• No results found

Indexing Genomic Data on Hadoop

N/A
N/A
Protected

Academic year: 2021

Share "Indexing Genomic Data on Hadoop"

Copied!
67
0
0

Loading.... (view fulltext now)

Full text

(1)

Indexing Genomic Data on Hadoop

Peter Büchler

-Master of Science Thesis- TRITA-ICT-EX-2014:111

KTH ROYAL INSTITUTE OFTECHNOLOGY STOCKHOLM, SWEDEN

Date August 11, 2014

Examiner Prof. Seif Haridi

(2)
(3)

Abstract

In the last years Hadoop has been used as a standard backend for big data appli- cations. Its most known application MapReduce provides a powerful parallel programming paradigm. Big companies, storing petabytes of data, like Face- book and Yahoo deployed their own Hadoop distribution for data analytics, interactive services etc. Nevertheless MapReduce’s simplicity in its map stage always leads to a full data scan of the input data and thus potentially wastes resources.

Recently new sources of big data, e.g. the 4k video format or genomic data, have appeared. Genomic data in its raw file format (FastQ) can take up to hundreds of gigabytes per file. Simply using MapReduce for a population analysis would easily end up in a full data scan on terabytes of data. Obviously there is a need for more efficient ways of accessing the data by reducing the amount of data, considered for the computation. Already existing approaches introduce index- ing structures into their respective Hadoop distribution. While some of them are specifically made for certain data structures, e.g. key-value pairs, others strongly depend on the existence of a MapReduce framework.

To overcome these problems we integrated an indexing structure into Hadoop’s file system, the Hadoop Distributed File System (HDFS), working indepen- dently of MapReduce. This structure supports the definition of own input for- mats and individual indexing strategies. The building process of an index is integrated into the file writing processes and is independent of software, work- ing in higher layers of Hadoop. As a proof-of-concept though MapReduce has been given the possibility to make use of these indexing structures by simply adding a new parameter to its job definition. A prototype and its evaluation will show the advantages of using those structures with genomic data (FastQ and SAM files) as a use case.

(4)
(5)

Acknowledgments

I would like to express my thanks and special appreciation to my supervisor Dr. Jim Dowling who inspired me in many ways during this project. Also my special thanks to Mahmoud Ismail and Salman Niazi who spent hours in supporting and guiding us even through the simplest problems.

Last but not least I want to thank SICS and all the people working there for providing a great working area, board game evenings and most important: lots of cake.

(6)
(7)

Contents

List of Figures v

List of Tables vii

Listings ix

1 Introduction 1

1.1 Problem Description . . . 1

2 Background 3 2.1 Hadoop Architecture . . . 3

2.1.1 HDFS. . . 4

2.1.2 MapReduce . . . 5

2.2 MySQL Cluster . . . 7

2.3 Hadoop Open Platform as a Service . . . 7

2.4 Genomic Data . . . 8

2.4.1 The FastQ Format. . . 10

2.4.2 SAM & BAM Format . . . 11

2.5 ADAM . . . 13

3 Related Work 15 3.1 HAIL . . . 15

3.2 Manimal . . . 16

4 Method 17 4.1 Requirements . . . 17

4.2 Implementation . . . 18

4.2.1 Requirements for Genomic Data . . . 18

4.2.2 The File Cutter & Indexer . . . 20

4.2.3 Cutting Files into Blocks . . . 20

4.2.4 Buidling the index . . . 23

4.2.5 Using the index . . . 27

5 Evaluation 31 5.1 Improvements through index queries . . . 31

5.2 Cutting Performance . . . 31

5.3 Block waste . . . 32

5.4 Indexing Performance . . . 34

(8)

Contents

6 Future Work 37

6.1 Byte Buffer for file cutting . . . 37

6.2 Query Extensions . . . 38

6.3 File Indexer as terminal parameter . . . 38

6.4 Index Management . . . 38

7 Conclusions 41

Bibliography 43

A SAM Flags 47

B Cigar 49

C The FastQ Indexer 51

iv

(9)

List of Figures

2.1 The HDFS copying process . . . 4

2.2 The HDFS architecture . . . 5

2.3 MapReduce phases . . . 6

2.4 HDFS in HOP . . . 7

2.5 Costs of sequencing a single genome. Raw data taken from [15] . 8 2.6 Formats of genomic data . . . 9

2.7 Comparative Visualization of BAM and ADAM File Formats [18] 13 3.1 The HAIL uploading pipeline. Adapted from [19] . . . 15

4.1 Sending file parts among datanodes . . . 19

4.2 Creating a file . . . 21

4.3 Overview of most important classes . . . 21

4.4 A structural overview of a query . . . 28

4.5 An index query for a SAM file . . . 30

5.1 Different ranges of block waste . . . 33

5.2 Durations of the writing and indexing process. . . 34

(10)
(11)

List of Tables

2.1 The FastQ header format . . . 11

2.2 The SAM file format . . . 12

5.1 Files used for the block waste tests . . . 32

5.2 The summed up block waste in relation to the block size . . . 33

(12)
(13)

Listings

4.1 The FileIndexer interface . . . 20

4.2 Quota updates . . . 23

4.3 Starting the index manager . . . 24

4.4 The SAM indexer . . . 25

4.5 Using an indexer as a job parameter . . . 27

4.6 Creating a query. . . 29

4.7 The PredicateBinOp class . . . 29

6.1 Usage of a byte buffer . . . 37

C.1 The FastQ indexer . . . 51

(14)
(15)

List of Acronyms and Abbreviations

HDFS Hadoop Distributed File System. 3–5,7,17,18,20,38 GFS Google File System.3

YARN Yet Another Resource Negotiator. 3

HOP Hadoop Open Platform as a Service.7,17,23 SICS Swedish Institute of Computer Science.7 DNA deoxyribonucleic acid.8

SAM Sequence Alignment/Map. 9,11–13,19,26,28 BAM Binary Alignment/Map.9,12,13

VCF Variant Calling Format. 9 BGZF Blocked GNU Zip Format.12

HAIL Hadoop Aggressive Indexing Library. 15 CSV Comma-Separated Values. 26

GUI Graphical User Interface. 38

(16)
(17)

1 Introduction

In recent years the focus on the topic of Big Data has been constantly increas- ing and „[e]ntire business sectors are being reshaped“ [1] by it. Already by 2011 Facebook had 30 Petabytes1 stored on its own Hadoop cluster [2]. New sources of data, e.g. the 4k video format or genomic data, have appeared and increased the need of new storage and analysing tools for huge amounts of data (see [3]). In case of genomic data files in its raw file format (FastQ) can take up to hundreds of gigabytes per file. Hence, genomic data has immense storage and computing requirements (see Chapter2.4) when being analysed. Many projects (e.g. [4], also see [5]) have been devoted to address the increasing problems in storing and analysing the huge amounts of data in different ways. Aligning pro- cesses can improve over time which requires FastQ files to be read more than once. Though processed data (e.g. in the SAM file format) require much less storage, there are many applications that require the analysis of more than one file. A population analysis for example would most likely require an analysis of thousands of files (see for example [6]). In most cases reading an entire FastQ or SAM file is not necessary.

Hadoop has been used as a standard distributed system for such tools and ap- plications (see [7]). In order to guarantee a certain level of failure tolerance it usually replicates data among its nodes. Its most famous application MapRe- duce offers a powerful parallel programming paradigm which has been a stan- dard by itself for parallel programming.

1.1 Problem Description

The human nucleotide diversity has a variety of less than 0.5% (see [8]). Ob- viously this also means that only tiny parts of sequence files may be interest- ing for a population analysis. Considering a maximum variety of 0.5% reading only this specific part of a human genome can potentially save up to 99.5% of reading time, compared to a full data read. Genomic Data is one of many ex- amples where a form of random access can potentially improve performance.

Right now HDFS does not offer any ways of storing or using meta informa- tion in order to filter out any parts of its files. This means that applications like MapReduce have no choice but simply read all input data. This simplicity is often simply countered by a higher number of machines used for the parallel computations.

11 Petabyte = 1000 Terabytes = 1 Million Gigabytes

(18)

1 Introduction

This thesis introduces an indexing structure that has been included into HDFS to overcome these problems. A first prototype has been developed that works independently of higher layers of Hadoop (e.g. the application layer incl. MapRe- duce). This index structure will allow applications to filter out parts of their input data by using existing indexing structures. On the other hand the system allows the user to define individual indexing solutions for new data types.

2

(19)

2 Background

2.1 Hadoop Architecture

In its current version (2.3.0), Hadoop consists of three main software packages:

1. HDFS [9]

Although not all parts of Hadoop strictly depend on it, Hadoop comes with its own file system, the HDFS, which is inspired by the Google File System (GFS) [10]. It is made for storing huge files on commodity hard- ware and at the same time offers a highly available and fault-tolerant plat- form (see Chapter2.2.)

2. MapReduce [11]

MapReduce is an implementation of the MapReduce Algorithm, also first introduced by Google [12]. Using the MapReduce package, it is possible to distribute computations among several machines and thus to parallelize them. In the first versions of Hadoop it was the only application run- ning on Hadoop. This was changed with the introduction of YARN (Yet Another Resource Negotiator) , which now accepts any kind of program.

Still MapReduce is the application, used to show the functionality of the thesis’ prototype.

3. YARN [13]

Yet Another Resource Negotiator (YARN) is Hadoop’s own resource man- ager. With its introduction Hadoop allowed other applications next to MapReduce. This change also resulted in an adjustment of MapReduce to this new recourse management and thus the introduction of MapReduce version 2 (MRv2). YARN itself is actually not relevant to this thesis’ work as it only touches the functionality of HDFS.

The following sections shall present details about the architecture and function- ality of the software packages that are most relevant for this thesis, i.e. HDFS and MapReduce.

(20)

2 Background

2.1.1 HDFS

HDFS is Hadoop’s own file system. One of its most important features is the splitting of large files into smaller blocks. In a standard configuration HDFS will split each file into blocks with a size of 64 Megabytes each. This feature is a re- quirement to easily distribute blocks (and thus files) among several datanodes.

To guarantee a certain level of failure-tolerance HDFS implements a possibility to replicate blocks and again distribute the respective replicas. The replication factor in a standard configuration is 3. Figure2.1shows a simplified overview of a standard copying process of a large file into HDFS.

Figure 2.1: The HDFS copying process

Blocks are distributed among all datanodes. Meta information (e.g. the loca- tions of replicas) is stored on a namenode. This namenode regularly receives block reports from its datanodes to update meta information regarding repli- cas. As a reaction to these block reports (and only then) the namenode can send commands to datanodes, e.g. to delete or create block replicas. The client can communicate with the namenode in order to receive or update meta informa- tion or to define previously mentioned reactions, for example by changing the

4

(21)

2.1 Hadoop Architecture replication factor of a single file. Figure2.2 shows the namenode in a standard HDFS architecture communicating with clients and datanodes. The meta infor- mation is, as any other kind of data, simply stored within HDFS.

Figure 2.2: The HDFS architecture

When accessing a file a client needs to request the block locations from the na- menode. The central position of the namenode makes it be a single point of failure. Newer versions of Hadoop though allow to have more than one na- menode to ensure some level of failure tolerance.

2.1.2 MapReduce

The MapReduce package contains an implementation of the MapReduce algo- rithm. MapReduce itself works in 2 major phases (see Figure2.3:)

1. Map

The input data is distributed to several Map-processes that are running a user defined function. Depending on the user defined requirements these processes distribute their results to central locations. During the map

(22)

2 Background

phase the MapReduce client will ask the namenode about all the block lo- cations for the input data. It will always receive all data, no mater which data is actually used in the end. If only a specific part of data is needed the map phase itself would have to take care of the filtering.

2. Reduce

This result data is then processed by exactly one reduce process per re- sult „location“. The reduce processes independently store their respective output on a location chosen by the user.

Both phases can easily be parallelized as each map process is supposed to work independently from each other map process (the same is valid for reduce pro- cesses). The mapping of all intermediate results to a reduce process (i.e. the

„shuffle phase“ ) can optionally be defined by the user too.

Figure 2.3: MapReduce phases

MapReduce’s main advantage is the simple way of implementing a scalable parallel process. On the other hand this simplicity strongly limits MapReduce’s capabilities, starting with having no possibility to implement iterative processes (other than just starting several MapReduce jobs).

6

(23)

2.2 MySQL Cluster

2.2 MySQL Cluster

MySQL Cluster is a highly available version of MySQL, using NDB Cluster as its storing engine. NDB Cluster itself can be accessed via several APIs (concur- rently). This includes the MySQL Server API and the more limited, but efficient ClusterJ API, which can directly access the NDB storage engine (as the MySQL server does). Both mentioned APIs, accessing NDB, will be used for this thesis’

work. While ClusterJ is more efficient for primary key queries, it does not offer more complex queries, e.g. joins. For these kinds of queries the MySQL Server API is used.

2.3 Hadoop Open Platform as a Service

The Hadoop Open Platform as a Service (HOP) [14] is a project, based on Hadoop’s version 2. It is currently developed in SICS (Swedish Institute of Computer Sci- ence) . This thesis’ work is based on this specific Hadoop distribution.

HOP moves the meta data from HDFS to a MySQL Cluster. By not storing any state inside the namenode it can make use of several namenodes at the same time and thus make HDFS highly available and more scalable (see Figure2.4).

Other parts of HDFS have mostly kept untouched though.

Figure 2.4: HDFS in HOP

(24)

2 Background

2.4 Genomic Data

Genomic Data is the data that is retrieved by using DNA (deoxyribonucleic acid) sequencers. DNA sequencers are automates used for automatically read- ing a genome sequence, i.e. the order of the DNA bases adenine, guanine, cy- tosine, and thymine. 6 years ago this process would have costed more than $ 100.000 for a single human genome. These costs constantly decreased within the last years and now are set around $ 1000 per genome (see Figure2.5). This trend will most likely continue and thus even private users will soon be able to make use of those offers.

Figure 2.5: Costs of sequencing a single genome. Raw data taken from [15]

.

Results of the sequencing process are given in a text based format. The output of a single read is always a part of the genome’s sequence, given by a String, containing the letters A, G, C and T (each representing one of the DNA bases).

Output files usually contain several reads. Single files in this raw format (FastQ files) can currently take up to 250GB of storage. Further processing of these files can drastically decrease the size, which is sometimes connected with a loss of information and thus depends on applications’ requirements. Figure 2.6gives an overview of common file formats, storing genomic data.

8

(25)

2.4 Genomic Data

Figure 2.6: Formats of genomic data

After retrieving a set of partial sequences (FastQ) the next obvious step is to map those sequences to the genome. As already mentioned in Chapter 1.1the human genome has a maximum diversity of 0.5%. Hence, the partial sequences can easily be mapped and compared to a so called „Reference Genome“. This information is saved in the SAM (Sequence Alignment/Map) file format, which also comes in a compressed version: the BAM (Binary Alignment/Map) file for- mat. Later processing can go as far as only saving the genomic difference to a specific reference genome. This information requires much less storage and is stored by using the Variant Calling Format (VCF)

In the following sections the two formats FastQ and SAM are described in de- tail. Being settled in an early stage of genomic data processing these two for- mats contain the highest amount of information and thus are most interesting for indexing solutions. On the other hand the cost factor is one reason to not delete one of the source files, even though one might currently not need all of its information. Furthermore analysing processes improve over time and thus source files are potentially read several times. These reasons make the FastQ and SAM file formats most interesting for this thesis’ work.

(26)

2 Background

2.4.1 The FastQ Format

FastQ is a currently used standard to store nucleotide sequences in a text based format. It also always includes a per base quality score, that describes the accu- racy of the readings (see [16]). The format uses four lines for each sequence:

1. The header line

This line always starts with an @ and contains meta information about the reading, i.e. an identifier and descriptions.

2. The sequence line

This line contains the sequence itself, i.e. the letters describing the raw sequence’s bases.

3. Additional description line

This optional line (but still always starting with a + contains again the sequence identifier and descriptions

4. Quality Scores

This line contains the per base quality scores and thus is supposed to have the same length as the sequence line.

The following is an exemplary FastQ sequence entry with all of its four lines:

@SEQUENCE_ID EXAMP_SOFT:2:1:583:662/1

ACTCACCTAACAGAGAAGAACCTTCCTTTTGACAGAGCAGTTTTGATAC +

IHDIIGHFIHDIBHFDHECHFFBGGGGEGFH@FEE8BDBB@EECE@CDE

10

(27)

2.4 Genomic Data The format is similar to the format used for the output of the Illumina sequenc- ing software. The following parts are most common in a FastQ header:

SEQUENCE_ID Additional identifier added for storage in library EXAMP_SOFT Illumina identifier

2 flowcell lane

1 tile number within flowcell lane

583 x-coordinate of the cluster within the tile 662 y-coordinate of the cluster within the tile

/1 /1 and /2 are used for members of a paired read

Table 2.1: Common FastQ header fields.

2.4.2 SAM & BAM Format

The Sequence Alignment/Map (SAM) format [17] is a text based file format to store the output of sequence alignments. Using the FastQ files as input format aligners assign sequences to a position within a known reference genome. The SAM format stores its data in tab delimited ASCII columns. The file starts with an arbitrary number of header lines (optional and again starting with an @).

Each header line starts with an acronym identifying the purpose of the line and a simple tab delimited key:value format for storing information. The following is an example for a set of Header lines, followed by a brief description:

@HD VN:1.0 SO:coordinate

@SQ SN:seq1 LN:5000

@CO Example of SAM/BAM file format

The @HD key is mostly used for general information, e.g. the file’s version num- ber (VN) or the sorting order of the data lines (SO).

For each of the used reference genomes there can be an @SQ key followed by information about this genome, e.g. its name (SN) or its length (LN).

The @CO key is simply used for comments for the file.

This was only an extract of all possible keys. Other parts of the header can for example give information about the software, used for aligning the sequences, or the sequencing center that produced those results.

(28)

2 Background

More important for this report are the data lines directly following the header.

Table2.2shows and explains the single fields of an exemplary SAM data line:

Example Field Name Description

r001 QNAME Query/Read name. It groups alignments by the read’s name. It is used for paired reads or multi- ple mappings.

163 FLAG A bitwise flag that gives general information about this read, e.g. whether it was a proper alignment or not (see AppendixAfor a complete listing).

seq1 RNAME The reference sequence’s name.

7 POS The position of the sequence within the reference sequence.

30 MAPQ The Mapping Quality

8M 2I 4M CIGAR The Cigar string, specifies the alignment. See Ap- pendixB

* MRNM Reference name of mate (i.e. a paired read)

37 MPOS Position of mate

39 ISIZE Observed template length of reads

TTAGATAAAG SEQ Segments sequence

>>. >>.,>7, QUAL Quality values for of sequence

MF:i:18 Aq:i:0 TAGs Meta information in a TAG:TYPE:VALUE format.

Table 2.2: Data fields of the SAM format.

The BAM (Binary Alignment/Map) file format is supposed to store the exact amount of information, but is a compressed version of the SAM format. It uses the BGZF (Blocked GNU Zip Format) compression and is a variation of the standard gzip file format. The main difference is that Binary Alignment/Map (BAM) files allow random access, if being indexed during the compression. The index usually is provided via a separate bam index file. Due to the special fea- tures of BAM files they have not yet been considered for this thesis (as a missing index file may lead to an important design decision regarding the storing pro- cedure of BAM files). The implementation though will allow new formats to be added including individual indexing strategies.

12

(29)

2.5 ADAM

2.5 ADAM

ADAM [18] is a „set of formats, APIs, and processing stage implementations for genomic data“. Most interestingly is the formats defined as an alternative to the SAM/BAM format. Normally a SAM file first contains all header lines and then all the data lines. A problem here is that header lines can refer to data lines and add meta information to it. The ADAM format tries to solve this problem putting both the data lines and headers together. This format was created in order to support the use of it within distributed systems. It also uses Parquet as an underlying data storage and thus stores ADAM files column-wise. Figure2.7 shows an exemplary conversion from the SAM/BAM file format to the ADAM format.

Figure 2.7: Comparative Visualization of BAM and ADAM File Formats [18]

The column storage allows ADAM to efficiently compress data and fast usage of data filters. Though the data storage is different, ADAM’s data format can potentially be used as an input format for this thesis’ prototype. As the process of cutting files into blocks can be individualised blocks can simply be cut when a column changes. The indexing structure can save the information, which block stores which column (potentially including a range). A first focus though has been put on more conservative formats, i.e. FastQ and SAM.

(30)
(31)

3 Related Work

3.1 HAIL

HAIL (Hadoop Aggressive Indexing Library) is one of many solutions to in- clude index structures into Hadoop. HAIL intends to add an index file for each block (see [19]). This index file enables faster random access for the respective block. In situations, where a block cannot easily be indexed, HAIL restructures the block. Having a replication factor of three, HAIL can potentially have three (physically) different replicas per (logical) block (see Figure 3.1). At the begin- ning of a MapReduce job all these index files can be asked about which part of the replica is important to the job.

Figure 3.1: The HAIL uploading pipeline. Adapted from [19]

One of the difference to older solutions of this problem is, that HAIL does not simply write MapReduce jobs for indexing their data but rather includes it in the file copying process without taking more time for this process. This in- dependence of the application layer was also one of the requirements for this thesis. This will let any application give the possibility to use the indexing struc- tures.

(32)

3 Related Work

3.2 Manimal

Manimal [20] is a system to reduce the running of MapReduce job, working completely in the background. The system analyses MapReduce programs and makes it only use necessary data. This task is achieved by the usage of index structures. According to the information gained out of these structures Mani- mal can optimize MapReduce programs by enabling relational-style optimiza- tions. A basic aspect of Manimal is to map java programs to SQL-like queries.

To do so it assumes that programmers are using common programming idioms for their MapRedcue programs. These idioms are detected and optimized by Manimal. Most of its work has been done for the map phase, which is respon- sible for the choice of data. The reduce phase, which most likely would be mapped to a combination of SQL’s GROUP BY and HAVING commands, is sup- posed to be analysed in a future work.

Unfortunately the whole system has a too strong focus on improvements for MapReduce. With the introduction of YARN though, MapReduce is only one of many potential applications using HDFS. This thesis will result in a prototype working independently of MapReduce. This way Manimal can potentially be setup on top of it and use the thesis’ own index structures for its improvements.

Nevertheless this should happen in a way of giving a generic possibility to add applications like Manimal.

16

(33)

4 Method

The goal of this thesis’ work is to introduce an indexing structure for HOP. This includes the possibility to define individual indexes for specific files and use them when running jobs on those files. An index of a file within HDFS maps its content to specific block IDs and thus creates a possibility to filter out certain blocks when necessary. This thesis’ hypothesis is that indexing structures can potentially save jobs’ resources without critically slowing down other parts of the system, i.e. the copying process. In order to validate this hypothesis a proto- type has been developed. This procedure resembles an experimental approach.

The result of this thesis and its success is directly determined by a statistical evaluation of the prototype regarding performance improvements and slow- downs. There are many ways of including an indexing process into Hadoop.

One of the requirements of this thesis though is, that it has to be as independent as possible from higher layers of Hadoop. Including the indexing process as a MapReduce job would simply break this requirement. Hence, this process is included in the copying process of a file into HDFS.

Only two components in Hadoop can analyse a block’s data during this process:

the client (while splitting the file into blocks) and the respective datanode (while receiving the block). Appending the indexing process to the client’s workload would drastically increase the upload time as blocks are handled sequentially.

The datanodes on the other hand can potentially parallelize the work of index- ing all blocks of a file.

4.1 Requirements

These goals require a set of changes also in Hadoop’s standard proceedings for copying files. For this the following work packages and requirements have been defined:

1. Implement a possibility to define „Input Formats“ for different files.

Those input formats shall define which index to use.

a) Give as much freedom as possible to individualise own formats.

2. Append the indexing process to the Datanode’s process of writing block to its disk.

a) Add this process without slowing down the normal writing process.

(34)

4 Method

3. Change the process of cutting a file into blocks in order to get „clean“ cuts (e.g. always cut a block at a linebreak).

a) Again add this package without slowing down existing processes.

b) Use the input formats to define this process.

4. Add a possibility to create an index query and send it with a job to reduce the number of blocks used for the computation.

a) Let the input formats handle queries instead of adding many new classes.

b) Allow applications easily access and use queries for their own needs.

5. Integrate all these packages in existing structures, i.e. existing functions of HDFS, job classes used by MapReduce etc.

A general goal was to create a solution that is as generic as possible. Though genomic data is the use case, the solution should be applicable to any other file format too. This is mostly given by the first work package, giving a possibility to define an arbitrary input format. When a new file is copied into HDFS it will be processed according to the given input format. This includes the cutting process and the indexing itself. Later, when using the index, the input format is also supposed to help the user creating index queries.

4.2 Implementation

The following sections will cover the implementation of the previously defined work packages (see Chapter 4.) Though most of the implementation can be applied to the official version of HDFS some details are influenced by the special structure of the namenode, i.e. the queries to the MySQL cluster.

4.2.1 Requirements for Genomic Data

Genomic Data can appear in different forms. Its raw file format (FastQ) is a sim- ple line-based text format where each „entity“ consists of exactly four lines (see Chapter2.1). In a situation where a datanode is supposed to index or analyse a block of a FastQ file it would be advantageous to always have all four lines of an entity in a block (or none). The usual process of cutting files into blocks, which is located in the client, does not consider a dynamic block size. It rather exactly cuts the file after the preferred block size, e.g. 64 or 128 Megabytes. Assuming that three consecutive blocks are distributed among three datanodes (ignoring

18

(35)

4.2 Implementation the replication), a single datanode would most likely have to be in contact with two other datanodes to share information about partial lines (see Figure4.1).

Figure 4.1: Sending file parts among datanodes

To avoid this communication one could simply cut blocks right before the header line that is closest to the preferred block size. Even for huge FastQ files the file size is mostly determined by the number of lines, not their respective length. Hence, the size of an entire entity, i.e. 4 consecutive lines, would be not larger than a few kilobytes. Assuming a block size of 64MB, even 100kB (an average of 25.000 symbols ber line) would only form about 0.15% of the block size and thus would be an acceptable „waste“ of free space.

SAM files have a similar, but still different structure. They start with an arbitrary number of header lines, followed by an (again arbitrary) number of data lines (see Chapter 2.2). Generally all lines are independent of each other.

This means that a new cutting function would only have to look for linebreaks.

Unfortunately both FastQ and SAM files still include another detail, that could affect the file cutting process. Both files can have paired reads that are in relation with each other. Whether those reads can be separated among datanotes or not strongly depends on potential applications. In the following implementations this fact has been ignored as only sequence coordinates have been used for the index structures.

(36)

4 Method

4.2.2 The File Cutter & Indexer

The File Cutter & Indexer form the input format that can be defined for every kind of data. To define a new input format the interface FileIndexer has to be implemented. The functions that have to be implemented by using this interface will be explained one after another in the following sections. The following is an overview of all functions within the interface:

public int giveCutLength(byte[] b);

public int getCutThreshold();

public String[][] indexBlock(InputStream blockStream);

public PQuery createQuery();

Listing 4.1: The FileIndexer interface

4.2.3 Cutting Files into Blocks

The process of cutting files into blocks is mostly realized within the class DFSOutputStream. Indirectly it inherits from the standard OutputStream class to let HDFS appear as any other file system. Within java, files can sim- ply be written to by calling the write(byte[] b) function (or similar ones).

In a case where the given byte array is too large to fit into the block the client would fill up the block, flush all packets if necessary and start a new block (starting with the rest of the byte array). In a standard scenario the client has no knowledge about following or previous byte arrays. Hence, the decision, where to cut the block, has to be exclusively made with the last array. This is what the function giveCutLength(byte[] b) of the interface has to decide. Given a byte array, it returns a potential cutting position that is closest to the end of that array. The function is called through an instance of the respective file indexer.

This file indexer can be given as a parameter to the create(fileName,...) function of HDFS, i.e. the class DistributedFileSystem (see. In the process of creating the file, the client will also inform the namenode about which indexer is being used for the file cutting. Currently this is simply done by sending the name of the index that the file indexer is referring to (see Figure4.2.)

20

(37)

4.2 Implementation

Figure 4.2: Creating a file

Actually the DFSOutputStream class normally does not handle the write process by itself, but rather uses the write(byte[] b) function of its super class FSOutputSummer, which itself extends OutputStream (see Figure4.3).

Figure 4.3: Overview of most important classes

To gain more control of this functionality the write function was also added to DFSOutputStream and thus overriding its super class’ function. Other than only implementing a standard write behavior the super class is also re- sponsible for updating statistics about the number of bytes written to datan- odes. Functionality to finish and start blocks though are entirely located in DFSOutputStream. Hence, the following simple steps can be implemented within the write function:

1. Call the super class’ write function for only a part of the byte array 2. Flush packets, finish the block and start a new one

3. Call the super class’ write function for the rest of the byte array

(38)

4 Method

This order of commands is used when a file indexer has been given to the stream instance. Otherwise it can simply call the function of the super class.

This approach though has a weakness. Assuming a block size of 20 bytes and a constant byte array size of 10 bytes, the following scenario is possible:

1. The first block is started. The file indexer has to find a cut position within the second (and last) byte array. It cuts after the 18th byte (exemplary value) and writes 2 bytes to the next block.

Size of first block: 18 bytes Space left in next block: 18 bytes

2. The second block has 18 bytes left. Again the second byte array will be considered. But this time it is limited to 18 bytes, as the block size of 20 bytes is the limit (2+10+10 > 20). The file indexer finds a cut position after 15 bytes (exemplary value) and thus the last 5 bytes are written to the third block.

Size of second block: 17 bytes Space left in next block: 15 bytes

3. The third block now can fill up to 15 bytes. Again only the second byte array can be considered. This time though it can only find a cut position within the first 5 bytes (5+10+5=20).

A continuation of this process might lead to a situation in which the file indexer cannot find a cut position as the byte array is simply too small. It would fill up the block (the standard behaviour) and create a break within the block structure.

This is where the getCutThreshold() function is used by the outputstream.

Its return value, the cut threshold, decides at which difference to the block size the client should start checking for cuts.

In the previous example there was a situation where only 15 bytes were left in the next block (right after the second step). Assuming a cut threshold of 6 bytes, the following would happen:

The third block can fill up to 15 bytes. The client knows that the last byte array’s size (actually only the region inside the block, which is considered, i.e. the first 5 bytes) would be below the threshold. It already starts checking for cuts in the very next byte array. If it finds a cut, it finishes the block and starts a new one.

The obvious trade-off of this solution is the fact, that the client in this exemplary situation would not check the last byte array for cuts. It potentially ignores a

22

(39)

4.2 Implementation cut position in the last block’s first 5 bytes. The potential waste of free space depends on the used byte array size and the block size.

A possible solution is to buffer as many bytes as given by the threshold value.

This solution is mentioned again in the Future Work chapter.

Interestingly Hadoop is simply hardcoded in many ways and not flexible.

Every time a block is closed DFSOutputStream invokes a function which will update Hadoop’s quota statistics if this was the last block. Quotas are user limits for the namespace and thus are always updated after writing a file.

Unfortunately Hadoop checks the block size to determine whether it was the last block or not:

long diff = getPreferredBlockSize() - block.getNumBytes();

if (diff > 0) {

String path = leaseManager.findPath(fileINode);

dir.updateSpaceConsumed(...);

}

Listing 4.2: Quota updates

Obviously this check would most likely not be valid anymore when using the mentioned file cutters. To ensure a correct functionality the respective function has been extended by a Boolean parameter which informs it about whether it really was the last block. Updating the quota itself is actually not the source of an error. The problem is that the file, as long as the stream is writing to it, would be still locked when this update needs to access it (by asking for its path, see the above code excerpt).

4.2.4 Buidling the index

Building the index is done in parallel to the process of writing a block to the datanode’s disk. In case of HOP every index is a table in the MySQL cluster, which usually has the following construct:

inode_id×block_id →key_index1× · · · ×key_indexn

When a datanode prepares for receiving a block’s packets it now also starts a thread with the new class DataNodeBlockIndexer (inheriting from the Thread class, see Listing4.2.4).

(40)

4 Method

// only for last Datanode in pipeline if (downstreams.length==0) {

InputStream blockStream = datanode.getFSDataset() .getBlockInputStream(block);

DataNodeBlockIndexer indexer;

indexer = new DataNodeBlockIndexer(datanode, blockStream, block);

indexer.start();

}

Listing 4.3: Starting the index manager

This class sends a request to the Namenode, asking for the proper file indexer.

As this information is only stored for iNode ids this function actually first has to ask the namenode about the iNode id fitting to the block id. In the end the iNode id never reaches the datanode as it would only use this information to directly send it back to the namenode.

The namenode stores the mapping of iNode IDs to index tables withing the table inode_metas which in the future may store more information than just the index table’s name.

After receiving the index information, it immediately calls the file indexer’s function indexBlock(InputStream blockStream). This function ac- cesses the underlying block file that the datanode is writing to. While the datanode is writing to this file the indexer thread immediately reads available data to prepare the index values. The indexBlock function is supposed to return a set (here: simply an array) of String arrays. Each String array holds values, that are going to be stored in the index table. In a later process those values are combined with the iNode and block id to send a complete

„index report“to the namenode. Listing 4.2.4 is a shortened version of the indexBlockfunction that is used for SAM file.

24

(41)

4.2 Implementation

public String[][] indexBlock(InputStream blockStream) { InputStreamReader ir = new InputStreamReader(blockStream);

BufferedReader br = new BufferedReader(ir);

String curSeq = null;

String minPos = null;

String maxPos = null;

// this is what is supposed to be returned // later it will be transformed into an array ArrayList<String[]> valueList;

valueList = new ArrayList<String[]>();

String line;

String[] indexValues;

while ((line = br.readLine()) != null) { if (line.startsWith("@")) { // header lines

// does nothing for now } else { // data lines

String[] lineParts = line.split("\t");

String seq = lineParts[2];

String pos = lineParts[3];

// first line

if (curSeq == null) { curSeq = seq;

minPos = pos;

//new ref. sequence

} else if (!curSeq.equals(seq)) {

indexValues = {curSeq, minPos, maxPos});

valueList.add(indexValues);

curSeq = seq;

minPos = pos;

}

maxPos = pos;

} }

if (curSeq!=null)

list.add(new String[]{curSeq, minPos, maxPos});

return list.toArray();

}

Listing 4.4: The SAM indexer

The indexer iterates through all lines of the block’s content. Whenever a data line „appears“ it is either updating its position values for the current refer- ence sequence or starting to find position values for the next reference sequence.

(42)

4 Method

By now the prototype comes with indexers for the following file formats:

1. CSV (Comma-Separated Values) files

2. FastQ files (a special case of CSV files, having the linebreak also as field delimiter, see AppendixCfor a code excerpt)

3. SAM files

In the case of the FastQ format many test files were at least ordered by the flowcell lane and the tile number. Accordingly the index keys are the sequence id, and the minimum and maximum of the flowcell lane and the tile number respectively:

inode_id×block_id→cell_lane_start×cell_lane_end×tile_start×tile_end In case of the SAM file format, files are usually first ordered by their reference sequence’s name. The position is used as a secondary order. Similar keys have been chose for the index. This index can easily contain more than one entry per block, i.e. one entry per reference sequence:

inode_id×block_id×re f erence_sequence → position_start×position_end In order to first receive index information and then send index reports, new functions had to be implemented in the datanode and namenode respectively.

The same had to be done for the Protocol Buffer classes that both datanode and namenode use for their communication with each other. The following is a shortened call hierachy for sending the index report:

BlockReceiver.java

--> DatanodeBlockIndexer.run()

--> FileIndexer.indexBlock(InputStream blockStream) --> DataNode.indexReport(IndexReport report)

At this point the report will have to be sent via Protocol Buffers, the file format used for communication between clients, data- and namenodes. This protocol is the reason why the indexBlock() function returns arrays of strings, not objects. The actual types of the fields are retrieved in a later process when the namenode simply asks the database, which types are expected.

26

(43)

4.2 Implementation --> DatanodeProtocolClientSideTranslatorPB.indexReport(...) Protocol Buffer

--> DatanodeProtocolServerSideTranslatorPB.indexReport(...) --> NameNodeRpcServer.indexReport(...)

--> BlockManager.indexReport(...)

--> IndexingClusterJ.indexReport(...)

<Query to MySQL Cluster>

All of the queries, mentioned by now (receiving index information, index re- ports) can be implemented by using the limited, but efficient API of ClusterJ as all of them can make use of primary key indexes.

4.2.5 Using the index

The index can be used by adding an index query to functions that are normally responsible for requesting all block locations of a file, e.g.

FileSystem.getFileStatus(..) . If this new parameter is present these functions will only return the blocks, matching this query. In case of the MapRe- duce project, the FileInputFormat class is responsible for calling this func- tion. To make it give the query as a parameter one simply has to add a query to the job configuration. To enable such a functionality a new kind of parameter was added to the job’s configuration: serialisable values. This way any serial- isable object can be stored as a job parameter. Hence, PQuery simply had to implement the interface Serializable.

The query can be initialized by calling the createQuery() function of a file indexer instance.

Job job = new Job(..);

...

SAMFileIndexer indexer = new SAMFileIndexer(..);

PQuery query = indexer.createQuey();

String posStart = SAMFileIndexer.POS_START;

Predicate predicate = query.getObj(posStart).greaterThan(1000);

query.where(predicate);

job.seqQuery(query);

Listing 4.5: Using an indexer as a job parameter

The class used for the new queries is for now called PQuery to avoid any confusion when using the ClusterJ API (and its own class Query. This class was mostly inspired by the ClusterJ API though.

(44)

4 Method

Query Structure

A PQuery instance can exactly have one predicate. A predicate (implemented through the Predicate class) can either be a simple binary comparison (imple- mented by PredicateBinOp) or an operation between two predicates. Hence, it has the same structure as a binary tree (having predicates as nodes, and pred- icateBinOps as leaves). PredicateBinOps hold the name of a specific index key and a value it is supposed to be compared with. The comparison operation can be chosen from an Enum type within the class.

SAM files were only indexed by their (reference) sequence’s name and their position within the genome. Figure4.4 shows the object structure that would represent the following statement:

reference_sequence=EXAMP_SEQ AND (position_start>500 OR position_end<1000) (The statement looks for comparisons to a specific reference genome, overlapping the position area 500 to 1000.)

PQuery

Predicate OR

Predicate

OR

Predicate PredicateBinOp

LESS_THAN

1000 position_end

Predicate PredicateBinOp GREATER_THAN

500 position_start Predicate

PredicateBinOp EQUALS

EXAMP_SEQ

reference_

sequence

Figure 4.4: A structural overview of a query

28

(45)

4.2 Implementation In Java the configuration of a query to achieve this structure would look like this:

Predicate ref_seq, pos_min, pos_max;

String refField = SAMFileIndexer.REF_SEQ;

String posStartField = SAMFileIndexer.POS_START;

String posEndField = SAMFileIndexer.POS_END;

ref_seq = query.getObj(refField).equals("EXAMP_SEQ");

pos_min = query.getObj(posStartField).greaterThan(500);

pos_max = query.getObj(posEndField).lessThan(1000);

query.where(ref_seq.and(pos_min.or(pos_max)));

Listing 4.6: Creating a query

As it can be seen in this code example the index keys can be accessed by the indexers constants. These constants simply store the index keys’ names in String values. The getObj(String key) function returns a predicateBinOp object. This object offers methods to compare the respective index key to a value of the user’s choice:

public class PredicateBinOp<T> { String name;

Ops op;

T value;

public enum Ops { EQUAL, GREATER_THAN,

GREATER_EQUAL_THAN, SMALLER_THAN, SMALLER_EQUAL_THAN;

}

private Predicate setComparison(Ops op, T value) { this.op = op;

this.value = value;

return new Predicate(this, true);

}

public Predicate equal(T value) {

return this.setComparison(Ops.EQUAL,value);

}

public Predicate greaterThan(T value) {

return this.setComparison(Ops.GREATER_THAN,value);

} ...

Listing 4.7: The PredicateBinOp class

(46)

4 Method

By calling one of its comparison functions, e.g. equals(T value) it returns a predicate, that can be used for the query. Predicates have a similar structure to PredicateBinOps, but can only be compared to other predicates. Hence, also the possible operators change to a set Boolean operators (currently only AND and OR).

At this moment the query simply has a function to return a String that can be used within SQL statements, i.e. in the WHERE part. Any other backend though can also iterate through its structure to implement its own representation. In the case of MySQL cluster the index table will be reduced to all blocks matching the index query and then will be joined with the normal block_info table, that holds all necessary information of the blocks (see Figure4.5).

Figure 4.5: An index query for a SAM file

To include a minimum level of failure tolerance the join is actually a right outer join. This means that blocks that do not appear in the index table (most likely because of a badly formatted file or simply a badly written indexer) are still selected for a job.

30

(47)

5 Evaluation

In the following sections the prototype is going to be evaluated by analysing its performance. For the experiments seven datanodes and one namenode have been setup, each running on a separate machine. If necessary resource manager was running on the same machine as the namenode (most experiments only needed HDFS being setup). All machines had two AMD Opteron 2435 proces- sors with 6 cores each (2,6GHz) and 32GB of RAM. Hadoop was running on a 64-bit version of Ubuntu 11.04 using the Java 6 SDK.

5.1 Improvements through index queries

The evaluation of improvements through the usage of queries is actually triv- ial. All queries end up in a single database query (optimized by the usage of primary keys), which simply replaces the „normal“ query. This query filters out a number of blocks and thus has a certain selectivity (returning 4 out of 10 blocks would be a selectivity of 40%). Exactly this selectivity is the percentage of resources that are needed to work on those blocks. In a sequential envi- ronment this would directly map to the amount of needed time whereas in an entirely parallel environment it may be the number of machines working in par- allel. Thus an appropriate definition of „resource“ would end up in a statistic, showing that a selectivity of X% can be mapped to a use of X% of the resources that would have been used without index structures. Even considering that all blocks have a different size one could simply define the selectivity through the summed up block sizes and would end up in a similar conclusion.

5.2 Cutting Performance

In the process of cutting a file into several blocks additional functions are now called for analysing a certain amount of bytes. This process though is currently limited to analysing one byte array after another until a cut position has been found. By adjusting the cut threshold value the user can make sure to find a cut position that is close to the block size. Nevertheless this functionality will add a certain amount of computation time to this process. To change the standard size of the used byte arrays the option io.file.buffer.size was adjusted. Its standard value is set to 4096 bytes. All experiments have been done copying a single 1.8GB into HDFS. For byte arrays smaller than a megabyte though it was not possible to measure a difference in computation time. Increasing its size to

(48)

5 Evaluation

32MBfinally lead to an additional computation time of 40ms. A value of 32MB is 8192 times as large as the standard value and already close to the standard block size of 64MB. As long as the byte array size is not larger than a megabyte, which still would be 256 times as large as the standard, the additional time can be ignored. Even a time of 40ms can simply be ignored when compared to the time it takes to send this data to a datanode and wait for it to be written to the datanode’s disk. These results are reasoned in the way the cutter works. It iterates through a byte array that is already in main memory. Depending on the file type different cutters can have more complex ways of finding a cut and thus require more time.

5.3 Block waste

This section will analyse the „block waste“ of different indexers. Block waste is for now defined as the difference between the preferred block size and the actual block size. Ignoring the last block of a file this would normally always be a difference of 0MB. In this prototype though the file cutter leads to a situation in which all blocks have different sizes, that most likely are smaller than the preferred block size. Figure5.1 shows the block waste’s diversity for different tested files. The reason to make such an analysis is to find an appropriate threshold (see Chapter 4.2.3) for the respective file types. The following is an overview of the tested files:

File Format File Size Block Size Number of Blocks

FastQ 4GB 64MB 65

SAM 1.8GB 64MB 28

SAM 58.8GB 256MB 236

Table 5.1: Files used for the block waste tests

32

(49)

5.3 Block waste

Figure 5.1: Different ranges of block waste

In the case of SAM files the average block waste apparently depends on the en- tire file’s size. Larger SAM files also tend to have larger lines. The 58,8GB file was taken from the 1000 Genome Project [21] and thus gives more representa- tive values. Nevertheless all three test files always had a block waste of less than 500 Bytes. Additionally one could have a look at the summed up block waste:

File Format File Size Block Waste in Bytes (sum) (Block Waste / Block size) in %

FastQ 4GB 9773 1.4·102

SAM 1.8GB 2767 0.4·102

SAM 58.8GB 48435 1.8·102

Table 5.2: The summed up block waste in relation to the block size

One could imagine the block waste to be responsible for an additional block from the very beginning of the file, which is only reserved for Bytes coming

(50)

5 Evaluation

from block waste. Assuming a constant block waste of 1KB and a very conser- vative block size of 64MB this additional block would only be filled up after 64.000 block cuts (64MB/1KB). This would only occur at a file size of about 4 Petabytes. Having this file size another block entry inside the database should be an acceptable amount of overhead. The average block waste in Figure5.1 though was not even close to 1KB but rather between 150 and 200 Bytes.

5.4 Indexing Performance

As the gains through indexing were rather trivial and the cutting performance was irrelevant, the last potentially important performance factor is the actual time it takes to index a block. This process is started in parallel to the writing process on a datanode. The writing process though does not synchronise with the indexing thread as the client would potentially have to wait longer for a response then. Figure5.2 shows the index times compared to the time it took to write the blocks to disk. For these tests a 1,8GB large SAM file was copied into HDFS. The results of the larger block sizes (128 and 256MB) were both confirmed by also writing a 58,8GB large SAM file.

Figure 5.2: Durations of the writing and indexing process.

34

(51)

5.4 Indexing Performance It is no surprise that both the time needed for writing a block and the time it takes for indexing it proportionally grow with the block size. As it can be seen in this figure the indexing time is about twice as large as the disk writing time.

Even if this process runs in parallel to the disk writing it will ultimately keep the datanode busy even though the initial block is already written. This can potentially slow down other processes (i.e. MapReduce jobs) on this node.

This statistic though only measures the time it takes to iterate through the block in order to create the index entries. Actually building the index, i.e. by sending the entries to the namenode and storing them, takes again time. Fortunately the measured time for this process was in an order of less than 15ms for several entries per block. Compared to the time needed for iterating through the block, sending the entries only takes an ignorable amount of time.

(52)
(53)

6 Future Work

The prototype was implemented in a very generic way to create many possi- bilities for extensions. A very simple extension is the implementation of a new input format by implementing the FileIndexer interface. Nevertheless the functionality of the prototype can be easily extended too. The following sec- tions are an overview of some potential extensions left for future work.

6.1 Byte Buffer for file cutting

In the current implementation the file cutting process will try to always use the last byte array to find a potential cut position (see Chapter4.2.3). If this array does not provide enough data, i.e. if it is too small, the process will simply use an earlier array. As already mentioned this process can potentially miss the optimal cut position to find one, that is still „good“. A reconstruction of this process can potentially change that. If the Java class simply buffered a certain amount of bytes it could at least guarantee to always have a minimum of bytes available for finding a cut position. It would only need to buffer the bytes that are closest to the block size (see Listing6.1to get an idea of it).

public void write(bytes[] b) {

int freeBlockBytes, cutThreshold, bytesOverThreshold;

...

// how far did we exceed the threshold?

bytesOverThreshold = b.length - (freeBlockBytes-cutThreshold);

//byte array touches threshold area if (bytesOverThreshold > 0) {

int firstPart = b.length - bytesOverThreshold;

write(b.rangeCopy(0, firstPart);

writeToBuffer(b.rangeCopy(firstPart, bytesOverThreshold) }

...

}

Listing 6.1: Usage of a byte buffer

(54)

6 Future Work

6.2 Query Extensions

A PQuery instance is usually filled with a predicate by using its where func- tion. By now though only the PredicateBinOp class can create predicates.

This limits all queries to using only binary operations. Though this is theoreti- cally enough to cover many potential statements it is not the most comfortable solution. Extending this class to implement n-ary relations would enable it to implement functions like between(value,min,max).

Another drawback of the existing prototype is its restriction of only comparing final values to index keys. A simple comparison of two index keys is not pos- sible. This extension though would require a new structure of the entire query construct.

Obviously neither more complex constructs, like joins, group by operations etc.

are possible. This restriction, as many others, mostly come from the implemen- tation being very close to the ClusterJ API, which is also having many restric- tions.

6.3 File Indexer as terminal parameter

The file indexer, which is also used for the process of cutting a file, can be given as a parameter to the FileSystem.create() function. Hence, it is possible to make use of the prototype via the Java interface. End-users though would most likely either use a Graphical User Interface (GUI) or the terminal for copy- ing files into HDFS. In the case of the terminal the copying functions would have to be adjusted for a new parameter, the file indexer. All terminal functions are not simply hardcoded by always calling a different set of functions from the java interface of HDFS. They are rather implemented by complex relations through different super classes and interfaces. This structure did not allow a quick change (adding a parameter) without restructuring many files.

6.4 Index Management

This thesis has explained the potential usage of index structure. Besides build- ing and using the index though these structures have to be managed through an API in order to:

1. Create and delete index structures 2. Reindex files/blocks

38

(55)

6.4 Index Management

3. Remove indexed data

Until now index structures have to be manually created within the MySQL clus- ter. Though this may not happen very often it is a future goal to do this through an API of HDFS rather than directly connecting to the database.

Next to this general management updates of files are not being handled yet.

This includes simple tasks like deleting or updating a file, which would simply lead to a deletion or update of the respective index. Adding a deletion of the index whenever the file is deleted is a task than can easily be added to the pro- totype though. Whenever a block is updated it most likely would have to be indexed again. Currently Hadoop only allows to append bytes to the file rather than update single blocks. Still this would lead to a re-indexing of the last block (and and indexing process for all new ones).

(56)
(57)

7 Conclusions

This thesis integrated an extendable indexing structure into HDFS. It showed easy ways of extending the indexer by adding new input formats as it has been done for FastQ and SAM files by writing classes no more than 100 lines long.

The evaluation showed that the additional functionality does not critically slow down HDFS but also leaves space for further optimizations. On the other hand it made clear that any kind of application can make good use of indexing struc- tures if available as it simply reduces the amount of needed resources. If indexes are wisely chosen this can lead to drastic improvements and overcome known weaknesses of applications, e.g. those of MapReduce.

Chapter6suggested ways for further improvements to make the system either more efficient or more functional. Though the general „backbone“ for indexing structures, including a running prototype for selected data types, exists, this thesis makes clear that a lot of work can still be invested in increasing usability and efficiency.

(58)
(59)

Bibliography

[1] V. Mayer-Schönberger and K. Cukier, Big data: A revolution that will trans- form how we live, work, and think. Houghton Mifflin Harcourt, 2013.

[2] P. Yang. (2011, Jul.) Moving an Elephant: Large Scale Hadoop Data Migration at Facebook. [Online].

Available: https://www.facebook.com/notes/paul-yang/

moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/

10150246275318920

[3] S. D. Kahn, “On the Future of Genomic Data,” vol. 331, no. 6018. American Association for the Advancement of Science, 1200 New York Avenue, NW Washington DC 20005 USA, 2011, pp. 728–729.

[4] BiobankCloud. (2014, Jun.) BiobankCloud - A PaaS for Biobanking.

[Online]. Available:http://www.biobankcloud.com/

[5] O. G. C. Ltd. (2014, Jun.) 3rd Annual Next Generation Sequencing. [On- line]. Available: http://www.nextgenerationsequencingdata-congress.

com/

[6] Consortium ICG. (2014, Jun.) International network of cancer genome projects. [Online]. Available: https://icgc.org/

[7] A. O’Driscoll, J. Daugelaite, and R. D. Sleator, “‘Big data’, Hadoop and cloud computing in genomics,” vol. 46, no. 5, 2013. doi:

http://dx.doi.org/10.1016/j.jbi.2013.07.001. ISSN 1532-0464 pp. 774 – 781.

[Online]. Available: http://www.sciencedirect.com/science/article/pii/

S1532046413001007

[8] S. A. Tishkoff and K. K. Kidd, “Implications of biogeography of human populations for ’race’ and medicine,” Nat Genet, Oct. 2004.

[9] T. A. S. Foundation. (2014, Apr.) HDFS Architectur. [Online].

Available: http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/

hadoop-hdfs/HdfsDesign.html

[10] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The Google File System,” in ACM SIGOPS Operating Systems Review, vol. 37. ACM, 2003, p. 29–43.

[Online]. Available:http://dl.acm.org/citation.cfm?id=945450

(60)

Bibliography

[11] T. A. S. Foundation. (2013, Apr.) MapReduce Tutorial. [Online]. Available:

http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html

[12] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” in Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6, ser. OSDI’04.

Berkeley, CA, USA: USENIX Association, 2004, pp. 10–10. [Online].

Available: http://dl.acm.org/citation.cfm?id=1251254.1251264

[13] T. A. S. Foundation. (2014, Feb.) Apache Hadoop NextGen MapReduce (YARN). [Online]. Available: http://hadoop.apache.org/docs/current/

hadoop-yarn/hadoop-yarn-site/YARN.html

[14] “What is Hops? | Hadoop Open Platform.” [Online]. Available:

http://www.hopstart.org/

[15] Wetterstrand KA. (2014, Jun.) DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP). [Online]. Available:

http://www.genome.gov/sequencingcosts/

[16] P. J. A. Cock, C. J. Fields, N. Goto, M. L. Heuer, and P. M. Rice,

“The sanger FASTQ file format for sequences with quality scores, and the solexa/illumina FASTQ variants,” vol. 38, no. 6, Dec. 2009.

doi: 10.1093/nar/gkp1137. ISSN 0305-1048, 1362-4962 pp. 1767–1771.

[Online]. Available: http://nar.oxfordjournals.org/lookup/doi/10.1093/

nar/gkp1137

[17] H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin et al., “The sequence alignment/map format and samtools,” vol. 25, no. 16. Oxford Univ Press, 2009, pp. 2078–2079.

[18] M. Massie, F. Nothaft, C. Hartl, C. Kozanitis, A. Schumacher, A. D. Joseph, and D. A. Patterson, “ADAM: Genomics Formats and Processing Patterns for Cloud Scale Computing,” 2013.

[19] J. Dittrich, J.-A. Quiané-Ruiz, S. Richter, S. Schuh, A. Jindal, and J. Schad,

“Only aggressive elephants are fast elephants,” vol. 5, no. 11, 2012, p. 1591–1602. [Online]. Available: http://dl.acm.org/citation.cfm?id=

2350272

[20] M. J. Cafarella and C. Ré, “Manimal: Relational Optimization for Data-Intensive Programs,” in Procceedings of the 13th International Workshop on the Web and Databases. ACM, 2010, p. 10. [Online]. Available:

http://dl.acm.org/citation.cfm?id=1859141

44

References

Related documents

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

The increasing availability of data and attention to services has increased the understanding of the contribution of services to innovation and productivity in

I dag uppgår denna del av befolkningen till knappt 4 200 personer och år 2030 beräknas det finnas drygt 4 800 personer i Gällivare kommun som är 65 år eller äldre i

Detta projekt utvecklar policymixen för strategin Smart industri (Näringsdepartementet, 2016a). En av anledningarna till en stark avgränsning är att analysen bygger på djupa

My study focuses on one particular area where this publicness may be described and discussed both in terms of real-life artefacts, public computer systems, and in

The compression time for Cong1 did not increase signicantly from the medium game as the collapsed pixels scheme has a linear com- plexity to the number of pixels (table

Industrial Emissions Directive, supplemented by horizontal legislation (e.g., Framework Directives on Waste and Water, Emissions Trading System, etc) and guidance on operating

Although the seven stages did not socially interact in every act (i.e audit engagement act, pre-planning act, audit planning act, audit execution act and audit reporting act) of