• No results found

Hive, Spark, Presto for Interactive Queries on Big Data

N/A
N/A
Protected

Academic year: 2022

Share "Hive, Spark, Presto for Interactive Queries on Big Data"

Copied!
67
0
0

Loading.... (view fulltext now)

Full text

(1)

IN

DEGREE PROJECT INFORMATION AND COMMUNICATION TECHNOLOGY,

SECOND CYCLE, 30 CREDITS STOCKHOLM SWEDEN 2018,

Hive, Spark, Presto for

Interactive Queries on Big Data

NIKITA GUREEV

(2)

TRITA TRITA-EECS-EX-2018:468

(3)

Abstract

Traditional relational database systems can not be efficiently used to analyze data with large volume and different formats, i.e. big data.

Apache Hadoop is one of the first open-source tools that provides a dis- tributed data storage system and resource manager. The space of big data processing has been growing fast over the past years and many tech- nologies have been introduced in the big data ecosystem to address the problem of processing large volumes of data, and some of the early tools have become widely adopted, with Apache Hive being one of them. How- ever, with the recent advances in technology, there are other tools better suited for interactive analytics of big data, such as Apache Spark and Presto.

In this thesis these technologies are examined and benchmarked in or- der to determine their performance for the task of interactive business in- telligence queries. The benchmark is representative of interactive business intelligence queries, and uses a star-shaped schema. The performance Hive Tez, Hive LLAP, Spark SQL, and Presto is examined with text, ORC, Par- quet data on different volume and concurrency. A short analysis and con- clusions are presented with the reasoning about the choice of framework and data format for a system that would run interactive queries on big data.

Keywords: Hadoop, SQL, interactive analysis, Hive, Spark, Spark SQL, Presto, Big Data

(4)

Abstract

Traditionella relationella databassystem kan inte anv¨andas effektivt f¨or att analysera stora datavolymer och filformat, s˚asom big data. Apache Hadoop ¨ar en av de f¨orsta open-source verktyg som tillhandah˚aller ett dis- tribuerat datalagring och resurshanteringssystem. Omr˚adet f¨or big data processing har v¨axt fort de senaste ˚aren och m˚anga teknologier har in- troducerats inom ekosystemet f¨or big data f¨or att hantera problemet med processering av stora datavolymer, och vissa tidiga verktyg har blivit van- ligt f¨orekommande, d¨ar Apache Hive ¨ar en av de. Med nya framsteg inom omr˚adet finns det nu b¨attre verktyg som ¨ar b¨attre anpassade f¨or interak- tiva analyser av big data, som till exempel Apache Spark och Presto.

I denna uppsats ¨ar dessa teknologier analyserade med benchmarks f¨or att fastst¨alla deras prestanda f¨or uppgiften av interaktiva business intelli- gence queries. Dessa benchmarks ¨ar representative f¨or interaktiva business intelligence queries och anv¨ander stj¨arnformade scheman. Prestandan ¨ar unders¨okt f¨or Hive Tex, Hive LLAP, Spark SQL och Presto med text, ORC Parquet data f¨or olika volymer och parallelism. En kort analys och sam- manfattning ¨ar presenterad med ett resonemang om valet av framework och dataformat f¨or ett system som exekverar interaktiva queries p˚a big data.

Keywords: Hadoop, SQL, interactive analysis, Hive, Spark, Spark SQL, Presto, Big Data

(5)

Contents

1 Introduction 4

1.1 Problem . . . 4

1.2 Purpose . . . 5

1.3 Goals . . . 5

1.4 Benefits, Ethics and Sustainability . . . 5

1.5 Methods . . . 5

1.6 Outline . . . 6

2 Big Data 7 2.1 Hadoop . . . 7

2.2 Hadoop Distributed File System . . . 9

2.3 YARN . . . 12

3 SQL-on-Hadoop 15 3.1 Hive . . . 15

3.2 Presto . . . 21

3.3 Spark . . . 24

3.4 File Formats . . . 28

4 Experiments 32 4.1 Data . . . 32

4.2 Experiment Setup . . . 36

4.3 Performance Tuning . . . 37

5 Results 38 5.1 Single User Execution . . . 38

5.2 File Format Comparison . . . 47

5.3 Concurrent Execution . . . 52

6 Conclusions 61 6.1 Single User Execution . . . 61

6.2 File Format Comparison . . . 61

6.3 Concurrent Execution . . . 62

6.4 Future Work . . . 62

(6)

1 Introduction

The space of big data processing has been growing fast over the past years [1].

Companies are making analytics of big data a priority, and meaning that in- teractive querying of the collected data becomes an important part of decision making. With growing data volume the process of analytics becomes less inter- active, as it takes a lot of time to process the data for the business to receive insights. Recent advances in big data processing make interactive quieries, as opposed to only long running data processing jobs, to be performed on big data.

Interactive queries are low lateny, sometimes ad hoc queries that analysts can run over the data and gain valuable insights. The most important feature in this case is the fast repsonse from the data processing tool, making the feedback loop shorter and making data exploration more interactive for the analyst.

Many technologies have been introduced in the big data ecosystem to address the problem of processing large volumes of data, and some of the early tools have become widely adopted [2], with Apache Hive1being one of them. However, with recent advances in technology, there are other tools better suited for interactive analytics of big data, such as Apache Spark2 and Presto3. In this thesis Hive, Spark, and Presto are examined and benchmarked in order to determine their relative performance for the task of interactive queries.

There are several works taken into account during writing of this thesis. Similar work was performed by atScale in 2016 [3], which claims to be the first work on the topic of big data analytics. The report is done well, but the main issue is that with the current pace in the development of technologies the results from several years before can become outdated and less relevant in deciding which data processing framework to use. Another work in similar vein is SQL Engines for Big Data Analytics [4], but the main focus on that work is in the domain of bioinformatics, which lessens the relevance of the work for business intelligence.

The work was also done in 2015, making it even older than the atScale report.

Performance Comparison of Hive Impala and Spark SQL [5] from 2015 was also considered, but has its drawbacks. Several other works served as references in choosing the method and setting up the benchmark, including Sparkbench [6], BigBench [7], and Making Sense of Performance in Data Analytics Frameworks [8].

1.1 Problem

How is the performance on interactive business intelligence queries impacted by using Hive, Spark or Presto with variable data volume, file format, and number of concurrent users?

1Apache Hive - https://hive.apache.org/

2Apache Spark - https://spark.apache.org/

3Presto - https://prestodb.io/

(7)

1.2 Purpose

The purpose of this thesis is to assess the possible performance impact of switch- ing from Hive to Spark or Presto for interactive queries. Usage of the latest ver- sions of frameworks makes the work more relevant, as all three of the frameworks are undergoing rapid development. Considering the focus on interactive queries, several aspects of the experiments are changed from the previous works, includ- ing choice of the benchmark, experimental environment, file format.

1.3 Goals

The main goal of this thesis is to produce an assessment of Hive, Spark, and Presto for interactive queries on big data of different volume, data format, and a number of concurrent users. The results are used to motivate a suggested choice of framework for interactive queries, when a rework of the system is performed or a creation of a new system planned.

1.4 Benefits, Ethics and Sustainability

The main beneficial effect of this thesis is a fair comparison of several big data processing frameworks in terms of interactive queries conducted independently.

This will help with the choice of tools when implementing a system for running analytical querying with constraints on responsiveness and speed on hardware and data corresponding to the setup in this work.

As this thesis uses some of the state-of-the-art versions of frameworks in ques- tion, this include all of the improvements that were absent from previous similar works, while ensuring that no framework is operating under suboptimal condi- tions and no framework is given special treatment and tuning.

1.5 Methods

Empirical method is used, as analytical methods cannot be efficiently applied to the presented problem within the resource and time constraints [9]. The results will be collected by generating data of different volume, implementing an interactive query suite, tuning the performance of the frameworks, and running the query suite on the data. This follows an established trend by the most relevant previous works [3], [4], [5], making changes in line with the focus of this thesis.

(8)

1.6 Outline

In the Big Data section the big data ecosystem is described, with emphasis on Hadoop and YARN. In the SQL-on-Hadoop section the data processing frame- works are presented, first Hive, then Presto, then Spark. The ORC and Parquet file formats are also briefly described. In the Experiments section the benchmark and experimental setup are described. In the Results all of the experimental results are outlined and briefly described. In Conclusions the results are sum- marized and conclusions are driven, with future work outlined.

(9)

2 Big Data

This thesis project is focused on comparing the performance of several big data frameworks in the domain of interactive business intelligence queries. Initially, works in big data space were making long-running jobs their focus, but with the advance of tools in big data processing it becomes more common for companies to be able to execute interactive queries over aggregated data. In this section the big data ecosystem is described, with a common Hadoop setup.

2.1 Hadoop

Apache Hadoop is a data processing framework targeted at distributed pro- cessing of large volumes of data on one or more clusters of nodes running on commodity hardware. Hadoop is an open-source project under Apache Foun- dation, with Java being the main implementation language [10].

The main components of Hadoop project are:

• Hadoop Common: The common utilities that support other Hadoop mod- ules

• Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data

• Hadoop YARN: A framework for job scheduling and cluster resource man- agement

• Hadoop MapReduce: A YARN-based system for parallel processing of large data sets

Hadoop is an efficient solution to big data processing as it enables large scale data processing workloads relatively cheaply by using commodity hardware clusters [11]. Hadoop provides fast and efficient data processing and fault- tolerance.

2.1.1 MapReduce

MapReduce [12] is a shared nothing architecture, meaning that every node is independent and self-sufficient, for processing large datasets with a distributed algorithm on clusters of commodity hardware. Hadoop uses MapReduce as the underlying programming paradigm. MapReduce expresses distributed com- putations on large volume of data as a sequence of distributed operations on key-value pair datasets. The Hadoop MapReduce framework utilizes a cluster of nodes and executes MapReduce jobs defined by a user across the machines in the cluster.

(10)

Figure 1: A MapReduce computation

Figure 1 shows a MapReduce computation. Each MapReduce computation can be separated into three phases: map, combine, and reduce. In the map phase the input data is split by the framework into a large number of small fragments that are subsequently assigned to a map task. The framework takes care of distributing the map tasks across the cluster of nodes, on which it operates. Each of these map tasks starts to consumer key-value pairs from the fragment that was assigned to it and produces intermediate key-value pairs after processing it [12].

The combine or shuffle and sort phase is performed next on the intermediate key-value pairs. The main objective of this phase is to prepare all of the tu- ples for the reduce phase by placing the tuples under the same key together and partitioning them by the number of machines that are used in the reduce phase.

Finally, the reduce phase each reduce task consumes the fragment of interme- diate tuples that is assigned to it. For each of the tuples a user defined reduce function is invoked and produces an output key-value pair. The framework distributes the workload across the cluster of nodes.

One important aspect of MapReduce is that if a map and reduce task is com- pletely independent of all other concurrent map or reduce tasks, it can be safely run in parallel on different keys and data. Hadoop provides locality awareness, meaning that on a large cluster of machines it tries to match map operations to the machines that store the data that the map needs to be run on.

(11)

Figure 2: Hadoop MapReduce architecture

Figure 2 shows the architecture of the Hadoop MapReduce framework. It is implemented using master/worker architecture. There is a single master server called Jobtracker and a number of worker servers that are called Tasktrackers, one for every cluster. Users interact with the Jobtracker by sending MapReduce jobs to it that are subsequently put in the pending jobs queue. The jobs in the queue are executed first-in-first-out (FIFO). Jobtracker assigns the individual map and reduce tasks to Tasktrackers, which handle the task execution and data motion across MapReduce phases. The data itself is commonly stored on a distributed file system, frequently on Hadoop Distributed File System [13].

2.2 Hadoop Distributed File System

The main purpose of Hadoop Distributed File System (HDFS) [13] is to reliably store very large files that do not fit into a single hard drive of a node across nodes in a large cluster. The initial inspiration for HDFS was the Google File System [14].

HDFS is highly fault-tolerant and is designed to run on commodity machines.

A commodity machine is an already-available computing component for parallel computing, used to get the greatest amount of useful computation at low cost [11]. In this environment the failure of hardware is expected and needs to be handled routinely. One HDFS instance can consist of thousands of server nodes with each node being responsible for parts of the data. HDFS makes failure detection and automatic recovery a core architectural goal.

(12)

Some of the projects in Hadoop ecosystem rely on streaming access to the data sets. The initial goal of HDFS was to provide more batch than interactive access with emphasis on high throughput rather than low latency of data access. With this in mind HDFS was designed with some of the POSIX4semantics traded off for increase in data throughput and to enable streaming access to file system data.

One of the assumptions HDFS makes is that applications relying on it require a write-once-read-many access model for files. Once a file is created, written to and closed, it does not need to be changed. This assumption simplifies the resolution of data coherency issues and enables the high throughput for data.

The data replication is one of the main architectural goals of HDFS. It is de- signed to reliably store very large files across many nodes in a cluster. Each file is stored as a sequence of blocks and all of them are the same fixed size, except for the last block that can be less or equal to the configured block size. The blocks are replicated in order to provide fault tolerance with the block size and replication factor being configurable by an application.

Figure 3: HDFS architecture. Image adapted from HDFS Design page in Apache wiki6

4POSIX - http://www.opengroup.org/austin/papers/posix faq.html

6HDFS Design page in Apache wiki - https://hadoop.apache.org/docs/current/

hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

(13)

Figure 3 shows the architecture of HDFS. HDFS has a master/worker archi- tecture and a single cluster consists of a single master node called Namenode and multiple worker nodes called Datanodes, one per cluster. The Namenode manages the file system namespace and regulates the file access by clients. The Datanodes manage storage that they are attached to and are responsible for serving read and write requests from the client. HDFS exposes the file system namespaces and allows users to store data in files. A file is split into one or more blocks and the Datanodes are responsible for their storage. The Namen- ode executes the file system operations and is responsible for opening, closing, renaming files and directories. The Datanodes execute block creation, deletion, and replication on requests from the Namenode.

HDFS supports a hierarchical file organization. Users can create and remove files, move a file from one directory to another, or rename a file. The Namen- ode maintains the file system namespace. The Namenode records any change to the file system namespace or the properties of it. The number of replicas of a file that should be maintained by HDFS can be specified by a user and is called the replication factor of that file, which is stored by the Namenode.

In addition, the Namenode is responsible for the replication of blocks. It pe- riodically receives a Heartbeat and a Blockreport from each of the Datanodes in the cluster. Datanode sends a list of all blocks that are stored on it in a Blockreport. If a Heartbeat is received then the Datanode is functioning and healthy. An absence of a Heartbeat point to a network partition, a subset of Datanodes losing connectivity to the Namenode. The Namenode then marks all of the Datanodes without recent Heartbeats as dead and stops forwarding any new requests to them. All of the data stored on dead Datanodes is unavailable to HDFS, leading to new replica placement in case the death of Datanodes let some files replication factor go below the specified value.

Replica placement is a critical factor in providing fault tolerance. HDFS aims to place replicas of data on different racks in order to prevent data loss in case of a whole rack of servers experiencing failure. There is a simple, but non-optimal policy placing each replica on a unique rack providing fault tolerance for entire rack failure and bandwidth of reading from separate racks at an increased cost of writes that needs to be transferred across multiple racks. For the common case of replication factor set to three replicas, HDFS places one replica on a local rack, another on a remote rack, and a final one on a different machine at the same remote rack. The rationale behind this is that node failure is much more common then rack failure, and having two copies on the same remote rack improves write performance, without trading off too much reliability and read performance.

HDFS provides data integrity by using checksums. Corruption of the data stored in a HDFS block is possible, due to either disk storage, network faults, or bugs in software. When a client retrieves a file it verifies the content agains a checksum stored in a hidden file in the same HDFS namespace as the file.

(14)

2.3 YARN

Yet Another Resource Negotiator (YARN) [15] is an Apache Hadoop compo- nent that is dedicated to resource management for Hadoop ecosystem. Ini- tially Hadoop was focused on running MapReduce jobs for web crawls, but later became used more widely for very different applications. The initial de- sign included tightly coupled programming model and resource management infrastructure and centralization of job control flow handling, leading to scala- bility issues of the scheduler. The aim of YARN is to remedy that. The main requirements for YARN are listed as follows.

1. Scalability - an inherent requirement from running on big data

2. Multi-tenancy - as resources are typically requested by multiple concurrent jobs

3. Serviceability - decoupling of upgrade dependencies in order to accommo- date slight difference in Hadoop versions

4. Locality Awareness - a key requirement to minimize overhead of sending data over network

5. High Cluster Utilization - making it economical and minimize time spent unused

6. Reliability/Availability - continuous monitoring of jobs to provide fault- tolerance

7. Secure and Auditable Operation - a critical feature for multi-tenant system 8. Support for Programming Model Diversity - required by the ever growing

Hadoop ecosystem

9. Flexible Resource Model - separation of map and reduce tasks can bottle- neck resources and requires careful management of resources

10. Backward compatibility - facilitate adoption

(15)

Figure 4: YARN System Architecture. Image adapted from YARN Design page in Apache wiki8

Figure 4 shows the architecture of YARN. The two main entities are the global Resource Manager (RM) and an ApplicationMaster (AM), one per application.

The ResourceManager monitors how the resources are used and the liveness of nodes, enforces the resource allocation invariants, and acts as an arbiter in resource contention among tenants. The responsibilities of ApplicationMaster are the coordination of the logical plan for a job by issuing resource requests to the ResourceManager, generating the physical plan from the received resources and coordinating the plan execution.

The ResourceManager, acting as a central authority can ensure the fairness and locality requirements across tenants. Based on the issued requests from the ap- plications, the ResourceManager can dynamically allocate leases or containers to applications to be executed on particular nodes. Each node runs a specific daemon called NodeManager (NM) that helps enforce and monitor these assign- ments. NodeManagers are also responsible for tracking the resource availability, reporting node failures, and managing the lifecycle of containers. From these snapshots of state from NMs the RM can create a global view of the state.

Each job submitted to the RM goes through admission control phase, during which the security credential are validated and administrative checks are per- formed, due to the secure and auditable operation requirement. If everything is in order, the job state is set to accepted and it is passed to the Scheduler.

8YARN Design page in Apache wiki - https://hadoop.apache.org/docs/current/

hadoop-yarn/hadoop-yarn-site/YARN.html

(16)

All of the accepted applications are recorded in persistent storage in order to recover in case of ResourceManager failure. After that the Scheduler acquires the necessary resources and sets the job state to running: a container is allo- cated for the ApplicationMaster and it is spawned on one of the nodes in the cluster.

The ApplicationMaster manages all of the details of the lifecycle, such as re- source consumption elasticity, management of flow execution, failure handling, local optimizations. This provides the scalability and programming model flex- ibility, as the ApplicationMaster can be implemented in any programming lan- guage, as long as it satisfies the few assumptions YARN makes.

Commonly, the ApplicationMaster will require resources from multiple nodes in the cluster. In order to obtain the leases, AM sends a resource request to RM with the locality preferences and container properties. The ResourceManager makes an attempt to satisfy the requests from each application according to the specified availability and scheduling policies. Each time a resource request is granted, a lease is generated and pulled with the heartbeat from the AM. A token-based security mechanism guarantees that a request is authentic when a request is passed from AM to NM. The AM encodes a launch request that is specific to the application on container start. Running containers can directly report status and liveness to and receive commands specific to the framework from the AM without any dependencies on YARN.

(17)

3 SQL-on-Hadoop

With the development of big data ecosystem the tools for big data processing started to become more sophisticated and easy to use. One of the major re- quirements for analytics tools is the ability to execute ad hoc queries on the data stored in the system. SQL, being one of the main language used for the purposes of data querying9 became the standard most frameworks try to inte- grate, creating a number of SQL-like languages. In this section several big data processing frameworks, their architecture and usage are presented.

3.1 Hive

Apache Hive [16] is a data warehouse system that supports read and write operations over datasets in a distributed storage. The queries support SQL-like syntax by introducing a separate query language called HiveQL10. HiveServer2 (HS2) is a service enabling clients to run Hive queries11.

Figure 5: Hive System Architecture. Image adapted from Hive Design page in Apache wiki13

Figure 5 shows the major components of Hive and how it interacts with Hadoop.

There is a User Interface (UI)x that is used for system interaction, initially being

9https://www.tiobe.com/tiobe-index/

10Hive Language Manual - https://cwiki.apache.org/confluence/display/Hive/

LanguageManual

11https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Overview

13Hive Design page in Apache wiki - https://cwiki.apache.org/confluence/display/

Hive/Design

(18)

only a command line interface (CLI). The Driver is the component that receives the queries, provides the session handles, and an API modelled on JDBC/ODBC interfaces14. The Compiler parses the query, does semantic analysis of the query blocks and expressions, and is responsible for creating an execution plan with respect to the table and partition metadata, which is stored in the metastore.

The Metastore stores all of the structural information about the different tables and partitions in the warehouse, including the information about columns and their types, the serializers and deserializers used to read and write data to HDFS, and the location of data. The Execution Engine is the component that actually executes the plan that was created by the compiler. The plan is represented by a Directed Acyclic Graph (DAG) of stages. The component is responsible for dependency management between stages and the execution of stages on different components. Figure 5 also shows how a query typically flows through the system. A call from UI to Driver is issued (step 1). Next, the Driver creates a session handle for that query and sends the query to the Compiler, which generates an execution plan for the query (step 2). After that the Compiler gets the required metadata from the Metastore (steps 3, 4). Then the Compiler can generate the plan, while type-checking the query and pruning partitions based on the predicates in the query (step 5). The execution plan consists of stages, each being a map or reduce job, metadata operation, or an HDFS operation.

The execution engine submits the stages to the responsible components (steps 6-6.3). In each map or reduce task the deserializer associated with the table outputs intermediate results, which are written to HDFS file using a serializer.

These files are used for the map and reduce tasks that are executed afterwards, with a fetch results call for Driver being done for the final query result (steps 7-9).

3.1.1 Hive Data Model

There are three main abstractions in Hive Data Model: tables, partitions, and buckets. Hive tables can be viewed as similar to tables in relational databases.

Rows in the Hive tables are separated into columns with types. Tables support filter, union, and join operations. All the data in a table is stored in a HDFS directory. External tables, which can be created on pre-existing files or directo- ries in HDFS, are also supported by Hive [17]. Each table can have one or more partition keys that determine how the data is stored. The partitions allow the system to prune data based on query predicates, in order not to go through all of the stored data for every query. The data stored in partition can be divided into buckets, based on the hash of the column. Each bucket is stored as a file in a directory. The buckets allow more efficient evaluation that depend on a sample of data.

In addition to supporting primitive types, Hive has support for arrays and maps.

Users can also create user-defined types composed of primitives and collections.

14https://docs.oracle.com/javase/8/docs/technotes/guides/jdbc/

(19)

The Serializaer and Deserializaer (SerDe) and object inspector interfaces provide the necessary hooks to serialize and deserialize data of User-Defined Functions (UDFs) and have their own object inspectors.

3.1.2 Tez

The initial design of Hive relied on MapReduce for execution, but was later replaced by Apache Tez [18]. The foundation for building Tez was laid during the breaking resource management out of Hadoop that resulted in YARN [15]

and enabled an architectural shift. MapReduce could be viewed as just one of the applications to be run on top of YARN. Tez made the next step by creating a flexible and extensible foundation with support of arbitrary data-flow oriented frameworks.

Apache Tez provides an API that allows clear modelling of logical and physical data flow graphs. This allows users to model computation as a DAG with finer decomposition compared to classic vertex-edge models. In addition,Application Programming Interfaces (APIs) to modify the DAG definition on the fly are available, enabling more sophisticated optimizations in query execution. Tez also supports efficient and scalable implementation of YARN features - locality awareness, security, reuse of resources, and fault-tolerance.

Tez solves the orchestrating and executing a distributed data application on Hadoop, providing cluster resource negotiation, fault tolerance, elasticity, secu- rity, and performance optimization. Tez is composed of several key APIs that define data processing and orchestration framework that the applications need to implement to provide an execution context. This allows Tez to be application agnostic. Computation in Tez is modelled as an acyclic graph, which is natural due to the process of data flowing from source to sinks with transformation happening on the in-between vertices. A single vertex in a Tez DAG API is a representation of some data processing and is a single step in the transforma- tion. A user provides a processor that defines the underlying logic to handle data. It is quite common for multiple vertices to be executed in parallel across many concurrent tasks. An edge of the DAG is a physical and logical repre- sentation of data movement from the vertex that produces data into the one that consumes it. Tez supports one-to-one, broadcast, and scatter-gather data movement between producers and consumers.

After the DAG API defines the structure of the data pipeline, the Runtime API is used to inject application logic. While a DAG vertex represents a data processing step, the actual transformation is applied by executing tasks on the machines in the cluster. Each of the tasks is defined as a composition of a set of inputs, processor, and outputs (IPO). The processor is defined for each task by the vertex, and the output classes of incoming edge define the inputs. The input classes of outgoing edges define the outputs of the vertex. Some of the complexity is hidden by this representation, such as the underlying data trans-

(20)

port, partitioning, aggregation of shards, which are configured after the creation of IPO objects. IPO configuration is done by binary payloads. The processor is presented with inputs and outputs and communicates with the framework by event exchange through a context object. Fault-tolerance is achieved by re-executing tasks to regenerate data on reception of an ErrorEvent.

Tez does not specify any format of data and is not a part of the data plane during the execution of the DAG. The data transfer is performed by the inputs and outputs with Tez serving as a router for producers and consumers. This allows Tez to have minimal overhead and makes Tez data format agnostic, allowing the inputs, processors, and outputs to choose their data formats themselves.

To enable dynamic reconfiguration of the DAG and to adapt the execution on the fly, Tez uses a VertexManager. Each vertex in the DAG is associated with a VertexManager responsible for reconfiguration of vertices during runtime. A number of different state machines are used to control the lifecycle of vertices and tasks that interact with VertexManager through states. The VertexManager is notified of the state transitions through a context object, facilitating the decisions on the DAG configuration change.

Apache Tez provides automatic partition cardinality estimation as a runtime optimization. For instance, that is used to solve a common problem in MapRe- duce - determining the required number of tasks for reduce phase. That depends on the volume of data shuffled from mappers. Tez can produce a statistical estimate of total data size and make an estimation of the total required num- ber of reducers at runtime. In addition, by having the fine-grained control over DAG vertices, Tez provides scheduling optimizations by providing locality- aware scheduling, minimizing out-of-order scheduling, reducing the number of scheduling deadlocks and having specific deadlock detection and preemption to take care of these situations.

Multi-tenancy is a common requirement in data processing and the discrete task-based processing model of Apache Tez provides very good support for it.

Resource allocation and deallocation based on a task a single unit enables high utilization of cluster resources, with the computational power being transferred from applications that no longer require it to those that do. In Tez each task is executed in a container process that guarantees this resource scalability and elasticity.

Hive 0.13 was the first version to take advantage of Tez, utilizing the increased efficiency of translating SQL queries into Tez DAGs instead of using MapRe- duce.

(21)

Figure 6: MapReduce and Tez for Hive. Images adapted from Tez page in Apache wiki16

Before switching to Tez, Hive used MapReduce for query execution. That led to potential inefficiencies, due to several factors. One instance are queries with more than one reduce sinks that are not possible to combine due to absence of correlation in partition keys. In MapReduce this would lead to separate MapReduce jobs being executed for a single query, as shown on figure 6. Each of the MapReduce jobs would read from and write to HDFS, in addition to data shuffling. In Tez, on the other hand, the query plan would be pipelined and linked directly, referred to a map-reduce-reduce pattern.

3.1.3 LLAP

In Hive 2.0 the Live Long And Process (LLAP) functionality was added17. While the improvements introduced in [17] and [18], a new initiative to introduce asynchronous spindle-aware I/O, pre-ferching and caching of column chunks, and multithreaded JIT-friendly operator pipelines was created. LLAP provides a hybrid execution model that consists of a long-lived daemon that works in place of the direct communication with the Datanode in HDFS, and a tightly integrated framework based on DAGs.

LLAP daemons include the caching, pre-fetching, and some query processing functionality. Some of the small and short queries are mostly directly passed to and processed by the daemon, but all the heavy queries still are the responsibility of the YARN container.

In addition, similar to how the Datanode is designed, LLAP daemons are ac- cessible to other applications, which can be useful in the case when a relational

16Tez page in Apache wiki - http://tez.apache.org/

17LLAP page in Apache wiki - https://cwiki.apache.org/confluence/display/Hive/LLAP

(22)

view of data can be preferable to file-centric one. The API offers InputFormat that can be used by data processing frameworks.

Figure 7: LLAP. Image adapted from LLAP page in Apache wiki18

Figure 7 shows how a job is processed with LLAP and a Tez AM, which coor- dinates the whole execution. Initially, the input query is passed into the LLAP.

After that the costly operations such as the shuffles are performed in separate containers during the reduce stage. LLAP can be accessed by multiple concur- rent queries or applications at the same time.

To achieve the goals of providing JIT optimization and caching, while reducing startup costs, the daemon is executed on worker nodes in the cluster and handles the I/O operations, caching, and execution of query fragments. Any request to a LLAP node contains the location of data and the associated metadata. While the concerns about data locality are left to YARN, the processing of local and remote data locations is handled by LLAP. Fault-tolerance overall is simplified, as any data node can be used to execute any query fragment, which is done by the Tez AM. Similar to Tez, direct communication between Hive nodes is permitted.

The daemon always tries to use multiple threads for the purposes of I/O and reading from a compressed format. As soon as data becomes ready it is passed to execution in order for the previous batch to be processed concurrently with the preparation of the next one. LLAP uses a RLE-encoded columnar format that is used for caching, minimizing copy operations in I/O, execution and cache, and also in vectorized processing. The daemon caches data itself and index and metadata for input files, sometimes even for data that is not currently in the cache. The eviction policy is pluggable, but the default is a LRFU policy. A

18LLAP page in Apache wiki - https://cwiki.apache.org/confluence/display/Hive/LLAP

(23)

column chunk is a unit of data in cache, making a compromise between efficiency of storage and low overhead in data processing. Dynamic runtime filtering is achieved using a bloom filter [19].

In order to preserve the scalability and flexibility of Hive, LLAP works using the existing processed-based Hive execution. The daemons are optional and Hive can bypass them, even in the case when they are running and deployed. Unlike MapReduce or Tez, LLAP is not an execution engine by itself. The existing Hive execution engines are used to schedule and monitor the overall execution.

LLAP results can be a partial result of a Hive query or can be passed to ex- ternal Hive task. Resource management is still a responsibility of YARN, with the YARN container delegation model being used by LLAP. The data caching happens off heap in order to overcome the JVM memory settings limitations - the daemon can initially use only a small amount of CPU and memory, but additional resources can later be allocated based on the workload.

For partial execution LLAP can take fragments of a query, e.g. a projection or a filter and work with it. For security and stability only Hive code and some UDFs are accepted by LLAP and the code is localized and executed on the fly. Concurrent execution of several query fragments from different queries and even sessions is allowed. Users can gain direct access to LLAP nodes using client API.

3.2 Presto

Apache Presto [20] is a distributed massively parallel query execution engine developed by Facebook in 201319. Presto can process data from multiple sources such as HDFS, Hive, Cassandra. There are many other available connectors making the integration of new data sources easy.

Presto is designed to query over data where it is stored, rather than moving data into a single storage. Presto is able to combine multiple data sources in a single query, which makes data integration easier. However, Presto can not act as a general purpose relational database, as it is not designed to handle Online Transaction Processing (OLTP).

19https://www.facebook.com/notes/facebook-engineering/

presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920/

(24)

Figure 8: Presto System Architecture

Figure 8 shows the architecture of Presto. There are two types of nodes in a Presto cluster: Coordinators and Workers. The main responsibilities of Presto coordinator are parsing the input queries, planning query execution, and man- aging the worker nodes. A Presto worker is responsible for executing tasks and processing data fetched from connectors and exchanged between workers.

(25)

Figure 9: Presto Task Execution

Presto does not use MapReduce, but instead uses message passing to execute queries, removing the overhead of transitions between map and reduce phases that is present in Hive, as shown on Figure 9. That leads to improvements in performance, due to all of the stages being pipelined, without any overhead from disk interaction. However, that reduces the fault-tolerance, as lost data will need to be recomputed. Another limitation is that data chunks need to fit in memory.

Presto is first and foremost a distributed system running on a cluster of nodes.

Presto query engine is optimized for interactive analysis of large volume of data and supports standard ANSI SQL, including complex queries, aggregations, joins, and window functions. Presto supports most of the standard data types commonly used in SQL.

(26)

When an input query is sent from the client node it is first received by the co- ordinator that parses it and creates a query execution plan. Then the scheduler assigns worker nodes that are closest to the data to minimize the network over- head of data transfer. The workers perform data extraction through connectors, when necessary, execute the created distributed query plan and return results to the coordinator, which returns the final result of the query to the client.

Presto is not specific to Hadoop and can work with multiple datasources in the same query, making Presto more suitable to cloud environments that do not use HDFS as storage. There are a variety of connectors to data sources, including HDFS, Amazon S3, MySQL, Apache Hive, Apache Kafka, Apache Cassandra, PostgreSQL, Redis, and more20.

3.3 Spark

Apache Spark is a unified engine for distributed data processing that was created in 2009 at the University of California, Berkeley [21]. The first release of it was in 2010 and since then Apache Spark has become one of the most active open source projects in big data processing. It is part of Apache Foundation and has been over 1000 contributors to the project [22]. Spark offers several frameworks 10, including Spark SQL for analytics, Spak Streaming, MLlib for machine learning, and GraphX for graph-specific processing.

Figure 10: Spark ecosystem. Image adapted from Apache Spark page21

20Presto Connectors - https://prestodb.io/docs/current/connector.html

21Apache Spark - https://spark.apache.org/

(27)

3.3.1 Architecture

Each Spark program has a driver running the execution of the different concur- rent operations on the nodes in the cluster. The main abstraction is a resilient distributed dataset (RDD) that represents a collection of objects partitioned over the Spark nodes in the cluster. Most commonly it can be created by read- ing in a file from HDFS. Another important abstraction in Spark is the concept of shared variables. With the default configuration during each concurrent ex- ecution of a function a copy of each variable is sent to each task. However, in some cases variables need to be shared across tasks, or with the driver. In these cases Spark supports two types of shared variables: broadcast variables that are used as a method to cache a value in memory, and accumulators that are mutated only by adding to them, for instance sums and counts.

Figure 11: Spark architecture. Image adapted from Cluster Overview page in Apache wiki23

Figure 11 shows the architecture of Apache Spark. Spark utilizes a master/- worker architecture with a single Driver node and multiple Worker nodes. Each Worker node contains an Executor that receive and execute the application code from the SparkContext running on the Driver. Each application has sep- arate executor processes that are independent from other applications, which remain for the whole lifecycle of the application. The executor processes can run tasks in multiple threads. This provides isolation of applications at the cost of them being unable to directly share data without using an external storage system.

Spark supports four different cluster managers. The cluster manager can be either Spark standalone one, Apache Mesos, Hadoop YARN, or Kubernetes.

23Cluster Overview page in Apache wiki - https://spark.apache.org/docs/latest/

cluster-overview.html

(28)

Spark is agnostic to the underlying cluster manager. The only requirement is that Spark is able to spawn executor processes that are able to communicate with each other. It can easily run with it even on a cluster manager that possibly also supports multiple other applications, for instance in the case of Mesos or YARN.

3.3.2 RDD

The key programming abstraction in Spark is Resilient Distributed Datasets (RDDs), which are fault-tolerant collections of objects partitioned over Spark nodes in the cluster and can be processed concurrently. Spark offers APIs in Java, Scala, Python, and R, through which users can pass functions to be executed on Spark cluster. Commonly, RDDs are first read from an external source, such as HDFS and then transformed using operations such as map, filter, or groupBy. RDDs are lazily evaluated, so an efficient execution plan can be created for the transformations specified by the user.

Some operations performed in Spark will trigger a shuffle event. The shuffle is the mechanism that Spark uses to re-distribute data across partitions so it will be grouped differently. Commonly this includes the process of copying the data across the nodes and executors, which makes the shuffling costly and complex.

In Spark the distribution of data across Spark nodes is typically not done in preparation for any specific operation. A single task will operate on a single partition, e.g. organizing data for a single reduceByKey an all-to-all operation will be required. This leads to Spark reading all of the values for each key across all of the partitions, bringing the values together in order to compute the result for each key. The performance impact of the shuffle operation is significant. Shuffle involves expensive I/O operations on disk, CPU load for data serialization, and I/O over the network to transmit the data to other Spark nodes. The shuffle operations consists of map and reduce phases, not to be confused with map and reduce Spark operations. Spark keeps the results of individual map tasks in memory until it does not fit, then it spills over to the disk, sorted by target partitions and stored in a single file. Reduce tasks read the blocks relevant to them. In addition to simple read/write interaction with the disk, shuffle operation generates a lot of intermediate files. Depending on Spark version, these files can be preserved until the corresponding RDDs are no longer used and only then garbage collected. This can lead to long-running Spark jobs consuming a lot of space on disk.

In order to provide fault-tolerance Spark uses lineage-based approach. In con- trast with the common strategies of replication or checkpointing, Spark tracks the graph of transformations leading to each partition of the RDD and reruns the necessary operations on base data if a RDD partition is lost. For shuffle operation the data is persisted locally on the sender node in case of receiver failure. This approach can significantly increase the efficiency of data-intensive workloads.

(29)

Spark also provides options for explicit data sharing across computations. RDDs are not persisted by default and are recomputed, but there is a setting to persist them in memory for rapid reuse. The data can spill to disk if it does not fit in the nodes memory, and the option to persist data to disk is also available.

The data sharing can provide a large increase in speed up to several orders of magnitude for interactive queries and iterative algorithms that reuse data. The persist() and cache() methods of an RDD are used for that. Different storage levels can be specified for this functions, for instance allowing to store RDD as serialized or deserialized Java objects in memory either with or without spilling to disk, or stored directly on the disk, with additional options that allow to replicate partitions on another Spark node in the cluster. The storage levels provide the developer with a trade-off in memory and CPU usage. There are several factors that should be taken into consideration, when making a decision on which storage level to use. First, if the data fits in memory it is best to leave it stored there, as this would be most CPU efficient and would allow operations on the RDD to be executed the fastest. Second, if the data does not fit into memory, then trying to store it in memory, but serialized should be the second choice. It would be more efficient in terms of space, but introduces the serialization and deserialization overhead, while still being reasonably fast.

Third, if both previous levels are impossible, the usage of disk should be a last resort. Only if the computation of the partition is so expensive that reading from disk would be faster, then the data should be persisted on disk, but in many cases the recomputation of a partition would be less costly. Finally, the replication option should be used only for faster fault-tolerance. By default, Spark already provides fault-tolerance for all partitions on all storage levels [23].

3.3.3 Dataframe, Dataset, and Spark SQL

Unlike the basic RDD API the interface provided by Spark SQL contains sig- nificantly more information about how data is structured and how computing is performed. Dataframes are distributed collections of data added to Spark in order to execute SQL-like queries on top of Spark engine. Datasets are an im- provement over Dataframes that combine the benefits of RDDs, such as strong typing and powerful lambda functions and the extra optimizations that Spark is able to do for Dataframes using the Catalyst optimizer. Datasets can be con- structed from Java objects and manipulated using functional transformations.

This API is available only in Java and Scala programming languages.

(30)

Figure 12: Spark SQL and Hive integration. Image adapted from Apache Spark SQL25

Spark SQL can read data from existing Hive installation by communicating with the Metastore (figure 12). Spark does not have all of the dependencies that Hive needs included, but if they are detected on classpath, Spark automatically loads them. Same goes for all of the worker nodes with respect to Hive serializers and deserialziers (SerDe). Spark also supports starting up a Thrift server26, however direct loading of files from HDFS was used.

3.4 File Formats

Storage of big data comes with additional challenges. The trade-off between volume and efficiency of access becomes more important, as every improvement in compression rate may mean major differences in total requirement for stor- age space. The efficiency of access on the other hand is necessary for running analytical queries on data in an interactive manner. This subsection presents two file formats that are commonly used for storing big data.

3.4.1 ORC File Format

The Optimized Row Columnar (ORC) file format is an efficient way to store data for Hive. It improves Hive performance by having a single output file for each task, reducing the load on NameNode, introducing lightweight index that is stored inside the file, concurrent reading of the same file by several readers. In addition it enables block-mode compression, depending on the type of data with run-length encoding for integer types and dictionary encoding for string types.

The metadata is stored by using Protocol Buffers27, allowing field addition and removal.

25Apache Spark SQL - https://spark.apache.org/sql/

26https://hortonworks.com/tutorial/spark-sql-thrift-server-example/

27https://github.com/google/protobuf/wiki

(31)

Figure 13: ORC File Format. Image adapted from ORC Language Manual page in Apache wiki29

Each ORC file consists of stripes, which represent grouped row data. Additional information on the content is stored in the file footer, such as list of stripes in the file, number of rows in a single stripe, and column-level aggregates on count, min, max, and sum. Compression parameters and the size after compression are stored in the postscript. The stripe footer contains the directory of stream locations.

29ORC Language Manual page in Apache wiki - https://cwiki.apache.org/confluence/

display/Hive/LanguageManual+ORC

(32)

3.4.2 Parquet File Format

Apache Parquet is another columnar file format popular in Hadoop ecosystem.

The main motivation for this format is to have efficient columnar data repre- sentation with support for complex nested data structures.

Figure 14: Parquet File Format. Image adapted from Parquet documentation page in Apache wiki31

<Column 1 Chunk 1 + Column Metadata>

<Column 2 Chunk 1 + Column Metadata>

...

<Column N Chunk 1 + Column Metadata>

<Column 1 Chunk 2 + Column Metadata>

<Column 2 Chunk 2 + Column Metadata>

...

<Column N Chunk 2 + Column Metadata>

...

<Column 1 Chunk M + Column Metadata>

31Parquet documentation page in Apache wiki - https://parquet.apache.org/

documentation/latest/

(33)

<Column 2 Chunk M + Column Metadata>

...

<Column N Chunk M + Column Metadata>

File Metadata

4-byte length in bytes of file metadata 4-byte magic number "PAR1"

In the above example and on figure 14, there are N columns in this table, split into M row groups. The file metadata contains the locations of all the column metadata start locations. Metadata is written after the data to allow for single pass writing.

Parquet supports efficient compression and encoding schemas on per-column level and is open for extension. Apache Parquet supports dictionary encod- ing, bit packing, and run-length encoding to provide compression to the data.

Snappy32and ZLIB33are supported codecs. Many data processing frameworks support this file format, including Apache Hive, Apache Drill, Apache Impala, Apache Crunch, Apache Pig, Cassandra, Apache Spark, and Presto34.

32https://github.com/google/snappy

33https://zlib.net/

34https://cwiki.apache.org/confluence/display/Hive/Parquet

(34)

4 Experiments

The main focus of the thesis is business intelligence queries on big data. With this in mind, several experiments are planned. First, the benchmark is run on data of increasing volume, to gain an understanding of how the frameworks’

performance changes on growing volume of data. Second, several data formats are examined in order to have a better understanding of how each frameworks deals with each format, and which framework works better with which format.

Finally, as the planned system works in a concurrent setting, the performance of frameworks on a different number of concurrent users is examined.

4.1 Data

The benchmark used to assess the Big Data frameworks in this thesis is the Star Schema Benchmark [24]. The choice of the benchmark was motivated by the fact that it closely models the typical schemas used in OLAP workloads. The benchmark is based on TPC-H35, but makes a number of modifications to it. It uses a star schema that is much more frequently used for interactive querying.

This leads to a smaller number of joins and makes it more useful for testing interactive business intelligence queries.

35http://www.tpc.org/tpch/

(35)

Figure 15: Start Schema Benchmark data model

Figure 15 shows the schema used in the benchmark. The data for this schema was generated for different volumes, defined by scale factor (SF). The tables below show the data volume for scale factors 1, 10, 20, and 75 in different file formats.

SF1 SF10 SF20 SF75

Customer 2.7 Mb 27.3 Mb 54.8 Mb 206.8 Mb Lineorder 571.3 Mb 5.8 Gb 11.6 Gb 44.4 Gb Dates 224.6 Kb 224.6 Kb 224.6 Kb 224.6 Kb Parts 16.3 Mb 65.7 Mb 82.2 Mb 115.4 Mb Supplier 162.8 Kb 1.6 Mb 3.2 Mb 12.2 Mb Total 590.8 Mb 5.9 Gb 11.8 Gb 44.7 Gb Table 1: Data size, by table and scale factor, Text format

(36)

SF1 SF10 SF20 SF75 Customer 722.6 Kb 7.0 Mb 14.0 Mb 52.7 Mb Lineorder 118.1 Mb 1.2 Gb 2.4 Gb 9.3 Gb Dates 10.8 Kb 10.8 Kb 10.8 Kb 10.8 Kb

Parts 1.9 Mb 7.8 Mb 9.7 Mb 13.6 Mb

Supplier 48.7 Kb 473.1 Kb 946.1 Kb 3.5 Mb

Total 120.9 Mb 1.2 Gb 2.5 Gb 9.3 Gb

Table 2: Data size, by table and scale factor, ORC format

SF1 SF10 SF20 SF75

Customer 1.2 Mb 12 Mb 24 Mb 89 Mb Lineorder 172 Mb 1.7 Gb 3.4 Gb 14 Gb

Dates 40 Kb 40 Kb 40 Kb 40 Kb

Parts 2.5 Mb 9.3 Mb 11 Mb 16 Mb Supplier 92 Kb 812 Kb 1.6 Mb 5.9 Mb Total 176 Mb 1.7 Gb 3.5 Gb 14 Gb Table 3: Data size, by table and scale factor, Parquet format

4.1.1 Queries

Query 1 The query is meant to quantify the amount of revenue increase that would have resulted from eliminating certain companywide discounts in a given percentage range for products shipped in a given year. In each set of queries the specific queries are denoted as Q1.1, Q1.2, and so on.

select sum(lo_extendedprice*lo_discount) as revenue from lineorder, date

where lo_orderdate = d_datekey and d_year = [YEAR]

and lo_discount between [DISCOUNT] - 1

and [DISCOUNT] + 1 and lo_quantity < [QUANTITY];

There are three queries generated using this template with specific values for YEAR, DISCOUNT, and QUANTITY. For scale factor 1 the filtering of the results will be presented. Q1.1 sets YEAR to 1993, DISCOUNT to 2, and QUANTITY to 25, which filters out approximately 11600 rows. For Q1.2 the year and month are specified instead of year only, as January 1994. DISCOUNT is set to be between 4 and 6, and QUANTITY is set to 26 and 35, filtering out approximately 4000 rows. For the Q1.3 the filtering is even stricter, with only the 6th week of the year 1994 specified for time filtering, DISCOUNT is set to between 5 and 7, and QUANTITY is 26 to 35, same as before. That filters out approximately 500 rows.

(37)

Query 2 For a second set of queries, the restrictions are placed on two di- mensions. The query compares revenue for select product classes, for suppliers in a select region, grouped by more restrictive product classes and all years of orders.

select sum(lo_revenue), d_year, p_brand from lineorder, date, part, supplier

where lo_orderdate = d_datekey and lo_partkey = p_partkey and lo_suppkey = s_suppkey and p_category = [CATEGORY]

and s_region = [REGION]

group by d_year, p_brand1 order by d_year, p_brand1;

There are three queries generated using this template with specific values for CATEGORY and REGION. The estimated number of rows after filtering are done for scale factor 1. Q2.1 uses MFGR#12 as CATEGORY, which accounts for approximately 4% of the orders and sets REGION to AMERICA. That filters out around 48000 rows. Q2.2 switches from a single category to a range of brands and sets them to be between MFGR#2221 and MFGR#2228, while the REGION is changed to ASIA. The added restrictions amount to around 10000 being selected. Q2.3 limits the search to a single brand MFGR#2339 and changes the REGION to EUROPE. That leads to around 1200 being displayed as a result. Each of the selections is disjoint and in addition separate from Q1, meaning no overlap and no effect of caching in these scenarios.

Query 3 In the third query suite, we want to place restrictions on three di- mensions, including the remaining dimension, customer. The query is intended to provide revenue volume for lineorder transactions by customer nation and supplier nation and year within a given region, in a certain time period.

select c_nation, s_nation, d_year, sum(lo_revenue) as revenue from customer, lineorder, supplier, date

where lo_custkey = c_custkey and lo_suppkey = s_suppkey and lo_orderdate = d_datekey

and c_region = [REGION] and s_region = [REGION]

and d_year >= [YEAR] and d_year <= [YEAR]

group by c_nation, s_nation, d_year order by d_year asc, revenue desc;

There are four queries generated using this template with specific values for YEAR and REGION. Q3.1 uses ASIS as REGION, leading to 20% of the rows being selected, and years 1992 and 1997, bringing the total amount of rows selected to around 205000. Q3.2 specifies a single nations instead of a region and sets it to UNITED STATES, selecting 4%, the years are not changed. The

(38)

number of rows returned is approximately 8000. Q3.3 sets tighter restrictions on the location, setting it to two cities in UK and adds grouping by them to the query. The year filtering is not changed. This nets around 320 rows. Finally, in Q3.4 the timespan is set to a single month and the location is unchanged, returning only around 5 rows.

Query 4 The last query suite represents a ”What-If” sequence, of the OLAP type. It starts with a group by on two dimensions and rather weak constraints on three dimensions, and measure the aggregate profit, measured as (lo revenue - lo supplycost).

select d_year, c_nation, sum(lo_revenue - lo_supplycost) as profit from date, customer, supplier, part, lineorder

where lo_custkey = c_custkey and lo_suppkey = s_suppkey and lo_partkey = p_partkey and lo_orderdate = d_datekey and c_region = [REGION]

and s_region = [REGION]

and p_mfgr = [CATEGORY]

group by d_year, c_nation order by d_year, c_nation

There are two queries generated using this template with specific values for REGION. Q4.1 specifies AMERICA as the region and two possible categories, leading to around 96000 orders returned. Q4.2 additionally restricts time to be 1997 or 1998, with all other filtering conditions remaining the same. That leads to around 27000 rows selected. Q4.3 adds a restriction on nation and additionally restricts category to a single one, returning only approximately 500 rows. The complete table of filtering factors for each query can be found in [24].

4.2 Experiment Setup

The experiments were run on a cluster of 9 nodes, each having 32 Gb RAM, 12 cores, and 100 Gb SSD disks. The cluster was setup in AWS EMR 5.13.036 that runs Apache Hive 2.3.2, Apache Spark 2.3, and Presto 0.194. The version for ORC file format used 1.4.3. The version for Parquet file format is 1.8.2.

The cluster is running Hadoop 2.8.337, with YARN acting as resource manager and HDFS as the distributed file system. As the main focus of the thesis is interactive queries, the containers are assumed to be long-lived. The time needed

36https://aws.amazon.com/about-aws/whats-new/2018/04/support-for-spark-2-3-0-on- amazon-emr-release-5-13-0/

37https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-5x.html#emr-5130- release

(39)

to spin up containers is not included in the benchmark results, and only the actual query execution time is tracked. For each query a timestamp is recorded right before it is submitted and right after the results are collected the second timestamp is recorded. For concurrent execution several benchmarks are started simultaneously, and the resource waiting time impacts the runtime of the query suite.

4.3 Performance Tuning

For Hive the following additional settings were altered:

hive.llap.execution.mode=all hive.prewarm.enabled=true

hive.vectorized.execution.enabled = true hive.vectorized.execution.reduce.enabled=true

hive.vectorized.execution.reduce.groupby.enabled=true

This ensured that the containers were pre-warmed and vectorization was enabled for LLAP. LLAP mode all was chosen instead of only in order for all queries to return results, even if not enough resources were available for LLAP to execute the query. In addition, force-local-scheduling was set to true, in order to ensure that all of the possible calls to HDFS were attempted locally and take advantage of data locality.

ORC file configuration was:

"orc.compress"="SNAPPY",

"orc.compress.size"="262144",

"orc.create.index"="true",

"orc.stripe.size"="268435456",

"orc.row.index.stride"="3000")

Parquet file configuration also used same parameter values for compression and index creation.

(40)

5 Results

The benchmark results are presented in this section, which is divided into three subsections. In the first subsection the single user execution is considered, in the second subsection file format comparison for each framework is presented, in the third section the results for concurrent executions are presented, grouped by scale of data, framework, and a separate comparison of Spark and Presto.

5.1 Single User Execution

In this section the results for a single user execution grouped by file format and framework are presented. First, the text format results are presented, then ORC, then Parquet. In the final subsection the direct comparison between Spark on Parquet and Presto on ORC is presented.

5.1.1 Text format

Figure 16: Average execution times for text format, SF1

Figure 17: Average execution times for text format, SF10

Figure 18: Average execution times for text format, SF20

Figure 19: Average execution times for text format, SF75

(41)

Figures 16-19 show the charts for average execution times for text format data, scale factors 1, 10, 20, and 75.

Figure 16 shows the results for text for scale factor 1. On all of the queries Presto was the fastest, and Hive LLAP was second fastest. On queries Q1.2, Q1.3, Q2.3, Q3.3, Q3.4 the execution times for them were comparable. Spark was the third fastest, being slower than Hive LLAP on most of the queries.

Spark was faster only on Q3.1, and tied with Hive LLAP on Q1.1 and Q4.1.

Hive Tez was the slowest on all of the queries, with by far the worst execution time on Q1.1.

Figure 17 shows the results for text for scale factor 10. Again, Presto was the fastest, with most of the execution times being twice as fast as the closest competitor. One major difference is that Spark became faster than Hive LLAP on most of the queries, with Q1.2, Q1.3, and Q2.1 being the only exceptions.

Hive LLAP is faster than Hive Tez on every query except Q2.2, where the execution times are roughly the same. Tez still is the slowest. There is still a notable spike for Q1.1, where the difference between Hive LLAP and Hive Tez execution time is the largest.

Figure 18 shows the results for text for scale factor 20. Presto maintains its status as the fastest, with only Hive LLAP almost matching the execution times for Q1.2 and Q1.3. Spark is the second fastest on some of the queries (Q2.1, Q4.1, Q4.2, Q4.3), but Hive LLAP matches the times for Q1.1, Q2.2 - Q3.4, and is faster for Q1.2 and Q1.3. Another key difference is that the execution times for Hive Tez are much closer to Hive LLAP and Spark, Q1.1 being the only exception, where it is much slower.

Figure 19 shows the results for text for scale factor 75. Hive LLAP is the fastest framework for Q1.2, Q1.3, and Q4.3, narrowly beating Presto, which ties on Q3.4 and is faster on all other queries. Hive Tez is slower or ties with Hive LLAP on most of the queries. The most striking difference is that Spark is the slowest framework on all of the queries, which was an unexpected result, taking into account the results of the previous runs.

(42)

Figure 20: Execution time growth, text format, Hive Tez

Figure 21: Execution time growth, text format, Hive LLAP

Figure 22: Execution time growth, text format, Spark

Figure 23: Execution time growth, text format, Presto

Figures 20-23 show the execution time growth for each framework on text format data. The data size growth corresponds to scale factors, being 1, 10, 20, and 75.

Figure 20 shows the execution time growth for Hive Tez. The execution time growth is the slowest for Q1.2, Q1.3, and Q4.3. It steeply growth on SF1 to SF10, but stays stable for SF10, SF20, and SF75. The growth is the fastest for Q2.1, especially for SF75. The growth rate corresponds to the change of SF for all of the other queries.

Figure 21 shows the execution time growth for Hive LLAP. The growth pattern is similar to Hive Tez, but in addition to Q1.2, Q1.3, and Q4.3 the queries Q3.3 and Q3.4 show very slow execution time growth on SF10 - SF75. Again, Q2.1 shows the steepest incline in execution time, while other queries grow in correspondence with scale factor.

Figure 22 shows the execution time growth for Spark. In contrast to Hive, Spark shows very stable execution time growth for all of the queries. Only Q2.2 shows a slightly sharper increase in execution time in SF20 to SH75. However, the

(43)

absolute values for Spark are larger for SF75.

Figure 23 shows the execution time growth for Presto. Similar to Spark, Presto shows very stable growth in execution time, closely corresponding to the growth in scale. There is no noticeable overhead and the absolute values of execution times are the lowest among all of the frameworks.

5.1.2 ORC

Figure 24: Average execution times for ORC format, SF1

Figure 25: Average execution times for ORC format, SF10

Figure 26: Average execution times for ORC format, SF20

Figure 27: Average execution times for ORC format, SF75

Figures 24-27 show the charts for average execution times for ORC format data, scale factors 1, 10, 20, and 75.

Figure 24 shows the results for ORC for scale factor 1. Similar to text results, Presto is the fastest by far on scale factor 1. Spark is the second fastest on almost all other queries, except for Q1.1 and Q2.1, where Hive LLAP was faster. Hive Tez was the slowest on all of the queries.

References

Related documents

Från den teoretiska modellen vet vi att när det finns två budgivare på marknaden, och marknadsandelen för månadens vara ökar, så leder detta till lägre

Generella styrmedel kan ha varit mindre verksamma än man har trott De generella styrmedlen, till skillnad från de specifika styrmedlen, har kommit att användas i större

Parallellmarknader innebär dock inte en drivkraft för en grön omställning Ökad andel direktförsäljning räddar många lokala producenter och kan tyckas utgöra en drivkraft

I dag uppgår denna del av befolkningen till knappt 4 200 personer och år 2030 beräknas det finnas drygt 4 800 personer i Gällivare kommun som är 65 år eller äldre i

Detta projekt utvecklar policymixen för strategin Smart industri (Näringsdepartementet, 2016a). En av anledningarna till en stark avgränsning är att analysen bygger på djupa

DIN representerar Tyskland i ISO och CEN, och har en permanent plats i ISO:s råd. Det ger dem en bra position för att påverka strategiska frågor inom den internationella

The government formally announced on April 28 that it will seek a 15 percent across-the- board reduction in summer power consumption, a step back from its initial plan to seek a

Sedan dess har ett gradvis ökande intresse för området i båda länder lett till flera avtal om utbyte inom både utbildning och forskning mellan Nederländerna och Sydkorea..