Performance of TensorFlow: Examining what factors impact the performance of TensorFlow in distributed systems

(1)

IN

DEGREE PROJECT TECHNOLOGY, FIRST CYCLE, 15 CREDITS

STOCKHOLM SWEDEN 2018,

Performance of TensorFlow

Examining what factors impact the performance of TensorFlow in distributed systems

HENRIK GLASS AXEL SWARETZ

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

Performance of TensorFlow:

Examining what factors impact the performance of TensorFlow in distributed systems

HENRIK GLASS AXEL SWARETZ

Date: June 6, 2018

Supervisor: Stefano Markidis Examiner: Örjan Ekeberg

Swedish title: TensorFlows prestanda: En undersökning om Vilka faktorer som påverkar TensorFlows prestanda i distribuerade system

School of Electrical Engineering and Computer Science

(3)

(4)

iii

Abstract

This study aims to examine what factors affect the scalability and performance of a distributed TensorFlow program. We created a Tensor- Flow cluster consisting of 3 Raspberry Pi 3 model Bs functioning as workers and one Dell XPS 13 (9350) acting as a parameter server. We benchmarked the performance of our cluster for different sizes and compositions of the TensorFlow graph and for different network configurations between the computers in the cluster. From the results we conclude that both of the previously mentioned factors impact the performance of our distributed TensorFlow program.

(5)

iv

Sammanfattning

Denna studies mål är att undersöka vilka faktorer som påverkar Ten- sorFlows prestanda och skalbarhet i distribuerade system. För att upp- nå detta skapade vi ett TensorFlow-kluster som består av 3 stycken Raspberry Pi 3 model B som agerar som arbetare och en Dell XPS 13 (9350) som agerar som parameterserver. Vi mätte prestandan av det tidigare nämnda klustret för olika storlekar och uppsättningar av TensorFlow-grafen samt olika nätverkskonfigurationer mellan arbe- tarna och parameterservern. Från resultaten av dessa mätningar kun- de vi dra slutsatsen att båda förutnämnda faktorer påverkar prestandan av det distribuerade TensorFlow-programmet.

(6)

Chapter 1 Introduction

In our modern world, the demand for automation is increasing while the tasks that are to be automated are becoming more and more complex. Machine learning is a valuable and popular tool to solve many of these complex tasks. There are many ways to implement software based on machine learning. This study, however, will focus the Ten- sorFlow machine learning framework and how it performs over distributed systems.

TensorFlow is a commonplace and valuable framework for machine learning using neural networks. It’s a cutting edge tool devel- oped by Google and is used on the forefront of AI research, often being run on massive distributed systems. A notable example of TensorFlow in use is the training of AlphaZero, which after less than 24h of training against itself managed to defeat the state-of-the-art chess engine Stock- fish 8 over a tournament of 100 games. However, training AI’s such as AlphaZero is a resource intensive task. AlphaGo (Fan), a predecessor to AlphaZero, most known for being the first computer program to beat a human champion at Go, was trained on a distributed system consisting of a total of 176 GPU’s[5]. Using TPU’s (tensor processing units) is one solution to increasing training performance, but using specialized hardware is often not profitable. If a company or organi- zation is interested in doing machine learning, using already existing and accessible computing power as an alternative to buying specialized hardware is often preferred. Therefore, knowing the limitations, bottlenecks and performance scaling of TensorFlow on distributed systems is interesting.

1

(9)

2 CHAPTER 1. INTRODUCTION

1.1 Problem Statement

This study examines the scalability and performance of TensorFlow on distributed systems, more specifically, on a distributed system with Raspberry Pi 3 model Bs used for computation. The main purpose of this study is to answer the question: What are some of the factors that impact the performance of TensorFlow in distributed systems?

1.2 Scope

The focus of this study is to examine the scalability and performance of TensorFlow specifically on distributed systems. Therefore, one of the automatic constraints of the study is the hardware available. There are many different possible configurations for a system like this but due to the limitations of the hardware available to us, namely three Raspberry Pi 3 model Bs and one Dell XPS 13 (9350), the scope of this study is constrained. There are also many different possible methods of implementing a distributed TensorFlow program. For example, distributed TensorFlow programs can be constructed to use model parallelism whilst our study only tests data parallel programs; this will be discussed later. Furthermore, there are many factors one could assume would impact the performance of distributed TensorFlow; our study only explores a subset of these.

1.3 Thesis Overview

Section 2 introduces how TensorFlow works and relevant concepts in machine learning as well as previous research related to the topic of this thesis. Section 3 describes the setup of the hardware into a distributed system and how testing and measuring of performance was done. Section 4 presents the results. In section 5 the results are an- alyzed and possible sources of errors are discussed. In section 6 the conclusions drawn from the results are presented.

(10)

Chapter 2 Background

The aim of this section is to introduce and explain TensorFlow, and how it and the distributed TensorFlow architecture works. It also presents some related work.

2.1 TensorFlow

TensorFlow is an open source software library for high performance numerical computation. This library functions as an interface for ex- pressing machine learning algorithms and the implementation of these algorithms. It is flexible in the way that a computation expressed in TensorFlow can be distributed to a wide range of devices, such as mobile devices and large-scale distributed systems, with little to no changes made to the source code. TensorFlow has been used in several fields relating to machine learning such as speech recognition, computer vision, robotics, natural language processing, computational drug discovery and more.

2.1.1 How does TensorFlow work?

TensorFlow has two major components in its calculations, the tensor and the graph[13]. The Tensor is the central data unit in TensorFlow and consists of a set of primitive values shaped into an array. The array can have any number of dimensions. An important thing to note is that tensors do not hold values themselves but instead act as handles to elements in the computation graph[13]. A graph has two components, the operations, which function as the vertices of the graph, and

3

(11)

4 CHAPTER 2. BACKGROUND

the tensors, which function as the edges in the graph[1]. The operations are special in that they do not manipulate or change the values of the tensor but instead consume the tensors that are fed into it and use them to produce new tensors that continue to the higher nodes in the graph[13]. Tensors are named after the operation that created them.

To be able to evaluate tensors it is necessary to construct a tf.Session which encapsulates the state of the TensorFlow runtime, and runs Ten- sorFlow operations[13]. All these factors come together to create the graph that enables TensorFlow to make predictions.

Training a neural network

Training a neural network involves passing data through a TensorFlow graph, letting the graph make predictions based on the data and then checking if the predictions that the graph produces aligns with the expected output[8]. TensorFlow controls the prediction by examining the loss, using a loss function. In most cases the loss function calcu- lates the mean square error between the produced output and the expected output. The next step is to minimize the loss. Minimizing the loss is achieved through manipulating the weight of each tensor and the bias of each node to adjust the prediction so that there is a smaller difference between the expected output and the prediction.[13] This process is referred to as backpropagation[8] TensorFlow has built in optimizers that can take care of the backpropagation with different aspects in mind such as momentum, update frequency etc[13]. Opti- mization is the process of applying a gradient descent algorithm to the TensorFlow graph. Gradient descent is simply to iteratively alter the model parameters and gradually trying to find the best combination of weights and biases to minimize the loss.

Two words that are important to understand when handling training of a neural network are epoch and batch size. An epoch refers to a full pass over the training data set[6]. Batch size on the other hand refers to how many training examples have been used in one itera- tion[7].

System architecture

The system architecture of TensorFlow is characterized by its three main components: the client, the distributed master and the worker services[14]. Together these components enable TensorFlow to create

(12)

CHAPTER 2. BACKGROUND 5

a computational graph and make predictions based off that graph. The client is written by the user as the main TensorFlow program and is responsible for building the computational graph. This program either directly composes operations or uses libraries to compose a neural network. This client can be written in multiple programming languages, though the main languages that TensorFlow supports are Python and C++[14]. A majority of the training libraries only support Python as of writing. The client initiates a session, this means that the client applies initial weights, biases and then builds the graph[14]. The creation of the session in turn sends the graph definition to the distributed master.

When an operation in the client is evaluated, said evaluation causes the distributed master to initiate computation. The distributed master is responsible for pruning the graph to obtain the subgraph required to evaluate the nodes requested by the client[14]. The distributed master also applies standard optimizations and coordinates execution of the optimized subgraphs. The worker service is the component that handles requests from the master, schedules execution for the operations and handling the direct communication between tasks[14].

2.1.2 The TensorFlow cluster

TensorFlow enables distributed computing via TensorFlow clusters. A TensorFlow cluster is a set of tasks, each associated with a TensorFlow server. The TensorFlow server encapsulates the set of devices specified in the cluster specification as well as training target. These servers can be placed on different machines and can communicate with each other. A TensorFlow cluster can be divided into one or more jobs, each containing one or more tasks[12].

In TensorFlow clusters there are two distinct kinds of jobs, parameter servers (PS) and workers. Generally, parameter servers are responsible for holding on to all the weights and biases in the TensorFlow graph and updating them in response to the workers. The workers are responsible for all the compute intensive parts, i.e. the actual calculations of the gradients that are communicated back to the parameter server[12].

(13)

2.1.3 Data parallelism

Letting multiple tasks in a worker job train the same model on different small batches of data is called data parallelism. There are a couple of different training configurations which achieve this.

Between-graph vs In-graph replication

Between-graph replication and in-graph replication are two different approaches on how to replicate a TensorFlow graph in a cluster. In- graph replication is the simplest approach. In in-graph replication TensorFlow only creates a single graph that contains one set of parameters. In between-graph replication each worker builds a similar graph[12].

Asynchronous vs synchronous training

Asynchronous training vs synchronous training refers to how each workers training loop executes and updates the graph parameters. In asynchronous configuration, they execute without coordination. In synchronous configuration, all the replicas read the same respective values from the graph (referring to either the local copy if using between- graph replication, or the global graph if using in-graph replication), compute the gradients in parallel, and apply them together[12].

2.2 Related Work

In this section we will go through some of the research papers that bring up topics that are relevant to our study.

2.2.1 Single Board Computers for Deep Machine Learn- ing

This paper by Andrew N. Taylor[11] examines the Floating Point Op- erations Per Second (FLOPS) differences between a cluster of Rasp- berry Pi 3 model Bs, a desktop computer and a General Purpose Graph- ics Processing Unit (GPGPU) as well as the time taken by each system with the cluster of Raspberry Pis having several setups of workers. The data sets utilized in this study were the MNIST data set and

(14)

CHAPTER 2. BACKGROUND 7

the Tic Tac Toe data set. The report found that in a distributed environment a synchronous setup were in all cases slower than the asynchronous setup by a factor between 1.4 to 2.0, though this depends on the data set and the setup of the neural network. It also found that a single Raspberry Pi was on average around 6 times slower than the desktop and around 20 times slower than the GPGPU when examining the FLOPS. When comparing the cluster of Raspberry Pis to the desktop and GPGPU on the MNIST data set, it found that even the most efficient setup of the cluster (4 workers) was around 13 seconds slower than the desktop and 22 seconds slower than the GPGPU, which ran in 2.5 seconds. On the Tic Tac Toe data set the study came to the conclusion that the most efficient setup of the Raspberry Pi cluster (7 workers) was around 25 seconds slower than both the desktop and the GPGPU[11], which ran in around 4 seconds. However this was on a 3 by 3 playing field, and when increased to a 10 by 10 playing field the optimal setup (3 workers) performed better than the desktop but slower than the GPGPU[11].

2.2.2 Scaling a Convolutional Neural Network for clas- sification of Adjective Noun Pairs with Tensor- Flow on GPU Clusters

This study by Víctor Campos et al.[4] presents how the training of a deep neural network can be parallelized on a distributed GPU cluster.

It also brings up what factors impact the training training time and how its performance and scalability in a distributed setting is affected by different factors. The report is based on a cluster of servers with each server running TensorFlow on 2 NVIDIA K80s. The report used a mix between synchronous and asynchronous modes. The researches found that the best throughput relation was 2 workers and 1 parameter server in 2 nodes, 4 workers and 3 parameter servers in 4 nodes, 8 workers and 7 parameter servers in 8 nodes and 16 workers and 7 parameter servers in 16 nodes[4]. The report also found that increasing the amount of parameter servers beyond 7 yielded minimal increases and attributes this to the network overhead.

(15)

2.2.3 Creating a Raspberry Pi-Based Beowulf Cluster

This thesis by Ellen-Louise Bleeker and Magnus Reinholdsson [3] describes how to build a Beowulf cluster and then analyzes how Tensor- Flow performs on said cluster training a neural network the MNIST data set. The basis for their cluster was the Raspberry Pi 3 model B and was built in a similar way to our method described in section 3.3. The researchers utilized between-graph replication and used asynchronous updates. Their findings from running TensorFlow on the cluster were not what they were expecting. The performance of the cluster was poor since it resulted in low accuracy but this was increased slightly by adding more workers. The more workers the researchers added the longer the total time required for training be- came.[3] They attributed this to a bug. At best their cluster with 28 workers achieved an accuracy of 34%. This was outperformed by a single Raspberry Pi 3 model B running a parameter server and a worker simultaneously.

(16)

Chapter 3 Method

The TensorFlow machine learning framework comes integrated with tools for doing distributed computing. This study uses these integrated tools to create a simple distributable program for training a neural network on the MNIST data set, using data parallelism with between-graph replication and asynchronous parameter updating.

3.1 The Computer Cluster

The computer cluster used to benchmark the performance of Tensor- Flow consists of a single Dell XPS 13 (9350) acting as a parameter server coupled with three Raspberry Pi 3 model Bs acting as workers.

Figure 3.1: The cluster

We chose to use the Raspberry Pi because of the low price of the

9

(17)

10 CHAPTER 3. METHOD

system which enabled us to construct a decently sized cluster at minimal cost. As Taylor[11] brings up, the Raspberry Pi 3 model B has a higher performance per dollar compared to many other computers and even GPUs. As demonstrated by Bleeker et al.[3] in their usage of the Raspberry Pi for their Beowulf cluster, it is not uncommon to use distributed TensorFlow on clusters of Raspberry Pis, at least for research purposes.

3.2 The Software

Since the main purpose of this study is to benchmark the performance of TensorFlow, we aren’t particularly interested in the input and output of the program. This gives us a lot of freedom in choosing which data set to train the neural network on, how many hidden layers we want, how many neurons in each layer, what learn-rate to use, etc.

Our baseline program, built using Python and the TensorFlow framework, trains a simple feedforward neural network with a single hidden layer on the MNIST data set with one device as a parameter server, and three other devices as workers. The program uses between-graph replication and asynchronous parameter updating. Using asynchronous parameter updating allows us to maximize the data throughput of the system [9]. The program takes a flattened MNIST image of 784 pixels in the 784 neuron input layer, passes it through a single hidden layer of 100 neurons, and outputs to 10 neurons in the output layer repre- senting the digits 0-9.

3.3 Environment

The execution environments are as specified below. All Raspberry Pis run identical environments.

3.3.1 Dell XPS 13 (9350)

• CPU: Intel Core i5-5200U @ 2.7GHz

• Installed memory(RAM): 8 GB

• OS: Ubuntu 16.04 xenial

(18)

CHAPTER 3. METHOD 11

• kernel: x86_64 Linux 4.4.0-101-generic

• TensorFlow version: 1.2.1

• Python version: 2.7.12

3.3.2 Raspberry Pi 3 model B

• CPU: ARMv7 rev 4 (v71) @ 1.2GHz

• Installed memory(RAM): 1 GB

• OS: Raspbian 9 (stretch) lite

• kernel: armv71 Linux 4.9.59-v7+

• TensorFlow version: 1.2.1

• Python version: 2.7.13

3.4 Building TensorFlow to run on the Rasp- berry Pi 3 model B

In order to properly run TensorFlow in the desired configuration on the Raspberry Pi system, a full build from scratch is required. This is a tedious process. Luckily, multiple guides are provided online on how to do this. Our process of building TensorFlow to run on the Rasp- berry PI 3 model Bs followed the guide provided by Sam J Abrahams that is available on his github[2]. Sections 3.4.1-3.4.4 go through these steps in more detail and describe relevant problems that we encountered. We successfully managed to build an older version of the Ten- sorFlow framework, TensorFlow 1.2.1, which had sufficient support for the TensorFlow functions used in our program, which was built and tested using TensorFlow 1.7. All of the related URLs can be found under Appendix A.

3.4.1 Installing basic dependencies

The first step to building TensorFlow was installing the basic dependencies for Bazel and TensorFlow.

(19)

# For Bazel

sudo apt-get install pkg-config zip g++ zlib1g-dev unzip

# For TensorFlow

sudo apt-get install python-pip python-numpy swig python-dev sudo pip install wheel

To enable certain optimization flags when compiling. We installed the gcc/g++ compiler version 4.8

sudo apt-get install gcc-4.8 g++-4.8

3.4.2 Installing a Memory Drive as Swap for Compil- ing

The Raspberry Pi 3 model B does not have enough memory to compile TensorFlow. To remedy this problem a USB storage drive was utilized as a swap. This process consisted of unmounting the drive, formatting the drive to be a swap via the mkswap command and then registering the swap file. This step involves editing the fstab and adding the UUID that was acquired during the formatting.

3.4.3 Building Bazel

TensorFlow is built using Bazel, a multi-language build system. So be- fore building TensorFlow, we first need to build a version of Bazel.

When building Bazel on the Raspberry Pi the maximum heap size needs to be increased to avoid an OutOfMemoryError. To solve this problem, after having downloaded the Bazel source, line 117 in

/bazel/scripts/bootstrap/compile.sh

was edited to include the flag -J-Xmx500M

Which sets the max java heap size to 500 MB. After this the file /bazel/tools/cpp/cc_configure.bzl

was edited to include return “arm”

in the beginning of the function called:

(20)

_get_cpu_value

After this, Bazel was built without any problems. As a side note, the guide provided by Sam J Abrahams details how to build an older version of Bazel that does not support building newer versions of Tensor- Flow.

3.4.4 Building TensorFlow

After the TensorFlow repository was cloned, all references to 64 bit program implementations were changed to 32 bit implementations through the use of this command:

grep -Rl ’lib64’ | xargs sed -i ’s/lib64/lib/g’

We then edited the file at

tensorflow/core/platform/platform.h

where we deleted the line

#define IS_MOBILE_PLATFORM

The reason this line was deleted was to prevent the Raspberry Pi 3 model B from being recognized as a mobile device. The next step in the building process was to edit the WORKSPACE file, located in the tensorflow root folder, at line 283 where we replaced

https_file(

name = "numericjs_numeric_min_js",

url = "https://cdnjs.cloudflare.com/ajax/libs/

numeric/1.2.6/numeric.min.js", )

with

http_file(

name = "numericjs_numeric_min_js",

url = "http://cdnjs.cloudflare.com/ajax/libs/

numeric/1.2.6/numeric.min.js", )

After this step, the eigen version dependency was replaced. To achieve this the

(21)

tensorflow/workspace.bzl

file was edited to replace native.new_http_archive(

name = "eigen_archive", urls = [

"http://mirror.bazel.build/bitbucket.org/

eigen/eigen/get/f3a22f35b044.tar.gz",

"https://bitbucket.org/eigen/eigen/get/

f3a22f35b044.tar.gz", ],

sha256 = "ca7beac153d4059c02c8fc59816c82d54ea 47fe58365e8aded4082ded0b820c4", strip_prefix = "eigen-eigen-f3a22f35b044", build_file =

str(Label("//third_party:eigen.BUILD")), )

with

native.new_http_archive(

name = "eigen_archive", urls = [

"http://mirror.bazel.build/bitbucket.org/

eigen/eigen/get/d781c1de9834.tar.gz",

"https://bitbucket.org/eigen/eigen/get/

d781c1de9834.tar.gz", ],

sha256 = "a34b208da6ec18fa8da963369e166e4a368 612c14d956dd2f9d7072904675d9b", strip_prefix = "eigen-eigen-d781c1de9834", build_file =

str(Label("//third_party:eigen.BUILD")), )

After this the TensorFlow build was configured, by running the included configure script, to not support Google Cloud Platform, Hadoop File System, OpenCL or CUDA. jemalloc was used as the malloc implementation. The next step was to actually build TensorFlow; a step which takes several hours. After the build had finished, the built bi- nary file was used to create a Python Wheel by running:

./bazel-bin/TensorFlow/tools/pip_package/build_pip_package

(22)

Build Issues

We encountered several problems during this step since none of the guides fully included the correct steps. We had to follow the steps of the newest guide in conjunction with an older guide to successfully build TensorFlow 1.2. The eigen version dependency, for example, still needed to be replaced even though the old guide for building Tensor- Flow 1.2 excluded it. And because of the very long build times of TensorFlow on the Raspberry Pi it could take hours until an error presented itself. This made building TensorFlow a very tedious process.

3.5 The Benchmarking

Since the primary focus of this study is on the performance of Tensor- Flow in distributed systems, the benchmarks are constructed to test some of the different factors that might affect performance in a distributed system. This study includes testing different sizes of the Ten- sorFlow graph, different network configurations and different number of workers. As a baseline, the training program is run across all workers, connected via Ethernet.

3.5.1 Network performance benchmarks

To coordinate parallel computation on the TensorFlow graph over the TensorFlow cluster puts high demand on the bandwidth capabilities of the network, meaning that the distributed system must use the network efficiently [1]. The first set of benchmarks aim to examine the scalability of our program by measuring how much bandwidth is necessary for unhindered execution.

We measured the incoming and outgoing network traffic on the parameter server host and the average batch processing time, sam- pled every 100 batches, when running all three Raspberry Pi workers.

We repeated the same measurements but with only one worker active.

These same two tests were then repeated in a sub-optimal environment where we tried to induce a performance bottleneck. We achieved this by substituting the Ethernet connection on the PS host computer with WiFi.

(23)

3.5.2 Graph size benchmarks

Training a neural network involves multiple computational steps. The execution times for some of these steps, such as processing training data and backpropagation through the graph, are directly related to the composition and size of the graph. Hypothetically, increasing the size of the TensorFlow graph and thus increasing the computation time for these steps, could have some positive effect on the performance. For example, depending on the implementations, built in Ten- sorFlow functions such as tf.matmul() might have large constant factors in their time complexities, meaning they are less efficient for small graphs. It might also help reduce the proportion of time spent on certain time consuming operations that remain constant regardless of graph size and composition, such as reading training data from disk or certain operations relating to the communication with the parameter server.

The second set of benchmarks aim to test the performance scaling when adjusting the size of the TensorFlow graph. For these tests, we first measured the change in execution time when increasing the number of neurons in the single hidden layer of the baseline configuration of our program. We tested 100, 200, 400, 800 and 1600 neurons in the hidden layer respectively. We also measured the change in execution time when increasing the number of neurons in conjunction with increasing the number of hidden layers in the graph. In this test, following a pattern similar to the one previously mentioned was dif- ficult. We tested 100 neurons in 1 hidden layer, 200 neurons spread uniformly over 2 hidden layers, 400 neurons spread uniformly over 2 hidden layers, 600 neurons spread uniformly over 3 hidden layers and finally 1200 neurons spread uniformly over 3 hidden layers.

(24)

Chapter 4 Results

This section will summarize the results of the benchmarks described in the Method section.

4.1 Ethernet performance

4.1.1 Three workers

The first of the benchmarks, i.e. our baseline, consists of running the program across all three Raspberry Pi workers over Ethernet-connected LAN. The results are as shown below.

0 5000 10000 15000 20000

0 100 200 300 400 500 600 700 800 900

KB/s

elapsed time in seconds

incoming traffic outgoing traffic

Figure 4.1: Network usage measured on the parameter server

17

(25)

18 CHAPTER 4. RESULTS

0 20 40 60 80 100 120

0 100 200 300 400 500 600 700 800 900

averagebatchtimeinms

rpi1rpi2 rpi3

Figure 4.2: Average batch processing time measured on the workers The total run time for the program was approximately 840 seconds.

The mean average batch processing time across all workers was 69.06 milliseconds.

4.1.2 One worker

The high network traffic of our baseline test prompted testing the same configuration but using only one worker to minimize the total network traffic. This was done in order to discover whether the network was acting as a performance bottleneck to our distributed system. The results are as shown below.

0 1000 2000 3000 4000 5000 6000 7000

0 100 200 300 400 500 600 700 800 900

KB/s

(26)

CHAPTER 4. RESULTS 19

0 20 40 60 80 100 120

0 100 200 300 400 500 600 700 800 900

rpi1

Figure 4.4: Average batch processing time measured on the rpi1 worker

The total run time for the program was approximately 820 seconds.

The mean average batch processing time was 68.12 milliseconds.

4.2 WiFi performance

To further test eventual performance restrictions of our program in- duced by network performance, the same baseline benchmarks were run again but with the parameter server communicating with the Rasp- berry Pi workers over WiFi.

(27)

4.2.1 Three workers

0 1000 2000 3000 4000 5000 6000 7000

0 500 1000 1500 2000 2500 3000

KB/s

0 100 200 300 400 500

0 500 1000 1500 2000 2500 3000

rpi1rpi2 rpi2

Figure 4.6: Average batch processing time measured on the workers The total run time for the program was approximately 3000 seconds.

The mean average batch processing time across all workers was 247.8 milliseconds, but with a visibly larger variance compared to the baseline test.

(28)

CHAPTER 4. RESULTS 21

4.2.2 One worker

0 500 1000 1500 2000 2500 3000 3500 4000

0 200 400 600 800 1000 1200 1400 1600 1800

KB/s

0 50 100 150 200 250

0 200 400 600 800 1000 1200 1400 1600 1800

rpi1

Figure 4.8: Average batch processing time measured on the rpi1 worker

The total run time for the program was approximately 1840 seconds.

The mean average batch processing time was 152.8 milliseconds.

4.3 A larger graph

To test the performance scalability of different size TensorFlow graphs we ran the baseline configuration again but with different number of

(29)

edges and neurons. The baseline configuration trains a TensorFlow graph with a single hidden layer containing 100 neurons which results in 784 ∗ 100 + 100 ∗ 10 = 79400 edges. In this test we benchmarked configurations with 200, 400, 800 and 1600 neurons in the hidden layer respectively. We introduce approx. total run time (s) per 1 000 edges. The idea behind it is to provide a measure of productivity. The results of the benchmarks are shown in the table below. The test accuracy mea- surement refers to the accuracy of the trained graph.

Number of neurons in the hidden

layer

Number of edges in the graph

Approx.

total run time (s)

Mean average

batch processing

time (ms)

Approx.

total run time (s) per 1 000

edges.

Test accuracy

(%)

100 79 400 840 69.06 10.57 51

200 158 800 1 550 129.5 9.810 61

400 317 600 2 970 247.3 9.351 66

800 635 200 5 730 477.1 9.021 71

1600 1 270 400 11 180 931.0 8.800 66

4.3.1 A deeper graph

To further test the performance scalability of different size TensorFlow graphs, we ran similar tests to the benchmarks from the previous chapter, but instead of only increasing the number of neurons in a single hidden layer to increase graph size we also increased the number of neurons by adding multiple new hidden layers. All tests are run with an equal number of neurons in all of the hidden layers.

Number of neurons

in the hidden layer(s)

Number of hidden

layers

Number of edges in the graph

Approx.

total run time (s)

Mean average

batch processing time

(ms)

Approx.

total run time (s) per 1 000

edges.

Test accuracy

(%)

100 1 79 400 840 69.06 10.57 51

200 2 89 400 910 75.59 10.21 53

400 2 198 800 1 810 151.0 9.105 57

600 3 238 800 2 135 177.7 8.941 53

1 200 3 637 600 5 437 452.6 8.527 62

(30)

Chapter 5 Discussion

This section will discuss the results of the benchmarks, eventual sources of errors and suggestions for further research.

5.1 The network bottleneck

The baseline benchmark, running all three workers over Ethernet connected LAN, resulted in high network traffic on the parameter server host, averaging around 14 MB/s. Another test was run using only one worker, to see if the network interface between the parameter server and the workers was acting as a performance bottleneck. This time, the network traffic averaged around 4.5 MB/s, about a third of the total network traffic for the baseline benchmark. This is what one would expect to find if the network interface did not act as a bottleneck in the baseline benchmark.

Repeating the benchmarks in a sub-optimal environment, i.e. over a WiFi connection, yielded very different results. In this setup, the Raspberry Pi workers were still connected via Ethernet to the network router, while the parameter server host computer was connected via WiFi. This configuration produced much more inconsistent performance and was substantially slower, as shown in both graphs. The approximate total run time nearly quadrupled. Running the same configuration with only one worker gave more stable performance, but still limited in comparison to the Ethernet configuration.

The results suggest that the performance of the network has a vital role in the overall performance of our program. 14 MB/s data transfer speed corresponds to a bit rate of 112 Mbit/s. Although this data rate

23

(31)

24 CHAPTER 5. DISCUSSION

is manageable for our PS host computer, it is not negligible. Were we for example to use a fourth Raspberry Pi 3 model B as our PS host this would be an issue, as the Raspberry Pi 3 model B only provides 100 Mbit/s support on its Ethernet port. Increasing the number of workers will increase the network traffic proportionally. At some point, the ceiling for what bandwidth the network can handle will be reached;

this is essentially what the results of our benchmarks over WiFi illus- trate, and is especially true for asynchronous modes. This can be com- bated somewhat by using synchronous parameter updating. How- ever, using synchronous parameter updating introduces other problems in regards to performance by introducing a straggler effect where the the performance is limited by having to wait for all workers to finish at each parameter synchronization[9]. In the study by Víctor Campos et al. a trade-off solution to these issues is proposed in the form of using a mixed parameter updating mode where the model parameters are updated asynchronously but the gradients are averaged synchronously[4].

5.2 Graph size scalability

Given the performance limitations put on by the network we wanted to investigate what changes could be made to the composition of the graph to increase productive efficiency of the program. I.e. what could be done to maximize amount of information held by the graph versus the total run time of the program. Increasing the size of the graph helped marginally. Increasing the number of edges in the graph by adding neurons to the existing layers decreased the total run time per edge. Increasing the number of edges in the graph by adding neurons in conjunction with increasing the number of layers decreased the total run time per edge at a faster rate. By plotting the the approximate total run time per 1 000 edges from both tests in chapter 4.3 A larger graph we can observe the trends.

(32)

CHAPTER 5. DISCUSSION 25

8.5 9 9.5 10 10.5 11

0 200000 400000 600000 800000 1e+06 1.2e+06 1.4e+06

Approx.runtimeper1000edges(s)

number of edges in graph

single hidden layer multiple hidden layers

Figure 5.1: Graph size effect on productivity

It is worth noting that the plot in Figure 5.1 is somewhat flawed in that it excludes the number of hidden layers for each point in the multiple hidden layers data set. Having this in mind, there is still a clear difference in performance, albeit small, between graphs of similar size for the small graphs we tested. Although not obvious for the multiple hidden layers data set, the single hidden layer plot curve hints at di- minishing returns for large graphs. This is verified by looking at the approximate total run time change for each data point in the single hidden layer data set. Each step in the table below corresponds to a doubling of the number of edges held by the graph.

graph size change Approx. total run time increase

100 to 200 neurons 84.5%

200 to 400 neurons 91.6%

400 to 800 neurons 92.9%

800 to 1600 neurons 95.1%

5.2.1 Differences in accuracy

When comparing graphs of similar size, the graphs with a single hidden layer performed better in terms of accuracy to the graphs with multiple hidden layers. Comparing a single hidden layer graph of 635 200 edges to a three hidden layer graph of 637 600 edges, the single layer graph scored 9 percentage points higher than the three

(33)

26 CHAPTER 5. DISCUSSION

layer graph which scored an accuracy of 62%. However, the accuracy score does not exclusively have to be a result of the composition of the graphs. There are many other factors which affect the accuracy. These include the learn rate, the batch size, the number of training epochs etc. Most importantly it includes what type of data the graph is being trained on, since deep neural networks can learn more complex non-linear models. [10]

5.3 Error sources

This study researches the performance of TensorFlow in a somewhat narrow scope. And with limited testing capabilities it’s very possible that the conclusions only apply to a limited set of hardware and software configurations.

5.3.1 Hardware

The tests were limited by the fact that we were only able to run a cluster with a maximum of three Raspberry Pi workers. The tests were also limited by the fact that the Raspberry Pis could only utilize their CPUs for computation. In real world applications, with larger neural networks to train and where performance is critical, the hardware configurations might look much different to ours. Often with much more powerful computers with capabilities to utilize GPUs.

Another limitation, as an implication of the relatively poor computational power of the raspberry pi cluster, was the inability to train larger neural networks in a relatively short space of time. To train the simple graph in the baseline configuration, with 20 epochs and a batch size of 100, the cluster took approximately 840 seconds, or 14 minutes.

The graph in the baseline configuration only contains 79 400 edges. To train graphs with a million or more edges, the cluster would take multiple hours. Hence, the tests in this study was only done on relatively small graphs.

5.3.2 WiFi performance inconsistency

The measurements over WiFi produced very inconsistent results. Run- ning our program during different times of the day, moving the parameter server computer etc. produced very different results. So the

(34)

CHAPTER 5. DISCUSSION 27

results of chapter 4.2 WiFi performance tell a very limited picture. How- ever, the point of testing our configuration over WiFi was to make a comparison to the stable performance of the tests over Ethernet in chapter 4.1, so some conclusions can still be drawn.

5.4 Applicability

Using Raspberry Pis or other single board computers for doing distributed computing with TensorFlow is in many cases, as implied by Taylor[11], a perfectly valid approach, due to the relative price to performance ratio compared to other solutions. While our study might be less applicable to larger distributed systems for training larger neural networks, due to the narrow scope and limitations of our study, it has some applicability to systems of similar scale. Our motivation behind using the Raspberry Pi as the principal computation units was primar- ily the low cost. Hobbyists or other research studies might also have similar motivations.

5.5 Further research

Our study only explores a small subset of factors that might affect the performance and scalability of TensorFlow in distributed environments. While the results of our study may be useful there definitely exists room for expanding on the subject. Our suggestions include using different training data sets, trying different cluster configurations with more than one parameter server, looking at other possible factors, testing larger graphs to further examine the performance benefit of the trends shown in Figure 5.1 etc.

The results of the benchmarks where we tested different sizes and compositions of the TensorFlow graph are probably not exclusive to the performance of distributed TensorFlow programs. Another sug- gestion for future research, not limited to distributed computing, is to repeat these tests for single computer TensorFlow programs and examine if and how the results differ from ours.

(35)

Chapter 6 Conclusion

The scalability and performance of TensorFlow in distributed environments, for the configurations we tested, can be severely hindered by the performance of the network between the computers in the cluster.

Our baseline cluster configuration, i.e. having all three Raspberry Pis active, ran with predictable performance and used roughly 14 MB/s worth of bandwidth, equalling to about 4.7 MB/s per Raspberry Pi worker. This data rate was verified by running the same benchmarks with only one Raspberry Pi active, where we measured roughly 4.5 MB/s. These high data rates mean that the network quickly can be- come a performance bottleneck, depending on its bandwidth capabilities.

Another factor impacting on the performance of TensorFlow is the composition and size of the graph being trained. Our study finds that the total run time of our distributed TensorFlow program doesn’t scale perfectly linearly to the number of parameters held by the graph.

Comparing a graph with a single hidden layer of 100 neurons to a graph with a single hidden layer of 200 neurons, effectively doubling the number of parameters held by the graph, the latter takes slightly less than double the time to train with the same training configuration. This effect was most notable on smaller graphs. Increasing both the number of neurons and the number of hidden layers in conjunction had an even more prominent effect.

28

(36)

Bibliography

[1] Martın Abadi et al. “TensorFlow: A system for large-scale machine learning”. In: CoRR abs/1605.08695 (2016). arXiv: 1605.

08695.URL: http://arxiv.org/abs/1605.08695.

[2] Sam J Abrahams. Building TensorFlow for Raspberry Pi: a Step-By- Step Guide. Jan. 2018.URL: https://github.com/samjabrahams/

tensorflow - on - raspberry - pi / blob / master / GUIDE . md(visited on 05/10/2018).

[3] Ellen-Louise Bleeker and Magnus Reinholdsson. Creating a Rasp- berry Pi-Based Beowulf Cluster. 2017.

[4] V. Campos et al. “Scaling a Convolutional Neural Network for Classification of Adjective Noun Pairs with TensorFlow on GPU Clusters”. In: 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). May 2017, pp. 677–

682.^DOI: 10.1109/CCGRID.2017.110.

[5] Demis Hassabis David Silver. AlphaGo Zero: Learning from scratch.

Oct. 2017. ^URL: https://deepmind.com/blog/alphago- zero-learning-scratch/(visited on 05/11/2018).

[6] Google Developers. Machine Learning Glossary. Apr. 2018. ^URL: https://developers.google.com/machine-learning/

glossary/#epoch(visited on 05/10/2018).

[7] Google Developers. Machine Learning Glossary. Apr. 2018. URL: https://developers.google.com/machine-learning/

glossary/#batch(visited on 05/10/2018).

[8] Inc. Frontline Systems. Training an Artificial Neural Network - In- tro. 2018. URL: https : / / www . solver . com / training - artificial-neural-network-intro(visited on 05/10/2018).

29

(37)

30 BIBLIOGRAPHY

[9] Peter H. Jin et al. “How to scale distributed deep learning?” In:

CoRR abs/1611.04581 (2016). arXiv: 1611.04581. URL: http:

//arxiv.org/abs/1611.04581.

[10] Christian Szegedy, Alexander Toshev, and Dumitru Erhan. “Deep Neural Networks for Object Detection”. In: Advances in Neural Information Processing Systems 26. Ed. by C. J. C. Burges et al.

Curran Associates, Inc., 2013, pp. 2553–2561. URL: http : / / papers.nips.cc/paper/5207-deep-neural-networks- for-object-detection.pdf.

[11] Andrew Taylor. “Single Board Computers for Deep Machine Learn- ing”. In: The UNSW Canberra at ADFA Journal of Undergraduate Engineering Research 10.1 (2018).

[12] TensorFlow. Distributed TensorFlow. Apr. 2018. ^URL: https://

www . tensorflow . org / deploy / distributed (visited on 05/04/2018).

[13] TensorFlow. Low Level APIs, Introduction. Apr. 2018.^URL: https:

/ / www . tensorflow . org / programmers _ guide / low _ level_intro(visited on 05/03/2018).

[14] TensorFlow. TensorFlow Architecture. Apr. 2018.URL: https://

www.tensorflow.org/extend/architecture (visited on 05/03/2018).

(38)

Appendix A

Appended Links

Building TensorFlow for Raspberry Pi: a Step-By-Step Guide:

https://github.com/samjabrahams/tensorflow-on-raspberry-pi

Python-pip:

https://github.com/pypa/pip

python-numpy:

http://www.numpy.org/

Swig:

http://www.swig.org/

Python-dev:

https://packages.debian.org/sv/jessie/python-dev

Wheel:

https://wheel.readthedocs.io/en/latest/

Bazel:

https://github.com/bazelbuild/bazel

31

(39)

www.kth.se

Performance of TensorFlow: Examining what factors impact the performance of TensorFlow in distributed systems

Performance of TensorFlow

Examining what factors impact the performance of TensorFlow in distributed systems

HENRIK GLASS AXEL SWARETZ

Performance of TensorFlow:

Examining what factors impact the performance of TensorFlow in distributed systems

HENRIK GLASS AXEL SWARETZ

Abstract

Sammanfattning

Contents

Chapter 1 Introduction

1.1 Problem Statement

1.2 Scope

1.3 Thesis Overview

Chapter 2 Background

2.1 TensorFlow

2.1.1 How does TensorFlow work?

2.1.2 The TensorFlow cluster

2.1.3 Data parallelism

2.2 Related Work

2.2.1 Single Board Computers for Deep Machine Learn- ing

2.2.2 Scaling a Convolutional Neural Network for clas- sification of Adjective Noun Pairs with Tensor- Flow on GPU Clusters

2.2.3 Creating a Raspberry Pi-Based Beowulf Cluster

Chapter 3 Method

3.1 The Computer Cluster

3.2 The Software

3.3 Environment

3.3.1 Dell XPS 13 (9350)

3.3.2 Raspberry Pi 3 model B

3.4 Building TensorFlow to run on the Rasp- berry Pi 3 model B

3.4.1 Installing basic dependencies

3.4.2 Installing a Memory Drive as Swap for Compil- ing

3.4.3 Building Bazel

3.4.4 Building TensorFlow

3.5 The Benchmarking

3.5.1 Network performance benchmarks

3.5.2 Graph size benchmarks

Chapter 4 Results

4.1 Ethernet performance

4.1.1 Three workers

4.1.2 One worker

4.2 WiFi performance

4.2.1 Three workers

4.2.2 One worker

4.3 A larger graph

4.3.1 A deeper graph

Chapter 5 Discussion

5.1 The network bottleneck

5.2 Graph size scalability

5.2.1 Differences in accuracy

5.3 Error sources

5.3.1 Hardware

5.3.2 WiFi performance inconsistency

5.4 Applicability

5.5 Further research

Chapter 6 Conclusion

Bibliography

Appendix A

Appended Links