Scaling the Evolution of Artificial Neural Networks in the Cloud

(1)

IT 18 044

Examensarbete 30 hp September 2018

Scaling the Evolution of Artificial Neural Networks in the Cloud

Felix Runge

Institutionen för informationsteknologi

Department of Information Technology

(2)

ii

(3)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress:

Box 536 751 21 Uppsala Telefon:

018 – 471 30 03 Telefax:

018 – 471 30 00 Hemsida:

http://www.teknat.uu.se/student

Abstract

Scaling the Evolution of Artificial Neural Networks in the Cloud

Felix Runge

Deep Learning is one technique within the field of Machine Learning which is able to solve tasks such as Image Classification or Natural Language Processing. With its recent success, Deep Learning and therefore Artificial Neural Networks receive a high exposure. However, the design of architectures for Artificial Neural Networks still represents a time consuming and complex task. Therefore, automating the search for architectures of Artificial Neural Networks is the next step to simplify this process. The automation of this process should consider two aspects: The scalability to adapt to increasing problem sizes in a variety of domains and the performance of the resulting architectures. For this project, Neuroevolution was distributed in the Cloud using Kubernetes. Neuroevolution, a sub field of Evolutionary Computation, is able to evolve architectures of Artificial Neural Networks. The evolution is

parallelisable, and therefore able to scale-out. Using cloud infrastructures with their theoretical endless resources, scalability and configuration of preferred hardware (CPUs, GPUs, or TPUs) is provided. Kubernetes abstracts the underlying resources, and enables a scalable and resilient execution of applications. Image classification is used for the evaluation of the implemented application. The experiments display that the application is able to scale horizontally with respect to strong and weak scaling.

There is no work imbalance in the distribution evident, due to the asynchronous evolution. In addition, the average and maximum accuracy of evolved Artificial Neural Networks increases, and no overfitting is present in the obtained results. Due to the simplicity and scalability, Neuroevolution represents a viable solution for architecture search.

Tryckt av: Reprocentralen ITC IT 18 044

Examinator: Mats Daniels

Ämnesgranskare: Andreas Hellander Handledare: Collin Rogowski

(4)

(5)

Acknowledgement

I want to dedicate this Master’s Thesis to my parents who have always supported me in any possible way no matter what I have in mind.

This Master’s Thesis was created in cooperation with inovex GmbH in Germany. I would like to thank Collin Rogowski and Hans-Peter Zorn for their marvellous support throughout my Master’s Thesis. In general, I would like to thank inovex for their support in various ways.

I really enjoyed my time at the company. I want to emphasize the worthwhile conversations with fellow colleagues not only regarding my project. In addition, I would like to thank Uppsala University, and especially my reviewer Andreas Hellander for supporting me during this project.

Felix Runge

(6)

(7)

List of Figures

2.1 Use of validation set to avoid overfitting . . . 6

2.2 A single neuron . . . 6

2.3 Layers in an Artificial Neural Network . . . 8

2.4 Deep Learning within the domain of image classification . . . 9

2.5 Example of convolution operator . . . 11

2.6 Illustration of max pooling . . . 12

2.7 Pseudocode of simplified evolutionary process . . . 15

2.8 Illustration of one-point crossover operator . . . 16

2.9 Encoding in NEAT . . . 17

2.10 Crossover in NEAT . . . 18

2.11 Virtual Machines and Containers . . . 21

2.12 Amdahl’s law visualised . . . 22

4.1 Example of Fashion-MNIST . . . 26

4.2 Distribution of items in Fashion-MNIST . . . 27

4.3 Deployment of prototype . . . 32

5.1 Strong scaling with inital population . . . 35

5.2 Strong scaling without inital population . . . 36

5.3 Task distribution . . . 37

5.4 Histogram of task lengths . . . 38

5.5 Weak scaling . . . 39

5.6 Average accuracy . . . 40

5.7 Accuracy on validation and test set . . . 41

5.8 Maximum accuracy . . . 42

5.9 Impact of epochs on accuracy . . . 43

5.10 Accuracy with respect to the number of nodes . . . 44

(10)

List of Acronyms

AGI Artificial General Intelligence AI Artificial Intelligence

ANN Artificial Neural Network

API Application Programming Interface CNN Convolutional Neural Network CPU Central Processing Unit DL Deep Learning

EA Evolutionary Algorithm EC Evolutionary Computation GA Genetic Algorithm

GPU Graphics Processing Unit IaaS Infrastructure as a Service LSTM Long Short Term Memory ML Machine Learning

NLP Natural Language Processing

NEAT NeuroEvolution of Augmenting Topologies NIST National Institute of Standards and Technology PaaS Platform as a Service

RL Reinforcement Learning SaaS Software as a Service

TWEANN Topology and Weight Evolving Artificial Neural Network TPU Tensor Processing Unit

VM Virtual Machine

(11)

Introduction

Automation is changing the world and is an inherent part of the future. Industry 4.0 describes the next industrial revolution which is currently ongoing[17]. Part of Industry 4.0 is the connection of everything. Based on this connection, it allows autonomous, decent- ralised decision making to improve the productivity[17]. This is necessary to cope with the huge amount of data being generated today and in the future[35]. The technology research company Gartner, Inc.[35] predicts that one emerging trend within this field is Artificial Intelligence (AI) Everywhere. This trend describes the interconnected use of AI in a variety of fields. Example use cases are reaching from autonomous vehicles to drug research.

One of the technologies listed as part of the trend is Artificial General Intelligence (AGI).

AGI refers to the concept of solving complex problems with different domains, and self-control based on feelings, thoughts etc.[14]. However, the goal of creating general intelligence is very difficult to achieve[13]. Evolutionary Computation (EC) is one approach to the concept of AGI[25][14]. EC describes a group of algorithms that transfer the aspects of evolution including reproduction, selection, and mutation of individuals within a population, from nature to computers[2]. The evolution of Artificial Neural Networks (ANNs) also called Neuroevolution is a variant of EC in the field of Machine Learning (ML)[42]. Neuroevolution uses Evolutionary Algorithms (EAs) to evolve ANNs[42] and can be applied to the same problem space as Reinforcement Learning (RL)[41]. These problems offer only sparse rewards unlike problems in the field of supervised learning, where for every input the expected output is specified. Example applications for problems within the field of RL can be found in robotics¹ and video game agents².

1.1 Motivation

According to Gartner[35], Deep Learning (DL) is on the peak of the hype right now. The expectations of DL exceed the actual outcome. Yoshua Bengio[3], known for his work with ANNs and DL, states that this is the result of advertising DL to mistakenly reach general human-level intelligence. However, he points out that DL can reach human-level intelligence in very specific domains. Narrow AI describes the ability of solving specific tasks[14]. Bengio emphasizes the importance of long-term research, especially in the field of AI. The focus on short-term goals such as profit could lead to a slow-down in scientific research.

1Example application within robotics: https://www.nature.com/articles/nature14422

2MarI/O, a video game agent for Super Mario World: https://www.youtube.com/watch?v=qv6UVOQ0F44

(12)

The technology of DL has already been applied to various fields with success. Image classification, which describes the task of assigning a label from a set of predefined labels to a given input image, represents such an example. GoogLeNet describes one architecture of an ANN that reached state-of-the-art results for image classification in the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) 2014³, a competition to test and evaluate algorithms of research teams[45]. An important factor that contributes to the success of ANNs and DL is their scalability. ANNs can be trained in parallel, and the use of Central Processing Units (CPUs) instead of Graphics Processing Units (GPUs), led to a major improvement of the computation time[36].

Due to the current demand of DL, companies invest into automating the process. The architectures of ANNs for DL applications can be large and complex, and researchers spend a large amount of time hand-crafting these. The problem shifts towards automating the architecture search for ANNs. Francois Chollet[8] author of Keras⁴, the DL library for the programming language Python, mentions that the automation can be one of the changes in how DL is used in the future. Therefore, the demand for Neuroevolution is rising again[42].

Neuroevolution can be used to find architectures for these deep ANNs[37][33] or improve existing ones[44]. Furthermore, Neuroevolution can find new strategies⁵ through evolution[32].

In addition, Neuroevolution and EC in general are designed to oﬀer the possibility of par- allelisation. Horizontal scalability is an important factor to handle the complexity provided by the architecture search for DL solutions.

Cloud Computing[31] enables massive distribution and scalability of scientific applications. Due to the hype of AI and blockchain technologies, the cost of GPUs has exploded[6].

Therefore, the concept of Cloud Computing becomes even more interesting since it allows a cost-eﬃcient approach[26]. The simple and flexible access of resources using a Cloud environment abstract the underlying hardware from the computations. Hence, the use of CPUs or GPUs is just a matter of choice. The combination of Cloud Computing and Neuroevolution for DL architectures can lead to a further cost-reduction and allows the flexibility of applying this technique to a large number of domains. As a result, architecture search fo DL problems is made possible.

1.2 Project Description

The aim of this project is to investigate the scalability of evolving ANNs in a cloud environment, and the evaluation of the resulting architectures for ANNs. For the assessment, an end-to-end prototype is implemented to allow insight about the possibility of automating the architecture search for DL. 0verall, the project tries to answer the following research questions:

1. Is it possible scale the evolution of ANNs in a cloud environment?

2. Is the performance of resulting ANNs suﬃcient?

3The website of the ILSVRC 2014: http://www.image-net.org/challenges/LSVRC/2014/

4The website of Keras can be found here: https://keras.io/

5For example connections between nodes in an ANN improving the performance of the network.

(13)

For the abstraction and easy deployment of a prototype that is implemented, the container orchestration tool Kubernetes⁶ is used. For the evaluation of the project, the horizontal scalability with respect to weak and strong scaling is assessed. As an example application for narrow AI, image classification is conducted. The Fashion-MNIST dataset is used for the evaluation of the performance of resulting ANNs. Since every ANN within the population can be trained separately, the training is done in parallel by a defined number of workers. Celery is used to distribute the training of ANNs in form of tasks among the workers. The prototype constructed for the benchmark is based on a public implementation of ANN evolution⁷.

The evaluation is two-fold. On one hand, the scalability of the proposed idea is analysed.

On the other hand, the performance of the resulting ANNs is evaluated. In addition, the impact of scaling the prototype on the resulting performance of ANNs is assessed.

1.3 Delimitations

The aim of this project is not to design and implement a fully-functional framework for the distributed evolution of ANNs. Instead, a prototype is designed and implemented to support the benchmark process. With respect to the used computing resources the resulting ANNs are not meant to compete with state-of-the-art solutions in this field. Hence, the project shall rather be used as an indicator to see the potential of this approach. This includes the configuration of the evolutionary process. Only selected parameters will be modified for the experiments to see their impact on scalability and performance of the resulting ANNs.

6Website of Kubernetes: https://kubernetes.io/

7The repository can be found here: https://github.com/joeddav/devol

(14)

Theory

In the following, the underlying theory within the fields of Machine Learning, Evolutionary Computation, and Cloud Computing will be introduced.

2.1 Machine Learning

Machine Learning according to Murphy[34], represents a solution to handle the huge amounts of data being generated today. In general, ML represents a collection of methods/algorithms.

To achieve this goal, the algorithms try to find patterns in data. Based on these patterns, predictions can be generated. In addition, Goodfellow u. a.[16] state that the ability of learning from data, which ML is, can be seen as understanding the fundamental principles of intelligence.

ML utilises learning to be capable of performing specific tasks. There are diﬀerent kinds of learning strategies. Two major strategies are unsupervised and supervised learning. A dataset is defined by a collection of data points x_i (also known as examples) where i describes the index of the data point. Every data point is a vector of features, and therefore defined by x_i ∈ Rⁿ. Every feature xij in xi describes a property of a data point that has been observed or measured. An example feature could be the colour of a car. There could be red cars, whereas other cars could be blue[16].

Unsupervised learning methods study an entire dataset and its features, to learn useful properties and information about the dataset. This information can be used to learn the underlying distribution of the data or to conduct other algorithms such as clustering.

Clustering describes the separation of the dataset into similar clusters. An example would be the separation of cars into clusters. Each cluster contains cars that are similar among themselves. Thus, it could create a cluster that contains sports cars and a second cluster that consists of family cars[16].

Supervised learning methods study an entire dataset, too, but in addition, every data point contains a label also called a class or target. For example, a car dataset contains a number of features and among those is the colour of a car. One car in this data set has the colour red, but this car also has a label Ferrari which refers to the brand of the car.

Classification describes a method in which based on the feature vector of a data point, a corresponding label is predicted for that specific data point. For example based on the features of a car, the brand of the car can be predicted. The term supervised refers to the concept of supervising the learning process by providing the outcome of a prediction.

(15)

Therefore, this can be seen as a teaching process. Unsupervised learning is conducted without any provision of the actual outcome. Hence, the unsupervised method has to come up with the solution by itself[16].

For the actual supervised learning process, the dataset can be split into a number of distinct subsets. The most common approach is to use a training and test set. The training set is used during the training phase of an algorithm. An algorithm will learn about the connections between features and labels of the data points, and thus find patterns in the data set. The training performance can be measured as a value E called error. The error describes how far the predicted and the expected value are apart. Hence, the aim is to reduce the error. According to the error on the training set, an algorithm will adjust its parameters to improve its performance. The actual process of how this learning is conducted depends on the used algorithm. The resulting output of an algorithm is referred to as model. After the training process is finished, the model should be able to predict the labels of data points.

The test set should contain data points that the algorithm has not seen during the training phase. The performance on unseen data points describes the ability to generalise.

This generalisation ability is very important since a model is typically used for diﬀerent, unseen data. The problem is also addressed by the term overfitting. Overfitting on the training set means that it can perfectly represent the training set but loses the flexibility and generalisation ability to be used with unseen data points[4].

The use of an additional data set called validation set is one approach to prevent overfitting on the training set. As well as the test set, the validation set only consists of unseen data points. Again, during the training phase, only the training set is used for learning the representation of the data. However, both error on training and validation set are recorded.

The error on the training set should decrease monotonously. The error on the validation set should also decrease, but at some point when the model starts overfitting on the data set, the error on the validation set goes up. This is the point at which the training should be stopped to preserve the generalisation ability. This process is illustrated in Figure 2.1 and is referred to as early stopping[4].

To analyse the performance of models, the accuracy can be used as a metric. The accuracy of a classification algorithm describes the percentage of correctly predicted labels as can bee seen in 2.1[16].

accuracy=

!|CorrectP redictions|

!|All P redictions| (2.1)

However, it is important to point out that the accuracy can be misleading if a class imbalance is existing in the dataset. A class imbalance describes the problem of a dataset that mostly contains data points with the same class. If for example 99% of the data set consists of cars with the label Ferrari and the remaining 1% of the dataset have diﬀerent labels, a classification algorithm that would always output Ferrari as a label would score an accuracy of 99%. Therefore, it is very important to balance the distribution of the classes in a dataset or use algorithms that are able to cope with the problem of class imbalance[30].

There are a variety of methods and algorithms in the domain of Machine Learning. One successful method called Artificial Neural Networks will be introduced in the next section.

(16)

Figure 2.1: Example of how a validation set is used to enable early stopping and prevent overfitting on the training set. [based on [4]]

2.1.1 Artificial Neural Networks

An overview of Artificial Neural Networks (ANNs) is given by Zou u. a.[50]. The concept of ANNs is based on biological neurons that can be found in the brain. Each neuron in the brain is able to perform some simple task in response to some input. The connection of multiple neurons enables the realisation of more complex tasks. Thus, for example speech recognition or pattern recognition in pictures is possible. For communication, electrochemical signals are used. If the received total signal reaches a threshold in the neuron, the neuron starts to fire which refers to relaying a signal to connected (also called neighbouring) neurons. The change in the strength of a connection between two neurons based on previous experience is assumed to represent the basis of the memory within a brain[50].

Figure 2.2: Single neuron with 2 inputs, bias and a resulting output. [based on [50]]

Figure 2.2 illustrates the concept of the biological neuron transferred to computers. The result is an artificial neuron (also referred to as node). The neuron in 2.2 has two inputs x1 and x² that are weighted with respective weights w¹ and w². In addition, a bias (also referred to as threshold) is used which represents the respective threshold in the biological neuron. The neuron itself computes the activation of the weighted sum of the input and

(17)

the bias. Equation 2.2 illustrates the calculation of the resulting output or activation of a neuron[50].

output= f (

"n i

w_i∗ x_i− bias) (2.2)

The activation of a neuron is computed by a given activation function f (also referred to as transfer function). The activation function can either be linear or non-linear. However, the function has to be diﬀerentiable to enable the training of ANNs.[50]

An ANN typically consists of the connection of multiple neurons to be able to solve complex tasks. Therefore, an ANN contains several layers where each layer is a group of neurons[50]:

• Input layer: Used features for an ANN to conduct classification or regression¹. For each feature a respective node is present in the input layer.

• Output layer: Resulting output of the ANN. For each label a respective node is present in the output layer.

• Hidden layer(s): Multiple layers with possibly diﬀerent amount of neurons that represent the relationship between features and classes using weights.

The design of an ANN depends on the number of neurons within a hidden layer, the number of hidden layers, and the connection between neurons and hence between layers.

Figure 2.3 illustrates an ANN with one input layer, one hidden layer, and one output layer.

A feed-forward network represents an ANN where the features are only propagated forward.

Therefore, no connection ends in the same or previous layer. In contrast, recurrent networks (also referred to as feed-back networks) can contain connections that end in the same or previous layer[50]. Fully-connected layers describe layers where every neuron in the connected layer contains connections to every neuron in the previous layer[16].

For the training of ANNs the so-called backpropagation algorithm is used. The main idea of the algorithm is to propagate the features and information forward through the layers. The resulting error is propagated backwards to adjust the weights of the connections between the layers. The aim is to reduce the error created by the ANN. This refers to an optimisation where the goal is to minimise a cost function. The weight adjustment process will be conducted until the error is converging. The error itself is calculated using a technique called gradient descent[50]. Goodfellow u. a.[16] describe the gradient descent algorithm as the use of the derivative of the cost function in order to minimise the error. Using the derivative, small steps towards the minimum error value can be made, and thus the error is reduced over time until it converges. For the training, (mini-)batches are sampled uniformly from the dataset. For each batch the gradient will be calculated and used to reduce the error². A single epoch is the forward propagation of every example within a dataset through the ANN and the backward propagation of the error[16].

1Regression describes the task where a continuous is calculated based on the features unlike classification where categorical are returned[16].

2This technique is referred to as stochastic gradient descent because of the uniform sampling of examples from the dataset.

(18)

Figure 2.3: ANN with one input, one hidden, and one output layer. [based on [50]]

Dropout

Dropout is a technique invented by Srivastava u. a.[40], and it addresses two problems:

1. Reduce overfitting of ANNs

2. Eﬃcient combination of ANNs into a single architecture to enhance the classification or regression performance

To be able to overcome these problems, units³ of an ANN are temporarily dropped out, and therefore they refer to this technique as dropout. The units are dropped randomly with a given probability of 1 − p⁴. Thus, they sample thinned ANNs from the original ANN.

During the training phase, all thinned ANNs share the same weights for their connections.

Afterwards, a single ANN is used for the test phase. The weights of the connections in this ANN are expressed as the expected weights during training time. Therefore, the weights are multiplied by the given probability p of keeping an unit during the training phase. Using this approach, they are able to obtain a better generalisation and are able to eﬃciently combine multiple ANNs into a single ANN.[40]

Batch Normalisation

Batch normalisation was published by Ioﬀe und Szegedy[21] and represents a strategy to cope with the change of the distribution of inputs of layers. More precisely, they state that the distributions of the inputs change during training when the weights are adjusted, and therefore hyperparameters have to be tuned more carefully. Thus, the time used for training could take longer. The larger an ANN, the bigger the impact of changes of the distributions.

Their idea to cope with this issue is to include normalisation of inputs of layers as part of the architecture of ANNs. Thus, the change of inputs is more robust and the weight adjustment

3Neurons and their respective incoming and outgoing connections to other neurons.

4The probability p refers to the chance of keeping a neuron and its connections during the training phase

(19)

Figure 2.4: Example of how Deep Learning refers to the composition of simple concepts to build complex concepts within the domain of image classification. [from [16]]

does not depend on the change of distributions of activations produced by previous layers.

This achieved independence explains the speedup during training time.[21]

Scalability

The massive scalability that is achievable when working with ANNs is one of the most important factors contributing to its success. Raina u. a.[36] demonstrate the possible use of GPUs to power the training of ANNs. In addition, advancements of hardware such as the Tensor Processing Unit (TPU)⁵ or other algorithms for training such as Genetic Algorithms[44] provide possibilities of further improve the performance of ANNs.

Artificial Neural Networks provide the basis for Deep Learning which is on the peak of the technology hype as stated by Gartner Inc.[35]

2.1.2 Deep Learning

According to Goodfellow u. a.[16], Deep Learning represents the composition of simpler concepts to build complex concepts. An example application is image classification. For computers, the recognition of objects from an input of raw pixels is a diﬃcult task. In Figure 2.4 the Deep Learning approach to this problem is illustrated. Instead of direct mapping, the

5A performance analysis can be found in [22].

(20)

recognition of objects is made up by simpler concepts. Since an ANN is made up by layers, the knowledge represented can increase with the number of hidden layers. Each layer can represent and extract features by using the knowledge of previous layers. These features describe information that can be extracted from an image or in general from given data.

Features are learned by the DL algorithm since constructing it by hand can be very complic- ated for humans. For example, the first hidden layer is able to detect edges in an image by comparing the neighbouring pixels. Based on the edges, corners and other contours can be found. In the following layers, even more complex features or whole objects can be detected.

Finally, the ANN will return the label of the classified object. Using this technique, image classification can be used in various domains.[16]

Deep Learning is applied in a variety of fields. In addition to image classification, Natural Language Processing (NLP) and speech recognition are application areas. NLP describes the processing of written human languages (e.g. Swedish or German). Speech recognition describes the concept of understanding human speech[16]. Dialogflow⁶, by Google, combines both technologies to enable the design of conversational agents. These agents can be used to provide an interface for users to access services of a product.

2.1.3 Convolutional Neural Networks

Convolutional Neural Networks (CNNs) in accordance to [16][27] are a specific type of Arti- ficial Neural Networks that is used for processing data that has a grid-like topology. Hence, the architecture of a CNN contains special operations to cope with this data. An example for an 1-dimensional use-case is time-series data⁷. Image data represents a use-case for two- dimensional data. The pixels of the image can be represented in a 2-dimensional grid.[16][27]

One inherent part to process the data is the convolution operator. Convolution in the domain of CNNs describes the concept of scanning an input, which could for example represent the pixels of an input image or the output of a previous layer, with a kernel (also known as filter ). This kernel is shifted along the pixels, and the values of the kernel are multiplied with the values of the input at the kernel’s position and added up. The output of the operation is a feature map (also known as activation map). The stride of a kernel describes the amount of pixels that the kernel is shifted in a direction. An illustration of how this operator works can be found in Figure 2.5. The values of the kernel are adapted by training the CNN with labelled data⁹. The adapted weights in a kernel are referring to features that can be found in the inputs to a layer. Therefore, the kernel will learn the features in an underlying dataset during training. The convolution operation utilises three ideas to improve ML algorithms[16][29]:

1. Sparse interactions: The kernel size is made smaller than the input size. The important features within a picture could be significantly smaller than the overall input picture.

6The website of Dialogflow can be found here: https://dialogflow.com/

7Time-series data can be thought of a 1-dimensional grid where every cell corresponds to a samples taken at regular time intervals[16].

8When zero-padding is used, cells with the value 0 are added to the matrix to ensure that the kernel can be applied to every element. Without zero-padding, the dimensions of the result can be smaller.

9Similarly as weights in an ANN are adapted by training with labelled data.

(21)

Figure 2.5: Example of convolution operation performed with an input matrix of size 3 × 4 and a kernel of size 2 × 2 with a stride of 1, resulting in an output of size 2 × 3 (without zero-padding⁸). [from [16]]

Therefore, less memory is necessary to store the weights of a kernel. In addition, less computations are needed to produce the output.

2. Parameter sharing: Since the kernel is shifted along the input, every weight is used at every position in the input. As the kernel is smaller than the input, less memory is necessary.

3. Equivalent representations: As the input changes, the output changes the same way.

For example if an input image is shifted by 1 pixel, the output pixel will also shift by 1 pixel.

Generalisation, and therefore the prevention of overfitting is an important property for ML algorithms. In image classification this describes the invariance to for example (small) trans- lations, which is the shifting of pixels in some direction. Pooling is one way to achieve this.

Pooling is the summary of nearby features. Since this process leads to a reduction of the input size of the next layer, this leads to a decrease of necessary memory and computations.

One example of a pooling function is max pooling. This function returns the maximum output of a neighbourhood. For pooling, the stride is typically equal to the width of the kernel[16]. An example can be found in Figure 2.6.

Using the previously described operations, a CNN is able to handle diﬃcult tasks such as image classification by stacking layers. The result is a deep CNN. Hence, they represent a successful solution used in the field of DL.

10Based on https://en.wikipedia.org/wiki/Convolutional_neural_network, accessed July 20th, 2018.

(22)

Figure 2.6: Example of max pooling function performed with an input matrix of size 4 × 4, and a kernel of size 2 × 2 with a stride of 2, resulting in an output of size 2 × 2. [From ¹⁰]

2.2 Evolutionary Computation

Evolutionary Computation according to De Jong[11], describes a community that contains algorithms, applications, and ideas based on the concept to transfer evolutionary processes, which can be found in nature, to computers. This supports the understanding and analysis of these complex processes. Moreover, the concepts of evolution can be transferred to AI to find new ways of solving problems that are described by some unknown, underlying function[11]. Unlike (stochastic) gradient descent, it is a derivative-free optimisation. Since it does not use the derivative of the cost function that is to be optimised, it can be used to solve problems where the derivative is unreliable or unavailable[9]. It is a very flexible and powerful approach[39]. However, for problems where the derivative is useful, this method would likely be slower which is also expressed in the no-free-lunch theorem[48]. This theorem states that if an algorithm is performing well on one class of problems there will always be a class of problems that it performs worse on compared to other algorithms. EC techniques can be seen as stochastic optimisation methods since they are used to find near-optimum solutions[10].

Stochastic optimisation is a strategy to handle highly non-linear and high dimensional problems that can be found in a variety of domains including science, engineering, and business. An inherent property of stochastic optimisation to cope with these problems is randomisation. In terms of EC for example this randomisation can be found in the mutation operator which as the algorithms iterates, makes random choices in the search direction towards a solution. Thus the convergence of the algorithm is sped up, and the problem of being stuck in local minima is avoided because the randomisation can lead to new solutions[39].

To be able to represent the evolutionary processes and find a near-optimum solution, EC methods include components that refer to their background from nature. The description of these components is based on [11], [10] and [2].

2.2.1 Population

The Population is a collection of candidate solutions which are called individuals in the following parts. This means that each individual represents one possible approach to solving the problem. This is the property of EC that distinguishes it from other stochastic optimisation methods. Stochastic search works with one candidate solution that is altered, whereas

(23)

EC uses multiple candidate solutions to move towards the optimum solution[39]. Therefore, the use of a population can be seen as a combination and utilisation of the learning process of the individuals.

2.2.2 Individual

Each individual represents a candidate solution. An individual is made up by its genotype and the resulting phenotype. The genotype can be seen as the genetic make up of an individual. An example is illustrated in equation 2.3.

genotype= < hair color, eye color, skin color, height, weight > (2.3) Here the genotype is made up of 5 genes. The phenotype represents the expressed behavi- oural traits or expressed appearance with respect to its genotype. Hence, the phenotype of an individual is used for the evaluation of its fitness. Evolution aims to create individuals with phenotypes whose behaviour is beneficial for handling a problem. However, this task is not as simple due to the effects of pleiotropy and polygeny. On one hand, Pleiotropy describes the effect of a single gene possibly affecting several phenotypic traits. On the other hand, polygeny represents the effect of a single phenotypic trait being possibly affected by several genes. Hence, the many-to-many mapping between genotypes and phenotypes complicates the search for an individual with a high fitness value which represents a (near-)optimum solution.

2.2.3 Fitness

The fitness of an individual is a value that represents the capability of solving the problem.

This value is calculated by a fitness function that is tied to a problem. EC can be seen as a maximisation problem. Therefore, the aim is to have individuals with high fitness values.

The fitness is improved over generations which describes the iterative use of reproduction operators and selection to construct a new population from a given population. The fitness values of individuals in the population should increase over time to improve the capability of solving a problem.

2.2.4 Reproduction

Reproduction refers to a set of operations which are used to evolve the population by improving the average and maximum fitness of individuals. Thus, the capability of solving the problem is advanced over time. The created individuals as a result of the reproduction of parent individuals are called oﬀspring. The two main operations are recombination and mutation. Recombination refers to the combination of two parent individuals to produce oﬀ- spring. On a higher level, recombination can be seen as the exchange of information between the parents to create an individual that has a higher capability of solving the problem. A mutation is described as the erroneous self-replication of an individual which means the modification of its genotype. Small modifications are more likely than larger ones.

(24)

2.2.5 Selection

Selection is a probabilistic mechanism to select individuals that are deleted or transferred over to the population of the next generation. The selection can be biased or non-biased.

Non-biased means that every individual has the same probability of being selected. On the contrary, biased selection refers to a selection that utilises some information such as fitness to change the probability of an individual being selected. One form of biased selection is elitism.

Elitism selects a definable number of individuals and transfers these without modification directly into the following population.

Both, reproduction and selection, build the stochastic part of EC. To ensure diversity within the population and avoid premature convergence¹¹, the specific choice of used selection and mutation mechanisms as well as their parameters have to be selected carefully. There is no optimal combination of used operators and parameters that fits all problems.

2.2.6 Evolution

The pseudocode in Listing 2.7 illustrates one way to implement EC for solving optimisation problems. It utilises the previously defined concepts to transfer the evolutionary process into actual code. The implementation is based on [2].¹²

The evolution function is used to evolve a population for solving an optimisation problem. The function has three parameters: num generations (number of generations), population size (size of population) and t fitness (threshold for the fitness value). The first step is to create an initial population with the specified number population size of individuals. Afterwards, the fitness of the initial population is evaluated. This evaluation function is the fitness function that has to be implemented according to the problem that is to be solved. Together with the population and its fitness, the evolution can be started. For a defined number of generations num generations, the population will be evolved. First, the population will be reproduced. The current population is then recombined using the recombine function. Afterwards, the population will be mutated to provide fresh candidate solutions, and thus new approaches to solve the problem. Afterwards, a selection strategy is used to select the individuals that will be able to participate in the next generation. Further- more, the evolution will be stopped if an individual reaches the fitness threshold t fitness.

This expresses that the individual is able to solve the problem to a defined extent. If either the number of generations or the fitness threshold is reached, the populations and their fitnesses are returned.

This outlines a simple example and the capability of EC being able to generalise and used as an abstract way to solve a variety of problems. Depending on the problem, the fitness evaluation can be modified, and other reproduction and selection functions can be used.

Another important aspect of EC stated by Fogel[12] is the ability of being parallelisable over the population of individuals. He mentions that the selection, evaluation, and mutation of

11Premature convergence refers to the process that one individual and its close oﬀspring take over the population.

12The definition of the evolutionary process and its mandatory operations can diﬀer among the literature.

Therefore, the example is not trying to illustrate the only solution but rather one possible definition.

(25)

1 d e f e v o l u t i o n ( n u m g e n e r a t i o n s , p o p u l a t i o n s i z e , t f i t n e s s ) : 2 # I n i t i a l i z e p o p u l a t i o n

3 t = 0

4 p o p u l a t i o n [ t ] = i n i t i a l i z e ( p o p u l a t i o n s i z e ) 5 f i t n e s s [ t ] = e v a l u a t e ( p o p u l a t i o n [ t ] )

6

7 # E v o l v e p o p u l a t i o n

8 f o r i i n r a n g e( n u m b e r o f g e n e r a t i o n s ) : 9 # R e p r o d u c t i o n

10 p o p u l a t i o n [ t +1] = recombine ( p o p u l a t i o n ) 11 p o p u l a t i o n [ t +1] = mutate ( p o p u l a t i o n [ t +1])

12 # S e l e c t i o n

13 p o p u l a t i o n [ t +1] = s e l e c t ( p o p u l a t i o n [ t +1]) 14 f i t n e s s [ t +1] = e v a l u a t e ( p o p u l a t i o n [ t +1]) 15

16 # Check i f f i t n e s s t h r e s h o l d r e a c h e d 17 i f max( f i t n e s s [ t +1]) > t f i t n e s s :

18 break

19 t = t + 1

20

21 r e t u r n p o p u l a t i o n , f i t n e s s

Figure 2.7: Pseudocode of a simplified evolutionary process.

every individual and their oﬀspring can be independent which allows parallel execution of the operators.

2.2.7 Genetic Algorithms

Genetic Algorithms (GAs) are a popular form of EC[39]. They were first introduced by Holland[19] in 1975. He states that the crossover operator, which is a form of recombination, is the main operation that distinguishes GA from other forms of EC.

The encoding of the genotype of individuals in the population is also referred to as chromosome. For the evaluation of an individual, its chromosome is decoded and a fitness value is assigned. Typically, a chromosome is represented by a binary encoding, but other encodings including real numbers are possible. However, recombination and mutation operators have to be designed with respect to the used encoding and problem to be solved[2].

In Figure 2.8 the crossover operator for binary encoded chromosomes is illustrated. Two parents encoded by a 10-bit binary string produce oﬀspring in the form of two individuals.

The individuals are the result of swapping parts of their parents’ chromosomes. The depicted crossover operation is called one-point crossover. This means that the parents’ chromosomes are interchanged at one random position (also called locus) in the chromosome. This is also referred to as splice. The genes from this position until the end of the chromosome are swapped[39]. Mutation with binary encoded chromosomes works as bit flipping. Every bit of a chromosome is flipped with some probability that has to be defined[2]. Whitley u. a.[47]

(26)

Figure 2.8: Example of one-point crossover operation. The chromosome of the oﬀspring is the result of a crossover between the parents’ chromosomes at a defined point. [Based on [39]]

investigated that although crossover is depicted as the main operator in context of GA, mutation by itself is already eﬀective enough to provide near-optimum solutions for a large amount of problems.

A popular selection strategy used with GA is tournament selection[2]. Tournament selection describes the random selection of n individuals¹³ (with or without replacement). From this group of individuals, the best individual is selected and used for further evolution, which refers to the reproduction in the form of mutation and recombination. Due to the simplicity of this algorithm, it is very easy to implement. In addition, tournament selection can be conducted in parallel since the selection only works on a subset of the population[15].

GA provides flexibility for solving problems. The encoding and reproduction operations have to be chosen with respect to the problem. The evolution of ANNs using GA will be described in the next section.

2.2.8 Neuroevolution

Stanley[42] describes Neuroevolution as the use of evolutionary algorithms to evolve ANNs.

He relates Neuroevolution to the evolution of brains transferred to computers. The resulting evolved ANNs are constructed with the aim to solve a specific task. Early approaches have focused on evolving weights for a fixed architecture (also referred to as topology). Hence, it can be seen as another way of training ANNs, and therefore a replacement for other algorithms (gradient-based such as backpropagation). However, due to the fixed topology, ANNs are not able to grow or change their structure (add or remove connections between neurons, or add neurons to existing ANNs). The next step was to introduce algorithms that are able to evolve both topology and weights. This concept is called Topology and Weight Evolving Artificial Neural Networks (TWEANNs)¹⁴[43].

NeuroEvolution of Augmenting Topologies (NEAT) represents a popular approach to the concept of TWEANNs. NEAT was designed by Stanley und Miikkulainen[43] in 2002¹⁵. The algorithm uses GA to evolve ANNs. A key feature of this method is to reduce the complexity of produced networks. This is achieved by starting oﬀ with simple topologies and gradually

13If n = 2, this is also called binary tournament selection[39].

14Therefore, the used encoding of ANNs in these algorithms can have a variable length.

15The users page of NEAT which includes a description, software catalogue and more can be found here:

https://www.cs.ucf.edu/~kstanley/neat.html

(27)

Figure 2.9: Example of encoding used in NEAT and how it maps to the resulting phenotype.

[from [43]]

increasing the topology of networks over time. Minimising the structure leads to an improved learning speed[43].

The encoding used in NEAT is illustrated in Figure 2.9. It contains the node genes that specify the number of nodes and their types¹⁶. In addition, it consists of the connection genes, each specifying a connection in the resulting ANN. Each connection gene contains its weight and the index of the input and output neuron. Furthermore, it has a flag which states whether the gene is enabled or disabled. If a connection is disabled, the gene is recessive which means it is not expressed in the resulting phenotype. As a result, the connection is not used in the resulting ANN. Moreover, the connection genes contain a so called innovation number. This number can be seen as a historical marker to trace back the genome’s creation steps and allow comparison to other genomes. For every added connection, an innovation number is assigned, and the global counter increased to be able to distinguish the origins of genomes over time[43].

Mutation in NEAT works on two levels: weight and structural mutation¹⁷. Weight mutations of connections slightly alter the weight¹⁸or replace the weight with a randomised one. Structural mutations change the topology of ANNs. New nodes or connections can be added. New connections between nodes are initialized with a random weight. If a new node is added, a connection between two other nodes is split into two connections with the new node being in the middle. The old connection is disabled and two new connections are added. This enables the traceability of the genome’s history. To reduce the impact of adding a new node to the topology, the new connection, having the new node as an output, receives a weight of 1. The connection from the new node uses the weight that the old connection had. Due to the introduction of the mutations, ANNs are able to grow gradually in size over time. Thus, the complexity of resulting ANNs is reduced[43].

The introduced innovation numbers simplify the process of reproduction in the form

16Sensor type neurons refers to neurons in the input layer.

17The source code of the original implementation can be found here: https://github.com/

FernandoTorres/NEAT/blob/master/src/genome.cpp

18For example drawn from a gauss distribution where the mean is the current weight.

(28)

Figure 2.10: Example of crossover used in NEAT. The two mating genomes are aligned with respect to the innovation numbers of their genes. Hence, matching, disjoint and excess genes can identified. [from [43]]

of crossover, since two genomes can be aligned according to their comparable origin. The crossover operation used in NEAT is illustrated in Figure 2.10. Due to the increasing global innovation number, the genes of the genomes can align in three different ways. The matching genes which are the connection genes with the same innovation number can be aligned side by side. They are inherited randomly from both parents. Genes that only appear in one of the genomes are either disjoint or excess genes. They contain information that the other genome does not have. Disjoint and excess genes are inherited from the parent with the higher fitness value¹⁹. The resulting crossover operator represents a more informed way of reproduction. The resulting offspring inherits the innovation number of its received genes, and therefore traceability is possible. However, it is important to point out that genes with the same innovation number can contain different weights for its connection[43].

Both mutation and crossover are two ways to ensure diversity within the population of ANNs in NEAT. Another concept to prohibit premature convergence by preventing individuals from rapidly taking over the population, is called speciation. Speciation specifically protects innovation. The fitness of mutated and reproduced individuals tends to drop in the beginning of the evolutionary process. However, they could contain innovation that will be important to solve the task eventually if given enough time. The idea of speciation is to divide the population into groups referred to as species. Individuals are able to evolve and

19In case both parents have the same fitness value, they are randomly carried over to the oﬀspring

(29)

optimise their structures independently without the danger of being removed due to their initial low fitness[43].

To be able to divide the population into species, the origins of each genome in the form of innovation numbers are used. The number of disjoint and excess genes between two genomes represents the distance between the respective genomes. It allows the comparison of their similarity. In NEAT, the compatibility distance δ depends on the number of disjoint and excess genes, and the diﬀerences of weights in matching connection genes. Equation 2.4 shows the calculation of the compatibility distance[43].

δ= c1∗ D

N ∗ c2∗ E

N ∗ c3∗ W (2.4)

The coeﬃcients c1, c2 and c3 are used to modify the importance of the number of disjoint genes D, number of excess genes E, and average weight diﬀerence W . To normalise the number of disjoint and excess genes, the number of genes N of the longer genome is used.

Using a compatibility threshold δ_t, newly produced or mutated individuals can be assigned their species using an ordered list of existing species. An individual is compared sequentially with a random individual of each available species. The individual will be placed into the first species where the compatibility distance is lower than δt. If there are no species or the individual does not fit into any species, a new species will be created[43].

Using the above described concepts, NEAT was able to achieve state-of-the-art results in reinforcement learning tasks²⁰[43]. However, researchers are combining the fields of Neur- oevolution and supervised learning to automate the architecture search for ANNs. Recent work will be presented in 3.

2.3 Cloud Computing

Cloud computing is a vastly used term with various definitions. The National Institute of Standards and Technology (NIST)[31]²¹ defines Cloud computing as follows:

Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management eﬀort or service provider interaction. This cloud model is composed of five essential characteristics, three service models, and four deployment models.

The 5 characteristics inherent to Cloud Computing that the NIST mentions, can be summarised as follows[31]:

• On-demand self-service: The resources requested by the user are configured and provisioned automatically.

20Reinforcement learning tasks refer to tasks with sparse feedback. Unlike in supervised learning, there is no label or feedback for every action you take, whereas in supervised learning every input is labelled.[41].

21NIST is part of the U.S. Department of Commerce[31].

(30)

• Broad network access: The resources can be accessed over the internet using standard services (client in web browser or ssh connections).

• Resource pooling: The resources including storage, network, or computing units are shared among customers (also referred to as multi-tenant model). The users are not able to see or trace the exact underlying resources including location. However, users may be able to choose a higher level location (e.g. regions defined by provided, country etc.).

• Rapid elasticity: Resources can be changed elastically depending on the user’s needs.

This enables scalability in both directions.

• Measured service: The provided cloud service should provide a transparent view on the costs. The calculation of the costs can vary depending on the used resources (used storage, network bandwidth, CPU time).

Furthermore, the NIST distinguishes between three service models namely Software as a Service (SaaS), Platform as a Service (PaaS), and Infrastructure as a Service (IaaS).

IaaS describes the configuration and provisioning of computing resources such as network (managing incoming and outgoing connections), storage, processing units (number of virtual CPUs). These resources can be used to install operating systems and deploy applications, or any other necessary tools[31]. The Google Compute Engine (GCE)²²represents one example for an IaaS.

PaaS oﬀers the deployment of applications. The cloud infrastructure is abstracted but the user can configure the underlying platform. The platform includes used software, programming languages, libraries, and other tools used for the deployment of created applications.

The Google App Engine²³) represents one example of PaaS. Google specifically states in their documentation that they provide a managed environment which enables the user to focus on the programming part[31].

SaaS provides an application that is running on a cloud infrastructure. This includes the abstraction of both infrastructure and platform. Hence, the user does not need to manage any underlying resources. An example for SaaS are Google applications such as Google Mail²⁴[31].

By defining these 3 service models, a user can choose the service model that fits his requirements the best.

2.3.1 Virtualisation

In Cloud Computing, operating systems, environments, and applications are running on virtualised hardware. Virtualisation as described by Joy[23] enables the abstraction of the used hardware. Users of a cloud service are only able to see the virtualised hardware, but not the actual underlying hardware. This is necessary to conduct resource pooling, one

22The website of the GCE can be found here: https://cloud.google.com/compute/

23The documentation of the Google App Engine can be found here: https://cloud.google.com/

appengine/

24Google Mail can be found here: https://mail.google.com/

(31)

Figure 2.11: Comparison between virtualisation techniques: VMs (left) and containers (right). [based on [23]]

of the characteristics of cloud computing[23]. In the following, two approaches to perform virtualisation, namely Virtual Machines (VMs) and containers, will be introduced.

Virtual Machine

One technique to achieve virtualisation are VMs. To be able to run a VM on a host system, a so called hypervisor is used. The hypervisor is capable of running diﬀerent guest operating systems on top of the physical host. The architecture for this approach is shown in Figure 2.11 on the left-hand side. To be able to do this, the hypervisor abstracts from the actual hardware and provides virtualised hardware for the VMs. Thus, computations are more expensive compared to directly accessing the hardware or host operating system. However, the abstraction provides security and prevents processes from accessing resources outside the VM[23].

Container

The other technique that Joy describes are containers. Containers represent a lightweight alternative to VMs where instead of a hypervisor, a so-called container engine is used. There- fore, containers are able to use the actual hardware instead of virtualised hardware. Fig- ure 2.11 illustrates this approach on the right-hand side. Building, starting, and turning containers oﬀ is very fast, and thus, the time during development is reduced since VMs are more costly with respect to these steps. In addition, underlying resources can be utilised more eﬃciently since no hypervisor is used, and containers do not use a full-fledged guest operating system. This allows the use of large amounts of containers if necessary[23].

2.3.2 Scalability

To be able to utilise the resources that cloud computing provides, applications have to be manufactured in a way to be able to scale them. Lantz[28] describes the properties of algorithms that enable scalability: The work that is involved consists of a number of

(32)

Figure 2.12: Amdahl’s law: Impact of serial portion of an application on its scalability.

[Based on [28]]

tasks that can be executed as independently as possible from each other. In addition, the communication overhead between tasks and in the entire application should be as small as possible. Each task should involve long computations. This means that the runtime of a single task should be larger than the communication overhead. Furthermore, it should make sense for the application to scale as many tasks as possible. Most importantly, the number of scheduled tasks should not interfere with the above properties[28].

Vaquero u. a.[46] describes two ways of extending the infrastructure to support scalability:

horizontal and vertical scaling. They outline that horizontal scaling refers to adding new servers²⁵, whereas vertical scaling describes the process of adding hardware (for example CPUs) to running servers. However, they also state that vertical scaling can be diﬃcult to achieve since common operating system are not able to dynamically adjust to the extended hardware[46].

Lantz compares two terms of how scalability can be measured: strong and weak scaling.

Strong scaling describes the scalability of an application when the problem size is fixed. The goal is to compute the solution N times faster. Equation 2.5 describes how to calculate the speedup S with respect to the number of CPUs N for an application. The speedup describes how many times faster a computation is using a given number of CPUs compared to using a single CPU[28].

S(N ) = T(1)

T(N ) (2.5)

T(1) describes the time it takes to compute the problem with one CPU. Likewise, T (N ) describes the time it takes to compute the problem on N CPUs. Therefore, a linear speed up

25Also referred to as nodes here.

(33)

is achieved when S = N . Due to Amdahl’s law, this is very hard to achieve because of serial parts of the application that are not parallelisable. Therefore, the equation can be formed into 2.6[28].

S(N ) = s+ p

s+ _N^p (2.6)

The constants s and p describe the serial and parallelisable portion of an application where s + p = 1. Figure 2.12 illustrates the impact of the serial portion on the scalability[28].

The figure shows the speedup S for diﬀerent parallelisable portions p (100%, 99.9%, 99%

and 90%) with respect to the number of CPUs N . The blue line indicates the optimum, an application that has no serial parts and is therefore 100% parallelisable. Even if just 0.1% of an application is not parallelisable, the speedup will eventually plateau and diverges significantly from the optimal speed up. For 10% serial portion the speedup will stagnate even earlier on. Overall, it is not possible to reach the optimal speedup even if the serial portion is far smaller than the parallel portion. In addition, with every CPU added, the improvement of the speedup with respect to N decreases until it eventually plateaus[28].

Weak scaling makes a diﬀerent assumption about the problem. The problem size can grow with the number of CPUs used. Therefore, the overall serial part is reduced with the number of CPUs used. This assumption is realistic for various problems. However, this depends on the how the serial part is growing. To reduce the impact of the serial portion, it should be either constant or grow slowly. An optimum weak scaling would therefore lead to a constant runtime to solve the problem[28].

Both measures can give us an insight about the scalability of an application. For both ANNs and EC, scalability is an inherent property to reduce the overall computation time.

Joy points out that containers are very scalable due to their lightweight architecture. Thus, they are very suitable for cloud infrastructures.

(34)

Related Work

Recent work goes towards the combination of Neuroevolution and DL. Thus, the evolution and training in the form of supervised learning of ANNs can be integrated. Therefore, automation of finding optimised ANNs for any problem with a given labelled dataset is achieved. In the following, related work in this field will be presented.

Recent papers illustrate that these methods can reach state-of-the-art results within image classification tasks. Miikkulainen u. a.[33] introduced a modification of the NEAT algorithm called CoDeepNEAT. They added two features to be able to evolve deep ANNs.

The first extension is to use layers (for example fully connected, recurrent, convolutional) instead of single neurons to construct these networks. Each layer consists of its hyperpara- mters (for example kernel size for a convolutional layer) which are evolved. The second extension represents the modularised approach when building ANNs¹. Each module can be seen as a small ANN. These modules are used repeatedly to build larger ANNs. During evolution two populations are used. One population is used to evolve modules. The second population evolves blueprints which are an abstract description of an ANN’s architecture.

The nodes in a blueprint contain species of modules. For the evaluation, both blueprint and modules that the respective blueprint uses are combined to construct the final ANNs. This technique has been applied successfully to diﬀerent domains. They reported results for image classification. Their best network achieved an accuracy of 92.7% on the CIFAR-10² dataset.

In addition, they constructed recurrent ANNs using Long Short Term Memory (LSTM)[18]

cells for language modelling³. These cells can be seen as modules that can memorise information over time. With the use of bluerprints and modules, CoDeepNEAT is able to find new architectures for memory cells. Furthermore, the algorithm is able to find new ways of connecting LSTM cells. Overall, Miikkulainen u. a. state that their approach of evolving ANNs is flexible and able to reach the level of ANNs designed by humans[33].

Another approach proposed by Real u. a.[38] reached an accuracy of 94.6% on CIFAR-10.

To be able to achieve these results they focused on the scalability part of Neuroevolution.

Hence, to prevent idling of workers, which are evolving the population, a binary tournament selection was chosen as an asynchronous selection strategy. The binary comparison allows workers to independently evaluate individuals. Their individuals are encoded in a graph of

1An example for a modular ANN can be found in the GoogLeNet[45] with its inception modules.

2The description of the CIFAR-10 dataset can be found here: https://www.cs.toronto.edu/~kriz/

cifar.html

3Language modelling refers to the prediction of the next word in a text[33].

(35)

convolutional layers, and they introduced a variety of mutations. The mutations are similar to decisions that humans could make when altering the architecture of ANNs[38]. In a subsequent paper, Real u. a.[37] propose another approach that uses normal and reduction cells. This approach is based on Zoph u. a. which uses Reinforcement Learning to search for architectures of ANNs. Each cell is a convolutional layer and used to either retain or reduce the input size of an image⁴. The structure of the cells can be altered during a search phase. The architecture of the resulting ANN depends on two parameters N (cell stacking depth) and F (channel depth). N refers to the number of normal cells that are repeatedly stacked in the final ANN. The parameter F describes the number of filters used in the initial convolutions. The resulting models set a state-of-the-art mean accuracy of 97.87% for CIFAR-10[37]. Both papers do not make use of crossover or any other recombination when conducting the evolution but it is pointed out as possible future work[38][37].

Due to diﬀerent experimental setups and used hardware, the comparison of resulting performances can be diﬃcult. Overall, the amount of papers being recently published reflects the need and interest in Neuroevolution. Scaling ML platforms such as Microsoft Azure⁵ or Deep Learning in the cloud with Kubernetes⁶ show the current state of distributed ML.

4In the reduction cell a stride of 2 is used for the initial operation of the cell to reduce the size of the resulting output.

5Microsoft Azure: ThewebsiteofMicrosoftAzurecanbefoundhere:https://azure.microsoft.com/

en-us/services/machine-learning-studio/

6RiseML oﬀers a Machine Learning platform which is running on Kubernetes: https://riseml.com/

Scaling the Evolution of Artificial Neural Networks in the Cloud

Examensarbete 30 hp September 2018

Scaling the Evolution of Artificial Neural Networks in the Cloud

Felix Runge

Institutionen för informationsteknologi

Department of Information Technology

Abstract

Scaling the Evolution of Artificial Neural Networks in the Cloud

Acknowledgement

Table of Contents

List of Figures

List of Acronyms

Introduction

1.1 Motivation

1.2 Project Description

1.3 Delimitations

Theory

2.1 Machine Learning

2.2 Evolutionary Computation

2.3 Cloud Computing

Related Work