Privacy-Preserved Federated Learning: A survey of applicable machine learning algorithms in a federated environment

(1)

Privacy-Preserved Federated Learning

A survey of applicable machine learning algorithms in a federated environment Robert Carlsson

Institutionen för informationsteknologi

Department of Information Technology

(2)

(3)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0

Postadress:

Box 536 751 21 Uppsala

Telefon:

018 – 471 30 03

Telefax:

018 – 471 30 00

Hemsida:

http://www.teknat.uu.se/student

Robert Carlsson

There is a potential in the field of medicine and finance of doing collaborative machine learning. These areas gather data which can be used for developing machine learning models that could predict all from sickness in patients to acts of economical crime like fraud. The problem that exists is that the data collected is mostly of confidential nature and should be handled with precaution. This makes the standard way of doing machine learning - gather data at one centralized server - unwanted to achieve. The safety of the data have to be taken into account. In this project we will explore the Federated learning approach of ”bringing the code to the data, instead of data to the code”. It is a decentralized way of doing machine learning where models are trained on connected devices and data is never shared. Keeping the data privacy- preserved.

Tryckt av: Reprocentralen ITC UPTEC IT 20041

Examinator: Lars-Åke Nordén

Ämnesgranskare: Andreas Hellander

Handledare: Salman Toor

(4)

(5)

uous positive attitude towards the learning process in a project. I also want to extend the thanks to my reviewer Andreas Hellander for his guidance and assistance in the project.

2

(6)

1 Introduction 1

1.1 Background . . . . 1

1.2 Problem statement . . . . 1

1.3 Project objective . . . . 2

2 Related Work 2 3 Theory 3 3.1 Machine learning . . . . 3

3.1.1 Support-vector machines . . . . 3

3.1.2 Neural networks . . . . 6

3.2 Federated machine learning . . . . 8

3.2.1 Local Model . . . . 8

3.2.2 Global Model . . . . 9

3.2.3 Model aggregation . . . . 9

3.2.4 Federated support-vector machine . . . . 11

3.2.5 Federated neural networks . . . . 11

4 System design 13 4.1 Design choices . . . . 13

4.2 System components . . . . 13

4.2.1 Federated Server . . . . 14

4.2.2 Flask web server - Communication protocol . . . . 15

4.2.3 Configuration phase . . . . 16

4.2.4 Device configuration phase . . . . 16

4.2.5 Training phase . . . . 17

4.3 Device . . . . 18

5 Data set 19 5.1 IID and Non-IID . . . . 19

5.2 Unbalanced data sets . . . . 20

6 Use cases 21 6.1 Federated Support-vector machines . . . . 21

6.1.1 use case 1 . . . . 23

6.1.2 use case 2 . . . . 23

6.1.3 use case 3 . . . . 23

6.2 Federated neural network . . . . 23

6.2.1 use case 4 . . . . 25

6.2.2 use case 5 . . . . 25

3

(7)

7.1 Federated support-vector machine . . . . 28

7.1.1 use case 1 . . . . 28

7.1.2 use case 2 . . . . 30

7.1.3 use case 3 . . . . 31

7.2 Federated neural network . . . . 32

7.2.1 use case 4 . . . . 32

7.2.2 use case 5 . . . . 33

7.2.3 use case 6 . . . . 34

7.2.4 use case 7 . . . . 36

8 Conclusion and Future work 39 8.1 Future work . . . . 39

References 40

4

(8)

1 Introduction

1.1 Background

Machine learning has potential to add value in fields like medicine or finance. Regarding medicine and according to Obermeyer et al.[13] machine learning algorithms will trans- form the field of medicine. One of those transformations could be collaborative machine learning between hospitals, where each hospital has their own data that could be bene- ficial for the others. The confidential nature of medical data makes it not attractive to transfer it to a centralized server. When doing so could cause problems of keeping pri- vacy. In these cases of medical screening[14] and disease outbreak discovery[9], search query logs were used in order to predict the wanted outcomes. In these cases the data was anonymous, but traditional data anonymization methods[12] can cause problems that make it possible to match data even if being anonymous. Mohammed et al.[12]

works on a collaboration between two different parties, one medical institution and the other a blood transfusion service. When either party wants to send information about their patients, strict privacy procedures are taken into account to enforce anonymity.

A solution to the problem of keeping data privacy-preserved was suggested by McMa- han et al.[10]. They developed a decentralized machine learning algorithm that would solve the issue of transferring data. Instead of handling the data with care when trans- ferring it to a designated server, they instead advocate to leave the data at the source keeping it distributed in its manner. The training would then be implemented at each lo- cal participant and gather the result of the training instead of the data. They termed this approach as Federated Learning. The focus of their research was the amount of data that comes from mobile devices and use this data to train language or image models to make the experience for the users better.

1.2 Problem statement

• What machine learning algorithms suite in a federated learning process and during what circumstances.

• What is needed to create a federated learning platform that could be implemented

by medical institutes or financial organizations like banks.

(9)

This thesis project will test the ideas of Federated Learning in different scenarios and with different machine learning algorithms to get an understanding of the usability of the method. Also the algorithm FederatedAveraging[10] that combines local stochastic gradient descent (SGD) with a server that performs model aggregation. The algorithms that will be used are support vector machines (SVM) and neural networks (NN) to test their efficiency in this environment.

After testing the project will have its aim in creating a basis for a platform that could support federated Learning. When such a system is created and maintained new partici- pants can join and have the benefit from the global model as well as contribute to it. The system will not aim to work for a large corpus of users as McMahan et al. proposed.

But instead try to work for a limited amount of users.

2 Related Work

The two machine learning algorithms chosen for this project was the SVM and NN.

Zanaty et al.[16] compare these two algorithms through different data sets. They wanted to answer the question with which kernel a SVM can achieve a high accuracy compared to a NN. The work present four proposed kernels that are tested and compared to a NN.

Where the last one Gaussian Radial basis Polynomial Function (GRPF) outperforms all the other kernels as well as the neural network. This shows good promise using SVM in this project for further research in Federated Learning. Though one of the problems that can arise is when the kernel is of a non-linear nature, it has to be explored that the standard model aggregation can be used.

Neural networks have shown its capacity in handling many different tasks such as im- age classification[7] where Krizhevsky et al. use deep convolutional neural networks to achieve a low error rate when classifying a large amount of high-resolution images.

Their final network is a set of five convolutional and three fully-connected layers span- ning over 60 million parameters and took five to six days to train on two – at that time – strong GPUs. This was at the time the state of the art of image classification and showed the possibility of it. This project will aim for an image classification problem, but will not handle such an advanced neural network or hard classification task.

Another neural network method, the reccurent neural network (RNN) with Long Short-

Term Memory (LSTM) units [17] . Have shown to be very efficient in language models

[11] and have been used in many variants of federated learning. Andrew Hard et al.[4]

(10)

specifically target Gboard – the Google keyboard ¹ – and its next-word prediction. Here they used a variant of a RNN with LSTM called Coupled Input and Forget Gate (CIFG).

Where they evaluate the methods by the recall of next-word predictions. Their results show that the federated CIFG outperforms the N-gram method, and are on par with the centralized counter-part where all the data would be gathered to a centralized server.

In the classical way of machine learning data is first collected to a centralized server and then either trained there or distributed to other devices in a private data center. This produces two obstacles as H. McMahan et al. [6] stated. One is that data can be large and transferring will cost time and resources, the other one is that data can be of private nature where sharing can inflict certain privacy issues [15]. To handle these situations H. McMahan et al. invented a system which leaves the data at its source and uses that device to do the machine learning. From the perspective of attaining and training with sensitive or large data McMahan et al.[10] have conceptualize a decentralized approach to handle these situation and termed it Federated Learning.

Their main objective was to leave the training data at the devices holding it, and set up a centralized server to convey a federated learning algorithm for a large group of devices to collaboratively do machine learning. With the goal to create a global model for all the participants to share, and with the objective to have the data Privacy-Preserved.

3 Theory

3.1 Machine learning

This project cover two machine learning algorithms and test their functionality in a federated learning environment. Both will be explained in detail of how they work.

3.1.1 Support-vector machines

support-vector machines (SVM) is a supervised machine learning algorithm used to an- alyze data for classification[5]. It wants to find one or several hyperplanes that separate and classify data in a high-dimensional space. The goal is to find the largest margin that have a hyperplane separating the two groups. As shown in figure 1 we see that two of the red data points from the negative group and one blue of the positive group borders the margin. They will act as support-vectors for defining the largest margin.

1

gboard.app.goo.gl/get

(11)

Figure 1: support-vector machine

w · x + b = 0 (1)

y = mx + b (2)

Equation 1 represent the formula for a hyperplane, this represent the line in the middle of the margin. To create a decision rule to use for classifying data points into the positive or negative group, we need some mathematical expression that shows if the point is over or below the median of the margin. First we have the vector w that is perpendicular to this median and is of arbitrary length. Then we have a random data point represented as the vector u and want to check if this is of a negative or positive example. We project vector u onto w and we will get the difference from w that way. When equation 3 is fulfilled it means that data point x is of a positive example.

w · x + b ≥ 0 (3)

A decision rule for all the data points is wanted instead of one for each class. Two new

equations is introduced, similar to the one before. Now there is a known positive and

negative sample that will be above or equal to 1 if positive and equal or below -1 if

negative. We also introduce a new variable y that takes the values of 1 or -1 depending

(12)

on the sample being processed.

w · x ₊ + b ≥ 1

w · x − + b ≤ −1 (4)

y =

( 1 if x is a positive sample

−1 if x is a negative sample

Now equation 5 can be created that express the decision rule for all the data points. This is done with the help of the y variable.

y i (x i · w + b) − 1 = 0 (5)

The best hyperplane that separates the classes is the one with the largest margin. Equa- tion 6 express how the relationship between two support-vectors that lie at the edge of the margin are subtracted from each other to get the vector in the image, and then pro- jected to w and at last normalized by the size of w. Now there is an expression for the width of the margin.

width = (x + − x − ) · w

k w k (6)

We use algebraic reduction to get the next expression

width = 1 2 · w

k w k

2 (7)

L = 1 2 · w

k w k

2 −

n

X

i

α _i [y _i (w · x _i + b) − 1] (8)

If we differentiate with respect to w, we end up with equation 9. If we differentiate with respect to b we get equation 10. These two equations will be used to translate equation 8.

w =

n

X

i

α _i · y _i · x _i (9)

(13)

n

X

i

α i · y i = 0 (10)

With translations and reductions we end up with equation 11

L =

n

X

i

α _i − 1 2

n

X

i n

X

j

α _i α _j y _i y _j x _i · x _j (11)

In the process of training a SVM there are parameters that covers different aspect of the training. These are kernel, regularization, gamma and margin. Kernel is the transfor- mation function that takes the data to another dimensional space where the SVM can linearly separate the classes. Regularization known as the parameter C represents how much the SVM will focus to avoid miss classifying the data points. Gamma controls how much impact far away data points have on the classification.

This project will only focus on linear SVM since the coefficients represents the hyper- planes and the aggregation of these models makes mathematical sense. The model is represented as coefficient for the orthogonal vector to the hyperplane and the intercept of this vector. These arrays of weights can be aggregated and converted into a global model.

3.1.2 Neural networks

Neural networks are machine learning algorithms that was created from the inspiration

of a biological neural network, better known as a brain. The primary part of a neural

network is the neuron, also known as a node. The neural network is structured with

these nodes in layers where each layer having a certain amount of connections for the

nodes as inputs and outputs. The node is modeled as a function that uses previous

known information together with its own bias value to calculate its activation value. The

previous known information can be either inputs from a data source or if it’s a hidden

or output layer this information would be the activation value with its corresponding

weight.

(14)

Figure 2: Parts and connections of a node. It exists inside a layer

Two of the different versions of neural network is the standard feedforward neural net- work and the other one is the RNN. The feedforward neural network is in principle a straight line of information from the input, through a number of layers to the output.

They can handle one input at a time and will be trained at every such session, hence no recollection of the previous input can be taken into account. This is where RNN have the ability to retain the temporal information of a previous input example. They are built first with the normal layering from input to output, then send the information to the next iteration of the network, giving the next input information about the previous example.

The RNN introduces a problem with this upgraded version of sending information for- ward. It also uses backpropagation to train itself to the next version. But in this variant it has the vanishing gradient problem, and the solution to this is the long short-term memory (LSTM). This is a set of gates that will hold the information of the system for each step during the training. [17]

RNN is a very useful for language recognition tasks, which is shown in the keyboard prediction article from google. [4] This project will focus on image recognition of handwritten digits and will make use of a NN instead of a RNN.

A neural network is made of at least three different parts, one input layer, zero, one or

more hidden layers and one output layer. With all but the input layer having a non-linear

activation function that activates the nodes in that layer. Each layer is represented as a

number of nodes, where in a standard densely connected layer each node would have

a connection to all next-coming nodes. All these connections between the nodes in the

network are represented as weights, a floating number representing the strength of the

connection between the nodes.

(15)

Figure 3: The different layers and possible structure of a neural network

In standard setting each node have a bias value, the set of bias values for each node in a layer together make up a bias vector. The bias vectors together with the weights of the connections would represent a neural network model. Neural networks utilizes a supervised learning technique called backpropagation, where it works from the outcome of the output layers and works backwards to create the next version of the model.

3.2 Federated machine learning

In federated machine learning it is important to differentiate between two types of mod- els. The local and the global model. The local one represents the model at the device level and the global model is a summary of the local models for the whole federation.

3.2.1 Local Model

The local model resides at device level, and are a representation of the knowledge of

that specific device. This will be trained with the local data set and then sent to the

centralized unit that collects and refine these models. It is trained and are affected by

any limitations that the local data set have, which can be having too few data points or

that the data set not a representation of the whole collection of data sets.

(16)

3.2.2 Global Model

The federated learning process circulates around the use of a global model. In its essence it is a collection of the local models. This is the strength and weakness of the system.

If the local models are a good representation of the underlying data structure, and are similar to each other. Then an aggregation will create a stronger and better global model.

3.2.3 Model aggregation

In order to create a global model from the local ones, the process of model aggregation will be used. It is a way of combining several models into a single one, with the purpose that the outcome should be more robust and accurate. In the paper from H.McMahan et al [6] the model aggregation method used is named as FederatedAveraging. This method aggregates the updated local models together with a weighted value which is the amount of data points the specific device used when training compared to the whole set of the federation.

Three different algorithms is presented here for achieving such aggregation. The first

algorithm shown here is a version of the FederatedAveraging. There are a few

changes from the original. In this new algorithm all participants will train each round,

instead of being picked randomly into a subset of the whole federation. The weight of

each model is not connected to the relative amount of data points, but instead simply

added together and then divided by the amounts of participants. This last change was

made so any information about the data set on the local devices will stay at its source. It

also strengthen the influence of unbalanced data set. It would then show an even greater

robustness of the algorithm.

(17)

minibatch size, E is the number of local epochs, and η is the leraning rate Server executes:

initialize w ₀

for each round t = 1,2,... do

for each client k ∈ K in parallel do w _t+1 ^k ← DeviceUpdate(k, w _t ) w _t+1 ← _K ¹ P K

k=1 w ^k _t+1 DeviceUpdate(k,w):

B ← (split P k into batches of size B) . NN will use batch size B = 32 for each local epoch i from 1 to E do

for batch b in B do w ← w - η∇ `(w; b) return w to server

This is the FederatedHighest algorithm, it’s another take on the same idea of combining the local models. As the name indicate the algorithm will search for the highest scoring local model, and introduce this as the new global model. Distributing it to the other devices for the next iteration.

Algorithm 2 FederatedHighest. The K clients are indexed by k Server executes:

initialize w ₀

for each round t = 1,2,... do i = 0

maxScore = 0

for each client k ∈ K in parallel do

w _t+1 ^k ← DeviceUpdate(k, w _t ) . See FederatedAverageing s ^k ← score w _t+1 ^k with test data

if s ^k > maxScore then maxScore ← s ^k i ← k from s ^k w _t+1 ← w _t+1 ⁱ

In the FederatedRandom algorithm we want to remove some information from the

set of models. It removes a random model from the set and then calculates the average

of the remaining models weights. Addressing this as the new global model.

(18)

Algorithm 3 FederatedRandom. The K clients are indexed by k Server executes:

initialize w 0

for each round t = 1,2,... do

for each client k ∈ K in parallel do

w _t+1 ^k ← DeviceUpdate(k, w _t ) . See FederatedAverageing i ← random value from 1 to K

remove w ⁱ _t+1 from saved models w _t+1 ← _K−1 ¹ P K-1

k=1 n

k

n w _t+1 ^k

These different algorithms will test their efficiency over a certain amount of settings.

Also test if any of the two produced FederatedHighest and FederatedRandom will outperform the standard FederatedAveraging.

3.2.4 Federated support-vector machine

In order to aggregate models, when working with a SVM, we need to access its support- vectors and its intercepts. These are the only parts we need to fully represent a model.

They will be sent to the server where the aggregation will take place. In the scope of this project the Federated SVM experiments will be run as a simulation on one machine instead of over a network. This, however, will not affect neither the training nor the aggregation of the models.

The chosen library for working with SVM is the scikit-learn ² library for python, which in turn is built on NumPy, SciPy and matplotlip. The library does not support directly setting the weights of a model. But could be introduced in the initialization process.

And at least one training round have to occur before testing the accuracy. This means that a global model cannot be directly implemented and tested without at least one training round taking place and will influence it with some data. This is not a problem in producing the global model, but in analyzing it.

3.2.5 Federated neural networks

Keras ³ is the library used to do the experiments and construct the platform for the fed- erated neural network. Keras enables quick experimentations and is easy to both setup

2

https://scikit-learn.org

3

https://keras.io/

(19)

federated system. The structure of a neural network model can be saved and moved to

replicate itself to other devices, this would be used by a federated server to initiate the

devices for learning. The model weights can be extracted from the models and then

moved, which is needed by the devices to send their weights to the federated server for

aggregation. Model weights can also be replaced from another model, this is what is

needed for the global model to be incorporated in the devices.

(20)

4 System design

Link to github repository: https://github.com/robertcarlsson/federated-learning-system

4.1 Design choices

Based on a short pre-research period leading up to this report, python ⁴ was found to be a well suited programming language when working with machine learning. This is because it has a good set of libraries with the necessary building blocks for creating the wanted system, such as keras ⁵ which is a high-level API written in python that runs on top of tensorflow ⁶ . Tensorflow is a well known and widely used machine learning library and have great support for working with neural networks, which is the machine learning algorithm that will run in this environment.

Bonatiwz et al. [2] with their research about the system design, was a great inspiration for the development of the federated system. They created a communications protocol that handles the connection and data transfer from devices to the server in a federated system. Their protocol is a foundation for the one created to support the system in this project, and is thoroughly explained in the Flask web server section below. Flask ⁷ is a lightweight web application framework for the python language and is one of the most popular one. Since the protocol is small and have low complexity making Flask a good choice of its simple and ease of implementation.

To create the environment and platform for the devices and server. They need to be rep- resented as separate entities. This can be done with the container technology Docker ⁸ which is a way to handle operating system virtualization. These stand alone virtualiza- tions are called containers and are isolated from one another with their own libraries and configuration files to support their running software.

4.2 System components

This section will present the design of the federated learning system. The system is de- signed to be a testing ground for further development in the field of federated learning,

4

https://www.python.org/

5

https://keras.io/

6

https://www.tensorflow.org/

7

https://www.palletsprojects.com/p/flask/

8

https://www.docker.com/

(21)

and then explain in detail from the perspective of the different parts. The system is writ- ten in python and uses libraries such as Flask and TensorFlow. The image below shows the system at large and the two main actors in a federated learning process, namely the federated server and its connected devices. The federated server is the centralized unit that will handle the federated learning process and will have a corpus of devices that will act as workers for it. Both the server and the devices that together make up the federated system is created as docker containers. This helps with compatibility issues, and has an advantage when using it in different testing schemes.

Figure 4: Federated Learning System

4.2.1 Federated Server

The federated server is the centralized unit that handles the creation and distribution

of the global model as well as the device connections and training sessions. It is con-

structed as a docker container that can be started with from a CLI. It holds the model

configuration, which is the instructions to reconstruct a certain neural network. The fed-

erated server is built with the python web server framework Flask, and is written with

Keras which is a library for python to use TensorFlow.

(22)

4.2.2 Flask web server - Communication protocol

The web server is built with the Flask framework and is constructed as a running service on the server. It uses a set of API calls written as HTTP-requests to transfer data between the server and devices. There is also two reserved API calls that is used by the admin to setup the server with the wanted settings.

The whole communication operation from start to finish is explained in the sections below. Where all the API calls are addressed and explained in the constructed flow of the federated system.

API Call Explanation User

/set-config Global model configuration, starting values and data distribution admin

/start-fed Activates the server to start the Device configuration phase admin

/connect Connects to the server and gets an ID device

/fed-ready Checks if the server is ready for training device

/config Requests the model configuration device

/data Uses ID to get their delegated data set device

/ready Tells the server with ID that it is ready for the next round of training device

/round Sends the updated weights for the specific round and device device

(23)

Figure 5: Configuration phase

4.2.3 Configuration phase

During the configuration phase the administrator of the system have the ability to set the configuration at the federated server. It is at this time the federated server can change what kind of neural network it will do work on. This is set by the administrator through the /set-config command, where it sends a Keras model-config as a JSON string object.

This is then implemented at the server side and makes it ready to distribute it after the configuration phase is done. Another configuration choice will be to load a previous trained global model, that the federation will do further updates to. For running exper- iments there is also a configuration for data distribution. These configurations is not a necessary step to run the federated server, it has default settings that will make it run at basic configuration.

The devices must connect during this phase and will do so with the /connect API call.

The federated server will note the connecting device and create an id for the device that it will send back through the response. This will be used by the device to be able to correctly identify itself in further communications with the server. Now both the federated server and the device are ready for cooperation. Connected devices will start to send the /fed-ready call where it waits for a go command into the next phase.

The administrator will wait during the configuration phase until all the wanted devices have connected to the server. Then the administrator have the API call /start-fed which is giving a go commands to the federated server to run its training procedure.

4.2.4 Device configuration phase

When the federated server is started by the administrator in the previous phase it will

be ready to send the configuration and data to the devices. The devices will repeatedly

check if the server is ready with the /fed-ready call. When it returns true the device will

ask for the configuration of the targeted model with the /config call. This returns a JSON

string object containing the Keras configuration of the model which the device can use

(24)

Figure 6: Device configuration phase

to correctly set up their own model for the training session. Then the devices will ask for their respectively data sets with the /data call. These data sets have been prepared at the time the federation gets the /start-fed command. In their default state they are evenly distributed IID data sets with 2000 data points each. In this setting only 30 devices can be tested (since the data set consists of 60000 data points). If more devices are wanted in the training session the data distribution setting is necessary in the configuration phase.

4.2.5 Training phase

Figure 7: Training phase

After the devices have been set up properly in the device configuration phase, the de-

vices will give the /ready API call that will tell the federated server that the specific

device is ready for the next training round. Now the server saves the readiness of the de-

vices in its own representation of the devices and will send back the global model only

when all the devices are in the ready state. This means that all the devices are updated

the server from the previous round or in the corner case that they are ready to train the

first round. If the devices don’t get the global model it means that all the devices are

(25)

sessions in a synchronous manner.

Before the first round of training the last device to connect with the /ready will get the first start command (since the others need to reconnect and ask again). The response will be the initializing weights that the device will set their models weights as. This creates the shared initialization for the devices, in the case that no previous global model is instantiated. After this the first training round will be run by the device.

When the devices have trained one round they will post their weights to the server connected to their id, every call will also make the federated server check if all devices are done with the training round and if so run the aggregation function to create the next global model. The devices will be put back into ready state until all other devices have done the same, and another round will commence.

4.3 Device

Devices are created as the simulated clients of this environment. They are created as a python process that will be run continuously during the federated learning procedure.

The program they run are automated, and the progress of the run-time is made indirect

from the federated server side that will give go commands to the devices. In this setting

the server will never stay connected to the devices in a bi-directional stream but only

receive post/get commands and that will control the program flow.

(26)

5 Data set

The MNIST hadnwritten digits dataset is well-documented and tested, which gives good reliability to the experiments that will be run during this project. It consists of 60000 training examples and 10000 designated test examples. Each example represent a 28x28 black and white handwritten image of a digit, connected to the right digit representation.

Pixels are organized row-wise. Pixel values are 0 to 255 where 0 means background (black), 255 means foreground (white).

Figure 8: MNIST data set - Examples

5.1 IID and Non-IID

Classically in machine learning a data set is assumed to be constructed into a collection

of independent and identical distributed (IID) data points[3]. This is a process of making

the data set balanced and lowers the chance of class imbalance. [8]. The MNIST data

(27)

identical distribution as the other ones.

In the setting of a federated system there is no control over the data, and hence the data sets. They will exist as they are and be used without any trimming. Then the assumption of these data sets is that they will have properties of Non-IID data. Meaning – with the MNIST data set and this specific classification problem in mind – that they have unbalanced amount of data points on each digit. There is no check that will ensure that each device have data points covering each digit, so there can be cases when one or several of the digits is not represented in the local data set.

With the assumption that the local data sets in a real federated environment is of Non- IID nature, then reproducing the setting and testing this kind of problem will be of importance to certify that the system show robustness.

5.2 Unbalanced data sets

Another situation that can happen for a federated system is that the size of the local data set each connected device have variations. Some of them can be of a regular size with a normal distribution, while others can lack in size and have a shortage of data points.

Which can result that the local data set is not a real representation of the underlying

structure that the whole federation together resemble. These small-sized data set – if

trained by them self – can give a very low score for a produced model. The question

asked in this project is if this will inhibit the training of a federated system and hinder

the global model to reach a certain accuracy.

(28)

6 Use cases

This section will present the use cases that were constructed to test the functionality of creating a global model in the two different machine learning algorithms chosen for this project. Use cases will also handle different situations to test the robustness of a federated learning process.

6.1 Federated Support-vector machines

The first machine learning algorithm that was tested was the linear SVM with SGD training. It was constructed to simulate a federated learning process without the need for any networking protocol. Which meant that the learning process was confined to a single device running the whole experiment. The federated SVM was written in python with the scikit-learn machine learning library using helper libraries like numpy for more advanced array and matplotlib for ploting the diagrams of the observations.

It is written as two classes, one being the federated server named as federation with functionality to add and control the second class, which is a representation of a device.

Figure 9: Federated-SVM system overview

Federation is the class that will act as the federated server. It is initialized with the

MNIST evaluation data. This is the data used for testing the accuracy of the models and

will be consistent throughout the use cases. Then there is functionality to add devices to

the list of participants, these devices are initialized by their own with their own data set

and model configuration. They are then directly controlled by the federation class to use

in the learning process, instead of communication over a network. After the federation

(29)

is constructed as follows: All the devices will receive and initiate the global model from the federation. Then each of the devices will train a certain number of SGD-iterations.

After the local training is done the evaluation phase will begin, where all the devices will receive the test data set and calculate the accuracy of its own model. This will be saved and used as reference and evaluation. Then the global model will be updated according to the choice of aggregation function, all which will be explained below. This is the last step of one such epoch and the global model is updated to be distributed for the next round of training.

Device is the class that represents a participant of a federation. It is initialized with its independent data set, a seed for the SGD random shuffling and the number of SGD iterations that will be run during one of the federations epoch. Then it is attached to the federation and all the control will be from the federation class. The classifier being used for the devices was the linear model class with a SGD classifier. The setup and parameters values are presented in the code below. Different values for the parameters were tested to raise the accuracy of the models and these were the one chosen.

self.clf = linear_model.SGDClassifier(

alpha=0.0001, max_iter=1, tol=0.001,

random_state=seed, shuffle=True,

warm_start=False, early_stopping=False, n_iter_no_change=5,

learning_rate='adaptive', eta0=0.01

)

In each of these use cases both balanced and unbalanced data distribution will be tested to check the robustness of the system. The settings for these experiments are as follows:

There will exist four devices that each will have independent data sets. In the case of a balanced setting each device will have 6000 data points from the MNIST data set and for the unbalanced setting the four devices will receive in order 40, 200, 1000 and 6000 data points. The amount of data points being used for the unbalanced data sets was chosen so that some of the devices would not train an accurate model.

During each epoch the next global model is constructed from the local models, but

this model can not directly attach to a SVM classifier, since the framework scikit-learn

(30)

doesn’t allow the classifier model to be edited. It will have to be implemented into a training round as the starting values for a SVM classifier and will therefore be affected with at least one training session. This means that the global model can not be evaluated after its creation.

6.1.1 use case 1

The first use case will be handled as a baseline for the other algorithms. It will use the FederatedAveraging algorithm and utilize it over the two data distribution setups.

Where it first will get the balanced data set that will act as a best-case scenario. And then use the unbalanced data set to see how it reacts to devices having possibly a detrimental effect on the federation at large. Interesting points to look for are the convergence time and if the unbalanced data set inhibit the learning process.

6.1.2 use case 2

Use case 2 will test another algorithm of global model aggregation but otherwise under the same conditions. Instead of taking the average of all the devices local models, here only the highest scoring model will proceed and take place as the global model. This is to check if less accurate models inhibits the learning process, and if only taking a high-scoring model as a better alternative.

6.1.3 use case 3

This use case will test the last global model aggregation which is named as Federated Random. It works on the same data distribution as the other two SVM use cases. It aggregates the devices models similar to the FederatedAveraging algorithm but will remove one of the models at random before the aggregation.

6.2 Federated neural network

From the previous use cases it was apparent that FederatedAveraging was a more robust and faster algorithm to create a global model. This is the chosen algorithm for the upcoming use cases and no others will be tested.

For the federated NN use cases the system was also written in python and uses Keras for

the TensorFlow machine-learning library. Keras is a machine-learning API for python

(31)

fast prototyping. One of the ways to create a model is through the Sequential model, the way it is created is through a linear stack of layers. The code below shows a example: a input layer of 784 (28*28), then following 3 hidden layers and at last a output layer of 10 nodes that represents the 10 digits that will be classified.

model = keras.Sequential([

keras.layers.Flatten(input_shape=(28, 28)), keras.layers.Dense(128, activation=tf.nn.relu), keras.layers.Dense(128, activation=tf.nn.relu), keras.layers.Dense(128, activation=tf.nn.relu), keras.layers.Dense(10, activation=tf.nn.softmax) ])

A simpler design was chosen for the system that tested the use cases. This made the code more compact and readable for prototyping the upcoming larger system. The federation is constructed as a functional loop that runs the learning process.

Figure 10: Fed-ANN system overview

(32)

Federation loop is the imperatively written programmed loop that will represent a fed- eration. It is first created with a Keras model that it will share with the devices when they are instantiated. This way of distributing the model configuration is how the platform will be working.

Device is the representation of a participant of the federation and is initialized with both the training data and the model configuration. Then the device will compile its model and will be ready for training. During the training it gives the federation the ability to run two different training procedures, the normal fit function and training on batch.

The data distribution for these use cases is as follows: The total amount of used data points is 10000, this is then distributed to the devices equally. Two devices will get 5000 data points each, and 5 devices will get 2000 points each. These use cases were centered around the testing of individual and shared initialization. The initialization step is the random values set for the model weights and biases before the start of training. This can either be shared or set by each device individually.

6.2.1 use case 4

This use case test only two devices as participants in a federation with the two different initializations expressed above. It will show the difference on how the global model is affected. The training method used in this use case was the train on batch function that the sequential model have. It trains for a few SGD rounds before stopping, the important part is the first few rounds.

6.2.2 use case 5

This use case tests the same initialization and will test five devices working in a fed- eration. It will distribute the data among these evenly and then test the two different initializations. This is to further show the implications of the first use case.

We will also show how this affects 10 devices that get the same amount of data points distributed evenly amongst them. There is also a test for 50 devices, this can be seen in the appendix A.

6.2.3 use case 6

This use case will have the devices train on their own data sets and send their local

models to the federation and let it aggregate them. But this global model will never be

(33)

global models will be evaluated and see what happens as the devices gets trained.

6.2.4 use case 7

From a federated learning point of view, it controls the model-types and aggregation- methods. The data sets however, are provided from outside the system and thus to be viewed as arbitrary. In order to be better prepared for issues that are data set induced we simulate a Non-IID scenario and analyze the result of a machine learning process under these conditions.

This case will be in similar conditions as the use cases 4, 5 and 6. It is also trained as a federated neural network and uses the same settings. We will use five devices with different configurations to test the Non-IID data distribution. All the devices will be given data points on two digits each with no overlapping digits i.e the configuration of digits can be as follows.

Device Digits Data points

Device 1 0, 1 200, 200

Device 2 2, 3 200, 200

Device 3 4, 5 200, 200

Device 4 6, 7 200, 200

Device 5 8, 9 200, 200

This is an example of how the distribution looks in one of the tests in this use case. The

digits are not statically assigned but will instead be randomly decided for each device,

but never overlapping. This is to represent one of the worst case scenario of Non-IID

data, when the devices have no cohesion in their data set and none is a IID representation

of the global data set that the devices together resemble.

(34)

There are four different test that will be run under similar conditions, here represented in a table:

Test Data points per digit Rounds

Test 1 200 100

Test 2 1000 100

Test 3 1000 200

Test 4 1000 400

(35)

In this section the results for the use cases will be presented. It will first have an ex- planatory text for the corresponding use cases, then the plots will be shown.

The accuracy is tested against the MNIST test data set which consist of 10000 images that are not used during the training. All use cases uses this data set to check their accuracy.

7.1 Federated support-vector machine

The federated SVM tested the federated averaging algorithm and two other aggregating strategies. Two test was performed on each of the algorithms, one with a balanced data set and one when the data set was unbalanced. In the plots each colored line represents a device in the federation, and there are no plot for the global model, but is represented as the models convergence instead.

One thing to note is the overall accuracy of the models, just slightly under 90% accuracy.

The same SVM configuration was tested with the whole MNIST data base without any significant gain in accuracy. This was done in order to see that it is the setup of the SVM that gives the specific accuracy and not the federated system.

7.1.1 use case 1

These are the result of the federated averaging algorithm. It only takes a few rounds

for the federation to reach a convergence at around 88% accuracy. In the first couple

of rounds there is a universal gain in the accuracy, and after that the models seem to

diverge for a few rounds and then in the end reaching a convergence.

(36)

There is more unity in the convergence from the device with unbalanced data set, not

diverging as much as the balanced data set. This is most likely not a contribution by the

size of data distribution but by the randomized data points. In the case shown, the data

points might have been randomized to create a better data set with less outliers.

(37)

There is no clear convergence in the global model for use case 2 and there is little difference of accuracy between the use of balanced or unbalanced data sets. The second test used one hundred rounds in order to search for convergence. However, none was found.

In the unbalanced data set there is an increase in accuracy for those devices with a

smaller data set. In this use case one can claim that the graphs converge to a certain

value. In this use case the best performing model was used as the global model. With

the assumption that it might perform better than taking the average of several models

of which some that can have lesser accuracy. But the test shows that the converging

accuracy is lesser than the federated averaging algorithm. In this test the amount of

epochs used was expanded to ensure that the pattern of oscillation was continuous.

(38)

7.1.3 use case 3

This use case has similarities with use case 1, it uses the average of models when cre-

ating the global model, but will not use all of them. We can see that the convergence

is similar to the federated averaging in that it converges to a specific accuracy for the

global model. But compared to that algorithm it takes more epochs to achieve the same

convergence.

(39)

same accuracy but as with the balanced version it takes more epochs to reach this.

7.2 Federated neural network

These use cases were centered around the testing of individual and shared initialization.

These are the random values set for the model before the start of training.

7.2.1 use case 4

In the shared initialization the global model is at a value in between both the devices.

And in difference to the individual set up we can see a convergent point after the first

round were both of these initialization will be averaged together with the first update.

(40)

7.2.2 use case 5

The plotting shows a big difference with shared compared to individual initialization.

that the first version of the global model have a very low accuracy, almost the same as

when random values is used at around 10%.

(41)

7.2.3 use case 6

This is a use case were the global model is computed at every round but never distributed

to the devices to work on. We can see that at each round both the devices gain in

accuracy of the models from the training, and so does the global model even though it

is not iterated through this process.

(42)

(43)

This first test is set with 200 data points per digit for each device. This means that the devices have a data set of 400 data points. The training session is done over 100 rounds and then stopped. At the beginning of the rounds we can see a slow gain of the accuracy for the local models until a large leap for the global one. After this the local models is relatively stable and do not progress any further for the duration of the experiment. The global model will slowly gain accuracy after this initial jump until a highest point of 80% but with great fluctuations between the rounds.

This clearly show that the combination of these Non-IID data sets does contribute to a better version of a global model, even if the devices themselves can’t make use of it. It is like the global model dissipates when introduced to the local models, but still improves them at some level though not noticeable in the graph.

Figure 13: Use case 7.1 - Non-IID

In the next test we will use a larger set of data points at 1000 per digit. The graph

shows similar results as the first test, but there are a few differences. The accuracy for

the global model oscillates at faster intervals and at the end shows a large gain for the

accuracy, which made the promise for the next and last test that the global model might

stabilize and land at a specific accuracy. The devices also react differently from the

previous test, they now seem to adapt the changes from the global model with quick

oscillations forth and back, but with a raised accuracy near the end.

(44)

Figure 14: Use case 7.2 - Non-IID

This test was made for the Non-IID with 1000 data points and spans 200 training rounds.

We can see that the global model oscillates around a specific value from round 100

onward. The devices show an upward trend with strong oscillations for the accuracy

though it differs one device to another.

(45)

Figure 15: Use case 7.3 - Non-IID

In this last test we let the learning process span 400 training rounds which let us see that the devices continuously gain in accuracy from round 200 onward. Still in an oscillat- ing fashion and with a slow decay at the end of the training session. Some devices is almost reaching the global models accuracy at the highest of peaks but regularly have a mean much lower than the global accuracy. Adding more rounds to this setting would probably show no difference in trend, and the experiments should take on a different approach for the data distribution.

Figure 16: Use case 7.4 - Non-IID

(46)

8 Conclusion and Future work

There are many ways to test a federated system which was why it was hard to choose the focus of this project. But the chosen use cases show promising results for both SVM and NN in the context they were tested. In the first cases, unbalanced datasets were tested and showed that the system could handle such situations. Also a worst case scenario of Non-IID datasets were tested, here the gain in accuracy for the global model shows a very clear value for all participants to join such a federation and gain knowledge even though they don’t share data. More cases handling Non-IID data would have been beneficial and would help in understanding under what context a federated system can gain a reasonable level of accuracy.

The platform created is a basic version of a federated system. It is fully functional and also has some dynamic properties like the amount of devices that is connected can vary, and the admin can change the model configuration. The communication protocol is a good starting point for this level of federated system. And there are still room for im- provements like naming conventions and efficiency in communication that would ben- efit the current system. In overall it is a good system that works with neural networks, though it has its limitations.

8.1 Future work

Secure Aggregation is a proposed method in some federated system designs. Bonawtiz

et al. [1] present in their work a protocol that securely computes sums of vectors. They

adapt the same idea of privacy-preserving as this project, but to the next level. The

protocol is able to make the updates from any party unreadable even to the server doing

the aggregation and is referred to as Secure Aggregation. They explore two different

variants of the protocol where one is more efficient while the other have higher security

but needs another round of communication. This would be interesting to evaluate from

the perspective of security and usability. Their work is also tightly coupled with the

Federated Learning process and would be a great continuation of the project.

(47)

[1] Keith Bonawitz et al. ”Practical Secure Aggregation for Privacy-Preserving Ma- chine Learning”. In: Proceedings of the 2017 ACM SIGSAC Conference on Com- puter and Communications Security. CCS ’17. Dallas, Texas, USA: ACM, 2017, pp. 1175–1191. ISBN : 978-1-4503-4946-8. DOI : 10 . 1145 / 3133956 . 3133982.

URL : http://doi.acm.org/10.1145/3133956.3133982.

[2] Keith Bonawitz et al. ”Towards Federated Learning at Scale: System Design”.

In: CoRR abs/1902.01046 (2019). arXiv: 1902.01046. URL : http://arxiv.org/abs/

1902.01046.

[3] Longbing Cao. ”Non-iidness learning in behavioral and social data”. In: The Computer Journal 57.9 (2014), pp. 1358–1370.

[4] Andrew Hard et al. ”Federated Learning for Mobile Keyboard Prediction”. In:

CoRR abs/1811.03604 (2018). arXiv: 1811 . 03604. URL : http : / / arxiv. org / abs / 1811.03604.

[5] Te-Ming Huang, Vojislav Kecman, and Ivica Kopriva. Kernel Based Algorithms for Mining Huge Data Sets: Supervised, Semi-supervised, and Unsupervised Learn- ing (Studies in Computational Intelligence). Berlin, Heidelberg: Springer-Verlag, 2006. ISBN : 3540316817.

[6] Jakub Konecn´y et al. ”Federated Learning: Strategies for Improving Communi- cation Efficiency”. In: CoRR abs/1610.05492 (2016). arXiv: 1610.05492. URL : http://arxiv.org/abs/1610.05492.

[7] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ”Imagenet classification with deep convolutional neural networks”. In: Advances in neural information processing systems. 2012, pp. 1097–1105.

[8] German Lahera. Unbalanced Datasets & What To Do About Them. 2019. URL : https : / / medium . com / strands - tech - corner / unbalanced - datasets - what - to - do - 144e0552d9cd (visited on 11/17/2019).

[9] Vasileios Lampos et al. ”Advances in nowcasting influenza-like illness rates using search query logs”. In: Scientific reports 5 (2015), p. 12760.

[10] H. Brendan McMahan et al. ”Federated Learning of Deep Networks using Model Averaging”. In: CoRR abs/1602.05629 (2016). arXiv: 1602 . 05629. URL : http : //arxiv.org/abs/1602.05629.

[11] Tom´aˇs Mikolov et al. ”Recurrent neural network based language model”. In:

Eleventh annual conference of the international speech communication associ-

ation. 2010.

(48)

from Data (TKDD) 4.4 (2010), p. 18.

[13] Ziad Obermeyer and Ezekiel J Emanuel. ”Predicting the future—big data, ma- chine learning, and clinical medicine”. In: The New England journal of medicine 375.13 (2016), p. 1216.

[14] John Paparrizos, Ryen W White, and Eric Horvitz. ”Screening for pancreatic ade- nocarcinoma using signals from web search logs: Feasibility study and results”.

In: Journal of Oncology Practice 12.8 (2016), pp. 737–744.

[15] Reza Shokri and Vitaly Shmatikov. ”f”. In: Proceedings of the 22Nd ACM SIGSAC Conference on Computer and Communications Security. CCS ’15. Denver, Col- orado, USA: ACM, 2015, pp. 1310–1321. ISBN : 978-1-4503-3832-5. DOI : 10 . 1145/2810103.2813687. URL : http://doi.acm.org/10.1145/2810103.2813687.

[16] E.A. Zanaty. ”Support Vector Machines (SVMs) versus Multilayer Perception (MLP) in data classification”. In: Egyptian Informatics Journal 13.3 (2012), pp. 177–

183. ISSN : 1110-8665. DOI : https://doi.org/10.1016/j.eij.2012.08.002. URL : http:

//www.sciencedirect.com/science/article/pii/S1110866512000345.

[17] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. ”Recurrent Neural Net- work Regularization”. In: CoRR abs/1409.2329 (2014). arXiv: 1409.2329. URL : http://arxiv.org/abs/1409.2329.

41