Privacy-Preserved Federated Learning
A survey of applicable machine learning algorithms in a federated environment Robert Carlsson
Institutionen för informationsteknologi
Department of Information Technology
Teknisk- naturvetenskaplig fakultet UTH-enheten
Besöksadress:
Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0
Postadress:
Box 536 751 21 Uppsala
Telefon:
018 – 471 30 03
Telefax:
018 – 471 30 00
Hemsida:
http://www.teknat.uu.se/student
Robert Carlsson
There is a potential in the field of medicine and finance of doing collaborative machine learning. These areas gather data which can be used for developing machine learning models that could predict all from sickness in patients to acts of economical crime like fraud. The problem that exists is that the data collected is mostly of confidential nature and should be handled with precaution. This makes the standard way of doing machine learning - gather data at one centralized server - unwanted to achieve. The safety of the data have to be taken into account. In this project we will explore the Federated learning approach of ”bringing the code to the data, instead of data to the code”. It is a decentralized way of doing machine learning where models are trained on connected devices and data is never shared. Keeping the data privacy- preserved.
Tryckt av: Reprocentralen ITC UPTEC IT 20041
Examinator: Lars-Åke Nordén
Ämnesgranskare: Andreas Hellander
Handledare: Salman Toor
uous positive attitude towards the learning process in a project. I also want to extend the thanks to my reviewer Andreas Hellander for his guidance and assistance in the project.
2
1 Introduction 1
1.1 Background . . . . 1
1.2 Problem statement . . . . 1
1.3 Project objective . . . . 2
2 Related Work 2 3 Theory 3 3.1 Machine learning . . . . 3
3.1.1 Support-vector machines . . . . 3
3.1.2 Neural networks . . . . 6
3.2 Federated machine learning . . . . 8
3.2.1 Local Model . . . . 8
3.2.2 Global Model . . . . 9
3.2.3 Model aggregation . . . . 9
3.2.4 Federated support-vector machine . . . . 11
3.2.5 Federated neural networks . . . . 11
4 System design 13 4.1 Design choices . . . . 13
4.2 System components . . . . 13
4.2.1 Federated Server . . . . 14
4.2.2 Flask web server - Communication protocol . . . . 15
4.2.3 Configuration phase . . . . 16
4.2.4 Device configuration phase . . . . 16
4.2.5 Training phase . . . . 17
4.3 Device . . . . 18
5 Data set 19 5.1 IID and Non-IID . . . . 19
5.2 Unbalanced data sets . . . . 20
6 Use cases 21 6.1 Federated Support-vector machines . . . . 21
6.1.1 use case 1 . . . . 23
6.1.2 use case 2 . . . . 23
6.1.3 use case 3 . . . . 23
6.2 Federated neural network . . . . 23
6.2.1 use case 4 . . . . 25
6.2.2 use case 5 . . . . 25
3
7.1 Federated support-vector machine . . . . 28
7.1.1 use case 1 . . . . 28
7.1.2 use case 2 . . . . 30
7.1.3 use case 3 . . . . 31
7.2 Federated neural network . . . . 32
7.2.1 use case 4 . . . . 32
7.2.2 use case 5 . . . . 33
7.2.3 use case 6 . . . . 34
7.2.4 use case 7 . . . . 36
8 Conclusion and Future work 39 8.1 Future work . . . . 39
References 40
4
1 Introduction
1.1 Background
Machine learning has potential to add value in fields like medicine or finance. Regarding medicine and according to Obermeyer et al.[13] machine learning algorithms will trans- form the field of medicine. One of those transformations could be collaborative machine learning between hospitals, where each hospital has their own data that could be bene- ficial for the others. The confidential nature of medical data makes it not attractive to transfer it to a centralized server. When doing so could cause problems of keeping pri- vacy. In these cases of medical screening[14] and disease outbreak discovery[9], search query logs were used in order to predict the wanted outcomes. In these cases the data was anonymous, but traditional data anonymization methods[12] can cause problems that make it possible to match data even if being anonymous. Mohammed et al.[12]
works on a collaboration between two different parties, one medical institution and the other a blood transfusion service. When either party wants to send information about their patients, strict privacy procedures are taken into account to enforce anonymity.
A solution to the problem of keeping data privacy-preserved was suggested by McMa- han et al.[10]. They developed a decentralized machine learning algorithm that would solve the issue of transferring data. Instead of handling the data with care when trans- ferring it to a designated server, they instead advocate to leave the data at the source keeping it distributed in its manner. The training would then be implemented at each lo- cal participant and gather the result of the training instead of the data. They termed this approach as Federated Learning. The focus of their research was the amount of data that comes from mobile devices and use this data to train language or image models to make the experience for the users better.
1.2 Problem statement
• What machine learning algorithms suite in a federated learning process and during what circumstances.
• What is needed to create a federated learning platform that could be implemented
by medical institutes or financial organizations like banks.
This thesis project will test the ideas of Federated Learning in different scenarios and with different machine learning algorithms to get an understanding of the usability of the method. Also the algorithm FederatedAveraging[10] that combines local stochastic gradient descent (SGD) with a server that performs model aggregation. The algorithms that will be used are support vector machines (SVM) and neural networks (NN) to test their efficiency in this environment.
After testing the project will have its aim in creating a basis for a platform that could support federated Learning. When such a system is created and maintained new partici- pants can join and have the benefit from the global model as well as contribute to it. The system will not aim to work for a large corpus of users as McMahan et al. proposed.
But instead try to work for a limited amount of users.
2 Related Work
The two machine learning algorithms chosen for this project was the SVM and NN.
Zanaty et al.[16] compare these two algorithms through different data sets. They wanted to answer the question with which kernel a SVM can achieve a high accuracy compared to a NN. The work present four proposed kernels that are tested and compared to a NN.
Where the last one Gaussian Radial basis Polynomial Function (GRPF) outperforms all the other kernels as well as the neural network. This shows good promise using SVM in this project for further research in Federated Learning. Though one of the problems that can arise is when the kernel is of a non-linear nature, it has to be explored that the standard model aggregation can be used.
Neural networks have shown its capacity in handling many different tasks such as im- age classification[7] where Krizhevsky et al. use deep convolutional neural networks to achieve a low error rate when classifying a large amount of high-resolution images.
Their final network is a set of five convolutional and three fully-connected layers span- ning over 60 million parameters and took five to six days to train on two – at that time – strong GPUs. This was at the time the state of the art of image classification and showed the possibility of it. This project will aim for an image classification problem, but will not handle such an advanced neural network or hard classification task.
Another neural network method, the reccurent neural network (RNN) with Long Short-
Term Memory (LSTM) units [17] . Have shown to be very efficient in language models
[11] and have been used in many variants of federated learning. Andrew Hard et al.[4]
specifically target Gboard – the Google keyboard 1 – and its next-word prediction. Here they used a variant of a RNN with LSTM called Coupled Input and Forget Gate (CIFG).
Where they evaluate the methods by the recall of next-word predictions. Their results show that the federated CIFG outperforms the N-gram method, and are on par with the centralized counter-part where all the data would be gathered to a centralized server.
In the classical way of machine learning data is first collected to a centralized server and then either trained there or distributed to other devices in a private data center. This produces two obstacles as H. McMahan et al. [6] stated. One is that data can be large and transferring will cost time and resources, the other one is that data can be of private nature where sharing can inflict certain privacy issues [15]. To handle these situations H. McMahan et al. invented a system which leaves the data at its source and uses that device to do the machine learning. From the perspective of attaining and training with sensitive or large data McMahan et al.[10] have conceptualize a decentralized approach to handle these situation and termed it Federated Learning.
Their main objective was to leave the training data at the devices holding it, and set up a centralized server to convey a federated learning algorithm for a large group of devices to collaboratively do machine learning. With the goal to create a global model for all the participants to share, and with the objective to have the data Privacy-Preserved.
3 Theory
3.1 Machine learning
This project cover two machine learning algorithms and test their functionality in a federated learning environment. Both will be explained in detail of how they work.
3.1.1 Support-vector machines
support-vector machines (SVM) is a supervised machine learning algorithm used to an- alyze data for classification[5]. It wants to find one or several hyperplanes that separate and classify data in a high-dimensional space. The goal is to find the largest margin that have a hyperplane separating the two groups. As shown in figure 1 we see that two of the red data points from the negative group and one blue of the positive group borders the margin. They will act as support-vectors for defining the largest margin.
1
gboard.app.goo.gl/get
Figure 1: support-vector machine
w · x + b = 0 (1)
y = mx + b (2)
Equation 1 represent the formula for a hyperplane, this represent the line in the middle of the margin. To create a decision rule to use for classifying data points into the positive or negative group, we need some mathematical expression that shows if the point is over or below the median of the margin. First we have the vector w that is perpendicular to this median and is of arbitrary length. Then we have a random data point represented as the vector u and want to check if this is of a negative or positive example. We project vector u onto w and we will get the difference from w that way. When equation 3 is fulfilled it means that data point x is of a positive example.
w · x + b ≥ 0 (3)
A decision rule for all the data points is wanted instead of one for each class. Two new
equations is introduced, similar to the one before. Now there is a known positive and
negative sample that will be above or equal to 1 if positive and equal or below -1 if
negative. We also introduce a new variable y that takes the values of 1 or -1 depending
on the sample being processed.
w · x + + b ≥ 1
w · x − + b ≤ −1 (4)
y =
( 1 if x is a positive sample
−1 if x is a negative sample
Now equation 5 can be created that express the decision rule for all the data points. This is done with the help of the y variable.
y i (x i · w + b) − 1 = 0 (5)
The best hyperplane that separates the classes is the one with the largest margin. Equa- tion 6 express how the relationship between two support-vectors that lie at the edge of the margin are subtracted from each other to get the vector in the image, and then pro- jected to w and at last normalized by the size of w. Now there is an expression for the width of the margin.
width = (x + − x − ) · w
k w k (6)
We use algebraic reduction to get the next expression
width = 1 2 · w
k w k
2
(7)
L = 1 2 · w
k w k
2
−
n
X
i
α i [y i (w · x i + b) − 1] (8)
If we differentiate with respect to w, we end up with equation 9. If we differentiate with respect to b we get equation 10. These two equations will be used to translate equation 8.
w =
n
X
i
α i · y i · x i (9)
n
X
i
α i · y i = 0 (10)
With translations and reductions we end up with equation 11
L =
n
X
i
α i − 1 2
n
X
i n
X
j
α i α j y i y j x i · x j (11)
In the process of training a SVM there are parameters that covers different aspect of the training. These are kernel, regularization, gamma and margin. Kernel is the transfor- mation function that takes the data to another dimensional space where the SVM can linearly separate the classes. Regularization known as the parameter C represents how much the SVM will focus to avoid miss classifying the data points. Gamma controls how much impact far away data points have on the classification.
This project will only focus on linear SVM since the coefficients represents the hyper- planes and the aggregation of these models makes mathematical sense. The model is represented as coefficient for the orthogonal vector to the hyperplane and the intercept of this vector. These arrays of weights can be aggregated and converted into a global model.
3.1.2 Neural networks
Neural networks are machine learning algorithms that was created from the inspiration
of a biological neural network, better known as a brain. The primary part of a neural
network is the neuron, also known as a node. The neural network is structured with
these nodes in layers where each layer having a certain amount of connections for the
nodes as inputs and outputs. The node is modeled as a function that uses previous
known information together with its own bias value to calculate its activation value. The
previous known information can be either inputs from a data source or if it’s a hidden
or output layer this information would be the activation value with its corresponding
weight.
Figure 2: Parts and connections of a node. It exists inside a layer
Two of the different versions of neural network is the standard feedforward neural net- work and the other one is the RNN. The feedforward neural network is in principle a straight line of information from the input, through a number of layers to the output.
They can handle one input at a time and will be trained at every such session, hence no recollection of the previous input can be taken into account. This is where RNN have the ability to retain the temporal information of a previous input example. They are built first with the normal layering from input to output, then send the information to the next iteration of the network, giving the next input information about the previous example.
The RNN introduces a problem with this upgraded version of sending information for- ward. It also uses backpropagation to train itself to the next version. But in this variant it has the vanishing gradient problem, and the solution to this is the long short-term memory (LSTM). This is a set of gates that will hold the information of the system for each step during the training. [17]
RNN is a very useful for language recognition tasks, which is shown in the keyboard prediction article from google. [4] This project will focus on image recognition of handwritten digits and will make use of a NN instead of a RNN.
A neural network is made of at least three different parts, one input layer, zero, one or
more hidden layers and one output layer. With all but the input layer having a non-linear
activation function that activates the nodes in that layer. Each layer is represented as a
number of nodes, where in a standard densely connected layer each node would have
a connection to all next-coming nodes. All these connections between the nodes in the
network are represented as weights, a floating number representing the strength of the
connection between the nodes.
Figure 3: The different layers and possible structure of a neural network
In standard setting each node have a bias value, the set of bias values for each node in a layer together make up a bias vector. The bias vectors together with the weights of the connections would represent a neural network model. Neural networks utilizes a supervised learning technique called backpropagation, where it works from the outcome of the output layers and works backwards to create the next version of the model.
3.2 Federated machine learning
In federated machine learning it is important to differentiate between two types of mod- els. The local and the global model. The local one represents the model at the device level and the global model is a summary of the local models for the whole federation.
3.2.1 Local Model
The local model resides at device level, and are a representation of the knowledge of
that specific device. This will be trained with the local data set and then sent to the
centralized unit that collects and refine these models. It is trained and are affected by
any limitations that the local data set have, which can be having too few data points or
that the data set not a representation of the whole collection of data sets.
3.2.2 Global Model
The federated learning process circulates around the use of a global model. In its essence it is a collection of the local models. This is the strength and weakness of the system.
If the local models are a good representation of the underlying data structure, and are similar to each other. Then an aggregation will create a stronger and better global model.
3.2.3 Model aggregation
In order to create a global model from the local ones, the process of model aggregation will be used. It is a way of combining several models into a single one, with the purpose that the outcome should be more robust and accurate. In the paper from H.McMahan et al [6] the model aggregation method used is named as FederatedAveraging. This method aggregates the updated local models together with a weighted value which is the amount of data points the specific device used when training compared to the whole set of the federation.
Three different algorithms is presented here for achieving such aggregation. The first
algorithm shown here is a version of the FederatedAveraging. There are a few
changes from the original. In this new algorithm all participants will train each round,
instead of being picked randomly into a subset of the whole federation. The weight of
each model is not connected to the relative amount of data points, but instead simply
added together and then divided by the amounts of participants. This last change was
made so any information about the data set on the local devices will stay at its source. It
also strengthen the influence of unbalanced data set. It would then show an even greater
robustness of the algorithm.
minibatch size, E is the number of local epochs, and η is the leraning rate Server executes:
initialize w 0
for each round t = 1,2,... do
for each client k ∈ K in parallel do w t+1 k ← DeviceUpdate(k, w t ) w t+1 ← K 1 P K
k=1 w k t+1 DeviceUpdate(k,w):
B ← (split P k into batches of size B) . NN will use batch size B = 32 for each local epoch i from 1 to E do
for batch b in B do w ← w - η∇ `(w; b) return w to server
This is the FederatedHighest algorithm, it’s another take on the same idea of combining the local models. As the name indicate the algorithm will search for the highest scoring local model, and introduce this as the new global model. Distributing it to the other devices for the next iteration.
Algorithm 2 FederatedHighest. The K clients are indexed by k Server executes:
initialize w 0
for each round t = 1,2,... do i = 0
maxScore = 0
for each client k ∈ K in parallel do
w t+1 k ← DeviceUpdate(k, w t ) . See FederatedAverageing s k ← score w t+1 k with test data
if s k > maxScore then maxScore ← s k i ← k from s k w t+1 ← w t+1 i
In the FederatedRandom algorithm we want to remove some information from the
set of models. It removes a random model from the set and then calculates the average
of the remaining models weights. Addressing this as the new global model.
Algorithm 3 FederatedRandom. The K clients are indexed by k Server executes:
initialize w 0
for each round t = 1,2,... do
for each client k ∈ K in parallel do
w t+1 k ← DeviceUpdate(k, w t ) . See FederatedAverageing i ← random value from 1 to K
remove w i t+1 from saved models w t+1 ← K−1 1 P K-1
k=1 n
kn w t+1 k
These different algorithms will test their efficiency over a certain amount of settings.
Also test if any of the two produced FederatedHighest and FederatedRandom will outperform the standard FederatedAveraging.
3.2.4 Federated support-vector machine
In order to aggregate models, when working with a SVM, we need to access its support- vectors and its intercepts. These are the only parts we need to fully represent a model.
They will be sent to the server where the aggregation will take place. In the scope of this project the Federated SVM experiments will be run as a simulation on one machine instead of over a network. This, however, will not affect neither the training nor the aggregation of the models.
The chosen library for working with SVM is the scikit-learn 2 library for python, which in turn is built on NumPy, SciPy and matplotlip. The library does not support directly setting the weights of a model. But could be introduced in the initialization process.
And at least one training round have to occur before testing the accuracy. This means that a global model cannot be directly implemented and tested without at least one training round taking place and will influence it with some data. This is not a problem in producing the global model, but in analyzing it.
3.2.5 Federated neural networks
Keras 3 is the library used to do the experiments and construct the platform for the fed- erated neural network. Keras enables quick experimentations and is easy to both setup
2
https://scikit-learn.org
3
https://keras.io/
federated system. The structure of a neural network model can be saved and moved to
replicate itself to other devices, this would be used by a federated server to initiate the
devices for learning. The model weights can be extracted from the models and then
moved, which is needed by the devices to send their weights to the federated server for
aggregation. Model weights can also be replaced from another model, this is what is
needed for the global model to be incorporated in the devices.
4 System design
Link to github repository: https://github.com/robertcarlsson/federated-learning-system
4.1 Design choices
Based on a short pre-research period leading up to this report, python 4 was found to be a well suited programming language when working with machine learning. This is because it has a good set of libraries with the necessary building blocks for creating the wanted system, such as keras 5 which is a high-level API written in python that runs on top of tensorflow 6 . Tensorflow is a well known and widely used machine learning library and have great support for working with neural networks, which is the machine learning algorithm that will run in this environment.
Bonatiwz et al. [2] with their research about the system design, was a great inspiration for the development of the federated system. They created a communications protocol that handles the connection and data transfer from devices to the server in a federated system. Their protocol is a foundation for the one created to support the system in this project, and is thoroughly explained in the Flask web server section below. Flask 7 is a lightweight web application framework for the python language and is one of the most popular one. Since the protocol is small and have low complexity making Flask a good choice of its simple and ease of implementation.
To create the environment and platform for the devices and server. They need to be rep- resented as separate entities. This can be done with the container technology Docker 8 which is a way to handle operating system virtualization. These stand alone virtualiza- tions are called containers and are isolated from one another with their own libraries and configuration files to support their running software.
4.2 System components
This section will present the design of the federated learning system. The system is de- signed to be a testing ground for further development in the field of federated learning,
4
https://www.python.org/
5
https://keras.io/
6
https://www.tensorflow.org/
7
https://www.palletsprojects.com/p/flask/
8