Federated Learning in Large Scale Networks: Exploring Hierarchical Federated Learning

(1)

Federated Learning in Large Scale Networks: Exploring Hierarchical Federated Learning

HENRIK ERIKSSON

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

Scale Networks: Exploring Hierarchical Federated

Learning

HENRIK ERIKSSON

MSc in Electrical Engineering Date: January 4, 2021

Supervisors: Johan Haraldsson, Ezeddin Al Hakim Examiner: Prof. Carlo Fischione

School of Electrical Engineering and Computer Science Host company: Ericsson AB

Swedish title: Federerad Inlärning i Storskaliga Nätverk:

Utforskande av Hierarkisk Federerad Inlärning

(3)

(4)

Abstract

Federated learning faces a challenge when dealing with highly heterogeneous data and it can sometimes be inadequate to adopt an approach where a single model is trained for usage at all nodes in the network. Different approaches have been investigated to succumb this issue such as adapting the trained model to each node and clustering the nodes in the network and train a different model for each cluster where the data is less heterogeneous. In this work we study the possibilities to improve the local model performance utilizing the hierarchical setup that comes with clustering the participating clients in the network.

Experiments are carried out featuring a Long Short-Term Memory network to perform time series forecasting to evaluate different approaches utilizing the hierarchical setup and comparing them to standard federated learning approaches. The experiments are done using a dataset collected by Ericsson AB consisting of handovers recorded at base stations in an European city. The hierarchical approaches didn’t show any benefit over common two-level approaches.

Keywords

Federated learning, personalization, time series forecasting, clustering, hierarchical federated learning, model interpolation, Mapper, HierFAvg, base stations, non-IID, LSTM

(5)

Sammanfattning

Federated Learning står inför en utmaning när det gäller att hantera data med en hög grad av heterogenitet och det kan i vissa fall vara olämpligt att använda sig av en approach där en och samma modell är tränad för att användas av alla noder i nätverket. Olika approacher för att hantera detta problem har under- sökts som att anpassa den tränade modellen till varje nod och att klustra noderna i nätverket och träna en egen modell för varje kluster inom vilket datan är mindre heterogen. I detta arbete studeras möjligheterna att förbättra prestandan hos de lokala modellerna genom att dra nytta av den hierarkiska anordning som uppstår när de deltagande noderna i nätverket grupperas i kluster. Experiment är utförda med ett Long Short-Term Memory-nätverk för att utföra tidsserie- prognoser för att utvärdera olika approacher som drar nytta av den hierarkiska anordningen och jämför dem med vanliga federated learning-approacher. Ex- perimenten är utförda med ett dataset insamlat av Ericsson AB. Det består av

"handoversfrån basstationer i en europeisk stad. De hierarkiska approacherna visade inga fördelar jämfört med de vanliga två-nivåapproacherna.

Nyckelord

Federerad inlärning, tidsserieprognostisering, personalisering, kluster, hierarkisk federerad inlärning, modellinterpolation, Mapper, HierFAvg, basstationer

(6)

Acknowledgements

Firstly, I would like to express my gratitude to my industrial supervisors Johan Haraldsson and Ezeddin Al Hakim at Ericsson for all time spent guiding and supporting me throughout this project. I also want to thank my examiner, Prof. Carlo Fischione, for distinct guidance in crucial stages of the project.

I would also like to thank my colleagues Yeongwoo Kim and Mayank Gulati for valuable input and interesting discussions exploring new subjects.

At last I want to give a special thanks to Martin Larsson, Gustaf Jacobzon and Victor Löfgren for their inspiration, support and encouragement throughout the project.

Stockholm, January 4, 2021 Henrik Eriksson

(7)

Acronyms

FL Federated Learning FT Fine-tuned

IID Independent and Identically distributed KPI Key Performance Indicator

LSTM Long Short-Term Memory MAE Mean Absolute Error

MASE Mean Absolute Scaled Error MdRAE Median Relative Absolute Error ML Machine Learning

SGD Stochastic Gradient Descent

sMAPE Symmetric Mean Absolute Percentage Error

(8)

1 Introduction 1

1.1 Background . . . 1

1.2 Problem . . . 3

1.3 Purpose . . . 4

1.4 Objectives . . . 4

1.5 Delimitations . . . 5

2 Overview of Federated Learning 7 2.1 Federated Learning . . . 7

2.1.1 Federated Averaging Algorithm . . . 8

2.2 Hierarchical Federated Learning . . . 8

2.2.1 HierFAvg Algorithm . . . 9

2.3 Model Interpolation . . . 11

2.3.1 The Mapper Algorithm . . . 11

2.4 Clustering . . . 12

2.4.1 Time Series Feature Extraction . . . 13

3 Methods 15 3.1 Proposed Algorithms . . . 15

3.1.1 Usability Updates to The Mapper Algorithm . . . 15

3.1.2 Three-Level Model Interpolation . . . 17

3.2 Method Evaluation . . . 19

3.2.1 Metrics: MASE . . . 19

3.2.2 Data . . . 19

3.2.3 Model . . . 21

3.3 Algorithms . . . 21

3.3.1 Vanilla Federated Learning . . . 21

3.3.2 Fine-tuned Vanilla Federated Learning . . . 21

3.3.3 Model Interpolation . . . 21

3.3.4 Clustered Vanilla Federated Learning . . . 22

3.3.5 Fine-tuned HierFAvg . . . 22

vii

(9)

3.3.6 Hierarchical Model Interpolation . . . 22

3.4 Experiments . . . 22

3.4.1 Experiment 1: Full Train Dataset . . . 23

3.4.2 Experiment 2: 25% Train Dataset . . . 23

3.4.3 Experiment 3: 12% Train Dataset . . . 23

3.4.4 Clustering . . . 24

4 Results 27 4.1 Experiment 1: Full Dataset . . . 27

4.2 Experiment 2: 25% Dataset . . . 28

4.3 Experiment 3: 12% Dataset . . . 28

4.4 Overall . . . 29

5 Discussion 31

6 Conclusion 34

Bibliography 36

A Cluster Examples 42

(10)

Introduction

1.1 Background

The standard way to train machine learning models is to do it centralized on a machine or in a data center with all the data locally available. In cases where data is produced on many clients it is not efficient or even feasible to transfer data from all clients to centrally training models due to privacy issues of the data or communication cost or constraint.

Examples of these cases are smartphones with user input, sensors, micro- phones and camera producing vast amounts of data, often private in nature and an increasing amount of IoT devices with large amount of sensor data.

The clients could also be base station antennas in the mobile network with the data and KPIs collected from the antennas, which is one particular domain that will be investigated in this degree project. Generally, devices can benefit from running a shared model trained on the rich data produced by all devices.

To facilitate this scenario, [1] presented an approach, Federated Learning, and specifically an algorithm, FederatedAveraging, that trains a shared machine learning model without the clients having to share their data. The technique is now deployed in various digital products [2]. Notably, Google uses federated learning in their keyboard, Gboard [3–7] and Apple has started using it in their products as well [8].

Federated learning is also used in pharmaceutical discovery [9], predictive models in Electronic Health Records [10] and explored in predicting financial risk in reinsurance [11], smart manufacturing [12] and medical imaging [13].

In [2] the current state of federated learning is discussed. There are some different techniques that are considered under the federated learning concept, fully decentralized (peer-to-peer) learning, cross-silo learning and cross-

1

(11)

device learning, where the two latter are orchestrated by a central server. The common technique to handle cross-device federated learning is by using the FederatedAveragingalgorithm presented in [1].

It works by initializing a model on a central server and distributing it to the clients where some epochs of training occur before the updated model is sent back to the central server and gets averaged with models from other clients.

The new averaged model is then sent back to the clients for another round of training, repeating it until convergence.

Depending on the relation between the data of different devices, there are two approaches to federated learning, vertical federated learning which can be used in cases where different clients have similar data but with difference in features of the data. This cases can be found for example in bank data where the privacy is of high concern. The more frequent case is horizontal federated learning were different clients have similar features but different data [14].

Clustering has been used to produce multiple models instead of one global model or fully separate models for better performance, dealing with heterogeneous data with similarities [15]. It has as well been used in a federated learning setup [16] and [17] introduces a reclustering method called Hyp- cluster which updates the clusters after each round of cluster model update. Each client gets to test its performance on each cluster model and gets assigned to the cluster with the most fitting model.

In clustered setups the federated learning algorithm can be set up in a hierarchical way with the clients updating a model for the cluster in which after some updates is aggregated with models from all clusters into a global model [18].

In the works of [19] a hierarchical federated learning setup is proposed where clients update a cluster model for some iterations before letting the cluster model update the global model. The incentives for the approach are to limit communication cost. Methods to further limit communication cost is imple- mented such as sparsification which somewhat impairs accuracy although running a hierarchical setup still improves accuracy.

In most federated learning cases the data is non-IID between the different clients which produces a challenge to find one global model to perform well on all clients. This is particularly apparent when adopting methods to improve privacy [20], such as differential privacy and median aggregation [21]. This warrants for adapting the general global model to the local data, personalize the model. There are some methods for personalizing, fine-tuning the model on the local data after finished the federated learning training [22]. Feder- ated learning can be interpreted as a meta-learning algorithm and focusing on

(12)

producing a global model to be suited for fine-tuning [23, 24].

Personalization can be seen as a multi-task learning problem [20, 25], and elastic weight consolidation helps avoid catastrophic forgetting [26] where the models Fisher Information is used to determine how much importance each parameter has for the current task. Using Fisher Information to determine the importance of parameters has previously been used for pruning neural nets in computer vision [27, 28]. Knowledge Distillation [29] is another proposed method for personalization [20]. Here the federated model is seen as a "teacher" and the local model as a "student" and the "student" is trained to mimic the behaviour of the "teacher".

Interpolating a local and a global model and training them to complement each other is an approach [17] is proposing with their Mapper algorithm.

Letting the global model learn only what is common of the data of all clients and having the local model to compensate with what is unique for the local client’s data.

In another attempt to handle heterogeneity, [30] proposes the FedProx algorithm which builds on the common FederatedAveraging algorithm.

With FedProx, the deviations from the global model is penalized while do- ing local updates on the clients. This makes the optimization more robust and less sensitive to differences between clients.

1.2 Problem

The scenario considered in this thesis is the case where we have a large number of clients with a high level of heterogeneity and each client with a considerably limited amount of data to use for training the model. The limited amount of data warrants for a use of federated learning to produce a model that general- izes well. Producing high performing machine learning models using federated learning can be tricky with high levels of heterogeneity amongst clients.

In this scenario the vanilla federated learning approach would face the problem of getting a model that performs well on some clients and badly on other clients or a model that performs mediocre on all clients since the distributions of the data differ significantly between clients. Fine-tuning the global model to better fit the different clients would be deemed difficult due to the limited amount of local data. Clustering similar clients together and produce separate models for each cluster could help produce a model better fitted to the different clients in an attempt to balance between generalization and local fit. However, this could still lead to unsatisfactory results.

(13)

To produce models that generalize well and at the same is well fitted to the local client it is beneficial if the knowledge shared between clients can be learned between clients. Assuming a clustered setup there is knowledge common between all clients, knowledge common between clients in a cluster and some knowledge unique to each client. Combining this knowledge could produce the best performing model.

In the end the problem to solve can be defined as

!min_k2K X

k2K

Fk(!k) where F (!k)⌘ 1

|Dk|

X

i=1

fi(!k) (1.1) and fi()denotes the loss of the i:th data sample and further denotations are explained in Section 2.1.

1.3 Purpose

So, we assume that the models of the different levels, from local to cluster level to global level, has some knowledge of the data in the local client that cannot be found in other levels. By this assumption we want find a method to incorporate the unique knowledge from all models into one model, and by that improve the model’s accuracy on the data. With this aim in mind the research question is formulated as follows:

“Can a hierarchical federated learning setup be advantageous over a normal 2-level federated learning setup in regards of model performance?”

Hypothesis:

“When dealing with heterogeneous data, clustering clients and extracting knowledge from each level can benefit model performance compared to the normal 2-level approach.”

1.4 Objectives

• Propose method/s to combine knowledge from all levels in a hierarchical clustered federated learning setup that could possibly produce models that perform better than models produced by 2-level federated learning.

(14)

• Setup and adapt existing framework for federated learning to handle the proposed method/s.

• Perform experiments on an Ericsson provided dataset.

• Evaluate the performance of the proposed method/s and compare it with the performance of the vanilla hierarchical approach and similar two- level approaches.

1.5 Delimitations

This thesis is focused on studying large scale federated learning and mainly exploring possible gains of clustered clients. Within the area of clustering of clients in federated learning there are many routes to explore. For instance, there are many ways to cluster the clients. This thesis will not explore ways of cluster clients but rather use previously used clustering methods and rely on the performance of them.

The principal of the project is Ericsson and it’s in their interest that the experiments and solutions in this thesis is fitting for their line of use. The experiments are therefore limited to mainly focus on time series forecasting using an Ericsson provided dataset.

To solve a time series forecasting problem there are different types of solutions at hand. Ericsson is looking to use machine learning to solve this problem and it is therefore not in interest to use other methods such as statistical methods as baseline in the experiments. Moreover, this is a study in different ways to optimize a given model and therefore the work is limited to assume a given model and not search for the optimal model but rather to find the best way to optimize the given model, given the circumstances.

(15)

(16)

Overview of Federated Learning

This chapter provides theoretical background into algorithms and methods used. Specifically, federated learning and existing alterations to the vanilla federated learning algorithm such as the Hierarchical Federated Averaging algorithm and the Mapper algorithm is explained. A brief background into clustering of nodes in a federated learning setup is provided as well as for the clustering algorithm used in this project to group the clients based on their local time series.

2.1 Federated Learning

Consider a setup with a set of clients K, each client k with its own data Dk ⇢ Dand model parameters !k, all connected to a common central server G. The data Dkis not accessible at the central server but only at client k.

The clients collaborate to jointly solve the optimization problem

min! f (!) where f (!)⌘ 1

|D|

X|D|

i=1

fi(!), (2.1) where ! are the model parameters [19] [1]. Since the data is partitioned to the clients and only exist locally, we can rewrite the objective from Eq. (2.1) as

f (!) =X

k2K

|D^k|

|D| Fk(!), (2.2)

where

Fk(!) = 1

|D^k| X

i2Dk

fi(!). (2.3)

7

(17)

We try to find argmin_!_kFk(!k)by optimizing the model using stochastic gradient descent (SGD) according to

!^t_k = !^{t 1}_k + ⌘rF^k(!^{t 1}_k ), (2.4) where t denotes the current update step and ⌘ is the learning rate. The global model which optimizes over all data D is updated by averaging the updates from a subset of all clients, S ⇢ K.

!^t+1_G =X

k2K

|Dk|

|D^C|!k. (2.5)

2.1.1 Federated Averaging Algorithm

To optimize the model using federated learning the FederatedAveraging [1] algorithm is most commonly used. It works by firstly randomly initialize the central model !G. Then a subset S of random clients is selected to which the central model !Gis transmitted to. In parallel, at each client in subset S a number of stochastic gradient descent update steps is performed to the model

!Gusing the local data Dkavailable at the client, according to Eq. (2.4). The updated models are then sent back to the server and aggregated according to 2.5 producing a new updated central model !G. After this, a new subset S of clients is selected, and the procedure is repeated until the performance of the model has converged and the optimized model !G is distributed to all clients k 2 K for deployment. The pseudocode for FederatedAveraging can be found in Algorithm 1.

2.2 Hierarchical Federated Learning

The hierarchical federated learning setup builds on conventional federated learning setup in Section 2.1 by adding a cluster level. Consider a setup with a large number of clients formed in clusters, by some similarity measure, so that each cluster c ⇢ K contains of several clients, k 2 c. Each cluster has its own set of model parameters !c and its loss function,

Fc(!) =X

k2c

|Dk|

|D^c|Fk(!), (2.6)

connected to it, where Fk(!)is defined in Eq. (2.3) and Dc is the combined data of all of the clients in the cluster. The global loss function can be rewritten as the weighted average of the cluster losses,

(18)

Algorithm 1 FederatedAveraging. B is the local batch size, R is the proportion of the subset of clients to be used in each iteration, ⌘ is the learning rate and E is the number of local epochs.

On server:

initialize !G

for each round t = 1, 2, ... do m max(R · |K|, 1)

St random subset of m clients for each client k 2 S^tdo

!k ClientUpdate(!^G,k) end for

!G average !k

end for

ClientUpdate(!, k): // Performed at client k B (split Dkinto batches of size B) for each local epoch i = 1 to E do

for all batches b 2 B do

! ! ⌘rFb(!) end for

end for return !

f (!) =X

c2C

|D^c|

|D|Fc(!). (2.7)

2.2.1 HierFAvg Algorithm

The HierFAvg [18] algorithm is an algorithm that can be used to perform federated learning in a hierarchical setup to optimize the central model !G. It can be seen as two levels of federated learning where the top level distributes the central model to the clusters and aggregates the models that have been optimized at each cluster. Each cluster can be seen as a small federated learning setup and performs updates according to the FederatedAveraging algorithm explained in Section 2.1.1. However, in this scenario the cluster model !c is not optimized until convergence but updated a limited number of steps before being sent to the central server for aggregation. Note that the HierFAvg algorithm was initially produced to perform the cluster update

(19)

Algorithm 2 HierFavg. B is the local batch size, R is the proportion of the subset of clients to be used in each iteration, ⌘ is the learning rate and E is the number of local epochs.

On server:

initialize !G

for each round t = 1, 2, ... do for each cluster c 2 C do

!c ClusterUpdate(!G, c) end for

!G average !^c end for

ClusterUpdate(!, c):

for each cluster round u = 1, 2, ... do m max(R · |c|, 1)

!k ClientUpdate(!^G, k) end for

!G average !k

ClientUpdate(!, k): // Performed at client k B (split Dkinto batches of size B) for each local epoch i = 1 to E do

for all batches b 2 B do

! ! ⌘rFb(!) end for

calculations at a cluster entity such as a network base station. However, it is not necessary for the cluster calculations to be done on a separate entity and could as well be done on the central server as will be done in the experiments in this thesis.

(20)

2.3 Model Interpolation

This method is based on the Mapper algorithm proposed by [17], which interpolates a local model with a global model and during training let the models learn to complement each other.

We introduce the inference model,

!_I = !_L+ (1 )!_G, (2.8)

which is an interpolation between the global model and the local model and is used locally when running inference. Here, neither !Lor !Gare used separately to run inference. The optimization over all clients can be setup as follows,

!L,Kmin,!G, K

X

k2K

|D^k|

|D| f ( !L,k+ (1 k)!G). (2.9)

2.3.1 The Mapper Algorithm

To find the !L,K, !G, K that optimizes Eq. (2.9) the Mapper algorithm is used. At the server the global model !G is initialized and then sequentially sent to random clients for updating. At the client, a set of ⇤ 2 [0, 1] is chosen and 8 2 ⇤ the optimal local model

!_L( ) = argmin

!L

f ( !_L+ (1 )!_G) (2.10) is found. The second stage is to find the most fitting

⇤ = argmin f ( !L( ) + (1 )!G) (2.11) and then use ^⇤ and the updated !L( ^⇤)to update the central model !Gwith a gradient descent step,

!G = !G ⌘rf( ^⇤!L( ^⇤) + (1 ^⇤)!G). (2.12) The central model !G is then sent to the next client for further updates. The full process is repeated until convergence and subsequently the central model

!Gis distributed to all clients. At each separate client the process of finding the optimal !L( ) and ^⇤ according to Eq. (2.10) and Eq. (2.11) and the deploy the locally optimal inference model !I.

(21)

Algorithm 3 The Mapper Algorithm. B is the local batch size, ⌘ is the global learning rate and E is the number of local epochs.

On server:

initialize !G

for each round t = 1, 2, ... do kt random client

!G ClientUpdate(!G,kt) end for

ClientUpdate(!G, k):

B (split Dkinto batches of size B)

!L !G

for each 2 ⇤ do

for each local epoch i = 1 to E do for all batches b 2 B do

!L( ) argmin!_Lf ( !L+ (1 )!G) end for

end for end for

⇤ argmin f( !L+ (1 )!G)

!_G !G ⌘rf( ^⇤!_L( ^⇤) + (1 ^⇤)!_G) return !G

2.4 Clustering

Clustering clients in a federated learning setup can be done in different ways, and by different metrics. One that has shown good results for clustering time series is agglomerative clustering using features extracted from the time series [16]. Using features from time series instead of the actual time series can reduce the amount of details leaving the client and thus help protecting the privacy as well as reducing the dimensions improving speed when calculating distance between nodes while clustering.

Agglomerative clustering is a hierarchical clustering method together with di- visive clustering. It works by defining all clients as nodes and pairwise connect the two nodes with the shortest distance to become a new node. This continues until the desired number of clusters is achieved.

(22)

2.4.1 Time Series Feature Extraction

Table 2.1: Time series features to be extracted as proposed by [31]. Table adapted from [15].

Feature Description

Mean Mean

Var Variance

ACF1-x First order of auto-correlation Trend Strength of trend

Linearity Strength of linearity Curvature Strength of curvature Season Strength of seasonality Peak Strength of peaks Trough Strength of trough Entropy Spectral entropy

Lumpiness Changing variance in remainder Spikiness Strength of spikiness

Lshift Level shift using rolling window Vchange Variance change

Fspots Flat spots using discretization Cpoints The number of crossing points KLscore Kullback-Leibler score

Change.idx Index of the maximum KL score

To deal with privacy constraints when clustering clients based on time series data, features are extracted from time series providing a different set of vari- ables used when clustering. In [16] the most proficient clustering method of clients based of time series was using agglomerative clustering based on a set of features computed by [31] that can be found in Table 2.1.

(23)

(24)

Methods

3.1 Proposed Algorithms

3.1.1 Usability Updates to The Mapper Algorithm

The Mapper algorithm have two features that makes it cumbersome to run on devices with limited computation power. The first being the dynamic choice of where for each possible 2 ⇤ it is required that the computations to minimize Eq. (2.10) is rerun. The second being that there is no built in parallel computing. Computation is only being done on one client at the time, not drawing benefits of federated averaging [1], leading to slow wall-clock convergence. To deal with the limitations in computing power we constrain on number of balancing choices to ⇤ = [_n¹_l]where nl is the number of levels. In the two-level setup we get = ¹₂ and specify the inference model from Eq.

(2.8) to be

!I = !L

2 +!G

2 . (3.1)

Adapted Model Inpterpolation Algorithm

From the updates from Section 3.1.1 and the notion that gradient descent is used to perform the local optimization in Eq. (2.10) the adapted algorithm is shown in Algorithm 4 and will look somewhat different from the standard Mapper algorithm. On server level it works in the same way as the Feder- atedAveragingalgorithm described in Section 2.1.1 where for each round the server sends the central model !Gto a subset of clients and aggregates the

15

(25)

Algorithm 4 The Adapted Mapper Algorithm. B is the local batch size, R is the proportion of the subset of clients to be used in each iteration, ⌘Land ⌘G

is the local and global learning rate and E is the number of local epochs.

On server:

initialize !G

for each round t = 1, 2, ... do m max(R · |K|, 1)

!G,k ClientUpdate(!^G, k) end for

!G average !G,k

end for

ClientUpdate(!G, k):

!_L !G

!I !_L 2 +^!₂^G

!_L !L ⌘_Lrf(!I) end for

end for

!G !G ⌘Grf(!I) return !G

updated version it receives back. At the client the proceedings are a bit simpler since only a set lambda is used. This results in a set of batch updates which begins with the interpolation model is set

!I = !L

2 +!G

2 (3.2)

and then an update step to the local model,

!L= !L ⌘Lr`^cross(!I, Dk). (3.3) Then an update step is performed to update the global model

!G= !G ⌘Gr`^cross(!I, Dk) (3.4)

(26)

before it is sent back to the server for aggregation. The full process is repeated until convergence and subsequently the central model !Gis distributed to all clients where the process of finding the optimal !Lis executed through fine- tuning and then deploy the locally optimal inference model !I.

3.1.2 Three-Level Model Interpolation

To expand the model interpolation method to handle the introduction of the cluster level we firstly expand the interpolation model in Eq. (2.8) to include a cluster model !c. For simplicity and the computation limitation discussed in Section 3.1.1 we set all three models to equal weights and end up with

!_I = !_L 3 +!_G

3 + !_c

3 (3.5)

as inference model. Subsequently there needs to be a method of updating the cluster model and the proposed solution is a mix between the adapted two-level model interpolation algorithm from Section 3.1.1 and the HierFAvg algorithm [19] presented in Section 2.2.1.

Three-level Model Interpolation Algorithm

On the server the algorithm works in the same way as the HierFAvg algorithm(see Section 2.2.1), as can be seen in Algorithm 5, where the global model is initialized and in each round the latest global model is sent to each cluster for updates and the returning models are then aggregated to form the updated global model. At each cluster the cluster model !cis initialized as the global model !Gand in a similar matter as in the Adapted Mapper algorithm the cluster model !c is sent for updates together with the global model !G. At the client the proceedings continue to be similar to the Adapted Mapper algorithm however, the interpolation model is set to

!I = !L

3 +!G

3 + !c

3 (3.6)

before each update of the local model

!L= !L ⌘1r`^cross(!I, Dk). (3.7) Then an update is performed to the cluster model according to

!c = !c ⌘cr`^cross(!I, Dk) (3.8)

(27)

Algorithm 5 The Extended Mapper Algorithm. B is the local batch size, R is the proportion of the subset of clients to be used in each iteration, ⌘L, ⌘C and

⌘G is the local, cluster and global learning rate and E is the number of local epochs.

On server:

initialize !G

for each round t = 1, 2, ... do for each cluster c 2 C do

!G,c ClusterUpdate(!^G, c) end for

!G average !G,c

end for

ClusterUpdate(!G, c):

!c !G

for each cluster round u = 1, 2, ... do m max(R · |c|, 1)

St random subset of m clients for each client k 2 Stdo

!c,k ClientUpdate(!G,!c, k, global_round = F alse) end for

!c average !^c,k end for

!G,c ClientUpdate(!G,!c, k, global_round = T rue) return !G,c

ClientUpdate(!G,!C, k, global_round):

!_L !C

!I !L

3 +^!₃^C +^!₃^G

!_L !L ⌘_Lrf(!I) end for

end for

if global_round then

!_G !G ⌘_Grf(!I) return !G

else

!C !C ⌘Crf(!I) return !C

end if

(28)

and the cluster model is then sent back to the cluster for aggregation. After all the cluster rounds has been finished one more round with random clients is carried out but this time the global model gets updated,

!G = !G ⌘Gr`^cross(!I, Dk) (3.9) and aggregated at cluster level before getting sent back to the server for aggregation. After a number of global rounds are carried out the global model !Gis distributed to all clusters where the cluster models !care updated accordingly to earlier proceedings. Finally, the global model !G and cluster models and

!c are sent to the clients in the respective clusters and the final fine-tuning of the local model !Lis performed before the optimized interpolation model !I

is deployed at the client.

3.2 Method Evaluation

3.2.1 Metrics: MASE

To measure time series error there are different metrics that can be used such as MSE, sMAPE and MdRAE. In 2006 [32] proposed that MASE should be used for comparing time series as it is suitable for all situation. MASE works by scaling the error based on the in-sample MAE from the naïve forecast method.

The mean and the scaled error is defined by

M ASE =

1 J

PJ

t=1|Yt Ft|

1 T 1

PT

t=2|Yt Yt 1| (3.10) where T is the length of the in-sample, J is the length of the forecast, Ftis the prediction at time t and Ytis the time series data at time t.

3.2.2 Data

The dataset used is collected from 323 base stations located in a European metropolitan city, each containing one time series consisting of hourly handovers spanning over eight weeks. Handover, or hand off, in a cellular network is when a mobile connection is transferred from one base station to another without losing the active transmission. Some examples of the time series can be seen in Figure 3.1. The base stations are as well hierarchically grouped by geographical location where each group creates an aggregated time series consisting of the sum of the time series of the base station in each group. These

(29)

Figure 3.1: Raw time series from client 39, 139 and 200.

Figure 3.2: Example of two train instances from client 39 and client 200.

time series are treated the same as the original base station time series and added to the lot for the sake of increasing the dataset. This results in 510 non-IID time series, each time series seen as a client in a network.

Preprocessing

Each time series is individually normalized with min/max scaling to make sure that it is the different shape of the time series that is the main distinguishable factor between them, both when clustering the clients as well as when training the forecasting model.

As the forecasting problem is defined as taking two days’ worth of time series as input and forecasting the next 24 hours the time series are organised into 24-hour blocks. Each 24-hour block is seen as the true output and is connected with its previous 48-hours as input, together forming one dataset instance as can be seen in Figure 3.2. The instantiated time series is then split into train, validation and test set by defining the last seven output blocks in the times series, and their associated input blocks, as test set, the next seven blocks before that as the validation set and the rest as training instances.

(30)

Table 3.1: Model parameters for the LSTM network used in all experiments.

Model Parameter Value

Batch size 4

LSTM-cells 64

L2-Regularization 0.0005

Optimizer RMSProp

3.2.3 Model

The defined problem to solve is a forecasting problem where the next 24 hours are to be predicted at each midnight based on the previous 48 hours readings.

For this we use a LSTM network with one hidden layer consisting of 64 LSTM nodes. The net is set with an input window size of 48 and a forecast horizon of 24 and further hyperparameters can be studied in Table 3.1.

3.3 Algorithms

There will be six different training algorithms compared in the experiment.

Three of which requires the clients to be clustered and three which works on a non-clustered setup.

3.3.1 Vanilla Federated Learning

We denote the standard FL algorithm described in Section 2.1.1 as Vanilla.

The algorithm results in a single global model which will be tested by all clients.

3.3.2 Fine-tuned Vanilla Federated Learning

The fine-tuned Vanilla FL is done by first train a global model according to the algorithm in Section 2.1.1 and then the global model will be fine-tuned locally on each client resulting in a unique model for each client in the network.

3.3.3 Model Interpolation

The Model Interpolation approach follows the Adapted Mapper algorithm described in Section 3.1.1. This algorithm is also fine-tuned in similar manner as the Vanilla FT but following the structure of the Mapper algorithm where

(31)

only the local model is updated during fine-tuning. This algorithm results in a unique model for each client in the network.

3.3.4 Clustered Vanilla Federated Learning

The clustered Vanilla algorithm is run in the same way as the Vanilla FL algorithm described in Section 2.1.1 but separately on each cluster resulting in a unique model for each cluster tested on all clients in that cluster.

3.3.5 Fine-tuned HierFAvg

The fine-tuned HierFAvg algorithm works by first producing a global model according to the HierFAvg algorithm described in Section 2.2.1. The global model is then separately updated on each cluster in the same way as in the HierFAvg algorithm to produce models adapted to each cluster. These models are thereafter fine-tuned locally on each client in the same manner as the fine- tuned Vanilla is fine-tuned. This results in a unique model for each client.

3.3.6 Hierarchical Model Interpolation

The Hierarchical Model Interpolation approach is the proposed extended Map- per algorithm that interpolates between a global, a cluster and a local model. It is described in Section 3.1.2 and is an extension of the adapted Mapper algorithm, to three levels, in the manner of the HierFAvg algorithm. It fine-tunes in the same manner as the fine-tuned HierFAvg algorithm where after the steps are taken to optimize the global model the cluster model is optimized and then local fine-tuning of the local model is done in the same manner as the two- level Model Interpolation approach in Section 3.3.3 and thus also produces a unique model for each client.

3.4 Experiments

The experiments intend to exploit the differences in the six algorithms and put them to test. The performance will be evaluated by calculating the mean MASE (see Section 3.2.1) between the prediction and true values over all test instances for all clients. All setups are run three times and the presented result will be the mean average the three runs. The validation performance is used to find the best hyperparameters for each algorithm for each experiment.

(32)

Early test runs on running the full train dataset showed indications that the amount of locally available train data was enough to model the distribution without much help needed from data available on other clients, thus making federated learning somewhat redundant for generalization purposes for this task. In an attempt to simulate a scenario where the locally available data cannot by itself make a good representation of the distribution, we reduce the amount of training data. To generalize well and produce a high performing model the training is more reliant on using data from several clients making the experiment a more suitable environment for comparing federated learning algorithms.

3.4.1 Experiment 1: Full Train Dataset

In Experiment 1 all possible training data available will be used to produce the best prediction. This results in 40 train instances spanning over 6 weeks of time series for each client.

3.4.2 Experiment 2: 25% Train Dataset

In the second experiment the train dataset is limited to 1/4th of the training data resulting in 10 instances randomly sampled from the first 14 samples, covering the first two weeks in the time series. This leads to some distance in time between the train data and the validation and test data. Under the assumption that the time series changes more or less over time the distribution the train set instances stem from might differ somewhat from the distribution the validation and test set stem from. This further enhances the challenge to produce a well generalized model.

3.4.3 Experiment 3: 12% Train Dataset

In the third experiment we starve the train set further, continuing in the same pattern as in Experiment 2 but this time with only 4 train instances randomly selected from the first 2 weeks of time series. Due to many of the time series showing a weekly periodicity, with weekdays showing similar pattern over time, restricting the train set to not fully cover a full week will further challenge the model to rely on information from other clients to generalize well.

Specifically, the information for weekdays not covered in the local train set would be found at clients in the same cluster.

(33)

Figure 3.3: Time series samples from Cluster 9 from different clients but for the same time step.

3.4.4 Clustering

To produce the clusters used for the three algorithms in need of a clustered network agglomerative clustering is used based on the features extracted (see Section 2.4.1) from the train set part of the time series. The clients are divided into ten clusters and the same clustering setup is used for all experiments.

Since this thesis does not focus on different ways of clustering a qualitative

(34)

analysis is done to deem the clustering good enough. In Figure 3.3 and Figure 3.4 we can see some train instances from clients belonging to cluster 9 and in Figure 3.5 and Figure 3.6 we can see some train instances from clients belonging to cluster 6. From studying figures like this (samples from all clusters can be found in Appendix A) we make the judgement that the time series within the clusters are similar and is less similar to time series from other clusters.

(35)

(36)

Full Train Dataset

Two-level Clustered

Vanilla Vanilla FT Mapper Vanilla HierFAvg FT HierMapper

MASE 1.70 1.00 1.12 1.41 1.06 1.18

p90 2.40 1.40 1.55 2.06 1.44 1.63

Chapter 4 Results

Here are the results shown for the three different experiments.

4.1 Experiment 1: Full Dataset

We can see in Table 4.1 that the best performing algorithm trained on the full data set is the two-level fine-tuned Vanilla algorithm with a mean MASE at 1.00 with the three-level fine-tuned HierFAvg algorithm performing nearly as good with a mean MASE at 1.06. The mean MASE for the 90th percentile for these to algorithms is also the lowest coming in at 1.40 and 1.44.

Slightly worse performing is the two-level and three-level Model Interpola- tion approaches with both showing an increase of the mean MASE and the 90th percentile mean MASE over the two fine-tuning algorithms. The clustered vanilla setup performs substantially worse than the fine-tuned algorithms but still shows an improvement over the non-clustered vanilla approach which shows the highest mean MASE and highest mean MASE for the 90th percentile.

27

(37)

Table 4.2: The MASE for the different algorithms for Experiment 2 for all clients as well as for the worst performing 10% of the clients, the 90th percentile(p90).

25% Train Dataset

Two-level Clustered

MASE 1.71 1.21 1.25 1.42 1.25 1.24

p90 2.45 1.73 1.72 2.07 1.77 1.77

Table 4.3: The MASE for the different algorithms for Experiment 3 for all clients as well as for the worst performing 10% of the clients, the 90th percentile(p90).

12% Train Dataset

Two-level Clustered

MASE 1.86 1.28 1.39 1.62 1.27 1.35

p90 2.45 1.79 2.03 2.23 1.82 2.01

4.2 Experiment 2: 25% Dataset

Using only 25% of the train dataset we can see little to no decrease in performance for the non-locally adapted algorithms, vanilla and clustered vanilla, in Table 4.2. For the fine-tuned vanilla algorithm there is a substantial increase in the mean MASE and mean MASE for the 90th percentile, as well as for the fine-tuned HierFAvg algorithm. The performance decreases a bit for the two-leveled and three-leveled Mapper algorithms as well, but they end up performing similarly to the fine-tuned Vanilla and fine-tuned HierFAvg, both in terms of the total mean MASE as well as the mean MASE for the 90th percentile.

4.3 Experiment 3: 12% Dataset

In the case with a very limited train data, only 1/8th of the train dataset, we can in Table 4.3 observe a more or less decrease in performance in all of the algorithms compared to the experiments with full and 1/4th of the train dataset.

However, in comparison with the experiment with 1/4th of the train data the performance decrease in the fine-tuned Vanilla and fine-tuned HierFAvg al-

(38)

gorithms is limited both in terms of the mean MASE and the mean MASE for the 90th percentile. Both of the Mapper approaches shows a significant decrease in performance compared to the 1/4th train set experiment and thus performance is significantly worse than the two fine-tuned algorithms which holds the best numbers for the 1/8th train dataset experiment. The clustered vanilla approach also shows a decreased performance with this level of train dataset restriction, but it still performs better than the 2-level vanilla approach.

The 2-level vanilla mean MASE increases to 1.86 for this experiment but the mean MASE for the 90th percentile is the same as for the 1/4th train dataset experiment and only slightly higher than for the full train dataset experiment.

4.4 Overall

Regardless of amount of training data the two-level vanilla approach is the worst performing algorithm. However, it does not tend to drop that much in performance when being trained on limited amount of training data. The clustered vanilla approach always shows an improvement over the vanilla approach but it doesn’t beat any of the locally adapted algorithms in any of the three experiments. The two-level and three-level Mapper algorithms follows the same performance changes between the different experiments and always shows similar performance to each other. They both gets outperformed by both of the fine-tuned algorithms except in the experiment with 1/4th of the data where all four local adapting algorithms perform similarly. The two fine- tuned algorithms also follows each other, showing similar results to each other in all three experiments and also shows the highest performance in all three experiments.

(39)

(40)

Discussion

Comparing the results from Experiment 1 and Experiment 2 we can see that the vanilla and clustered vanilla approaches does not drop in performance whilst all the local adapted models suffer. The decrease in performance on local adapted methods could be due to the distance in time between where the training instances and the test instances are taken from in the time series. This is further supported by the fact that the vanilla and the clustered vanilla methods does not decrease in performance between Experiment 1 and 2, which can be attributed to the notion that they are more generalized and are not fit that strongly to any specific time series and thus not suffer when the distance between train and test set increases. This can be translated to the concept of distributions where in the case of the fine-tuning algorithms, not only does the distribution of the train set differ from the test set because the low amount of training data, but also because the samples of the two sets can be seen as drawn from two different distributions as the attributes of the time series might differ with time. In the case with the vanilla and clustered vanilla, the train set can be seen as a union of all train sets, as can the test set, and because of the heterogeneity between the different clients the early part of one time series can be similar to a later part of another client, thus smoothing out the time difference in the time series, making the overall distribution of the training data in the vanilla and clustered vanilla algorithms more similar to the overall distribution of the combined test set. Which could lead to why the performance of the vanilla approaches does not drop in performance between Experiment 1 and 2.The Mapper algorithm does not show any increases in performance over running vanilla FL with local fine-tuning in any of the three experiments. Rather, it lacks in performance more or less in all of the experiments compared to the

31

(41)

fine-tuned vanilla solution which, in combination with its increased number of hyperparameters and increased need for computation, makes it not recom- mendable to use for this kind of dataset and task.

Comparing the Mapper algorithm with the Hierarchical Mapper we can see that they follow each other in performance throughout the three experiments.

The fine-tuned vanilla and the fine-tuned HierFAvg algorithms show very similar performance to each other. This can be an indication that the dataset in general is very susceptible to fine-tuning in general and that the fine-tuning does not rely so much on the initial model being fine-tuned. The potential benefit that the HierFAvg algorithm possess, with adapting the global model to the cluster before local fine-tuning, does not not show any performance increase comparing to the vanilla fine-tuned model which let all clients use the global model for initializing the fine-tuning process. We also notice in that the limited number of global updates occurring in the HierFAvg algorithm does not hinder the performance after the models have been fine-tuned.

In none of the experiments there is evidence that training a ML model in a hierarchical federated learning setup leads to greater performance than training it in a standard two-level setup. The results are limited to this dataset where the task appear to be simple, however, even with the highly constrained dataset the results are the same. The clustering method used in the experiments is not necessarily the optimal one and it is possible that another clustering setup would have produced better results for the hierarchical approaches.

Separating the clients into clusters and train the clusters using federated learning separately shows improvement over training a global vanilla model, in all of the three experiments. However, in Experiment 3, with the least amount of local data, the difference isn’t that big pointing at that the drastically lower amount of total training data in a cluster might make it more difficult for the model to generalize equally well as in the experiments with more data.

(42)

(43)

Conclusion

To sum up the what we can gather from this work and to answer the research question in Section 1.3 we can conclude that we found no performance benefit from training the models in a hierarchical setup over the two-level setup. For tasks and datasets such as the one used in this thesis, the best performance can be achieved by locally fine-tuning the global model produced by vanilla federated learning.

We can also see that the Mapper algorithm does not provide better results than simply fine-tuning a global model, rather it produces similar results at best and otherwise even performs worse. Rather simple clustering can provide performance increases when dealing with heterogeneous data, even with smaller amounts of local data, compared to using one global model to fit all clients.

The time series forecasting task in this dataset might be too simple to fully represent the problem stated in the problem formulation. Running similar experiments using a different dataset and a more complex task would be an interesting way forward.

34

(44)

(45)

[1] H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hamp- son, and Blaise Agüera y Arcas. “Communication-Efficient Learning of Deep Networks from Decentralized Data”. In: arXiv e-prints, arXiv:1602.05629 (Feb. 2016), arXiv:1602.05629. arXiv: 1602 . 05629 [cs.LG].

[2] Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Keith Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, Rafael G. L. D’Oliveira, Salim El Rouayheb, David Evans, Josh Gardner, Zachary Garrett, Adrià Gascón, Badih Ghazi, Phillip B. Gibbons, Marco Gruteser, Zaid Har- chaoui, Chaoyang He, Lie He, Zhouyuan Huo, Ben Hutchinson, Justin Hsu, Martin Jaggi, Tara Javidi, Gauri Joshi, Mikhail Khodak, Jakub Kone n˝, Aleksandra Korolova, Farinaz Koushanfar, Sanmi Koyejo, Tancrède Lepoint, Yang Liu, Prateek Mittal, Mehryar Mohri, Richard Nock, Ayfer Özgür, Rasmus Pagh, Mariana Raykova, Hang Qi, Daniel Ramage, Ramesh Raskar, Dawn Song, Weikang Song, Sebastian U.

Stich, Ziteng Sun, Ananda Theertha Suresh, Florian Tramèr, Praneeth Vepakomma, Jianyu Wang, Li Xiong, Zheng Xu, Qiang Yang, Felix X.

Yu, Han Yu, and Sen Zhao. “Advances and Open Problems in Fed- erated Learning”. In: arXiv e-prints, arXiv:1912.04977 (Dec. 2019), arXiv:1912.04977. arXiv: 1912.04977 [cs.LG].

[3] Sundar Pichai. “Google’s Sundar Pichai: Privacy Should Not Be a Lux- ury Good”. In: New York Times (May 2019).

[4] Andrew Hard, Kanishka Rao, Rajiv Mathews, Swaroop Ramaswamy, Françoise Beaufays, Sean Augenstein, Hubert Eichner, Chloé Kid- don, and Daniel Ramage. “Federated Learning for Mobile Key- board Prediction”. In: arXiv e-prints, arXiv:1811.03604 (Nov. 2018), arXiv:1811.03604. arXiv: 1811.03604 [cs.CL].

36

(46)

[5] Timothy Yang, Galen Andrew, Hubert Eichner, Haicheng Sun, Wei Li, Nicholas Kong, Daniel Ramage, and Françoise Beaufays. “Applied Federated Learning: Improving Google Keyboard Query Suggestions”.

In: arXiv e-prints, arXiv:1812.02903 (Dec. 2018), arXiv:1812.02903.

arXiv: 1812.02903 [cs.LG].

[6] Mingqing Chen, Rajiv Mathews, Tom Ouyang, and Françoise Beau- fays. “Federated Learning Of Out-Of-Vocabulary Words”. In: arXiv e-prints, arXiv:1903.10635 (Mar. 2019), arXiv:1903.10635. arXiv:

1903.10635 [cs.CL].

[7] Swaroop Ramaswamy, Rajiv Mathews, Kanishka Rao, and Françoise Beaufays. “Federated Learning for Emoji Prediction in a Mo- bile Keyboard”. In: arXiv e-prints, arXiv:1906.04329 (June 2019), arXiv:1906.04329. arXiv: 1906.04329 [cs.CL].

[8] Apple. Designing for Privacy. https : / / developer . apple . com/videos/play/wwdc2019/708. 2019.

[9] EU CORDIS. Machine Learning Ledger Orchestration for Drug Discovery. https : / / cordis . europa . eu / project / id / 831472?WT.mc_id=RSS-Feed&WT.rss_f=project&WT.

rss_a=223634&WT.rss_ev=a. 2019.

[10] Theodora S. Brisimi, Ruidi Chen, Theofanie Mela, Alex Olshevsky, Ioannis Ch. Paschalidis, and Wei Shi. “Federated learning of predictive models from federated Electronic Health Records”. In: International Journal of Medical Informatics 112 (2018), pp. 59–67. : 1386- 5056. : https : / / doi . org / 10 . 1016 / j . ijmedinf .

2018.01.007. : http://www.sciencedirect.com/

science/article/pii/S138650561830008X.

[11] Life Insurance International. WeBank and Swiss Re Signed Coopera- tion MOU. https://www.lifeinsuranceinternational.

com/news/swiss-re-webank/. 2019.

[12] MUSKETEER. About. https://musketeer.eu/project/.

2020.

[13] Intel AI. https://www.intel.ai/federated-learning-for-medical-imaging/#

gs.117zw3. https : / / www . intel . ai / federated - learning-for-medical-imaging/#gs.117zw3. 2019.

(47)

[14] Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. “Federated Machine Learning: Concept and Applications”. In: ACM Trans. Intell.

Syst. Technol. 10.2 (Jan. 2019). : 2157-6904. : 10 . 1145 / 3298981. : https://doi.org/10.1145/3298981.

[15] Kasun Bandara, Christoph Bergmeir, and Slawek Smyl. “Forecast- ing Across Time Series Databases using Recurrent Neural Networks on Groups of Similar Series: A Clustering Approach”. In: arXiv e-prints, arXiv:1710.03222 (Oct. 2017), arXiv:1710.03222. arXiv:

1710.03222 [cs.LG].

[16] Fernando Díaz González. “Federated Learning for Time Series Fore- casting Using LSTM Networks: Exploiting Similarities Through Clus- tering”. MA thesis. KTH, School of Electrical Engineering and Com- puter Science (EECS), 2019, p. 70.

[17] Yishay Mansour, Mehryar Mohri, Jae Ro, and Ananda Theertha Suresh.

“Three Approaches for Personalization with Applications to Feder- ated Learning”. In: arXiv e-prints, arXiv:2002.10619 (Feb. 2020), arXiv:2002.10619. arXiv: 2002.10619 [cs.LG].

[18] Lumin Liu, Jun Zhang, S. H. Song, and Khaled B. Letaief. “Client- Edge-Cloud Hierarchical Federated Learning”. In: arXiv e-prints, arXiv:1905.06641 (May 2019), arXiv:1905.06641. arXiv: 1905 . 06641 [cs.NI].

[19] Mehdi Salehi Heydar Abad, Emre Ozfatura, Deniz Gunduz, and Ozgur Ercetin. “Hierarchical Federated Learning Across Heterogeneous Cel- lular Networks”. In: arXiv e-prints, arXiv:1909.02362 (Sept. 2019), arXiv:1909.02362. arXiv: 1909.02362 [cs.LG].

[20] Tao Yu, Eugene Bagdasaryan, and Vitaly Shmatikov. “Salvaging Feder- ated Learning by Local Adaptation”. In: arXiv e-prints, arXiv:2002.04758 (Feb. 2020), arXiv:2002.04758. arXiv: 2002.04758 [cs.LG].

[21] Luca Melis, Congzheng Song, Emiliano De Cristofaro, and Vitaly Shmatikov. “Exploiting unintended feature leakage in collaborative learning”. In: 2019 IEEE Symposium on Security and Privacy (SP).

IEEE. 2019, pp. 691–706.

[22] Khe Chai Sim, Petr Zadrazil, and Françoise Beaufays. “An Investiga- tion Into On-device Personalization of End-to-end Automatic Speech Recognition Models”. In: arXiv e-prints, arXiv:1909.06678 (Sept.

2019), arXiv:1909.06678. arXiv: 1909.06678 [eess.AS].

(48)

[23] Yihan Jiang, Jakub Kone n˝, Keith Rush, and Sreeram Kannan.

“Improving Federated Learning Personalization via Model Agnostic Meta Learning”. In: arXiv e-prints, arXiv:1909.12488 (Sept. 2019), arXiv:1909.12488. arXiv: 1909.12488 [cs.LG].

[24] Chelsea Finn, Pieter Abbeel, and Sergey Levine. “Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks”. In: Proceed- ings of the 34th International Conference on Machine Learning - Volume 70. ICML’17. Sydney, NSW, Australia: JMLR.org, 2017, pp. 1126–1135.

[25] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. “Over- coming catastrophic forgetting in neural networks”. In: arXiv e- prints, arXiv:1612.00796 (Dec. 2016), arXiv:1612.00796. arXiv:

1612.00796 [cs.LG].

[26] Robert French. “Catastrophic forgetting in connectionist networks”. In:

Trends in cognitive sciences 3 (May 1999), pp. 128–135. : 10 . 1016/S1364-6613(99)01294-2.

[27] Lucas Theis, Iryna Korshunova, Alykhan Tejani, and Ferenc Huszár.

“Faster gaze prediction with dense networks and Fisher pruning”.

In: arXiv e-prints, arXiv:1801.05787 (Jan. 2018), arXiv:1801.05787.

arXiv: 1801.05787 [cs.CV].

[28] Qing Tian, Tal Arbel, and James J. Clark. “Structured deep Fisher prun- ing for efficient facial trait classification”. In: Image and Vision Com- puting 77 (2018), pp. 45–59. : 0262-8856. : https://doi.

org / 10 . 1016 / j . imavis . 2018 . 06 . 008. : http : / / www . sciencedirect . com / science / article / pii / S0262885618301045.

[29] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. “Distilling the Knowl- edge in a Neural Network”. In: arXiv e-prints, arXiv:1503.02531 (Mar.

2015), arXiv:1503.02531. arXiv: 1503.02531 [stat.ML].

[30] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Tal- walkar, and Virginia Smith. “Federated optimization in heterogeneous networks”. In: arXiv preprint arXiv:1812.06127 (2018).

(49)

[31] R. J. Hyndman, E. Wang, and N. Laptev. “Large-Scale Unusual Time Series Detection”. In: 2015 IEEE International Conference on Data Mining Workshop (ICDMW). Nov. 2015, pp. 1616–1619. : 10 . 1109/ICDMW.2015.104.

[32] Rob J. Hyndman and Anne B. Koehler. “Another look at measures of forecast accuracy”. In: International Journal of Forecasting 22.4 (2006), pp. 679–688. : 0169-2070. : https://doi.org/

10 . 1016 / j . ijforecast . 2006 . 03 . 001. : http : / / www . sciencedirect . com / science / article / pii / S0169207006000239.

(50)

(51)

Cluster Examples

Figure A.1: Time series samples from Cluster 1 from the same time step.

42

(52)

(53)

(54)

(55)

(56)

(57)

(58)

(59)

(60)

(61)

(62)

(63)

(64)

(65)

(66)

(67)

(68)

(69)

(70)

(71)

www.kth.se