Identifying regions most likely to contribute to an epidemic outbreak in a human mobility network

(1)

Identifying regions most likely to contribute to an epidemic outbreak in a human

mobility network

Alexander Bridgwater

Computer Science and Engineering, bachelor's level 2021

Luleå University of Technology

Department of Computer Science, Electrical and Space Engineering

(2)

Abstract

The importance of modelling the spreading of infectious diseases as part of a public health strategy has been highlighted by the ongoing coronavirus pandemic. This includes identifying the geographical areas or travel routes most likely to contribute to the spreading of an outbreak. These areas and routes can then be monitored as part of an early warning system, be part of intervention strategies, e.g. lockdowns, aiming to mitigate the spreading of the disease or be a focus of vaccination campaigns.

This thesis focus on developing a networkbased infection model between the municipalities of Sweden in order to identify the areas most likely to contribute to an epidemic. First a human mobility model is constructed based on the wellknown radiation model. Then a networkbased SEIR compartmental model is employed to simulate epidemic outbreaks with various parameters. Finally, the adoption of the influence maximization problem known in network science to identify the municipalities having the largest impact on the spreading of infectious diseases.

The resulting superspreading municipalities points towards confirmation of the known fact that central highly populated regions in highly populated areas carries a greater risk than their neighbours initially. However, once these areas are targeted, the other resulting nodes shows a greater variety in geographical location than expected. Furthermore, a correlation can be seen between increased infections time and greater variety, although more empirical data is required to support this claim.

For further evaluation of the model, the mobility network was studied due to its central role in creating data for the model parameters. Commuting data in the Gothenburg region were compared to the estimations, showing a overall good accuracy with major deviations in few cases.

Keywords

Complex networks, Influence maximization, Epidemic modelling, Human mobility, Artifcial intelligence

(3)

Acknowledgements

I would like to express my appreciation to my supervisor András for keeping me motivated throughout the thesis and providing me with challenging yet very exciting paths in the subject of epidemic modelling. Our discussions have given me many insights about research and essaywriting that I will carry with me for the future.

(4)

Authors

Alexander Bridgwater <alebir8@student.ltu.se>

Computer Engineering

LTU Luleå University of Technology

Place for Project

Luleå, Sweden

Examiner

Marcus Liwicki

LTU Luleå University of Technology

Supervisor

András Bota

(5)

1 Introduction

The first record of an epidemic originates in Babylon 1200BC[20]. Since then over hundreds of epidemics have emerged with the latest major one being covid19. Scientific efforts in creating vaccines to reduce propagation have vastly minimized the toll on human life the latest centuries. Yet, when a virus appears the scientists are always faced with a reoccurring factor, time. During this time, the proper strategic steps to restrict infrastructure should be taken. There are lots of literature covering these steps[3][19][23]. What will be covered in this report is one of the tools assisting in deciding on these precautions, namely Epidemic modelling.

1.1 Motivation

As Bill Gates minted in 2015: ”We’re not ready for another pandemic”[10]. He was correct. Covid19 has shown the world’s unpreparedness of handling a highly contagious virus independent of nations’ wealth, with some data even showing a positive correlation between GDP and infections[29]. The unpreparedness has become evident judging by all of the different approaches taken by nations with similar conditions, in particular, Sweden compared to its neighbouring countries. The experimenting of strategies is done with high stakes to say the least.

After the Covid19 epidemic, a lot of time will be put in to prepare for the next highly contagious virus to emerge. By that time, the data will be more significant and our ability to predict the stochastic processes improved. At that point, the strategies could be implemented with greater certainty using a model to experiment with different scenarios, without putting human life at risk.

The propagation between regions were vastly facilitated by our modern travelling frequencies. As a result, highly contagious diseases have an efficient platform to propagate by. But today’s society is also accompanied by higher tractability of human mobility, mainly through social media[21]. These patterns in mobility constitute the single most important data in combating diseases by Epidemic modelling.

1.2 Problem

By following previous research within the area, it is possible to create a model simulating real outbreaks and providing us with more information given the situation. The most restrictive measures taken by governments in our current epidemic are lockdowns in a controlled

(7)

manner to not affect the supply chain too heavily. This is one of the scenarios where it may be important to highlight the regions with the highest rate of infection, since imposing lockdown on a large part of nations can have a detrimental economic impact. This raises the question, is there is a way to orderly find these regions using epidemic modelling?

1.3 Purpose

The purpose of this thesis is to highlight an uncommon yet powerful tool for decision making regarding the propagation of viruses and argue for the accuracy of the solution. Additionally, to provide the reader with common tools and methods of Epidemic modelling to allow for future work.

1.4 Goal

The goal of this thesis is to create a networkbased epidemic model for Sweden and implement an influence maximization algorithm to identify the cities most likely to contribute to an outbreak.

1.5 Delimitations

Limitations for models are unavoidable when dealing with many stochastic processes.

Firstly, using centroids when approximating the whole population of a region involved in the radius for the radiation model. This may cause over and underestimations from case to case and does considerably worse for a short radius with big populations, since not including a highly populated area and vice versa could greatly affect the flux. The radiation model used for estimating travel is also known to suffer from inaccuracy for short distances[27].

The assumption of a equal distribution of individual conditions for regions is also a simplification of reality. For instance, cities mostly populated by university students such as Luleå, Lund, etc. have a different age distribution and therefore affected by a pandemic differently than other regions. This also applies to the act of viewing one population as a collective organism assuming a collective state; although proven relatively accurate, not totally intact with reality.

1.6 Ethics

The data gathered from the sources follows all the laws on human rights and data protection.

The data is minimized to only contain the data necessary to fit the parameters and create a

(8)

model. Furthermore, the data is gathered from public available information and does not contain any personspecific data.

1.7 Sustainability

The motivation for this thesis lie mainly in assisting governments to protect its citizens.

That is why this thesis carry a large sustainable sentiment. It directly supports UNs third goal being ”Good Health and Wellbeing”[28], but also many others indirectly due to the multifaceted toll an epidemic have on people.

(9)

2 Theoretical Background

The model used in epidemic modelling is often seen as a complex network as it is often represented as a mathematical graph with nontrivial topological functions, which often is the case for networks representing real systems. Hence, this section will start by defining mathematical graphs and their implementation in epidemic modelling. Then it will move on to cover research implemented for this thesis and later on cover data structures used for an efficient network.

2.1 History of epidemic modelling

The first epidemic model was a mathematical one developed by a physician named Daniel Bernoulli in the 18th century when smallpox was resulting in 400 000 deaths each year in Europe [22] [13], 36 years before the modern vaccine was developed. His efforts were placed to investigate the effect of variolation, a precursor to inoculation meant to provide immunity by passing on the virus from a diseased patient through superficial processes in hopes of a mild reaction compared to naturally aquired smallpox. He managed to prove this method of providing immunity globally would increase the life expectancy of three years and two months.

The introduction to compartmental models first arose in 1920 when Anderson Gray McKendrick together with William Ogilvy Kermack introduced what is now known as the KermackMcKendrick model[16]. In their research introduce the compartments susceptible, infected, and immune and the relationship between these different groups. The different compartments are and their rate of change is presented as the following differential equations

dS

dt =−λS , ∂i

∂t+ ∂i

∂a = δ(a)λS− γ(a)i , dR dt =

∫ _∞

0

γ(a)i(a, t)da

As you can see, infections is depedent upon time t and the time duration of infection a. The variable λ expresses the pressure of propagation

λ =

∫ _∞

0

β(a)i(a, t) da

2.2 SEIR

Epidemic modelling generally assumes the states of the disease for the nodes can be divided into multiple compartments[21]. Typically these compartments consist of nodes that are

(10)

Susceptible default signalling it is vulnerable to the virus, Exposed the virus has contracted but not yet contagious, Infected contagious, Removed antibodies or passed away. These four compartments carry the foundation for the most used models.

All models carry significance to some sort of virus or disease. The most basic of these models is the SI model. The acronyms referring to the order of the states, Susceptible→Infected.

The state process of an entity at first consists of a susceptible node that is later infected by an infectious node. This model assumes the nodes are born with or without immunity. Once infected they remain infectious for the remainder of the simulation. This behaviour is similar to the herpes family and could therefore be modelled using these states [25]. Due to its less general nature, this model is seldom used.

The next is an extension of the SI model called the SIS model that has the states Susceptible→Infected→Susceptible. This model acts like the SI model with one exception.

Once the infection is over, no immunity is provided so the nodes reinstate as susceptible.

These simulations can form a cyclic process where the nodes become infectious indefinitely under some circumstances [21]. With time the susceptible and infectious nodes reach a steadystate as can be seen in Figure 2.1 where blue and green represents susceptible and infections respectively. Reaching a constant rate of infection as in this case is in epidemic terms called an endemic [26]. It is fit to do a simple simulation of infections that do not result in longlasting immunity with an insignificant time of exposure, such as the common cold[6].

(11)

Figure 2.1: Steady state for SIS where blue and green represents susceptible and infections respectively[8]

The next model SIR is very similar and replaces the last state Susceptible with Removed. This means the entities will become infected and then permanently removed or remain susceptible making it noncyclical and nonstationary, unlike SIS. Use cases for SIR is similar to SIS but is more fit when the infections cause permanent immunity or death such as measles. As you can see in Figure 2.2, nodes tend to become removed, depending on the parameters and model.

Figure 2.2: Steady state for SIR [14]

(12)

Figure 2.3: Compartmental model’s transition rates[17]

Additionally, the SIR model can be extended to apply temporary immunity by allowing removed nodes to become susceptible after being immune for η time steps. Because of its last transition, it is named SIRS. As you can imagine this allows for a more complicated simulation and therefore it is difficult to generalise its long term behaviour as it largely depends on the input.

Moving on to the SEIR model, the most general of the classical model; introducing the previously mentioned Exposed compartment. This additional dimension facilitates for modelling many of the infections with an incubation time where nodes have been exposed but are not yet contagious. Using these four compartments a rough classification of the stages of viruses can be utilized to create a realistic simulation. This also introduces a variable δ describing the transition rate from susceptible to exposed. Worth noting is that these transition rates vary depending on the model. In some cases as mentioned in [21] β can be constructed as ¯β = βkwhere β describes the infection for every effective contact and k the contact with other nodes. This text will not delve into the mathematical aspect of transition rates for the model but will provide enough background to make it possible for future work.

Alike the SIR the permanent immunity causes a finite simulation.

We have now covered the core of compartmental models. Other compartmental models typically use the same foundation such as Susceptible, Infected but may add some specific compartment for the disease being modelled such as separating Removed and Deceased, Maternallyderived immunity for newborns given immunity from maternal antibodies, and so on [9] [1]. As seen in SIS some models vary in cyclical elements but generally, the consideration remains whether once infectious nodes should be able to become infected again. There are countless additional ways to extend this framework, such as grouping individuals by age, gender, socialeconomic status, etc. The pros are a more accurate model assuming a decent groping as some groups of society are often not affected equally as others.

However, this is outside of the thesis’s scope.

As seen in the picture the states of this model follow in this order with β, δ, γ .

(13)

Figure 2.4: Example of directed weighted graph[31]

2.3 Mathematical Graphs

As mentioned the network representing Sweden, in this case, is represented by a Mathematical graph as is standard in epidemic modelling[21]. A graph is a way of relating entities in it to each other. The entities are called nodes and a relation between a pair of nodes is called an edge, denoted E in the text. The nodes a node has relations to are often referred to as neighbours or adjacent nodes. If relationships between nodes are mutual, like a handshake, we say the graph is undirected. The opposite is called a directed graph which in contrast is a pat on the back. When using graphs to i.e. figure out the shortest path to reach a certain node it is appropriate to denote a certain weight to the edges, making it a weighted directed graph.

For reasons which will become obvious later we will use a directed weighted graph. This boils down to relations in a real system typically not being mutual and weights an appropriate tool in the model.

The following is a example of a undirected weighted graph with nodes AE and weights 1,1.... In epidemic modelling, each node constitutes an entity prone to propagate the virus.

Normally it is represented people, cities, countries, etc. In this case, it is municipalities. The weight between node i and j is the probability i will infect j once it is infected. This attempt at infection may occur numerous times depending on the model.

(14)

2.4 Influence models for complex networks

2.4.1 Threshold model

Complex networks are models common to many fields as mentioned, one of which is social networks. As this setting plays a part in the development of one of the most important research on which the thesis is based upon, I will lightly mention the Threshold model. The Threshold model is used in networks to represent a pressure from nearby adjacent nodes to transform state, and does so after the pressure has reached a certain threshold[15]. This model plays a big part in the modelling of social networks as it is fit to represent social pressure.

2.4.2 Epidemic model

In epidemic modelling, the case of infections depends on independent probabilities between each interaction with an infectious individual rather than a combination of interaction mounting up to a virus [2]. Hence, each time step infectious nodes has a probability to infect a neighbour that is unique for different edges. This stochastic process of testing the probability is known as a coinflip[7]. If the coinflip indicates transmission, the neighbour node becomes infected and is infectious in the same regard for Γ time steps.

2.4.3 Edge Probabilities

The probability of an edge on a small scale where a node repents an individual can for instance depend on the number of times the infectious individual stays in contact with the susceptible individual in relation. If we scale this comparison to regions, it can be seen as the travel between regions. Then the edges’ probabilities are considered by a formula that typically assigns the largest fluxes with the highest probability.

2.5 Migration patterns

In a real system, the cause of propagation is the rate of contact between people. Thus, there needs to be a way to quantify contact both on a local and global scale even when data is limited. For a small scale, the data of interaction between individuals is needed. On a large scale however, the interaction can be vastly simplified as migration patterns between regions.

To quantify the edges for such a model, real data on the directed migration patterns between all regions would have to be available. As this often is not the case, researchers in the area have been using two models to estimate these patterns called the gravity model and the

(15)

radiation model. Other models invented in recent years are namely implementations of data on a model derived from either of the two[92].

2.5.1 Gravity model

The first model to consider is the gravity model which like the Newtons Law of Gravity also assumes the flow T_ij, known as flux, of individuals between two cities i, j is proportional to the population of some power in i and j decaying as distance r between them increase[27].

Tij = m^α_im^β_j

f (r_ij) (1)

Function f is decided by empirical data while α, β is adjusted to historic traffic data. Although as mentioned in [gravity] systematic travel data between all regions is a problematic quest, namely in epidemic modelling. In Sweden for example, there is no data available for traffic between regions on a large scale, whether at the municipalities or cities as a whole. This calls for another way to estimate traffic by using assumptions based on other data.

2.5.2 Radiation model

A paper published by Filippo Simini et al. (2012) introduces a new variant of the gravity model called the radiation model [27]. They start from Eq(2) and step by step strip away limitations, namely the need for traffic data. Instead, they assume the fluxes can be determined solely on job selection. Furthermore, it is shown that the model can be created be independent of the distribution of job benefits that can impact job selection. However, the distance for commuting is still to be considered indirectly and so is the population for the source and destination as well as the distribution of people within a certain radius of the source, thought to represent a distribution of job opportunities. The model estimates the fluxes between two nodes i→ j using the formula

Tij = Ti

n_in_j

(n_i+ s_ij)(n_i+ n_j+ s_ij) (2) where m_i and n_j consists of the population of region i and j respectively. The s_ij constitutes the population within circle area around i with the radius being the distance i → j, subtracting the population of i, j. The proportional factor T_i ≡ ∑

i̸=jT_ij is expressed as T_i = n_i(N_c/N )where N is the total population of the model and N_cthe commuting out of the region, as the population of the origin is proportional to the commuters in all regions and

(16)

their populations. Thus, a model is created based solely on population and its distribution, and the commuting out of regions.

Numerous improvements have been made to the radiation model to fit a certain situation.

One is to consider the underestimations often occurring with the model as it is derived from an infinite system but used in finite one[18]. By normalizing T_i however, these estimations can become more accurate. In the paper, the recommended T_i was expressed as

Ti = n_i(N_c/N )

1−^m_N_tⁱ (3)

2.6 Maximum influence

Finding maximum influence in a social network is applicable to many occurring phenomenons in such networks. It can explain the spread of political ideas, rumour spreading, and many more implementations which have been studied in social sciences[15].

It revolves around finding the k nodes with the greatest impact on other nodes in the network.

In rumour spreading this can boil down to finding the persons in the network who will spread a rumour to the largest extent both directly to people and then them spreading it further.

In social networks, they are typically supported by a linear threshold model to mimic a bar for social pressure. As you can imagine, the idea of finding the biggest influence in a network is very valuable to marketing. This is where the problem first arose from a paper by P. Domingos and M. Richardson (2011) on marketing[5]. Unlike the conventional way of considering the ratio of marketing cost per person to their spending, he posed the idea of considering influence in social networks as a way to improve marketing further. This way marketing can be directed towards individuals with lots of influence who will impose social pressure on their peers to adopt the product. In the paper, he credits Microsofts Hotmail email service early growth relative to its small marketing budget to them promoting themselves in emails sent by their users.

The first solution related to this problem was proposed by Richardsson and is a greedy hillclimbing approach meaning it systematically changes solutions to find the suboptimal solution[68]. In the paper, he discussed its implementations in a social network with inputs of marketing actions against individuals such as offering discounts, information regarding who has bought the product, and a set of descriptions about the product. Potential gains by actions are expressed by a function Expected Lift in Profits (ELP) considering the profits gained by the configurations of targeted marketing. The set of individuals with maximum

(17)

the approach is to start by marketing to one person at a time, measuring their influence with respect to ELP and doing so for each individual one at a time. Then search through the respective profit generated and pick out the maximum in a greedy fashion, which will be the one with the single largest influence. Store the person into memory and combine this solution when hillclimbing to find the person with the second largest solution.

Since small tweaks in a large system may cause very different results, the task of finding the optimum combination can only be guaranteed by trying all combinations, making it N=P hard. Because of this, a proof was needed to account for its accuracy. Furthermore, to be able to adopt a similar idea in other areas a more generalized framework is required. In 2010, nine years later, a paper was published by researchers at Cornell about maximizing influence in a social network where they provide a generalized framework inspired by Domingos approach along with a proof with guarantees of 63% optimum solution[15]. This proof extends to the threshold model along with the a general version of the epidemic model called the independent cascade model.

2.7 Data structures

This chapter introduces the data structures used for efficient implementations which will be implemented in the method section.

2.7.1 Adjecency Matrix

The most common way to represent a graph’s connections is using an adjacency matrix.

Given a graph of N nodes the matrix consists of a N × N square matrix with each row and column incremented starting from the first nodes’ corresponding index to the last, typically [0, N− 1]. In a directional graph the direction from→to is expressed as row→column as seen in Figure 2.5. For an undirected graph, the matrix is symmetric.

(18)

Figure 2.5: Example of an adjacency matrix[11]

The values in a cell for a weighted graph cell describes the weight the corresponding weight, whereas in an undirected graph binary classification is used to determine if the edge exists or not.

2.7.2 Linked List

Linked lists are a classical data structure that also plays a part in other data structures such as chaining in hashtables. They are used to build a linear data set where each piece of data contains a pointer to the next value. New values to the list are assigned as the next element of the formerly last one. The structure itself only keeps a pointer to the first placed value known as head. This makes it a minimalist data structure and thus a very efficient one.

Figure 2.6: Structure of a linked list[4]

Because of its linear structure with a pointer on its head, it is often applied to create a first in first out queue, also known as FIFO. In such a queue the node next in the queue is always the head, which is eventually removed by assigning the head pointer to the node’s next pointer.

(19)

3 Method

This section is divided into three core parts of building the model: travel estimations between municipalities, the SEIR implementation, and applying the maximization algorithm.

3.1 Inputs

This section goes through the data to satisfy all parameters and their origin. It will also slightly cover parts of the data processing for parsing the data found.

3.1.1 Travel Estimations

To estimate commuting between regions the radiation model was used. The radiation model is a parameter free model depending only on population and distance, contrary to the traditional Gravity Model[27]. The model estimates the fluxes between two nodes i → j using the formula in Equation 3.

The first step to estimating the fluxes was to use the gathered data found in excel format to be converted into .csv format to utilize its cell separation formatting. While converting from excel the characters {ÅÄÖ} where temporarly swapped to {._} to avoid compiling faults and later reversed. This file was then read from assigning the respective population and commuting data to each region in the program represented by an object Node. The next step was evaluating the radius parameter.

Determining the population within a radius can become quite tricky when the radius cover fractions of the region, not to mention varying population density. Therefore, the radiation model was implemented to only include population of regions whose centroids, meaning geometrical center, are within the radius. By doing so we assume uniform density and calculations for population in an area becomes vastly simplified. The following shows the centroids of the regions who’s population are included in s_ij marked as red, where i, j represents Stockholm and Lidingö municipality respectively.

(20)

Figure 3.1: Visualisation of the centroid approach using the radiation model

Existing geographical data from each municipality was gathered, and then managed with the geographical information system QGiS[30][24]. With QGiS the centroids position in (latitude, longitude) for each region were obtained and then exported with their respective name to be merged in Java with prior information.

3.1.2 Probability of propagation between nodes

As mentioned earlier, we will now try to estimate the probabilities of each city infecting its adjacent nodes. An adjacency matrix similar to the previous one is used to represent the directed probability between all regions, which we will now denote as weights. The weights represent a probability between 0 and 1 and were determined by using our knowledge of the fluxes E_f. Each weight E_pbetween all regions i to j were scaled by the function

E_p(i, j) = E_f(i, j)

M ax(E_f(0, 0), ..., E_f(n− 1, n − 1)) (4) Simply dividing the edges of the fluxes with the greatest flux will cause some travel dense edges to have probabilities as high as 1. To adjust for this all edges are scaled by a factor s, which according to the literature should be assigned as s = 0.9. This gives us a slightly different function

E (i, j) = s E_f(i, j)

(5)

(21)

where E_pfalls in the range [0, 0.9]. This probability constitutes the chance of an infected node propagating to another node for each time step it is infected.

The parameters needed by using Equation 5 can easily be obtained by iterating through the flux matrix searching for the maximum weight and then use formula on all edges.

3.2 Implementations

Using the inputs, the previous sections are now ready for implementation. The implementations consist mainly of the algorithms used and the QGIS system that will now be covered.

3.2.1 Travel Estimations

Each municipality was at this point represented by a class Nodes containing the data:

Municipality, Citizens, Commuters out of the region, Coordinates. The instances were placed in an Arraylist in an arbitrary order of size n.

Next step was to for all edges i → j iterate by trying to find regions lying within the region;

in analogy trying to find cities within distance d_ij from i. The fluxes between regions were saved in a adjacency matrix A of size n×n so that all indexes i, j∈[1, n] has a corresponding flux T_ij stored at A_ij.

The following is pseudocode of the implementation Algorithm 1: Calculating the fluxes

Result: Computes the fluxes for all nodes saving them into a matrix NodeList

Fluxes = [][]

for i in NodeList do for j in NodeList do

radiusPop = 0

for r in NodeList do if condition then

radiusPop += r.population else

end

Fluxes[i.index][j.index] = radiusPop end

end

(22)

resulting in fluxes describing travel between all regions.

As will be reviewed later, the estimations suffer inconsistencies for short distances.

Therefore, commuting data between nearby regions was extensively searched. The very limited data found containing short travel between municipalities in the Gothenburg region was applied[12].

3.2.2 The model

Using previously defined SEIR model, the regions can now acquire the four different states in the following order: Susceptible, Exposed, Infected, Removed. The regions not abiding by this order of states are those who start as infectious, given as parameters for each run, or those not infected. Those are the nodes where each virus for each simulation has a probability to propagate from. Other parameters used for the model are τ_eand τ_i which determine the discrete time duration of exposed and infected nodes respectively.

So let’s summarize our current knowledge before moving on. The regions are represented by a List of nodes containing all independent information about the region. For each interconnection, there is a directed probability of infection each time step. The latter mentioned parameters: start nodes, τ_e and τ_i are free parameters that will be varied later.

Combining this knowledge a model can be put together.

The simulation is meant to play out in discrete time until the state all regions are constant.

This means all of our nodes have to be in one of our constant states Susceptible or Removed.

Therefore, a while loop was implemented to run while the size of the two Lists referencing Susceptible and Removed nodes were greater than zero.

Since all the nodes had the same τ_e, τ_ithey could be seen as a FIFO Queue, where the node first to be exposed is first to become infected. Hence, all exposed and infected nodes’ references were stored in a linked list. In this case, it can simply check if the first in the queue is ready, if it is not the rest is not either. This reduces time complexity from O(n· t) to O(n + t) contrary to looping through all of them each time step, where t is time.

For every time step propagation is attempted by the infected nodes. To simulate it using the edges E_pwe simply use Java’s Random() function to create a unique stochastic variable ξ in range [0, 1] for each comparison and time step. This means our edges are at most 90% likely to propagate and does so if E_p(i, j)≥ ξijt.

The following code is pseudocode featuring the main parts of the implementations recently

(23)

discussed where the undeclared variables included such as time are implicit.

Algorithm 2: SEIR algorithm

Result: Runs the SEIR configurations LinkedList exposed

LinkedList infected

while exposed or infected not empty do for node in infected do

for edges from node do

if probs of edge to j > random ξ then add j to exposed

set j infectious at time + τ_e end

end end

while exposed.first.timeWhenInfected = time do remove node from exposed add node to infected set node removed at time + τ_i

end

while exposed.first.timeWhenRemoved = time do set state of node to removed

end end

3.2.3 Maximization of influence

The calculation of each propagation of potential nodes was done by adding each node at a time to the list of start nodes running the simulation r times. Simulations are considered done when all nodes reach a steadystate, which means they are either susceptible or removed. Hence, for each iteration the number of infections is calculated by iterating through the states of all nodes counting removed nodes and then added in an array corresponding to the sum of removed nodes for all iterations at the index of the specific node. Then the states of the nodes are reset. When having run the r iterations for a node it is removed from the list of start nodes to be considered later. After running this procedure for each node not yet in the start nodes list to avoid duplicates, a new permanent start node is determined by the maximum number of propagation by iterating through the array sum of removed nodes adding the node with the corresponding index of the highest value.

(24)

Algorithm 3: Maximum influence algorithm

Result: Computes a number of k nodes with the combined largest propagation infections = []

nodes = List<Nodes>

startNodes = List<Nodes>

for k times do

for node in nodes do startNodes.add(node) for r times do

runSim()

infections[node.index] += countRemoved(nodes) startNodes.length reset(nodes)

end

startNodes.removeLast() end

startNodes.add(getMaxNode(nodes,infections)) end

As explained earlier the algorithm runs in a greedy fashion by systematically, testing which node together with the current start nodes produces the most infections. This continues until the number of start nodes reaches a predefined value k. For each test of nodes, the algorithm is set to run a number of r times summing up the total numbers of infections propagated for each run. The sums are then compared and the node of the maximum sum is chosen as the one of maximum influence. Since the process of propagation is stochastic the value r determines the validity of the results.

3.2.4 Animating the data in QGIS

The onset of the nodes’ states was also stored to later be simulated in QGIS, where Susceptible was considered as a default starting at discrete time = 0. In order to identify them with an id when writing timestamps to a file, their id was given by their corresponding id in our given Shapefile. By doing so the tables for municipalities in QGIS could easily join with our information of time stamps. Once joined the tables takes the form of Figure 3.2

Municipality ID Exposed Infected Removed Stockholm 15 10050101 10100101 10250101

... ... ... ... ...

(25)

The reason for choosing our discrete time in the type date of format YYYYMMDD is due to QGIS limitations using the temporal setting. Therefore, months were chosen to MMDD were chosen to 0101 by default and year starting at 1000 incrementing for every time step.

The next step was to utilize QGIS 3.16s temporal function where layers can be applied at given a start and end time from the table. For Susceptible nodes, start time was set to blank and end time set until the onset of the exposed state. For Susceptible nodes never being affected i.e no exposed time or removed time some manipulation with an additional layer was to be done.

Start and end points were set for the remaining states. The onsets for the layers simply states when the polygon for a Municipality with a certain state should reveal itself. Thus providing us with an animation of the outbreak which will later be displayed.

(26)

4 Result

Next, we will discuss the results of the thesis. First, the results of the mobility network will be reviewed, highlighting the municipalities with the most and least ingoing fluxes. Second, the probabilities will be reviewed in the same manner regarding the ingoing fluxes, showing the largest and least ingoing percentages. Finally, the results of the model will be reviewed including the solution to the problem, as well as a showcase of its time complexity.

4.1 Mobility network

First, we will cover the results of the mobility estimations and compare them with actual data. Since the reason for estimating the patterns was due to a lack of data, the comparisons available are limited. The comparisons will be made with the largest available data on commuting between regions gathered by Göteborgs Hyresgästförening to promote improvements of infrastructure in the region[12]. Below you can see the data showing the directed commuting between all twelve submunicipalities of Gothenburg

From/To Ale Göteborg Härryda Kungsbacka Kungälv Lerum Mölndal Orust Partille Stenungsund Tjörn Varberg Öckerö

Ale 7342 182 58 1261 105 582 9 169 182 28 18 11

Göteborg 1163 4304 3050 3236 1578 16080 158 4128 924 268 610 557

Härryda 51 8796 250 93 129 1878 7 537 32 11 32 12

Kungsbacka 58 14804 544 122 68 4199 7 215 37 14 1100 23

Kungälv 499 9432 215 81 49 720 39 186 812 107 30 22

Lerum 118 9806 461 114 165 974 6 1029 36 14 17 7

Mölndal 98 19178 1099 1443 223 120 9 349 76 19 158 28

Orust 23 864 17 16 207 6 72 10 984 313 7 3

Partille 82 11211 761 127 158 418 1154 5 55 12 32 18

Stenungsund 124 3507 71 21 1171 22 250 208 61 577 11 5

Tjörn 44 1378 20 20 354 18 138 219 39 1543 4 9

Varberg 11 1950 57 1010 32 17 450 2 25 21 3 10

Öckerö 13 2763 39 21 34 17 203 3 34 7 7 1

Table 4.1: Data depicting directed commuting in the Gothenburg area [12]

and the following is the data estimated by the mobility network

ale 599 121 16 511 6391 24 4 280 462 6 8 4

Göteborg 1195 2117 3660 2894 2144 5123 414 3436 924 577 1333 1227

Härryda 19 1934 59 19 793 3159 3 6912 10 5 21 5

Kungsbacka 53 6454 2616 94 97 10841 16 117 32 20 170 31

Kungälv 36 5278 28 40 45 59 14 50 3051 3506 18 897

Lerum 2144 1586 1399 36 37 798 5 7123 19 6 18 7

Mölndal 58 9625 4216 254 94 108 16 9225 43 18 96 30

Orust 35 107 1 2 94 2 2 2 419 1753 1 7

Partille 28 3355 7709 62 45 47 3772 5 16 8 22 10

Stenungsund 428 278 6 10 1495 98 11 142 54 1303 4 15

Tjörn 82 294 2 3 3161 3 4 213 3 225 1 71

Varberg 8 523 177 1981 13 13 468 2 85 5 3 3

Öckerö 1 3314 2 4 5 2 5 0 3 1 1 1

Table 4.2: Data generated from the mobility network corresponding Table 4.1

The difference between the estimations and the data can be displayed further in a table

(27)

Ale 6743 61 42 750 6286 558 5 111 280 22 10 7

Göteborg 32 2187 610 342 566 10957 256 692 0 309 723 670

Härryda 32 6862 191 74 664 1281 4 6375 22 6 11 7

Kungsbacka 5 8350 2072 28 29 6642 9 98 5 6 930 8

Kungälv 463 4154 187 41 4 661 25 136 2239 3399 12 875

Lerum 2026 8220 938 78 128 176 1 6094 17 8 1 0

Mölndal 40 9553 3117 1189 129 12 7 8876 33 1 62 2

Orust 12 757 16 14 113 4 70 8 565 1440 6 4

Partille 54 7856 6948 65 113 371 2618 0 39 4 10 8

Stenungsund 304 3229 65 11 324 76 239 66 7 726 7 10

Tjörn 38 1084 18 17 2807 15 134 6 36 1318 3 62

Varberg 3 1427 120 971 19 4 18 0 60 16 0 7

Öckerö 12 551 37 17 29 15 198 3 31 6 6 0

Table 4.3: Absolute difference between estimations and data, σ = 192.

To give a brief overview of the results without including all municipalities it is deemed a good strategy to explore the most extreme values. To gain information of the impact the mobility network will have on each node, the choice was between analyzing the outgoing or the in

going fluxes of each node. In order to study the susceptibility of nodes, the sum of ingoing fluxes was chosen as it is more explanatory of infections to one node.

Municipalities Ingoing fluxes

Arjeplog 157

Åsele 232

Pajala 329

Sorsele 361

Jokkmokk 459

Dorotea 446

Överkalix 478

Bjurholm 528

Storuman 602

Övertorneå 620

Table 4.4: The ten municipalities with least ingoing fluxes and their sum of ingoing travel

Stockhom 187896

Göteborg 123266

Solna 61573

Huddinge 60767

Sollentuna 56045

Nacka 54583

Malmö 48439

Järfälla 47290

Sundbyberg 41475

Mölndal 40171

Table 4.5: The ten municipalities with least ingoing fluxes and their sum of ingoing travel

(28)

4.2 Edge probabilities

As the edge probabilities is a scaled ratio given by the mobility network, it does not provide us with much information to consider the least infectious edges. However, by summing up all ingoing edges of nodes it becomes apparent where the scaling property lacks and hence the nodes practically never become infected. To make a transparent comparison later on, the opposite extremes will also be listed as these irregularities is more easily observed. The sum of E_p can not be seen as the chance for node j will become infected for all occurring time

steps given all its neighbors are infected as this makes for a more complicated distribution and are not within the scope of comparison.

Arjeplog 0.19%

Åsele 0.28%

Pajala 0.39%

Sorsele 0.43%

Jokkmokk 0.55%

Dorotea 0.53%

Överkalix 0.57%

Bjurholm 0.62%

Storuman 0.72%

Övertorneå 0.74%

Table 4.6: The municipalities with least ingoing fluxes and their sum of ingoing travel

Municipalities Ingoing fluxes Stockholm 223.36%

Göteborg 146.53%

Solna 73.20%

Huddinge 72.23%

Sollentuna 66.62%

Nacka 64.89%

Malmö 57.58%

Järfälla 56.21%

Sundbyberg 49.30%

Mölndal 47.75%

Table 4.7: The municipalities with largest ingoing fluxes and their sum of ingoing probabilities

4.3 Model

When animating the spread for different cases it became apparent the infections largely varied depending on where in Sweden the virus was artificially deployed. Therefore, I decided

(29)

respective average spread of infection, hence possible decimal values. This resulted in the following heatmap

Figure 4.1: Spread of infection for each region selected as startnode, r = 10000, k = 1

Finally, to solve the thesis proposed the model run at r = 10000, where the ten was the number of regions chosen to be represented as most likely to lead to an outbreak. The parameters τ_e and τ_i were arbitrarily chosen to fit the range of their suggestested value from the literature, [3, 5] and [5, 15] respectively to fit the most common diseases. Since the simulations take a lot of time, it limited me to only study the changes in results by varying τ_i, setting τ_e= 5.

(30)

Order/i 5 10 15 1 Stockhom Stockholm Stockholm

2 Göteborg Malmö Malmö

3 Malmö Göteborg Kungsbacka

4 Uppsala Örebro Karlstad

5 Haninge Helsingborg Kristianstad 6 Helsingborg Vänersborg Timrå 7 Örebro Kristianstad Vänersborg

8 Österåker Hammarö Falun

9 Norrköping Gävle Mörbylånga

10 Vänersborg Värmdö Boden

Runtime 7 hours 30 hours 65 hours

Table 4.8: The most dangerous regions for different infection duration’s k = 10, r = 10000, where e=5, i=5,10,15.

These results put into a visualization looks like this

Figure 4.2: Superspreading municipalities with τ_i = 5, 10, 15from left to right.

To study the time complexity of the simulations from Table 4.8 and how the runtime varied with ascending values of τ_i, a plot was constructed to fit the limited data. The trendline formed by the three data points is a seconddegree polynomial function

(31)

Figure 4.3: Trendline fitting the runtime with equation 0.24τ_i²+ τ_i− 4.

(32)

5 Conclusions

By using the mobility network we managed to create an estimator of the fluxes between all regions requiring small amounts of data for regions. The validation of the network was proven with a sample of the data in the Gothenburg region. The distances compared were relatively short and the population considered normal to the rest of the model, hence the vast majority of edges that covers longer distances are expected to perform even better.

To showcase the extremes of the estimation both ingoing fluxes and their percentages were calculated to be further discussed as a possible fault to some region’s low infection rates.

Unsurprisingly, the resulting list features the same nodes as those with the lowest ingoing mobility in the same order. This is no coincidence considering edges depend solely on the relative ratio between outgoing mobility and its relative to the maximum flux times a factor.

However, the estimations also had great successes with several values being predicted their exact fluxes. Typically these estimations were made for larger distances relative to the table.

Considering the small scale, these types of outliers will diminish vastly on a larger scale due to the implementations.

Furthermore, the model showed some indication of its correctness when regarding the mobility network. Although it can be argued for the first node based on our results in Table 4.5 and 4.7, the other nodes will be selected dependently on how ”well” it works together with the selected ones. So even though nearby municipalities to Stockholm such as Solna, Sollentuna, Sundbyberg made it to the top ten for ingoing fluxes, they did not result in as high infection as other combinations when Stockholm was already selected.

Finally, the results correlated with existing results large population centers are ideal for propagation.

5.1 Discussion

As previously mentioned, the radiation model suffers in close distances. Furthermore, the simplification using centroids may also suffer for short distances. Mainly since the binary classification whether to include a city’s population in s_ij or not can impact the estimation vastly. In the case of the radius being an insignificant distance from including another centroid with a large population or vice versa, as opposed to taking the population within the area. However, on a larger scale these rough estimations make sense as corner cases do not carry the same significance to s_ij. This seems to be the case for this sample as well when comparing the distance between the greatest deviation in fluxes to their relative

(33)

big difference when scaling. For instance, an undershot by 4 commuters when the actual value is 8 may be a great estimation, and here that is the case. But if the same 50% accuracy persists when scaling up the model it may not be as enticing. Therefore, it is important to explore the shortcoming and successes on a casetocase basis in the table. In my opinion, the overall results performed incredibly well considering the parameters and model used, with the exceptions of:Göteborg→ Ale (7342 predicted 599 actual), Lerum→Ale (105 predicted

6391 actual), Partille→Mölndal (349 predicted 9225 actual) out of 156 fluxes.

During the simulations, it became obvious that there were some regions that were practically never infected. Thus, I decided to research this by looking into the edges and their fluxes as represented in the tables. In order to abstract this one step at a time, I had to create the heatmap, notice the irregularity, research the probabilities, then make sure that correlated with the mobility, which it did. Since the real data reviewed correlated with the model, one can assume the method for determining edge probabilities is at fault. Although a realworld scenario such as covid19 contains more factors such as air travel, etc., it is safe to assume an origin in one of the nodes in the north of Sweden would result in more than that origin node becoming infected.

The final results to answer the thesis best visualized in Figure 4.2, shows an unexpected variation of nodes depending on τ_i. They seem to indicate a greater variation as the infection time increases. This could in simple terms be explained by larger infection times covering larger values of the radius around the infected node. So if a disease were to begin at Stockholm at a relatively high τ_i, we can be relatively certain it will spread throughout the nearby highly populated municipalities. Whereas for a lower value of τ_i, the infection starting at Stockholm might not spread to all nearby highly populated areas, hence a lot of susceptible individuals still exist for another node nearby to be selected and infect. In the first instance where τ_i = 5 this seems to be the case, where many of the selected nodes are close to Stockholm, Gothenburg and Malmö. In the second part of the figure, the infections begin to spread to municipalities deemed independently less infectious according to Figure 4.1 such as Hammarö and Kristianstad. Finally for τ_i = 15 we see the same pattern where even a northern municipality Boden become a start node, falling into the second least infectious category.

It was shown in Figure 4.3 that different τ_i had a large impact on the runtime for the simulations. Most likely this is due to not only an increased time for the infectious nodes, but also the cascade effect an increased infection time entails as it provides further iterations for infectious nodes to potentially infect.

(34)

5.2 Future Work

As discussed, the greatest potential fault lies in the edge probabilities. Therefore, the most beneficial extension of this work would be to implement the various formulas in the literature for scaling E_p until a heatmap is produced leading to a more reasonable infection rate for the currently least infectious nodes.

To improve the mobility network further one would have to reconsider the centroid simplification. For instance, it might be possible to use QGIS to assume uniform population distribution, as we did here, and calculate the area of municipalities included in the radius, then by using the uniform population density get a more accurate estimation of s_ij. It could also be beneficial to replace the radiation model with the gravity model in close distances where the radiation model lacks, supposing data for the parameters is sufficient.

Further improvement in the SEIR implementation lies in efficiency. Although various data structures were used to minimize exponential growth for the algorithms listed, there are a few cumbersome options for optimizing. One is sorting the edge probabilities for each row in the adjacency matrix while keeping a tab on the direction and then produce a random number once for each row and iterate in ascending order until ξ > E_p, similar to the infection queue.

Assuming a good model, an interesting path would be to consider the impact lockdown imposes on different municipalities. This could be done both by considering locking down the most dangerous nodes produced and deploying the virus in arbitrary nodes, perhaps even using common mitigation rates during lockdowns, then comparing the differences, or in an equally greedy fashion to the maximization problem assume the most dangerous is in a lockdown and find the remaining nodes using the maximization algorithm.

(35)

References

[1] Bailey, Norman T. J. The mathematical theory of infectious diseases and its applications. 2nd ed. London: Griffin, 1975.

[2] Bóta, András and Gardner, Lauren. “A generalized framework for the estimation of edge infection probabilities”. In: Draft available at:

http://arxiv.org/abs/1706.07532 (June 2017).

[3] Chaudhry, Rabail et al. “A country level analysis measuring the impact of government actions, country preparedness and socioeconomic factors on COVID19 mortality and related health outcomes”. In: EClinicalMedicine 25 (Aug. 2020), p. 100464. DOI:10.

1016/j.eclinm.2020.100464. URL: https://doi.org/10.1016/j.eclinm.2020.

100464.

[4] Data Structures Algorithms in JavaScript(Single Linked List) Part 1.https://dev.

to / swarup260 / data - structures - algorithms - in - javascript - single - linked - list-part-1-3ghg. Accessed on 20210408. Nov. 2019.

[5] Domingos, Pedro and Richardson, Matt. “Mining the network value of customers”. In:

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining KDD ’01. ACM Press, 2001. DOI: 10 . 1145 / 502512 . 502525. URL: https://doi.org/10.1145/502512.502525.

[6] Dong, Wen, Heller, Katherine, and Pentland, Alex (Sandy). “Modeling Infection with Multiagent Dynamics”. In: Social Computing, Behavioral Cultural Modeling and Prediction. Ed. by Shanchieh Jay Yang, Ariel M. Greenberg, and Mica Endsley. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 172–179.

[7] Easley, David and Kleinberg, Jon. Networks, Crowds, and Markets. Cambridge University Press, July 2010. DOI: 10.1017/cbo9780511761942. URL: https://doi.

org/10.1017/cbo9780511761942.

[8] Fiorillo, Luca et al. “Virtual reality and massive multiplayer online roleplaying games as possible prophylaxis mathematical model: focus on COVID19 spreading”. In:

Epidemiologic Methods 9.s1 (2020). DOI:doi:10.1515/em-2020-0003. URL: https:

//doi.org/10.1515/em-2020-0003.

[9] Fouchet, David et al. “The role of maternal antibodies in the emergence of severe disease as a result of fragmentation”. In: Journal of The Royal Society Interface 4.14 (Dec. 2006), pp. 479–489. DOI:10.1098/rsif.2006.0189. URL: https://doi.org/

10.1098/rsif.2006.0189.

(36)

[10] Gates, Bill. In: (2015). URL:https://www.ted.com/talks/bill_gates_the_next_

outbreak_we_re_not_ready?language=en.

[11] Graphs.https://guides.codepath.com/compsci/Graphs. Accessed on 20210408.

Aug. 2018.

[12] hurvibor.se. Pendling präglar kranskommunerna. https : / / hurvibor . se / wp - content/uploads/O_bef_pendling_2008.pdf. Accessed on 20210222. Mar. 2021.

[13] HW, Hethcote. The mathematics of infectious diseases. 42nd volume. Society for Industrial and Applied Mathematics, 2000, pp. 599–653.

[14] Keller, KlausDieter. SIRModel. https : / / en . wikipedia . org / wiki / File : SIR - Modell.svg. Accessed on 20210325. Mar. 2019.

[15] Kempe, David, Kleinberg, Jon, and Tardos, Éva. “Maximizing the spread of influence through a social network”. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining KDD ’03. ACM Press, 2003.

DOI:10.1145/956750.956769. URL: https://doi.org/10.1145/956750.956769.

[16] Kermack, W. O. and McKendrick, A. G. “Contributions to the mathematical theory of epidemics—I”. In: Bulletin of Mathematical Biology 53.12 (Mar. 1991), pp. 33–55.

DOI:10.1007/bf02464423. URL: https://doi.org/10.1007/bf02464423.

[17] Mahmud, Aidalina and Lim, Poh Ying. “Applying the SEIR Model in Forecasting The COVID19 Trend in Malaysia: A Preliminary Study”. In: medRxiv (2020). DOI:10.

1101/2020.04.14.20065607. eprint: https://www.medrxiv.org/content/early/

2020/04/17/2020.04.14.20065607.full.pdf. URL: https://www.medrxiv.org/

content/early/2020/04/17/2020.04.14.20065607.

[18] Masucci, A. Paolo et al. “Gravity versus radiation models: On the importance of scale and heterogeneity in commuting flows”. In: Physical Review E 88.2 (Aug. 2013). DOI:

10.1103/physreve.88.022812. URL: https://doi.org/10.1103/physreve.88.

022812.

[19] Matua, Gerald Amandu, Wal, Dirk Mostert Van der, and Locsin, Rozzano C.

“Ebola hemorrhagic fever outbreaks: strategies for effective epidemic management, containment and control”. en. In: Brazilian Journal of Infectious Diseases 19 (June 2015), pp. 308–313. ISSN: 14138670. URL:http://www.scielo.br/scielo.php?

script=sci_arttext&pid=S1413-86702015000300308&nrm=iso.

[20] Mouritz, A. “The flu”. In: (1921). URL:https://www.gutenberg.org/ebooks/61607.

Identifying regions most likely to contribute to an epidemic outbreak in a human mobility network