Temporal Convolutional Networks for Forecasting Patient Volumes in Digital Healthcare

(1)

Temporal Convolutional Networks

for Forecasting Patient Volumes in

Digital Healthcare

JONATHAN BERGLIND

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

Networks for Forecasting

Patient Volumes in Digital

Healthcare

JONATHAN BERGLIND

Master in Computer Science Date: June 27, 2019

Supervisor: Johan Gustavsson Examiner: Olof Bälter

School of Electrical Engineering and Computer Science Host company: KRY International AB

Swedish title: Temporala faltningsnätverk för förutsägelse av patientmängder inom digital vård

(4)

(5)

Abstract

Patient volume forecasting is an important tool for staffing clinicians to meet patient demands. In traditional care, the problem has been studied by mul-tiple authors with inconclusive results. Recent advances in using recurrent and convolutional models in the neighbouring area of sequence modeling have not yet been replicated in the area of patient volume forecasting in tra-ditional healthcare. In the growing area of digital care, only one study has attempted the problem to date.

In this study, aLong Short-Term Memory Network (LSTM) and a Temporal Convolutional Network (TCN) were implemented and fit in a one-step fore-casting problem using historical hourly patient volumes of a digital caregiver, both with and without explicit weekday annotations. The models were eval-uated in one-step and multi-step forecasting with an horizon up to 168 time steps (1 week), and compared to statistical baseline models.

In the one-step forecasting evaluation the univariate TCN achieved a Mean Squared Error (MSE) of 93.4 ± 2.4, outperforming the univariate LSTM (122.2 ± 5.9 MSE) and all baseline models (best: 193 MSE). In the 168-step forecasting evaluation, the univariate TCN achieved a mean MSE (MMSE) for each step in the forecasted horizon of 143.2 ± 5.5, outperforming the LSTM (261.5 ± 63.0 MMSE) and baseline models (best: 195.8 MMSE). The performance of the LSTM and TCN models were shown to deteriorate for each step ahead in the multi-step forecasts, the LSTM at a faster rate than the TCN. The results indicated that the models learned to approximate the seasonality of the dataset, but when the data deviated, the accuracy of all models worsened. The use of multivariate data lowered the errors slightly. The computational performance of the TCN, attributed to its parallelizable architecture, was shown to be a major advantage over the LSTM.

It was concluded that the TCN is a promising alternative to the LSTM in the context of the specific problem, both in terms of accuracy and usability, but that more studies are needed to say anything about the general problem of patient volume forecasting in digital healthcare.

(6)

Sammanfattning

Ett viktigt verktyg för bemanning inom vården är prognoser om framtida patientmängder. Flertalet studier har granskat problemet inom traditionell vård med motstridiga resultat. Nyliga inträffade framsteg inom det angrän-sande området sekvensmodellering har ännu inte återskapats inom området för förutsägelse av patientvolymer inom traditionell vård. Inom det växan-de områväxan-det av digital vård har hitintills endast en tidigare studie angripit problemet.

I denna studie utvärderas två modeller, ett Long Short-Term Memory (LSTM) och ett Temporalt Faltningsnätverk (eng: Temporal Convolutional Network (TCN)), i att förutse antalet patienter genom användning av hi-storisk timmesdata för antalet patienter, från en digital vårdgivare, både med och utan explicita markeringar för veckodagar. Modellerna tränades i enstegsproblemet och utvärderades i både enstegs- och flerstegsproblem med en horisont på upp till 168 tidssteg (1 vecka). Resultaten jämfördes med resultat från statistiska standardmodeller.

I enstegsproblemet uppnådde den univariata TCN-modellen ett genom-snittligt kvadratfel (eng: Mean Squared Error (MSE)) på 93.4 ± 2.4 vilket var bättre än LSTM-modellens (122.2 ± 5.9 MSE) och alla testade standardmo-deller (bästa: 193 MSE). I flerstegsproblemet med en horisont på 168 tidssteg uppmätte TCN-modellen ett medel-MSE (MMSE) för varje steg i prognosen på 143.2 ± 5.5 vilket var bättre än LSTM-modellens fel på 261.5 ± 63.0 MMSE och standardmodellerna (bästa: 195.8 MMSE). LSTM- och TCN-modellernas prestanda visade sig försämras för varje steg i framåt i prognosen i flerstegs-problemet, dock i en snabbare takt för LSTM-modellen. Resultaten tydde på att modellerna lärt sig att approximera säsongsbaserade mönster i da-tan, men när datan avvek från det normala försämrades resultaten för alla modeller. Användandet av data med explicita veckodagsmarkeringar mins-kade felen något. Den beräkningsmässiga prestandan för TCN-modellen, vil-ken tillskrivs möjligheten till parallellisering, visade sig vara en stor fördel gentemot LSTM-modellen.

TCN-modellen är ett lovande alternativ till LSTM-modellen i samman-hanget av det specifika problemet, både med hänsyn till noggrannhet och användbarhet, men att fler studier behövs för att slutsatser ska kunnas dra om det mer generella problemet om förutsägelse av patientmängder inom digital vård.

(7)

1 Introduction 1

1.1 Research Question . . . 2

1.2 The Principal . . . 2

1.3 Limitations . . . 3

2 Background 4 2.1 Introduction to Research Areas . . . 4

2.1.1 Sequence Modeling and Time Series Forecasting . . 4

2.1.2 Patient Volume Forecasting . . . 5

2.2 Theory . . . 6

2.2.1 Artificial Neural Networks . . . 6

2.2.2 Long Short-Term Memory (LSTM) . . . 7

2.2.3 Temporal Convolutional Networks (TCN) . . . 9

2.3 Related Work . . . 12

2.3.1 Patient Volume Forecasting in Traditional Care . . . 12

2.3.2 Patient Volume Forecasting in Digital Care . . . 13

2.4 Summary . . . 13 3 Methodology 14 3.1 Data . . . 14 3.2 Models . . . 15 3.2.1 LSTM . . . 15 3.2.2 TCN . . . 16 3.2.3 Baseline Models . . . 17 3.3 Model Fitting . . . 17 3.4 Hyperparameter Optimization . . . 18 3.5 Evaluation . . . 19 v

(8)

4 Results 21 4.1 Hyperparameter Optimization . . . 21 4.1.1 LSTM . . . 21 4.1.2 TCN . . . 22 4.2 Model Fitting . . . 23 4.3 One-step Forecasting . . . 23 4.4 Multi-step Forecasting . . . 25 5 Discussion 30 5.1 Model Comparison . . . 30 5.1.1 One-step Forecasting . . . 30 5.1.2 Multi-step Forecasting . . . 31

5.1.3 Interpretability and Usage . . . 32

5.2 Univariate vs. Multivariate Data . . . 34

5.3 Methodology and Limitations . . . 35

5.3.1 Hyperparameter Optimization . . . 35

5.3.2 Evaluation . . . 36

5.3.3 Data . . . 36

5.4 Ethics, Sustainability and Social Aspects . . . 37

5.5 Future Research . . . 38

5.5.1 Multi-output Models . . . 38

5.5.2 Use of Non-seasonal Multivariate Data . . . 38

6 Conclusions 40

(9)

Introduction

Digital healthcare in the form of online clinicians has seen a lot of growth in the last years with many new caretakers appearing [1, 2, 3]. These services allow patients to meet with clinicians, such as doctors and psychologists, via video calls or text chats to get help with and treat symptoms that do not require a physical examination [1].

As more people start to use online clinician services, it becomes increas-ingly important to staff the right number of clinicians each day, in order to keep up with the patient demand. To staff clinicians ahead of time, forecasts that predict the futurepatient volume can be used. The better the accuracy of a forecast, the easier it will be to staff the right number of clinicians to meet demands precisely. Optimal staffing makes efficient use of clinicians and gives them an opportunity to treat more patients while sub-optimal staffing might mean that clinicians that could have been of help elsewhere sit idle, or that patients do not get the help they need. The true patient demand may be affected by unknown events, which makes the problem of forecasting non-trivial.

Patient volume forecasting in the context of traditional healthcare has been studied by multiple authors to date [4, 5]. Results from studies in tradi-tional healthcare could be, but are not necessarily applicable to digital care. Reasons for this could for example be that a digital caregiver can operate at a larger scale than a physical one, or that the patients come from larger ge-ographical areas. To date, only one study has attempted the problem in the context of digital care.

A forecast can be based on different kinds of data, but often includes historical time series data of some kind, for example past values of patient healthcare demands at previous points in time. The research areas of

(10)

quence Modeling and Time Series Forecasting have in recent years been domi-nated by learning models of the typeArtificial Neural Networks (ANN). ANNs of the recurrent type, such as theLong Short-Term Memory Network (LSTM), have for long been regarded as the natural starting point for sequence mod-eling problems [6, 7]. However, recent results of a convolutional type of network known as a Temporal Convolutional Network (TCN) indicate that this association should be reconsidered [8, 7].

The research area of patient volume forecasting have not yet taken ad-vantage of the advances in the areas of sequence modeling and time series forecasting. In the traditional context, the area is still dominated by statistical models, and to date, no study has evaluated recurrent or convolutional net-works. In the context of digital healthcare, a single study published last year, evaluated an LSTM model inone-step forecasting [9] and showed promising results. However, there is a general need of more studies in the area that explore the use of new models.

The aim of this study is to evaluate if the recent results of the TCN in sequence modeling tasks can be replicated in the task of one-step and multi-step forecasting of patient volumes in the context of digital care, and if the TCN model is more suitable for the task than the LSTM.

1.1 Research Question

Are Temporal Convolutional Networks more suitable, in terms of forecasting errors and usability, compared to Long Short-Term Memory networks, for use in forecasting patient volumes in digital healthcare using historical time series data?

1.2 The Principal

The principal for this thesis is the digital healthcare provider KRY1. KRY is active in Europe and have been helping patients via online clinicians since 2014. Staffing the right number of clinicians to meet patient demands each day becomes increasingly important as the company is growing. Looking into options to forecast future patient volumes is in their interest to keep waiting times low and to help all the patients seeking care.

1

(11)

1.3 Limitations

Models have been implemented, fit and evaluated using real historical data of patient volumes from a digital caregiver. All data used comes from a single digital caregiver and from a single market which is the Swedish one. This fact limits the generality of any conclusions drawn. A further limitation is the lack of previous work within the digital context of the project area to compare with.

A final limitation is that the data used in this study represents the his-toricalpatient volumes of a caregiver and not the historical patient demands. Although this is in line with previous studies, it should be noted that pa-tient demand forecasting may be of greater use for staffing of clinicians than patient volume forecasting. However, the collection of the historical patient demands is regarded as more complex than the collection of patient volumes, which is the cause of this limitation.

(12)

Background

2.1 Introduction to Research Areas

2.1.1 Sequence Modeling and Time Series Forecasting

The research area ofSequence Modeling (also: Sequence-to-sequence Learning) is about a type of supervised learning problems that concerns the mapping of an input sequence to an output sequence. What differentiates it from other supervised learning problems is that the length of the input sequence can vary [10]. Additionally, the length of the input and output sequences may differ [6].

Data in the form of sequences appear naturally, making it a common type of problem and an active research area. Classic examples of sequence mod-eling problems are machine translation, where a sequence of words are to be translated to a sequence of words in another language, or speech recognition, where an audio sequence is to be mapped to words [6].

The research area of sequence modeling is currently dominated by ma-chine learning models of the type Artificial Neural Networks. The class of ANNs known as Recurrent Networks has long been synonymous with se-quence modeling [6] but has in recent years seen competition from Convo-lutional Networks [7, 6]. These concepts are surveyed in sections 2.2.2 and 2.2.3.

Time Series Forecasting

ATime Series is a special type of sequence that indexes data points by tem-poral order, often with equally spaced intervals between each point [11].

(13)

Time series data occurs naturally when observations have been collected over time, such as by a sensor.

A common time series problem is to try to extend the series by forecast-ing values of future timesteps, known asTime Series Forecasting. Time series forecasting is a form of extrapolation and thus relies on the assumption that the future will be like the past - An assumption that does not always hold [11], especially for forecasts with longer horizons.

2.1.2 Patient Volume Forecasting

ThePatient Volume is a measure of how many patients are visiting a health-care provider at a specific point in time. The patient volume may vary over time and may be affected by a large number of parameters that could vary between different healthcare providers.

Patient Volume Forecasting (also: Patient Visit Forecasting, Patient Admis-sion Forecasting) is the problem of forecasting the future patient volume of a healthcare provider. For that purpose, the historical patient volume, usu-ally represented as a time series, is often used [4, 5], making it a time series forecasting problem. The uncertainty of the future and what factors affect the volume makes it a non-trivial problem, as shown by the great number of studies published (reviewed in [4, 5]). If done with accuracy, the forecasted patient volume could be used when staffing ahead of time, to meet the de-mands precisely, making better use of coveted resources like clinicians.

Due to the global usefulness of a solution the problem of patient volume forecasting, the area has been explored in a number of studies during the years. Section 2.3.1 surveys the area further.

Patient Volume Forecasting in Digital Healthcare

Digital Healthcare in the form of online doctors is a growing area. In Europe, many new caretakers have been made available over the last years. The services allow patients to meet with clinicians, such as doctors and psychol-ogists, via video calls or text chats to get help with and treat symptoms that do not require a physical examination. [1, 2, 3]

Digital care is different from physical care in the way that it does not have to be geographically bound to physical clinics. This makes its receptive field of patients larger and removes constraints on the number of patients that can be received. The limitation of only treating symptoms that do not require a physical examination possibly affects the predictability of forecasts and the

(14)

possibility for clinicians to work part-time from anywhere could make it possible to adjust staffing to forecasts quicker than in traditional care.

The research area of patient volume forecasting within digital healthcare is relatively new and has to date not received much scrutiny from academia. Results from the neighbouring area focusing on traditional healthcare are not necessarily applicable and needs to be verified in this new domain. Results to date are surveyed in section 2.3.2.

2.2 Theory

2.2.1 Artificial Neural Networks

AnArtificial Neural Network is a computational model for supervised learn-ing. It consists of computational units calledneurons, connected together in a a network, similar to how the human brain consists of neurons and synapses [12].

Each neuron in an ANN may have multiple inputs and produces a single output by multiplying each input with a weight, summing them together with a bias value, and then applying an activation function that introduces non-linearity and squashes the result into a smaller range [13, 14].

Figure 2.1 shows a simple ANN called aMultilayer Perceptron (MLP) with two inputs, six neurons arranged in two layers, finished by a single output neuron.

Output Inputs

Figure 2.1: Multilayer Perceptron with two inputs, two hidden layers of three neurons each and one output neuron.

ANNs are commonly fit using a process called backpropagation that works by repeatedly calculating the results of the training samples and ad-justing the weights and biases of each neuron to minimize the error over all training samples [13, 14]. Forwarding a new sample through the network

(15)

after fitting yields an approximation of what the true function that generated the sample would produce.

ANNs are universal function approximators that can learn to approxi-mate complex functions by fitting on a large number of samples [14]. This flexibility allows them to be used for many different problems which has mo-tivated multiple studies and lead to advances within their respective field.

2.2.2 Long Short-Term Memory (LSTM)

A Recurrent Neural Network (RNN) is an ANN with a recurrent structure where the outputs of the network are connected to itself as inputs. This structure introduces the concept of time by creating an internal memory in the network [13]. In theory, each output of the network is based on the current input together with all previously observed inputs [7]. This is in contrast to Feed Forward Networks like the MLP where the state is lost be-tween each sample. The recurrent structure makes them especially suitable for processing sequences of inputs with dependencies between them, such as time series data.

Figure 2.2 shows a simple RNN in two different representations. As can be seen in the unfolded representation, the depth of a RNN is directly related to the length of the sequence of inputs, which can become a problem at train-ing. Backpropagation determines how to adjust each weight by calculating the partial derivatives of the loss function with respect to the weight. For a specific neuron, this expression depends on the neurons it is connected to in later layers. The Vanishing Gradient Problem commonly appears in the early layers of the deep networks where the expression is a long multiplicative chain of values. When the gradient goes to zero, or vanishes, updates to the network cease and training stops [13, 14]. The depth of an RNN together with sharing of weights across timesteps further amplifies this problem [13].

(16)

A Input Output (a) RNN A A A Input Sequence Output Sequence (b) Unfolded RNN

Figure 2.2: Recurrent Neural Network in two different representations. TheLong Short-Term Memory Network (LSTM) [15] is a recurrent network designed for solving the problem of vanishing gradients. It replaces the recurrent neurons of a RNN with memory cells. A memory cell contains a memory node - a neuron with an unweighted connection to itself, and, gates that learn to open and close access to the memory node [15, 13].

Aforget gate controls how much of the memory in the memory node that should be forgotten. Aninput gate controls how much of the training sample and the output of the previous timestep that gets added to the memory and anoutput gate controls how much of the memory that becomes the output of the memory cell. The unweighted self-connection of the memory node lets the error pass through it to earlier layers during the backpropagation process, which mitigates the problem of vanishing gradients [13, 15]. Figure 2.3 shows a single timestep of an LSTM model.

(17)

Forget Input Output Memory from

prev. timestep

Output from

prev. timestep next timestepOutput for Memory for next timestep Output

Input

Figure 2.3: Single timestep of an LSTM model.

The LSTM architecture has defined the state-of-the-art in classic se-quence modeling problems such as speech recognition [16] and machine translation [10]. At the time of this study, RNNs are regarded as the starting point for sequence modeling tasks [6, 7], and the LSTM architecture is one of the most widely used RNNs [7].

The wide adoption of the LSTM architecture has led to many versions ex-tended from the original LSTM version first introduced in 1997. Forget gates were not part of the original design but is today included in most implemen-tations. In a large study from 2016 by Greff et al. [17], different extensions to the original design were evaluated in three classic sequence modelling prob-lems. It was found that none of the extensions could improve significantly over the original design but that forget gates and output activations were the most critical extensions.

A notable drawback of the LSTM and RNNs in general is that their se-quential structure makes them hard to parallelize since the output for a cer-tain timestep depends on the output of previous timesteps.

2.2.3 Temporal Convolutional Networks (TCN)

A Convolutional Neural Network (CNN) is an ANN that includes convolu-tional layers to make use of spatial relationships in the input data. A

(18)

convo-lutional layer consists of a number of learnable filters or kernels of a given size. The filters are made up from neurons and are convolved with the in-put sample to producefeature maps that contain features extracted from the sample [18].

For example, in 2D images neighbouring pixels are likely correlated. A filter in a convolutional layer applied to an image might learn to detect cer-tain types of edges in the image and its feature map would then indicate the existence of an edge in a certain spatial region in the image. By stack-ing multiple convolutional layers, the filters can learn to extract higher order features from the data. Figure 2.4 shows an example of a single convolutional layer applied to a two-dimensional sample.

0 1 1 0 1 0 0 0 0 1 1 0 1 0 0 1

Input Convolutional Layer (Feature Maps)

Figure 2.4: Single convolutional layer applied to two-dimensional sample. A Temporal Convolutional Network (TCN) is a convolutional net-work designed for net-working with sequential data. In a normal CNN, the fil-ters are convolved with a single input at a time. In a TCN, the filfil-ters have a temporal dimension and are convolved with a sequence of inputs instead. The convolutions of a TCN arecausal, meaning that a sample for a timestep is only convolved with samples from earlier timesteps, preventing leakage of information from the future into the past [8, 7].

Thereceptive field of a sequence model denotes the number of sequential samples that are used to compute an output value. To increase the recep-tive field of a TCN model, convolutional layers are stacked. Stacking many layers leads to deep models which can make the model harder to train. The TCN architecture achieves large receptive fields without stacking too many layers by usingdilated convolutions. A dilated convolution is a convolution that skips inputs at regular intervals, by stacking dilated convolutional lay-ers with an exponentially increasing dilation rate, it is possible to get an exponentially increasing receptive field while still hitting every input [8, 7].

(19)

Figure 2.5 shows a simplified TCN model with three stacked layers of dilated causal convolutions. The dilation factor is increased with a factor of 2 for each level. The model shown in the figure has a receptive field of 8 timesteps.

Zer

o Padding

Input Sequence Output Sequence

Figure 2.5: TCN using 3 convolutional layers with increasing dilation. The idea of using convolutional layers along the temporal dimension of the dataset is not new [19], but it is only in recent years that it has started to take off thanks to good results in recent studies outlined below.

In 2016, Van Den Oord et al. [8] introducedWaveNet, an ANN model for audio generation. WaveNet used a structure where dilated causal convolu-tions were stacked to get a large receptive field. In addition to this, residual [20] and skip-connections, were used “to speed up convergence and enable training of much deeper models”. The model was trained in text-to-speech synthesis, a time series problem, and yielded state-of-the-art performances with human listeners rating it more natural sounding than existing state-of-the-art models.

Inspired by the success of WaveNet and other studies using CNNs for sequence modelling problems, in 2018 Bai et al. [7] published a study that aimed to compare the performance of the LSTM to a CNN based architecture that was inspired by the recent studies, but simplified. The model used resid-ual connections and dropout layers. The authors adopted the name Tempo-ral Convolutional Network for describing the architecture, noting that it was used “not as a label for a truly new architecture, but as a simple descriptive term for a family of architectures”. The models were evaluated in different sequence modelling tasks that have commonly been used to benchmark re-current models. The TCN outperformed the LSTM in all but one problem. The authors concluded that “convolutional networks should be regarded as a natural starting point and a powerful toolkit for sequence modeling”.

(20)

Exactly what a TCN entails has not been clearly defined by anyone at the time of this study.Residual connections [20], skip connections and dropout [21], as included in the surveyed studies are general extensions that can and have been used with other ANN architectures but may be regarded as de-faults for the TCN architecture in the future.

In contrast to the LSTM architecture presented in section 2.2.2, the TCN does not exhibit a dependency between outputs and can therefore be trained over each timestep in the dataset in parallel which is a notable benefit. How-ever, it should be noted that the receptive field of a TCN model is fixed and must be determined at implementation and tailored to each problem, while the recurrent structure of the LSTM makes it learn its receptive field.

2.3 Related Work

2.3.1 Patient Volume Forecasting in Traditional Care

Multiple studies have been focusing on the problem of forecasting patient in the context of traditional healthcare during the years [4, 5]. The large num-ber of studies using different models, data and metrics of evaluation makes it hard to draw general conclusions about the suitability of the different models used. There is a general need of larger studies and review articles summa-rizing the many results within the area.

Early studies tried to model the approximation as a linear combination of independent variables using calendar data and in some cases weather data [4]. Later studies attempted more sophisticated statistical models like the Autoregressive Integrated Moving Average (ARIMA) model that models the approximation based on the historical time series within a short range [4, 5]. To date, statistical models like ARIMA and various extensions of it seem to be dominating the field. Some few studies have attempted to model the forecasts using machine learning models like ANNs [5].

In a study from 2008, Jones et al. [22] included a single layer MLP model in a comparison with multiple linear regression models and a seasonal ARIMA model and found that the MLP did not provide consistently accurate forecasts. Hellstenius [9] later commented on the shallow architecture of the evaluated model and the lack of clarity regarding the methodology used. In a more recent study from 2017, Jiang et al. [23] used a more advanced version of an MLP, than the one Jones et al. [22] used, with multiple lay-ers and overfitting mitigation usingdropout [21] and l2 regularization. The structure of the model was hyperparameter optimized using Grid Search [24]

(21)

and the features of the data was selected with a genetic algorithm. The model was able to outperform frequently used statistical models like ARIMA, and additionally modern machine learning models like Support Vector Machines. No studies using recurrent or convolutional ANNs to forecast patient volumes in the context of traditional healthcare have been identified.

2.3.2 Patient Volume Forecasting in Digital Care

Only one study attempting patient volume forecasting in digital care has been identified. In a study from 2018, Hellstenius [9] compared two ANNs, an MLP and an LSTM to an autoregressive (AR) model in the problem of one-step forecasting of patient volume using data from a digital healthcare provider (same caregiver as the principal of this study). Both the MLP and the LSTM outperformed the AR model indicating that non-linear models are suitable for the problem. The LSTM obtained the lowest error of the evalu-ated models.

The LSTM model was further evaluated with multivariate datasets, an-notating the historical time series with either weekday or holiday data. The results showed no significant improvements using the annotated data, con-tradicting previous studies within traditional care. Hellstenius concluded that the LSTM is an appropriate choice for this type of forecasting problem but mentions the recent results of Bai et al. [7] on Temporal Convolutional Networks indicating that the field is constantly changing.

It should be noted that while one-step ahead forecasting of patient vol-ume, as studied by Hellstenius, is an interesting sequence modeling problem, its usability for staffing of medical personnel is limited due to the short hori-zon of the forecasts.

2.4 Summary

To summarize the content of this chapter it can be said that the recent results of the TCN architecture on sequence modeling problems are promising but that there is still a need to verify the generality of the results in different problem settings. The area of patient volume forecasting has not yet been shown to benefit from the advancements of learning models in the related area of sequence modeling and there is a need to evaluate those. Lastly, the novel area of patient volume forecasting within digital care needs to be scrutinized in more studies, preferably with larger forecasting horizons.

(22)

Methodology

3.1 Data

The primary data used in this study consisted of timestamps of video meet-ings taking place between patients and clinicians at the digital healthcare provider KRY1 from Mars 2015 until February 2019. The data was supplied by KRY. The timestamps used comes from two different types of meetings, in meetings and pre-booked meetings, from the Swedish market. A drop-in meetdrop-ing is when the patient wants to meet a cldrop-inician as soon as possi-ble. For drop-in meetings, the time of entering the drop-in queue represents the timestamp in the data. A pre-booked meeting is used when the patient wants to meet a clinician at a specific time. For pre-booked meetings, the time scheduled for the meeting represents the timestamp in the data.

The data supplied was complete in the way that there were no gaps or periods of time where meetings had occurred without their timestamps being present in the dataset.

The set of timestamps from the two types of meetings were joined into a single dataset and resampled into a count of timestamps per hour, repre-senting the historical absolute patient volume for each timestep/hour. “Uni-variate” is used as a suffix for the rest of the paper to refer to models being fit using this dataset, for example “TCN Univariate”. Due to confidentiality restrictions, there is no visualization of the complete dataset. The data shows a clear growth trend from start to end with indications of daily and weekly seasonality.

In addition to the univariate dataset a multivariate dataset was created by annotating each timestep in the dataset with the day of the week (0-6).

1

www.kry.se/en/

(23)

This dataset is denoted “Multivariate” for the rest of the paper and used as a suffix to model names.

The univariate and multivariate datasets were both split chronologically into training, validation and test sets. The first 80% of the data was used as the training set, the 80-90% range for the validation set and the 90-100% range was used as the test set.

Each dataset was transformed into a supervised learning problem by set-ting the target for each timestep to be the change in demand from the current timestep to the next. At each timestep, the forecasting model gets the data of all timesteps up to timestept and is tasked with predicting the change in demand to timestept + 1. The decision to set the target to the change in demand instead of the actual demand of the next timestep, as done by Hell-stenius [9], was taken after early tests showed models mainly learning to model the naive forecast of predicting the last input as output.

3.2 Models

Two different model architectures, an LSTM and a TCN, were implemented in Keras2, a python library for deep learning.

Both models can be used with both univariate and multivariate data. When models are mentioned for the rest of the paper they may have suffixes denoting what data they are fit and used with. For example LSTM Multivari-ate refers to a LSTM model using the multivariMultivari-ate dataset.

3.2.1 LSTM

The LSTM model was implemented using the Keras built inCuDNNLSTM3 layer that is a GP U-optimized version of the standard LSTM implementation that uses the Forget Gate and Output Activation extensions of the original LSTM design. The model implemented consists of a single CuDNNLSTM layer where the dimensionality of the memory node and output is config-urable as a hyperparameter, finished by a 1-unit fully connected layer with a linear activation function.

The CuDNNLSTM layer usesSigmoid activations for the recurrent acti-vations in the LSTM cell and Tanh actiacti-vations for the output actiacti-vations.

2

www.keras.io

3

(24)

3.2.2 TCN

The TCN was implemented by recreating the implementation used in the 2018 paper by Bai et al. [7] with the exception ofweight normalization [25] that was not available in Keras at the time of the study. Recreation of weight normalization was deemed out of scope for the study.

The model consists of a number of residual blocks that are stacked in a sequential fashion. Each residual block performs a dilated convolution, followed by aReLU activation and dropout. This series of transformations is repeated twice. Alongside the series of transformations, a convolution with a kernel size of 1 is applied separately to the input and the result is added to the output of the series of transformations which makes the block learn the residual mapping from the inputs to the outputs. Figure 3.1 shows a residual block. Convolution(k, f, d) ReLU Activation Dropout Convolution(k, f, d) ReLU Activation Dropout Convolution(1, f, 1) Input Output Residual Block(k, f, d)

Figure 3.1: A single residual block.k, f , and d refers to the kernel size, the no. of filters and the dilation factor of the convolutional layers within the block.

The number of residual blocks to use in the model is configurable as a hyperparameter. For each block stacked, the dilation factor used in the dilated convolutions of that block is increased. How quickly the dilation factor increases for each block is determined by the dilation rate of the model. A dilation rate of 2 means that the dilation factor is doubled for each residual block used. The dilation rate is configurable as a hyperparameter. The kernel

(25)

size of the convolutions, the number of filters of the convolutions and the dropout rate are the same for every residual block used and are configurable as hyperparameters of the model.

The sequence of residual blocks was finished by a 1-unit fully connected layer with a linear activation function.

3.2.3 Baseline Models

In addition to the LSTM and TCN models, baseline models were implemented to aid in comparisons. Three different Naive forecasting models were imple-mented:

• Naive 1h - Periodicity of 1 hour. Outputs the last observed value. • Naive 24h - Periodicity of 24 hours. Outputs the value from the same

hour the previous day. Assumes daily seasonality in the data.

• Naive 168h - Periodicity of 1 week. Outputs the value from the same hour in the week of the previous week. Assumes daily and weekly seasonality in the data.

In addition to the naive models, an8-Week Moving Percentile (8w-MP) model was implemented. To forecast the next timestep/hour, the 8w-MP takes the 80th percentile of that hours percentual contribution to the weekly total for the last eight calendar weeks, and multiplies it with the weekly total of the last complete calendar week.

The 8w-MP model is similar to a highly ranked candidate model at the principal that has been evaluated for staffing internally, and was selected on that basis.

3.3 Model Fitting

The LSTM and TCN models were fit on the training set using the ADAM optimizer [26] with the Keras default learning rate of 0.001. Mean Squared Error, depicted in equation 3.1 was chosen as the loss function to be mini-mized.

The models were fit using early stopping, terminating the fitting process if the MSE on the validation set had not been lowered more than 0.5 for more thanpatience number of epochs in a row. The LSTM model was fit for a maximum of 300 epochs with the early stopping patience set to 20 epochs.

(26)

MSE =_n1 n Õ t=1 (yt − ˆyt)2 (3.1)

Equation 3.1: Mean Squared Error. n is the number of timesteps, yt is the true target of a timestep and ˆyt is the prediction.

The TCN model was fit for a maximum of 500 epochs with the early stopping patience set to 80 epochs.

When the fitting process was stopped early, the weights from the best performing epoch on the validation set was restored and used in further evaluation.

3.4 Hyperparameter Optimization

A hyperparameter search was conducted to find hyperparameters suitable for the specific problem for each model. Due to time constraints, hyperpa-rameters were only optimized for on the univariate dataset and then reused in evaluation with the multivariate dataset.

Since the LSTM model only had a single parameter, a Grid Search [24] was deemed suitable. The number of units was varied between 250 and 2000 in increments of 250. For each configuration, the model was fit in 10 inde-pendent trials and the validation loss was recorded. The hyperparameter configuration that resulted in the lowest loss on the validation set was se-lected for further evaluation and is presented in section 4.1.1.

Due to the large number of hyperparameters of the TCN and the possible interactions between these, aRandom Search [24] was deemed suitable. Table 3.1 shows the search space of the TCN hyperparameter optimization process. Hyperparameter configurations were randomly generated from this search space using a uniform distribution. A total of 600 trials were performed. The hyperparameter configuration that resulted in the lowest loss on the validation set was selected for further evaluation and is presented in section 4.1.2.

(27)

Parameter Low Boundary High Boundary Step Kernel Size 2 6 1 Dilation Rate 2 3 1 Blocks 2 6 1 Filters 32 256 32 Dropout 0.0 0.3 1/∞

Table 3.1: TCN hyperparameter search space. The allowed values for each parameter varies with an optional step between the low and high values inclusively.

3.5 Evaluation

The best performing hyperparameters from the hyperparameter search of each model was selected for evaluation. The models were evaluated with both univariate and multivariate data in 10 different trials for each dataset and model.

In each trial, the model was first fit using the procedure described in section 4.2. Results from the fitting procedure was recorded. The model was then evaluated in both one-step ahead and multi-step ahead forecasting.

In the one-step ahead forecasting evaluation, the model was used to fore-cast the change in demand from the current timestep to the next for each timestep in the test set. This yielded one forecast for each timestep. Mean Squared Error (MSE), depicted in equation 3.1, was chosen as the evaluation metric and was calculated in the same way as the loss was calculated during the fitting step.

In the multi-step ahead forecasting evaluation, the model was used to forecast the change in demand for the nextn timesteps for each timestep in the test set. For each timestep, an n-step ahead forecast was generated by forecasting the demand of the next timestep and then using that forecasted demand as input to forecast the demand for the step after the next one. This was repeated in a recursive fashion to yield an n-step ahead forecast from each timestep in the test set.

For each n-step forecast the MSE was calculated yielding one MSE value per timestep. TheMean MSE (MMSE) was calculated by taking the mean of the MSE values and was chosen as the evaluation metric.

The multi-step ahead forecasting evaluation was repeated with n set to 24, 72 and 168.

(28)

All evaluation was conducted on an Amazon EC2 p2.xlarge instance (1x NVIDIA Tesla K80 GP U, 64GiB RAM, 4x vCP U) running Ubuntu 16.04 using the GP U version of TensorFlow4 1.13 as the Keras backend.

4

(29)

Results

This chapter starts by presenting the results of the hyperparameter optimiza-tion step and the hyperparameter configuraoptimiza-tions chosen for further evalu-ation. Results from the model fitting process is presented followed by the results of the evaluations of each model in both the one-step ahead and the multi-step ahead forecasting problems.

4.1 Hyperparameter Optimization

4.1.1 LSTM

Figure 4.1 shows the MSE and mean time per epoch for the Univariate LSTM on the validation set when the number of units used in the LSTM cell was varied between 250 and 2000 in increments of 250. 10 trials were performed for each configuration.

The mean MSE trended downwards as the number of units were in-creased to a minimum of 107.5 at 1500 units after which both the mean MSE and the variance started to trend upwards. The mean time per epoch in-creased with the number of units.

The configuration using 1500 units were selected for further evaluation. For the rest of the paper, LSTM Univariate and LSTM Multivariate refers to LSTM models using a configuration of 1500 units.

(30)

250 500 750 1000 1250 1500 1750 2000 105 110 115 120 125 130 Number of Units MSE 10 15 20 25 30 35 Mean Time/Ep o ch ( se conds)

Figure 4.1: MSE and mean time per epoch for the Univariate LSTM on the validation set when varying the number of units. Results of 10 trials of each configuration.

4.1.2 TCN

Table 4.1 shows the hyperparameters of the model that resulted in the small-est loss on the validation set out of 600 randomly generated configurations from the search space previously defined in section 3.4.

The configuration presented in the table was selected for further evalu-ation. For the rest of the paper, the TCN Univariate and TCN Multivariate refers to TCN models using this configuration. The kernel size, dilation rate and number of blocks of this configuration results in a receptive field of 249 timesteps, meaning that a TCN model configured with these hyperparame-ters will use up to 249 historical timesteps to forecast the next target.

Kernel Size Dilation Rate Blocks Dropout Filters MSE

5 2 5 0.05 160 90.214859

Table 4.1: TCN hyperparameters that resulted in the lowest loss on the val-idation dataset in the hyperparameter search on the univariate dataset.

(31)

4.2 Model Fitting

Table 4.2 shows the mean results of fitting each of the models on the training set 10 times.

Since the models were fit using early stopping with restoration of weights, “Epochs” in the table refers to the epoch with the best performance on the validation set, and not the total number of epochs until termination. “Total Fitting Time” refers to the total time used to fit the model up until termination of fitting either by early stopping or by reaching the maximum number of epochs.

As shown in the table, the LSTM models required significantly more time per epoch than the TCN models but converged in fewer epochs. There were no significant differences in number of epochs, time per epoch or total train-ing time between the univariate and multivariate versions of the same model.

Model Epochs Time/Epoch Total Fitting Time LSTM Univariate 93.7 ± 16.6 25.29s ± 0.23s 2874s ± 410s LSTM Multivariate 87.6 ± 13.8 25.35s ± 0.22s 2727s ± 342s TCN Univariate 282.4 ± 57.0 0.33s ± 0.01s 120s ± 18s TCN Multivariate 298.9 ± 30.0 0.33s ± 0.01s 126s ± 10s Table 4.2: Mean number of epochs, time per epoch and total training time when fitting each model 10 times.

4.3 One-step Forecasting

Table 4.3 presents the mean MSE of each model when evaluated in one-step ahead forecasting on the test set, along with the results of the baseline models described in section 3.2.

As shown in the table, both the univariate and multivariate TCN and LSTM models outperformed the baseline models by fair margin in terms of MSE. Both TCN models also performed significantly better than the LSTM models. For both the LSTM and TCN models, the use of multivariate data resulted in runs with slightly lower mean MSE and variance.

(32)

Model MSE Naive 1h 302 Naive 24h 248 Naive 168h 193 8w-MP 232.5 LSTM Univariate 122.2 ± 5.9 LSTM Multivariate 121.3 ± 4.8 TCN Univariate 93.4 ± 2.4 TCN Multivariate 89.8 ± 1.9

Table 4.3: Mean MSE of each model when evaluated in one-step ahead fore-casting. Mean results of 10 trials per model.

Figure 4.2 shows a visualization of the one step ahead forecasts of the best performing multivariate LSTM and TCN models along with the Naive 1h model together with the absolute error from the target for each timestep. It should be noted that the visualization only shows a small part of the test set that might not be representative for the whole dataset. Also note that for each predicted timestep t, the models have had access to all historical targets up to timestep t-1 as per the definition of one-step ahead forecasting.

(33)

0 10 20 30 40 50 60 70 Patient V olume 0 10 20 30 40 50 60 70 Timestep (Hour) Absolute Err or TCN Multivariate LSTM Multivariate Naive 1h

Figure 4.2: The top graph shows a visualization of the one-step ahead fore-casted targets for the best performing multivariate TCN and LSTM models together with the naive 1h forecasting model. The black line represents the true targets to be forecasted at each timestep. The bottom graph shows the absolute error for each forecasted target.

4.4 Multi-step Forecasting

Table 4.4 presents the mean MSE of each model when evaluated in a multi-step ahead forecasting on the test sets together with the results of the Naive 168h and 8-Week Average baseline models described in section 3.2.3. For the TCN Univariate model, two trials resulted in errors diverging more and more for each timestep ahead predicted. These two trials are not included in the results in table 4.4.

As shown in the table, both the univariate and multivariate TCN and LSTM models outperformed the baseline models by fair margin when evalu-ated in 24-step forecasting, with the TCN outperforming the baseline models on all tested forecasting horizons.

Both TCN models also performed significantly better than the LSTM models on all tested forecasting horizons. For the TCN models, the use of

(34)

multivariate data resulted in runs with lower mean MSE and variance on all tested forecasting horizons. For the LSTM models, the use of multivariate data resulted in runs with lower mean MSE on all forecasting horizons and lower variance in the 24 and 72-step cases.

Model 24h MMSE 72h MMSE 168h MMSE Naive 168h 193.5 194.4 195.8 8w-MP 233.5 235.3 238.5 LSTM Univariate 157.6 ± 13.3 193.0 ± 38.6 261.5 ± 63.0 LSTM Multivariate 150.0 ± 4.1 184.0 ± 25.5 241.8 ± 67.2 TCN Univariate 133.5 ± 3.8 141.5 ± 5.0 143.2 ± 5.5 TCN Multivariate 128.7 ± 1.6 133.6 ± 1.7 136.6 ± 2.2 Table 4.4: Mean MSE of each model when evaluated in multi-step ahead forecasting. Mean results of 10 trials per model with standard deviation.

Figure 4.3 shows a visualization of a 72-step ahead forecast of the best performing multivariate LSTM and TCN models along with the 8w-MP base-line model together with the absolute error from the target for each timestep. Again, note that the visualization only shows a small part of the test set that might not be representative for the whole dataset. For each forecasted timestep t the models had access to historical targets in the dataset from timestep 0 (not shown in figure), along with its own predicted targets up to timestept − 1.

Figure 4.4 shows a visualization of a 72-step ahead forecast on another part of the test set. Note the clear differences in accuracy for each model in figure 4.3 versus figure 4.4.

(35)

0 10 20 30 40 50 60 70 Patient V olume 0 10 20 30 40 50 60 70 Timestep (Hour) Absolute Err or TCN Multivariate LSTM Multivariate 8w-MP

Figure 4.3: The top graph shows a visualization of the forecasted targets of a 72-step ahead forecast for the best performing multivariate TCN and LSTM models together with the 8-Week Baseline model and the true targets. The bottom graph shows the absolute error for each forecasted target.

(36)

0 10 20 30 40 50 60 70 Patient V olume 0 10 20 30 40 50 60 70 Timestep (Hour) Absolute Err or TCN Multivariate LSTM Multivariate 8w-MP

Figure 4.4: The top graph shows a visualization of the forecasted targets of a 72-step ahead forecast for the best performing multivariate TCN and LSTM models together with the 8-Week Baseline model and the true targets. The bottom graph shows the absolute error for each forecasted target.

Figure 4.5 shows the mean MSE for every step ahead in the 24-step fore-cast for the LSTM and TCN models. For both models, the MSE increases quickly from step one to step two. For the TCN the MSE trends upward in a linear fashion from step 2 to step 24. For both LSTM models, the error grows to step 15 and then decreases for some steps. For both the univariate models, the error is higher in each step than for their multivariate counterparts.

(37)

1 4 7 10 13 16 19 22 100 120 140 160 Hours ahead MSE LSTM Multivariate LSTM Univariate TCN Multivariate TCN Univariate

Figure 4.5: A visualization of the mean MSE for each step ahead in a 24-step forecast. Results of 10 trials of each model.

(38)

Discussion

The aim of the study was to evaluate the suitability of the relatively untested TCN model, compared to the more widely used LSTM model, in the context of one-step and multi-step forecasting of patient demand in a digital health-care setting, and, to contribute to the understanding of its strengths and lim-itations in general. A TCN model and an LSTM model were implemented and evaluated in both one-step and multi-step ahead forecasting problems using data both with and without explicit seasonal features. The findings and the methodology of the study are discussed in the following sections.

5.1 Model Comparison

5.1.1 One-step Forecasting

Both the LSTM and the TCN models achieved lower MSE values, using both univariate and multivariate data, compared to the Naive and 8w-MP baseline models, in the evaluation of the one-step ahead forecasting problem. They performed better than all tested baseline models showing their suitability for solving the problem. This is in line with previous work by Hellstenius [9] from 2018, that compared two ANN models, one being an LSTM, to an autoregressive baseline model in a similar problem setting. It should be noted however, that the baseline models used in this study does not always make use of the data of the latest available timestep to make predictions. This is discussed as a limitation of the study later on in section 5.3.2.

The TCN model was able to consistently achieve significantly lower MSE values than the LSTM model regardless of the dataset used. This is in line with previous work by Bai et al. [7] that found TCN models consistently

(39)

performing LSTM models in different sequence modeling tasks, and further proves the TCN model’s suitability over the LSTM in different types of prob-lems. It is also in line with the results of Van Den Oord et al. [8] that saw state of the art performance of a model similar to the TCN used in this study when evaluated in the time series problem of speech synthesis.

The LSTM showed higher MSE values when evaluated on the test set compared to the results recorded on the validation set from the hyperpa-rameter optimization phase. The same observation could not be made for the TCN. This indicates that the LSTM did not generalize as well as the TCN on this type of problem, which could be due to a bias developed towards the validation set through the hyperparameter search and early stopping in the fitting process.

The results of the naive models where the Naive 24h model performed better than the Naive 1h model, and the Naive 168h model performed better than the Naive 24h model indicates that the dataset used exhibits strong daily and weekly seasonality patterns.

5.1.2 Multi-step Forecasting

Both the LSTM and TCN models achieved lower MMSE values, with both univariate and multivariate data, compared to the Naive 168h and 8w-MP baseline models, in the evaluation of the multi-step ahead forecasting prob-lem with a horizon of 24 hours. The TCN achieved continued good results as the horizon was increased to 72 and 168 hours while the errors of the LSTM model quickly grew.

Figure 4.5 shows how the errors of both models increase for each step ahead in the multi-step forecasting evaluation. The trend is to be expected as the long term future is assumed to be more uncertain than the short term future. The recursive nature of the multi-step forecast where the predicted demand for a timestep is used when predicting the next timestep could also be a contributing factor to the trend as the error could be allowed to propa-gate. The error increases quickly from step one to step two. This is believed to be due to the fact that the first step is the only predicted value for which the model has only been using the true history and not its own previously predicted values as inputs.

The LSTM errors increased at a faster rate than the TCN errors for each step ahead which is in line with the results of the evaluation of the one-step forecasting and the recursive nature of the forecast as noted in the previous paragraph. In section 5.5: Future Research, a possible mitigation of the error

(40)

propagation is proposed.

Figure 4.3 and 4.4 show 72h multi-step forecasts for the TCN and LSTM models together with the 8w-MP baseline model and the true targets, on two different parts of the test set. In figure 4.3, the forecasted values of the baseline model are close to the true targets, indicating that the visualized part of the dataset follows the seasonality observed in previous data. In figure 4.4, the baseline model performs considerably worse which indicates that the visualized part of the dataset deviates from previous data. In both figures, the forecasted values of the TCN model are close to the ones of the baseline model, indicating that the TCN model learned to approximate the general seasonality in the dataset. This is unsurprising, given that it has been fit using the historical time series data in which the seasonality is encoded, and additional calendar data in the multivariate setting.

In both figures, the forecasts produced by the LSTM are far off the tar-gets, even though it achieved lower MMSE values than both baseline models when evaluated with a horizon of 24 hours. It is not obvious why this is, but a theory that is strengthened by figure 4.4 is that it generally performs worse than the baseline models, but for certain parts of the data that deviates from the seasonality, it achieves significantly better results than the baseline mod-els. The Mean Squared Error metric used in the evaluation penalizes large errors more than small ones which could misrepresent the performance of the models in these cases. It should again also be noted that the baseline models used in this study does not always make use of the data of the latest available timestep to make predictions meaning that their forecasts do not change for every step in the evaluation. This could help keep the errors low for the TCN and LSTM models when the data deviates from the norm and is discussed as a limitation of the study later on in section 5.3.2.

5.1.3 Interpretability and Usage

Both the LSTM and the TCN are ANNs while all of the baseline models are statistical. A neural network is a bit like a black box and a drawback of this is that for a predicted value in a forecast, it is hard to know exactly why that value was predicted. This is in contrast to the statistical models where it is easy to understand exactly why a value was forecasted. Another drawback is that large amounts of data is needed to fit the ANN models while the statistical models require no fitting procedure.

An advantage of the ANN models over the statistical models is their flexi-bility in modelling different problems. For a dataset with clear patterns, such

(41)

as seasonal ones, it might be possible to find a suitable statistical model and apply it successfully. With a more complex dataset this could be harder to do, especially if there are non-linear relationships between the targets and the predictors. The ANNs ability to model the relationship using large amounts of data makes them very flexible and avoids biasing the model towards as-sumptions. Another advantage is that the models require a relatively small amount of data for making predictions after training. The TCN evaluated in this study had a receptive field of 249 timesteps (about 1.5 weeks) while the 8-Week Percentile model has a receptive field of 1344 timesteps (8 weeks).

The LSTM model has a complex structure with multiple data paths (see section 2.2.2) that can make it hard to grasp. Thanks to good results on many different problems in multiple studies it has become widely adopted and very available. Different implementation choices have been tested [17] and it has reached a state where there are accepted choices for parameters meaning that little to no hyperparameter searching is required to use it. It is implemented in popular software frameworks for deep learning, like Keras, which makes it very easy to use without understanding of the exact implementation.

Exactly what constitutes a TCN model is not clear at the time of this study. The TCN model used in this study is very similar to the one used by Bai et al. [7] in 2018, consisting of causal dilated convolutions stacked together with dropout layers in residual connections. Dropout and residual connections are standalone concepts that have been used with other models before and shown to benefit the training of deep models [21, 20]. Perhaps causal dilated convolutions therefore is the only significant requirement for a model to be defined as a TCN?

Compared to the LSTM, the structure of a TCN that only uses causal di-lated convolutions, is straightforward, which makes it easy to sketch and understand how the data flows. Unlike the LSTM, the TCN does not auto-matically learn its receptive field that is dependent on the kernel size and dilation rate used in the convolutions, and the number of stacked convolu-tion layers. This means that a hyperparameter search is absolutely necessary to get good performance for some problems where the needed receptive field is unknown. At the time of this study, TCN is yet to be included in any of the major software frameworks for deep learning, which makes it somewhat less available than the LSTM. It is likely that it will become more available and easier to use as more studies choose to evaluate it and it has been more clearly defined.

(42)

Computational Performance

The results of the fitting procedure showed that the TCN model converged significantly faster than the LSTM model. This was in line with expectations since the structure of the TCN allows for parallelization while the structure of the LSTM does not, due to dependencies between timesteps. The highly parallel structure of the GP U is believed to have been an advantage for the TCN model that could make use of this. It should however be noted that the LSTM cells used in the LSTM model were of the the type CuDNNLSTM, a GP U-accelerated version that has been specifically created to allow the LSTM to run faster on a GP U device.

The computational performance and the parallel nature of the TCN is considered to be a major advantage over the LSTM as GP U devices have become cheaper and more available.

5.2 Univariate vs. Multivariate Data

The TCN and LSTM models were both evaluated with two datasets. The univariate dataset simply consisted of the historical patient demand recorded at each timestep. The multivariate dataset consisted of the same time series, annotated with a value indicating the day of the week (0-6) for each timestep. As previously mentioned, the results of the naive models indicated that the dataset used exhibited strong daily and weekly seasonality patterns. We can see from the results of the multi-step evaluation that both models were able to pick up and learn the seasonality using the univariate dataset. With the use of the multivariate dataset, the MSE was lowered slightly. This is in contrast to the results of Hellstenius [9] from 2018 that found no significant improvement when using multivariate weekday data in a similar study.

It is likely that having an explicit feature in the dataset tied to the season-ality made it easier for the model to pick up on it, possibly making it easier for the model to focus on other patterns in the data. It is also likely that an inclusion of a feature representing the hour of the day would also have benefited the models.

Multivariate data that is not seasonal and can be forecasted separately would be of interest to create better forecasts, especially for the cases where the data deviates from its regular seasonality patterns. This is discussed fur-ther as future research in section 5.5.2.

(43)

5.3 Methodology and Limitations

5.3.1 Hyperparameter Optimization

The fact that the LSTM and TCN are not fixed models, but rather model ar-chitectures, means that some decisions regarding their structure were taken during implementation. Two models may be conceptually of the same archi-tecture while having very different implementations which makes it non-straightforward to make fair comparisons between model architectures and draw general conclusions from the results. In this study, the decision to hy-perparameterize a certain option while fixing another were made based on the assumption that the option had a large impact on the performance of the model, an assumption that may or may not hold.

The hyperparameter optimization step conducted was assumed to lead to a more fair comparison of the models to be evaluated, but the success of the hyperparameter search is dependent on the adequacy of the chosen search space. The search space used by the hyperparameter optimization step in this study was limited by available computational resources and cho-sen thereafter. The best performing parameters for each model, as precho-sented in section 4.1, does not indicate that the search space was too small, but there is a possibility that the found parameters did not make up the global minima and that a larger search space might have resulted in different parameter configurations for the evaluated models, that in turn, would have lead to different results.

Hyperparameters were optimized for using different techniques for the LSTM and the TCN models. The decision to use different techniques was made based on the dimensionality of the search spaces and the estimated training duration of each model. The LSTM grid search evaluated each figuration in 10 trials while the TCN random search only evaluated each con-figuration once. Due to the stochastic nature of the initialization of weights of each model, this could be viewed as a drawback for the TCN hyperparame-ter search and the configuration selected for evaluation. On the other hand, the LSTM search was done in rather large increments of 250 units which carries the risk of the optimal number of units lying in between increments. There is a trade-off between how much of the search space that can be ex-plored and the validity of the results. In retrospect, lowering the number of trials in the LSTM search while decreasing the increments, and doing multi-ple trials for each configuration in the TCN search could have been a better option.

(44)

Hyperparameters were searched for on the univariate dataset and the best performing parameter configurations were then reused for the evalua-tion of the models with the multivariate datasets. It can be argued that this makes the comparison of the univariate and multivariate results less fair. On the other hand, the preceding paragraphs highlight the possible issues of searching the univariate and multivariate models separately.

5.3.2 Evaluation

The LSTM and the TCN models were evaluated, along with the baseline mod-els, in both one-step and multi-step forecasting. In the one-step case, each model yields one prediction for each timestep in the dataset, making eval-uation straightforward. In the multi-step case, the model yields multiple predictions (one for each step ahead) for each timestep in the dataset. The predictions at each timestep may differ based on from which timestep each forecast originated.

In this study, the multi-step evaluation was performed by evaluating the multi-step forecast for each timestep in the test set, and then using the mean error of all forecasts as the evaluation metric. This makes sense for the LSTM and TCN models where a forecasted timestep may have a different predicted value based on from which timestep the forecast originated. The baseline models, on the other hand, generates the same prediction for a cer-tain timestep, irrespective of from which timestep the forecast originated. This could be seen as a possible flaw in the comparison between the ML models and the baseline models. It can also be seen as a strength of the ML models since they have the ability to adapt quickly as new data is made available.

An alternative to the evaluation conducted could have been to perform a multi-step forecast at everynth timestep, where n is equal to the forecasting horizon, and then using the mean of those forecasts as the evaluation metric. Likely, this would have benefited the baseline models in the comparisons, with the added risk of misrepresenting the results of the LSTM and TCN models.

5.3.3 Data

All models were fit and evaluated on a dataset consisting of a single time series. The dataset was split chronologically into training, validation and test sets with the last 20% held out from training. The data held out from

(45)

training might contain useful observations for modeling the problem, espe-cially considering the growth trend of the dataset. K-fold cross validation is commonly used to make use of the whole dataset while also flagging for bias and overfitting of the model. K-fold cross validation was considered but decided against with the argument that a model trained on a future obser-vation would be cheating. With a larger dataset of multiple time series from different caregivers, this type of validation could be performed in the way that one whole time series is held out at a time.

An alternative to K-fold cross validation for time series data is forward chaining/rolling origin cross validation where the point at which the time series is split into training, validation and testing sets are rolled forward one step at a time, generating one split for each observation. This approach is able to preserve the temporal ordering of the sets while still being able to make use of more of the data and flagging for bias and overfitting. Due to computational and time constraints, this technique was not used but is an interesting consideration for future works.

All data used in the study came from a single digital caregiver. All results and findings are therefore biased by this fact and this must be taken into account when interpreting the results. This makes it hard to say how much about the general case of forecasting patient demand in digital healthcare or even more so for the general case of time series forecasting.

5.4 Ethics, Sustainability and Social Aspects

The data used in this study included no information that could be tied to a specific person, which frees it from some ethical obligations that are com-monly seen in similar studies. One could imagine personal data of patients connected to a certain caregiver to be used to proactively predict when a patient is in need of care which could then be used as an input to a forecast. This could be an ethical consideration of future studies.

If the models used in this study were to be implemented at a caregiver the effects would be in relation to what it replaces. If the model provides more accurate forecasts than previous methods, a positive effect would be that coveted resources like clinicians can be better utilized. This would lead to more patients getting treated, or alternatively, less pressured clinicians possibly providing better care. Either way, it should benefit the society.

On the other hand, there is a risk that the model provides less accurate forecasts than existing solutions in some cases. The evaluation method used in this study calculates the accuracy with respect to the whole dataset, but as

(46)

shown in figures 4.3 and 4.4 the accuracy varies for different parts of the data. Inaccurate forecasts would lead to under or overstaffing, resulting in the op-posite of the positive effects mentioned. Furthermore, the black-box nature of the neural networks might make it harder to discover these performance anomalies in time than for certain other methods.

5.5 Future Research

This study evaluated two different ANN models in the problem of forecast-ing patient volumes usforecast-ing the data of a sforecast-ingle digital caregiver. A previous study from 2018 [9] did a similar evaluation with data from the same digi-tal caregiver. This fact makes it hard to draw conclusions about the general area of patient volume forecasting from the results. There is a general need to validate the results found in this study, and previous studies on data from different digital caregivers.

5.5.1 Multi-output Models

In this study, the same models trained to make one-step ahead forecasts were used in a recursive fashion to yield multi-step forecasts. Some flaws with this technique, such as the error increasing abruptly from timestep one to timestep two, were shown in the results. It would be interesting to create and test a multi-output model that would be capable of forecasting the whole horizon in one go. Benefits of a model like this would be that it would be optimized to minimize the loss over the whole forecasting horizon instead of only the next timestep, possibly resulting in more accurate forecasts. A possible drawback would be a loss in flexibility as the forecasting horizon would have to be determined at an early point in time.

5.5.2 Use of Non-seasonal Multivariate Data

The data used in this study showed daily and weekly seasonality patterns that the models were able to pick up on. When the data deviated from the normal seasonality patterns, as can be seen in figure 4.4, the performance of the models worsened. The data used to train the models included the his-torical patient volume together with the day of the week for each timestep in the multivariate case. In hindsight, it is unsurprising that the model were not able to perform well when the data deviated from the normal. A possi-ble remedy for this propossi-blem would have been to include features of a