• No results found

Developing Markov chain models for train delay evolution in winter climate

N/A
N/A
Protected

Academic year: 2021

Share "Developing Markov chain models for train delay evolution in winter climate"

Copied!
33
0
0

Loading.... (view fulltext now)

Full text

(1)

Master’s thesis in Engineering Physics, 30 hp Department of Mathematics and Mathematical Statistics

Autumn term 2020

Developing Markov chain

models for train delay

evolution in winter climate

Frej Sundqvist

(2)

Developing Markov chain models for train delay evolution in winter

climate

Master’s thesis in Engineering Physics at Umeå University

January 2021

Authors

Frej Sundqvist

frejsundqvist@protonmail.com

Place of Work

Department of Mathematics and Mathematical Statistics

Umeå University

Sweden

Supervisors

Jun Yu, Department of Mathematics and Mathematical Statistics

Jianfeng Wang, Department of Mathematics and Mathematical Statistics

Examiner

Markus Ådahl, Department of Mathematics and Mathematical Statistics

(3)

Abstract

The traffic on Swedish railways is increasing and punctuality is of important matter for both passenger and freight trains. The problem of modeling train delay evolution is complex since conflicts between trains can occur and since a delay can have a wide variety of causes. Swedish railways faces in addition harsh winter climate. Studies of railways in Scandinavia have shown that harsh winter climate decreases the punctuality of trains. This thesis work investigates the possibilities of modeling train delay evolution as continuous time Markov processes and which specific modeling choices are preferable. It also further assesses the impact of a harsh winter climate on the delay evolution. The studied segments are Stockholm - Umeå and Luleå - Kiruna. Both over one winter season. It was found that a change in the time schedule, which in a way redefines the delay, allows for a better fit and better prediction capabilities. It reduced the MSE of the prediction by 50 %. As for the weather variables, four variables were included together with their week long moving averages. Low temperatures were found to increase the risk of a delay (Hazard ratio of 1.10) as well as to decrease the chance of recovering from a delay (Hazard ratio of 0.91).

No other significant weather impacts were found.

(4)

Acknowledgements

I would like to express my gratitude to my supervisors Jianfeng Wang and Jun Yu for their help and guidance and for answering my questions throughout the whole project. I would also want to thank the Swedish Transport Administration for providing data. Finally I would like to thank Carl-William Palmqvist who has much experience in this field for a helpful phone call.

(5)

Contents

1 Introduction 1

1.1 Background . . . 1

1.2 Aim . . . 1

1.3 Definition of terms . . . 1

1.4 Existing research . . . 2

1.5 Limitations . . . 3

1.6 Structure of the thesis . . . 3

2 Theory 4 2.1 Continuous time Markov chains . . . 4

2.2 Model assessment. . . 5

3 Data preprocessing 6 3.1 About the data . . . 6

3.2 Missing data . . . 8

4 Method 9 4.1 Model description. . . 9

4.2 Approach 1: Modelling both running and dwelling processes, excluding passages. . . 10

4.3 Approach 2: Combining running and dwelling processes . . . 11

4.4 Altering the time schedule . . . 11

4.5 Defining state space . . . 12

4.6 The weather variables studied . . . 13

4.7 Model selection procedure . . . 14

4.8 Model assessment and comparison . . . 14

5 Results 16 5.1 Approach 1 . . . 16

5.1.1 Standard schedule . . . 16

5.1.2 Altered schedule . . . 17

5.2 Approach 2 . . . 19

5.2.1 Standard schedule . . . 19

5.2.2 Altered schedule . . . 20

5.3 Comparison of the models . . . 21

6 Discussion 25

7 Conclusion 26

8 Flaws and possible improvements 27

(6)

1 Introduction

1.1 Background

The traffic on railways in Sweden has doubled in the last 30 years. More people are using the railway as trans- portation and at the same time the journeys get longer and the trains go faster. For passenger trains, punctuality is considered the second most important matter after safety [1]. The long winters in Scandinavia make it more difficult to satisfy these wishes of punctuality of passenger trains. Now, the railway is already a delicate system, where one minor delay of one train could affect other trains that share the tracks. Extreme weather conditions complicate this even further. Both rolling stock, which refers to powered and unpowered railway vehicles, and the infrastructure around are susceptible for problems caused by weather. The problems that can occur are known but are difficult to predict and costly to prevent.

The Swedish Transport Administration, STA, is greatly concerned with being more punctual. Since 2013, STA is collaborating with multiple branch operators in a project called "Tillsammans för tåg i tid", translated to "Together for Trains on time", where the aim is to find which actions should be taken to increase punctuality in the entire Swedish railway. The goal is that 95% of all the trains are on time to their final destination and that 80% of passengers should be content or neutral with the information in delay situations. As of now, in the entire railway system, nine out of ten trains are on time. Unfortunately, the punctuality has not been improved since the beginning of the project. The latest report over 2018, even showed a clear decrease of punctuality in 2018. One has to keep in mind though, that the usage of the railways is increasing. Since 2013, the total number of train kilometers has increased by nearly 13 percent [4]. The report from 2019 showing this, is extensive and solution-driven and concludes in its analysis that the entire decrease in punctuality in 2018 could be explained by the extreme weather conditions that occured in the spring and summer. The report divides all possible causes of non-punctuality in eight sections. Each of them describes what has been done and what will be done to reduce the problems related to this area. Worth to note is that none of these sections are explicitly dealing with weather related problems. Weather conditions are rather found in many of these sections. This is because weather can cause such a large variety of problems that it is not sensible to group them together. The same goes for the delay report coded into categories.

It is a system used by STA where the train driver reports a presumed cause of the delay when the delay is large enough. For example, snow or ice can block switches. This is categorized as an infrastructure problem. If snow gather in the boogie of the train, it would be a vehicle problem. Moreover, if an animal is hit, it is an accident or incident caused problem. But it is well known that great snow depths bring moose and reindeer to the track where they can stroll more easily. The last example is more indirect.

For high speed trains in Scandinavia, that is velocities over 200km/h, it is not very well known how winter conditions directly affect punctuality, as high speed doesn’t exist here in large extent. In the rest of Europe, the situation is different. The railways in Scandinavia face, apart from the rest of Europe, persistent harsh winter climate. In France and Germany, the extreme winter climate is rarer and is treated as an exception. The common solution is simply to lower the train speed and alter the time table. This solution does not seem to be accepted in Scandinavia as these conditions are more normal and it is expected that these problems are solved rather than avoided [8].

This thesis is conducted under the research project Nordic Icing Centre of Expertise (NoICE).

Seehttps://www.icingcenter.eufor information about NoICE.

1.2 Aim

This thesis work will investigate the possibilities of modeling train delay evolution in Sweden with Markov chains.

More specifically, it aims to investigate how the Markov states should be defined to get an accurate model and which predictors to involve in order to get a better fit. Modelling train delays as stochastic processes has already been done as discussed in Section1.4, but not on Swedish train data and not with involving weather variables in this extent. Markov Chains are generally used when trying to predict a behaviour or an event. This study will however mainly investigate the inference of the weather variables effect on delay, but also, if one model turns out to fit nicely, work as a predictive model.

1.3 Definition of terms

What follows is a list of terms that will be used throughout this thesis. The definitions are specific for this thesis only.

• On time - the delay is less than or equal to five minutes.

• Delay - The deviation from the scheduled arrival/departure time in minutes.

(7)

• Segment delay - The additional delay on a certain segment. Another way to put it, the segment delay is the deviation from the planned driving time between two stations.

• Segment - A certain part of the train run. The segment is delimited by two measuring points. Either a station or a passage.

• Station - A measuring point where the train actually stops. Not necessarily for passengers to get on/off.

• Passage - A measuring point where the train doesn’t stop. It can be seen as an arrival and a departure occuring simultaneously.

• Event - Either a departure or an arrival, where the time of the event is measured.

• Running process - The process between a departure and the subsequent arrival. The train is running during this process.

• Dwelling process - The process between an arrival and the subsequent departure. The train is standing still on a station. It is ”dwelling”

1.4 Existing research

The investigation of previous research in this topic focuses on the effect of weather variables on the train delays, as well as attempts on modeling the evolution of delays with stochastic processes.

Palmqvist et al. [10] made a broad study to quantify how variables in the categories weather, timetable, opera- tional and infrastructural affect punctuality. For the weather variables, they found that high and low temperatures affect the punctuality exponentially. Punctuality is here a binary response variable that tells whether the the train arrived on time to the final station. What ”on time” means can differ, but usually five minutes is used. So a train not being more than five minutes delayed is considered punctual. Punctuality is defined the same in the works by Zakeri and Olsson below.

Zakeri and Olsson [14] showed that temperatures below -15oC together with snow depths of at least 30cm correlated strongly with non-punctuality of passenger trains in the Oslo area. The data points in this study were taken on a weekly basis, counting the number of days with certain conditions and the number of non-punctual train runs. The study showed that extreme winter conditions affect punctuality strongly.

Zakeri and Olsson [13] investigated the Nordland line for a period of 10 years from 2007 to 2016 using Pearson correlation and multiple regression approaches. There they concluded that snow depth is the variable that mainly explains the variation in punctuality. They also concluded that the effect is much greater in harsh winters compared to mild winters, where the performance can sometimes be in line with, or better than, the rest of the season. This motivates a study over more than one winter.

Qin et al. [11] used a Wiener diffusion process approach on a data set of train movement data, weather data and railway failure data in the Stockholm region from 2009. The response variable studied was the arrival delay. The study quantified the effect of weather variables and operational variables. It was shown that a model accounting for weather variables could better predict new data. Although said to have low precision, it was suggested to be a first step in realizing the real-time arrival delay prediction.

Bergström and Krüger [2] found clear correlations of delay with travel days and with the number of stations.

The study analyzed the distribution of delays for certain segments in southern Sweden from 2008 and 2009. They stressed the problem of low probability high impact in delay modeling, and showed that the distribution of delays is not normal but rather partly a negative exponential distribution. And the tail of this distribution is responsible for a major share of all delay time, meaning that the large delays cannot be neglected even though they occur seldomly.

Ludvigsen and Kläbo [9] investigated how harsh winter climate affected freight operation in Sweden, Norway, Switzerland and Poland. They found that these rail operators were ill-prepared for long periods of cold temperatures, heavy snowfall and strong winds. Especially low temperature in combination with snowfall could explain a large amount of the variance.

Sahin [15] modeled train delays in Turkey with Markov chains. The extensive description of the states used and how to incorporate the buffer time on the train station inspired parts of the modeling in this study. It did not account for external predictors however, but focused on the conflict between trains using the same railway. The circumstances regarding weather are undoubtedly different in Turkey.

Kecman et al. [7] used a Markov process to model high speed trains between Beijing and Shanghai. The result was a model for real time prediction. No weather variables were used. The accuracy of the predictions were analyzed and it was claimed that the presented method could improve the reliability of the previous predictions by 71%.

As it seems from previous studies, much is known in terms of what sort of weather conditions causes delays.

But there has not been many attempts to model train delay in Northern climate with stochastic process models.

(8)

Regarding modelling approaches, from the previous research above, Kecman et al. [7] used Markov processes for modeling train delay evolution, but with quite different circumstances. The expectations of punctuality between Beijing and Shanghai are probably higher than in Scandinavia, and the journey is probably not as troubled by harsh weather conditions. This means that the planned schedules will be different as well as the distribution of delays. More over, the railway between Beijing and Shanghai is a double track line. Therefore, there will be less conflicts with other trains and a different delay propagation. These clear differences, motivate an investigation of the Swedish railway with similar models.

Qin et al. [11] also used a stochastic process modelling approach, but not the Markov process approach. This motivates that a Markov process approach including a study of weather variables in Scandinavia is indeed possible and worthwhile to examine.

1.5 Limitations

Freight train operation data will not be included in the analysis, but for understanding problems that can occur in winter conditions, they are also studied. A problem with freight trains is that they can depart before scheduled departure, which a passenger train never would. They can also skip scheduled stops. But they do however, interfere with passenger trains. Many of the passenger train delays are registered as being disturbed by another train, which then could be a freight train.

There is, as mentioned above, a great number of possible causes for a delay, and many of them do not have anything to do with weather. Therefore, and because of the complexity of modeling train delay evolution, this study does not have high expectations on predictive capabilities. Rather, the focus will lay on the effects of weather variables, and the suitability of modeling train delay evolution with Markov processes in the way presented. However, different models can be compared through their prediction capabilities.

Since this thesis work is time limited, there is a risk of ending the study abruptly. Even if more investigation is clearly needed, it will be left out and suggested as further studies if the project time is running out. Approximately six months was planned for this thesis.

Further limitations are about the data handling. Firstly, the STA’s capability to contribute with data was limited. Secondly, as the work was done exclusively on my personal computer, there were struggles handing large data and time consuming runs in the model fitting. The data eventually used is described in Section3.

1.6 Structure of the thesis

Section 1 introduces the situation and tries to motivate why an investigation of this sort is necessary. It defines terms and sets an aim for this thesis work. Section 2goes through the theory of Markov chains and how it will be applied to the problem at hand. It might be helpful to first read Section3and4to get a grasp of the setup. Section 3 explains where the data comes from, how it looks like and how missing data is handled. Section4 describes the setup of the problem and also motivates why it was chosen the way it was. Though titled "Method", this section is also the result of an investigation about what setup could be suitable. Further on, the result, discussion and conclusion are presented in Section 5, 6, and 7, respectively. Some flaws and possible improvements for further study are presented in Section8.

(9)

2 Theory

The msm-package, a multi-state modeling package in R, was found to be the most suitable package for the purpose of this modeling. This section goes through the theory necessary to understand how the model fitting is done in msm. The assumptions that are made about the underlying system are mentioned here and the validity of them and what they mean when modeling train delay evolution are discussed in Section4. The manual of the msm-package [5], explains carefully the mathematical theory behind the package. Some steps in this section which are omitted can be found in that manual. If not, other sources are referenced. This section tries to summarize the theory needed and captures what will be necessary when modeling the specific case of train delay evolution.

2.1 Continuous time Markov chains

In discrete time, a Markov chain remains in each state for one unit of time before the next transition. The family of random variables is denoted by

{Xn, n2 [0, N]}. (1)

This stochastic process starts with a value X0 at time n = 0, and ends at time n = N for some N 2 N0. The set of all possible values that Xn can attain is called the state space, S. In order for a stochastic process to be called a (homogeneous) Markov process it must fulfill the Markov condition, that is 8i, j, k, l... 2 S,

P (i, j) = P (Xn+1= j|Xn= i) = P (Xn+1= j|Xn= i, Xn 1= k, Xn 2= l, ...) (2) Here, P (i, j), known as the transition probability from state i to state j, depends conditionally only on i. It does neither depend on the previous states k, l, ... nor on the time n. By ordering all possible transitions into a matrix, we can form a so called transition matrix. Let the state space be defined by S = {s1, s2}, then we get the transition matrix

P =

P (s1, s1) P (s1, s2)

P (s2, s1) P (s2, s2) (3)

A stochastic process that fulfills equation (2) is called a discrete time Markov process or Markov chain (DTMC) for which there exists a vast amount of theory on the development of the process. To study continuous time Markov Chains (CTMC), we need to relax the condition in equation (1) to let the process spend any continuous amount of time in each state. The family of random variables in continuous time is

{X(t), 0 < t  T } (4)

and the corresponding continuous time Markov property is, 8i, j, ku2 S

Pij(t) = P (X(t + s) = j|X(s) = i) = P (X(t + s) = j|X(s) = i, {X(u) = ku, s > u 0}) (5) Here, Pij(t)represents the probability of a transition to happen within an interval of time t. All possible transitions then form a transition matrix just like in equation (3). Moreover, one can define the instantaneous risk of transition, called the intensity qij, as

qij= lim

!0P (X(t + t) = j|X(t) = i)/ t (6)

and also allow q to depend on time t and on some possibly time dependent covariates w(t), so that

qij = qij(t, w(t)) (7)

These intensities are what is estimated when using the msm-package. When qij is independent of time, we call it a time-homogeneous CTMC. Otherwise, we call it a time-inhomogeneous CTMC.

An interpretation of the intensity for the time-homogeneous CTMC is that it is the negative rate of the exponen- tial distribution of the sojourn time in a state. To clarify, the sojourn time of a state is the time which the system is expected to stay in that state. One can show that the expected time follows a negative exponential distribution with rate qii. Just like the probabilities of transition form a transition matrix, the intensities qij form a transition intensity matrix

Q =

q11 q12

q21 q22 (8)

with four entries in the case of a two-dimensional state space S. To eventually calculate the likelihood of a Q-matrix, msm needs the transition matrix P (t). So after guessing Q, msm then calculates the transition matrix P (t) whose entries are Pij(t)by taking the matrix exponential

P (t) = exp(tQ). (9)

(10)

The exact details of why (9) holds are omitted here. See the source mentioned in the beginning of this section for more details. Now note that Pij(t)only explains the probability of the process being in state j a time t after observing it in state i. This reveals nothing about when the transition took place or if any other transitions took place in between. To illustrate this, Figure1shows how the underlying Markov process could look like and what we observe. From the figure one could ask, does the sampling time affect the fitted model? The answer is that some observation schemes do, and some don’t. For example, predetermined fixed sampling times and random sampling times are unbiased or non-informative. These do not make the model biased in some way. Other schemes, where the sampling time somehow depends on the state itself, will bring a bias. These schemes are called informative. More details as well as references on this issue can be found in the msm-manual. The msm-package assumes non-informative observation schemes which means that we need to assume that in the purpose of train delay modeling. The validity of this assumption is discussed in Section6.

Figure 1 – Example of a Markov process and how it is being observed. The dashed vertical lines are the observation times. Figure source: the msm-manual [5].

Next, the choice of the intensity matrix Q is evaluated by the maximum likelihood estimation. Looking at one individual x, that is one independent Markov process, and one transition, say from i to j, we can denote this likelihood by Lx,ij. To get the full likelihood L(Q), we need take the product of Lx,ij over all individuals and transitions. The contribution from one individual and one transition is simply

Lx,ij = Pij(tj ti) (10)

where tj and ti are the times the observations of j and i were made. This assumes that we allow transitions to take place in between observations.

The transition probabilities will not only depend on the process itself, but also on chosen covariates that are time dependent. They are variables that might affect the goodness of fit of the model. As a measure of how significant the effect from each covariate is, msm provides a Hazard ratio of each covariate. This measures the effect of w(t) on each transition intensity qij by setting

qij(w(t)) = q(0)ij exp( ijTw(t)), (11)

where exp( ijT) is the Hazard ratio. A zero estimate in the vector ij (Hazard ratio of 1) after fitting the model would imply no effect of the corresponding covariate. Estimates in apart from zero implies that the corresponding covariate in w(t) affects the transition intensity to some degree. The likelihood L(Q) is then calculated from the new Q-matrix. Another assumption that must be made here by msm is that the time-dependent covariates w(t) are piece-wise constant so that they are unchanged between the observation times.

To predict a future distribution of states given a fitted model, the msm-package uses the P -matrix in equation (9) and feeds it with the time t and the values of the covariates w(t). This is used in for example Figure11 to compute the expected prevalence of each state.

2.2 Model assessment

In the model assessment, comparisons of prediction capabilities is performed by computing the mean squared error (MSE) of observed and predicted values. The MSE is defined as

MSE = 1 n

Xn i=1

(Yii)2 (12)

where n is the number of samples, Y are the observed values and ˆY are the estimated values.

(11)

3 Data preprocessing

3.1 About the data

In this thesis, there were three datasets at hand. Firstly, long-distance passenger train data from the line of Stockholm-Umeå, both directions, 2016/12-2017/02. This dataset contains information about the schedules and the outcomes of the journeys. The Swedish Transport Administration is the original supplier of this dataset. Secondly, a similar data set but with passenger trains for the line of Luleå-Kiruna, both directions, the same time-period was used. These are used together in the analysis.

Thirdly, weather data, generated by using the weather research and forecasting (WRF) model (Skamarock et al. [12]) at the Group of Atmospheric Science at Luleå University of Technology, was used to fetch the four weather variables studied. WRF is a state-of-the-art mesoscale numerical weather prediction system, aiming at both operational forecasting and atmospheric research needs. The variables chosen were Temperature, Humidity, Snow depth and ”Icing”. All quite self-explanatory except the last. The variable Icing is the hourly accumulated combination of snow and ice precipitation, which in turn takes many other weather parameters into account. The key here is that this variable indicates how much ice is forming on the rolling stock and infrastructure. As the weather data is very high resoluted, one could match a value of a weather variable for every train observation, i.e departure or arrival. After matching the weather variables for each arrival and departure event, we have the two data sets shown in Table1which then includes the weather variables. Figure2shows the distribution of the values.

Table 1 – Information about the data sets. The averages are taken over the observed events. So it might be different from a daily average.

Stockholm - Umeå Luleå - Kiruna Time period 2016/12-2017/02 2016/12-2017/02

Average travel time (h) 6,41 3,71

Number of Entries including passages 64784 15660 Number of Entries excluding passages 13392 6294

Average temperature ( C) -1.2 -5.7

Average humidity (%) 84.15 86.34

Average Icing (mm) 0.02 0.06

Average Snow depth (m) 0.03 0.18

(12)

Figure 2 – Histograms of the weather variables. The ”Icing”-variable appears zero inflated. For this reason, it will be transformed to a binary variable.

Figure 3 – Histogram of delays for Stockholm-Umeå and Luleå-Kiruna, winter 2016/2017. Values above 20 or below -5, which make up about 4.8% of the data, have been omitted for illustrative purposes. The delays are similarly distributed.

(13)

Figure 4 – Histogram of segment delays for Stockholm-Umeå and Luleå-Kiruna. Values above 10 or below -10, which make up about 1.7% of the data, have been omitted for illustrative purposes.

3.2 Missing data

For each station and passage, the arrival and departure time truncated to the minute is supposed to be reported.

The percentage of missing departure and arrival times, for the entire data is 15%. However, among reported times from stations only, the percentage of missing values is only 4.6%

Based on these numbers, missing values are handled as follows: If a model is fitted on data that only contains observations at stations, the missing values are simply imputed. With such a low percentage of missing values, it is assumed that even a biased imputation procedure will not affect the results greatly. The imputation is performed by assuming that the train run keeps its previous delay. Another way to put it is to impute such that the segment delay is zero. It is the most likely outcome according to Figure 4. If a model is fitted on data that also contains observations at passages, one might consider other methods of handling missing data. In this thesis work, passage data was not used. Therefore, the above described imputation method is the only one that will be used.

(14)

4 Method

In this section, the fundamental assumptions of the modeling approaches used are firstly explained and motivated.

Later on, the specific modeling approaches for each approach is described.

4.1 Model description

In reality, the railway system is a complex system of train routes that affect each other. If a train is behind its schedule, there will often occur conflicts with other trains sharing the same infrastructure. For example, a single track line, passenger connections or crew connections. In this modeling, conflict with other trains will be neglected and each train run is considered an independent stochastic process. It starts at the first station, it is then observed along the way at stations or passages, and ends at the final station. Each such run is one sample from the underlying stochastic process. This assumption of independent train runs was also made by Kecman et al. [7]. In their case though, the studied railway was a double track line, which perhaps better justifies this assumption.

The random variable studied is delay. That is either departure or arrival delay measured as the time deviation from the time schedule in minutes. Studying this variable, one can motivate the use of Markov chains. By referring to the Markov property, see equation (5), we assume that the future delays only depends on the current delay and not the past delays of the train. More concretely, a certain trains departure time from a station will only depend on its arrival to that station, and not what happened before that. A different random variable one could study where Markov chains are not as suitable is the segment delay, see Section1.3 for definition.

We then describe the stochastic process of delays as in equation (4),

{X(t), 0 < t  T } (13)

where each random variable X represents the delay at time t. The transitions between the states, X, occur continuously. That is, a train can change state and thus get less or more delayed at any part of the journey. The process is not however, observed continuously, but rather discretely at stations or passages. The process we observe can be described as

{Xi, i = 1, ..., N} (14)

where i is the index of the observation event. The event is either a departure or an arrival. An example of a train run is illustrated in Figure5, where one can see, for example, the event numbering. The passage between the second and third station in the figure is simply a measuring point between two stations. It can be seen as an arrival event and a departure event occuring simultaneously.

Figure 5 – Illustration of a general example of a train run.

Another important modeling choice to consider is whether to assume time-homogenity or not. A time-homogeneous stochastic process would in this case mean that the evolution of delays do not depend on how long the train has traveled or what station it is at. Kecman et al. [7], did not assume time-homogenity and modeled each segment with different transition matrices. That is, each transition of the chain, is modeled with a separate transition matrix. The resulting model is then a list of Markov transition matrices, in their case, four. By modeling with a list of Markov chains, one can be sure that the difference between running and dwelling processes as well as station specific variations and time-dependent covariates are accounted for. The authors also motivated why assuming time-homogenity is limiting in modeling train delay evolution.

(15)

Sahin [15] did assume time-homogenity but distinguished between two kinds of processes and defined a ”recovery process” and ”deterioration process”, and thereby ending up with two different transition matrices. It was suggested that developing non-homogenous Markov chains could be a model improvement. So as it seems, it might be too simplistic to assume time-homogenity. Examining the data, no obvious time-dependencies of the delays could be found. However, Figure6shows a box plot of the segment delay for each segment. That is, either a running process or a dwelling process. Here we see a quite clear difference between the two processes. The running processes are colored pink, and the dwelling are colored purple. The difference that we see in this plot presumably stems from the fact that a passenger train can arrive before schedule, but will never depart before schedule. This motivates why one ought to discriminate these two types of transitions. For this reason and because of what was suggested in [15] and [7], time-homogenity will not be assumed here.

A few methods of dealing with it will be considered. Firstly, one could use separate transition matrices like [7].

For the line of Stockholm - Umeå, this would result in 23 matrices. As one can see in Figure6, the line consists of 23 segments. Note that those are then only valid for one direction, which is from Stockholm to Umeå. One could of course, down-sample more to obtain any number of separate transition matrices less than 23. This method will not be used in this modeling. Partly because of a lack of available data, partly because of a lack of time. As another way of dealing with non-homogenity, it was suggested in the msm-manual [5], that one could model the ”age” of the process as a covariate. The transition intensities would then vary with the travel time of the train. Moreover, in the two modeling approaches presented below, some time-independent, station-specific covariates are assumed to affect the transition rates. Those are proposed as an alternative to modeling with separate transition matrices and are thought to serve the same purpose. The time-independent, station-specific covariates used in each model are presented in Section4.2 and 4.3 below. As a final way to refine the models, a change in the time schedule is proposed in Section 4.4.

Figure 6 – Boxplot of segment delays. Each segment number represents either a running process (pink) or a dwelling process (purple) between Stockholm and Umeå.

4.2 Approach 1: Modelling both running and dwelling processes, excluding passages.

This is the most intuitive way to setup the transitions. It was used in Kecman et al. [7]. The state of the train is observed at both the arrival and the departure from a station, see Figure5 for the general case. What is new here, is that passages are excluded. These data points are believed only to bring noise to the data as the passages come frequently but the time registered is not very accurate. For example, there can be two minutes of driving time between two passages and each registers a time that is truncated to the nearest minute. For this reason, the passages are excluded which leads to a data reduction of about 80 percent (for Stockholm-Umeå, winter 2016/17).

The updated model from the general model can be seen in Figure7. The time-independent, station-specific covariates that will be examined in this approach are

• ”Process type” - A binary variable that can be either "running" or "dwelling".

• ”Station buffer time” - The planned ”dwelling time” at a station.

(16)

Figure 7 – Illustration of an example of a train run with modeling Approach 1.

4.3 Approach 2: Combining running and dwelling processes

The idea here is to only observe the train at the arrivals. This will create even more down sampling, which could be of advantage since many of the stops are just one or two minutes long. It also gets rid of the need of a ”Process type”-variable that was needed in the previous model. However, combining running and dwelling processes, two rather different processes in nature, might be too simplistic. This model is less complex than the previous. There is therefore less risk of over-fitting but it also might not be complex enough to fit the data nicely. The updated model from the general model can be seen in Figure8. Here the passages can be chosen to be included or excluded, but they will be excluded here for the entire analysis. The time-independent, station-specific covariates that will be examined in this model is only one:

• ”Station buffer time” - The planned ”dwelling time” at a station.

Figure 8 – Illustration of an example of a train run with modeling Approach 2.

4.4 Altering the time schedule

In the end of section4.1we saw that some segment runs were systematically shorter than planned. In this section, we investigate this further, which will lead to a proposed alteration of the time schedule. The idea is that modelling might be better if the arrival/departure time in the time schedule is also the expected time of arrival/departure.

Most often this is the case, but not always. In the case where a train almost always arrives a few minutes early to a certain station, is that then a deliberate choice to have some margin on that segment or is it an ill-constructed time schedule? Either way, as we are not using separate transition matrices, but the same for each transition, it might be beneficial if each deviation from the schedule is an anomaly and can be explained by the covariates that we include. The alteration of the time schedule explained in this section is proposed as an alternative to using separate transition matrices. The effects of the new schedule is assessed by comparing its prediction capabilities to models using the standard schedule.

In Figure9, the delays for each station on the runs Luleå-Kiruna are illustrated in box plots similar to in Figure 6. But note that here the actual delays are plotted as opposed to the segment delays. Departures are plotted in pink and arrivals in purple. One event that stands out in this plot is the arrival to Murjek station numbered to 8

(17)

in the plot. The arrival is often ahead of schedule. In Figure10, the delays according to a new schedule is plotted.

The new schedule is constructed by shifting the the old schedule so that the median delay to every station is zero.

Mostly, no change in the schedule is necessary. But in a few cases, like the arrival to Murjek, the schedule is altered with as much as three minutes. One can clearly see in Figure10, that the median arrival/departure delay is zero.

Figure 9 – Box plot of delays for each event on the line of Luleå-Kiruna. The delays are here defined after the standard time schedule.

Figure 10 – Box plot of delays for each event on the line of Luleå-Kiruna. The delays are here defined after the altered time schedule.

4.5 Defining state space

In order to model with the Markov chain approach, the random variable X in equation (13) cannot be continuous but has to be categorized in discretely defined states. And it is up to me to define that state space. The trade-off to have in mind is that defining to many states runs the risk of over fitting, very long calculation times and more data is needed since there are more possible transitions. Too few states on the other hand, might not be able to capture the difference between a small and a large delay, which might have widely different causes.

Kecman et al. [7] defined three states, namely 1 : d 0

2 : 0 < d 5 3 : d > 5

(18)

where d is the delay in minutes. The border value of ”5” was motivated by the definitions of when a train is considered to be delayed, which was at 5 minutes. Sahin [15] used six states:

1 : d < 0

2 : d = 0 d < 2 3 : d = 3 d < 5 4 : d = 6 d < 8 5 : d = 9 d < 11 6 : d > 11

The choice of these states were said to be ”based on the data”, which probably means that it was made sure that enough data existed to calculate the transitions between these states. From the above it seems as the choices of states are quite arbitrary. To get a sense of how to choose the states, Figure3is referred to, where the distribution of the delays, both arrival and and departure delay, excluding passages is plotted. The figure shows that a large part of the observed delays are close to zero. The choice of the State space in this thesis is as follows. To start with, delay less than zero will be, as opposed to Sahin [15], grouped in the same state as zero. If using different states for these, then a train arriving ahead of schedule would certainly transition to a different state when waiting on the station to depart on time. And this would always be the case. Moreover, there are certain thresholds of delays defined by railway companies that could be used to define a state space. For example, the threshold of punctuality is widely considered to be at 5 minutes. Furthermore, the STA reports a cause of the delay if it exceeds 3 minutes.

But using these thresholds would only make sense if one were to investigate punctuality or somehow include the cause of the delay in the model. Therefore, these thresholds will be disregarded and instead the state space will be chosen from a predetermined percentage. The choice of which percentages define a state will be quite arbitrary, but it takes into account that choosing a too low state boundary will result in many transitions that was in fact a result of the rough rounding of the observation times. Choosing a too high state boundary might result in very few observations in the most delayed state. Therefore, the following states are defined

1 : 0 85% quantile 2 : 85 95%quantile 3 : 95 100%quantile

For the dataset Luleå-Kiruna 2016-2017 combined with Stockholm-Umeå 2016-2017, this translates to

Table 2 – The state space definition where d is the arrival or departure delay in minutes. The average prevalence tells how large proportion of all the observations fall in to that state.

State Definition Average prevalence

State 1 d < 3 85%

State 2 3  d < 11 10%

State 3 d 11 5%

4.6 The weather variables studied

Apart from the three variables explained above, four weather related variables will be considered in the modeling.

To investigate long periods of some weather conditions, a seven day simple moving average is created from each one of them. The variables are listed in Table 3. For each train arrival/departure observation, the value of every weather variable is matched by the time and the location. Note that the moving averages is taken over a location.

So it is not the weather conditions that the train itself experiences. It is rather the conditions that the non-rolling stock infrastructure experiences and could be affected by. For computational reasons, all the weather variables except ”Icing” are centered at zero and scaled to have standard deviation one. ”Icing” is left out because it has a large proportion of zeros and was therefore chosen to be a binary variable with values ”no icing” or ”icing”. The seven day SMA of ”Icing” however, will use the continuous values.

(19)

Table 3 – Weather variables Variable name Type

Snow depth Continuous Temperature Continuous Humidity Continuous Icing Categorical SMA Snow depth Continuous SMA Temperature Continuous SMA Humidity Continuous SMA Icing Continuous

4.7 Model selection procedure

Each model and each schedule will be examined and optimized separately by performing a variable selection. The variables chosen for each method are then examined further. A ”Best subset selection”, where every possible variable subset is tested, was considered, but would be estimated to take a bit too much time. Therefore, the selection is done using ”Forward stepwise selection” decribed in James et al. [6, pages 207-208]. Note that this is performed for each approach (1 and 2) and for each schedule (standard and altered). That is, for four different cases.

1. Let M0 denote the null model, which contains no predictors.

2. For k = 0, ..., p 1:

(a) Consider all p k models that augment the predictors in Mk with one additional predictor.

(b) Choose the best among these p k models, and call it Mk+1. Here best is defined as having smallest RSSof the observed and expected prevalence. The prevalence at a certain time is the percentage of the individuals being in a certain state at that certain time.

3. Select a single best model from among M0, ..., Mp using AIC. The likelihood is acquired from the fitted model as described in Section2.

In step 3 above, the nested model with the lowest AIC is the one that will generally be chosen. However, models with an AIC-score close to the lowest can also be considered having strong support. Therefore, when motivated, the model with the lowest AIC-score is not necessarily going to be chosen. The ”rule of thumb” that the choice will be based on is outlined in Burnham and Anderson [3, pages 70-71]. With a best model chosen (given an approach and a schedule), we can produce plot of the observed and expected prevalence. The prevalence is as described in the algorithm above, the percentage of the individuals being in a certain state at that certain time. ”Individuals” is here referring to the different train journeys. The prevalence plots are the Figures11-14in the result section. These type of plots give a rough indication of the goodness of fit. Note that no prediction is made here. Only the expected prevalence from the fit is computed. Prediction is performed in the next subsection about model comparison. Apart from the prevalence plots, we also produce tables of Hazard ratios. These provide information about the influence of each variable in the fitted model. The way to interpret Hazard ratios is the following: A Hazard ratio larger than 1, means that a large value on the variable increases the probability of a transition and a small value on the variable reduces the probability of a transition. A Hazard ratio smaller than 1 does the opposite. If the confidence interval of the Hazard ratio does not contain 1, we say it is significant.

4.8 Model assessment and comparison

The ”Model selection procedure” above described the selection of the best model given a certain approach (1 or 2) and given a certain schedule (altered or standard). Among these four cases, we would like to assess which performs better. To compare the approaches and the schedules, which consist of different response variables and are based on different data, we need to compare their prediction capabilities. Note that the purpose is not to assess how well each model predicts, but rather to compare the models and see which predicts better. We can then say that the model that predicts best, will be given most attention when analysing the Hazard ratios. After all, referring to the aim of this thesis work, the influence of the variables is the main focus here. If prediction capabilities are great, it is a bonus. But it is not expected.

To compare the prediction abilities of the best models, a five-fold cross-validation is performed (for each approach and for each schedule). In each fold, the procedure described in Section 4.7 above is performed on the training

(20)

data. Since the number of folds are five, the training data will consist of 80 percent of the data. Then, with the model chosen in the selection procedure, the prevalence of the testing data is predicted. The performance of the prediction is then assessed through measuring the MSE of the observed and predicted prevalence. The MSE is defined in (12). Now we are not interested in the absolute value of the MSE, but rather how the MSE of the four cases (two approaches and two schedules) compare to each other. Therefore we scale it by the smallest MSE. For example, the four MSE-values (5, 100, 40, 12) would be scaled to (1, 20, 8, 2.4) The case with the lowest MSE-index is then considered the most reliable.

The cross-validation also gives information about the stability of the variable selection. Consider Mp, the model containing all the predictors, if the order in which the variables were chosen into Mpvaries greatly between the five folds, one could say that the variable selection is unstable. If the order on the other hand tends to be the same, the selection is stable.

(21)

5 Results

Here is a summary of what variables will be analyzed and the settings that will be used.

The data is from the distance Stockholm-Umeå, 2016/12- 2017-02, both directions, and Luleå-Kiruna, same time period, both directions. The variables in Table 4are considered for variable selection. The three variables ”Travel time”, ”Station buffer time” and ”Process type” (for Approach 1 only) were theorized to explain a large part of the variation and to be necessary in the model. The rest are weather related variables whose effect are to be examined.

The State space will consist of three states, the ones described in Table 2. For each model and for each schedule, variable selection will be done by forward step selection.

Table 4 – The variables studied

Class Variable name Type

Weather Snow depth Continuous

Weather Temperature Continuous

Weather Humidity Continuous

Weather Icing Categorical, Binary

Weather SMA Snow depth Continuous

Weather SMA Temperature Continuous

Weather SMA Humidity Continuous

Weather SMA Icing Continuous

Station-specific Station buffer time Piece-wise constant (minutes) Station-specific Process type Categorical, Binary

Age Travel time Continuous

5.1 Approach 1

5.1.1 Standard schedule

The result of the forward stepwise selection for Approach 1 and for the standard schedule is presented here. In Table5, the order of the variables as well as the AIC score for each nested model is listed.

We see that model 5 has the lowest AIC score, and thereby the most supported model. We identify no other models having strong support. Therefore, Model 5 is chosen for further investigation. The covariates in Model 5 are then used to fit a Continuous time Markov chain to the data. To evaluate the fit, the prevalence is used and the goodness of fit is visually assessed. In Figure 11, the observed and expected prevalence in all three states are plotted for model 5 . In Table6, the Hazard ratios are listed for each variable and for each transition in the selected model.

Table 5 – The eleven nested models yielded in the forward stepwise selection for Approach 1 and using the standard schedule. They are listed with an increasing number of variables together with the corresponding AIC-scores.

Model Nr Variables included AIC Score

1 Travel time 10233.8

2 Travel time, Station buffer time 10237.9

3 Travel time, Station buffer time, Process type 9672.6

4 Travel time, Station buffer time, Process type, SMA Humidity 9660.5 5 Travel time, Station buffer time, Process type, ..., SMA Snow depth 9654.5 6 Travel time, Station buffer time, Process type, ..., Humidity 9660.6 7 Travel time, Station buffer time, Process type, ..., Icing 9664.3 8 Travel time, Station buffer time, Process type, ..., SMA Icing 9667.2 9 Travel time, Station buffer time, Process type, ..., Snow depth 9668.5 10 Travel time, Station buffer time, Process type, ..., Temperature 9671.2 11 Travel time, Station buffer time, Process type, ..., SMA Temperature 9675.3

(22)

0 1 2 3 4 5

60708090100

Prevalence of State 1

Time (Hours)

Prevalence (%)

0 1 2 3 4 5

051015202530

Prevalence of State 2

Time (Hours)

Prevalence (%)

0 1 2 3 4 5

051015202530

Prevalence of State 3

Time (Hours)

Prevalence (%)

Figure 11 – The prevalence of each state for Model 5 in Table5 . Each plot shows observed prevalence (blue) and expected prevalence (red).

Table 6 – Hazard ratios of the fitted CTMC using Model 5 in Table5

.

Covariate Transition Hazard ratio Lower CI Upper CI

Travel time

State 1 - State 2 1.16 1.12 1.21

State 2 - State 1 1.01 0.96 1.06

State 2 - State 3 0.81 0.73 0.89

State 3 - State 2 0.88 0.78 1.01

Station buffer time

State 1 - State 2 0.94 0.92 0.97

State 2 - State 1 1.01 0.98 1.05

State 2 - State 3 0.98 0.91 1.05

State 3 - State 2 0.98 0.94 1.03

Process type

State 1 - State 2 0.17 0.15 0.20

State 2 - State 1 0.56 0.44 0.70

State 2 - State 3 0.19 0.13 0.28

State 3 - State 2 0.39 0.19 0.78

SMAHumidity

State 1 - State 2 0.88 0.81 0.95

State 2 - State 1 1.05 0.96 1.16

State 2 - State 3 1.04 0.86 1.25

State 3 - State 2 1.14 0.86 1.52

SMA Snow depth

State 1 - State 2 1.05 0.98 1.14

State 2 - State 1 0.93 0.85 1.01

State 2 - State 3 1.17 0.99 1.38

State 3 - State 2 1.30 1.02 1.66

5.1.2 Altered schedule

The same procedure as in the previous section is performed here. Model 4 in Table 7 is chosen for further inves- tigation. The prevalence plots in Figure 12and the Hazard ratios in Table 8 are based on the fitted CTMC with covariates listed in model 4 . In the coming sections, the same procedure is used, and the choice of the ”best” model is based on the nested model with lowest AIC-score.

(23)

Table 7 – The eleven nested models yielded in the forward stepwise selection for Approach 1 and using the altered schedule. They are listed with an increasing number of variables together with the corresponding AIC-scores

Model Nr Variables included AIC Score

1 Process type 9257.2

2 Process type, Station buffer time 9260.1

3 Process type, Station buffer time, Travel time 9253.0

4 Process type, Station buffer time, Travel time, Temperature 9244.2 5 Process type, Station buffer time, Travel time, ... ,Humidity 9247.8 6 Process type, Station buffer time, Travel time, ... , SMA Humidity 9254.8 7 Process type, Station buffer time, Travel time, ... , Snow depth 9249.7 8 Process type, Station buffer time, Travel time, ... , SMA Snow depth 9252.5 9 Process type, Station buffer time, Travel time, ... , SMA Icing 9256.1 10 Process type, Station buffer time, Travel time, ... , Icing 9257.8 11 Process type, Station buffer time, Travel time, ... , SMA Temperature 9262.6

0 1 2 3 4 5

60708090100

Prevalence of State 1

Time (Hours)

Prevalence (%)

0 1 2 3 4 5

051015202530

Prevalence of State 2

Time (Hours)

Prevalence (%)

0 1 2 3 4 5

051015202530

Prevalence of State 3

Time (Hours)

Prevalence (%)

Figure 12 – The prevalence of each state for Model 4 in Table7. Each plot shows observed prevalence (blue) and expected prevalence (red)

(24)

Table 8 – Hazard ratios of the fitted CTMC using Model 4 in Table7 Covariate Transition Hazard ratio Lower CI Upper CI

Process type

State 1 - State 2 0.25 0.21 0.29

State 2 - State 1 0.15 0.12 0.18

State 2 - State 3 0.28 0.18 0.41

State 3 - State 2 0.09 0.06 0.16

Station buffer time

State 1 - State 2 1.02 0.99 1.05

State 2 - State 1 1.03 0.99 1.06

State 2 - State 3 0.98 0.91 1.05

State 3 - State 2 0.96 0.92 1.01

Travel time

State 1 - State 2 0.96 0.92 1.01

State 2 - State 1 0.93 0.88 0.97

State 2 - State 3 0.93 0.85 1.02

State 3 - State 2 0.89 0.79 1.01

Temperature

State 1 - State 2 0.91 0.84 0.98

State 2 - State 1 1.10 1.01 1.19

State 2 - State 3 0.98 0.84 1.15

State 3 - State 2 0.93 0.74 1.18

5.2 Approach 2

5.2.1 Standard schedule

The same procedure as in the previous section is used here. The fitted models use Approach 2 described in section 4.3and the standard time schedule.

Table 9 – The ten nested models yielded in the forward stepwise selection for Approach 2 using the standard schedule. They are listed with an increasing number of variables together with the corresponding AIC-scores

Model Nr Variables included AIC Score

1 Travel time 6577.1

2 Travel time, SMA Humidity 6568.0

3 Travel time, SMA Humidity, Station buffer time 6567.7

4 Travel time, SMA Humidity, Station buffer time, Humidity 6574.1 5 Travel time, SMA Humidity, Station buffer time, ... , SMA Icing 6567.5 6 Travel time, SMA Humidity, Station buffer time, ... , SMA Snow depth 6568.7 7 Travel time, SMA Humidity, Station buffer time, ... , SMA Temperature 6563.5 8 Travel time, SMA Humidity, Station buffer time, ... , Temperature 6569.0 9 Travel time, SMA Humidity, Station buffer time, ... , Snow depth 6575.4 10 Travel time, SMA Humidity, Station buffer time, ... , Icing 6581.1

(25)

0 1 2 3 4 5

60708090100

Prevalence of State 1

Time (Hours)

Prevalence (%)

0 1 2 3 4 5

051015202530

Prevalence of State 2

Time (Hours)

Prevalence (%)

0 1 2 3 4 5

051015202530

Prevalence of State 3

Time (Hours)

Prevalence (%)

Figure 13 – The prevalence of each state for Model 7 in Table9. Each plot shows observed prevalence(blue) and expected prevalence(red).

Table 10 – Hazard ratios of the fitted CTMC using Model 7 in Table9 Covariate Transition Hazard ratio Lower CI Upper CI

Travel time

State 1 - State 2 1.04 0.99 1.10

State 2 - State 1 0.94 0.89 1.00

State 2 - State 3 0.87 0.78 0.97

State 3 - State 2 0.85 0.73 0.98

SMAHumidity

State 1 - State 2 0.94 0.84 1.06

State 2 - State 1 1.19 1.03 1.36

State 2 - State 3 1.21 0.95 1.55

State 3 - State 2 1.35 0.95 1.92

Station buffer time

State 1 - State 2 1.02 0.98 1.06

State 2 - State 1 1.03 0.99 1.08

State 2 - State 3 0.96 0.93 0.99

State 3 - State 2 1.00 0.95 1.06

Humidity

State 1 - State 2 1.05 0.95 1.16

State 2 - State 1 1.03 0.93 1.15

State 2 - State 3 1.04 0.85 1.27

State 3 - State 2 1.01 0.69 1.48

SMA Icing

State 1 - State 2 1.04 0.95 1.15

State 2 - State 1 0.90 0.80 1.00

State 2 - State 3 1.05 0.88 1.25

State 3 - State 2 1.04 0.82 1.32

SMA Snow depth

State 1 - State 2 1.10 0.99 1.23

State 2 - State 1 0.97 0.85 1.10

State 2 - State 3 1.03 0.83 1.29

State 3 - State 2 1.25 0.91 1.70

SMATemperature

State 1 - State 2 1.00 0.89 1.12

State 2 - State 1 0.99 0.87 1.12

State 2 - State 3 0.80 0.64 1.01

State 3 - State 2 0.52 0.35 0.77

5.2.2 Altered schedule

The same procedure as in the previous section is used here. The fitted models use Approach 2 described in section 4.3and the altered time schedule.

(26)

Table 11 – The ten nested models yielded in the forward stepwise selection for Approach 2 using the altered schedule. They are listed with an increasing number of variables together with the corresponding AIC-scores

Model Nr Variables included AIC Score

1 Snow depth 6770.7

2 Snow depth, Station buffer time 6769.1

3 Snow depth, Station buffer time, Icing 6767.9

4 Snow depth, Station buffer time, Icing, SMA Humidity 6773.2 5 Snow depth, Station buffer time, Icing, ... , Temperature 6780.2 6 Snow depth, Station buffer time, Icing, ... , Humidity 6787.6 7 Snow depth, Station buffer time, Icing, ... , SMA Temperature 6789.9 8 Snow depth, Station buffer time, Icing, ... , SMA Snow depth 6792.5 9 Snow depth, Station buffer time, Icing, ... , Travel time 6795.1 10 Snow depth, Station buffer time, Icing, ... , SMA Icing 6798.3

0 1 2 3 4 5

60708090100

Prevalence of State 1

Time (Hours)

Prevalence (%)

0 1 2 3 4 5

051015202530

Prevalence of State 2

Time (Hours)

Prevalence (%)

0 1 2 3 4 5

051015202530

Prevalence of State 3

Time (Hours)

Prevalence (%)

Figure 14 – The prevalence of each state for Model 3 in Table11. Each plot shows observed prevalence(blue) and expected prevalence (red).

Table 12 – Hazard ratios of the fitted CTMC using Model 3 in Table 11 Covariate Transition Hazard ratio Lower CI Upper CI

Snow depth

State 1 - State 2 1.05 0.96 1.15

State 2 - State 1 0.91 0.81 1.01

State 2 - State 3 1.20 1.00 1.43

State 3 - State 2 1.40 1.07 1.84

Station buffer time

State 1 - State 2 1.04 0.99 1.08

State 2 - State 1 1.04 0.99 1.09

State 2 - State 3 0.96 0.93 0.99

State 3 - State 2 0.99 0.94 1.05

Icing

State 1 - State 2 1.19 0.98 1.45

State 2 - State 1 0.90 0.73 1.11

State 2 - State 3 1.10 0.75 1.61

State 3 - State 2 1.47 0.84 2.58

5.3 Comparison of the models

To compare the models, the procedure described in section 4.8is used. Here follows a short summary: A five-fold cross-validation is used on the procedure above. For each fold in the cross-validation, variable selection is performed,

(27)

a model is fitted to the training data, and the model is then used to predict the prevalence of testing data. The performance of the prediction is then assessed through measuring the MSE of the observed and predicted prevalence.

It is then scaled so that the lowest value gets scaled to 1. What we attain we call an index value. In this case, Approach 1, altered schedule performs best in prediction, and thus the other index values tells us how much larger the MSE is in each case. An example of how a prediction plot looks like in given in Figure15.

The stability of the forward step-wise variable selection is examined by looking at how much the order of selected variables changes from fold to fold. What follows are the results of the prediction error index and the variable selection stability for each approach and for each schedule.

0 1 2 3 4 5 6

60708090100

Prevalence of State 1

Time (Hours)

Prevalence (%)

0 1 2 3 4 5 6

051015202530

Prevalence of State 2

Time (Hours)

Prevalence (%)

0 1 2 3 4 5 6

051015202530

Prevalence of State 3

Time (Hours)

Prevalence (%)

Figure 15 – An example of how a prediction plots looks like. This one is from approach 1, altered schedule and fold 3 in the cross-validation. It is clear from the observed prevalence (blue) that it is based on less data (20 percent) than that in the figures11-14

. Approach 1 with Standard schedule

• Prediction error index value: 2.02

• Stability of the covariates:

Table 13 – Summary of the five different variable selections in the five-fold cross-validation for Approach 1 and the standard schedule. The table shows the order in which the variables were selected.

Variable Selection position mean Selection position variance

Travel time 1.0 0.0

Station buffer time 2.0 0.0

Process type 3.0 0.0

Temperature 8.2 4.7

Snow depth 9.0 2.0

Humidity 8.6 6.8

Icing 5.6 1.8

SMA Temperature 10.0 0.5

SMA Snow depth 7.6 2.8

SMA Humidity 4.0 0.0

SMA Icing 7.0 1.0

(28)

Approach 1 with Altered schedule

• Prediction error index value: 1.00 (reference value)

• Stability of the covariates:

Table 14 – Summary of the five different variable selections in the five-fold cross-validation for Approach 1 and the altered schedule. The table shows the order in which the variables were selected.

Variable Selection position mean Selection position Variance

Travel time 2.6 0.8

Station buffer time 8.6 14.3

Process type 1.0 0.0

Temperature 8.2 8.7

Snow depth 5.6 3.3

Humidity 5.8 3.7

Icing 8.6 5.3

SMA Temperature 9.8 1.7

SMA Snow depth 5.8 5.2

SMA Humidity 4.4 1.8

SMA Icing 5.6 2.3

Approach 2 with Standard schedule

• Prediction error index value: 2.33

• Stability of the covariates:

Table 15 – Summary of the five different variable selections in the five-fold cross-validation for Approach 2 and the standard schedule. The table shows the order in which the variables were selected.

Variable Selection position mean Selection position variance

Travel time 1.0 0.0

Station buffer time 4.8 8.7

Temperature 8.0 1.0

Snow depth 8.6 0.3

Humidity 3.4 0.3

Icing 9.6 0.8

SMA Temperature 6.0 0.5

SMA Snow depth 6.0 2.5

SMA Humidity 2.0 0.0

SMA Icing 5.6 0.8

Approach 2 with Altered schedule

• Prediction error index value: 1.10

• Stability of the covariates:

(29)

Table 16 – Summary of the five different variable selections in the five-fold cross-validation for Approach 2 and the altered schedule. The table shows the order in which the variables were selected.

Variable Selection position mean Selection position variance

Travel time 6.2 6.7

Station buffer time 3.2 14.7

Temperature 5.2 1.7

Snow depth 1.4 0.3

Humidity 3.8 0.7

Icing 5.8 6.2

SMA Temperature 6.8 5.7

SMA Snow depth 8.6 2.3

SMA Humidity 5.2 2.2

SMA Icing 8.8 2.7

Comparison summary

Table 17 – Summary of the model comparison, including the stability of the variable selection. Computing the sum of the variances is hardly a formal way to assess the stability, but serves here as a guideline for classifying the stability as ”High” or ”Low”.

Case Prediction error index Sum of variances Relative selection stability

Approach 1 with Standard schedule 2.02 19.6 High

Approach 1 with Altered schedule 1.00 47.1 Low

Approach 2 with Standard schedule 2.33 14.9 High

Approach 2 with Altered schedule 1.10 43.2 Low

References

Related documents

Finally, Subsection 2.3 introduces options on the CDS index, sometimes denoted by credit index options, and uses the result form Subsection 2.2 to provide a formula for the payoff

För deltagarna med en Rosenberg poäng &lt;15 så ökar sannolikheten för att vilja förändra något med sitt utseende med 9,4 gånger (p=0,003), sannolikheten att andra

Heterologous expression of malaria proteins is problematic due to the unusual codon usage of the Plasmodium genome, so to overcome this problem a synthetic PfCA gene was

Tryck på den lilla vita pilen i Program FPGA knappen för att köra alla nivåer upp till Program FPGA med inkluderad nedladdning till kortet. Trycker man bara på Program FPGA så

In the third study, the indirect effects of leisure activity and marital status on memory function via health, as well as the direct effects of these two important aspects of

Huge-Brodin, M., Sweeney, E., Evangelista, P., (2020), Environmental alignment between logistics service providers and shippers - a supply chain perspective, International Journal

Indeed, a large number of physical and chemical properties of multi-atomic systems can be accurately determined by considering the changes in the valence electronic states only,

Another study examining autoregressive models to predict European stock indices found that while the performance using the past 1 to 10 days return was generally poor, it improved