• No results found

A study of the effects of winterclimate and atmospheric icing onhigh-speed passenger trains

N/A
N/A
Protected

Academic year: 2022

Share "A study of the effects of winterclimate and atmospheric icing onhigh-speed passenger trains"

Copied!
45
0
0

Loading.... (view fulltext now)

Full text

(1)

Master thesis, 30 credits

Master of Science in Industrial Engineering and Management

A study of the effects of winter climate and atmospheric icing on

high-speed passenger trains

Master thesis

Markus Granlöf

(2)

A study of the effects of winter climate and atmospheric icing on high- speed passenger trains

Department of Mathematics and Mathematical Statistics Ume˚a University

901 87 Ume˚a, Sweden Supervisors

Jun Yu, Ume˚a University

Jianfeng Wang, Ume˚a University

Anders Nilsson, Swedish Traffic Administration Examiner

Leif Nilsson, Ume˚a University

(3)

Abstract Harsh winter climate causes various problems for both the public and private sector in Sweden, especially in the northern part and the railway industry is no exception. This master thesis project covers an investigation of the effects of the winter climate and a phenomena called atmospheric icing on the performance of the train in a region called the Botnia-Atlantica region. The investigation was done with data over a short period January-February 2017 with simulated weather data from the Weather research and forecast model that was compared with the period October - December 2016. The investigation only included high speed trains.

The trains have been analysed based on two different performance measurements.

The cumulative delay which is the increment in delay over a section and the cur- rent delay which is the current delay compared to the schedule. Cumulative delays are investigated with survival analysis and the current delay is investigated with a Multi-state Markov model.

The results show that the weather could have an effect on the trains performance where the survival analysis detected connection between the weather and cumula- tive delays. The Markov model also showed a connection between the weather and delayed trains including that the presence of atmospheric icing had a negative effect on remaining in a state of non-delay .

Keywords: Survival analysis, continuous Markov model, cumulative delay,high- speed train, weather, atmospheric icing

(4)

Sammanfattning H˚art vinter klimat skapar olika problem f¨or b˚ade allm¨anheten samt den privata sektorn i Sverige, speciellt i den norra delen och t˚agindustrin ¨ar inget undantag. I denna master uppsats g¨ors en utredning om hur effekterna av vinterklimatet och ett fenomen som kallas atmosf¨arisk nedisning har p˚a prestation av t˚agen i en region som kallas Botnia-Atlantica regionen. Utredningen ¨ar gjord p˚a data f¨or en kort period Januari - Februari 2017 med simulerad v¨ader data fr˚an en v¨ader och prognos modell som j¨amf¨ordes med en period Oktober - December 2016.

Utredning inkluderade endast h¨ogfartst˚ag.

T˚agen har analyserats baserat p˚a tv˚a stycken prestationsm˚att. Merf¨orsening som syftar till ¨okniningen i f¨orsening ¨over en delstr¨acka och nuvarande f¨orsening syftar till aktuell tid i j¨amf¨orelse mot schemat. Merf¨orseningar studeras genom ¨overlevnads- analys och aktuell f¨orsening studeras med hj¨alp av en Fler-tillst˚ands Markov modell.

Resultaten visar att v¨adret kan ha en effekt p˚a prestationen av t˚ag. ¨Overlevnad- sanalysen visade en koppling mellan v¨adret och merf¨orseningar. Markov modellen visade ocks˚a en koppling mellan v¨adret och nuvarande f¨orsening d¨ar till exempel n¨arvaro av atmosf¨arisk nedisning hade en negativ effekt p˚a att t˚aget skulle forts¨atta vara i en icke-f¨orsenat tillst˚and.

Nyckelord: Overlevnadsanalys, kontinuerlig markov model, merf¨¨ orsening, h¨ogfartst˚ag, v¨ader, atmosf¨arisk nedisning

(5)

Acknowledgements I would like to express my gratitude to my two supervisors Jianfeng Wang and Jun Yu at Ume˚a university for their help and guidance through out the project. I would also want to thank the Swedish traffic administration and Anders Nilsson for their guidance and providing the data. Finally I would like to thank the Atmospheric Science Group (ASG) at Lule˚a University of technology for simulating the weather data.

(6)

Contents

1 Introduction 1

1.1 Background . . . 1

1.2 Project Description . . . 2

1.3 Goal . . . 2

1.4 Assumptions and limitations . . . 2

1.5 Related work . . . 3

2 Theory 4 2.1 Survival analysis . . . 4

2.1.1 Censoring . . . 4

2.1.2 Terminology and Notation . . . 4

2.1.3 The Survival function . . . 5

2.1.4 The hazard function . . . 5

2.1.5 Life data . . . 6

2.1.6 Kaplan–Meier estimator . . . 6

2.1.7 Cox proportional-hazard model . . . 6

2.1.8 Estimation of coefficients with partial likelihood . . . 7

2.1.9 Hazard Ratio and proportionality . . . 7

2.1.10 Heaviside function . . . 8

2.1.11 Schoenfeld residuals . . . 8

2.2 Recurrent events . . . 9

2.2.1 Counting process approach . . . 9

2.2.2 Counting process for time varying variables . . . 9

2.2.3 Robust Estimation . . . 9

2.3 Tests . . . 9

2.3.1 Spearman correlation . . . 9

2.3.2 Likelihood-ratio test . . . 10

2.4 Stochastic process . . . 10

2.4.1 Continuous time Markov chain . . . 10

2.4.2 Continuous multi-state Markov model . . . 11

2.4.3 Pearson-type goodness-of-fit test . . . 11

3 Material 13 3.1 Background information . . . 13

3.1.1 High speed train . . . 13

3.1.2 Sections . . . 13

3.1.3 Delay and Cumulative delay . . . 13

3.1.4 Weather and forecast system . . . 13

3.1.5 Atmospheric Icing . . . 13

3.2 Description of the train operation and the train line . . . 14

3.3 Data collection . . . 14

3.3.1 Train operation data . . . 14

3.3.2 Weather data . . . 15

3.4 Tools . . . 16

(7)

4 Method 17

4.1 Data extraction . . . 17

4.2 Data management . . . 17

4.2.1 Trip Identification . . . 17

4.2.2 Combination of time variables . . . 17

4.2.3 Planned and actual driving time . . . 17

4.2.4 Cumulative delay . . . 18

4.2.5 Delayed . . . 18

4.2.6 Missing values in the operation data . . . 18

4.2.7 Merge the weather data and the operation data . . . 19

4.2.8 Section mean of weather variables . . . 19

4.3 Exploratory analysis . . . 19

4.4 Survival analysis . . . 19

4.4.1 Response variable . . . 19

4.4.2 Predictors . . . 20

4.4.3 Observation time . . . 20

4.4.4 Kaplan-Meier estimatation . . . 20

4.4.5 The extended Cox proportional-hazard model with recurrent events . . . 21

4.4.6 Validation . . . 22

4.5 Multi-state Markov model for delayed . . . 22

4.5.1 Response variable . . . 22

4.5.2 Predictors . . . 22

4.5.3 Time unit . . . 23

4.5.4 Markov model . . . 23

5 Results 24 5.1 Exploratory analysis . . . 24

5.1.1 Predictors . . . 24

5.1.2 Examination of the imputed values . . . 25

5.1.3 Examine the cumulative delays and the delayed state . . . 26

5.2 Survival Analysis . . . 26

5.2.1 Kaplan-Meier estimation . . . 26

5.2.2 The extended Cox proportional-hazard model with recurrent events . . . 28

5.3 Stochastic model results . . . 29

6 Conclusion and discussion 31 6.1 Conclusion . . . 31

6.2 Discussion . . . 32

6.2.1 Data uncertainty . . . 32

6.2.2 Performance measure for the train . . . 32

6.2.3 Model approach . . . 32

6.2.4 Time period . . . 33

6.2.5 Future work . . . 33

References 34

(8)

A Survival analysis Appendix 36 A.1 Proportionality test . . . 36 A.2 Results cox model . . . 36 A.3 Results reference period . . . 36

B Stochastic process Appendix 37

B.1 Results reference period, non-delayed to delayed . . . 37 B.2 Results reference period, delayed to non-delayed . . . 37

(9)

1 Introduction

1.1 Background

Botnia-Atlantica is a EU-program created to finance collaboration-projects between Finland, Norway and Sweden. It covers a region consisting of one municipality and six regions in these three countries. The counties of V¨asterbotten and V¨asternor- rland, as well as the municipality of Nordanstig are the Swedish part of the region.

A visualisation of the area can be seen in Figure 1 [1].

Figure 1: A map of the Botnia-Atlantica region

One of the project that is financed by Botnia-Atlantica is called Nordic Icing Center of Expertise (NoICE). The purpose of this project is to gather knowledge and con- duct research about how atmospheric icing affects the private and the public sector in the region of Botnia-Atlantica. The project having a budget of 1,9 million euros and the time scope of the project is from September 2018 to August 2021. At the moment there are four universities working with this project and one of these is Ume˚a University in V¨asterbotten [1].

Nordic countries climate is known for changing considerably during the year. The winters can be cold and harsh putting a strain on infrastructure, companies and peo- ple. It is therefore central to gather knowledge and methods to counter problems caused by the climate. One phenomena that occurs during the winter is atmospheric icing. Atmospheric icing refers to all type of accumulation of frozen substances of water, usually divided into two types, in-cloud icing and precipitation icing. This causes ice to accumulate on different sorts of equipment’s and limits their ability to work properly.

The railway system is an important part of the infrastructure and a large number of

(10)

people as well as companies depends on it for transportation of different sort. One key component for the railway industry is that the train should be on time in order to minimise costs and ensure the confidence in the railway system. An important task is therefore to investigate how train delays are affected by different climate factors.

1.2 Project Description

The project will mainly be done with the purpose of understanding the effects of the winter climate and the atmospheric icing on the train service in the region of Botnia-Atlantica. The performance of the train service will be measured by looking at the cumulative delays for each section which is the increment in the delay within each a section. This project will also focus on the delayed state which is defined as if the train has a delay of 5 minutes or more compared to the schedule within each section during the trip.

The project will be conducted using information from earlier research on the subject together with statistical analysis of acquired train operation data and simulated weather data. This should give an indication of which factors and to what extent these factors influences the train services.

1.3 Goal

Answering the following questions with help of statistical methods is the goal of this thesis.

1. In what way does the winter climate and atmospheric icing affect the occur- rence of cumulative delays for trains in the the region of Botnia-Atlantica?

2. What effect have the winter climate and the atmospheric icing on the transition to a delayed state from a non-delayed state and the transition from a non- delayed state to a delayed state?

Answering these questions gives increased understanding of the performance of trains in winter climate and could help the railway industry to improve train operations in winter climate.

1.4 Assumptions and limitations

The Swedish train network is large and only a small part of the network is located in the region of Botnia-Atlantica. The project focuses on this region and it is therefore important to do the analysis of train trips in this region. The approach is to use a part of the train network that is located in the Botnia-Atlantica. region and only use data for one specific type of trains for analysis. The conditions that will be used for the selection of train operation data can be seen below:

1. The train operation data includes only trips for trains operation between Ume˚a and Stockholm.

(11)

2. The train operation data includes only trips of high speed passenger trains.

3. Simulated weather data have to exist for the time of the train operation data.

1.5 Related work

Earlier studies of train performance and the effects of weather has been conducted.

Xia et al [2] fitted a linear model and showed that some weather variables as snow, temperature, precipitation and wind had significant effect on the punctuality of trains in the Neatherlands. In a more recent study of Zhang and Wang [3] a ma- chine learning approach was used to create a predictive model to predict the current delay at each station for a train line in China with help of weather observations with some success.

A train travels between locations in time which could be seen as time-to event data.

Amorim and Cai [4] explored different approaches of survival analysis for time-to event data where the counting approach was suggested for model recurrent events.

According to Anderson et al [5] using a Time Homogeneous Markov Model for time to event data was reasonable for monitoring a process which moves between dif- ferent states. A similar project was done by Ottosson [6] using count models such as negative binomial regression and zero-inflated models with some success showing that some weather variables had a significant effect on punctuality and cumulative delays with weather data from meteorological stations.

(12)

2 Theory

2.1 Survival analysis

Survival analysis is a method to analyse the time from a predefined starting point and an event of interest. The observation time is the time from the beginning of a monitoring of an observation until an event happens or the study ends. Time can be defined in various ways for example hours, years or minutes. An event means a predefined type of occurrence that is of interest, for example death, recovery or relapse. Survival analysis has therefore a broad area of application and is used in various area of work [7].

2.1.1 Censoring

A problem that often occurs when working with survival analysis is censoring. This happens when the survival time of an observation is partly unknown. For example if a observation drops out early without an occurrence of an event or if an event has not happened when the study ends. In both cases the survival of the observations are unknown.

The most commonly used censoring conditions are right-, left- and interval censoring.

Right censoring is when the time of event is known to be above a certain threshold.

Left censoring is when the time of event is known to be below a certain threshold.

Interval censoring is right and left censoring that happens at the same time and the time of event is between two thresholds. Below in Table 1 is one example for each censoring condition to give more understanding [7].

Table 1: Example of causes for three common censoring conditions.

Type of censoring Example

Right No occurrence of an event for an observation has happened before the study ends.

Left A observation drops out early for some other reason then the event that’s being studied.

Interval An observation misses the follow-up time or is lost during the study.

2.1.2 Terminology and Notation

To explain survival analysis further some basic terminology and notations are in- troduced. The survival time of an observation is denoted by T. Secondly the small letter t is denoted as a point of interest for the analysis. For example if the study is about investigating if an observation survives more then 10 time units then t = 10.

Lastly θ is denoted as a variable indicating if the event happened or not. 1 is used to indicate if an event has occurred and 0 is used to indicate if the survival time is censored [7].

(13)

2.1.3 The Survival function

There are two functions that are central for survival analysis, one of these is called the survival function. The survival function which often is denoted as S(t) gives the probability of an observation survive longer then t. Survival functions have three attributes that applies to all survival functions, that is:

1. A survival function is non-increasing which means that when t is increasing the S(t) is decreasing.

2. When t = 0, S(0) = 1. At the beginning of a study no observation could have had the event.

3. When t = ∞, S(∞) = 0. All observations eventually have had the event as t increases.

The survival function is defined in

S(t) = 1 − F (t) = P (T > t), where F(t) is the cumulative distribution function

The cumulative distribution function (cdf) is defined as F (t) = P (T ≤ t) which is the probability that an event have occurred before or at time t. The cdf function is defined in the following way F (t) =Rt

0 f (u)du, where f(u) is the probability density function (pdf). The pdf gives the relative likelihood that the event happens exactly at the time u. This means that the survival function [7] also can be defined as

S(t) = 1 − F (t) = P (T > t) = Z

t

f (u) du.

2.1.4 The hazard function

The other important function in survival analysis is called hazard function, and is commonly denoted as h(t). The hazard function h(t) gives the immediate risk for an event to happen in the instance t given that the event hasn’t occurred. The mathematical expression for the hazard function is given by

h(t) = lim

∆t→0

P (t ≤ T < t + ∆t|T ≥ t)

∆t .

The expression P (t ≤ T < t + ∆t|T ≥ t) in the hazard function gives the probability that an event will happen in the interval (t, t + ∆) given that no event has happened.

The hazard function is called the intensity function for survival processes where an event could occur multiple times for the same subject. The hazard function can also be defined from the survival function as [7]

h(t) = f (t) S(t).

(14)

2.1.5 Life data

In order to perform the survival analysis specific data structure has to exist. The data analysed in survival analyse is commonly known as life data. Figure 2 is an example illustrated of life data [7].

Table 2: An example of the structure of life data illustrated with arbitrary values.

The notation Obs.# is an identification variable to track each observation. The notation t (Section 2.1.2) is the observation time. The notation θ (Section 2.1.2) is a variable indicating if the event happened or not and X1...Xp are the predictor values for the p predictors

Obs. # t θ X1· · · Xp

1 t1 θ1 x11· · · x1i

2 t2 θ2 x21· · · x2i

... ... ... ...

n tn θn xn1· · · xni

2.1.6 Kaplan–Meier estimator

In practice, most survival curves are unknown and there is therefore a need to estimate. Kaplan-Meier (KM) is a common method to estimate survival curves when the data includes observations that are censored. The KM method requires the life data to be ordered to work. If t1, t2. . . tn are times for observations when an event or censoring occurred then the observations should be ordered in the following way t1 ≤ t2· · · ≤ tn. Let ci denote the frequency of censored observations within the latest time interval, and R(ti) is the risk set which denotes the observations that survive at least to time ti. The KM estimator [7] is then defined as

S(t) =ˆ

n

Y

i=1

(1 − ci Ri).

2.1.7 Cox proportional-hazard model

This model is a regression model used to examine the relationship between survival time of subjects and one or more predictors. The aim with this model is to investi- gate the effects of predictors on the hazard rate.

Essentially the cox model is an expression for the hazard function h(t), which gives the risk of an event at time t. The cox model [7] is defined in the following way in

h(t) = h0(t) ∗ eβ1x12x2...+βpxp, where

• t is point in time,

• h(t) is the hazard rate at time t,

(15)

• (x1, x2, ..., xp) are the predictors where p is the total number of predictors,

• h0(t) is the baseline hazard which is the hazard rate if all the predictors xi are equal to zero and

• the effect of the predictors is represented with the exponential of the coeffi- cients β1, β2, ..., βp.

A cox proportional-hazard model that includes variables that change over time is called the extended cox proportional-hazard model and is defined in

h(t) = h0(t) ∗ eβ1x1(t)+β2x2(t)...+βpxp(t).

The hazard rate for an subject at time t is dependent on the values of the time dependent predictor variables at time t.

2.1.8 Estimation of coefficients with partial likelihood

The coefficients are estimated with partial likelihood which means that the likelihood formula is based on conditional probabilities and is a product over the event times for the subjects that fails. Let Y (i) be ordered survival times for subjects that are not censored and let i = 1, 2, ..., n. Then the partial likelihood is given by

L(β) =

n

Y

i=1

eβtXi P

j∈R(Yi)eβtXj,

where β = (β1, β2, ..., βp), Xi = (x1i, x2i, ..., xpi) and R(t) = {i : Yi ≥ t} is the group of subjects that are at risk for an event at time t [7].

2.1.9 Hazard Ratio and proportionality

A hazard ratio (HR) is calculated for each predictor which indicate if a predictor have a positive or negative relationship with the probability of an event. The hazard ratio from each predictor is derived from taking the exponential of each predictor coefficient. A hazard ratio larger or smaller then 1 indicate that the predictor have relationship to the hazard rate.

eβ > 1, gives increased hazard rate eβ < 1, gives decreased hazard rate

The hazard ratio is the ratio between two hazard rates connected to two different levels of a predictor. Consider two observations A and B, with corresponding pre- dictor values XA and XB. The hazard function for each observation can then be written as follow [7]

hA(t) = h0(t) ∗ eβXA

hB(t) = h0(t) ∗ eβXB.

(16)

The hazard ratio is then defined by dividing one hazard function with the other in the following way

HR = hA(t) hB(t).

The cox model is a proportional hazard regression model which means that the hazard ratio is constant for any observation compared to another in time. This is visualised in Figure 2.

(a) Proportional hazard ratio in time (b) Not proportional hazard ratio in time Figure 2: Comparison between proportional and not proportional hazard ratio

2.1.10 Heaviside function

If the proportionality assumption is not satisfied for a variable fitted in a Cox proportional-hazard model then something called a heaviside function g(t) can be used to fix the proportionality violation. The purpose of a heaviside function is to divide the observations time in intervals and compute the hazard rate for each interval for that variable. A heaviside function is therefore dependent on time. A example of a heaviside function is defined in (1) where K indicate the break point in time between the two intervals [7]

g(t) =

( 1 if t ≤ K

0 if t > K. (1)

2.1.11 Schoenfeld residuals

The usage of the Schoenfeld residuals is a common way to asses the proportionality of a fitted Cox model. The Schoenfeld residuals are defined for each predictor fitted in the model for every subject that have had an event. The Schoenfeld residual is defined as

sk= xk− x( ˆβ, tk),

where xk is predictor value for a subject with an event at time tkand ¯x( ˆB, tk) is the weighted average of that predictor value for all subjects that are still alive at tk

(17)

2.2 Recurrent events

A survival process can have occurrences of multiple events in some instances for the same observation. Those events are referred to as a recurrent event. There are different approaches handling data with recurrent events which is mostly about how the data is set up before analysis. [8]

2.2.1 Counting process approach

A common way of handling recurrent events is an approach called the counting process. This method is defining each event as independent and the observation time is defined as intervals. The total observation time is divided into interval where the start of an interval is the beginning of the study or end observation of an event and the the end of an interval is the occurrence of an event or the end of the study. [8]

2.2.2 Counting process for time varying variables

If time variables are present in the data then the counting approach is modified. An interval is created for each time the time varying variable changes value or by an occurrence of an event. This means that the variables are assumed to be constant within each interval [8]

2.2.3 Robust Estimation

The model assumes that each observation is independent when the coefficients is estimated with partial likelihood. This does not hold when a subject could have multiple events. The robust estimation [7] is a way to handle that problem. The procedure only affect the variance of the coefficients not the estimate. For the estimated covariance/variance matrix dV ar( ˆB) and the matrix with score residuals Rs is the robust variance estimated as

R( ˆˆ β) = dV ar( ˆβ)[R0sRs] dV ar( ˆβ).

2.3 Tests

2.3.1 Spearman correlation

The Spearman correlation test is a statistical test for measuring the strength of association between two variables. Let v and s be two different variables with n measurements, then each variables measurements are ranked from highest to lowest.

Then the rank is compared within the subject and the difference is defined as d.

The Spearman correlation is then defined as ρ = 1 − 6P d2i

n(n2− 1).

Large spread of ranks between the variables give low correlation low spread gives large correlation.

(18)

2.3.2 Likelihood-ratio test

A likelihood ratio test evaluate the goodness of fit by comparing two competing models by looking at the ratio of their likelihoods. The test is defined in as

λLR = −2 ∗ log(Lss) Lkk)),

where Ls and Lk represent two different models with the variables θs and θk respec- tively where θs is a subset of θk. The test statistic λLR is then compared to a χ2 distribution with the difference in degree of freedom between the two models.

2.4 Stochastic process

A stochastic process {X(t), t ∈ I} is a collection of random variables that are in- dexed with a parameter t where t belongs to an index set I. The process is defined to have discrete time if the index set I is discrete I = {1, 2, ..} and continuous time if the index set I is continuous I = [0, ∞]

A stochastic process assumes different values in the space set S. The process is de- fined as discrete if S is a discrete set S = {1, 2, ..} and continuous if the space set is continuous S = [0, ∞] [9].

A stochastic process is said to be stationary if the transition probabilities for a discrete stochastic process or the transition intensities for a continuous stochastic process are constant.

2.4.1 Continuous time Markov chain

Let X(t) be a stochastic process in continuous time t and let HX(s) represent all information of up to time s where s ≤ t. X(t) is defined as a continuous time Markov chain if (2) holds [9]

P {X(t) = j|HX(s)} = P {X(t) = j|X(s)}. (2) This means that the process satisfies the Markov property and that next state solely depends on the the current state. The process is defined as time-homogeneous if (3) holds, meaning that the probability to move from a state to another depends solely on the time difference between two time points. [9]

P (X(t + s) = j|X(s) = k) = P (X(t) = j|X(0) = k). (3) The transition between states is governed by transition intensities qij where i and j represent two different states. The transition intensity qij is the instantaneous rate that the Markov chain moves from state i to j and which is defined in

qij = lim

∆t→0

pij(t, t + ∆t)

∆t .

The transition intensity matrix Q represent the matrix that holds all transition intensities for each pair of states. A continuous Markov chain with N number of

(19)

state has a N × N transition intensity matrix. An example of an transition intensity matrix Q can be seen below [9]

Q =

"

−q11 q12 q21 −q22

# . where transition intensities qij are

qij =

( −P

jq(i, j), f or i = j q(i, j), f or i 6= j

The transition intensities relationship to the transition probabilities is defined as P (t) = etQ,

where P (t) is the transition probability matrix for time t.

2.4.2 Continuous multi-state Markov model

A continuous multi-state Markov model S(t) is a way of using a continuous Markov chain to describe a process where subjects travel through continuous time while moving between different states. The transitions between the states are governed by the transition intensities qij and could also be depends on time dependent variables x(t) [10].

Let q(0)ij represent the baseline transition intensity, then for a set of variables values Xm(t) the transition intensity from state i to j is defined in (??), where βij represent a vector with each log hazard ratio as

qij(xm(t)) = qij(0)∗ eβTijXm(t). The baseline transition intensities are defined as

qij(0) = lim

∆t→0

P (S(t + ∆t) = i|S(t) = j)

∆t

The transition intensities qij estimated through maximising the likelihood in L(Q) =Y

m,r

PS(tmr)S(tm,r+1)(tm,r+1− tmr),

for all subjects m and observations times r over the baseline intensities and the log hazard ratios βij.

2.4.3 Pearson-type goodness-of-fit test

The estimated transition probabilities for a multi-state Markov model depends on the time between observations, the sequence of observed states and the variable pattern. When testing if a Markov model is stationary all these aspects have to be considered when calculating a Pearson-type goodness-of fit statistic [11].

(20)

1. Let C denote the total number of variable combination, if the variables are continuous then the use of quantiles of the variables value space in order to let C be finite.

2. Let H denote the total number time intervals that the Markov process is classified

3. Let R denote the total number of states 4. Let L denote the different interval lengths

Let the eh,l,r,c and oh,l,r,cbe the expected and the observed number of transitions for the cell (h,l,r,c) then a Pearson type goodness-of fit statistic can be calculated by

T =

H

X

h=1 L

X

l=1 R

X

r=1 C

X

c=1

(oh,l,r,c− eh,l,r,c)2 eh,l,r,c .

The test can then be used to validate stationary assumption of the multi-state Markov. The test statistic T is χ2 distributed and can then be compared with values in a χ2 Chi-squared distribution Table to extract a p-value [11].

(21)

3 Material

3.1 Background information

3.1.1 High speed train

A high speed train is a passenger train with top speed of between 200 to 250 km/h.

This type of train often travels longer distances and is therefore more prone to experience disturbances.

3.1.2 Sections

A train line is divided into parts which are called sections. A station or a operation spot are marking the start and the end of each section. At a station passengers can go on and off, where a operation spot is just a place wheres the trains pass without stopping. For both operation spots and stations operational times are recorded, as departure and arrival times [12].

3.1.3 Delay and Cumulative delay

The current positive difference between the planned departure or arrival time and the actual departure or arrival time of a operation spot or station, is defined as the degree of delay. If the degree of delay is more then 5 minutes on the final station the train is defined as non-punctual. The degree of delay can change during a train trip which is the reasoning of having a measurement as cumulative delay. A cumulative delay is defined as the increment in current delay for a section which is also called a disturbance. All disturbances that are larger or equal to 3 minutes are labelled with a casual factor and defined as a disruption [12].

3.1.4 Weather and forecast system

A weather reserach and forecasting model (WRF) is a numerical weather prediction system that is used for research and operational purposes. The model can be set up with different configurations and over different regions. Actual atmospheric condi- tions and idealised conditions can be used in the model. The WRF model simulates desired weather variables estimations over grids. The grids cover the region of choice and the size of each grid square is decided with the resolution setting. Higher res- olution gives smaller squares in the grid and more precise simulations. The update rate is decided by the time step which means that smaller time steps gives more frequent simulations. Higher resolution and smaller time steps are naturally more computationally heavy and increase the time for the simulation [13].

3.1.5 Atmospheric Icing

Atmospheric icing is a phenomenon that can occur in the atmosphere when droplets of water freezes on objects that the droplets get in contact with. There are two different sorts of atmospheric icing one is in-cloud icing and the other is precipitation icing. In-cloud icing is the result of droplets that have been super-cooled and been

(22)

in contact with a surface and frozen. Precipitation icing is the results of rain and snow which frozen after coming in contact with a surface [14].

3.2 Description of the train operation and the train line

The train line between Ume˚a and Stockholm is a train line with 14 stations and 120 sections. The total length of the train line is 711 km and the planned drive time for a train to drive the whole line is approximately six and a half hour. Multiple trains have the order to drive Ume˚a-Stockholm or Stockholm-Ume˚a every day. The sections lengths vary from 0.3 km to 15 km

3.3 Data collection

The data used in this project was collected from two different sources which is further explained below.

3.3.1 Train operation data

The train operation data was provided by the Swedish traffic administration (STA) and contains train operations from 2011 to 2018. The data is stored in csv files and contains one year of train operations each. The most important variables are further explained in Table 3 with the name and information about each variable.

Table 3: Information about variables in the operation data except for operation times

Variables Description

Train Number An identification number for train used in the trip

Arrival location Name of arrival location Departure location Name of departure location

Departure date The departure date for a train for a location.

The format used is: yyyy-mm-dd

Arrival date The arrival date for a train for a location.

The format used is: yyyy-mm-dd

Train type Type of train, for example: High-speed, commute-train and regional

Contract company Name of the company that was responsible for the trip

Cancel code Shows if the trip was cancelled or not: J for cancelled and N for not cancelled.

Section Length Length of the section in kilometres (km) Section Number A number to identify a section of the train

line. Numbers: 1-120.

Each observation represent a section and includes variables in Table 3 and the planned and actual operation times, these can be seen in Table 4.

(23)

Table 4: Information about the operation times

Variables Description

Planned departure time The planned departure time from an location Planned arrival time The planned arrival time from an location Actual departure time The Actual departure time from an location Actual arrival time The Actual arrival time from an location

The operation times are stored as date, hour and minutes separately which means that a combination is required. This is further explained in the data processing section. The minute is rounded down to closest minute by the signalling system.

For example if the actual time is 14:03:05 then the recorded time will be 14:03:00.

3.3.2 Weather data

The weather data was obtained through the Atmospheric Science Group (ASG) at Lule˚a University of technology. The simulations were done for the period October 2016 to February 2017. The resolution was set to 3x3 km which means that each grid square has the area 9 km2. The time step was set to 1h which means that there are simulated values each hour. The region can be seen in Figure 3 which is the region wheres the train line of interest is located and the grid points used lied in close proximity. The blue line represents the train line.

Figure 3: Train line in the region of simulated weather variables

The weather variables used and extracted from the weather data set can be seen in Table 5. These variables are chosen because they are easy to understand and

(24)

are suitable indicator of the weather. The icing variable is chosen because of the purpose of the NoICE project and is simulated with the WRF model with help of Makkonen’s icing model [15].

Table 5: Information about the weather variables

Variables Description

Temperature The temperature in kelvin (k) 2 meters above the ground Humidity Relative Humidity at 2-meters %

Snow depth The snow depth in meters (m)

Icing Rate Accumulated Snow and Ice from the Microphysics Scheme between model output steps, mm

3.4 Tools

During the project R has been the programming language that has been used to trough out all parts of the project together with different packages. The packages has simplified the process of data handling and analyses. The package dplyr has been used for data handling, which is a package with simple commands for data manipulation. The analysis has been carried out with the Survival package and the plots has been made with ggplot2. For the multi-state Markov model the MSM package was used.

(25)

4 Method

4.1 Data extraction

All train trips from October - February 2017 are extracted from the operation data.

This period is choosen because the weather data is only available for this period.

The period October 2016 - December 2016 is extracted to be used as reference data and the January - February is extracted as data used representing a winter period with more extreme weather. Only high speed trains that had SJ AB as the contract company trafficking the train line Stockholm-Ume˚a or Ume˚a-Stockholm is extracted. This train line were chosen because majority of the train line is located in the Botnia-Atlantica area. The train trips that had been cancelled are removed.

4.2 Data management

Before the analysis is conducted some data management is done. The purpose of this section is to explain each step that is taken to manage the data.

4.2.1 Trip Identification

In the operation data multiple train trips are stored in the same file. To be able to do analysis on the data there has to be some way to uniquely separate observations from different trips. One trip is divided into 120 sections which means that each trip has 120 observations. A new variable is created which is named tripID which is a combination of the train order number and the departure date from the starting station. This lead to that observations from the same trip have a unique identifica- tion. The tripID has the following structure yyyy-mm-dd1000, where the y’s stand for the year, m’s stands for the month, d’s stands for the days and the number in the end is the order number.

4.2.2 Combination of time variables

One section in the operation data has planned and actual times for arrival and departure. These times is divided into minutes, hours and date. These times has to be converted into a time class and combined together. This is done with help of a base function in R called as.P OSIXct. The conversion and combination resulted into 4 new time variables called planned departure time, actual departure time, planned arrival time and actual arrival time for each secetion observation.

4.2.3 Planned and actual driving time

The difference between the planned driving time and the actual driving time has to be calculated in order to compute the cumulative delay for each section. This is done with help of a base function in R called difftime making it possible to subtract the planned arrival time from the planned departure time to calculate the planned drive time. The same is done for the actual driving time, but instead using difftime to subtract the actual arrival time from the actual departure time.

(26)

4.2.4 Cumulative delay

To calculate the cumulative delay for a section the travelled and actual driving time are compared. The planned driving time is subtracted from the actual driving time, which gave the cumulative delay for each section.

4.2.5 Delayed

To calculate if the train is currently delayed the difference between the actual arrival time and the planned arrival time computed for every section and trip.

4.2.6 Missing values in the operation data

A section in the operation data for a trip sometimes has missing values. This lead to an investigation of the missing values to get a better picture of the nature of the missing values. It concluded that all trips had missing values and these are classified into three different classes which are defined in Table 6.

Table 6: Missing operation times classes

Class Departure time missing Arrival time missing

1 True False

2 False True

3 True True

To handle the missing values and repair the data a imputation is proposed. The operation data for a trip can be seen as an time series where the arrival and departure time are logged for each section. A common method to impute missing values in this type of data is called Last observation carried forward imputation (LOCF). The latest recorded value is used to impute the missing value. The advantages of using LOCF are that the number of observations removed from the study decreases and make it possible to study all subjects over the whole time period. A disadvantage with the method is the introduction of bias of the estimates if the values changes considerably with time and the time period between the most recent value and the missing value is long. A modified version of this method can be used in the train operation data. The modified version is explained further below.

1. Start from the beginning of the trip and save the latest arrival or departure time until a missing time is occurring.

2. If the missing time is arrival time then a) if departure then b)

(a) Replace the missing arrival time with the latest departure time + the planned driving time for the previous section

(b) Replace the missing departure time with the latest arrival time + the planned station stop time.

3. Save the imputed time as the new latest time.

(27)

4. If the section is not the last section of the trip, go back to 1

The reason this method is used is because the intervals with missing values is short which decreases the risk of bias. The imputed operation times is examined to ensure that the imputation process doesn’t cause large bias.

4.2.7 Merge the weather data and the operation data

The weather data and the operation data has to be merged. This is done by finding the coordinates for every location of a departure or arrival location. All arrival and departure locations existing on the train line between Stockholm and Ume˚a are then matched with the closest grid point with help of the euclidean distance. All departures and arrivals are then matched with weather values from the closest grid point by date and time. The time is rounded to the closest hour.

4.2.8 Section mean of weather variables

The mean of the weather variables are computed by talking the mean of the weather variables at the starting point and the end point of a section.

4.3 Exploratory analysis

Before doing any modelling exploratory analysis is done. All weather variables is examined and the Spearman correlation (Section 2.3.1) are tested for each pair of predictors. This is done to ensure that no large multi-colinarity is present in the data. The disruptions is also examined mostly to examine the effect of the imputation process by looking at the disruptions that had a connection with the imputed times and the number of disruptions.

4.4 Survival analysis

A train trip could be seen as a survival process which is starting when the train leaves the first station and ends at the final destination. During the trip the train is vulnerable to disruptions that in the end could lead to a non-punctual train. There could be factors that increase or decrease the risk of disruptions.

4.4.1 Response variable

A requirement to conduct survival analysis is to have a response variable. In this approach the occurrence of disruptions is looked at. A binary variable is therefore created were 1’s indicating the occurrence of a disruption and 0’s as no occurrence.

The response variable is defined in Table 7. A disruption could occur multiple times per trip but only one time per section.

Table 7: Definitions of disruptions and non-disruptions

Class Description Sign

None For Cumulative delay : < 3 min 0 Disruption For Cumulative delay : ≥ 3 min 1

(28)

4.4.2 Predictors

For each section the mean of the weather variable at the starting point and the end point of the section is derived. The assumption is that the mean of the weather variables for a section is more representative than using the values for the start or the end of a section. The predictors is defined in Table 8.

Table 8: Predictors for the modelling part

Predictors Description

Section temperature The mean temperature of the departure and arrival location for a section

Section humidity The mean humidity of the departure and ar- rival location for a section

Section snow depth The mean snow depth of the departure and arrival location for a section

Section icing Group

None: Mean icing of the departure and ar- rival location for a section = 0, Ice: Mean icing of the departure and arrival location for a section > 0

4.4.3 Observation time

The observation time used for the modelling is the travelled kilometres. This is used because sections have different lengths and the travel time is hard to use because the response variable is built on the travel time.

4.4.4 Kaplan-Meier estimatation

A Kaplan-Meier estimation (Section 2.1.6) is an estimation of the survival curve (Section 2.1.3) for different groups. This method can be used when looking at the first event for a survival process and can give an indication of how predictors affect the rate of survival. Each trip is divided into intervals where the intervals is the whole trip for a trip with no disruptions and for a trip with disruptions is the in- terval from start to the first disruption, the observation time is the driven kilometres.

The latest section weather observation is used in the estimation because a Kaplan- Meier estimator can’t handle time varying predictors. The disruptions is divided into two groups based on the median value for temperature, snow depth and humidity or the presence of atmospheric icing or not for the atmospheric icing predictor. The groups are defined in Table 9 and 10.

Table 9: Groups for: Temperature, humidity and snow depth

Class Description

High Predictor value > Median(Predictor) Low Predictor value ≤ Median(Predictor)

(29)

Table 10: Groups for: Atmospheric icing

Class Description

None Atmospheric icing = 0

Icing Atmospheric icing > 0

A kaplan-meier estimation of the survival curve for each group is fitted and plotted.

4.4.5 The extended Cox proportional-hazard model with recurrent events In order to account for all weather values and disruptions during a trip and to asses all predictors simultaneously is an extended Cox proportional-hazard model (Sec- tion 2.1.7) fitted for the time period January 2017 - February 2017.

The counting approach (Section 2.2.1) is used because of time dependent predictors which meant that each trip is divided into intervals. The intervals is defined as each section and the travelled kilometres before a section and the travelled kilometres at the end of a section is defined as the time intervals. For the counting approach are the predictors assumed to be constant between within each interval this is not true in this case because the weather changes all the time. An assumption is made that the weather changes within a section wouldn’t affect the analysis in a significant way because the changes should be small. Robust estimation (Section 2.2.3) is used to account for correlation within each trip.

The predictor temperature did not satisfy the assumption about proportionality (Section 2.1.9) and seemed to have larger effect in the start of a trip. This is handled by using two heaviside functions (Section 2.1.10) g1(t) and g2(t) which are defined in equation (4) and equation (5) and create an interaction with that function and temperature, the break point is 100 kilometres.

g1(t) =

( 1 if t ≤ 100

0 if t > 100 (4)

g2(t) =

( 1 if t ≥ 100

0 if t < 100 (5)

This mean that one hazard ratio is estimated for the interval before 100 kilometres and one after. This breaking point is chosen after inspecting a the Schoenfeld resid- uals plotted against time indicating that the interval 0km - 200 km had a different hazard ratio then from 200 to the end. 100 is in the middle and it seemed appro- priate to chose 100. All predictors is kept in the model because of the purpose of investigating all predictors together. Interactions is tested for some combinations of variables, no large interaction effect is found leading to no inclusion of any interac- tions to make the interpretation easier of the main effects. Another model is fitted in the same way for the reference data with the time period October-December 2016.

(30)

4.4.6 Validation

The Cox proportional hazards model makes an assumption about proportionality which is tested by looking at Schoenfeld residuals (Section 2.1.11) plotted against time and doing a score test using the cox.zph function from the survival package. A log-ratio test (Section 2.3.2) is done comparing a model with no weather predictors and the fitted model to see if the model is significant.

4.5 Multi-state Markov model for delayed

A train trip can be seen as a stochastic process where the train inhabits different states. At start of the trip the train is in one state and during the trip the train could move between the possible states until the train reaches the end station.

4.5.1 Response variable

A train that is non-punctual is a train that arrives 5 minutes or more after schedule time for each section and during a trip the train can move between being delayed and not delayed which is visualised in Figure 4.

Figure 4: Visualisation of the movement between the states

This can be modelled by a multi-state Markov model (Section 2.4.2). The definition of each state is defined in Table 11.

Table 11: Definitions of the two states

Class Description Sign

None Delayed: < 5 min 1

Delayed Delayed: ≥ 5 min 2

4.5.2 Predictors

The predictors used for the modelling part can be seen in Table 12.

(31)

Table 12: Predictors for the modelling part

Predictors Description

Current section tem- perature

The mean temperature of the departure and arrival location for a section

Current section humid- ity

The mean humidity of the departure and ar- rival location for a section

Current section snow depth

The mean snow depth the of departure and arrival location for a section

Current section icing group

None: Mean icing of the departure and ar- rival location for a section = 0, Ice: Mean icing of the departure and arrival location for a section > 0

Current state The current state assumed by the train

4.5.3 Time unit

The time unit used for this problem is the continuous variable driven kilometres similar to the survival analysis. The argument for this is that the response variable is based on the actual time which could affect the analysis.

4.5.4 Markov model

A continuous two state model (Section 2.4.2) with the predictors from Table 12 is fitted with help of the function msm from MSM package. The baseline transition intensities is estimated together with the hazard ratios and respective confidence interval for each predictor. The model is tested with a Pearson-type goodness- of-fit test (Section 2.4.3). The model is also compared to a model without any weather predictors with a log-ratio test (Section 2.3.2). Lastly is another model fitted with the weather predictors for the period October-December to see if the weather predictors had the same effect in less extreme weather.

(32)

5 Results

5.1 Exploratory analysis

5.1.1 Predictors

Table 13 shows summarised data of the mean of each predictor over the sections.

The K stand for kelvin degrees and the C stands for Celsius degrees.

Table 13: Summarising data about each predictor: Extreme period

Predictors Min Max Mean

Temperature 255.4°K

(−17.65°C) 281.2°K (8.05°C) 271.9°K (−1.5°C)

Humidity 43.78 % 100.00 % 85.52 %

Icing 0 mm/h 2.59 mm/h 0.027 mm/h

Snow depth 0 cm 28.82 cm 3.33 cm

Table 14 shows the proportion of the predictors that are non-zero values. These predictors are icing and snow depth. Icing shows large number of none zeros with 26% observations with non-zero values. Snow depth has 89% of the observations that are not non-zeros.

Table 14: Proportion zone-zero values for snow depth and icing Predictors Proportion none zeros

Icing 26 %

Snow depth 89%

The examination of the Spearman correlation (Section 2.3.1) between the predictors shows that correlation exist. The results can be seen in Table 5. A value of 0 indicates no correlation, negative values indicate negative correlations and positive values indicate positive correlations.

(33)

Figure 5: The Spearman correlation between the predictors

No pair of predictors seems to have high correlation, the highest is temperature and humidity with a correlation of 0.38.

5.1.2 Examination of the imputed values

Table 15 shows the number of section observation with a missing actual arrival or actual departure times. 12.74% of the section observations have a missing value of class 1,2 or 3 from Table 6 (Section 3.5.6).

Table 15: Information about the imputed travel times

Type Number of observations % of total

Sections with imputed operation

times 6147 12.74%

Sections with no imputed opera-

tions times 42109 87.26%

Table 16 shows the number of disruptions that have a connection to the imputation process and the number of disruptions that are true disruptions. The proportion that have a connection to the imputation process is 13.29%

(34)

Table 16: Information about the disruptions and imputed disruptions

Type Number of disruptions % of total

Disruptions with a connection to

imputed operation times 57 13.29%

Disruptions with no connection to

imputed operation times 429 86.71%

5.1.3 Examine the cumulative delays and the delayed state

Table 17 shows the total number of disruptions for the period January-February for the total number of 416 trips. The number of section observations with a disruption is 0.89%.

Table 17: Table with information about the number and proportion of cumulative delays

Type Number of observations % of total

Cumulative delay ≥ 3 minutes 429 0.89%

Section with no ≥ 3 minutes 48256 99.21%

Table 18 shows the total number of sections where a train had a delayed status for the period January-February for the total number of 416 trips. 12.54% of the total number of section observations is in a delayed state.

Table 18: Table with information about the amount of sections with delayed status and non-delayed status

Type Number of observations % of total

Delayed: Current-Delay ≥ 5 min-

utes 5153 12.54%

non-delayed: Current-Delay <5

minutes 43103 87.46%

5.2 Survival Analysis

5.2.1 Kaplan-Meier estimation

In Figure 6 two survival curves are plotted for each group. The high group seems to have lower survival compared to the low group for both humidity and snow depth until the first disruption.

(35)

Figure 6: Kalan-Meier estimation of survival curve for humidity and snow depth In Figure 7 survival curves are plotted for both temperature and icing groups. The no icing group seems to have higher survival than the icing group. The high temperature group seems to have higher survival than the low temperature group for the first disruption.

Figure 7: Kalan-Meier estimation of survival curve for atmospheric icing and tem- perature

(36)

5.2.2 The extended Cox proportional-hazard model with recurrent events The results from the fitted extended Cox proportional-hazard model model with 4 predictor can be found in Table 19. Two predictors seems to have significant effects on the time until a disruption which are the temperature and the humidity. Lower temperature and higher humidity seems to increase the hazard rate.

Table 19: Results from the fitted model

Predictor Coefficient Hazard ratio

Robust standard error

z-value p(>z) Temperature:

Before 150 kilometres

-0.19 0.83 0.055 -3.39 0.00070*

Temperature:

After 150 kilometres

-0.062 0.94 0.025 -2.46952 0.014*

Humidity 0.025 1.026 0.0081 3.14 0.0017*

Snow depth 0.0073 1.0074 0.015 0.48 0.63

Icing group:

None -0.16 0.85 0.15 -1.061 0.29

The proportionality test after fixing the violation of the proportionality assumption for temperature can be seen in Table 20. None of the predictors or the model have a small p-value and the lowest p-value of 0.45 indicating that the proportionality assumption can’t be rejected. The results from the fitted model before fixing the violation of the proportionality assumption can be seen in Appendix A1 and A2.

Table 20: Results from proportionality test

Predictor P-value

Temperature 0.45

Humidity 0.54

Snow depth 0.51

Icing group 0.48

Global 0.78

The likelihood ratio test comparing the fitted model with a model without any covariates is shown in Table 21 showing that the model with weather predictors is significantly better than the model without weather predictors.

Table 21: Log-likelihood test: Model with weather predictors compared to a model without predictors

Model Degree of

freedom Chisq Pr(>Chisq)

With predictors 5 49.052 2.17e-09 *

(37)

The results from the model fitted over the period October 2016 - December 2016 showed no significant effect of the predictors, all of them have a large p-value. This can be seen in Appendix A3.

5.3 Stochastic model results

The baseline intensities for the continuous Markov chain model can be seen in Table 22. The baseline transition intensity seem to be higher for the non-delayed to delayed compared to delayed to non-delayed.

Table 22: Baseline transition intensities between non-delayed and delayed state To: Non-delayed To: Delayed

From: Non-delayed -0.0015 0.0015

From: Delayed 0.012 -0.012

The hazard ratio for each predictor can be seen in Table 23 for the non-delayed to delayed state. Temperature and icing group are the only predictors with a confidence interval not containing 1 which means that they are significant. The hazard ratio captures the change in the transition intensities given a change by one unit in the predictor or by changing group.

Table 23: Non-delayed to delayed

Predictor Hazard

Ratio CI: Lower CI: Upper

Temperature 0.97 0.94 0.99

Humidity 0.99 0.98 1.002

Snow depth 1 0.97 1.024

Icing group: Ice 1.460 1.163 1.83

The hazard ratios for each predictor can be seen in Table 24 for the delayed to non-delayed state. Temperature, snow depth and humidity seems to be significant with a confidence interval not including a 1.

Table 24: Delayed to non-delayed

Predictor Hazard

Ratio CI: Lower CI: Upper

Temperature 1.033 1.0010 1.065

Humidity 0.98 0.96 0.99

Snow depth 0.95 0.92 0.99

Icing group: Ice 1.19 0.945 1.50

Figure 8 shows two lines, the blue is the percentage of observed number of train occupying each state for the time period. The red line shows the models expected

(38)

number of train occupying each state for the time period. It is clear that the model expectation and the observed percentages differ indicating a bad fit.

Figure 8: The expected prevalence by the model and the observed prevalence The results from the Pearson-type goodness-of-fit test (Section 2.4.3) gives a p-value of < 0.05 which indicate that the model has a bad fit on the fitted data and that the process is not time homogeneous. The log-likelihood test against a model with no weather predictors is shown in Table 25 showing that the model with weather predictors is significantly better then a model without the weather predictors.

Table 25: Log-likelihood test: Model with weather predictors compared to a model without predictors

Model Degree of

freedom Chisq Pr(>Chisq)

With predictors 8 50.91 2.74e-08 *

The results from the model fitted over the period October 2016 - December 2016 shows no significant effect of the predictors except for the snow depth on transition from non-delayed delayed, in the all other cases included the confidence interval 1 which can be seen in Appendix B1 and B2

(39)

6 Conclusion and discussion

6.1 Conclusion

The purpose of this study was to investigate if and how the weather and the at- mospheric icing affects the occurrence of disruptions and the transitions between a delayed and a non-delayed state. The disruptions was investigated with survival analysis and the Cox proportional-hazard model with a counting approach which showed that the temperature and humidity seems to have an significant effect on the time until disruptions. Lower temperatures and higher humidity gives shorter time until a disruption meaning that these conditions are less optimal for avoiding disruptions. Temperature seems to have a larger effect in the start of a train trip indicating that the effect of temperature is not constant over time or it could be an effect of the sample size. This could be the effect of trips going in two different directions and that the temperature in Ume˚a is lower compared to Stockholm in average, which could have affected the results.

The delayed and non-delayed state was investigated with a two state continuous Markov chain model. This showed that the temperature and the presence of icing seems to have a significant effect on the transition rates from non-delayed to de- layed. Lower temperatures and the presence of icing seems to decrease the time until a transition from non-delayed to a delayed state. This means that lower tem- perature and the presence of icing seems to be less favourable conditions to stay in a non-delayed state.

Humidity, temperature and snow depth seems to have an significant effect on the transition from a delayed state to a non-delayed state. Lower temperatures, higher snow depth and humidity seems to increase the time until a transition from a de- layed state to a non-delayed state. This means that these conditions seems to be less favourable for recovering from a delayed state. The Markov model does not fit the data well tough and is not time homogeneous which means that the transitions intensities change over the trips or that important variables are missing in the model that may or may not have a connection to the weather. The base intensities is larger for the delayed to non-delayed compared to non-delayed to delayed which indicate that the expected time spent in the non-delayed state is longer.

Both models shows that the temperature and the humidity seems to have a general effect on the performance of a train in the winter climate. Both approaches indicate that lower temperature and higher humidity are worse conditions for a train decreas- ing the time until a disruption and increasing time spent in a delayed state. The fitted models over the period October - December showed no significant effect of the weather predictors except for snow depth for the transition between non-delayed to delayed. This could indicate that the weather has a more significant impact on the train performance during more extreme winter conditions.

(40)

6.2 Discussion

6.2.1 Data uncertainty

The operation data had missing operation times that was fixed with an imputation process. This process could have introduced some bias in the data which could have affected the results. The missing operation times had never more than two operation times missing in consecutive order which decreases the risk of bias. Still could the missing operation times have some effect where a delayed train was seen as delayed for a longer period than it should and the opposite that non-delayed trains were seen as non-delayed for longer period. The imputation process was mostly problem- atic for the disruptions because fake disruptions could be created by the imputation process or true disruptions could be invisible. Sections with missing operation times will send their cumulative delay forward and this means that the next section with true operation times adds the forwarded cumulative delay and could create a fake disruption or take out a true disruption if the cumulative delay passed forward is negative. The number of affected disruptions by imputation process was 13.29%

which means that a large majority of the disruptions are unaffected by the imputa- tion process. This project also include all disruptions during the period for example if the train had to stop because a person was on the rail and a disruption occurred then it could be disputed if that disruption should be excluded.

The weather data is simulated data from a WRF model which means that the val- ues could have a disparity from the true weather values and also have affected the results. AIG, the research group in Lule˚a, made some validation of the simulated values by comparing model simulations with true weather observations in the region with success showing that the simulated values followed the true weather observa- tions most of the times and usually captured the overall trend.

6.2.2 Performance measure for the train

The choice of performance measure was done through guidance of the supervisors and was made to capture different parts of the performance of a train. The distur- bances that are greater or equal to three minutes are defined as disruptions which are rarer than smaller disturbances and generally have an impact on the punctuality of the trains. Non-punctual trains are in the delayed state at the final station disre- garding the state the train has occupied during the trip which is why looking at the delayed state for the whole trip could increase the knowledge about the movement between being in a delayed and not delayed state.

6.2.3 Model approach

The choice of modelling approach was done to complement the research already done by Ottoson 2019 where she used count models to examine disruptions and logistic regression to examine punctuality. Both survival analysis and the Markov model includes a time component in the modelling which was mentioned as a future research area in her rapport.

References

Related documents

Stöden omfattar statliga lån och kreditgarantier; anstånd med skatter och avgifter; tillfälligt sänkta arbetsgivaravgifter under pandemins första fas; ökat statligt ansvar

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

Both Brazil and Sweden have made bilateral cooperation in areas of technology and innovation a top priority. It has been formalized in a series of agreements and made explicit

The increasing availability of data and attention to services has increased the understanding of the contribution of services to innovation and productivity in

Av tabellen framgår att det behövs utförlig information om de projekt som genomförs vid instituten. Då Tillväxtanalys ska föreslå en metod som kan visa hur institutens verksamhet

Generella styrmedel kan ha varit mindre verksamma än man har trott De generella styrmedlen, till skillnad från de specifika styrmedlen, har kommit att användas i större

Materials from surface surveys are often contradictory and seldom support simple, one-dimensional conclusions. The archaeologist needs to consider at least three dimensions in

We have learnt through a series of studies, that structured needfinding by engineers during the earliest phases of product development could better support the process