Data-driven methods for estimation of dynamic OD matrices

Full text

(1)LiU-ITN-TEK-A--21/040-SE. Data-driven methods for estimation of dynamic OD matrices Ina Eriksson Lina Fredriksson 2021-06-18. Department of Science and Technology Linköping University SE-601 74 Norrköping , Sw eden. Institutionen för teknik och naturvetenskap Linköpings universitet 601 74 Norrköping.

(2) LiU-ITN-TEK-A--21/040-SE. Data-driven methods for estimation of dynamic OD matrices The thesis work carried out in Elektroteknik at Tekniska högskolan at Linköpings universitet. Ina Eriksson Lina Fredriksson Norrköping 2021-06-18. Department of Science and Technology Linköping University SE-601 74 Norrköping , Sw eden. Institutionen för teknik och naturvetenskap Linköpings universitet 601 74 Norrköping.

(3) Upphovsrätt Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under en längre tid från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/ Copyright The publishers will keep this document online on the Internet - or its possible replacement - for a considerable time from the date of publication barring exceptional circumstances. The online availability of the document implies a permanent permission for anyone to read, to download, to print out single copies for your own use and to use it unchanged for any non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional on the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its WWW home page: http://www.ep.liu.se/. © Ina Eriksson, Lina Fredriksson.

(4) Abstract As the number of users in the traffic network increases, the transportation planners are left with the difficult task to meet the need of accessibility and to solve the problems with congestion, accidents, and pollution. Online control of the traffic network, including travel demand, is an important tool for transportation planners to make appropriate decisions and actions. Travel demand is often described in terms of an Origin Destination (OD) matrix which represents the number of trips from an origin zone to a destination zone in a geographic area. The idea behind this report is based on the fact that it is not only the number of users in the traffic network that is increasing, the number of connected devices such as probe vehicles and mobile sources has increased dramatically in the last decade. These connected devices provide large-scale mobility data and new opportunities to analyze the current traffic situation as they traverse through the network and continuously send out different types of information like Global Positioning System (GPS) data and Mobile Network Data (MND). The aim of this master thesis is to develop and evaluate a data-driven method for estimation of dynamic OD matrices using unsupervised learning, sensor fusion and large-scale mobility data. The idea with unsupervised learning is to handle the effect of sparse data related to large-scale mobility data, and sensor fusion is used to combine different data sources and enable online estimation. Traditionally, OD matrices are estimated based on travel surveys and link counts. The problem is that these sources of information do not provide the quality required for online control of the traffic network. A method consisting of an offline process and an online process has therefore been developed. The offline process utilizes historical large-scale mobility data to improve an inaccurate prior OD matrix. The online process utilizes the results and tuning parameters from the offline estimation in combination with real-time observations to describe the current traffic situation. A simulation study on a toy network with synthetic data was used to evaluate the data-driven estimation method. Observations based on GPS data, MND and link counts were simulated via a traffic simulation tool. A new approach for creating a data-driven assignment matrix, that maps link counts to OD demand was also tested. The results showed that the sensor fusion algorithms Kalman filter and Kalman filter smoothing can be used when estimating dynamic OD matrices. Kalman filter smoothing is a better choice when no new observations are obtained during the estimation process which is the scenario for the offline estimation. The results also showed that the quality of the data sources used for the estimation is of high importance. More probe vehicles providing GPS data and vehicles providing MND in the traffic network improves the estimation of dynamic OD matrices. Aggregating largescale mobility data as GPS data and MND by using the unsupervised learning method Principal Component Analysis (PCA) improves the quality of the large-scale mobility data and so the estimation results.. i.

(5) Table of Contents 1. 2. 3. 4. 5. 6. 2. Introduction ..................................................................................................................... 7 1.1 Background.............................................................................................................. 7 1.2 Aim .......................................................................................................................... 9 1.3 Methodology............................................................................................................ 9 1.4 Limitations ............................................................................................................. 10 1.5 Outline ................................................................................................................... 10 Literature review .......................................................................................................... 11 2.1 OD matrices ........................................................................................................... 11 2.2 Unsupervised learning ........................................................................................... 12 2.3 Sensor fusion ......................................................................................................... 13 2.3.1 Kalman filter ..................................................................................................... 13 2.3.2 Kalman filter smoothing .................................................................................... 16 2.3.3 Colored noise Kalman filter .............................................................................. 16 2.4 Previous research ................................................................................................... 17 2.4.1 Estimating OD matrices using sensor fusion .................................................... 17 2.4.2 Unsupervised learning in the context of OD matrices ...................................... 18 2.5 Error metrics .......................................................................................................... 19 Data-driven estimation method ................................................................................... 21 3.1 Assignment matrix................................................................................................. 23 3.2 Estimation methods ............................................................................................... 23 3.2.1 Offline estimation ......................................................................................... 24 3.2.2 Online estimation ......................................................................................... 26 3.3 Aggregation of large-scale mobility data .............................................................. 26 3.4 Sensitivity analysis ................................................................................................ 27 3.5 Software ................................................................................................................. 28 Simulation study............................................................................................................ 31 4.1 Toy network layout................................................................................................ 31 4.2 Synthetic OD demand............................................................................................ 31 4.3 Traffic simulation .................................................................................................. 34 4.4 Synthetic observations ........................................................................................... 34 4.4.1 Link count observations ............................................................................... 34 4.4.2 GPS and MND observations ........................................................................ 35 Results and analysis ...................................................................................................... 39 5.1 Aggregation of large-scale mobility data .............................................................. 39 5.2 Offline estimation .................................................................................................. 45 5.2.1 Sensor fusion method ................................................................................... 45 5.2.2 Observations ................................................................................................ 47 5.2.3 Prior OD demand......................................................................................... 48 5.2.4 Summary of the sensitivity analysis for offline estimation ........................... 49 5.3 Online estimation................................................................................................... 51 5.3.1 Assignment matrix ........................................................................................ 51 5.3.2 OD demand used as input to online estimation ........................................... 52 5.3.3 Summary of the sensitivity analysis for online estimation ........................... 52 Discussion....................................................................................................................... 55 6.1 Synthetic data ........................................................................................................ 55 6.2 Sensor fusion ......................................................................................................... 55 6.3 Unsupervised learning ........................................................................................... 56.

(6) 7 8. 6.4 Future work ........................................................................................................... 56 Conclusion ..................................................................................................................... 59 Reference List ................................................................................................................ 61. 3.

(7) List of Figures Figure 1. Overview of the offline process and online process ................................................... 9 Figure 2. Illustration of two PCs for a data set with observations ........................................... 12 Figure 3. Flow chart representing the OD matrix estimation method ...................................... 21 Figure 4. Detailed overview of the data-driven estimation methods ....................................... 22 Figure 5. Illustration of the toy network................................................................................... 31 Figure 6. Time profile of total OD demand for the ground truth typical historical weekday .. 32 Figure 7. OD demand distributed over OD pairs and time ...................................................... 32 Figure 8. The process of how the different OD demands were simulated ............................... 33 Figure 9. Illustration of the toy network with loop detectors ................................................... 34 Figure 10. The correlation between simulated link counts and link counts based on the assignment matrices (a) 𝐴𝑠𝑖𝑚𝑝𝑙𝑒𝑥 and (b) 𝐴𝑑𝑎𝑡𝑎𝑑𝑟𝑖𝑣𝑒𝑛𝑥 .................................................... 35 Figure 11. A histogram of the actual penetration rates for all OD pairs .................................. 36 Figure 12. Time profile of the OD demand and scaled observations for a specific OD pair ... 37 Figure 13. The time profile for OD pair 78 together with GPS and MND observations (a) 𝑇 = 10 and (b) 𝑇 = 60 .................................................................................................................... 39 Figure 14. The R2 value after recreation of GPS observations depending on the number of PCs and 𝑇 = 60 ............................................................................................................................... 40 Figure 15. The cumulative PVE based on GPS observations and 𝑇 = 60 .............................. 40 Figure 16. PCA-aggregated observations together with ground truth observations for OD pair 78 (a) 𝑇 = 10 and (b) 𝑇 = 60 .................................................................................................. 41 Figure 17. Correlation between the ground truth observations and original observations versus PCA-aggregated observations and 𝑇 = 60. (a) synthetic original observations, (b) synthetic PCA-aggregated observations .................................................................................................. 42 Figure 18. Biplot of PCs representing GPS observations ........................................................ 43 Figure 19. Temporal variability for the first two PCs .............................................................. 44 Figure 20. Time profile of GPS observations for two OD pairs .............................................. 44 Figure 21. Time profile of the total demand for different type of OD demand ....................... 49 Figure 22. Time profile of the OD demand for the OD pair with highest OD demand ........... 50 Figure 23. Error metric over time and OD pairs (a) NRMSE over time, (b) NRMSE over time and OD pairs sorted by largest total OD demand ..................................................................... 51 Figure 24. Time profile of the total demand for different type of OD demand in online estimation ................................................................................................................................. 53 Figure 25. Time profile of the OD demand for the OD pair with highest OD demand with realtime observations ...................................................................................................................... 53 Figure 26. Error metric over time and OD pairs in online estimation (a) NRMSE over time, (b) NRMSE over time and OD pairs sorted by largest total OD demand ...................................... 54. 4.

(8) List of Tables Table 1. List of notations used for OD matrices ...................................................................... 11 Table 2. List of notations used in the Kalman filter equations................................................. 14 Table 3. List of notations used in the Kalman filter algorithm ................................................ 15 Table 4. R2 values for observations compared to ground truth observations ........................... 42 Table 5. Sensitivity analysis of sensor fusion method ............................................................. 45 Table 6. Sensitivity analysis of assignment matrix .................................................................. 46 Table 7. Sensitivity analysis of APR ........................................................................................ 47 Table 8. Sensitivity analysis of different settings of observations ........................................... 48 Table 9. Sensitivity analysis of the prior OD demand ............................................................. 48 Table 10. Setup for offline estimation ...................................................................................... 49 Table 11. Sensitivity analysis of assignment matrix ................................................................ 51 Table 12. Sensitivity analysis of OD demand used as input to online estimation ................... 52. 5.

(9)

(10) 1. Introduction. As the number of users in the traffic network increases, the transportation planners are left with the difficult task to meet the need of accessibility and to solve the problems with congestion, accidents, and pollution. Online control of the traffic network is an important tool for transportation planners to make appropriate decisions and actions. This online control tool requires not only real-time data describing the current traffic situation but also a reliable and efficient method that utilizes the data to estimate travel demand. Travel demand is often described in terms of an Origin Destination (OD) matrix which represents the number of trips from an origin zone to a destination zone in a geographic area. Traffic analysis zones are in many cases created according to socio-economic criteria and the number of zones can range from a couple of hundred to several thousand depending on the size of the city. A large number of zones yields a large OD matrix, in addition, an OD matrix often includes a temporal aspect that further increase the dimension of the OD matrix. Over time, travel demand is better described with a dynamic OD matrix rather than a static OD matrix because travel behavior varies. For example, the travel demand often peaks in the morning and in the afternoon due to trips made to and from work. The large dimensions of OD matrices make it cumbersome to estimate OD demand online. Therefore, it is important to enable efficient processing for online estimations. The idea behind this report is based on the fact that it is not only the number of users in the traffic network that is increasing, the number of connected devices such as probe vehicles and mobile sources has increased dramatically in the last decade. These connected devices provide large-scale mobility data and new opportunities to analyze the current traffic situation as they traverse through the network and continuously send out different types of information. Global Positioning System (GPS) data, Mobile Network Data (MND), and Bluetooth data are example of different types of information that is continuously sent out from devices depending on the area of use. One question that arises is whether there is a data-driven method that utilizes large-scale mobility data to estimate OD demand.. 1.1. Background. The estimation of OD matrices has been a research question for many years. Traditionally, OD matrices are estimated based on travel surveys and link counts. The problem is that these sources of information do not provide the quality required for online control of the traffic network since they do not describe the traffic situation in real-time. Researchers have redirected their focus to include additional data sources that provide information about the current traffic situation. License plate recognition, Bluetooth, and WiFi detection are data sources mentioned in [1] and can be used to retrieve information about the current traffic situation. The authors in [2] are for example using data produced by Bluetooth detection of mobile devices that equip vehicles to estimate dynamic OD matrices. Both average speeds, travel times between detectors, and link counts are obtained from the Bluetooth detection. Probe vehicles provide GPS data which gives information about the position of the vehicles at specific time stamps. From GPS data trajectories, speeds, and direct OD demand observations can be retrieved. The GPS data are a reliable source of data, meaning that the position of the vehicles can be trusted. However, GPS observations are only available for a small fraction of. 7.

(11) the vehicles in the traffic network, the Average Penetration Rate (APR) of probe vehicles is approximated to 3–10% of all vehicles in the traffic network [3]–[5]. MND are another type of large-scale mobility data that contain information of travel behavior. MND are obtained from mobile phone users that are connected to the mobile network. When a mobile phone user traverses the network, the connection between base stations is changed and the change between base stations is registered as an event. The events are stored by the mobile operators and after some processing, OD demand can be retrieved. It is important to note that the user data stored by the mobile operators are anonymized before being processed and that the privacy is carefully handled. Tallgren [6] and Breyer [7] describes how the OD demand from MND can be extracted. The start and stop for a trip are identified when a user has not moved for a longer time period. Because of the many assumptions made in the data processing, the direct observable OD demand from MND is not as reliable as the ones retrieved by GPS data. It is reasonable to think that the observations retrieved from MND includes error in both time and between nearby zones. However, the fraction of mobile phone users is larger than the fraction of probe vehicles. The APR of mobile phone users is approximately 20–40% which is related the mobile operators’ market share [1, p. 10]. Having a dynamic OD demand matrix describing the travel patterns in a large network, the small APR for both GPS data and MND leads to OD pairs that do not get any observations. This does not mean that there is no vehicle using the path and it is important to make a difference between no observations and observing zero travelers. The problem with no observations is called sparse data and can be handled in many different ways. [9] suggests aggregating the data in time or unsupervised learning to handle sparse data. Link counts can be provided by loop detectors and are the most frequently used input to the OD demand estimation problem [10]. Link counts provide indirect information about the OD demand because observations on a link cannot be directly associated with an OD pair in the OD matrix. There are many different routes between an OD pair, and a link or position can therefore be used by many OD pairs. The mapping between link counts and demand in an OD matrix is a complicated process, but there are traffic assignment models that solve this problem. The aim of traffic assignment modelling is to allocate a set of OD pairs to an existing road network based on route choices. Assignment matrices describes the fraction of vehicles related to an OD pair that passes a link with a detector in the traffic network. As brought up by [11], link counts in a traffic network can be explained by many different OD matrices and implies that the OD matrix estimation problem is underdetermined when only link counts are used. The solution for an estimation problem therefore often includes a prior OD matrix based on for example travel surveys or previous estimations, where direct or indirect observations can be used to improve the prior OD matrix. In this master thesis, prior OD demand will be improved in an offline estimation process before being used as input to an online estimation. Sensor fusion algorithms are widely used to combine observations with different modality. According to [7, p. 1], the definition of sensor fusion is “the combining of sensory data or data derived from sensory data from disparate sources such that the resulting information is in some sense better than would be possible when these sources were used individually”. The master thesis will utilize a sensor fusion algorithm in order to combine different kind of data sources like traditional link counts with large-scale mobility GPS data and MND to estimate an OD matrix. Another positive property with some sensor fusion algorithms is that they are recursive which is preferable for an online system.. 8.

(12) The problem of estimating OD matrices is commonly solved with the an optimization method like Simultaneous Perturbation Stochastic Approximation (SPSA)[13]. SPSA is an appropriate method for optimization problems with noisy objective functions. In the context of online estimation, there are other methods that are more appropriate, for example sensor fusion algorithms that are recursive.. 1.2. Aim. The aim of the master thesis is to develop and evaluate a data-driven method for estimation of dynamic OD matrices using unsupervised learning, sensor fusion and large-scale mobility data. The aim can be concretized in the following research questions: 1. How can sensor fusion be used for estimation of dynamic OD matrices? 2. How can unsupervised learning be used to improve estimation of dynamic OD matrices? 3. How does the different types of data sources affect the estimation of dynamic OD matrices?. 1.3. Methodology. This master thesis contains a literature study that covers the definition of static and dynamic OD matrices, unsupervised learning, sensor fusion, and how these methods have been used for dynamic OD matrix estimation. The knowledge acquired by the literature review is the basis for how to develop and evaluate a method for estimation of dynamic OD matrices. This master thesis is focusing on the whole process from an inaccurate prior OD matrix to an estimated dynamic OD matrix that describes the traffic situation in real-time. The method is divided into two processes: an offline process and an online process. An overview of the processes is shown in Figure 1.. Figure 1. Overview of the offline process and online process The idea with the offline estimation is to utilize historical large-scale mobility data and link counts to improve a prior OD matrix. The prior OD matrix is assumed to be outdated and not accurate enough for the online estimation. It is of high importance that the output from the offline estimation is reliable so that the estimated historical dynamic OD matrix used for the online estimation is accurate. For the online estimation, the highest priority is efficiency. The online estimation combines the estimated historical dynamic OD matrix and real-time large-. 9.

(13) scale mobility data and link counts to estimate an OD matrix that describes the current traffic situation. One suggested sensor fusion approach for OD matrix estimation is Kalman filtering which is used for both the offline estimation and the online estimation. The idea is to process the historical and real-time large-scale mobility data before being used as input to the estimation methods to deal with the effects of sparse data. Unsupervised learning is a suggested approach for this. To evaluate the method, a simulation study based on a toy network is done. The proposed datadriven estimation method is compared to a benchmark method and a sensitivity analysis is done to see how different factors in the method affects the results. Link counts and large-scale mobility data as MND and GPS data from a toy network are used as input for the evaluation. All data are synthetic and based on traffic simulations; this means that no empirical data are used.. 1.4. Limitations. The main limitation in this master thesis is that all data used for evaluation are synthetic. The synthetic data are perturbed to imitate reality based on previous research. Even if the synthetic data have been created according to previous research, it is difficult to imitate human choices and travel behavior. The data-driven estimation method is not tested with another network or empirical data and the results may be unique for this synthetic data. Another limitation is that not all available data sources are investigated. It is only GPS data, MND, and link counts that is used as input to the method. From these data sources, only the actual counts of OD demand and the trajectories from GPS data are utilized. Speeds from the GPS data are, as an example, not used when creating the assignment matrix.. 1.5. Outline. The thesis is structured as follows; Section 2 is the Literature study which reviews results and methods from relevant research studies. In Section 3, the proposed method that includes both offline estimation and online estimation is presented. Section 4 presents a simulation study for a toy network. Section 4 also presents how the synthetic data was created. In Section 5, the results generated from the data-driven estimation method and the simulation setup are presented and analyzed. In section 6 the results are discussed, and future work is presented. Section 7 is the conclusion.. 10.

(14) 2. Literature review. In this section, relevant theory for the master thesis is presented. In the first chapter, the definition of static and dynamic OD matrices is stated. The following two chapters includes theory regarding unsupervised learning and sensor fusion, respectively. The fourth chapter includes previous research regarding estimation of OD matrices using unsupervised learning and sensor fusion. The last chapter presents error metrics.. 2.1. OD matrices. Travel demand is often represented with an OD matrix. An OD matrix contains the number of trips from an origin node 𝑜 to a destination node 𝑑. An example of a representation of a static OD matrix is the following: 𝑥!,! 𝑋=6 ⋮ 𝑥!,$. ⋯ ⋱ ⋯. 𝑥!,# ⋮ : 𝑥$,#. (1). where the rows are the number of trips generated from each origin 𝑜 and the columns are the number of trips attracted by each destination 𝑑. 𝑂 is the total number of origins and 𝐷 is the total number of destinations. An OD pair 𝑖 is a combination of an origin 𝑜 and destination 𝑑. 𝐼 is the total number of OD pairs and 𝑖 = 1, … , 𝐼. The number of trips in a city is not always constant and a dynamic OD matrix can be constructed to describe the travel demand over a period of time. An OD matrix for a dynamic network is defined as the number of trips that starts in the departure period ℎ from an origin node 𝑜 and ends in a destination node 𝑑 after some time. The dynamic OD demand can be expressed as 𝑥%,& where 𝑖 denotes the OD pair and ℎ denotes the departure period. 𝐻 is the total number of departure periods and ℎ = 1, … , 𝐻. The duration of one departure period ℎ is 𝑇. Table 1. List of notations used for OD matrices 𝑂 𝐷 𝐼 𝑇 𝐻 𝑜 𝑑 𝑖 ℎ 𝑥%,&. Total number of origins Total number of destinations Total number of OD pairs Duration of departure period (minutes) Total number of departure periods Origin Destination OD pair Departure period Demand of OD pair 𝑖 starting in departure period ℎ. 11.

(15) 2.2. Unsupervised learning. Machine learning methods can be divided into different categories: unsupervised learning, supervised learning, and reinforcement learning. Unsupervised learning is widely used when investigating patterns in unlabeled large data sets [14]. Supervised learning is instead used for data sets that are labeled. Reinforcement learning is a learning agent that gets improved with rewards and punishment. Principal Component Analysis (PCA) and clustering are two examples of unsupervised learning [14]. It is important to note that unsupervised learning does not have a correct answer which means that the analysis of the results is explorative. [14] states that K-means clustering and hierarchical clustering are the two best-known clustering approaches. The K-means clustering assigns each observation to exactly one of the specified number of clusters. The hierarchical clustering instead results in a dendrogram. PCA was first introduced by Pearson [15] and Hotelling [16] as early as 1901 to describe the variation of a set of uncorrelated variables in a multivariate data set. Since then, PCA has found application in many science fields where it is preferable to reduce dimensions of data. PCA generates Principal Components (PC) where the first PC is a vector in the direction that includes as much variance of the observations as possible [14]. The first PC is chosen so that if all observations are projected on to a line, the observations are as close as possible to the original observations. The second PC is an orthogonal vector in the direction that explains the second largest amount of variance. Figure 2 shows an example of how the PCs are defined for two variables. The dots are observations, and the straight line represents the first PC. This line is as close as possible to the observations and minimize the sum of the distances between each point and the line. The dashed line is the second PC which is orthogonal to the first PC and explains the second largest amount om variance.. Figure 2. Illustration of two PCs for a data set with observations. 12.

(16) Given a dataset 𝑋 = [𝑋! , 𝑋' , … , 𝑋( ] with 𝑛 observations and 𝑟 variables, 𝑋! , 𝑋' , … , 𝑋( are vectors including 𝑛 numer of elements 𝑥 . The number of PCs that can be calculated is min(𝑛 − 1, 𝑟) and the first PC is defined as: 𝑍! = 𝜙!! 𝑋! + 𝜙'! 𝑋' + ⋯ + 𝜙(! 𝑋( . (2). Where 𝑍! is called the score vector and contains the elements 𝑧!! , 𝑧'! , … , 𝑧)! . The score vector describes the temporal variability for each variable in the data set and can be found when projecting all data points to each PC. 𝜙!! , 𝜙'! , … , 𝜙(! are called loadings and 𝜙! = [𝜙!! , 𝜙'! , … , 𝜙(! ]* is the loading vector which describes how much of each variable that is represented in the PCs. The loading vector is normalized, meaning that ∑(+,! 𝜙+' = 1 and that the average of the score vector is zero. Because the second PC is orthogonal to the first PC, their loading vectors are by definition uncorrelated. Proportion of Variance Explained (PVE) measures the discrepancy between the PCs and the variables [14]. Higher percentages of explained variance indicates a stronger association. With the assumption that the variables have been centered and have zero mean, the PVE for the first PC can be calculated as follows: 𝑃𝑉𝐸! =. ∑)%,!Y∑(+,! 𝜙+! 𝑥%+ Z ' ∑(+,! ∑)%,! 𝑥%+. ( 3). Where 𝜙+! are the loadings for the first PC and 𝑥%+ are the elements related to the variables in the data set 𝑋. The PCs are dependent on the scaling of the variables, and to avoid misleading results, the variables are typically scaled to have a standard deviation equal to one before performing PCA. The scaling is not necessary in scenarios where the variables are measured in the same unit.. 2.3. Sensor fusion. Sensor fusion combines data with different modality or data derived from different sources and aims to result in information with less uncertainty than available from the data sources individually. Sensor fusion covers several algorithms and methods, one of which is Kalman filtering. Depending on the problem formulation, different Kalman filtering algorithms can be used. Kalman filter, Kalman filter smoothing, and Colored noise Kalman filter have different properties and are described in section 2.3.1, 2.3.2, 2.3.3 respectively. 2.3.1 Kalman filter The Kalman filter algorithm was first proposed by Rudolf Kalman in 1960 [17] and is used to estimate the state of a dynamic system. The Kalman filter algorithm is based on a state space model which represent a linear dynamical system.. 13.

(17) The state space systems consist of two equations: a process model and an observation equation. 𝑐-.! = 𝐹- 𝑥- + 𝐺/,- 𝑣- 𝑐𝑜𝑣(𝑣- ) = 𝑄- (𝑃𝑟𝑜𝑐𝑒𝑠𝑠 𝑚𝑜𝑑𝑒𝑙),. (4). 𝑦- = 𝐻- 𝑐- + 𝑒- 𝑐𝑜𝑣 (𝑒- ) = 𝑅- (𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 𝑒𝑞𝑢𝑎𝑡𝑖𝑜𝑛). ( 5). The Kalman filter estimates the state of a variable 𝑐- based on the information of the last known state 𝑐-1! together with new observations 𝑦- . The Kalman filter is therefore recursive. The fact that the Kalman filter is recursive makes it very efficient and appropriate for online applications [17]. A list of the notations used in (4) and (5) can be seen in Table 2. Table 2. List of notations used in the Kalman filter equations 𝑘 𝑐𝐹𝐺/,𝑣𝑦𝐻𝑒𝑄𝑅-. Time period State variable at time period 𝑘 Process matrix Control-input matrix Process noise Observation variable Observation matrix Observation noise Covariance of the process noise Covariance of the observation noise. The Kalman filter algorithm consist of two steps: time update and measurement update [12]. The recursive procedure of Kalman filtering means that for every new observation the two steps are performed. The time update takes a process model into account and predicts the mean and variance of the state variables and the output variable. The measurement update uses sensor observations to correct the predictions. The measurement update compares the prediction with a computed actual state that is based on sensor observations and perform a correction. The covariance matrices 𝑅- and 𝑄- describes the uncertainty of the observation model and the process model respectively. By utilizing the ratio between 𝑄- and 𝑅- , the reliability of the model is weighted against the reliability of the observations.. 14.

(18) The algorithm for the Kalman filter is shown in Algorithm 1 [12]. Algorithm 1: The Kalman filter Step 0. Initialization Set 𝑘 = 0, 𝑐̂!|3 = 𝐸(𝑥3 ), 𝑃!|3 = 𝐶𝑜𝑣(𝑐3 ) and 𝑁 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙𝑠 Step 1. Perform measurement update 𝑆- = 𝐻- 𝑃-|-1! 𝐻-4 + 𝑅𝐾- = 𝑃-|-1! 𝐻-4 Y𝐻- 𝑃-|-1! 𝐻-4 + 𝑅- Z 𝜀- = 𝑦- − 𝐻- 𝑐̂-|-1! 𝑐̂-|- = 𝑐̂-|-1! + 𝐾- 𝜀𝑃-|- = 𝑃-|-1! − 𝐾- 𝑆- 𝐾-4. 1!. Step 2. Perform time update 𝑐̂-.!|- = 𝐹- 𝑐̂-|4 𝑃-.!|- = 𝐹- 𝑃-|- 𝐹-4 + 𝐺/,- 𝑄- 𝐺/,If 𝑘 = 𝑁, stop. Otherwise, set 𝑘 = 𝑘 + 1 and repeat from Step 1.. Table 3. List of notations used in the Kalman filter algorithm 𝑐̂-|-. The state estimate at time k given observations up to and including at time k. 𝑃-|-. The estimated covariance matrix. 𝑆-. The innovation or measurement pre-fit residual covariance. 𝐾-. The optimal Kalman gain. 𝜀-. The innovation or measurement pre-fit residual. Kalman filtering is based on two assumptions [12]. If the two assumptions are fulfilled, then the Kalman filter is the optimal linear estimator meaning that there is no better estimator of the state of a dynamic system. The assumptions are: 1. Both the process and the observation models are linear. 2. The noise terms 𝑅 and 𝑄 that describe the uncertainty of the observation model and the process model are independent zero-mean Gaussian distributed.. 15.

(19) 2.3.2 Kalman filter smoothing As an alternative to ordinary Kalman filtering, there is Kalman filter smoothing that utilizes more than just the current observation. In [12] three types of Kalman smoothing algorithms are mentioned: Fixed-lag smoothing, Fixed-point smoothing, and Fixed-interval smoothing. Fixedlag smoothing uses a set number of previous observations in the estimation, whereas Fixedpoint smoothing aims at finding the best estimate at one selected time using all the observations individually. Fixed-interval assumes that there are no new observations obtained during the estimation process and seeks the best estimates inside a time interval using all observations. In an offline estimation situation, all measurements are available and [12] states three different approaches where all estimates are used. One of the approaches is the Fixed-lag smoothing which includes all observations in the lag. This approach is naive, and two different forwardbackward filters are suggested instead. One forward-backward filter is Rauch-Tung-Striebel Formulas in combination with Fixed-interval smoothing which is shown in Algorithm 2. Algorithm 2: Rauch-Tung-Striebel formula for fixed interval smoothing Step 1. Run the forward filter. Using available observations 𝑦- (𝑘 = 1, … , 𝑁), run the Kalman filter from Algorithm 1, and store the measurement updates 𝑐̂-|-1! , 𝑃-|-1! and the time updates 𝑐̂-|- , 𝑃-|- . Step 2. Run the backward filter. Calculate the following backwards in time: * 1! 𝑐̂-1!|5 = 𝑐̂-|-1! + 𝑃-1!|-1! 𝐹-1! 𝑃-|-1! Y𝑐̂-|5 − 𝑐̂-|-1! Z 1! 1! 1! 𝑃-1!|5 = 𝑃-1!|-1! + 𝑃-1!|-1! 𝐹-1! 𝑃-|-1! Y𝑃-|5 − 𝑃-|-1! Z𝑃-|-1! 𝐹-1! 𝑃-1!|-1! .. 2.3.3 Colored noise Kalman filter Colored noise Kalman filter reformulates the original state equation and are useful in scenarios when the observation noise is correlated. As mentioned in section 2.3.1, uncorrelated noise is one of the important assumptions in the Kalman filter. In [10], the state of the Colored noise Kalman filter is formulated as follows: 𝑐-.! = 𝐹- 𝑐- + 𝐺/,- 𝑣- 𝑐𝑜𝑣(𝑣- ) = 𝑄- (𝑃𝑟𝑜𝑐𝑒𝑠𝑠 𝑚𝑜𝑑𝑒𝑙),. (6). 𝑦- = 𝐻- 𝑐- + 𝜁- (𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 𝑒𝑞𝑢𝑎𝑡𝑖𝑜𝑛),. (7). where the process model (6) is similar to (4) and the observations equation (7) introduces a new term 𝜁- that represents time-correlated observation noise. 𝜁- is modeled as a first-order Gauss process as follows: 𝜁-.! = Ψ𝜁- + 𝑒- . 16. (8).

(20) where Ψ is a correlation matrix and 𝑒- is the same noise term as in (5). To reduce effects of correlated noise, the noise term 𝜁- in the observation equation is modeled with an equation that has white noise. The new representation of the noise effects all variables in the state, otherwise the procedure of Colored noise Kalman filter is similar to the one presented in Algorithm 1.. 2.4. Previous research. OD matrices are traditionally estimated based on travel surveys and link counts, but in recent years, researchers have directed their focus towards other data sources and methods. This section includes previous research of OD matrix estimation using sensor fusion and unsupervised learning in the context of OD matrices. 2.4.1 Estimating OD matrices using sensor fusion To estimate an OD matrix using Kalman filtering, the problem must be formulated as a state space model. Okutani and Stephanedes [18] formulate the dynamic OD demand estimation problem directly as a state space model predicting the actual number of trips between origins and destinations. Their process model and observation equation are formulated as the following: -. 𝑥%,-.! = t 𝑓%,& 𝑥%,& + 𝑣%,- ,. (9). &,-16 ! -. 8. %,& 𝑦7 = t t 𝑎7,𝑥%,& + 𝑒7,- .. (10). &,! %,!. In (9), 𝑥%,-.! indicates the fraction of OD demand in OD pair 𝑖 during departure period 𝑘 + 1. 𝑓%,& models the relationship between 𝑥%,& and 𝑥%,-.! . 𝑞4 is the order of the autoregressive process. %,& In (10), 𝑎7,is an assignment matrix containing the fractions of OD demand 𝑥%,& passing link 𝑙 during departure period ℎ. The patterns in an OD matrix are based on both spatial and temporal distributions as well as characteristics of the traffic system. A problem brought up by [11] is that the assumptions made using Kalman filtering does not hold for OD matrices. One such assumption is that the system is linear, which do not hold for congested networks. Whether reality is reflected in the assumptions is crucial to the success of the Kalman filter. However, as the authors of [11] also states, many models are not linear in real life and therefore not answer to the assumptions, but still, the models result in unbiased estimations with Kalman filtering. The same approach as Okutanis and Stephanedes [18] suggest is used by Barceló et al. [2] who implements a Kalman filter with success. This is because the prediction interval is narrowed to 1–3 seconds to be able to detect congestion. In addition, the observations are travel times. The tests use a noninformative OD pattern initialization and two fixed OD patterns with static OD flows. The results are successful with both uncongested and congested conditions, but the authors suggest that the models should be extended to use an adaptive time-varying model as a part of future work. Ashok and Ben-Akiva [19] were first to introduce a redefinition of the state and observation variables to fit the Kalman filter better. The idea is to estimate deviations in historical patterns. 17.

(21) instead of OD demands. This approach filters out the structural variations over space and time using historical data. Their process model and observation equation are formulated as the following: -. 𝑥-.! − 𝑥v-.! = t 𝑓-& (𝑥& − 𝑥v& ) + 𝑑& + 𝑣- ,. (11). &,-16 ! -. 𝑦- − 𝑦v- = t 𝑎-& (𝑥& − 𝑥v& ) + 𝑏& + 𝑒- .. (12). &,-19!. In (11), the state vector is instead defined as the difference between an OD matrix 𝑥- and a prior OD matrix 𝑥v& , and ℎ is the departure period. 𝑓-& models the relationship between 𝑥%,& and 𝑥%,-.! and 𝑞4 is the order of the autoregressive process. In order to include deviations from more than the preceding time period, the state is constructed utilizing a prior OD demand 𝑥v& . In (12), the maximum number of departure periods needed to travel between any OD pair is denoted with 9 𝑝4 . 𝑐& models the time lag effect with 𝑑& = ∑-1! w9 − 𝑥v9 Z and 𝑏& models the time lag 9,-16 ! 𝑓& Y𝑥 9 -1! effect with 𝑏& = ∑9,-19! 𝑎& Y𝑥w9 − 𝑥v9 Z, where 𝑥w9 is the estimate of 𝑥9 . In this way the vehicles departing at a previous departure period and detected at a later time period are included in the observation equation as well. This approach overcomes the difficulty to approximate the state estimation error since the distribution is more symmetric and a better fit to a normal distribution. Later, Ashok and BenAkiva extended the model to include additional measurement and transition equations to introduce stochasticity in the assignment matrix [20]. This is to model the error that is introduced in the dynamic OD estimation process. The assignment matrix created by Ashok and Ben-Akiva depends on link and path travel times and travel route-choice fractions. These sources of information are not known with certainty, and therefore introduce an error to the OD estimation process. In the approach of Zhou and Mahmassani [21], deviations of an OD matrix are extracted using a polynomial trend filter. The idea is to use a similar Kalman filter as in Ashok and Ben-Akiva [20] but using captured demand deviations from prior demand estimates to describe structural deviations in demand. One drawback with polynomial filter is that it is computationally expensive for large-scale networks. Barceló et al. [22] also formulates a similar state space model as Ashok and Ben-Akiva [20]. Their state space model differs in the way that they avoid the assignment matrix using a subset of most likely OD path flows instead. The input data to the model are travel times extracted from vehicles equipped with Bluetooth devices. 2.4.2 Unsupervised learning in the context of OD matrices Previous research shows that PCA can be used as a tool to find underlaying spatiotemporal patterns with a dynamic OD matrix as input. Montero et al. [23] use PCA to identify underlaying structures and relationships for a dynamic OD matrix and states that the results from PCA can validate whether an OD matrix coincides with the known mobility patterns within a specific area. Hourly patterns and zones with similar patterns regarding generated and attracted trips are identified. The method that is proposed in [23] can for example be used when new data sources need to be validated before being used within a traffic model.. 18.

(22) Djukic et al. [24] also show how the results from the PCA can be used to reveal underlaying temporal patterns in dynamic OD matrices. The purpose with [24] is to reduce the computational cost that comes with large matrices without any significant loss of the accuracy. In the paper, the temporal trends of the PCs are captured by mapping OD demand to each PC. The paper proposes three steps to process a historical dynamic OD matrix with PCA. The first step is to identify temporal variability patterns and classify the OD demand into temporal trends. The second step is to reduce the dimensions by selecting a subset of OD pairs or PCs that includes structural trends. The third step is to use the reduced OD matrix or a chosen subset of OD pairs in an estimation model. In [1, Ch. 5] Djukic continues the work of estimating OD matrices with a Kalman filter approach by aggregating OD demand with PCA for efficient processing. PCA preserves structural patterns as the data are dramatically reduced which gives lowered computational cost. The data is instead described with principal components and the Kalman filter is reformulated to take correlated noise into account. The result from this shows that the computational cost can be significantly reduced but at the expense of the estimation accuracy. Djukic discusses the two counterweights where model performance is weighted against computational efficiency and the main purpose is decisive for which is prioritized. The main purpose in [1, Ch. 5] is to estimate online which means that computational efficiency is prioritized over model performance. The conclusion is that the model performance is acceptable given the efficient computational time. Djukic suggests, for future work, an adaption of the model to take additional sensory data into account which could improve the quality of the estimation.. 2.5. Error metrics. A number of methods can be used to evaluate the performance of OD demand estimation algorithms. [25] distinguishes two types of performance indicators: Statistical performance and computational efficiency. Statistical performance indicators that are applied in [25] are for example: Root Mean Square Error (RMSE), Normalized Root Mean Square Error (NRMSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), Mean Absolute Normalized Error (MANE), Mean Error (ME), coefficient of determination or R-squared (R2). The divergence between the estimated outputs and their ground truth are used to evaluate the overall performance of the OD demand estimation method. Each performance indicator represents only a part of information contained in OD demand, therefore it is of high importance to consider multiple indicators to get an overall performance [10].The computational efficiency is evaluated by calculating the Central Processing Unit (CPU) in [10]. The error metrics presented by [25] are used by many other researchers as well. [19] is using RMSE and NRMSE to compare the resulting OD demand with three different assignments matrices to historical OD demand. [10] uses RMSE and MAE whereas [1] is instead using NRMSE and R2. NRMSE of an estimated OD matrix with respect to the ground true OD matrix describes the overall error of the estimation [10]. The NRMSE is always non-negative, and a lower value is better than a high value. The total NRMSE is calculated as follows: ' y 1 ∑8%,! ∑: w%,& − 𝑥̅%,& Z &,!Y𝑥 𝐼×𝐻 𝑁𝑅𝑀𝑆𝐸 =. 𝑚𝑒𝑎𝑛(𝑥̅ ). (13). 19.

(23) where the notation is the same as in Table 1. 𝑥w%,& is the estimated OD demand for OD pair 𝑖 departing at time period ℎ and 𝑥̅%,& is the OD demand representing the ground truth. MAPE is an average error of the estimated OD matrix [10].The total MAPE is calculated as follows: 8. :. 1 𝑥w%,& − 𝑥̅%,& 𝑀𝐴𝑃𝐸 = tt| | × 100 𝐼×𝐻 𝑥̅%,&. (14). %,! &,!. where the notation is the same as in Table 1. A low MAPE indicates a better estimated OD demand than a high MAPE. R2 is a measure that describes the correlation in terms of how much of the variance from the OD matrix that represents the ground truth is explained by the estimated OD matrix. The total R' is calculated as follows: '. '. 𝑅 = 1 −. ∑8%,! ∑: w%,& Z &,!Y𝑥̅ %,& − 𝑥 ∑8%,! ∑: &,! ~𝑥̅%,&. − 𝑚𝑒𝑎𝑛Y𝑥̅%,& Z•. ' .. (15). If the estimated OD matrix match the ground truth exactly, R' = 1. R' < 0 if a constant average value in the OD matrix gives a better result than the estimated OD matrix. Another statistical measure used in [26] is the Total Demand Deviation (TDD). TDD gives the percentage of the total difference between two demands. TDD is calculated as follows: 𝑇𝐷𝐷 =. 20. ∑8%,! ∑: w%,& − ∑8%,! ∑: &,! 𝑥 &,! 𝑥̅ % ,& × 100. 8 : ∑%,! ∑&,! 𝑥̅%,&. (16).

(24) 3. Data-driven estimation method. This section begins with a flow chart of the data-driven estimation method. This is followed by three sections focusing on the offline process, the online process, and the sensitivity analysis separately. Finally, the software used in the method is described. Figure 3 illustrates a flow chart of the data-driven estimation method which is divided into an offline process and an online process. The offline process includes creation of synthetic data, where OD demand and observations are simulated, and an offline estimation of dynamic OD demand. In the online process, outputs from the offline estimation are combined with real-time observations to perform online estimation of OD demand.. Figure 3. Flow chart representing the OD matrix estimation method Synthetic OD demand and observations are created as a part of the offline process. The creation of synthetic data is unique for this master thesis and is not part of the offline process in case empirical data are used. A more detailed description of the traffic simulation and how the synthetic OD demand and observations are created is described in section 4. The process including synthetic data starts with a given typical historical OD demand, 𝑥̅%,& . This OD demand is represented by an OD matrix that describes the demand for a typical historical weekday between 05:00 and 22:00, with the duration of the departure periods 𝑇 = 10 minutes. A perturbation is made on the typical historical OD demand to create prior OD demand, 𝑥v%,& . The prior OD demand is assumed to be based on travel surveys which cannot describe the travel. 21.

(25) demand over time. The goal with the perturbation is therefore to simplify the prior OD demand so that it does not reflect travel demand over time very well. To create historical and real-time OD demand 𝑥̈ %,& and observations, 𝑦& that reflects weekday variations, temporal variations are added to the typical historical OD demand. Unlike the perturbation made for the prior OD demand, daily variations are added to simulate variance in travel behavior for different weekdays. Historical OD demand is assumed to describe the reality, in terms of actual OD demand for historical weekdays. Each weekday is represented by its own OD matrix and is used to create historical observations. With the historical OD demand and real-time OD demand as input to a traffic simulation; historical observations and real-time observations are generated. As Figure 3 illustrates, the historical observations are used in the offline estimation and the real-time observations are used in the online estimation. The historical and real-time demand are not available information in the estimation methods. The estimated historical OD demand is dynamic and is used as input to the online estimation. The online estimation utilizes real-time observations to create estimated OD demand that describes the current traffic situation. In both the evaluations, the estimated OD matrices are compared with a ground truth. This means that in the offline estimation, the estimated historical OD matrix is compared with the typical historical OD demand and in the online estimation, the estimated OD matrix is compared with the real-time OD demand. Since the data are synthetic, a ground truth exists and can be used for evaluation. This is not the case for a scenario where empirical data are used. Different combinations and varied quality of the observations are used to evaluate the proposed online estimation method. The proposed method is also compared to a benchmark method. This is further described in section 3.3 which describes the sensitivity analysis. Figure 4 shows a more detailed overview of how both the offline estimation and online estimation works.. Figure 4. Detailed overview of the data-driven estimation methods As Figure 4 shows, link counts, large-scale mobility data, prior OD demand, and assignment matrix are input to the data-driven OD demand estimation, both in the offline estimation and 22.

(26) the online estimation. From large-scale mobility data there are two paths indicating that it is possible to aggregate the observations with PCA or in time before being used as input to the estimation. When aggregating observations, they a processed to reduce the effect of sparse data. This is further described in section 3.3. The output from the estimation method is estimated OD demand.. 3.1. Assignment matrix. By utilizing the trajectories from probe vehicles that generates GPS, two types of assignment matrices can be created and are described in this section. One very simple 𝐴 ;%<97= , and one 𝐴>?@?>(%/=) that is created according to data-driven methods suggested by [26]. The aim of both the assignment matrices is to map link counts to OD demand. The simple assignment matrix, 𝐴 ;%<97= is created with the assumption that all vehicles choose the route with the shortest free flow travel time. The trajectories from probe vehicles can be used to decide the route with the shortest free flow travel time for each OD pair. 𝐴 ;%<97= has the dimensions (𝐿 × 𝐼) where 𝐿 is the total number of loop detectors and 𝐼 is the total number of OD pairs. If the route for an OD pair 𝑖 passed a loop detector 𝑙, the entry (𝑙, 𝑖) in the assignment matrix is set to the value 1, and 0 otherwise. 𝐴 ;%<97= is constant for all departure periods ℎ and does not consider the time lag effect. It is more reasonable to think that the route choices are affected by other vehicles within the network and that the route choice fractions vary between the departure periods ℎ. A more reliable assignment matrix therefore describes the fractions of OD demand that departed at time ℎ and passed link 𝑙. Recently, [26] presented a work that is based on the hypothesis that GPS data from probe vehicles includes not only trajectories, but also spatial-temporal variations of congestion. The assignment matrix 𝐴>?@?>(%/=) in [26] is constructed by utilizing data-driven network loading. The difference between the two assignment matrices is that 𝐴>?@?>(%/=) reflects route choices and propagation effects. 𝐴>?@?>(%/=) is based on the historical probe vehicle trajectories. The same assignment matrix produced in the offline process is therefore used in the online process as well. [26] states that a higher APR for GPS data improves the accuracy of the assignment matrix. 𝐴>?@?>(%/=) is a matrix including all individual assignment matrices for all demand periods 𝐴>?@?>(%/=) . 𝐴>?@?>(%/=) has the dimensions (𝐿𝑆 × 𝐼𝐻) where 𝑆 is the number of detection & periods, which is assumed to be the same as the number of departure periods 𝐻. 𝐴>?@?>(%/=) has & >?@?>(%/=) the dimensions (𝐿 × 𝐼), and includes the rows (ℎ − 1)𝐿 + 1 to ℎ𝐿 from 𝐴 and the >?@?>(%/=) columns (ℎ − 1)𝐼 + 1 to ℎ𝐼 from 𝐴 . With this notation, the link counts from the departure time ℎ are mapped to the OD pairs at the same departure time. The assignment matrix 𝐴9,>?@?>(%/=) considers the time lag process in the Kalman filter by & mapping the link count observations at previous times 𝑝 to OD demand that departure at time ℎ. 𝐴9,>?@?>(%/=) has the dimensions (𝐿 × ℎ𝐼), and includes the rows (ℎ − 1)𝐿 + 1 to ℎ𝐿 from & >?@?>(%/=) 𝐴 and the columns 1 to (ℎ − 1)𝐼 from 𝐴>?@?>(%/=) . With this notation, the link counts from the departure time ℎ are mapped to all previous departure periods for all OD pairs.. 3.2. Estimation methods. This section describes both the offline estimation and the online estimation, respectively.. 23.

(27) 3.2.1 Offline estimation The OD demand estimation problem is in state space form, and according to (4) and (5) formulated as: (17) 𝜑𝑐-.! = 𝐹- 𝜑𝑐- + 𝑣- 𝑐𝑜𝑣(𝑣- ) = 𝑄- , (18) 𝜑𝑦- = 𝐻- 𝜑𝑐- + 𝑏- + 𝑒- 𝑐𝑜𝑣(𝑒- ) = 𝑅- , where (17) is the process model and (18) is the observation equation. 𝐹- is a process matrix that model the relationship between 𝑐-.! and 𝑐- . 𝐻- is as an observation matrix that connects the observations 𝑦- to state variables in 𝑐- . 𝑏- correspond to the link observations during prior demand periods. The time lag effect on the state variables is modeled with backward filtering. The problem formulation fulfills the Kalman filter assumption that the process and observation models are linear, and according to [11] the noise terms 𝑄- and 𝑅- can be assumed to be independent zero-mean Gaussian distributed for OD demand if they are specified as the sum of squared errors. Because the Kalman filter assumptions are fulfilled and the estimations is done offline with all observations available, both the Kalman filter and Kalman filter smoothing algorithm are chosen to solve the offline estimation problem. The state vector is defined as the difference between OD demand 𝑥%,& and a prior OD demand 𝑥v%,& which means that it is how much the OD demand deviate from the historical pattern that is estimated. Since it is the deviation that is estimated, the prior OD demand need to be added to the deviation in order to reconstruct the estimated historical OD matrix. This approach is inspired by Ashok and Ben-Akiva [19] in order to avoid estimating an undetermined problem. The state is formulated as:. (19) 𝜑𝑐- = 𝑥w- − 𝑥v- . where 𝑥w- is a vector that includes all estimated OD demand 𝑥w%,& , and 𝑥v- is a vector that includes all OD demand 𝑥v%,& from the prior OD demand. An extension of 𝜑𝑐- is as follows: 𝑥w!,𝑥v ⎡ ⎤ ⎡ !,- ⎤ 𝑥v 𝑥w ⎢ ',- ⎥ ⎢ ',- ⎥ 𝜑𝑐- = ⎢𝑥wA,- ⎥ − ⎢𝑥vA,- ⎥ . ⎢ ⋮ ⎥ ⎢ ⋮ ⎥ ⎣ 𝑥w8,- ⎦ ⎣ 𝑥v8,- ⎦. (20). The size of the vector 𝜑𝑐- is (𝐼 × 1) where 𝐼 is the total number of OD pairs. The observations are also defined as a difference: 𝜑𝑦- = 𝑦- − 𝑦v- . (21). where 𝑦v- are synthetical observations from prior OD demand 𝑥v- and 𝑦- is the historical observations. The size of the vector 𝜑𝑦- depends on how many sources of observations that are included, the number of OD pairs 𝐼, and how many links that have loop detectors. The size of the vector 𝜑𝑦also depend on how many observations of the same kind that is considered at each time 𝑘. Using. 24.

(28) observations for several days will increase the size as the observations for time 𝑘 are grouped together. An example of 𝑦- is as follows: 𝑦-BCD 𝑦- = Š𝑦-E5# ‹ 𝑦-FG. (22). where 𝑦-HIJ is the observations obtained from probe vehicles, 𝑦-KLM is the observations obtained from MND, 𝑦-NO is observations from link counts.. HIJ KLM NO The observations 𝑦- , 𝑦- and 𝑦- are represented as follow: 𝑦-!,BCD 𝑦-!,E5# 𝑦-!,FG (23) 𝑦-HIJ = Š ⋮ ‹ , 𝑦-KLM = Š ⋮ ‹ , 𝑦-NO = Š ⋮ ‹ 8,E5# 8,BCD F,FG 𝑦𝑦𝑦 %,BCD %,E5# where 𝑦and 𝑦describes the OD demand at time 𝑘 for OD pair 𝑖. The sizes of 𝑦-HIJ and 𝑦-KLM are (𝐼 × 1) respectively where 𝐼 is the total number of OD pairs. 𝑦-7,FG describes the link flows at time 𝑘 for link 𝑙. 𝐿 is the total number of observable links and 𝑙 = 1, … , 𝐿. The size of 𝑦-7,FG is (𝐿 × 1). The time lag for the link count observations is modeled as: 9. 𝑏- = 𝐴- Y𝑥w9 − 𝑥v9 Z . (24). where 𝐴9- is the assignment matrix that maps the link count observations at previous times 𝑝 to OD demand that departure at time 𝑘, 𝑥w9 is the estimated OD demand at all previous times 𝑝, and 𝑥v9 is the prior OD demand for all previous times 𝑝. Note that (24) is ignored when 𝐴 ;%<97= is chosen, since the assignment matrix is static and do not consider the time lag. The size of the vector 𝜑𝑦- that includes observations for one historical weekday is (𝑌 × 1) where 𝑌 = (𝐼 + 𝐼 + 𝐿). The size of the vector 𝜑𝑦- grows linear with more historical weekdays added. With two historical weekdays, 𝑌 = 2(𝐼 + 𝐼 + 𝐿). The observation matrix 𝐻- has the size (𝑌 × 𝐼) and includes different elements depending on the type of observations and the departure period 𝑘. 𝐻- is described as follows: 𝑒𝑦𝑒(𝐼) 𝐻- = Š𝑒𝑦𝑒(𝐼)‹ 𝐴-. (25). where the two first identity matrices are a direct mapping between the observations and the OD demand. 𝐴- is one of the assignment matrices mentioned in section 3.1 with the size (𝐿 × 𝐼) and maps the link counts to the state. The process matrix 𝐹- is an identity matrix with size (𝐼 × 𝐼) and 𝑄- and 𝑅- are diagonal matrices with the sizes (𝐼 × 𝐼) and (𝑌 × 𝑌) respectively. All observations are available in the. 25.

(29) offline process, and it is possible to calculate how much the observations from the prior OD demand deviates from the simulated historical observations. The diagonal in 𝑅- is set to the calculated variance between different historical weekdays. For a scenario where empirical data are used, the historical OD demand is not known in the offline process and therefore it is not possible to calculate the variances for 𝑄- . Therefor, 𝑄- is tuned with a constant value in the diagonal to find a relationship between the trust in the model versus the observations. Step 0 in Algorithm 1 requires an initialization of the covariance matrix 𝑃3 and the estimate 𝑥w3 . 𝑃3 is usually initialized as a matrix with large diagonal entries, reflecting the fact that the estimate of 𝑥w3 is highly uncertain [11]. Since the data is synthetic and the true OD demand is available, 𝑥w3 is set to the true OD demand deviation and the diagonal entries in 𝑃3 are therefore set to very small values to reflect a correct estimation. The estimated OD demand deviations 𝜑𝑥- are added to the prior OD demand 𝑥v- to get the estimated historical OD demand, 𝑥w . When using the Kalman filter, it is possible to retrieve a negative estimation of the OD demand. If the estimated OD demand deviations are larger than the corresponding prior OD demand, the final OD demand will be negative. Negative OD demand does not exist, and if this situation occur, the OD deviation is interpreted as a low OD demand and the final estimated OD demand is set to zero. 3.2.2 Online estimation The online estimation has many similarities with the offline estimation, but the main difference is that the online estimation does not have access to observations in the future. This affect the choice of sensor fusion method, since it is not possible to use Kalman filter smoothing. Only the standard Kalman filter is used in the online estimation. The definition of the state space form presented in (17) and (18) also holds for the online estimation. (19), (21), and (24) is reformulated to include the estimated historical OD demand and observations instead of the prior OD demand and observations. The initialization of the estimated deviation 𝑥w3 required in Algorithm 1, is set to the negative value of the estimated historical OD demand in order to guess that the OD demand is zero in the early morning. The covariance matrix 𝑅- and the assignment matrix 𝐴>?@?>(%/=) are both based on historical observations and trajectories, and therefore the variables are reused in the online estimation. The tuning of the covariance matrix 𝑄- made in the offline estimation is also reused in the online estimation. In addition, the observations are only aggregated in time, not with PCA since it demands a large set of observations which is not available in an online situation.. 3.3. Aggregation of large-scale mobility data. To handle the effect of sparse data, the large-scale mobility data GPS and MND are aggregated. As [9] mentions, this can be done by aggregating the observations in time or by performing PCA. By utilizing PCA, the large-scale mobility data is aggregated in both time and within OD pairs and are expressed as PCs instead of OD demand. The main goal with the PCA is to reduce the variations due to sparse data. It is of high importance to understand what the PCs consists of to make proper analysis of the PCA-aggregated observations. The results from PCA are therefore used to create plots that can be used to analyze and understand the results of a dimension reduction. The plots are inspired by [10] and investigates spatial and temporal patterns.. 26.

(30) PCA is performed on GPS observations and MND observations, respectively, after being scaled up to represent the whole population. Before applying PCA with the GPS observations and MND observations as input respectively, the dataset is constructed to a matrix with size (𝐼 × 𝐻) so that the OD pairs are treated as observations 𝑛 and the departing periods are treated as variables 𝑟. The PCA is resulting in 𝑚𝑖𝑛(𝑛 − 1, 𝑟) number of PCs. To understand what the PCs consists of, PVE is calculated for each PC to examine how much of the variance that is explained by each PC. To investigate spatial and temporal patterns, two PCs are used to create a biplot. The values from the loading vectors 𝜙 and the score vector 𝑍 are represented in a two-dimensional space to see what spatial and temporal aspects that are included in the selected PCs. From this analysis, the number of PCs to use, 𝑀, is decided. The observations are recreated before being used as input to the estimation methods. By using all PCs, the exact observations can be recreated. By using only the first 𝑀 PCs, PCA-aggregated observations are created. The observations are recreated by multiplying the rotated data with the variable loadings as follows: 𝑦 CGP1?QQ(=Q?@=> = [𝑍!. 𝜙! … 𝑍E ] × • ⋮ Ž . 𝜙E. (26). For sparse real-time observations, aggregating in time is the only option used within this master thesis, since PCA demands a large set of observations which were not available in an online situation.. 3.4. Sensitivity analysis. It is of interest to evaluate and investigate how different factors affect the proposed estimation method. A sensitivity analysis of the offline estimation and the online estimation is done separately and will show what factors that are most important for the final data-driven estimation method. According to [22], the quality of the estimation depends on three factors; the percentage of penetration of the observations, the detection layout and the quality of the prior OD matrix that is used as an initial estimate. The penetration rate of large-scale mobility data is the only factor that cannot be controlled in reality [1]. For the offline estimation there are a lot of factors that can be varied and tested, but three main factors have been identified: •. the choice of sensor fusion method and assignment matrix,. •. the quality of the observations,. •. the quality prior OD demand.. The two sensor fusion methods Kalman filter and Kalman filter smoothing are tested to investigate how much Kalman filter smoothing improves the results. The performance of the sensor fusion methods is also affected by the choice of assignment matrix and the tuning of 𝑄 and 𝑅, and therefore it is also of interest to investigate how sensitive the methods are to tuning.. 27.

(31) The sensor fusion methods are also compared to a simple benchmark method that estimate the OD demand as the average value of the GPS and MND observations. Since all observations are synthetic it is possible to vary how well the observations reflect the ground truth observations. How different kind of data sources affects the estimation, the APR of GPS data and MND and the method of handling sparse data are example of factors that can be varied to change the quality of the observations. How well the prior OD demand reflects the typical historical OD demand may also affect the results. Different qualities of prior OD demand are used to investigate how sensitive the offline estimation is to the prior OD demand. As for the offline estimation, there are a lot of factors that can be varied in the online estimation as well. Two main factors have been identified: •. the choice of assignment matrix,. •. the quality of the historical OD demand used as input to the online estimation.. The standard Kalman filter is the only method used in the online estimation. Since the tuning of 𝑄 and 𝑅 is done in the offline estimation these variables are not investigate for the online estimation. Instead, how the assignment matrix affects the sensor fusion method is investigated. [25] states that the OD matrices used as input to a traffic model often include low-quality information. Because of this, it is hard to evaluate what causes the estimation errors. Errors can, for example, be affected by modeling mistakes or incorrectly calibrated models. The quality of the OD matrix that is used as input to the online estimation is therefore a key factor. Therefore, different qualities of the estimated historical OD demand are tested to see how sensitive the online estimation is to low-quality information. An OD matrix with constant values does not reflect the ground truth OD demand and has therefore a lower quality than an OD matrix representing the ground truth OD demand better.. 3.5. Software. MATLAB [27] is a software that offers matrix manipulations, plotting of functions and data and implementation of algorithms. In this master thesis, MATLAB is mainly used to implement the Kalman filter algorithms and the dimensionality reduction. The Kalman filter algorithms are implemented from scratch and a built-in function called “pca” makes the process for implementing PCA and dimensionality reduction easy. MATLAB is also used to handle the results from the traffic simulation made in Aimsun Next. Aimsun Next is a traffic modelling software offering services for traffic planning and simulation [28]. For experiments where it is not possible to derive OD demand using traditional methods like travel surveys, OD demand and observations can be retrieved by simulation. Aimsun Next software can be used to build a model that represents a digital twin of the traffic network in a city. OD demand can be simulated and assigned to the traffic network. Dynamic User Equilibrium (DUE) is a tool to assign OD demand to a traffic network. For simulation tools in general, warm-up periods are used. A warm-up period is necessary for a model to mimic a system that is not empty in the beginning of the simulation period. If no warm-up period is used it is assumed that the model is empty when the simulation starts. Another general setting for simulation tools is to define the number of replications. A replication is a repeating run of a simulation experiment and is necessary when the experiment is based on. 28.

No results found