A Multi-Target Graph-Constrained HMM Localisation Approach using Sparse Wi-Fi Sensor Data

(1)

IN

DEGREE PROJECT MATHEMATICS, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2018,

A Multi-Target Graph-Constrained HMM Localisation Approach using Sparse Wi-Fi Sensor Data

SIMON DANIELSSON JAKOB FLYGARE

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

A Multi-Target Graph-Constrained HMM Localisation Approach using Sparse Wi-Fi Sensor Data

SIMON DANIELSSON JAKOB FLYGARE

Degree Projects in Optimization and Systems Theory (30 ECTS credits) Degree Programme in Industrial Engineering and Management

KTH Royal Institute of Technology year 2018 Supervisor at IBM: Holger Hellebro

Supervisor at KTH: Johan Karlsson Examiner at KTH: Johan Karlsson

(4)

TRITA-SCI-GRU 2018:287 MAT-E 2018:66

Royal Institute of Technology School of Engineering Sciences KTH SCI

SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

(5)

Abstract

This thesis explored the possibilities of using a Hidden Markov Model approach for multi-target localisation in an urban environment, with observations generated from Wi-Fi sensors. The area is modelled as a network of nodes and arcs, where the arcs represent sidewalks in the area and constitutes the hidden states in the model. The output of the model is the expected amount of people at each road segment throughout the day. In addition to this, two methods for analysing the impact of events in the area are proposed. The first method is based on a time series analysis, and the second one is based on the updated transition matrix using the Baum-Welch algorithm.

Both methods reveal which road segments are most heavily affected by a surge of traffic in the area, as well as potential bottleneck areas where congestion is likely to have occurred.

Keywords: Hidden Markov Model, Localisation, Wi-Fi Sensors, Forward-Backward Algo- rithm, Baum-Welch Algorithm, Machine Learning, Time Series Analysis.

(6)

(7)

Sammanfattning

I det här examensarbetet har lokalisering av gångtrafikanter med hjälp av Hidden Markov Models utförts. Lokaliseringen är byggd på data från Wi-Fi sensorer i ett område i Stockholm. Området är modellerat som ett graf-baserat nätverk där linjerna mellan noderna representerar möjliga vägar för en person att befinna sig på. Resultatet för varje individ är aggregerat för att visa förväntat antal personer på varje segment över en hel dag. Två metoder för att analysera hur event påverkar området introduceras och beskrivs. Den första är baserad på tidsserieanalys och den andra är en maskinlärningsmetod som bygger på Baum-Welch algoritmen. Båda metoderna visar vilka segment som drabbas mest av en snabb ökning av trafik i området och var trängsel är troligt att förekomma.

Nyckelord: Hidden Markov Model, Lokalisering, Wi-Fi Sensorer, Forward-Backward Algo- ritm, Baum-Welch Algoritm, Maskinlärning, Tidsserieanalys.

(8)

(9)

Acknowledgements

The authors would like to thank IBM for the opportunity to be part of the project team. Special thanks to Holger Hellebro for assisting with communication with stakeholders and providing help at IBM whenever needed. We would also like to thank our supervisor Johan Karlsson for providing valuable insights and guidance regarding the mathematical modelling.

(10)

(11)

Contents

List of Figures vi

List of Tables ix

1 Introduction 1

1.1 Problematization . . . . 2

1.2 Purpose . . . . 2

1.3 Research Questions . . . . 2

1.4 Limitations . . . . 3

1.5 Previous Work . . . . 3

1.6 Outline . . . . 5

2 Background 7 2.1 Hidden Markov Models . . . . 7

2.2 Assumptions for HMM . . . . 8

2.3 Forward-Backward Algorithm . . . . 8

2.3.1 Forward Probabilities . . . . 10

2.3.2 Backward Probabilities . . . . 11

2.3.3 Smoothing . . . . 12

2.4 The Baum-Welch Algorithm . . . . 13

2.5 The Viterbi Algorithm . . . . 15

3 Method 17 3.1 Data Overview . . . . 18

3.1.1 Field Study Methodology . . . . 19

3.1.2 RSSI vs Distance . . . . 19

3.1.3 Detection Rate . . . . 20

3.1.4 Detection at Different Heights . . . . 21

3.1.5 Data Assumptions . . . . 22

(12)

Contents

3.2 Mathematical Formulation . . . . 22

3.2.1 Network Representation . . . . 22

3.2.2 Transformation of Data to Observations for HMM . . . . 23

3.2.3 Initial Distribution . . . . 24

3.2.4 Transition Probabilities . . . . 24

3.2.5 Emission Probabilities . . . . 25

3.2.6 HMM Implementation . . . . 27

3.2.7 Difference in Arrival Distributions . . . . 29

4 Results 30 4.1 Method Comparison for Individual Flows . . . . 30

4.2 Daily Flow Representation . . . . 32

4.2.1 Weekday Flow . . . . 33

4.2.2 Weekend Flow . . . . 35

4.3 Impact of Events . . . . 37

4.3.1 Event Flow . . . . 37

4.3.3 Congestion Detection . . . . 41

5 Discussion 43 5.1 Viterbi Algorithm for Individual Flow . . . . 43

5.2 Performance of the Forward-Backward algorithm . . . . 44

5.2.1 Accuracy . . . . 44

5.2.2 Sensor Proximity Bias . . . . 44

5.2.3 Maximum Time Difference . . . . 45

5.2.4 Scalability . . . . 45

5.3 Impact of Events . . . . 46

5.3.2 Congestion Detection . . . . 47

5.4 Model Improvements . . . . 48

5.4.1 Activity States . . . . 48

5.4.2 Specific Prediction Areas . . . . 49

6 Conclusion 50 6.1 Further Research . . . . 51

A Graphs 52

Bibliography 57

(13)

List of Figures

1.1 Overview of the specific area studied in this thesis with sensor locations and a picture of one of the sensors. . . . 2 2.1 One step in the recursion when calculating the forward probabilities. . . . 10 2.2 One step in the recursion when calculating the backward probabilities. . . . 12 2.3 Calculation of the normalised Forward-Backward probability for time step t at state xi. 13 2.4 Flow chart of the logic behind the Baum-Welch algorithm. . . . 14 2.5 One step in the Viterbi algorithm visualised for state xjat time t. . . . 16 3.1 An illustration of the observation and hidden state sequences of the HMM in the

context of this work . . . . 17 3.2 Sensor detection range. . . . 18 3.3 Distribution over the received signal strengths (RSSI) at different distances from the

sensor. . . . 20 3.4 Time between detections of the same device for two sensors outside the Tele2 arena.

The sensors are located at different height levels separated by a concrete floor. . . . 21 3.5 Overview of the specific area studied in this thesis. Left-hand figure shows location

of the Wi-Fi sensors, and the right-hand figure is the area represented with nodes and arcs. . . . 23 4.1 The constructed observation sequence, where each used sensor is shown with the

associated time of observation. . . . 31 4.2 Localisation using Viterbi and the Forward-Backward algorithm. The star at each

time step represents the most likely state from the Viterbi algorithm, while the thick- ness of the lines represent the probabilities from the Forward-Backward algorithm at each time step. . . . . 31 4.3 A heat map visualisation over the average flow between 12:00-13:00 in the daily table. 33 4.4 Total number of people in the area over an average weekday. . . . 33

(14)

List of Figures

4.5 Expected amount of people for segments near the night club over an average weekday.

The data for the two left graphs includes the Friday while the right one does not. . . 34 4.6 Expected amount of people for segments outside the Ericsson Globe (1,28) and the

north subway station (19,18). The data is over an average weekday. . . . 35 4.7 Total number of people in the area over an average weekend. . . . 35 4.8 Expected amount of people for segments outside the Ericsson Globe and close to the

north subway station during an average weekend. . . . 36 4.9 Comparison between an average weekday and weekend for the total amount of

people in the area during the day. . . . 37 4.10 Total number of people in the area during the Ericsson Globe event compared to an

average weekday. . . . 38 4.11 Expected amount of people over time during the day of the Ericcson Globe event. The

two upper graphs display the segment outside the Ericsson Globe in both directions and the bottom two the segment near the northern subway station. . . . 38 4.12 Total number of people in the area during the Tele2 arena event compared to an

average weekday. . . . 39 4.13 Expected amount of people on adjacent arcs from the Tele2 arena during the day

of the event. The upper two graphs display the segments at the north exit and the bottom two the east exit from Tele2 arena in both directions. . . . 39 4.14 Expected amount of people on remote arcs from the Tele2 arena during the day of

the event. The upper two graphs display the segments outside the Ericsson Globe and the bottom two the segments adjacent the north western subway station in both directions. . . . 40 4.15 Map visualisation of the difference between the time series for both events and an

average weekday. The size of the segment is proportional to the difference. . . . . 41 4.16 Map visualisation of percentual increase in expected travel time for each segment

during events, as calculated by the Baum-Welch algorithm. . . . 42 5.1 Relation between distance to closest sensor and flow on the arc. Each point corre-

sponds to an arc and the size is dependent on the number of sensors within 75 meters radius from the arc. . . . 45 A.1 All differences between time series for the Ericsson globe event and an average

weekday. . . . . 52 A.2 All differences between time series for the Tele2 arena event and an average weekday. 53 A.3 Percentual change of the diagonal element in the transition matrix between the

trained and original one for all arcs at the Tele2 arena event. . . . . 54 A.4 Percentual change of the diagonal element in the transition matrix between the

trained and original one for all arcs at the Ericsson Globe event. . . . . 55

(15)

LIST OFFIGURES

A.5 The change of probability for each arc for the Forward-Backward algorithm over time compared against the Viterbi algorithms choice of path. The graph if based on the specially made observation sequence. . . . . 56

(16)

List of Tables

3.1 Description of the data structure. . . . 18

3.2 Summary metrics: distance vs RSSI. . . . 19

3.3 Field test of detection rate of pedestrians. . . . . 20

3.4 Field test of detection rate of vehicles. . . . 21

4.1 How often the Viterbi algorithm violates expected travel time, for two different days and two different travel speeds. . . . 32

4.2 Distribution of type of day in data set. . . . 32

(17)

CHAPTER 1

Introduction

With more than half the population living in urban areas (UN 2014), the need for smarter city development is greater and has higher potential impact than ever before. The recent development of Internet of Things (IoT) opens up the possibilities of data gathering at a larger scale, which warrants the need for tools and methods to interpret and analyse the data for better decision making (Ahlgren et al. 2016). This has led the European Commission to support several

“Smart City and Development” projects under Horizon2020. One of those projects is lead by GrowSmarter, which aims to transform cities for a smart, sustainable Europe by implementing 12 smart city solutions in energy, infrastructure and transport (GrowSmarter 2017). These solutions will be carried out and evaluated in three cities; Stockholm, Cologne, and Barcelona. More than twenty industry partners are working in collaboration with GrowSmarter to develop these solutions, one of which regards an area in Stockholm where Wi-Fi sensors have been mounted for crowd-monitoring purposes. Data on the location of people for crowd monitoring purposes is of interest for traffic planners and event managers alike looking to make more well-informed decisions, e.g. adding signs or pathways that lead pedestrians around crowded areas to reduce traffic congestion due to a busy crosswalk. Location data from GPS devices such as smart-phones is generally not publicly accessible, and requires a clear view of the orbiting satellites which reduces its effectiveness in urban canyons where that’s not always guaranteed (Cheng et al.

2005). Other surveillance methods such as video cameras impose serious privacy concerns and video footage is generally difficult to model effectively. Thus, there’s an incentive in developing analytical methods applicable to data generated by other sources such as Wi-Fi sensors.

One of the companies involved in the project is IBM, at which this thesis is written. This work contributes to the analysis of the streams of data collected by the Wi-Fi sensors. There are in total 21 sensors in the area spread out over approximately 0.300 km²and most of them are located near the two large arenas, see Figure 1.1. The two arenas, Ericsson Globe and Tele2 arena, frequently attract large groups of people to the area which makes it ideal for crowd monitoring purposes. The Wi-Fi sensors receive signals from devices such as mobile phones and laptop computers. These signals occur when the device is searching the area for a Wi-Fi network, which it repeatedly does as long as the Wi-Fi setting is turned on. Every phone is given an encrypted

(18)

1. INTRODUCTION

identification code, which allows the tracking of a single phone throughout the area without gathering any personal information on the user.

(a) Sensor locations (b) One of the Wi-Fi sensors (white box)

Figure 1.1: Overview of the specific area studied in this thesis with sensor locations and a picture of one of the sensors.

1.1 Problematization

Given the large amounts of data that is gathered from IoT devices, an important problem is in the analysis of it. The data itself contains a lot of noise and provides little to no guidance, and it has to be understood, categorised, and represented in simpler ways to be useful for decision makers.

There is an added degree of uncertainty in areas between sensors where there is no sensor coverage. This is especially relevant in the western region of the area, towards the left-most sensor in Figure 1.1 a). Thus, a problem that must be addressed by the localisation method is how it handles these “dead zones”. The specific problem in this thesis is how mathematical models can turn large amounts of low-level sensor data into a representation of the movement of people in an urban environment.

1.2 Purpose

The purpose of this report is to explore the possibilities of using mathematical models to create useful insights about crowd dynamics from large amounts of data from IoT devices.

1.3 Research Questions

MRQ: How can low-level sensor data from devices in an urban area be used in mathematical models to determine the location and movement of large crowds throughout a day?

(19)

1.4. Limitations

RQ1: How can the model be used to analyse to what extent a sudden surge of visitors impacts different parts of the area?

1.4 Limitations

The location of the sensors are already determined and they can neither be moved nor can new ones be added. Their specific location generates some limitations on the data gathering.

The sensors can only collect data from devices that are trying to connect to a Wi-Fi network.

Therefore, not everyone is detected by these sensors.

1.5 Previous Work

The literature review covers the different approaches to location estimation that has previously been employed. The major varying parameters among the articles presented below are sensing technology and sensor density, localisation technique, and indoor or outdoor environment. The fixed parameters for this thesis are the Wi-Fi sensing technology and their areal density, as well as passive localisation in an outdoor environment. However, articles and previous research with different operational settings will also be discussed and analysed since they may still provide valuable insights.

Schilit et al. (1993) introduced the idea of location sensing for ubiquitous computing and it has been a popular area of study since then. Bahl & Padmanabhan (2000) developed a radio- frequency based system called RADAR for locating and tracking users inside a building. The paper mainly focuses on locating stationary users. They used signal strength as an indicator of the user’s distance to a sensor, and then triangulation techniques to determine their position with an average error of 2-3 meters. The technique of using the signal strength to nearby sensors is common in WLAN positioning systems and is often referred to as fingerprinting. It involves building a pre-defined map of signal strengths observed at different positions throughout the area of interest. A position and its corresponding signal strength measures to nearby sensors are what’s referred to as a fingerprint. This may result in an extensive training period, as the area and number of sensors increases. Haeberlen et al. (2004) utilised a process referred to as war-driving during the training phase, which involves walking or driving around the area with a device that continuously records sensor readings and its location simultaneously. In their work, specific point granularity was not needed. Instead they mapped a person to a segment, e.g. a hallway or specific office. They reported results of 95% accuracy, and once an error occurred it often positioned the user at the nearby segment. The WLAN fingerprint method is most often used in indoor environments, where sensor density is high and fingerprint mapping less extensive. LaMarca et al. (2005) developed the Place Lab method, which is a radio beacon-based approach intended for metropolitan-scale deployment. Their method utilises already existing radio beacons, such as WLAN access points, in the environment. New sensors and their estimated location were continuously added to the database through war-driving efforts by the community. The method

(20)

1. INTRODUCTION

displayed a 20-30 meter median accuracy and close to 100% coverage in a metropolitan area with a sensor density of 1200 sensors per square kilometres. The research done by Cheng et al.

(2005) is an extension of the Place Lab project. They compared different localisation techniques in metropolitan-scale Wi-Fi localisation, where fingerprinting with a war-driving training phase was one of them. It performed relatively well compared to more sophisticated methods, such as particle filter, with a median error of 20-30 meters. However, they found that the fingerprint method was less robust to noisy GPS data of the sensor locations than the other methods, e.g.

particle filter. They analysed the performance of the different localisation techniques in three different neighbourhoods with sensor densities ranging from 130 to 1030 per square kilometres.

The sensor density for this work is about 70 sensors per square kilometres, which indicates that the sensor density may be too scarce for the fingerprint method to perform well.

One of the other methods tested by Cheng et al. (2005) was particle filter. Particle filter is a type of Bayesian filtering and has a wide variety of applications, one of which is positioning estimation (Gustafsson 2010). Cheng et al. (2005) concluded that sophisticated methods such as particle filter are more valuable in environments with a low density of sensors. In dense environments, simpler methods such as a centroid-based positioning algorithm performs equally well. An advantage with particle filter is the ability to fuse different type of sensor data, as done by Hightower & Borriello (2004) in their case study. They used a particle filter to locate a robot equipped with different types of sensor technology that moved around in an office environment.

Their results indicate that the particle filter significantly outperforms deterministic algorithms in terms of the cumulative travel distance error, which indicates that particle filter better captures the true motion of an object. Patterson et al. (2003) uses particle filter to infer high-level behaviour such as travel mode and most likely route of a traveller from raw GPS data. They model the environment as a graph with nodes and arcs. To reduce computational complexity, the first order Markov assumption is assumed to hold so that only two time slices have to be considered at once. An Expectation Maximum (EM) algorithm is implemented to learn the model parameters in an unsupervised manner. After the training phase, the model was able to accurately predict the correct mode of transportation in 84% of the cases. A key difference between their study and the situation in thesis work is the sensor technology. Using raw GPS data allows Patterson et al. (2003) to follow the traveller along their path. The stationary Wi-Fi sensors employed in this work does not provide that opportunity, which greatly limits the possibility of accurate localisation in sparse sensor areas.

Thiagarajan et al. (2009) uses a graph-based Hidden Markov Model (HMM) to estimate the most likely routes and travel times of cars, with a combination of GPS readings and Wi-Fi sensor technology. They utilise the so-called Viterbi algorithm to compute the most likely state sequences given the observations. The Wi-Fi location estimates are gathered using the Place Lab method developed by LaMarca et al. (2005). Their contribution to the research of HMM for map matching is developing a model robust for noisy and sparse data. Using only Wi-Fi data, the proposed model produces trajectories with a median error less than 10%. HMM has been proven effective in indoor pedestrian localisation, as shown by Ni et al. (2016). They use a

(21)

1.6. Outline

WLAN fingerprint system, with variation in signal strength rather than absolute values. However, the environment in which their model is tested drastically differs from the one of this thesis. It’s smaller, more dense with sensors, and indoors. Thus, their simplified fingerprint system is not viable for a larger, metropolitan scale deployment.

The type of localisation approaches mentioned so far are considered active localisation, characterised by the user seeking to identify their own position. The alternative, where an observer seeks to identify the position of other users, is called passive localisation (Sun et al.

2017). This is the type of localisation of interest for this thesis. One of the difficulties with passive localisation is that the interval in which data is generated is not constant, and the time between measurements is entirely dependent on how frequently the user walks nearby one of the stationary sensors.

Both HMM and particle filter are viable methods for this thesis and either one could potentially be used. An advantage with HMM compared to the particle filter is the possibility for exact inference via the Forward-Backward algorithm. This is only feasible for a reasonable amount of discrete states, which requires that the area is modelled as a graph of nodes and arcs as done by Thiagarajan et al. (2009). Additionally, the ability of HMM to localise targets between sensor readings and obtain the most likely route via the Viterbi algorithm makes it the ideal choice of method for this thesis. The relevant theory regarding a HMM is described in the next chapter.

1.6 Outline

This section presents the structure of the thesis and briefly describes the context in each chapter.

Chapter one introduces the project in the thesis and provides the reader with the necessary context to understand the problem to be examined. Furthermore, the limitations for the project and the research questions are presented in this chapter. A section explaining the previous work done on the topic is presented in order to understand how similar problems have been solved in the past. The reader is given an overview of the different methods for sensor-based localisation and the motivation for choosing Hidden Markov Models is articulated.

Chapter two covers all the theoretical parts for the thesis. It is limited to only contain the essential theory of an HMM, which is of interest in this specific context. Three different algorithms for inference, training, and decoding are described in detail; the Forward-Backward, the Baum-Welch, and the Viterbi algorithm.

Chapter three describes how the theory is applied to the specific context of the thesis. In order to understand the model choices made, the chapter initiates with a detailed description of how the sensors collect data followed with the corresponding data assumptions. Three models are constructed based on the three different algorithms, each with the purpose to solve a specific sub-problem. Additionally, a time series analysis method is presented and used in conjunction with the Forward-Backward algorithm.

Chapter four presents the empirical results generated from the different models. Firstly, two individual based localisation approaches are compared. Secondly, a crowd localisation approach

(22)

1. INTRODUCTION

is presented which is an aggregated version of the individual model. Lastly, the impact on the area during events is examined in regards to traffic bottlenecks and the effect on specific segments.

Chapter five critically assesses the performance of the models. For each model, the empirical results are explained via a context-aware analysis. The possibility for generalisation of the methods is also presented. Furthermore, two specific model improvements are explained in detail.

Chapter six provides the reader with the conclusions of the thesis. This chapter connects back to the research questions by explaining how and to what extent the three models answered them. A recommended direction for further research is given.

(23)

CHAPTER 2

Background

This chapter presents the theory for the Hidden Markov Model. The concept of HMM is introduced with the corresponding main assumptions, followed by three different algorithms used for decoding and training of the model. The algorithms presented in this chapter are the Forward-Backward algorithm, the Baum-Welch algorithm, and the Viterbi algorithm.

2.1 Hidden Markov Models

A Hidden Markov Model is a Markov model where the states are not directly observable but instead is connected to another stochastic process which can be observed. The set of unobservable states which constitute the state space for the HMM is defined as X = {x1, x₂, ..., x_N} and the number of states N is usually bounded by the physical character of the modelled system.

The observable stochastic process generates a set of observation sequences denoted as Ω = {Z¹, Z², ..., Z^W} where each sequence consists of observations Zⁿ = (z₁, z₂, ..., z_Tⁿ) over time-span Tⁿ. Here, superscript n refers to the n:th individual. However, n is only used in Section 2.4 since all other equations are on an individual basis. There is a predetermined number of possible observations that z_tcan take which is denoted as V = {v₁, v₂, ..., v_O} where O is the total number of possible observations, e.g a coin toss can either be heads or tails. Hence, zt∈ V for all time steps t. An observation sequence Z = (z1, z₂, ..., z_T) is analysed to obtain a picture of the hidden states X = {x₁, x₂, ..., x_N} in the HMM.

The HMM consists of three major parts which are the transition model, the emission model, and the initial distribution. The transition model specifies the probability to transition from one hidden state to another during a time step, i.e.

P (qt+1= xj|q_t= xi), 1 ≤ i, j ≤ N, t = 1, 2, ..., T (2.1) where qtdenotes the state at time t.

As mentioned before the hidden states are connected to another stochastic process and it is the emission model that specifies this connection. The emission model works as a bridge between the hidden states and the observations in a probabilistic manner. Specifically, the emission probability

(24)

2. BACKGROUND

is the probability of obtaining observation z_t= v_kgiven that the system is in the hidden state x_i at time t, i.e.

P (zt= v_k|q_t= xi), 1 ≤ k ≤ O, 1 ≤ i ≤ N, t = 1, 2, ..., T. (2.2) Since each z_tcan take any of the values in V , a N ×O matrix containing all emission probabilities for each state xi and all values of ztcan be constructed. A shorter notation for the emission probability used later is P (zt|q_t= x_i) and an alternative expression for an observation sequence is Z_1:t = (z₁, z₂, ..., z_t) for t = (1, 2, ..., T ).

Lastly, the initial distribution is the probability of being at state xi during time step t = 1, i.e.

P (q1 = xi), 1 ≤ i ≤ N. (2.3)

The three models mentioned above are the model parameters for the HMM and these must be specified by the user (Rabiner 1989). How these are specified in this thesis is covered in Section 3.2.

2.2 Assumptions for HMM

There are three main assumptions for Hidden Markov Models referred to as stationarity, Markov property, and output independence (Rabiner 1989). The first regards the transition model and implies that the transition probabilities are stationary over time. This means that the probability of transitioning from state xito xj is time independent, i.e.

P (q_t₁₊₁= x_j|q_t₁ = x_i) = P (q_t₂₊₁= x_j|q_t₂ = x_i), 1 ≤ i, j ≤ N, for any t₁, t₂. The second assumption regarding the transition probabilities is that they follow the Markov property, i.e the state transition probabilities only depend on the current state as

P (q_t+1= x_j|q_t= x_i, q_t−1, ..., q₁) = P (q_t+1= x_j|q_t= x_i), 1 ≤ i, j ≤ N.

The third and final assumption is output independence which means that the probability for an observation at time t is only dependent on the current state qt, i.e.

P (z_t|q₁, q₂, .., q_t, ..., q_T, Z) = P (z_t|q_t), t = 1, 2, ..., T. (2.4)

2.3 Forward-Backward Algorithm

The purpose of the Forward-Backward algorithm is to compute the probability of being in state xi at time t given the observation sequence Z1:T = (z1, z2, ..., zT), i.e.

P (q_t= x_i|Z_1:T), 1 ≤ i ≤ N, t = 1, 2, ..., T (2.5) which is known as the forward-backward probability. This algorithm is divided into three steps;

the forward probabilities, the backward probabilities, and the smoothing process. The probability in eq. (2.5) can be re-written using the definition of conditional probability.

(25)

2.3. Forward-Backward Algorithm

The definition of conditional probability is

P (A|B) = P (A ∩ B)

P (B) (2.6)

where P (B) > 0 (Råde & Westergren 2003, p. 426).

By applying the definition of conditional probability on eq. (2.5), the expression can be written as

P (q_t= x_i|Z_1:T) = P (qt= xi, Z_1:T) P (Z_1:T)

= P (qt= xi, Z1:t, Zt+1:T) P (Z_1:T)

= P (Z_t+1:T|q_t= x_i, Z_1:t)P (Z_1:t, q_t= x_i) P (Z1:T)

= P (Zt+1:T|q_t= xi)P (Z1:t, qt= xi)

P (Z_1:T) , 1 ≤ i ≤ N, t = 1, 2, ..., T. (2.7) where the last equality comes from the fact that Z_1:tand Z_t+1:T are conditionally independent given qt = xi. This follows directly from the output independence assumption in eq. (2.4).

Hence, the probability in eq. (2.5) is equal to the probability of obtaining the observation sequence Z_t+1:T given the state x_iat time t multiplied with the probability of obtaining both the observation sequence Z1:tand be in state xiat time t. Note that the probability P (Z1:T) must be computed to normalise eq. (2.5). Without any sophisticated method, this calculation is

P (Z1:T) = X AllQ

P (Z1:T ∩ Q) = {Applying definition (2.6)}

= X

AllQ

P (Z1:T|Q)P (Q) = {Applying output independence assumption (2.4)}

= X

AllQ

P (Q)

T

Y

t=1

P (zt|q_t). (2.8)

where P (Q) is the probability of the state sequence, and is computed using the transition probabilities and the Markov property as

P (Q) = P (q₁, q₂, ..., q_T) = P (q₁)P (q₂|q₁)P (q₃|q₂)...P (q_T|q_{T −1}) (2.9) and by inserting this into eq. (2.8), the expression becomes

P (Z_1:T) = X AllQ

P (q₁)

T −1

Y

t=1

P (q_t+1|q_t)

T

Y

t=1

P (z_t|q_t)

which consists of (2T − 1)N^T multiplications and (N^T − 1) additions. Thus, computing this probability using brute force is not feasible since the computational complexity rapidly

(26)

2. BACKGROUND

grows with the amount of time steps and number of states in the system (Rabiner 1989). This warrants the need for a more sophisticated method such as the Forward-Backward algorithm. An alternative way of computing P (Z1:T) is as

P (Z_1:T) =

N

X

i=1

P (Z_1:T, q_T = x_i).

which enables eq. (2.7) to be written as

P (q_t= x_i|Z_1:T) = P (Z1:t, qt= xi)P (Z_t+1:T|q_t= xi) PN

i=1P (Z_1:T, q_T = xi) = αt(i)βt(i) PN

i=1α_T(i) (2.10) where αt(i) and βt(i) are called the forward- and backward probabilities respectively and can be computed in a recursive manner by using the principle of dynamical programming. This reduces the complexity to O(N²T ) (Rabiner 1989). The next sections detail the steps of how to compute the forward- and backward probabilities.

2.3.1 Forward Probabilities

The forward probabilities αt(i) for all time steps t and states x_ifrom eq. (2.10) are computed recursively. First, α₁(i) is calculated in the initialisation step as

α1(i) = P (q1 = xi)P (z1|q₁ = xi), z1 ∈ V, 1 ≤ i ≤ N. (2.11) The other Forward probabilities, i.e. α_t(i) for t = 2, 3, ..., T , are calculated by induction as

αt(j) = P (zt|q_t= xj)

N

X

i=1

αt−1(i)P (qt= xj|q_t−1= xi), (2.12) z_t∈ V, 1 ≤ j ≤ N, t = 2, 3, ..., T.

An overview of this procedure is visualised in Figure 2.1.

Figure 2.1: One step in the recursion when calculating the forward probabilities.

(27)

2.3. Forward-Backward Algorithm

In each time step a scaling constant is introduced to stop α_t(i) from approaching zero as t increases. This occurs since αt(i) is a sum over two product sums for the transition- and emission probabilities, i.e.

αt(i) = P (Z1:t, qt= xi)

= X

q1q2...qt−1

P (Z1:t, qt= xi|q₁q2...qt−1)P (q1q2...qt−1)

= X

q1q2...qt−1

t

Y

s=1

P (z_s|q_s)

t−1

Y

s=1

P (q_s+1|q_s). (2.13)

As T becomes large, these product sums rapidly approaches zero which is known as underflow (Shen 2008). The scaling constant is introduced in the initial step as

˙

α₁(i) = α₁(i)

c₁ = 1

PN i=1α˙1(i) αb₁(i) = c₁α˙₁(i) = c₁α₁(i)

and for time steps t = 2, 3, ..., T as

˙

αt(j) = P (zt|q_t= xj)

N

X

i=1

αbt−1(i)P (qt= xj|q_t−1= xi)

ct= 1

PN

j=1α˙_t(j) αbt(j) = ctα˙t(j).

This makesαb_t(i) sum to one over all states i in each time step t. The scaled forward probabilities can also be expressed as (Shen 2008)

αb_t(j) = α_t(j)

t

Y

k=1

c_k, 1 ≤ j ≤ N, t = 1, 2, ..., T. (2.14)

2.3.2 Backward Probabilities

The recursion for the backward probabilities is initialised with βT(i) = 1, 1 ≤ i ≤ N

and then the algorithm iterates recursively backwards in time until t = 1. For time steps t = T − 1, T − 2, ..., 1, βt(i) is calculated as

β_t(i) =

N

X

j=1

P (q_t+1= x_j|q_t= x_i)P (z_t+1|q_t+1= x_j)β_t+1(j), z_t+1∈ V, 1 ≤ i ≤ N, t = T − 1, T − 2, ..., 1.

(28)

2. BACKGROUND

Figure 2.2: One step in the recursion when calculating the backward probabilities.

A graphical illustration of one step in the calculation of the backward probabilities can be seen in Figure 2.2. The same scaling factors as the ones used for the forward probabilities are introduced to stop the backward probabilities from approaching zero as t increases. However, the last backward probability βT(i) is not scaled and instead the last scaling factor in c_T is used for β_{T −1}(i). Thus, the scaling procedure becomes

β˙_t−1(i) =

N

X

j=1

P (q_t= x_j|q_t−1= x_i)P (z_t|q_t= x_j) bβ_t(j)

βb_t−1(i) = ˙β_t−1(i)c_t

and similarly as in eq. (2.14) bβ_t(i) can be written as

βbt(j) = βt(j)

T

Y

k=t+1

ck, t = 1, 2, ...T − 1.

By using the same scaling factor as for the forward probabilities, the Forward-Backward probability in eq. (2.10) is normalised at each time step (Rabiner 1989). This is part of the smoothing process of the Forward-Backward algorithm and is covered in the next section.

2.3.3 Smoothing

The final step in the Forward-Backward algorithm is the smoothing process. This is done by multiplying the forward- and backward probabilities together. Recall that eq. (2.7) can be expressed in terms of

P (qt= xi|Z_1:T) = αt(i)βt(i) PN

i=1αT(i). (2.15)

Using the expression forαb_T(i) from eq. (2.14) yields

N

X

i=1

αb_T(i) =

T

Y

t=1

c_t

N

X

i=1

α_T(i) =

T

Y

t=1

c_tP (Z_1:T) = 1

(29)

2.4. The Baum-Welch Algorithm

which enables the probability of the observation sequence P (Z_1:T) to be written as P (Z_1:T) = 1

QT t=1ct

. (2.16)

Using the scaled forward- and backward probabilitiesαbt(i) and bβt(i) to calculate eq. (2.15) yields

P (qt= xi|Z_1:T) =αbt(i) bβt(i) = αt(i)βt(i)

t

Y

k=1

ck T

Y

k=t+1

ck= αt(i)βt(i)

T

Y

k=1

ck

and the benefit of using the same scaling factor for bothαbt(i) and bβt(i) becomes clear since it normalises the probability P (qt = x_i|Z_1:T) in the smoothing part as well (Rabiner 1989). A visual representation of this step can be seen in Figure 2.3.

Figure 2.3: Calculation of the normalised Forward-Backward probability for time step t at state xi.

2.4 The Baum-Welch Algorithm

The Baum-Welch Algorithm is an unsupervised machine learning algorithm used to update and improve the model parameters based on a set of observation sequences Ω = {Z¹, Z², ..., Z^W} where W is the total number of sequences. An unsupervised machine learning algorithm uses unlabelled data which in the context of this thesis means that there are no data set explaining which model parameters should be applied to certain observation sequences available. Therefore, there are no training- and validation parts as for a supervised machine learning algorithm (Gentleman

& Carey 2008).

In this thesis, only the transition model is of interest for training and the reason is discussed in Section 3.2.6. The algorithm is an iterative process which updates the model parameters in each iteration based on all observation sequences using the Forward-Backward algorithm.

First, all observation sequences n = (1, 2, ..., W ) are selected and their posterior distributions are calculated using the Forward-Backward algorithm. The posterior distributions from all observation sequences are then used to update the model parameters which constitutes one iteration in the Baum-Welch algorithm. These new model parameters are then used in the next iteration for the Forward-Backward algorithm when calculating the posterior distribution for all

(30)

2. BACKGROUND

observation sequences. The algorithm continues to iterate until the updated parameters are not significantly different from the previous iteration, which is defined in eq. (2.20). An overview of the logic behind the Baum-Welch algorithm is given in Figure 2.4

Figure 2.4: Flow chart of the logic behind the Baum-Welch algorithm.

However, the algorithm only guarantees a local maximum for the model parameters due to it being an unsupervised algorithm (Gentleman & Carey 2008). To update the transition model, the probability P (qt+1= xj|q_t= xi) must be re-estimated using the knowledge obtained from all the observation sequences in Ω. The estimated transition probability can be expressed as the fraction between the expected number of transitions from state x_i to x_j and the expected number of transitions from state xi. Let ξt(i, j) denote the probability of being in state xiat time t and state xj at time t + 1 given an observation sequence, i.e.

ξ_tⁿ(i, j) = P (q_t= x_i, q_t+1= x_j|Z_1:Tⁿ n)

where superscript n refers to the n:th individual and their corresponding observation sequence.

To avoid having to carry the superscript in all equations, it’s henceforth dropped for all individual calculations. The calculations below are conducted for each observation sequence in Ω and in eq.

(2.19) all probabilities over these sequences are added together. The superscript is introduced again in eq. (2.19) when all the individual calculations are aggregated together to update the model parameters. The expected number of transitions from state xican be expressed using the previously defined ξt(i, j) as

Expected nr. of transitions from statexi =

T −1

X

t=1 N

X

j=1

ξt(i, j), 1 ≤ i ≤ N. (2.17)

Similarly, the expected number of transitions from state xito each individual state xj can be expressed as

Expected nr. of transitions from statex_ito statex_j =

T −1

X

t=1

ξ_t(i, j), 1 ≤ i, j ≤ N. (2.18)

The probabilities ξ_t(i, j) can be calculated using the forward- and backward probabilities as ξ_t(i, j) = P (q_t+1= x_j, q_t= x_i, Z_1:T)

P (Z1:T)

= α_t(i)P (q_t+1 = x_j|q_t= x_i)P (z_t+1|q_t+1= x_j)β_t+1(j) P (Z_1:T)