Master of Science Thesis in Electrical Engineering
Department of Electrical Engineering, Linköping University, 2019
Mobility analysis of zoo
visitors
Master of Science Thesis in Electrical Engineering
Mobility analysis of zoo visitors:
Kim Byström LiTH-ISY-EX--19/5224--SE
Supervisor: Kamiar Radnosrati
isy_{, Linköpings universitet}
David Gundlegård
itn_{, Linköpings universitet}
Examiner: Fredrik Gustafsson
isy_{, Linköpings universitet}
Division of Automatic Control Department of Electrical Engineering
Linköping University SE-581 83 Linköping, Sweden
Sammanfattning
I ett samarbete mellan Kolmården djurpark och Linköpings universitet, sponsrat av Norrköpings kommuns fond för forskning och utveckling, har rörelsemätning-ar gjorts inuti prörelsemätning-arken. Mätningrörelsemätning-arna hrörelsemätning-ar utgjorts av sex WiFi-sniffers som samlrörelsemätning-ar in anonymiserade MAC-adresser från besökares smartphones. Målet med detta arbete är att analysera denna data för att förstå besökarflöden i parken och an-nan statistik genom att använda en modellbaserad rörelseanalys. Arbetet visar att man med denna utrsutning och statistiska metoder kan skapa en god predik-tion av hur den geografiska besökardistribupredik-tionen ser ut över tid.
Abstract
In a collaboration between Kolmården Zoo and Linköping University, supported by the Norrköping municipality’s fund for research and innovation, mobility measurements have been performed inside the zoo. These measurements have been done by six WiFi sniffers collecting anonymised MAC addresses from the visitors smartphones. The aim of this thesis is to analyse these data to understand visitor flows in the park and other statistics using a model based mobility analysis. The work implies that one can make a rather good prediction of the geographical visitor distribution using this equipment and statistical models.
Acknowledgments
Thanks to my supervisors Kamiar Radnosrati and David Gundlegård and my ex-aminer Fredrik Gustafsson. Your help, support and expertise during this work have been truly valuable and appreciated.
Linköping, May 2019 Kim Byström
Contents
Notation xi
1 Introduction 1
1.1 Related work . . . 1
1.2 Aim and problem Formulation . . . 2
1.3 Thesis outline . . . 2 2 Data structure 3 2.1 Technical background . . . 3 2.1.1 WIFI . . . 3 2.1.2 Probe request . . . 4 2.2 Data acquisition . . . 6 3 Modelling 9 3.1 Movement prediction . . . 9
3.2 Dwell time prediction . . . 12
4 Performance evaluation 15 4.1 Experimental analysis - Ticket data as ground truth . . . 15
4.1.1 Datasets . . . 17
4.1.2 Results . . . 17
4.2 Prediction of geographical distribution . . . 20
4.2.1 Results . . . 21
5 Discussion 25 5.1 Conclusion . . . 26
5.2 Application . . . 26
5.3 Future work . . . 27
A Ground truth correlation plots 31
B Distribution of ˆpij 35
C Prediction error 45
x Contents
Notation
Model parameters
Notation Description
i Index for different sniffers in the zoo si Position of sniffer i
t0 Opening time of zoo
tc Closing time of zoo
tk Discrete time where k ∈ [0, c]
T_{i}m(t) Time stamp for user m at station i at time t
T_{i}m(tf irst) First time stamp in a series for visitor m at position i
T_{i}m(tend) Last time stamp in a series for visitor m at position i
xi(tk) Number of visitors at position si at time tk
xsi(tk) Number of sniffed unique MAC addresses by sniffer si
at time tk
xs1(tf irst) Number unique MAC addresses that get sniffed for the
first time by sniffer s1at time tk
xticket(tk) Number of sold tickets at time tk
ˆ
xi(tk) Estimated number of visitors at position siat time tk
C1 Scaling factor
C2 Time-lag
y_{ii}m The time visitor m stays at station i
y_{ij}m The time visitor m takes to move from position si to sj
ˆ
pij(t) Dwell time distribution
A Transition matrix, transition distribution
h Hour within open hours
xii Notation
Abbreviations
Abbreviations Description
AP Access point, part in WiFi communication
MAC address Media Access Control address, unique identifyer for every network card
WLAN Wireless local area network
1
Introduction
This work investigates if the geographical visitor distribution of a zoo can be predicted by monitoring movement of WIFI-devices in the park. The work uses Markov chain to model movement of these devices and continuous probability distributions to model the time dimensions. It targets and discusses zoo visitors specifically but could easily be applied in other areas as well. This chapter starts with discussing what have been done on the topic, before it presents the formal aim and problem formulation of the work.
1.1
Related work
Analytics has, for a long time, been important for online applications to under-stand customer behaviours. For instance, online stores analyse web page-visitor’s click path to get a better understanding of who their customers are and how the web page can be rearranged to better satisfy their needs [19]. For long, these in-depth analyses did not have the same break-through in the physical customer-orientated world because of the absence of tools to gather the data.
In more recent years, smart phones have found a natural place in most peo-ples life which has opened up new possibilities. Research has started to elaborate with the signals that WIFI-devices, such as smart phones, broadcast to track a trajectory with global positioning system (GPS) as ground truth [14] and esti-mate crowd densities and pedestrian flows at airports with the security check as ground truth [18]. These works have shown that one can, with rather simple methods and cheap hardware, create a high accuracy analysis of human move-ment patterns.
Additionally, using this WIFI-interface, attempts have been made to model human movement patterns including duration-of-stay with hidden Markov model (HMM) and ergodic model [8] with promising result. This work focuses on
2 1 Introduction
ysis of individual’s movement pattern and have many similarities to what earlier was done on the web pages.
So, is this the technology that enables in-depth analysis also in the physical custemer-oriented world? It might just be! If so, it will play an important role in managing physical facilities such as malls, hotels, amusement parks etc in a close future. In fact, there are companies that offers such services out there already [1].
1.2
Aim and problem Formulation
Kolmården zoo has today rather good ways of predicting the number of visitors for each day. However, once the visitors have entered the park, there is no good way to automatically estimate how this population will be distributed in the zoo through out the day. The aim of this work is to investigate if this is feasible using the discussed WIFI-inteface. More specifically:
– Given the number of people that will visit the zoo during a day, can their geographical distribution be predicted with a satisfying accuracy?
This will be considered true if both of the following things are true:
a) the number of probe requests captured in the main entrance can estimate the number of visitors entering the park in such way that it can be used as ground truth,
b) movement and dwell time of WIFI-devices in the zoo can be estimated with Markov chain and continuous distribution models.
1.3
Thesis outline
Chapter 2 gives a brief introduction to what a probe request is and how it can be captured. It is also explains how, where and when the used data were acquired.
Chapter 3 discusses howtime and position of a WIFI-device can be modelled in
the zoo and explains of how these models work.
Chapter 4 explores whether the ticket data can be used as ground truth for the probe request data and further if the probe request data can be pre-dicted for future times.
2
Data structure
2.1
Technical background
This chapter will provide an overview of how WIFI works and how a WIFI-device’s position can be traced.
2.1.1
WIFI
WIFI-signals are basically electromagnetic waves transmitted and received by an antenna. To make sense of the information the waves carry, institute of electrical and electronics engineers (IEEE) has defined a standard for the communication. This is called the IEEE 802.11 standard [3] that states the fundamental signal properties and protocol to let different devices communicate with each other. With an access point (AP), which is built-in in today’s routers, one can set up a wireless local area network (WLAN) and further connect it to internet. In this way, by connecting your smart phone to the WLAN via the AP, you get internet access in your hand.
The connection procedure follows three basic steps: probe request/response, authentication and association (see Figure 2.1). The probe request/response is essentially the WIFI-device’s way of scanning the surrounding for APs. It sends out a message with its "name" to alert its surroundings of its existence. The APs within reach of this message respond to let the WIFI-device know they are there. In the authentication step, it is determined whether the WIFI-device is allowed to connect to the network or not. Private WLANs use passwords as authentication while open networks only an acceptance of the terms. In the association step, the WIFI-device decides to connect to the network and further communication properties are set, such as encryption.
The components of the AP are rather simple and can be found in every smart phone or laptop. In fact, these devices act as APs when set to share internet. By
4 2 Data structure
Figure 2.1:The three steps in WIFI connection.
switching a computers network card into monitor mode, one can further examine the probe requests broadcasted within range using a python script called Probe-mon [15].
2.1.2
Probe request
The frequency of which probe requests are sent out varies from a few seconds to a couple of minutes depending on a couple of parameters, such as: type of device, manufacturer, OS and if it is actively in use or in stand by [5].
To be able to communicate in a network, every device needs a network card. These network cards have a unique identifier called MAC address that differenti-ates the devices from each other in the communication. The MAC address is part of the probe request (the "name") to identify the WIFI-device. This means, it has for a long time been possible to monitor which specific WIFI-devices are within reach from you using a computer. This act will from now on be referred to as sniffing and the computer performing it a sniffer.
Since sniffing WIFI-devices is perceived to restrict the privacy of the device owner, most new smart phones use a randomized MAC address for the public probe requests. Further do they switch MAC address at a frequency which makes it impossible to track a device over time.
Apple introduced this technique in iOS8 and were compatible with iPhones 5 and newer [6]. Google introduced it in Android 8 as an available feature for developers [4]. It was then up to the phone manufacturer to use this feature, create there own or not use MAC-randomization at all. In a dataset captured
2.1 Technical background 5
between January 2015 and December 2016, Martin et al. [10] showed that at least 53 % of smart phones did not randomize there MAC addresses with the remark that they"posit that much less than 50% of devices conduct randomization".
There is evidence of that this randomization technique can be defeated [11][12], however it will not be applied in this work. The MAC address can in its original format be considered as personal data which through the general data protection regulation (GDPR) is prohibited to save within EU without permission [2].
6 2 Data structure
2.2
Data acquisition
During the preparatory work of this project, 6 sniffers where placed at different locations of Kolmården zoo (see Figure 2.2) during two days (4/11 - 5/11 2017), to monitor and save probe requests. All personal data were exchanged with random values (hashes) using a hashing function. By using different input (salts) to the hash function for the two days, a unique MAC address could only be followed for one day at a time in the zoo. The original data could not be restored from the hashes, hence these actions made the personal data totally anonymous and therefore legal to save.
By monitoring how WIFI-devices get sniffed by different sniffers with differ-ent time-stamps, it is possible to iddiffer-entify a visitor’s path throughout the park during the day (see example in section 3.2). Further ticket data is provided by the zoo, showing the number of visitors entering the zoo at every minute. Ac-cording to this data, (see Table 2.1) 4/11 had 10 742 paying visitors and 5/11 had 5 466. Models presented in this work are calibrated on the data of 4/11 and cross-validated on the data of 5/11.
Figure 2.2:Map of the zoo with sniffer positions.
Table 2.1: Comparison between the sniffed data and the ticket data for the two days of data acquisition.
Date Tickets Unique MAC addresses Probe requests 2017-11-04 10 742 122 757 1 083 150 2017-11-05 5 466 58 834 695 050
2.2 Data acquisition 7
A comparison of the ticket data and sniffed data in Table 2.1 shows consis-tently that the number of unique MAC addresses is about eleven times bigger then the number of sold tickets. This can be explained by the fact that many of the WIFI-devices in the zoo uses randomized MAC addresses and therefore uses multiple addresses through the day. What the rate between the number of WIFI-devices and visitor number really is, we cannot know. Some visitors may have multiple devices while others may have none.
3
Modelling
When predicting the visitor flow in the zoo, two types of models are needed: movement predictions and dwell time predictions. A movement prediction an-swers the question "given a visitor’s position, where will it go next?". This is done
with a Markov chain. Dwell time refers to a number of entities, namely: time of entrance and exit, duration at each position and the travel time between var-ious positions. These entities will be predicted with some commoncontinuous probability distributions. Combining these two models enables us to predict an
in-dividual visitor’s path in the zoo through out the day. Later, in chapter 4, we will aggregate this concept to a prediction of the geographical visitor distribution.
What will be mentioned as a visitor position in this chapter is in fact the sniffer position of which a WIFI-device in the filtered dataset (see section 4.1) has been sniffed.
3.1
Movement prediction
The general idea behind a Markov chain is to determine the probabilities of where a visitor m will move next in the zoo, by observing how other visitors moved in the past. Assume we know that the position q of visitor m at time tk is si,
then P [qm(tk) = si] = 1. The probability that the visitor will move to position sj
directly after this is then given by P [qm(tk+1) = sj|qm(tk) = si]. This probability is
established by [17]:
Phqm(tk+1) = sj| qm(tk) = si
i
=observed transitions to si from sj observed transitions from sj
= ai,j (3.1)
If, for example, most of the visitors at position s1moved to s2in the
observa-tion set, visitor m will be given a high probability to make the same transiobserva-tion.
10 3 Modelling
Probabilities ai,j are called transition probabilities and create the transition
ma-trix A: A = a1,1 · · · a1,j · · · a1,n .. . . .. ...
ai,1 ai,j ai,n
.. . . .. ... an,1 · · · an,j · · · an,n (3.2)
where n is the number of possible positions. Note thatPn
j=1ai,j = 1. The diagonal elements in A represent the probability
to transcend to the same position as one came from. This will not be considered a valid transition here, hence ai,i= 0. Instead will the time that a visitor stays at
a position be modelled asattraction dwell time in section 3.2.
(a)Observed transitions from s1
makes up the first column in the transition matrix A.
(b) Observed transitions to s1
makes up the first row in the transition matrix A.
Figure 3.1
Figure 3.1a shows how the columns in matrix (3.2) are based on observed transitionsfrom position si and Figure 3.1b shows how rows in matrix (3.2) are
based on observed transitionsto sj.
The Markov chain is based on two assumptions: 1) there exists a pattern in the observations that is applicable on future events and 2) this pattern is stable over time. The Markov chain is therefore more powerful for shorter time predictions [7]. If these assumptions are not met by the data it is applied on, the prediction will be of low accuracy. These things are not restricted to only Markov chains but all pattern predicting models.
In order to capture the pattern, a big quantity of observations are needed. If the observations are sensitive too fluctuations in transition patterns over time, it is a good idea to divide the time horizon into fragments and use separate transi-tion matrices for these time intervals. This is accomplished by creating a matrix
Ahfor each hour h within the opening hours using only observed transitions from
that hour.
It is intuitive to think that the visitors transition pattern may vary over time in the zoo. For example, that they move deeper into the park in the morning
3.1 Movement prediction 11
after they arrived, while they move closer to the exit late in the afternoon. If the zoo offers shows during the day at certain times, these could cause temporary dynamics in the visitor flow that is not true for the rest of the day. These are examples of possible fluctuations in transition patterns.
12 3 Modelling
3.2
Dwell time prediction
The termdwell time is often use to describe how long one stays somewhere. In
this context, it will be used in two different ways:
Attraction dwell time:the time spent at each position,
Inter-attraction dwell time:the time spent moving in between two positions,
The sniffed data consists of time stamps Tm
i (tk) for visitor m at station i =
1, 2, . . . , 6 at times tk, k = 1, 2, . . . N . Let yiimdenote the attraction dwell time of
visitor m at position si. yiimis defined as the time between the first and the last
time stamp from si in series (that means not being sniffed somewhere else in
between):
y_{ii}m= T_{i}m(tend) − Tim(tf irst) (3.3)
Further, let y_{ij}mdenote the inter-attraction dwell time of the same visitor between position si and sj. yijmis defined as the time between the last time stamp from si
and the first time stamp at sj:
y_{ij}m= T_{j}m(tf irst) − Tim(tend) (3.4)
It should be clarified that dwell times only relates to the positions where sniffers are placed. The dwell time between two positions can therefore be very long if the visitor goes to a non-sniffer position in between.
To model visitors entering and exiting the zoo, it is convenient to add a new state, s0, representingnot in the park or the surrounding. A visitor transcending
from s0is entering the zoo and then leaving it when transcending to s0. The enter
time is defined as the time difference between a visitor’s first time stamp and the zoo’s opening time:
y_{0j}m = T_{0}m(t0) − Tjm(tf irst) (3.5)
Similarly is the exit time given by the time difference between a visitor’s last time stamp and the zoo’s opening time:
ym_{i0}= T_{0}m(t0) − Tim(tend) (3.6)
Table 3.1 exemplifies how a WIFI-device get sniffed through-out the park dur-ing one day. Figure 3.2 then exhibits the WIFI-device’s dwell times.
3.2 Dwell time prediction 13
Table 3.1:Example of how a WIFI-device get sniffed through-out the park during one day. Dots represent one or more time stamps by the same sniffer in between. Zoo opens at 10:00.
Time stamp position Time stamp position
10:25 1 13:39 5 .. . ... ... ... 10:37 1 13:59 5 10:45 2 14:49 6 .. . ... ... ... 11:20 2 15:14 6 11:30 4 15:30 2 .. . ... ... ... 12:15 4 15:32 2 12:20 3 15:42 1 .. . ... ... ... 13:35 3 15:44 1
Figure 3.2: The dwell times for the WIFI-device with time stamps in Table 3.1.
By calculating the first and second moment of the dwell times (mean and standard deviation) for all observed visitors, they can be modelled with continu-ous probability distribution models, ˆpij(t), using the method of moments[16]. The
distributions that will be tested is: normal, exponential and beta distribution. These models can then be evaluated with a 1-sample K-S test [9]. A 1-sample K-S test measures a statistical distribution’s goodness-of-fit as the supremum er-ror between its CDF and the empirical data’s CDF (ECDF):
14 3 Modelling
D = sup
x
|_{F}_{E}_{(x) − F(x)|} _{(3.7)}
For a finite set, the supremum of a function is the same as the maximum.
Just as the Markov chain, continuous distribution models are sensitive to tem-porary fluctuations. To divide the model into different time fragments could be applied also here. Notice that if n is the number of sniffer positions, the number of dwell times are n2. To fragment this model over time makes the model more complex and will not be applied in this work.
4
Performance evaluation
4.1
Experimental analysis - Ticket data as ground
truth
This section will discuss how the official ticket data can be used as ground truth for our prediction models. This is important since one cannot know how many visitors there are at each position, only the number of WIFI-devices. The ticket data tells the number of visitors entering the zoo at each minute. The idea is to compare the number of visitors entering the zoo with the number of WIFI-devices entering the zoo at each time. A correlation between these two implies that the sniffed data is well suited to model the visitor number.
But first, why would there be a correlation between the ticket data and the sniffed data? Sniffer s1is placed in the main entrance, which is also the position
of the ticket machine. A WIFI-device may therefore be sniffed by s1 when its
owner (visitor) goes through the ticket machine and enters the zoo. However, for this to happen, the WIFI-device must send out a probe request while still within reach of the sniffer. It appears that this is not always the case. As discussed in chapter 2, the frequency of which a WIFI-device sends out probe request can vary from a few seconds to a couple of minutes. Hence, if this does not happen, there is no information of when the WIFI-device entered the zoo. To let the sniffed data estimate the number of visitors:
• only the WIFI-devices where the first time stamp is from sniffer s1is
con-sidered, i.e. Ti(tf irst) ∈ T1(tk).
Those devices that are not, have not been sniffed when they entered the zoo and must be discarded.
When we compared the number of visitors with the number of unique MAC addresses in Table 2.1, the latter was about 11 times bigger. Four factors are
16 4 Performance evaluation
Figure 4.1:Ticket machines and sniffer s1have the same position. If a
corre-lation exists between their data, the ticket data can be used as ground truth for all sniffers data.
identified to cause this difference (especially the last): • staff wearing WIFI-devices,
• stationary WIFI-devices in zoo,
• number of WIFI-devices per visitor may vary (some wear multiple, some none),
• WIFI-devices using randomized MAC addresses creates multiple unique MAC addresses.
One may therefore consider to filter the sniffed data also based on these coteries. The goal is to investigate whether or not there is a correlation between the number of tickets sold, xticket(t), and the number of unique MAC addresses sniffed
by sensor s1for the first time, xs1,first, such that
xticket(tk) = C1xs1,first(tk+ C2) (4.1) where C1= |_{x}_{ticket}_{(t)|} |_{x}_{s} 1,first(t)| = v t Pc i=0x2ticket(ti) Pc j=0x2s1,first(tj) (4.2)
scales the amplitude and C2is the lag in time which will be matched manually.
If such correlation exists, the number of visitors at position s1 can be
esti-mated by
ˆ
x1(tk) = C1xs1(tk+ C2) (4.3)
It will be further assumed that this relation also holds for the general position
ˆ
4.1 Experimental analysis - Ticket data as ground truth 17
4.1.1
Datasets
Model (4.4) will be formulated for two datasets: araw dataset and a filtered dataset.
The filtered dataset will be filtered based on the bullet points above under the following assumptions:
A WIFI-device can be used to model the true visitor number through (4.4) if: 1. it is MAC address sniffed at s1the first time it is sniffed.
A MAC address is considered random if: 2. it appears less than 3 times in the log, 3. it only appears in an interval of 5 minutes.
A MAC address is considered to belong to staff or a stationary machine if: 4. it appears more than 30 minutes outside the opening hours
5. it appears at one position only for more than 3 hours.
The raw dataset will be filtered only based on assumption 1. MAC addresses, which do not match these assumptions, will be discarded in the datasets. The two datasets will be evaluated visually on how well they follow the ticket data and withNormalized root mean square error (NRMSE)
f it = _{||} ||xticket−xˆ1|| xticket−mean(xticket)||
(4.5)
The model is perfect if its graph covers the ticket data and f it = 0.
By filtering the data, it becomes more qualitative and relevant for the pur-pose. However, the data quantity is often reduced heavily and most probably are some relevant data discarded. E.g. it is possible that successful transitions can be observed also from a WIFI-device with randomized MAC address. Such obser-vations can be used to make movement and dwell time models more robust and accurate.
4.1.2
Results
Table 4.1 shows the size of the two datasets. Not surprisingly, both sets have been reduced significantly (compared to Table 2.1). The raw dataset still has more unique MAC addresses than visitors while the filtered dataset now has fewer.
Table 4.1: Number of unique MAC addresses in each dataset together with ticket data for the two days of data acquisition.
Date Tickets Raw dataset Filtered dataset 2017-11-04 10 742 30 334 2 122 2017-11-05 5 466 16 162 1 067
18 4 Performance evaluation
When comparing the raw dataset in Figure 4.2a to the filtered dataset in Fig-ure 4.2b, it is clear that the latter has a stronger correlation to the ticket data (can be further monitored in Appendix A). The NRMSE-fit in Table 4.2 confirms that the filtered dataset can provide a more suitable model. Between the two datasets, the filtered dataset is therefore considered the best. The filtered dataset is fur-ther considered to estimate the number of visitors entering the zoo good enough for its purpose and will therefore be used when modelling movement and dwell time. Figure A.1 compares the discussed correlation for different aggregation lev-els. This comparison shows that the model correlation becomes stronger as the time intervals get aggregated. One should therefore consider what aggregation level is relevant for the application.
Table 4.2:Evaluation of (4.4) for the two datasets.
Dataset C1 C2 NRMSE-fit
Raw data 0.445 0 1.08 Filtered data 5.01 0 0.812
4.1 Experimental analysis - Ticket data as ground truth 19
(a)
(b)
Figure 4.2:a) shows how the raw dataset has a poor correlation to the ticket data and b) shows how the filtered dataset has a rather good correlation to the ticket data.
20 4 Performance evaluation
4.2
Prediction of geographical distribution
From transition probabilities aij and dwell time probabilities ˆpij(t) one can
com-pute analytically how a given distribution of visitors will be distributed at any given future time. However, these expressions are quite complicated. A much simpler alternative is to perform Monte Carlo simulations.
Monte Carlo is a subdivision in mathematics that instead of solving a problem analytically, uses a big quantity of particles, spread randomly over the function domain, to estimate the functions value. The portion of particles within the func-tion ‘area’ among the total number of particles is a good estimafunc-tion of the true function when the number of particles or iterations is high [13].
Figure 4.3 shows an example of what the distribution of transitions from s1to
the other positions may look like. The probability of a transition is proportional to the sub-area of that position. The red dots are particles given random coordi-nates x and y. The particles represent the visitors in position s1and the particle
position the next position of that visitor in the zoo. If they were to be distributed directly according to the measured distributions, chances are that the visitors should be distributed in fractions.
Figure 4.3:The distribution of transitions from position s1is proportional to
the ‘position area’ such that the total area is one. The red ‘particles’ are dis-tributed with Monte Carlo simulation. The simulated distribution estimates the measured distribution but keeps the number of visitors at each position a natural number.
The same approach can be used with distributions of dwell time, with the difference of that each ‘sub-area’ represent number of minutes. The following algorithm will be used:
1. Initialize the number of visitors at each position with the actual distribution of visitors. This number is 0 if simulation starts at t0.
2. For each visitor m,
(a) Draw a sample y_{ii}m∈_{p}_{ˆ}_{ii}_{(t).}
(b) Draw a random new position j according to the probability aij.
4.2 Prediction of geographical distribution 21
(d) Repeat until the visitor has left the park (position j = 0), or until the prediction horizon has been exceeded.
3. For each prediction horizon t, check the fraction of visitor locations at that time.
Simplified to one visitor that has not entered the zoo yet, it perforce the fol-lowing:
I Decide when visitor enters the zoo (si = s1) with beta distribution.
II Decide action dwell time at position si with beta distribution.
III Decide new position sjwith Markov chain.
IV Decide inter-action dwell time between si and sjwith beta distribution.
V Set si = sj
VI Repeat from II until visitor is probable to leave according to beta distribu-tion.
The same thing is then done for each visitor. Thus, the prediction horizon for the simulation is one day. The algorithm should be iterated many times to produce a solid result.
(a)s1 (b)s2 (c)s2
Figure 4.4:Models which task 2. in prediction algorithm use.
4.2.1
Results
According to results in Tables B.1 - B.3, the beta distributions scores best for all the dwell times at the 1-sample K-S test. Besides, they show the most desired behaviour in terms of fit in Figures B.1 - B.8 (Appendix B).
In total, 7231 observed transitions were used to create the transition models. Figures 4.5, 4.6, C.1 and Table C.1 compare the the predicted number of unique MAC addresses with the true value. Both model 1 and model 2 use beta dis-tribution to model dwell times. Model 1 uses a static transition matrix trained from a whole day’s observed transitions while model 2 uses several time sliced transition matrices trained on one hour each. The result is conducted by 5000
22 4 Performance evaluation
iterations of Monte Carlo simulations. Each particle in the simulation represents one WIFI-device.
Figure 4.5: Monte Carlo simulation of the total number of unique MAC addresses over time in the zoo. Both models use beta distribution predict dwell times. Model 1 uses a general A-matrix while model 2 uses hourly A-matrices.
4.2 Prediction of geographical distribution 23
(a)s_{1} (b)s_{2}
(c)s3 (d) s4
(e) s5 (f) s6
Figure 4.6:Monte Carlo simulation of the number of unique MAC addresses at positions si over time. Both models use beta distribution to predict
dwell times. Model 1 uses a general A-matrix while model 2 uses hourly A-matrices.
5
Discussion
That the hourly transition matrices for movement and beta distribution for dwell times performs the best is not very surprising since these are the most sophis-ticated of the ones tested. Although they demand a bit more calculations, they seem to make the effort worthwhile. As mentioned, the dwell time model could be made even more delicate by create separate models for different time frag-ments, just as made with the Markov chain. This task is left for future work. Notice that it is hard to tell from cross-validating only against one day’s data whether such work calibrates the model to the better or overfits it to a pattern that is not general. Regardless, further evaluation is in place.
The fact that the models were successfully cross-validated against a day with such a significant difference in visitor number, indicates that the model truly is general. Although the result is considered good, the fact that it is not perfect is evidence that there are dynamics in the visitor flow that cannot be captured only from the probe request data. This model uses a very quantitative approach when predicting, using only observations of how the visitors behave, without considering what drives this behaviour. It could be a good idea to broaden the scope and integrate data of drivers of the dynamic, e.g. weather data.
The choice of using the filtered dataset in the prediction model was natural since it had the strongest correlation to the ticket data. If this work were to be used to create an estimation model, filtering the data in real time could be prob-lematic. Thus, the raw dataset has to used. Since the raw dataset does not discard randomized MAC addresses, it will include also some non-entering WIFI-devices when modelling entering visitors. This is particularly evident in the morning, when a queue builds up at the main entrance before the zoo opens, and late in the afternoon, when visitors are leaving the zoo. It is likely that a raw dataset which excludes these time frames could get a better correlation. Thus will it also get more credible constants C1and C2for the visitor estimation.
26 5 Discussion
5.1
Conclusion
This report has investigated whether it is possible to predict how visitors at Kolmår-den zoo will be geographically distributed at any future time using probe request data.
The results from the experimental analysis shows that there is a strong cor-relation between the number of visitors entering the park and the number of unique MAC addresses sniffed for the first time in the main entrance for the fil-tered dataset. This means that ticket data can be used as ground truth for the model. Hence, given the number of probe requests sniffed at a position si, the
number of visitors can be estimated as:
ˆ
xi(tk) = C1xsi(tk) (5.1)
where C1= 5.01. This correlation gets only stronger when aggregated over a
big-ger time interval. The correlation between the ticket data and the raw dataset was much weaker. A reason for this is probably that WIFI-devices having randomized MAC address will be included multiple times in the test, also when leaving the park.
The results from the Monte Carlo prediction show that:
• all dwell time are best modelled with continuous beta distribution,
• movement is best modelled with hourly transition matrices.
This implies that the number of sniffed unique MAC addresses by sniffer siat any
future time tk can be predicted as ˆxsi(tk) using these methods.
Combining the experimental and predicted results show that the visitor num-ber at each position si at any future time tk can be predicted by:
ˆ
xi(tk) = C1xˆsi(tk) (5.2)
The prediction model should be aggregated to time intervals of 5 or 15 minutes to give a more precise and solid result. More work is needed before we can tell exactly how precise it is.
5.2
Application
An application based on this work could serve both a tactical and a strategic purpose. As atactical tool, it can be used to distribute staff upfront in a smarter
way. By doing so, an even smaller workforce could probably respond better to the visitor demand than the existing. As astrategic tool, it can be used to plan
walkways, restaurant positions etc. to get a better overall visitor flow in the zoo. This could increase the maximum visitor number and create a better experience for the visitor. These tools could be integrated in the existing enterprise resource planning system (ERP) or as a stand-alone module.
5.3 Future work 27
5.3
Future work
As mentioned, the day of which the model was calibrated, had almost twice as many visitors compared to the day the models where validated against. It is possi-ble that a day with higher number of visitors could suffer from bottlenecks, which of course would cause a different dynamic in the visitor flow. It would therefore be interesting to calibrate and validate the model for days when the visitor num-bers were more alike. If this increases the prediction accuracy, different models could then be used depending on the expected visitor number. In the same way, it would be interesting to use separate models for other parameters that could affect the visitor flow dynamic, such as different weather forecasts.
Further, the model bases it predictions only on how visitors generally moved in the past. This means it will have trouble predicting when special events occur (e.g. a cage is emptied due to cleaning). If these events were planed in advanced, it would be good if managers could incorporate this quality data to the model. Then could the the probability for a long dwell time at (or at all transcend to) this position be decreased.
A
Ground truth correlation plots
32 A Ground truth correlation plots
(a)Time plot - 1 min interval (b)R2 plot - 1 min interval
(c)Time plot - 5 min interval (d)R2 plot - 5 min interval
(e)Time plot - 15 min interval (f)R2 plot - 15 min interval
Figure A.1: Plot shows the correlation between the true and the modelled number of visitors entering the zoo, aggregated at different time intervals. The true number is received from the ticket data while the modelled uses the filtered dataset.
33
(a)Time plot - 1 min interval (b)R2 plot - 1 min interval
(c)Time plot - 5 min interval (d)R2 plot - 5 min interval
(e)Time plot - 15 min interval (f)R2 plot - 15 min interval
Figure A.2: Plot shows the correlation between the true and the modelled number of visitors entering the zoo, aggregated at different time intervals. The true number is received from the ticket data while the modelled uses the raw dataset.
B
Distribution of
p
ˆ
_{ij}
Table B.1:D-value from 1-sample K-S test for beta distribution.
To \ From S1 S2 S3 S4 S5 S6 S1 0.46 0.29 0.30 0.52 0.14 0.26 S2 0.27 0.32 0.17 0.41 0.14 0.10 S3 0.20 0.17 0.50 0.14 0.22 0.32 S4 0.30 0.20 0.28 0.28 0.22 0.26 S5 0.13 0.10 0.24 0.17 0.36 0.20 S6 0.14 0.10 0.51 0.35 0.26 0.65
Table B.2:D-value from 1-sample K-S test for exponential distribution.
To \ From S1 S2 S3 S4 S5 S6 S1 0.46 0.17 0.21 0.57 0.28 0.28 S2 0.21 0.32 0.25 0.50 0.26 0.09 S3 0.28 0.23 0.50 0.17 0.16 0.33 S4 0.36 0.16 0.20 0.28 0.30 0.23 S5 0.25 0.23 0.23 0.17 0.36 0.15 S6 0.20 0.18 0.51 0.31 0.19 0.65 35
36 B Distribution of ˆpij
(a)S1 (b)S1to S2
(c)S1to S3 (d)S1to S4
(e)S1to S5 (f)S1to S6
37
(a)S2to S1 (b)S2
(c)S2to S3 (d)S2to S4
(e)S2to S5 (f)S2to S6
38 B Distribution of ˆpij
(a)S3to S1 (b)S3to S2
(c)S3 (d)S3to S4
(e)S3to S5 (f)S3to S6
39
(a)S4to S1 (b)S4to S2
(c)S4to S3 (d)S4
(e)S4to S5 (f)S4to S6
40 B Distribution of ˆpij
(a)S5to S1 (b)S5to S2
(c)S5to S3 (d)S5to S4
(e)S5 (f)S5to S6
41
(a)S6to S1 (b)S6to S2
(c)S6to S3 (d)S6to S4
(e)S6to S5 (f)S6
42 B Distribution of ˆpij
Table B.3:D-value from 1-sample K-S test for entering and leaving the zoo.
norm exp beta
S0to S1 - 0.16 0.07 S1to S0 0.15 - 0.14 S2to S0 0.16 - 0.08 S3to S0 0.17 - 0.10 S4to S0 0.32 - 0.22 S5to S0 0.20 - 0.20 S6to S0 0.11 - 0.07
43
(a)S_{1}to S_{0} (b)S_{2}to S_{0}
(c)S3to S0 (d)S4to S0
(e)S5to S0 (f)S6to S0
C
Prediction error
Table C.1: The mean of the true number of WIFI-devices over time com-pared with the mean absolute prediction error for all sniffer positions over time. Model 1 uses static A-matrix while Model 2 uses hourly A-matrices.
Position True value Model 1 error Model 2 error
1 121 49 38 2 116 24 25 3 43 14 9 4 52 18 16 5 80 36 35 6 104 36 25 45
46 C Prediction error
(a)s_{1} (b)s_{2}
(c)s3 (d) s4
(e) s5 (f) s6
Figure C.1:Absolute error between the predicted and true number of WIFI-devices at position si over time.
Bibliography
[1] Foot analytics. http://footanalytics.com/. Accessed: 2019-05-18. [2] Regulation (EU) 2016/679 of the european parliament and of the council
(article 30).
[3] IEEE Standard for Information technology–Telecommunications and infor-mation exchange between systems Local and metropolitan area networks– Specific requirements - Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications. IEEE Std 802.11-2016 (Re-vision of IEEE Std 802.11-2012), pages 1–3534, Dec 2016. doi: 10.1109/ IEEESTD.2016.7786995.
[4] Android. Privacy - MAC Randomization. https://source.android. com/devices/tech/connect/wifi-mac-randomization. Accessed: 2019-05-24.
[5] Anemyr and Zernis. Analys av wi-fi signaler från mobiltelefoner -analysis of probe requests from mobile phones. 2018.
[6] Apple. iOS Security - iOS 12.3. https://www.apple.com/business/ site/docs/iOS_Security_Guide.pdf, May 2019. Accessed: 2019-05-24.
[7] A. S. C. Ehrenberg. An appraisal of markov brand-switching models. Jour-nal of Marketing Research, 2(4):347–362, 1965. ISSN 00222437. URL http://www.jstor.org/stable/3149481.
[8] Niels J. Geuze. Hidden Markov models for probe request based location prediction. 2016.
[9] Andrey Kolmogorov. Sulla determinazione empirica di una legge di dis-tribuzione. Biometrika, 4:83–91, 1933.
[10] Jeremy Martin, Travis Mayberry, Collin Donahue, Lucas Foppe, Lamont Brown, Chadwick Riggins, Erik C. Rye, and Dane Brown. A study of mac address randomization in mobile devices and when it fails. Proceed-ings on Privacy Enhancing Technologies, 2017, 03 2017. doi: 10.1515/ popets-2017-0054.
48 Bibliography
[11] Jeremy Martin, Travis Mayberry, Collin Donahue, Lucas Foppe, Lamont Brown, Chadwick Riggins, Erik C Rye, and Dane Brown. A study of MAC address randomization in mobile devices and when it fails. Proceedings on Privacy Enhancing Technologies, 2017(4):365–383, 2017.
[12] Célestin Matte, Mathieu Cunche, Franck Rousseau, and Mathy Vanhoef. De-feating mac address randomization through timing attacks. In Proceedings of the 9th ACM Conference on Security & Privacy in Wireless and Mobile Networks, pages 15–20. ACM, 2016.
[13] Nicholas Metropolis and S. Ulam. The Monte Carlo method. Journal of the American Statistical Association, 44(247):335–341, 1949. ISSN 01621459. URL http://www.jstor.org/stable/2280232.
[14] ABM Musa and Jakob Eriksson. Tracking unmodified smartphones using wi-fi monitors. In Proceedings of the 10th ACM conference on embedded network sensor systems, pages 281–294. ACM, 2012.
[15] nikharris0. Probemon, 2016-10-06. URL https://github.com/ nikharris0/probemon.
[16] Karl Pearson. Method of moments and method of maximum likelihood. Biometrika, 28(1/2):34–59, 1936. ISSN 00063444. URL http://www. jstor.org/stable/2334123.
[17] L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, Feb 1989. ISSN 0018-9219. doi: 10.1109/5.18626.
[18] Lorenz Schauer, Martin Werner, and Philipp Marcus. Estimating crowd den-sities and pedestrian flows using wi-fi and bluetooth. In Proceedings of the 11th International Conference on Mobile and Ubiquitous Systems: Comput-ing, Networking and Services, pages 171–177. ICST (Institute for Computer Sciences, Social-Informatics and . . . , 2014.
[19] Tak Woon Yan, Matthew Jacobsen, Hector Garcia-Molina, and Umeshwar Dayal. From user access patterns to dynamic hypertext linking. Computer Networks and ISDN Systems, 28(7):1007 – 1014, 1996. ISSN 0169-7552. doi: https://doi.org/10.1016/0169-7552(96)00051-7. URL http://www. sciencedirect.com/science/article/pii/0169755296000517. Proceedings of the Fifth International World Wide Web Conference 6-10 May 1996.