Estimation of player trajectories from context in football games using autoencoders
MANON DEPRETTE
KTH ROYAL INSTITUTE OF TECHNOLOGY
SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
trajectories from context in football games using
autoencoders
MANON DEPRETTE
Master in Computer Science Date: June 26, 2020
Supervisors: Atsuto MAKI & Gareth LOY Examiner: Joakim GUSTAFSON
School of Electrical Engineering and Computer Science Host company: ChyronHego AB
Swedish title: Uppskattning av fotbollsspelares positioner över tid
med hjälp av kontext och autoencoders
Abstract
ChyronHego tracks a large number of football games. Occasionally there are errors in the tracked positions of the ball or the players. This thesis aims to investigate to what extent vanilla autoencoders, variational autoencoders and conditional variational autoencoders can recognise patterns in the data and thus be used to predict missing data for individual agents’ trajectories in an adversarial multi-agent situation such as a football game. Furthermore, we also implement a multi-agent role alignment technique to order the outfield players in the dataset and use their identity, learnt unsupervised, in the predic- tions. We find out that in cases where the model cannot sufficiently rely on the individual agent’s trajectory information, it efficiently uses the context, i.e. the other agents behaviour, to make more accurate predictions of the missing data.
However, the identities of the players do not seem to improve the predictions
of the models.
Sammanfattning
ChyronHego samlar in data från ett stort antal fotbollsmatcher. Denna data
innehåller bollens och spelarnas positioner över tid, men ibland saknas kor-
rekta positioner för en spelare och det finns då ett behov av att kunna predi-
cera dessa. Målet med detta arbete är att undersöka hur en autoencoder, en
variational autoencoder samt en conditional variational autoencoder kan an-
vändas för att känna igen mönster i datan. Vidare kommer arbetet undersöka
om dessa modeller kan predicera korrekta positioner för spelarna under en
fotbollsmatch. Utöver de ovan nämnda modellerna kommer vi även att imple-
mentera en rolljusteringsteknik som använder sig av icke-vägledd inlärning
av spelarnas inbördes ordning, vilken skulle kunna ge förbättrade prediktio-
ner. Resultaten visar att när korrekt data saknas för en spelares position kan
modellen använda övriga spelares rörelsemönster för att kunna predicera den
korrekta positionen, vidare finner vi inget stöd för att de inlärda egenskaperna
från rolljusteringstekniken förbättrar modellens prediktioner.
Firstly, I would like to thank Gareth Loy & Lars Bretzner for their guidance and valuable advice throughout this project, and the opportunity to work with them in ChyronHego. I would also like to thank all the team for the pleasant semester I spent there.
Further thanks to my supervisor Atsuto Maki for his support and mentoring to help me carry out this project.
v
1 Introduction 1
1.1 Motivation . . . . 2
1.2 Research Question . . . . 2
1.3 Outline of the thesis . . . . 2
2 Background 4 2.1 Models . . . . 4
2.1.1 Autoencoders . . . . 4
2.1.2 Variational autoencoders . . . . 6
2.1.3 Conditional variational autoencoders . . . . 9
2.2 Training a deep neural network . . . 12
2.2.1 Adam optimizer . . . 12
2.2.2 Batch Normalization . . . 12
2.3 Related work . . . 13
2.3.1 Forecasting multi-agent motion paths . . . 13
2.3.2 Solving the permutational alignment problem . . . 16
3 Methods 18 3.1 Data preprocessing . . . 18
3.1.1 Data preparation . . . 18
3.1.2 Role alignment . . . 20
3.2 Architectures . . . 23
3.2.1 Model 1: a variational autoencoder . . . 23
3.2.2 Model 2: adapted from the variational autoencoder . . 25
3.2.3 Model 3: a corrective architecture . . . 25
3.2.4 Model 4: a conditional variational autoencoder . . . . 26
3.3 Implementation . . . 27
vi
4 Results 30
4.1 Role alignment . . . 30
4.2 Model 1: a variational autoencoder . . . 31
4.2.1 Experiments with a single player . . . 31
4.2.2 Experiments on the loss function . . . 34
4.2.3 Experiments on the role alignment . . . 34
4.2.4 Experiments on the encoded agents . . . 35
4.2.5 Experiments on the number of predicted frames . . . . 37
4.2.6 Experiments to improve the training . . . 39
4.3 Model 2: adapted from the variational autoencoder . . . 43
4.4 Model 3: a corrective network . . . 44
4.5 Model 4: a conditional variational autoencoder . . . 45
4.5.1 Experiments on the role alignment . . . 48
5 Discussion and future work 51 5.1 Results . . . 51
5.2 Mask . . . 54
5.3 Connections . . . 55
5.4 Architectures . . . 55
5.5 Future work . . . 56
6 Conclusions 57
Bibliography 58
A Derivation of the formula for the K-L divergence between the prior and the approximated posterior in the Gaussian case 60
B Networks architectures 64
2.1 Schema of the autoencoder architecture. Image downloaded from https://www.jeremyjordan.me/autoencoders/ in January
2020. . . . . 5
2.2 Influence of the size of the latent space on the reconstruction. Image downloaded from https://github.com/ChengBinJin/VAE- Tensorflow/ in October 2019. . . . 6
2.3 Schema of the variational autoencoder architecture. Image downloaded from https://www.jeremyjordan.me/variational-autoencoders/ in Jan- uary 2020. . . . 7
2.4 The reparametrization trick. Image downloaded from https://www.jeremyjordan.me/variational-autoencoders/ in Jan- uary 2020. . . . 9
2.5 Illustration of the latent space. Image downloaded from https://jmetzen.github.io/2015-11-27/vae.html in October 2019. 10 2.6 Schema of the CVAE architecture . . . 10
2.7 Generation of a digit from a random sample from the latent distribution of a VAE. Code from github: https://github.com/graviraja/pytorch-sample-codes accessed in January 2020. . . 11
2.8 Generation with a CVAE with concatenated one-hot encoding to generate the digit 9. Code from github: https://github.com/graviraja/pytorch-sample-codes accessed in January 2020. . . 11
2.9 Architecture of the network from [10] . . . 14
2.10 DESIRE architecture from [12] . . . 15
2.11 Example of a template from [11] . . . 16
3.1 Schema of a tensor of the dataset . . . 20
viii
3.2 Example of a learnt template for the home team . . . 21
3.3 Schema of an iteration of the tree based learning of the templates 22 3.4 Schema of our data split to learn the templates. . . 24
3.5 Schema of our VAE architecture . . . 25
3.6 Schema of our adapted VAE architecture . . . 25
3.7 Schema of our corrective architecture . . . 26
3.8 Schema of our CVAE . . . 27
4.1 Templates for each team when the home team is attacking . . . 31
4.2 Templates for each team when the away team is attacking . . . 31
4.3 Templates at the leaves of the tree for the home team . . . 31
4.4 Templates at the leaves of the tree for the away team . . . 31
4.5 Average position for each role after player ordering . . . 32
4.6 Examples of reconstruction without context using model 1 . . 33
4.7 Prediction of 1/3 of a sequence using model 1, the ball is kicked by the corrupted player where the data is missing . . . 37
4.8 Prediction of 1/3 of the sequence using model 1, the ball is in possession of the corrupted player for most of the sequence . . 38
4.9 Prediction of 2/3 of the sequence using model 1, the ball is in possession of the corrupted player where the data is missing . 40 4.10 Correction of the prediction using the ball (20 predicted frames). 45 4.11 Correction of the prediction using the ball where the correc- tion fails to improve the first estimation of the prediction (20 predicted frames) . . . 46
4.12 Schema of a sequence to be filled . . . 47
4.13 Prediction of 1/2 of the sequence using model 4 . . . 49
5.1 Evolution of the MSE on the predicted frames for different encoded agents and the single player using a VAE . . . 53
5.2 Evolution of the MSE on the predicted frames for different
encoded agents and the single player using a CVAE . . . 54
3.1 Agent ordering in the dataset . . . 19 4.1 Predictions errors for a single player . . . 32 4.2 Comparison between training the network to reconstruct the
entire output or only the predicted frames of the corrupted player 34 4.3 Perfomance of model 1 with and without using the ordering
of the players . . . 35 4.4 Performance of model 1 depending on the encoded agents . . . 36 4.5 Performance of model 1 with different encoded agents and
given a number of predicted frames . . . 39 4.6 L1 norm of the weights of the mean layer of the network when
only the ball is encoded as context . . . 42 4.7 Experiments on the K-L divergence regularization . . . 42 4.8 Performance of model 1 with Batch Normalization (BN), with
L2 weight decay and without any regularization . . . 43 4.9 Performances of model 1 and 2 for 20 predicted frames . . . . 43 4.10 Performances of model 1 and 2 for 40 predicted frames . . . . 44 4.11 Performance of model 3 for 20 and 40 predicted frames first
for a single player and then depending on the agents provided to the corrective network . . . 45 4.12 Results for model 4 with varying known frames and encoded
agents . . . 48 4.13 Comparison of performances with and without using the or-
dering of the players when generating 40 frames and using 20 known frames. . . 50 5.1 Average predictions errors for 20 frames filled with each model
depending on the context provided . . . 53 5.2 Average predictions errors for 40 frames filled with each model
depending on the context provided . . . 55
x
B.1 Model 1 architectures depending on the encoded agents . . . . 64 B.2 Model 2 architectures depending on the encoded agents . . . . 65 B.3 Comparison of the total number of trainable parameters in
model 1, 2 and 3 . . . 65 B.4 Model 4 architectures depending on the encoded agents for 20
predicted frames . . . 66 B.5 Model 4 architectures depending on the encoded agents for 30
and 40 predicted frames . . . 67 B.6 Comparison of the total number of trainable parameters in
each network based on model 4 . . . 68
AE AutoEncoder.
CVAE Conditional Variational AutoEncoder.
ELBO Evidence Lower Bound.
EM Expectation–Maximization.
K-L divergence Kullback-Leibler divergence.
MSE Mean Squared Error.
ReLU Rectified Linear Unit.
RNN Reccurent Neural Network.
ULM Uniform Linear Motion.
VAE Variational AutoEncoder.
xii
Introduction
Team sports like basketball and football involve multi-agent interactions char- acterized by sequences of events that are usually reflecting strategies. In crow- ded cases, it can be difficult for a computer vision system to consistently track every agent’s trajectory and the system can fail to correctly retrieve all the agents’ paths. In this thesis, we focus on football games where the agents are the outfield players, the goalies, the ball and the referees. Our dataset contains tracking data for these agents from hundreds of professional football games.
We investigate how deep learning models can be used to facilitate the esti- mation of the position of an agent once the camera has lost sight of it or has inaccurate/noisy tracking data. This specific agent will be referred to as the corrupted agent. More specifically, we study architectures based on the Au- toEncoder (AE), such as the Variational AutoEncoder (VAE) and the Condi- tional Variational AutoEncoder (CVAE). The VAE is a generative model that learns a stochastic latent distribution to produce new examples similar to the training data. The objective of this model is to recognize patterns in the data and use them to generate plausible trajectories in place of the missing data.
Derived from the VAE, the CVAE also takes into account conditions that the generated sample has to match, in our case these were the other agents’ be- haviour and interactions with the corrupted agent, given the adversarial nature of the game. These conditions represent an aspect of the "context" that we want to use to make the best possible predictions. The players’ roles are an- other aspect of this context. We learn them unsupervised and used them to order the data.
1
1.1 Motivation
ChyronHego tracks a large number of football games, occasionally there are errors in the tracked positions of the ball or players. This project aims to facil- itate the estimation of the position of the player once the camera has lost sight of it or to infer missing data points. Hence, with the results achieved from this thesis one can greatly increase the robustness of the tracked data.
We will be focusing on football games but this work can be generalized to other sports involving interacting agents such as basketball and by extension to any multi-agent situation.
The project’s innovation relies on the conditional approach of filling in the missing data using context.
Moreover, exploring how our models can recognise patterns in the data and thus enable prediction of missing data will provide us with an insight of possible further extensions of this project. Indeed, this work can also lead to further questions of prediction of future trajectories given context and detec- tion of deviation of the trajectories from the expected pattern.
1.2 Research Question
The hypothesis being tested is the ability for an AE or one of its variants to be able to capture patterns in the data and the interdependence between the agents’ motions, and to explore the extent to which this can be used to fill missing points in the data.
The research questions this project is studying are the following:
• Can the manifold of football player/team trajectories be effectively mod- elled by the latent space of an AE or its variants: VAE and CVAE?
• Using such a manifold, can we fill in missing data in tracking sequences?
• What aspects of the context can be used to improve the prediction of the missing data ?
1.3 Outline of the thesis
This thesis is structured as follows:
• In chapter 2, we first introduce relevant theoretical background, includ-
ing a presentation of the architectures our models are based on, namely
AE, VAE, and CVAE, and techniques on training deep neural networks efficiently. After that, related literature is reviewed.
• In chapter 3, we describe our approach to answer the research questions.
We explain how we prepared the data and used a role-alignment proce- dure to learn orderings of the outfield players. We also define the models we train, how we train them and how we evaluate them both quantita- tively and qualitatively.
• In chapter 4, we present our results and we explain how the implementa- tion and the models have evolved towards the solution by analyzing the outcomes of our experiments.
• In chapter 5, we discuss the results, their validity considering the ap- proach we have taken and future work on the subject.
• In chapter 6, we summarize our work and draw our conclusions on it.
Background
This chapter provides explanations regarding relevant theoretical background and describes several approaches related to this thesis. We start by presenting the AE, VAE and CVAE architectures theoretically and practically, then we introduce two important tools in the training of deep neural networks,. Finally, we review literature regarding the prediction of multi-agent motion paths and role-alignment implementations.
2.1 Models
The architectures investigated throughout this project are the AutoEncoder, the Variational AutoEncoder and the Conditional Variational AutoEncoder.
In this section, we give an introduction to these types of networks.
2.1.1 Autoencoders
Autoencoders were introduced in 1986 by Rumelhart et al. [1] as an unsu- pervised learning algorithm. The following year, Ballard [2] also proposed them as a method for unsupervised pre-training of Artificial Neural Networks.
Since then, these auto-associative configurations of neural networks have been increasingly popular, be it for the purpose of finding compressed representa- tion of the data, feature learning or, more recently, as generative models.
The objective of an AE is to find a smaller representation of the data by learning its underlying structure. In order to do that, an encoder is used to compress the data into a space of smaller dimension, and a decoder is used to reconstruct the original input from this "latent" representation. In order to accurately reconstruct the input, the network has to extract the relevant features
4
of the data when encoding it. This structure is shown in figure 2.1. It can use convolutional or fully connected layers.
Figure 2.1: Schema of the autoencoder architecture
With such an architecture, the size of the latent space has an important role when it comes to what accuracy the reconstruction can reach since all information to be decoded is compressed into these variables. We can observe in figure 2.2 that a small latent space leads to fuzzy reconstructions, whereas the information is much sharper when we increase its size. However, as the objective of such a network is to reduce the dimensionality of the input data, the accuracy of the reconstruction has to be balanced with the size of the latent space.
In our case, we want the network to be able to correctly capture the speci-
ficities of our data so we tend to add depth to our network. However, a very
small latent space leads to poor predictions since we are losing information
during the compression.
Figure 2.2: Influence of the size of the latent space on the reconstruction. In this example, the network is trained on the MNIST dataset [3]. On the top left images, there are only two nodes in the latent space of the network which makes the reconstruction very blurry and sometimes inaccurate. The more the size of the latent space is increased, the sharper the reconstructed numbers are.
2.1.2 Variational autoencoders
The principle
The Variational AutoEncoder (VAE) is a variant of the AE introduced by Kingma et al. [4] that assumes that an underlying distribution p(x) can de- scribe the training data X = {x}. Then, instead of mapping an input to a fixed vector in the latent space, it is mapped to a probability distribution. The latent space of the AE is now replaced by the mean and standard deviation of this distribution. Next, a sample from the distribution is fed into the de- coder to reconstruct the input. This principle is shown in figure 2.3. Once the network is trained, one can sample directly from the latent distribution and decode this sample to generate a new data point that is similar to the data the model is trained with. In his tutorial, Doersch [5] gives further explanations about VAEs.
From a probabilistic perspective, the goal is to learn a model p
θparametri-
zed by a vector θ which one can sample from in order to generate new examples
similar to those of the training data X. Theoretically, one can sample latent
Figure 2.3: Schema of the variational autoencoder architecture
variables z from a prior distribution p
θ(z), generate x from a conditional dis- tribution p
θ(x|z) and then maximize the marginal likelihood p
θ(x) defined by:
p
θ(x) = Z
p
θ(x|z)p
θ(z)dz (2.1.1) This is the decoder part of the VAE, also called the generative network. How- ever, most z have almost no contribution to p
θ(x). Hence these samples z are not likely to have produced x. For the model to be representative of our dataset, it is important to make sure that a sample z is generating something similar to the training data x. However, the true posterior distribution p
θ(z|x) is intractable so it is approximated using a model q
φ(z|x) parametrized by a vector φ. So, for a given sample x, the learnt distribution q
φis trained to give high probability to the samples z from which x could have been generated.
This model is the encoder part of the VAE, also called the inference model.
Hence, instead of maximizing p
θ(x) = R p
θ(x|z)p
θ(z)dz = E
z∼pθ(z)p
θ(x|z), E
z∼qφ(z|x)p
θ(x|z) is maximized.
The objective function
We now derive the objective function of the VAE. We start by mentioning a metric that helps us estimate the "distance" or "dissimilarity" between two distributions called the Kullback-Leibler divergence (K-L divergence).
This metric is defined by:
D
KL(pkq) = Z
∞−∞
p(x) log p(x)
q(x) dx (2.1.2)
It can be interpreted as the expectation with respect to the distribution p of the logarithm difference between the distributions p and q. In other words, it measures how q is different from the reference distribution p.
Using the definition of conditional probability p(z|x) =
p(z,x)p(x)and of the K-L divergence (equation 2.1.2), we can derive successively:
D
KL(q
φ(z|x)kp
θ(z|x)) = E
qφ[log q
φ(z|x)] − E
qφ[log p
θ(x, z)]
+ log p
θ(x) (2.1.3)
log p
θ(x) = D
KL(q
φ(z|x) kp
θ(z|x)) + L (θ, φ; x) (2.1.4) Since D
KL(q
φ(z|x)kp
θ(z|x)) ≥ 0, we have:
log p
θ(x) ≥ E
qφ[log q
φ(z|x)] − E
qφ[log p
θ(x, z)] = L (θ, φ; x) (2.1.5) For this reason, L (θ, φ; x) is called the variational lower bound on the marginal likelihood distribution of the datapoint x. In order to minimize the difference between the true posterior distribution p
θ(z|x) and the approximated poste- rior q
φ(z|x), the Evidence Lower BOund (ELBO) L (θ, φ; x) have to be max- imized.
Moreover, using Bayes rule p(x, z) = p(x|z)p(z), we can derive that:
L (θ, φ; x) = E
qφ(z|x)[log p
θ(x|z)] − D
KL(q
φ(z|x) ||p
θ(z)) (2.1.6) Hence, the ELBO can be written as the sum of two terms. The first one mea- sures how well x is decoded from z ∼ q
φ(z|x), hence maximizing it cor- responds to minimizing the reconstruction error of the network. The second term measures how different the approximated posterior distribution q
φ(z|x) is from the prior p
θ(z). It can be seen as a regularization term to force the learnt posterior to be close to the prior. This way, one can sample from the prior and, from this sample, the decoder generates a plausible x.
The reparametrization trick
On a more practical note, when it comes to back-propagating the gradients, one needs to tackle the issue that the sampling operation is not differentiable.
However, since the distribution is normally distributed, one can sample from a standard normal distribution, multiply the sample by the standard deviation σ and add the mean µ of the learnt distribution (equation 2.1.7).
z = µ + σ (2.1.7)
This way, the latent variable depends deterministically on the parameters µ and σ and one can back-propagate through the network. This is shown in figure 2.4 and is referred to as the reparametrization trick in [4].
Figure 2.4: The reparametrization trick.
A toy example
In this example, a VAE is trained with a 2D latent space on the MNIST dataset.
The recognition network (encoder) takes as input images of digits of dimen- sions 28x28 and the generator (decoder) learns to reconstruct these samples.
At test time, one can sample from the prior distribution and generate digits from these samples. To illustrate the structure of the latent space in the 2D case, the generator can be used to decode samples at different positions in the latent space and the results are shown in figure 2.5 at the positions for which the samples have been generated.
2.1.3 Conditional variational autoencoders
The architecture we build on throughout this project is a variant of the VAE, the
conditional VAE (Sohn et al. [6]), that introduces a supplementary condition
in the network by concatenating it to the input of the encoder and the decoder,
as shown in figure 2.6. Hence, the encoder now learns a distribution q
φ(z|x, c)
and the decoder a distribution p
θ(x|z, c), where c is the condition the network
has to take into account. All equations derived in subsection 2.1.2 are still
valid by conditioning on c. We can note that the prior p
θ(z|c) is still a standard
Gaussian distribution because z is sampled independently of x at test time.
Figure 2.5: Illustration of the latent space.
The objective (equation 2.1.6) is modified as in equation 2.1.8:
L
CVAE(θ, φ; x, c) = E
qφ(z|x,c)[log p
θ(x|z, c)] − D
KL(q
φ(z|x, c) ||p
θ(z)) (2.1.8) At test time, like in the VAE case, one can sample from the prior distribu- tion. However, conditions are now concatenated to this sample. This concate- nation is fed to the decoder and, from it, a new data point that fits the given conditions is generated. Further details about this architecture can be found in [5] and [6].
Figure 2.6: Schema of the CVAE architecture
In our case, we are interested in creating trajectories that fit a given situ-
ation. Hence, we condition our network on some additional information that
our predicted trajectory have to match with, namely other agents’ trajectories,
known frames from the corrupted player’s sequence or identities of the players.
Toy example
Typically, one can learn to generate data points that have a particular label. If we consider our previous example with the MNIST dataset, one can learn to reconstruct digits while concatenating a one-hot encoding of each of them to the input of the encoder and the decoder at train time. This way, the network learns which digit is reconstructed every time, and when the network is trained, we can choose which digit to generate by conditioning on the chosen label of the digit at test time.
We compare the generations of digits from sample from the latent distri- bution using a VAE and a CVAE. In figure 2.7, we use a VAE to generate a random digit. In this case, the generated digit looks like an average reconstruc- tion of three digits at the same time: 3, 8 and 9. Indeed, in order to minimize the Mean Squared Error (MSE) loss, the model learns averaged reconstruc- tions, which is not what we would like. On the other hand, the CVAE learns a reconstruction alongside a one-hot encoding of the digit value. This way, it tends to generate digits that corresponds strictly to the given value. In fig- ure 2.8, we concatenate to the sample from the latent distribution a one-hot encoding of the digit 9 and the network correctly generates a 9.
Figure 2.7: Generation of a digit from a random sample from the la- tent distribution of a VAE.
Figure 2.8: Generation with a CVAE
with concatenated one-hot encoding
to generate the digit 9.
2.2 Training a deep neural network
In this section, we introduce the Adam optimizer (Kingma and Ba [7]), an adaptive learning rate optimization algorithm and Batch Normalization (Ioffe and Szegedy [8]), a technique to facilitate training of deep neural networks.
2.2.1 Adam optimizer
The optimizer that is used to train our networks is called the Adam (the name is derived from Adaptive Moment Estimation) optimizer. It uses adaptive learn- ing rates for each parameter based on estimates of first (mean) and second (variance) moments of the gradients. The variance is used in order to diminish the update in dimensions varying a lot and to increase the update in dimensions with small variations.
It stores exponentially decaying average of these moments in order to cor- rect their biases. After that, the weight update is proportional to the first mo- ment unbiased average divided by the square root of the second moment un- biased average.
The benefits of this optimizer are that the learning rate is computed for each parameter based on the magnitudes of the last gradients. Hence, the learning rate requires less fine tuning. It still remains computationally efficient and is easy to configure as the parameters suggested by the original paper are effec- tive. In [9], Ruder argues that, in comparison to several other optimizers, the Adam optimizer adds bias correction and slightly outperforms other methods at the end of optimization as gradients become sparser.
2.2.2 Batch Normalization
Batch normalization aims to reduce the change in the distribution of the hidden
units during the training (called the internal covariance shift) by normalizing
each batch (subtracting the batch mean and dividing by the batch standard
deviation) in order to make the network more stable. Two trainable parame-
ters are added at each layer to be able to restore the unnormalized output that is
passed through the next layer. They correspond to a standard deviation param-
eter that is multiplied to the normalized output and a mean parameter that is
then added to the result. The benefits are that we can use higher learning rates
without vanishing and exploding gradients, it reduces overfitting as it slightly
regularizes the network and hence reduces the need for dropout. It also adds
some noise to each layer’s activation and makes the training depend less on
the initialization. Finally, as the training is more stable and more efficient, we can decay the learning rate faster.
2.3 Related work
We now review literature about the prediction of multi-agent motion paths and, after that, about role-alignment techniques.
2.3.1 Forecasting multi-agent motion paths
We further detail two solutions that use a CVAE to predict multi-agent motion paths.
Predicting basketball player trajectories using a CVAE
In their paper [10], Felsen et al. implement a CVAE in order to forecast the motion of some agents given the remaining ones in basketball matches. In this case, the number of agents is fixed and all agents are given at once to the network. They learn the multi-agent trajectories latent representation and, using some context (conditions), predict future motions of several players. The architecture of the model is shown in figure 2.9.
Let A be the set of interacting agents that are observed over the time history [t
0, t
q] and P ⊆ A the set of agents whose trajectories are predicted during the time (t
q, t
f]. The conditions are, in this case:
• the identities % of the players in P or the identity of the team,
• the future motions of the agents K that are not predicted X (
tq,tf]
K
and,
• one second history of the predicted agents X
P[tq−1,tq]. The decoder outputs Y =
X
P[tq−1,tq], Y (
tq,tf]
P
. The authors argue that adding this one second of observed trajectory history of the agents helps the model to learn to make predictions that are consistent with the history.
As we want to experiment on predicting segments of the trajectory of a player, this consideration is relevant to us. We can choose to only output the reconstructed part of the agent trajectory, since we can use the known data for the rest of the sequence, or the whole agent trajectory.
Moreover, Felsen et al. have observed that by ordering the agents consis-
tently from one play to another, the network learnt more efficiently. This is
performed using a tree-based role alignment of the agents introduced by Sha et al. in [11] and detailed in subsection 2.3.2.
A disadvantage of this approach is that a different model has to be trained depending on the period of time [t
0, t
q] used as history to predict the future motions (input size of the encoders) and the period of time (t
q, t
f] that is pre- dicted (output size of the decoder). This issue is not encountered when using Reccurent Neural Network (RNN), as we will see in the next reviewed solu- tion.
Figure 2.9: Architecture of the network from [10]. The context and identity of the players are encoded separately and given as input to the decoder alongside the last second of the predicted agents histories and the sample ˆ z from the variational module. The decoder outputs this last second concatenated with the predictions.
An RNN based CVAE framework for predicting future trajectories of multiple interacting agents
Lee et al. [12] provide a different framework to predict future trajectories of multiple interacting agents called DESIRE, a Deep Stochastic IOC1 RNN Encoder-decoder framework. Their architecture is also based on a CVAE but the encoder and decoder are RNNs instead of being fully connected layers.
Given that past situations can lead to several plausible future trajectories, they needed stochasticity in their architecture, hence the choice for a CVAE. The predictions are returned by the sample generation module and then ranked and refined to produce the best possible results in the end. These modules and the entire network can be found in figure 2.10.
The ranking and refinement module added after the decoder aims to assign
a reward function to each prediction to determine the most likely one and to
iteratively refine them using a module called Scene Context Fusion that uses the past motion histories, the semantic scene context and the interactions be- tween the agents to build consistent hidden representations for the RNN. The trajectories are passed through a temporal convolution layer to encourage the network to learn the concept of velocity from adjacent frames before being fed into the RNN encoders. The predicted trajectories of each agent are combined together through a fusion layer, where interaction features among the agents are extracted. The reward function is learnt using Reinforcement Learning principles. The RNN assigns scores to each prediction based on the accu- mulated rewards to perform better in the long-term. After each iteration, the prediction is updated by a learnt displacement and is fed into the module again until precise predictions are obtained.
This structure is more complicated than the one in [10] and includes a new module to train without which the predictions are poor compared to the ones from [10]. However, this approach does not need any role alignment since the agents are treated separately first and refined afterwards using the scene context. What is more, since this approach is RNN based, we expect the temporal aspect of the data to be more efficiently captured in this case.
Finally, the idea of learning a concept of velocity might also be useful in our approach, to return coherent results in terms of human motion, even though in a first approach, we expect the network to learn it from the data.
Figure 2.10: DESIRE architecture from [12]. Several predictions are made from the CVAE and then iteratively ranked and refined until the predictions are precise enough.
We investigate an approach similar to the one introduced in [10], even
though it does not adapt to different amount of predicted frames, as it has
the advantage of not needing any additional refinement module like [12].
2.3.2 Solving the permutational alignment problem
An important question in such multi-agent context is the role alignment in the input representation. Indeed, the features associated with one input should ideally be consistent from one set to the next to help the network to learn the behaviour of the agents. If different orderings of the inputs are allowed, the model must, in a sense, learn all of them, whereas if a consistent ordering of inputs is used, the problem is simplified.
In [13], Le et al. tackle the issue of learning individual policies along- side multi-agent coordination by using a Reinforcement Learning technique, imitation learning. Given a set of (state, action) from experts that the agents want to imitate, the model has to learn the policy that drives an agent to take action a from state s or its probability distribution p(a). The goal is to find a consistent indexing mechanism to perform the alignment of the experts given some latent variables that are representing the roles of the agents. However, the identity of the agents cannot be used since they might not be the same from one sample to another and each agent has a relatively small amount of data samples. Another way would be to learn roles attributed to the agents.
However, some agents may exchange roles during the game or the roles might be different between samples. Instead of defining any role upstream, they are learnt unsupervised, as latent variables.
Similarly, in [11], Sha et al. address the issue of agents alignments to find a consistent ordering over basketball games. In their paper, a template designates an average (over time) player composition for a given team. An example of such a template can be found in figure 2.11 b).
Figure 2.11: Example of a template from [11]. Each color corresponds to a player. In plot (a), we see the positions of the players if they are not ordered.
Plot (b) displays the template that will be apply to order the players. The po- sitions of the players after ordering are shown in plot (c).
Assuming that the behaviours of the players consist of many states, the ob-
jective is to learn a different template for each state. Sha et al. [11] use an Ex-
pectation–Maximization (EM) algorithm to learn the templates unsupervised.
With the templates, they can compute the permutation matrix that minimizes the distance between the data point and the learnt templates. After that, they find the best clusters to partition the data into distinct states. Each cluster is characterized by one template that defines the data alignment. The number of clusters is determined by maximizing within-cluster similarity whilst main- taining a significant number of data points in each cluster. They then alternate between those two steps: alignment and data partitioning.
The results of the ordering of the agents are shown in figure 2.11. On a), we can see the player position before ordering and on c), after the players are ordered according to a single learnt template.
Instead of defining any role beforehand, they are learnt unsupervised as latent variables for the same reasons as [13]. In the following sections, the
"role" of a player refer to its position in the data tensor after ordering.
This approach is used in [10] and has helped significantly improve their
results by reducing the permutation disorder inherent to their dataset. Indeed,
Felsen et al. argue that without ordering, two similar trajectories could have
very different representations, which hampered the training. This is further
detailed and adapted to our problem in subsection 3.1.2.
Methods
In this chapter, we explain our approach, how we built the dataset and how we adapted the role-alignment technique from [11] to our problem. After that, we describe the architectures of our deep neural networks and we discuss why we expect them to efficiently solve our problem. Finally, we give details about the implementation of the models and the training process.
3.1 Data preprocessing
The objective is to fill in missing data in a player’s trajectory. In this section, we explain how the dataset was prepared to be fed into the models.
3.1.1 Data preparation
Our data is provided by ChyronHego and consists of tracking data for the ball, the players, the goalies and the referees. The data is originally extracted into csv files containing 6 columns for the ball (the frame number, the x, y and z coordinates, a binary flag indicating if the ball is in play or not and a flag indicating which team has possession of the ball) and 3 columns for the other agents (the frame number and the x and y coordinates). The goalies, the ref- erees and the z coordinate of the ball are not used in the experiments but they are still included in the dataset.
The dataset is prepared as follow:
• Since some arenas have different pitch sizes, we scale them to a pitch size of 105m × 68m.
18
• We reflect half of the data for each match so the home team is always on the left side of the pitch.
• We divide the data of each match into 12 seconds sequences of 25Hz data frames where the ball has to be in play for the full 12 seconds.
Hence, each of our data point is a 2x26x300 tensor of a 12 seconds (or 300 frames) sequence of the 2D coordinates of the 26 agents, namely the ball, the players, the goalies, and the referees. Such a tensor is shown in figure 3.1. The ordering of the agents is consistent within the dataset.
It is detailed in table 3.1. However, the outfield players do not have any particular order at this stage.
• When loaded, the data is scaled between 0 and 1. It is also downsampled to 5Hz, decreasing the size of the tensor to 2x26x60. This considerably reduces the number of trainable parameters of the network.
Row(s) Agent
1 Ball
2 Home goalie
3 to 12 Home outfield players (randomly ordered)
13 Away goalie
14 to 23 Away outfield players (randomly ordered)
24 Referee
25 Line referee (positive y) 26 Line referee (negative y)
Table 3.1: Agent ordering in the dataset
The goal is to predict missing frames in the data. In order to train the network to do so, we mask a given number of frames (from a third to two thirds of the sequence) of the data of a random outfield player from the home team. The mask corresponds to setting those frames to zero. The frames are always situated in the middle of the sequence, as we can assume without loss of generality that we can always find a sequence where the missing frames are centered. The sequence then contains 60 frames for each coordinate and we predict 20, 30 or 40 frames in the middle of it. An example of a tensor with missing frames is show in figure 3.1.
The dataset is split into a training set of 188766 samples, a validation set of
67978 samples and a test set of 57348 samples. The training set is built from
games from seasons 2015-2016, 2016-2017 and 2017-2019 of the Bundesliga
Figure 3.1: Schema of a tensor of the dataset
whereas the validation and test sets are built from games from season 2018- 2019, first and second division, respectively. Hence, the three sets contain samples from different games.
3.1.2 Role alignment
As mentioned in [10], the randomness in the positions of the player in a data tensor that we input to the network might disturb the learning of the weights of the fully connected network. Hence, we implement a role alignment technique developed by [11] in order to correct this issue with the goal of making the network easier to train. This solution is based on learning a set of roles of the outfield players, called a template, and then assigning a role to a player so as to minimize the sum of the distances between the players and their role.
For each player, we use the average of the trajectory over time to facilitate the ordering. So, in this case, the roles are defined by 2D positions in the pitch and a data point is a tensor of the coordinates of the average of the trajectory of each outfield player. For each team, we have 10 outfield players, so a template a 2 × 10 matrix. An example of such a template for the home team is shown in figure 3.2. Both teams are treated separately.
The algorithm proposed in [11] allows us to learn different compositions
by alternatively learning the templates and clustering the data. The dataset is
split into several clusters that have an associated learnt template so each tensor
from a cluster is aligned to its associated template.
Figure 3.2: Example of a learnt template for the home team
Learning a template
In order to learn the templates used to order the players consistently in each sample, we use the following algorithm:
First, we initialize the templates by computing the average over all the data points. Then, we order each data point, one by one, to minimize the sum of distances between the template’s roles and the outfield players averaged positions. This ordering problem is solved using the Hungarian algorithm, an optimisation procedure introduced by Kuhn in [14] and used to solve an assignment problem where "jobs" have to be assigned to "workers" in order to minimize the total cost. In our case, the cost function is the Euclidean distance between the agent from the data point (the "worker") and the role from the template ("the job"). Once each point has been re-ordered according to that template, we re-calculate the template by computing the average over all the re- ordered data points of the dataset. We repeatedly perform these two steps until the difference between two successive templates stops decreasing or reaches a minimum.
Tree based role alignment
As data samples can be quite different, if we were to learn a template for the
entire dataset, the template would be an average of all the data points and it
would not be representative of the different states of the behaviours of the
players, as mentioned before. That is why we split the dataset into several
clusters based on the average over time of each agent’s trajectory and each
of them is represented by a different template. This clustering of the dataset
corresponds to a layer of the tree we are building. In order to split the dataset, we use K-Means clustering, based on the average over time of each agent’s trajectory, and then learn a template for each cluster. The templates are learnt using the procedure detailed in the previous paragraph. Once it is done, we order the data points according to the template of the cluster they belong to.
The templates after clustering are aligned to the parent’s template to enforce a consistent ordering throughout the tree. After that, each cluster is split again into several clusters using the K-Means algorithm and new templates are learnt for each cluster, until we reach a certain depth in the tree or a minimum number of examples in a cluster. This process allows us to learn several templates since we argue that all the dataset cannot be represented by a single template.
The templates at the leaves of this tree are the ones that we use to order the data samples from the dataset during training. We load each data point, assign them to a cluster at every layer of the tree and order them using the template at the corresponding leaf of the tree.
One iteration of this algorithm is shown in figure 3.3.
Figure 3.3: Schema of an iteration of the tree based learning of the templates
Our solution
Since we have information about ball possession during each game, and be- cause the tree approach can still lead to quite "averaged" templates, where most players are situated at the middle of the field, our approach differs a bit from [11]. Using the ball possession information, we first divide the data into three sets.
Our cases are:
• during most of the sequence the home team has possession,
• during most of the sequence the away team has possession,
• both of them have possession.
Additionally, we extract the cases where, most of the time, the ball is in one side of the field, as we assume that one team is attacking while the other is defending.
Because this is very restricting, we allow some flexibility. We introduce a parameter that is setting the amount of flexibility that we authorized in each sequence. For instance, choosing 70% of flexibility, we consider that the home team has possession if it has the ball for at least 70% of the sequence. Addi- tionally, if the ball is in the right side of the field during at least 70% of the sequence, we consider that the home team is attacking.
We split the dataset given who has possession, and in case of possession, if the given team is attacking or not. In the attacking cases, we learn the templates directly, one for each team. In the rest of the cases, we use a tree per team.
Once all the templates for a team are learnt, the templates are also ordered between them so the order between the templates is also kept consistent. This is summarized in figure 3.4.
When loading the data for training, we evaluate if the sequence corre- sponds to a team attacking considering their positions and possession of the ball, within the flexibility defined. If this is the case, we use the average tem- plates. If not, we use the trees corresponding to each team. For each team, we find the cluster that this sequence belongs to in the appropriate tree and use the template associated to this cluster to order the players in this data sample.
Each team is always aligned separately.
3.2 Architectures
Different types of architectures, based on the AE, VAE and CVAE, are inves- tigated in this thesis. They are detailed and justified in the following sections.
3.2.1 Model 1: a variational autoencoder
We try to fill the gaps in the sequence of a player from the home team by using a VAE and encoding a certain number of agents alongside this player. The architecture is shown in figure 3.5.
The objective was for the network to capture the dependencies between the
trajectories. As a player plays with his team and against his opponents, he can
pass the ball to his teammates, block his opponents’ progression when he is
Figure 3.4: Schema of our data split to learn the templates. The templates are learnt separately for each team.
defending and avoid them when he is attacking. The ball has a specific be- haviour. It cannot move by itself, as opposed to the players, and its behaviour is entirely dictated by the players. The network should be able to capture this dependence on the players’ passes and movements, and the ball motion con- tains a lot of information as the players tend to run towards the ball, most of the time. Finally, if the ball’s trajectory changes suddenly, we can deduce that a player kicked it out of its previous trajectory, which the network should also learn. This kind of information is particularly useful when it comes to predict- ing missing frames. All these considerations correspond to the "context" and its consequences over the corrupted player behaviour are investigated through- out this thesis.
We experiment on which agents to encode alongside the corrupted player.
We also implement role alignment and build templates to order the players
from each team.
Figure 3.5: Schema of our VAE architecture
3.2.2 Model 2: adapted from the variational autoen- coder
We adapt the VAE architecture in order to reconstruct only the player that has missing frames. We still encode a certain number of agents in order for the encoder to capture the dependencies inherent to the dataset and extract the relevant features within the sequences. However, since we are interested in filling the missing frames only, we decode only this given player and adapt the architecture accordingly, as shown in figure 3.6. The objective here is to decrease the size of the network to make it easier to train while maintaining good performance by focusing on the corrupted player.
Figure 3.6: Schema of our adapted VAE architecture
3.2.3 Model 3: a corrective architecture
This third architecture is created from our observation that, when predicting less than half the sequence, encoding only the corrupted player often yields better results. After experimenting on our first two models, we come to the conclusion that the context was not obviously beneficial for the predictions.
Hence, we try to build a different structure that can reveal the context effect on a single prediction. We train a VAE (model 1) without using any context, only the single player’s path, and fill the missing frames from his trajectory.
We have a first plausible way of filling the gaps in the data that we want to
improve so we train a second network to correct this first estimation and we
freeze the first network. The output from model 1 is given as an input to this second network, alongside the surroundings of this player, i.e. the context.
The output from this second network is then the correction of the prediction without context and is summed to this first estimation to give a new prediction that takes into account the context. This principle is shown in figure 3.7.
Figure 3.7: Schema of our corrective architecture. On the top left, we have a pre-trained VAE that has frozen weights and gives a first estimation of the reconstruction. This output is concatenated with other agents trajectories and given as input to a second VAE that we train to correct the first estimation. The correction is summed with the first estimation to give the final prediction.
Since, using this architecture, we can make a first prediction of the player trajectory, we use it in order to estimate the closest players from the opposite team. We try to correct this first estimation using only the closest players from the opposite team instead of the entire opposite team.
3.2.4 Model 4: a conditional variational autoencoder
We want to implement a VAE that can generate the missing frames directly from the latent distribution while taking into account the context that drives the corrupted player behaviour. We choose to use the CVAE architecture. This network is trained end-to-end.
As mentioned before, our experiments showed that the most important con-
text information seemed to be the non corrupted part of the corrupted player
sequence. However, our objective is to investigate how the players from the
other team and the ball affect this generation. So, we add as context the other
agents’ trajectories but since we want to keep an hourglass architecture for
our network, this information is encoded before being given as conditions to
the decoder. Moreover, since we implement role alignment, we also give as a
condition the learnt "role" or "identity" of the player, which corresponds to his position in the data tensor after being aligned. Lastly, to help generating coher- ent predictions, we also input into the decoder known frames of the corrupted player sequence. The amount of known frames (or "non-corrupted frames") used varies in our experiments. This architecture is shown in figure 3.8.
Figure 3.8: Schema of our CVAE
During the testing phase, we sample from a standard Gaussian distribution and our decoder produces a plausible filling, given the known frames of the corrupted player sequence and the encoded context. We also adapt the number of predicted frames and the number of non corrupted frames to study the con- text’s influence. Indeed, the more frames the network has to predict, and the less known frames we give, the more the network has to rely on the context, i.e. the other agents’ trajectories.
3.3 Implementation
Our objective is to fill a given number of frames in the middle of the trajectory of a player of the home team. For our experiments, we applied a "mask" on top of the data, meaning that these frames are set to zero.
The loss function that is used is adapted from the objective function (equa-
tion 2.1.6) derived in subsection 2.1.2. The objective function corresponds to
the sum of the expected reconstruction error and a regularization term, the K-L
divergence, that measures the distance between the learnt posterior q
φ(z|x),
that is the latent distribution, and the prior p
θ(z). Maximizing this objective
corresponds to minimizing its opposite, our loss function.
We choose to use the Mean Squared Error (MSE) as reconstruction loss and the prior to be a standard Gaussian distribution N (0, 1). Since the learnt posterior distribution q
φ(z|x) is also considered to be Gaussian, we can derive the expression of the K-L divergence. This is mentioned in [4] and detailed in Appendix A.
Let us denote x
rthe reconstruction of x with the gaps filled and L the loss function. We have:
L = MSE + D
KL(q
φ(z|x) kp
θ(z)) (3.3.1)
= kx − x
rk
2− 1 2
X
n