Estimation of player trajectories from context in football games using autoencoders

(1)

Estimation of player trajectories from context in football games using autoencoders

MANON DEPRETTE

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

trajectories from context in football games using

autoencoders

MANON DEPRETTE

Master in Computer Science Date: June 26, 2020

Supervisors: Atsuto MAKI & Gareth LOY Examiner: Joakim GUSTAFSON

School of Electrical Engineering and Computer Science Host company: ChyronHego AB

Swedish title: Uppskattning av fotbollsspelares positioner över tid

med hjälp av kontext och autoencoders

(4)

(5)

Abstract

ChyronHego tracks a large number of football games. Occasionally there are errors in the tracked positions of the ball or the players. This thesis aims to investigate to what extent vanilla autoencoders, variational autoencoders and conditional variational autoencoders can recognise patterns in the data and thus be used to predict missing data for individual agents’ trajectories in an adversarial multi-agent situation such as a football game. Furthermore, we also implement a multi-agent role alignment technique to order the outfield players in the dataset and use their identity, learnt unsupervised, in the predic- tions. We find out that in cases where the model cannot sufficiently rely on the individual agent’s trajectory information, it efficiently uses the context, i.e. the other agents behaviour, to make more accurate predictions of the missing data.

However, the identities of the players do not seem to improve the predictions

of the models.

(6)

Sammanfattning

ChyronHego samlar in data från ett stort antal fotbollsmatcher. Denna data

innehåller bollens och spelarnas positioner över tid, men ibland saknas kor-

rekta positioner för en spelare och det finns då ett behov av att kunna predi-

cera dessa. Målet med detta arbete är att undersöka hur en autoencoder, en

variational autoencoder samt en conditional variational autoencoder kan an-

vändas för att känna igen mönster i datan. Vidare kommer arbetet undersöka

om dessa modeller kan predicera korrekta positioner för spelarna under en

fotbollsmatch. Utöver de ovan nämnda modellerna kommer vi även att imple-

mentera en rolljusteringsteknik som använder sig av icke-vägledd inlärning

av spelarnas inbördes ordning, vilken skulle kunna ge förbättrade prediktio-

ner. Resultaten visar att när korrekt data saknas för en spelares position kan

modellen använda övriga spelares rörelsemönster för att kunna predicera den

korrekta positionen, vidare finner vi inget stöd för att de inlärda egenskaperna

från rolljusteringstekniken förbättrar modellens prediktioner.

(7)

Firstly, I would like to thank Gareth Loy & Lars Bretzner for their guidance and valuable advice throughout this project, and the opportunity to work with them in ChyronHego. I would also like to thank all the team for the pleasant semester I spent there.

Further thanks to my supervisor Atsuto Maki for his support and mentoring to help me carry out this project.

v

(8)

1 Introduction 1

1.1 Motivation . . . . 2

1.2 Research Question . . . . 2

1.3 Outline of the thesis . . . . 2

2 Background 4 2.1 Models . . . . 4

2.1.1 Autoencoders . . . . 4

2.1.2 Variational autoencoders . . . . 6

2.1.3 Conditional variational autoencoders . . . . 9

2.2 Training a deep neural network . . . 12

2.2.1 Adam optimizer . . . 12

2.2.2 Batch Normalization . . . 12

2.3 Related work . . . 13

2.3.1 Forecasting multi-agent motion paths . . . 13

2.3.2 Solving the permutational alignment problem . . . 16

3 Methods 18 3.1 Data preprocessing . . . 18

3.1.1 Data preparation . . . 18

3.1.2 Role alignment . . . 20

3.2 Architectures . . . 23

3.2.1 Model 1: a variational autoencoder . . . 23

3.2.2 Model 2: adapted from the variational autoencoder . . 25

3.2.3 Model 3: a corrective architecture . . . 25

3.2.4 Model 4: a conditional variational autoencoder . . . . 26

3.3 Implementation . . . 27

vi

(9)

4 Results 30

4.1 Role alignment . . . 30

4.2 Model 1: a variational autoencoder . . . 31

4.2.1 Experiments with a single player . . . 31

4.2.2 Experiments on the loss function . . . 34

4.2.3 Experiments on the role alignment . . . 34

4.2.4 Experiments on the encoded agents . . . 35

4.2.5 Experiments on the number of predicted frames . . . . 37

4.2.6 Experiments to improve the training . . . 39

4.3 Model 2: adapted from the variational autoencoder . . . 43

4.4 Model 3: a corrective network . . . 44

4.5 Model 4: a conditional variational autoencoder . . . 45

4.5.1 Experiments on the role alignment . . . 48

5 Discussion and future work 51 5.1 Results . . . 51

5.2 Mask . . . 54

5.3 Connections . . . 55

5.4 Architectures . . . 55

5.5 Future work . . . 56

6 Conclusions 57

Bibliography 58

A Derivation of the formula for the K-L divergence between the prior and the approximated posterior in the Gaussian case 60

B Networks architectures 64

(10)

2.1 Schema of the autoencoder architecture. Image downloaded from https://www.jeremyjordan.me/autoencoders/ in January

2020. . . . . 5

2.2 Influence of the size of the latent space on the reconstruction. Image downloaded from https://github.com/ChengBinJin/VAE- Tensorflow/ in October 2019. . . . 6

2.3 Schema of the variational autoencoder architecture. Image downloaded from https://www.jeremyjordan.me/variational-autoencoders/ in Jan- uary 2020. . . . 7

2.4 The reparametrization trick. Image downloaded from https://www.jeremyjordan.me/variational-autoencoders/ in Jan- uary 2020. . . . 9

2.5 Illustration of the latent space. Image downloaded from https://jmetzen.github.io/2015-11-27/vae.html in October 2019. 10 2.6 Schema of the CVAE architecture . . . 10

2.7 Generation of a digit from a random sample from the latent distribution of a VAE. Code from github: https://github.com/graviraja/pytorch-sample-codes accessed in January 2020. . . 11

2.8 Generation with a CVAE with concatenated one-hot encoding to generate the digit 9. Code from github: https://github.com/graviraja/pytorch-sample-codes accessed in January 2020. . . 11

2.9 Architecture of the network from [10] . . . 14

2.10 DESIRE architecture from [12] . . . 15

2.11 Example of a template from [11] . . . 16

3.1 Schema of a tensor of the dataset . . . 20

viii

(11)

3.2 Example of a learnt template for the home team . . . 21

3.3 Schema of an iteration of the tree based learning of the templates 22 3.4 Schema of our data split to learn the templates. . . 24

3.5 Schema of our VAE architecture . . . 25

3.6 Schema of our adapted VAE architecture . . . 25

3.7 Schema of our corrective architecture . . . 26

3.8 Schema of our CVAE . . . 27

4.1 Templates for each team when the home team is attacking . . . 31

4.2 Templates for each team when the away team is attacking . . . 31

4.3 Templates at the leaves of the tree for the home team . . . 31

4.4 Templates at the leaves of the tree for the away team . . . 31

4.5 Average position for each role after player ordering . . . 32

4.6 Examples of reconstruction without context using model 1 . . 33

4.7 Prediction of 1/3 of a sequence using model 1, the ball is kicked by the corrupted player where the data is missing . . . 37

4.8 Prediction of 1/3 of the sequence using model 1, the ball is in possession of the corrupted player for most of the sequence . . 38

4.9 Prediction of 2/3 of the sequence using model 1, the ball is in possession of the corrupted player where the data is missing . 40 4.10 Correction of the prediction using the ball (20 predicted frames). 45 4.11 Correction of the prediction using the ball where the correc- tion fails to improve the first estimation of the prediction (20 predicted frames) . . . 46

4.12 Schema of a sequence to be filled . . . 47

4.13 Prediction of 1/2 of the sequence using model 4 . . . 49

5.1 Evolution of the MSE on the predicted frames for different encoded agents and the single player using a VAE . . . 53

5.2 Evolution of the MSE on the predicted frames for different

encoded agents and the single player using a CVAE . . . 54

(12)

3.1 Agent ordering in the dataset . . . 19 4.1 Predictions errors for a single player . . . 32 4.2 Comparison between training the network to reconstruct the

entire output or only the predicted frames of the corrupted player 34 4.3 Perfomance of model 1 with and without using the ordering

of the players . . . 35 4.4 Performance of model 1 depending on the encoded agents . . . 36 4.5 Performance of model 1 with different encoded agents and

given a number of predicted frames . . . 39 4.6 L1 norm of the weights of the mean layer of the network when

only the ball is encoded as context . . . 42 4.7 Experiments on the K-L divergence regularization . . . 42 4.8 Performance of model 1 with Batch Normalization (BN), with

L2 weight decay and without any regularization . . . 43 4.9 Performances of model 1 and 2 for 20 predicted frames . . . . 43 4.10 Performances of model 1 and 2 for 40 predicted frames . . . . 44 4.11 Performance of model 3 for 20 and 40 predicted frames first

for a single player and then depending on the agents provided to the corrective network . . . 45 4.12 Results for model 4 with varying known frames and encoded

agents . . . 48 4.13 Comparison of performances with and without using the or-

dering of the players when generating 40 frames and using 20 known frames. . . 50 5.1 Average predictions errors for 20 frames filled with each model

depending on the context provided . . . 53 5.2 Average predictions errors for 40 frames filled with each model

depending on the context provided . . . 55

x

(13)

B.1 Model 1 architectures depending on the encoded agents . . . . 64 B.2 Model 2 architectures depending on the encoded agents . . . . 65 B.3 Comparison of the total number of trainable parameters in

model 1, 2 and 3 . . . 65 B.4 Model 4 architectures depending on the encoded agents for 20

predicted frames . . . 66 B.5 Model 4 architectures depending on the encoded agents for 30

and 40 predicted frames . . . 67 B.6 Comparison of the total number of trainable parameters in

each network based on model 4 . . . 68

(14)

AE AutoEncoder.

CVAE Conditional Variational AutoEncoder.

ELBO Evidence Lower Bound.

EM Expectation–Maximization.

K-L divergence Kullback-Leibler divergence.

MSE Mean Squared Error.

ReLU Rectified Linear Unit.

RNN Reccurent Neural Network.

ULM Uniform Linear Motion.

VAE Variational AutoEncoder.

xii

(15)

Introduction

Team sports like basketball and football involve multi-agent interactions char- acterized by sequences of events that are usually reflecting strategies. In crow- ded cases, it can be difficult for a computer vision system to consistently track every agent’s trajectory and the system can fail to correctly retrieve all the agents’ paths. In this thesis, we focus on football games where the agents are the outfield players, the goalies, the ball and the referees. Our dataset contains tracking data for these agents from hundreds of professional football games.

We investigate how deep learning models can be used to facilitate the esti- mation of the position of an agent once the camera has lost sight of it or has inaccurate/noisy tracking data. This specific agent will be referred to as the corrupted agent. More specifically, we study architectures based on the Au- toEncoder (AE), such as the Variational AutoEncoder (VAE) and the Condi- tional Variational AutoEncoder (CVAE). The VAE is a generative model that learns a stochastic latent distribution to produce new examples similar to the training data. The objective of this model is to recognize patterns in the data and use them to generate plausible trajectories in place of the missing data.

Derived from the VAE, the CVAE also takes into account conditions that the generated sample has to match, in our case these were the other agents’ be- haviour and interactions with the corrupted agent, given the adversarial nature of the game. These conditions represent an aspect of the "context" that we want to use to make the best possible predictions. The players’ roles are an- other aspect of this context. We learn them unsupervised and used them to order the data.

1

(16)

1.1 Motivation

ChyronHego tracks a large number of football games, occasionally there are errors in the tracked positions of the ball or players. This project aims to facil- itate the estimation of the position of the player once the camera has lost sight of it or to infer missing data points. Hence, with the results achieved from this thesis one can greatly increase the robustness of the tracked data.

We will be focusing on football games but this work can be generalized to other sports involving interacting agents such as basketball and by extension to any multi-agent situation.

The project’s innovation relies on the conditional approach of filling in the missing data using context.

Moreover, exploring how our models can recognise patterns in the data and thus enable prediction of missing data will provide us with an insight of possible further extensions of this project. Indeed, this work can also lead to further questions of prediction of future trajectories given context and detec- tion of deviation of the trajectories from the expected pattern.

1.2 Research Question

The hypothesis being tested is the ability for an AE or one of its variants to be able to capture patterns in the data and the interdependence between the agents’ motions, and to explore the extent to which this can be used to fill missing points in the data.

The research questions this project is studying are the following:

• Can the manifold of football player/team trajectories be effectively mod- elled by the latent space of an AE or its variants: VAE and CVAE?

• Using such a manifold, can we fill in missing data in tracking sequences?

• What aspects of the context can be used to improve the prediction of the missing data ?

1.3 Outline of the thesis

This thesis is structured as follows:

• In chapter 2, we first introduce relevant theoretical background, includ-

ing a presentation of the architectures our models are based on, namely

(17)

AE, VAE, and CVAE, and techniques on training deep neural networks efficiently. After that, related literature is reviewed.

• In chapter 3, we describe our approach to answer the research questions.

We explain how we prepared the data and used a role-alignment proce- dure to learn orderings of the outfield players. We also define the models we train, how we train them and how we evaluate them both quantita- tively and qualitatively.

• In chapter 4, we present our results and we explain how the implementa- tion and the models have evolved towards the solution by analyzing the outcomes of our experiments.

• In chapter 5, we discuss the results, their validity considering the ap- proach we have taken and future work on the subject.

• In chapter 6, we summarize our work and draw our conclusions on it.

(18)

Background

This chapter provides explanations regarding relevant theoretical background and describes several approaches related to this thesis. We start by presenting the AE, VAE and CVAE architectures theoretically and practically, then we introduce two important tools in the training of deep neural networks,. Finally, we review literature regarding the prediction of multi-agent motion paths and role-alignment implementations.

2.1 Models

The architectures investigated throughout this project are the AutoEncoder, the Variational AutoEncoder and the Conditional Variational AutoEncoder.

In this section, we give an introduction to these types of networks.

2.1.1 Autoencoders

Autoencoders were introduced in 1986 by Rumelhart et al. [1] as an unsu- pervised learning algorithm. The following year, Ballard [2] also proposed them as a method for unsupervised pre-training of Artificial Neural Networks.

Since then, these auto-associative configurations of neural networks have been increasingly popular, be it for the purpose of finding compressed representa- tion of the data, feature learning or, more recently, as generative models.

The objective of an AE is to find a smaller representation of the data by learning its underlying structure. In order to do that, an encoder is used to compress the data into a space of smaller dimension, and a decoder is used to reconstruct the original input from this "latent" representation. In order to accurately reconstruct the input, the network has to extract the relevant features

4

(19)

of the data when encoding it. This structure is shown in figure 2.1. It can use convolutional or fully connected layers.

Figure 2.1: Schema of the autoencoder architecture

With such an architecture, the size of the latent space has an important role when it comes to what accuracy the reconstruction can reach since all information to be decoded is compressed into these variables. We can observe in figure 2.2 that a small latent space leads to fuzzy reconstructions, whereas the information is much sharper when we increase its size. However, as the objective of such a network is to reduce the dimensionality of the input data, the accuracy of the reconstruction has to be balanced with the size of the latent space.

In our case, we want the network to be able to correctly capture the speci-

ficities of our data so we tend to add depth to our network. However, a very

small latent space leads to poor predictions since we are losing information

during the compression.

(20)

Figure 2.2: Influence of the size of the latent space on the reconstruction. In this example, the network is trained on the MNIST dataset [3]. On the top left images, there are only two nodes in the latent space of the network which makes the reconstruction very blurry and sometimes inaccurate. The more the size of the latent space is increased, the sharper the reconstructed numbers are.

2.1.2 Variational autoencoders

The principle

The Variational AutoEncoder (VAE) is a variant of the AE introduced by Kingma et al. [4] that assumes that an underlying distribution p(x) can de- scribe the training data X = {x}. Then, instead of mapping an input to a fixed vector in the latent space, it is mapped to a probability distribution. The latent space of the AE is now replaced by the mean and standard deviation of this distribution. Next, a sample from the distribution is fed into the de- coder to reconstruct the input. This principle is shown in figure 2.3. Once the network is trained, one can sample directly from the latent distribution and decode this sample to generate a new data point that is similar to the data the model is trained with. In his tutorial, Doersch [5] gives further explanations about VAEs.

From a probabilistic perspective, the goal is to learn a model p

θ

parametri-

zed by a vector θ which one can sample from in order to generate new examples

similar to those of the training data X. Theoretically, one can sample latent

(21)

Figure 2.3: Schema of the variational autoencoder architecture

variables z from a prior distribution p

θ

(z), generate x from a conditional dis- tribution p

θ

(x|z) and then maximize the marginal likelihood p

_θ

(x) defined by:

p

_θ

(x) = Z

p

_θ

(x|z)p

_θ

(z)dz (2.1.1) This is the decoder part of the VAE, also called the generative network. How- ever, most z have almost no contribution to p

_θ

(x). Hence these samples z are not likely to have produced x. For the model to be representative of our dataset, it is important to make sure that a sample z is generating something similar to the training data x. However, the true posterior distribution p

_θ

(z|x) is intractable so it is approximated using a model q

_φ

(z|x) parametrized by a vector φ. So, for a given sample x, the learnt distribution q

φ

is trained to give high probability to the samples z from which x could have been generated.

This model is the encoder part of the VAE, also called the inference model.

Hence, instead of maximizing p

_θ

(x) = R p

_θ

(x|z)p

_θ

(z)dz = E

_z∼p_θ_(z)

p

_θ

(x|z), E

_z∼q_φ_(z|x)

p

θ

(x|z) is maximized.

The objective function

We now derive the objective function of the VAE. We start by mentioning a metric that helps us estimate the "distance" or "dissimilarity" between two distributions called the Kullback-Leibler divergence (K-L divergence).

This metric is defined by:

D

_KL

(pkq) = Z

∞

−∞

p(x) log p(x)

q(x) dx (2.1.2)

(22)

It can be interpreted as the expectation with respect to the distribution p of the logarithm difference between the distributions p and q. In other words, it measures how q is different from the reference distribution p.

Using the definition of conditional probability p(z|x) =

^p(z,x)_p(x)

and of the K-L divergence (equation 2.1.2), we can derive successively:

D

_KL

(q

_φ

(z|x)kp

_θ

(z|x)) = E

_q_φ

[log q

_φ

(z|x)] − E

_q_φ

[log p

_θ

(x, z)]

+ log p

_θ

(x) (2.1.3)

log p

_θ

(x) = D

_KL

(q

_φ

(z|x) kp

_θ

(z|x)) + L (θ, φ; x) (2.1.4) Since D

_KL

(q

_φ

(z|x)kp

_θ

(z|x)) ≥ 0, we have:

log p

θ

(x) ≥ E

q_φ

[log q

φ

(z|x)] − E

q_φ

[log p

θ

(x, z)] = L (θ, φ; x) (2.1.5) For this reason, L (θ, φ; x) is called the variational lower bound on the marginal likelihood distribution of the datapoint x. In order to minimize the difference between the true posterior distribution p

_θ

(z|x) and the approximated poste- rior q

φ

(z|x), the Evidence Lower BOund (ELBO) L (θ, φ; x) have to be max- imized.

Moreover, using Bayes rule p(x, z) = p(x|z)p(z), we can derive that:

L (θ, φ; x) = E

_q_φ_(z|x)

[log p

_θ

(x|z)] − D

_KL

(q

_φ

(z|x) ||p

_θ

(z)) (2.1.6) Hence, the ELBO can be written as the sum of two terms. The first one mea- sures how well x is decoded from z ∼ q

φ

(z|x), hence maximizing it cor- responds to minimizing the reconstruction error of the network. The second term measures how different the approximated posterior distribution q

_φ

(z|x) is from the prior p

θ

(z). It can be seen as a regularization term to force the learnt posterior to be close to the prior. This way, one can sample from the prior and, from this sample, the decoder generates a plausible x.

The reparametrization trick

On a more practical note, when it comes to back-propagating the gradients, one needs to tackle the issue that the sampling operation is not differentiable.

However, since the distribution is normally distributed, one can sample from a standard normal distribution, multiply the sample by the standard deviation σ and add the mean µ of the learnt distribution (equation 2.1.7).

z = µ + σ (2.1.7)

(23)

This way, the latent variable depends deterministically on the parameters µ and σ and one can back-propagate through the network. This is shown in figure 2.4 and is referred to as the reparametrization trick in [4].

Figure 2.4: The reparametrization trick.

A toy example

In this example, a VAE is trained with a 2D latent space on the MNIST dataset.

The recognition network (encoder) takes as input images of digits of dimen- sions 28x28 and the generator (decoder) learns to reconstruct these samples.

At test time, one can sample from the prior distribution and generate digits from these samples. To illustrate the structure of the latent space in the 2D case, the generator can be used to decode samples at different positions in the latent space and the results are shown in figure 2.5 at the positions for which the samples have been generated.

2.1.3 Conditional variational autoencoders

The architecture we build on throughout this project is a variant of the VAE, the

conditional VAE (Sohn et al. [6]), that introduces a supplementary condition

in the network by concatenating it to the input of the encoder and the decoder,

as shown in figure 2.6. Hence, the encoder now learns a distribution q

_φ

(z|x, c)

and the decoder a distribution p

_θ

(x|z, c), where c is the condition the network

has to take into account. All equations derived in subsection 2.1.2 are still

valid by conditioning on c. We can note that the prior p

θ

(z|c) is still a standard

Gaussian distribution because z is sampled independently of x at test time.

(24)

Figure 2.5: Illustration of the latent space.

The objective (equation 2.1.6) is modified as in equation 2.1.8:

L

_CVAE

(θ, φ; x, c) = E

_q_φ_(z|x,c)

[log p

_θ

(x|z, c)] − D

_KL

(q

_φ

(z|x, c) ||p

_θ

(z)) (2.1.8) At test time, like in the VAE case, one can sample from the prior distribu- tion. However, conditions are now concatenated to this sample. This concate- nation is fed to the decoder and, from it, a new data point that fits the given conditions is generated. Further details about this architecture can be found in [5] and [6].

Figure 2.6: Schema of the CVAE architecture

In our case, we are interested in creating trajectories that fit a given situ-

ation. Hence, we condition our network on some additional information that

our predicted trajectory have to match with, namely other agents’ trajectories,

known frames from the corrupted player’s sequence or identities of the players.

(25)

Toy example

Typically, one can learn to generate data points that have a particular label. If we consider our previous example with the MNIST dataset, one can learn to reconstruct digits while concatenating a one-hot encoding of each of them to the input of the encoder and the decoder at train time. This way, the network learns which digit is reconstructed every time, and when the network is trained, we can choose which digit to generate by conditioning on the chosen label of the digit at test time.

We compare the generations of digits from sample from the latent distri- bution using a VAE and a CVAE. In figure 2.7, we use a VAE to generate a random digit. In this case, the generated digit looks like an average reconstruc- tion of three digits at the same time: 3, 8 and 9. Indeed, in order to minimize the Mean Squared Error (MSE) loss, the model learns averaged reconstruc- tions, which is not what we would like. On the other hand, the CVAE learns a reconstruction alongside a one-hot encoding of the digit value. This way, it tends to generate digits that corresponds strictly to the given value. In fig- ure 2.8, we concatenate to the sample from the latent distribution a one-hot encoding of the digit 9 and the network correctly generates a 9.

Figure 2.7: Generation of a digit from a random sample from the la- tent distribution of a VAE.

Figure 2.8: Generation with a CVAE

with concatenated one-hot encoding

to generate the digit 9.

(26)

2.2 Training a deep neural network

In this section, we introduce the Adam optimizer (Kingma and Ba [7]), an adaptive learning rate optimization algorithm and Batch Normalization (Ioffe and Szegedy [8]), a technique to facilitate training of deep neural networks.

2.2.1 Adam optimizer

The optimizer that is used to train our networks is called the Adam (the name is derived from Adaptive Moment Estimation) optimizer. It uses adaptive learn- ing rates for each parameter based on estimates of first (mean) and second (variance) moments of the gradients. The variance is used in order to diminish the update in dimensions varying a lot and to increase the update in dimensions with small variations.

It stores exponentially decaying average of these moments in order to cor- rect their biases. After that, the weight update is proportional to the first mo- ment unbiased average divided by the square root of the second moment un- biased average.

The benefits of this optimizer are that the learning rate is computed for each parameter based on the magnitudes of the last gradients. Hence, the learning rate requires less fine tuning. It still remains computationally efficient and is easy to configure as the parameters suggested by the original paper are effec- tive. In [9], Ruder argues that, in comparison to several other optimizers, the Adam optimizer adds bias correction and slightly outperforms other methods at the end of optimization as gradients become sparser.

2.2.2 Batch Normalization

Batch normalization aims to reduce the change in the distribution of the hidden

units during the training (called the internal covariance shift) by normalizing

each batch (subtracting the batch mean and dividing by the batch standard

deviation) in order to make the network more stable. Two trainable parame-

ters are added at each layer to be able to restore the unnormalized output that is

passed through the next layer. They correspond to a standard deviation param-

eter that is multiplied to the normalized output and a mean parameter that is

then added to the result. The benefits are that we can use higher learning rates

without vanishing and exploding gradients, it reduces overfitting as it slightly

regularizes the network and hence reduces the need for dropout. It also adds

some noise to each layer’s activation and makes the training depend less on

(27)

the initialization. Finally, as the training is more stable and more efficient, we can decay the learning rate faster.

2.3 Related work

We now review literature about the prediction of multi-agent motion paths and, after that, about role-alignment techniques.

2.3.1 Forecasting multi-agent motion paths

We further detail two solutions that use a CVAE to predict multi-agent motion paths.

Predicting basketball player trajectories using a CVAE

In their paper [10], Felsen et al. implement a CVAE in order to forecast the motion of some agents given the remaining ones in basketball matches. In this case, the number of agents is fixed and all agents are given at once to the network. They learn the multi-agent trajectories latent representation and, using some context (conditions), predict future motions of several players. The architecture of the model is shown in figure 2.9.

Let A be the set of interacting agents that are observed over the time history [t

₀

, t

_q

] and P ⊆ A the set of agents whose trajectories are predicted during the time (t

_q

, t

_f

]. The conditions are, in this case:

• the identities % of the players in P or the identity of the team,

• the future motions of the agents K that are not predicted X (

^t^q^,tf

]

K

and,

• one second history of the predicted agents X

_P^[t^q^−1,t^q^]

. The decoder outputs Y =

X

_P^[t^q^−1,t^q^]

, Y (

^t^q^,tf

]

P

. The authors argue that adding this one second of observed trajectory history of the agents helps the model to learn to make predictions that are consistent with the history.

As we want to experiment on predicting segments of the trajectory of a player, this consideration is relevant to us. We can choose to only output the reconstructed part of the agent trajectory, since we can use the known data for the rest of the sequence, or the whole agent trajectory.

Moreover, Felsen et al. have observed that by ordering the agents consis-

tently from one play to another, the network learnt more efficiently. This is

(28)

performed using a tree-based role alignment of the agents introduced by Sha et al. in [11] and detailed in subsection 2.3.2.

A disadvantage of this approach is that a different model has to be trained depending on the period of time [t

₀

, t

_q

] used as history to predict the future motions (input size of the encoders) and the period of time (t

_q

, t

_f

] that is pre- dicted (output size of the decoder). This issue is not encountered when using Reccurent Neural Network (RNN), as we will see in the next reviewed solu- tion.

Figure 2.9: Architecture of the network from [10]. The context and identity of the players are encoded separately and given as input to the decoder alongside the last second of the predicted agents histories and the sample ˆ z from the variational module. The decoder outputs this last second concatenated with the predictions.

An RNN based CVAE framework for predicting future trajectories of multiple interacting agents

Lee et al. [12] provide a different framework to predict future trajectories of multiple interacting agents called DESIRE, a Deep Stochastic IOC1 RNN Encoder-decoder framework. Their architecture is also based on a CVAE but the encoder and decoder are RNNs instead of being fully connected layers.

Given that past situations can lead to several plausible future trajectories, they needed stochasticity in their architecture, hence the choice for a CVAE. The predictions are returned by the sample generation module and then ranked and refined to produce the best possible results in the end. These modules and the entire network can be found in figure 2.10.

The ranking and refinement module added after the decoder aims to assign

a reward function to each prediction to determine the most likely one and to

(29)

iteratively refine them using a module called Scene Context Fusion that uses the past motion histories, the semantic scene context and the interactions be- tween the agents to build consistent hidden representations for the RNN. The trajectories are passed through a temporal convolution layer to encourage the network to learn the concept of velocity from adjacent frames before being fed into the RNN encoders. The predicted trajectories of each agent are combined together through a fusion layer, where interaction features among the agents are extracted. The reward function is learnt using Reinforcement Learning principles. The RNN assigns scores to each prediction based on the accu- mulated rewards to perform better in the long-term. After each iteration, the prediction is updated by a learnt displacement and is fed into the module again until precise predictions are obtained.

This structure is more complicated than the one in [10] and includes a new module to train without which the predictions are poor compared to the ones from [10]. However, this approach does not need any role alignment since the agents are treated separately first and refined afterwards using the scene context. What is more, since this approach is RNN based, we expect the temporal aspect of the data to be more efficiently captured in this case.

Finally, the idea of learning a concept of velocity might also be useful in our approach, to return coherent results in terms of human motion, even though in a first approach, we expect the network to learn it from the data.

Figure 2.10: DESIRE architecture from [12]. Several predictions are made from the CVAE and then iteratively ranked and refined until the predictions are precise enough.

We investigate an approach similar to the one introduced in [10], even

though it does not adapt to different amount of predicted frames, as it has

the advantage of not needing any additional refinement module like [12].

(30)

2.3.2 Solving the permutational alignment problem

An important question in such multi-agent context is the role alignment in the input representation. Indeed, the features associated with one input should ideally be consistent from one set to the next to help the network to learn the behaviour of the agents. If different orderings of the inputs are allowed, the model must, in a sense, learn all of them, whereas if a consistent ordering of inputs is used, the problem is simplified.

In [13], Le et al. tackle the issue of learning individual policies along- side multi-agent coordination by using a Reinforcement Learning technique, imitation learning. Given a set of (state, action) from experts that the agents want to imitate, the model has to learn the policy that drives an agent to take action a from state s or its probability distribution p(a). The goal is to find a consistent indexing mechanism to perform the alignment of the experts given some latent variables that are representing the roles of the agents. However, the identity of the agents cannot be used since they might not be the same from one sample to another and each agent has a relatively small amount of data samples. Another way would be to learn roles attributed to the agents.

However, some agents may exchange roles during the game or the roles might be different between samples. Instead of defining any role upstream, they are learnt unsupervised, as latent variables.

Similarly, in [11], Sha et al. address the issue of agents alignments to find a consistent ordering over basketball games. In their paper, a template designates an average (over time) player composition for a given team. An example of such a template can be found in figure 2.11 b).

Figure 2.11: Example of a template from [11]. Each color corresponds to a player. In plot (a), we see the positions of the players if they are not ordered.

Plot (b) displays the template that will be apply to order the players. The po- sitions of the players after ordering are shown in plot (c).

Assuming that the behaviours of the players consist of many states, the ob-

jective is to learn a different template for each state. Sha et al. [11] use an Ex-

pectation–Maximization (EM) algorithm to learn the templates unsupervised.

(31)

With the templates, they can compute the permutation matrix that minimizes the distance between the data point and the learnt templates. After that, they find the best clusters to partition the data into distinct states. Each cluster is characterized by one template that defines the data alignment. The number of clusters is determined by maximizing within-cluster similarity whilst main- taining a significant number of data points in each cluster. They then alternate between those two steps: alignment and data partitioning.

The results of the ordering of the agents are shown in figure 2.11. On a), we can see the player position before ordering and on c), after the players are ordered according to a single learnt template.

Instead of defining any role beforehand, they are learnt unsupervised as latent variables for the same reasons as [13]. In the following sections, the

"role" of a player refer to its position in the data tensor after ordering.

This approach is used in [10] and has helped significantly improve their

results by reducing the permutation disorder inherent to their dataset. Indeed,

Felsen et al. argue that without ordering, two similar trajectories could have

very different representations, which hampered the training. This is further

detailed and adapted to our problem in subsection 3.1.2.

(32)

Methods

In this chapter, we explain our approach, how we built the dataset and how we adapted the role-alignment technique from [11] to our problem. After that, we describe the architectures of our deep neural networks and we discuss why we expect them to efficiently solve our problem. Finally, we give details about the implementation of the models and the training process.

3.1 Data preprocessing

The objective is to fill in missing data in a player’s trajectory. In this section, we explain how the dataset was prepared to be fed into the models.

3.1.1 Data preparation

Our data is provided by ChyronHego and consists of tracking data for the ball, the players, the goalies and the referees. The data is originally extracted into csv files containing 6 columns for the ball (the frame number, the x, y and z coordinates, a binary flag indicating if the ball is in play or not and a flag indicating which team has possession of the ball) and 3 columns for the other agents (the frame number and the x and y coordinates). The goalies, the ref- erees and the z coordinate of the ball are not used in the experiments but they are still included in the dataset.

The dataset is prepared as follow:

• Since some arenas have different pitch sizes, we scale them to a pitch size of 105m × 68m.

18

(33)

• We reflect half of the data for each match so the home team is always on the left side of the pitch.

• We divide the data of each match into 12 seconds sequences of 25Hz data frames where the ball has to be in play for the full 12 seconds.

Hence, each of our data point is a 2x26x300 tensor of a 12 seconds (or 300 frames) sequence of the 2D coordinates of the 26 agents, namely the ball, the players, the goalies, and the referees. Such a tensor is shown in figure 3.1. The ordering of the agents is consistent within the dataset.

It is detailed in table 3.1. However, the outfield players do not have any particular order at this stage.

• When loaded, the data is scaled between 0 and 1. It is also downsampled to 5Hz, decreasing the size of the tensor to 2x26x60. This considerably reduces the number of trainable parameters of the network.

Row(s) Agent

1 Ball

2 Home goalie

3 to 12 Home outfield players (randomly ordered)

13 Away goalie

14 to 23 Away outfield players (randomly ordered)

24 Referee

25 Line referee (positive y) 26 Line referee (negative y)

Table 3.1: Agent ordering in the dataset

The goal is to predict missing frames in the data. In order to train the network to do so, we mask a given number of frames (from a third to two thirds of the sequence) of the data of a random outfield player from the home team. The mask corresponds to setting those frames to zero. The frames are always situated in the middle of the sequence, as we can assume without loss of generality that we can always find a sequence where the missing frames are centered. The sequence then contains 60 frames for each coordinate and we predict 20, 30 or 40 frames in the middle of it. An example of a tensor with missing frames is show in figure 3.1.

The dataset is split into a training set of 188766 samples, a validation set of

67978 samples and a test set of 57348 samples. The training set is built from

games from seasons 2015-2016, 2016-2017 and 2017-2019 of the Bundesliga

(34)

Figure 3.1: Schema of a tensor of the dataset

whereas the validation and test sets are built from games from season 2018- 2019, first and second division, respectively. Hence, the three sets contain samples from different games.

3.1.2 Role alignment

As mentioned in [10], the randomness in the positions of the player in a data tensor that we input to the network might disturb the learning of the weights of the fully connected network. Hence, we implement a role alignment technique developed by [11] in order to correct this issue with the goal of making the network easier to train. This solution is based on learning a set of roles of the outfield players, called a template, and then assigning a role to a player so as to minimize the sum of the distances between the players and their role.

For each player, we use the average of the trajectory over time to facilitate the ordering. So, in this case, the roles are defined by 2D positions in the pitch and a data point is a tensor of the coordinates of the average of the trajectory of each outfield player. For each team, we have 10 outfield players, so a template a 2 × 10 matrix. An example of such a template for the home team is shown in figure 3.2. Both teams are treated separately.

The algorithm proposed in [11] allows us to learn different compositions

by alternatively learning the templates and clustering the data. The dataset is

split into several clusters that have an associated learnt template so each tensor

from a cluster is aligned to its associated template.

(35)

Figure 3.2: Example of a learnt template for the home team

Learning a template

In order to learn the templates used to order the players consistently in each sample, we use the following algorithm:

First, we initialize the templates by computing the average over all the data points. Then, we order each data point, one by one, to minimize the sum of distances between the template’s roles and the outfield players averaged positions. This ordering problem is solved using the Hungarian algorithm, an optimisation procedure introduced by Kuhn in [14] and used to solve an assignment problem where "jobs" have to be assigned to "workers" in order to minimize the total cost. In our case, the cost function is the Euclidean distance between the agent from the data point (the "worker") and the role from the template ("the job"). Once each point has been re-ordered according to that template, we re-calculate the template by computing the average over all the re- ordered data points of the dataset. We repeatedly perform these two steps until the difference between two successive templates stops decreasing or reaches a minimum.

Tree based role alignment

As data samples can be quite different, if we were to learn a template for the

entire dataset, the template would be an average of all the data points and it

would not be representative of the different states of the behaviours of the

players, as mentioned before. That is why we split the dataset into several

clusters based on the average over time of each agent’s trajectory and each

of them is represented by a different template. This clustering of the dataset

(36)

corresponds to a layer of the tree we are building. In order to split the dataset, we use K-Means clustering, based on the average over time of each agent’s trajectory, and then learn a template for each cluster. The templates are learnt using the procedure detailed in the previous paragraph. Once it is done, we order the data points according to the template of the cluster they belong to.

The templates after clustering are aligned to the parent’s template to enforce a consistent ordering throughout the tree. After that, each cluster is split again into several clusters using the K-Means algorithm and new templates are learnt for each cluster, until we reach a certain depth in the tree or a minimum number of examples in a cluster. This process allows us to learn several templates since we argue that all the dataset cannot be represented by a single template.

The templates at the leaves of this tree are the ones that we use to order the data samples from the dataset during training. We load each data point, assign them to a cluster at every layer of the tree and order them using the template at the corresponding leaf of the tree.

One iteration of this algorithm is shown in figure 3.3.

Figure 3.3: Schema of an iteration of the tree based learning of the templates

Our solution

Since we have information about ball possession during each game, and be- cause the tree approach can still lead to quite "averaged" templates, where most players are situated at the middle of the field, our approach differs a bit from [11]. Using the ball possession information, we first divide the data into three sets.

Our cases are:

• during most of the sequence the home team has possession,

(37)

• during most of the sequence the away team has possession,

• both of them have possession.

Additionally, we extract the cases where, most of the time, the ball is in one side of the field, as we assume that one team is attacking while the other is defending.

Because this is very restricting, we allow some flexibility. We introduce a parameter that is setting the amount of flexibility that we authorized in each sequence. For instance, choosing 70% of flexibility, we consider that the home team has possession if it has the ball for at least 70% of the sequence. Addi- tionally, if the ball is in the right side of the field during at least 70% of the sequence, we consider that the home team is attacking.

We split the dataset given who has possession, and in case of possession, if the given team is attacking or not. In the attacking cases, we learn the templates directly, one for each team. In the rest of the cases, we use a tree per team.

Once all the templates for a team are learnt, the templates are also ordered between them so the order between the templates is also kept consistent. This is summarized in figure 3.4.

When loading the data for training, we evaluate if the sequence corre- sponds to a team attacking considering their positions and possession of the ball, within the flexibility defined. If this is the case, we use the average tem- plates. If not, we use the trees corresponding to each team. For each team, we find the cluster that this sequence belongs to in the appropriate tree and use the template associated to this cluster to order the players in this data sample.

Each team is always aligned separately.

3.2 Architectures

Different types of architectures, based on the AE, VAE and CVAE, are inves- tigated in this thesis. They are detailed and justified in the following sections.

3.2.1 Model 1: a variational autoencoder

We try to fill the gaps in the sequence of a player from the home team by using a VAE and encoding a certain number of agents alongside this player. The architecture is shown in figure 3.5.

The objective was for the network to capture the dependencies between the

trajectories. As a player plays with his team and against his opponents, he can

pass the ball to his teammates, block his opponents’ progression when he is

(38)

Figure 3.4: Schema of our data split to learn the templates. The templates are learnt separately for each team.

defending and avoid them when he is attacking. The ball has a specific be- haviour. It cannot move by itself, as opposed to the players, and its behaviour is entirely dictated by the players. The network should be able to capture this dependence on the players’ passes and movements, and the ball motion con- tains a lot of information as the players tend to run towards the ball, most of the time. Finally, if the ball’s trajectory changes suddenly, we can deduce that a player kicked it out of its previous trajectory, which the network should also learn. This kind of information is particularly useful when it comes to predict- ing missing frames. All these considerations correspond to the "context" and its consequences over the corrupted player behaviour are investigated through- out this thesis.

We experiment on which agents to encode alongside the corrupted player.

We also implement role alignment and build templates to order the players

from each team.

(39)

Figure 3.5: Schema of our VAE architecture

3.2.2 Model 2: adapted from the variational autoen- coder

We adapt the VAE architecture in order to reconstruct only the player that has missing frames. We still encode a certain number of agents in order for the encoder to capture the dependencies inherent to the dataset and extract the relevant features within the sequences. However, since we are interested in filling the missing frames only, we decode only this given player and adapt the architecture accordingly, as shown in figure 3.6. The objective here is to decrease the size of the network to make it easier to train while maintaining good performance by focusing on the corrupted player.

Figure 3.6: Schema of our adapted VAE architecture

3.2.3 Model 3: a corrective architecture

This third architecture is created from our observation that, when predicting less than half the sequence, encoding only the corrupted player often yields better results. After experimenting on our first two models, we come to the conclusion that the context was not obviously beneficial for the predictions.

Hence, we try to build a different structure that can reveal the context effect on a single prediction. We train a VAE (model 1) without using any context, only the single player’s path, and fill the missing frames from his trajectory.

We have a first plausible way of filling the gaps in the data that we want to

improve so we train a second network to correct this first estimation and we

(40)

freeze the first network. The output from model 1 is given as an input to this second network, alongside the surroundings of this player, i.e. the context.

The output from this second network is then the correction of the prediction without context and is summed to this first estimation to give a new prediction that takes into account the context. This principle is shown in figure 3.7.

Figure 3.7: Schema of our corrective architecture. On the top left, we have a pre-trained VAE that has frozen weights and gives a first estimation of the reconstruction. This output is concatenated with other agents trajectories and given as input to a second VAE that we train to correct the first estimation. The correction is summed with the first estimation to give the final prediction.

Since, using this architecture, we can make a first prediction of the player trajectory, we use it in order to estimate the closest players from the opposite team. We try to correct this first estimation using only the closest players from the opposite team instead of the entire opposite team.

3.2.4 Model 4: a conditional variational autoencoder

We want to implement a VAE that can generate the missing frames directly from the latent distribution while taking into account the context that drives the corrupted player behaviour. We choose to use the CVAE architecture. This network is trained end-to-end.

As mentioned before, our experiments showed that the most important con-

text information seemed to be the non corrupted part of the corrupted player

sequence. However, our objective is to investigate how the players from the

other team and the ball affect this generation. So, we add as context the other

agents’ trajectories but since we want to keep an hourglass architecture for

our network, this information is encoded before being given as conditions to

the decoder. Moreover, since we implement role alignment, we also give as a

(41)

condition the learnt "role" or "identity" of the player, which corresponds to his position in the data tensor after being aligned. Lastly, to help generating coher- ent predictions, we also input into the decoder known frames of the corrupted player sequence. The amount of known frames (or "non-corrupted frames") used varies in our experiments. This architecture is shown in figure 3.8.

Figure 3.8: Schema of our CVAE

During the testing phase, we sample from a standard Gaussian distribution and our decoder produces a plausible filling, given the known frames of the corrupted player sequence and the encoded context. We also adapt the number of predicted frames and the number of non corrupted frames to study the con- text’s influence. Indeed, the more frames the network has to predict, and the less known frames we give, the more the network has to rely on the context, i.e. the other agents’ trajectories.

3.3 Implementation

Our objective is to fill a given number of frames in the middle of the trajectory of a player of the home team. For our experiments, we applied a "mask" on top of the data, meaning that these frames are set to zero.

The loss function that is used is adapted from the objective function (equa-

tion 2.1.6) derived in subsection 2.1.2. The objective function corresponds to

the sum of the expected reconstruction error and a regularization term, the K-L

divergence, that measures the distance between the learnt posterior q

φ

(z|x),

that is the latent distribution, and the prior p

θ

(z). Maximizing this objective

(42)

corresponds to minimizing its opposite, our loss function.

We choose to use the Mean Squared Error (MSE) as reconstruction loss and the prior to be a standard Gaussian distribution N (0, 1). Since the learnt posterior distribution q

_φ

(z|x) is also considered to be Gaussian, we can derive the expression of the K-L divergence. This is mentioned in [4] and detailed in Appendix A.

Let us denote x

r

the reconstruction of x with the gaps filled and L the loss function. We have:

L = MSE + D

_KL

(q

_φ

(z|x) kp

_θ

(z)) (3.3.1)

= kx − x

_r

k

₂

− 1 2

X

n

1 + log(σ

²_n

) − µ

²_n

− σ

_n²

(3.3.2) At test (and validation) time, for all networks, we fill the gaps in a sequence and evaluate the accuracy of the reconstruction using the MSE and the max- imum distance between a point from the ground truth and a point from the prediction, at a given frame number. We also compute the MSE on the speed in order to evaluate to what extent the network captures this aspect of the trajec- tory. Finally, we compare our results with the case where the missing frames are filled using a simple straight line, meaning that we suppose that the player has a Uniform Linear Motion (ULM). We calculate the MSE for this case and report it as well to help us estimate what are the performance achieved without learning anything from the dataset.

In the CVAE case, we condition on the context but the equations remain valid. We experiment generating trajectories by sampling directly from the prior distribution and concatenating this sample with the conditions provided to the network. These conditions include the other agents, encoded separately, a one-hot encoding of the player identity, in the case that role alignment is used, and the non-corrupted frames of the player we are trying to fill. We evaluate this generation qualitatively, by analyzing the plausibility of the generated tra- jectory and by trying to infer how the context influenced this generation, by comparing it to a case where no context was used, and quantitatively, using the MSE and the maximum distance on the positions and the MSE on the speed, by comparing the predictions to the ground truth.

The architecture of each network depends on the number of agents that

are encoded. In all cases, the encoders and decoders consist of several fully

connected layers where each layer has about half the amount of nodes of its

input layer in the encoder and double in the decoder. The architectures for

every model are detailed in appendix B. We used Rectified Linear Unit (ReLU)

(43)

activation functions except for the last layer of the decoder where we used a Sigmoid activation in order to get outputs between 0 and 1. In the corrective architecture case (model 3), the output of the second network is a correction of the first estimation of the missing frames real values and is summed to it so we want an output between -1 and 1 in this case. Therefore, we use a tanh activation for the last layer of the decoder of the correcting VAE.

The weights are initialized using He initialization introduced by He et al.

in [15], where the weights are sampled from a Gaussian distribution with zero mean and a variance that depends on the size of the previous layer. This method has proven to make the gradient descent more efficient and to diminish effects of vanishing and exploding activations.

We use the ADAM optimizer, a learning rate l

_r

= 0.001 and a batch size B = 64. We experiment using Batch Normalization and L2 weight decay. We use a scheduler to decay the learning rate by half if the validation loss is not improving for 10 epochs and implement Early Stopping to stop the training if the validation loss have not improved in the last 20 epochs.

The implementation is done in Python using the framework PyTorch.

(44)

Results

The outcomes of our experiments are detailed in this chapter. First, we de- scribe our role alignment results and then, for each of our model, we explain quantitatively and qualitatively how they perform. Their architectures can be found in Appendix B.

4.1 Role alignment

We implement our proposed solution (subsection 3.1.2). The average tem- plates in case one team is attacking and the other defending are shown in fig- ure 4.1 and 4.2. In these plots, each color corresponds to a team. We build two-layer trees excluding the root. The templates learnt for each cluster are shown in figure 4.3 for the home team and in figure 4.4 for the away team. In these plots, each color corresponds to a template, that is to a leaf of the tree, which is a cluster of the dataset. We can notice that they are spread in the pitch which was our main reason for building these trees.

We load 100 data samples from the test set and align the players for the home team only. We plot in a figure where each player for a given role was in the pitch, to see if our ordering was consistent enough for the test set, using all these different templates, since they are built using only the training set.

The result is shown in figure 4.5. Each color corresponds to a given role in the ordering of the players. The templates are aligned between each other so a structure appears where players with the same role are relatively close to each other in the pitch.

From these figures, we can consider that the ordering of the agents should simplify the network’s training. We experiment on this in the following sec- tions.

30

(45)

Figure 4.1: Templates for each team when the home team is attacking

Figure 4.2: Templates for each team when the away team is attacking

Figure 4.3: Templates at the leaves of the tree for the home team

Figure 4.4: Templates at the leaves of the tree for the away team

4.2 Model 1: a variational autoencoder

We start by trying to fill the gaps by encoding the player that misses some frames, alone and alongside other agents. At first, we do not use K-L diver- gence regularization. The network is then considered equivalent to an AE.

4.2.1 Experiments with a single player

We start by trying to fill the gaps of a random player from the home team with-

out using any context. We simply encode his masked trajectory and decode it

with the gaps filled. Some predictions are shown in figure 4.6 and the MSE be-

tween the masked frames and the prediction on the unseen data of the test set is

shown in table 4.1 (column "Average"). We also report the maximum distance

between a frame of the prediction and a frame of the ground truth at the same

(46)

Figure 4.5: Average position for each role after player ordering

time instance (column "Max") and the MSE on the speed (column "Speed").

When computing the maximum distance between a point of the ground truth and a point of the prediction, as displayed in the figures, we only consider the missing frames and not the known frames. We also compare with the case where the gap is filled using a simple straight line with equally spaced points, meaning that the player is considered to have a motion in one direction with constant velocity during this time lapse. It is referred in table 4.1 as Uniform Linear Motion (ULM) and it gives us an order of magnitude of what error we could achieve if we were not learning anything about the players’ behaviour.

Average Max Speed

Single player 0.47m 8.36m 0.43m/s

ULM 1.63m 10.38m 1.00m/s

Table 4.1: Predictions errors for a single player. We display the MSE on the position (Average), the maximum L2 error on the position (Max) and the MSE on the speed (Speed) compared to the real trajectory of the corrupted player.

No context is used in this case. We also compare with the case where the prediction is a simple Uniform Linear Motion (ULM).

We can observe in the figures that the network is able to fill the gaps with

a coherent trajectory. It seems to also capture the speed of the player and

the trajectories are smooth and linked to the non corrupted frames. When

the player is running almost in a straight line (plot c), the prediction is very

(47)

close to the real trajectory. However, when the player has a curved trajectory, when he changes path suddenly or makes a U-turn (plots b and d), the network predicts a plausible trajectory that is actually far from the ground truth. What we want to study in this thesis is how we can use the context, the surrounding of the players, to improve these predictions, since the player’s behaviour is not independent of the other agents’.