Football Shot Detection using Convolutional Neural Networks

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Biomedical Engineering

Master’s thesis, 30 ECTS | Datavetenskap

2019 | LIU-IMT/TFK-A-M--19/41--SE

Football Shot Detection using

Convolutional Neural Networks

Simeon Jackman, simeon@jackman.ch

Supervisor : Felix Jaremo-Lawin, felix.jaremo-lawin@liu.se Examiner : Anders Eklund, anders.eklund@liu.se

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

In this thesis, three different neural network architectures are investigated to detect the action of a shot within a football game using video data. The first architecture uses con-ventional convolution and pooling layers as feature extraction. It acts as a baseline and gives insight into the challenges faced during shot detection. The second architecture uses a pre-trained feature extractor. The last architecture uses three-dimensional convolution. All these networks are trained using short video clips extracted from football game video streams. Apart from investigating network architectures, different sampling methods are evaluated as well. This thesis shows that amongst the three evaluated methods, the ap-proach using MobileNetV2 as a feature extractor works best. However, when applying the networks to a video stream there are a multitude of challenges, such as false positives and incorrect annotations that inhibit the potential of detecting shots.

(4)

Acknowledgments

First of all I would like to thank Signality for making this thesis work possible. Also, I would like to thank Felix Järemo Lawin, David Habrman, Ludwig Jacobsson and Mikael Rousson for their help and patience during my thesis work. Finally, I would like to thank Vanessa Rodrigues for proofreading my thesis multiple times.

(5)

1 Introduction

From the beginning of the 20th century sport has turned from a minor affair into a huge business. Sport has become increasingly important in society given the technological ad-vancements in making live broadcasts available to everyone. Especially football has grown immensely, with the so called big five, the English Premier League, the Spanish LaLiga, the German Bundesliga, the Italian Serie A and the French Ligue 1 nowadays known all around the world. According to the Deloitte annual review of football finance of 20181, the European football market had a total revenue of€25.5 billion in the 16/17 season, an increase of 4% over the 15/16 season. The big five leagues contributed more than 50% to that total revenue.

As there is a vast amount of money involved, teams try to push as hard as they can to be on top. Since there is an urge to improve performance, data should be available as fast, as precise and as detailed as possible to give the teams an edge over their competitors. Indeed, over the past years sophisticated sports analytics has been adopted in most big sports leagues, not only in football. Metrics and visualizations introduced by Courtvision [16] in the NBA and SnapShot [45] in the NHL have been widely used. In the early 2010s, chest tracking devices [3, 6] emerged as the new technological edge for football teams. But while those yield precise information about one team’s players during practise and matches, they do not generate any knowledge about the movement of the opposing team. Video-based solutions are able to obtain such information about both teams. Furthermore, they do not require a player to put on specific gear before a match or a training session.

The player and game statistics that result from analysing football games have multiple purposes. They can serve coaches to improve team tactics as they can give valuable insights about the performance of players in specific situations or information about their general fitness level. The extracted data can also be used by fans who want to know as much as possible about their favourite player or team.

Extracting relevant data from football games is done by manually assigning labels to spe-cific events such as shots, goal, offsides, etc. As there are hundreds of professional league games played each week, the need for manual labelling is huge. Using automated techniques would have a lot of advantages over manual labelling. First and foremost it reduces cost significantly. Secondly, using artificial intelligence is faster than human annotation and can generate data with very little inference time. Finally, automated approaches are much more

(8)

1.1. Signality

consistent in the way annotations are made. Different human annotators may have different perceptions of when possession changes while an algorithm does annotations in a replicable and unbiased way.

A possible automated approach o this task is to use a neural network. Such a network has the capability to learn from existing training data by adjusting its internal structure to it. The learned knowledge can then be applied to unseen data. In the context of this thesis, a network can be trained on a set of games by extracting clips of shot situations and clips of situations not depicting shots. The images with the video clip are then processed through the network by applying a combination of convolution and pooling operations to extract deep features. Finally, the network can be run on a video stream, applying the learned knowledge to detect shots within a football game.

1.1 Signality

Signality is a start up company based in Linköping performing sports analytics. In the context of football, two 4K cameras are used to collect video data at 25 frames per second from live games. A broad range of machine learning techniques such as deep neural networks are used to automatically extract metrics such as player movements, passes, ball position and possession from the video data. This is then abstracted (As seen in Figure 1.1) and used to give insights to customers.

Figure 1.1: Example showing the two angles of the 4K cameras. The boxes around the players are the result of pre-existing player tracking software.

1.2 Motivation

So far, the system used by Signality to do sports analytics does not recognise shots being taken within a game. Considering how important shots are to analyse a game of football, detecting shots is an extension to their system and will enhance the insights gained by their software. Moreover, using convolutional neural networks (CNN) to do shot detection in the context of football games represents a novel approach that fills a methodological gap in extant literature.

(9)

1.3. Aim

1.3 Aim

The aim of this thesis is to detect shot events within video data using convolutional neural networks. First, data from real football games will be extracted and cleaned. Then a base-line model will be implemented, trained and tested. After evaluation of the basebase-line method, two additional approaches will be tested based on current literature of action detection al-gorithms. As part of the thesis, these methods should be transferred from a more abstract environment and be adapted to perform in the best way possible on the problem of shot de-tection in live football games. In addition different sampling methods, the number of video frames and the rate at which frames get fed as input, should be evaluated.

1.4 Research questions

The research question presented in this section provides information about what has been investigated in this thesis. All of the research questions will be put into scientific context in Chapter 2.

1. Is it possible to create a CNN that can predict shots from football game videos well enough for production? For the use case of Signality it is vital that any method being used for shot detections complies with two requirements which allows for an algorithm to be ap-plied in a production scenario. First, the method can not have an inference time longer than a few seconds, as this would not allow it to be used in real time. Second, the method has to have an accuracy of over 95%, meaning that the accuracy has to be approximately as good as manual annotation.

2. How well do action detection methods transfer to the setting of shot detection in the context of football? The methods mentioned used for action detection are tested on more abstract datasets, where the actions and the background greatly differ amongst the samples. The two datasets consist of short video clips, each of which contains exactly one action. In the context of this thesis, the football videos have long sequences without a shot instance, meaning that the task at hand is not only classifying if there is a shot, but also when a shot actually happens.

3. How can the temporal information be processed in a way that leads to high classification accuracy? CNNs applied on static image data is a well researched subject. However, the temporal information encapsulated in videos allows for multiple viable approaches to try and overcome the challenge of incorporating the additional dimension. The literature mentions a multitude of possible approaches, but as these approaches are applied in a different domain, the transferability to football games will have to be evaluated.

4. How do different sampling methods perform in the context of shot detection? Different sampling methods can have a great effect on the volume of data being processed during network training and its accuracy. As part of this thesis, different sampling methods and techniques will be compared and evaluated. Having too much information leads to long training and inference times, while having too little information will have a negative effect on prediction accuracies.

5. Can the utilisation of weights pre-trained on a static image dataset improve the training of CNNs in the context of action detection within football games? Pre-training has been used widely in the field of machine learning because most algorithms are very data hungry and in a lot of cases, there is insufficient data available to make a network generalise well enough. But does pre-training on static images translate well to the spatio-temporal domain?

(10)

1.5. Delimitations

1.5 Delimitations

The action of a shot in general can be ambiguous. There are situations within a football game, where it is unclear whether a shot, a block, or a scramble of the ball is taking place. Furthermore, when the ball is occluded by a player or the referee, it becomes exponentially more difficult to predict the right temporal boundary of the action. Hence the annotations of such events are not consistent. Depending on the annotator the labels can differ. A neural network cannot learn what is not labelled correctly. There will always be specific situations where any algorithm will struggle to perform well.

This thesis does not delve deeply into post-processing of shot detection. In any produc-tion pipeline, the predicproduc-tions made would be post-processed due to multiple reasons such as noise in the predictions or reduction of false positives. In this thesis the emphasis is put on the performance of the neural network architectures and how different sampling methods affect it. To evaluate the networks some processing is used, but the methods for post-processing are in no way intended to be used in production, but should rather enable a fair comparison between the methods.

(11)

2 Related Work

In this chapter the related work and the scientific background are presented. Furthermore, the research question mentioned in Section 1.4 are put into context.

2.1 Deep Learning and Neural Networks

Deep learning is a supervised or unsupervised learning technique using network like struc-tures. According to [7] it is defined as:

A class of machine learning techniques that exploit many layers of non-linear information processing for supervised or unsupervised feature extraction and transformation, and for pattern analysis and classification.

However, there is no consensus on who exactly invented deep learning, neither is there a specific year where neural networks suddenly came from nothing. But in 1943, McCul-loch and Pitts [39] created the mathematical model for neural networks. In 1980, Kunihiko Fukushima introduced a self-organizing neural network for visual pattern recognition [13], which is considered a milestone in the long development of neural networks. This net-work called "Neocognitron" was an extension to Fukushimas Cognitron[12] (hence the name

Neocognitron) and was able to "learn without a teacher". Fukushimas intention was to cre-ate a network model which was capable of pattern recognition similar to the human brain. Other than the attempts to do so by Kabrinsky [26] in 1964 and later Giebel [14] in 1971, the Neocognitron was capable of detecting patterns regardless of their position in the input. A step further was taken by Yann LeCun [35] a few years after Fukushima. LeCun et al. [35] came up with LeNet 5, a convolutional network trained to detect digits. There are multi-ple ideas in this paper that inspired further work in the field. LeCun mentions that pixel convolutions should take place right after the input, as pixels in images usually have high spatial correlation, and extracting features through convolution in the early stages takes ad-vantage of that fact. Another novel idea was the architecture used in the LeNet 5. It has three consecutive layers where a convolution step precedes a pooling step, making the function incorporated by the network non-linear. This network was trained using gradient descent back-propagation also proposed by LeCun et al. in 1989 [34]. In 2006 Hinton and Salakhutdi-nov [22] introduced the densely connected Deep Belief Network, using many hidden layers and thus coining the term "Deep Learning". This sparked a renaissance of neural network

(12)

2.2. Action Detection

learning leading to deep learning methods slowly replacing the Support Vector Machine ap-proach introduced by Cortes and Vapnik [5] in 1995, which was considered state-of-the-art for most classification problems at that time [37, 41, 43].

2.2 Action Detection

The term action detection refers to the task of extracting an action label given an image or a video sequence. The action itself can be anything from facial expressions such as smil-ing or crysmil-ing [44] to more compound actions such as drinksmil-ing a cup of coffee or smoksmil-ing a cigarette [32].

Action detection in the context of this thesis is the task of analysing football videos, spa-tially and temporally, to locate the time of the "shot" action within a game. Through the years state-of-the-art methods in general action detection have relied on various approaches. The biggest difference between them is how they incorporate the temporal component encapsu-lated by a video. Having a third dimension through time is also what distinguishes action detection from classical image classification challenges. This section will outline two differ-ent benchmark datasets, approaches taken in action detection, how they process the temporal component, which sampling methods are state-of-the-art and how well they perform on said benchmarks.

Benchmark Datasets

To allow fair comparison between action detection algorithms, benchmark datasets are re-quired. Throughout the literature of the last few years two datasets appear on a regular basis. The UCF-101 dataset [53] was introduced in 2012 and contains 13320 videos from 101 actions which themselves are divided into five different topics: human-object interaction, body-motion, human-human interaction, playing musical instruments and sports. All the videos in the dataset have been extracted from YouTube.com.

The second commonly used benchmark dataset is HMDB-51 introduced in 2013 [31]. This dataset incorporates 51 different actions in 6849 video clips. The actions are also divided into five categories: general facial actions, facial actions with object manipulation, general body movements, body movements with object interaction and body movements for human interaction. The actions in this dataset are harder to distinguish than the actions in the UCF-101 dataset and as a consequence, the HMDB-51 dataset is considered to be the more difficult of the two benchmarks for action detection. One of the actions within the UCF-101 is "soccer penalty", while one action in the HMDB-51 dataset is "kick ball", hinting that the methods applied to the datasets could be useful in the context of shot detection in a football game.

Methods

Expanding from classical 2D image detection into the spatio-temporal domain of action recognition has been tried with methods based on spatio-temporal gradients [30], bag-of-features [59], Random Forest [66], or using dense optical flow [58]. However, after the 2010s, approaches using deep CNNs have surpassed other methods in performance. In early 2014 Karpathy et al. [27] tried to improve the classification accuracy on the benchmark datasets by using stacked color images to predict actions. The authors used two separate streams in their network architecture, a context stream which depicts the whole image in a low resolution and a fovea stream which contains a higher resolution center-crop of the image. The two streams are then fused "slowly" in a stepwise fashion, hence the name "Slow Fusion" (see Table 2.1). They argue that the information that is crucial to detect actions usually is incorporated in the center of an image. However, the approach was a lot worse than the state-of-the-art method at that time. They investigated various fusion techniques and concluded that slow fusion works best when applied to action detection. They also state that only using a single-frame

(13)

Table 2.1: Progress in classification accuracies made on two commonly used bechmark video datasets mentioned in section 2.2. All methods in the table use a deep neural network to predict the label of a given action which is represented through a video clip.

Method Year HMDB-51[31] UCF-101[53]

HOF+MBH[58] 2013 57.2% -Slow Fusion [27] 2014 - 65.4% Two-Stream[11] 2014 59.4% 88.0% Deep Network[67] 2015 - 88.6% TSN[62] 2016 69.4% 94.2% I3D[2] 2017 80.9% 97.8% Discriminative Pooling[60] 2018 81.3% -DTPP[68] 2018 82.1% 98.0%

model instead of frame-stacks can already yield strong results. They presume that local mo-tion queues might not be critically important, even for acmo-tion detecmo-tion.

On the basis of Karpathy et al. [27], a different and promising approach to incorporate motion was introduced in 2014 by Simonyan and Zisserman [52]. They proposed a then novel two-stream method where the temporal and spatial channels are treated completely separate by inputting a RGB stream and a separate optical flow [23] stream to the network. They merge the output of both streams in the end by using a Support Vector Machine to make predictions. As a result, the HMDB-51 [31] benchmark tests showed that the optical flow stream was able to improve on the benchmark (See Table 2.1) by incorporating temporal information better than simple RGB frames could. This two-stream approach has since been refined in multiple ways [2, 8, 11, 15, 36, 56, 61, 62, 68] and the improvements have since lead to even further increment in accuracies on the two benchmark datasets UCF-101 [53] and HMDB-51 [31].

Probably the most obvious extension to a basic two-stream network is expanding the fil-ters and pooling kernels from 2D to 3D. This was initially proposed in 2015 by Tran et al. [56] (C3D). They use imageNet [48] to train a spatial feature extractor and then apply three di-mensional filters and pooling kernels to merge the obtained features. This approach has been further refined by Carreira and Zisserman in 2017 [2], and by Varol, Laptev and Schmid in 2018 [57] by pre-training the network on action datasets rather than static datasets.

Feichtenhofer, Pinz, and Zisserman [11] evaluated different fusing mechanisms for the spatial and the temporal stream. The original two-stream network has the flaw that it never learns the pixel-wise correspondence of spatial and temporal information, as the channels are only fused in the end. They therefore proposed an enhanced version of a two stream network [52] by fusing the two channels early, while still keeping both channels after fusion, thereby conserving the temporal channel for a longer duration. Le Wang et al. [61]. intro-duced attention-aware temporal weighted CNNs. They criticise that the current approaches sample random frames from the video sequence, leading to cases where useless or redundant information is being fed to the network. Their proposed approach handles that limitation, as their model implicitly assigns higher temporal weights to semantically critical segments. For example, sequences with disadvantageous camera angles do not get taken into account as much as clean shots. This is done by adding attention model layers after feature extraction. But even for the temporal weighting mechanism the problem of sampling remains. Using a whole high resolution video sequence as input(dense sampling) is too computationally ex-pensive and feeds redundant information into the network [11, 62]. Sparse sampling [60, 67] and segment sampling [61], where a video is cut into segments and then frames get randomly sampled from each segment, are therefore the more frequently used procedures in former state-of-the-art methods. In 2017, Zhu, Zou and Zhu have proposed a new pooling method for the temporal component [68]. They call their method Deep networks with Temporal

(14)

Pyramid Pooling (DTPP). Basically, they transfer the spatial pyramid pooling method [19] into the temporal domain. Similar to Le Wang et al. [61], the video is split into segments and a frame is sampled from each segment. Then frame-features are extracted using CNNs. But instead of simply using the sampled frame-features as input directly, a pyramid pool-ing layer is placed on top of them. The main advantage of DTPP over older methods is that the network incorporates different granularities of temporal scopes. As seen in Table 2.1, this method is until today the state-of-the-art action detection method for the HMDB-51 [31] dataset1reaching a classification accuracy of 82.1% .

Another approach for general action detection is using the pose of the actor as input to the convolutional neural networks. Luvizon, Picard and Tabia [38] have done so by using a multi-task network where action detection and pose estimation are done simultaneously. Actions can be detected through temporal sequences of body joint movements. The detection from joints is then combined with visual appearance recognition through a softmax layer to improve action detection. Baradel, Wolf, and Mille [1] proposed a pose-conditioned convolu-tional neural network with Spatio-Temporal Attention for Human Action Recognition. It has two channels and uses the Long Short-Term Memory (LSTM) mechanism with joint positions as input.

Using Recurrent Neural Networks (RNN) and LSTM is another approach to tackle action detection. This method has a natural way of incorporating the temporal component as it is recurrent. Donahue et al. [9] and Yue-Hei Ng et al. [67] have both proposed using LSTM for action detection. This approach has been considered state-of-the-art for benchmarks in 2015, but has since been surpassed by multi-stream CNNs.

Mehrasa et al. [40] propose a different approach to action detection. They suggest using only the trajectory of the players in ice hockey to detect actions such as shooting and passing. The authors use a network architecture they call "shared-compare". From the raw player trajectories, features are extracted using a "shared trajectory" network, then those features are fed into a "shared comparison" network. The latter network is then used to predict actions.

Amongst others, Zhu et al. [64] and Wang et al. [61] have shown that deep CNNs can be pre-trained on common image or video classification datasets such as imageNet [48] and Kinetics [2] to improve performance. They show that even the temporal channel can profit from being pre-trained on static images by stacking the same picture multiple times. This approach can be particularly useful if the training dataset is relatively small, as the networks will converge faster when using pre-training or the pre-trained network does not need to be trained more at all, reducing training time and complexity.

Another frequently discussed structural entity of the neural networks is the pooling method. Zhu et al. [68] propose a novel pooling method. Temporal pyramid pooling is capa-ble of handling variacapa-ble length input videos and also considers different scopes of the video for its final prediction. Wang et al. [60] use so called discriminative pooling. They train a Support Vector Machine to do the pooling by weighing the different streams appropriately. Girdhar et al. [15] investigate the performance of different pooling methods. They come to the conclusion that the pooling method they propose, ActionVLAD which clusters the ap-pearance and motion features and aggregates their residuals from the nearest cluster centers, is a competitive approach as well.

Most of the aforementioned methods have been evaluated on datasets such as the HMDB-51 dataset [31], the UCF-101 [53] and the Sports-1 M [27] in which kicking a ball and taking a penalty is one of the actions in the set. This suggests that these methods may be applicable for shot detection in the context of this thesis. But as mentioned by Herath, Harandi and Porikli [21] the UCF-101 and HMDB-51 datasets contain data that is well cropped in the tem-poral domain, possibly making it hard to find the exact temtem-poral boundaries of shots with said methods.

(15)

3 Theory

The theory section presents the foundations of convolutional neural networks, residual net-works and how to train them, as well as the theory behind different optimizer strategies used for network training.

3.1 Empirical Risk, Loss functions

In common machine learning problems the true distribution pdata(x, y)of the data x and the

corresponding label y is not known, but a subset of the data drawn from that distribution is available for training. The combination(x, y)are also referred to as sample. Neural net-works try to learn a set of parameters θ from the training set by minimizing a cost function J constituted of the expectation of the loss of training samples

J(θ) =E_(x,y)„p

(data)L(f(x; θ), y) (3.1)

The loss function L is an error function which measures the deviance of f(x; θ)from the true label y [17]. It usually has the form similar to the cross entropy over the classes c P C in a dataset: L(f(x; θ), y) =´ C ÿ c yclog(f(θ)) (3.2)

C denotes the number of classes to be found within the data. The knowledge learned through training should generalize to an unseen test set, as the underlying distribution pdata(x, y)

which generates the training and the test data are assumed to be the same or at least similar. Minimizing the cost function in equation 3.1 is called empirical risk minimisation, as the cost function is applied to an empirical subset of the data xt, ytsampled from the true distribution

pdata(x, y) [42].

3.2 Network Training

Before being able to make sophisticated predictions, any neural network has to either be trained from scratch or from a specific state using transfer learning. The general idea of net-work training is to pass the input through the netnet-work using forward propagation, calculate

(16)

3.3. Optimizers

a gradient using a loss function, and then propagate that gradient reversely through the net-work to update the weights accordingly. As mentioned in Section 3.3, neural netnet-works have an increased level of complexity over other machine learning algorithms in the sense that their loss functions are non-convex in most cases and thus have no global convergence guar-antee [17]. Therefore iterative gradient-based methods are used to train them. Their goal is to push the cost function to a low value by using the gradient.

Models with high complexity can simply memorize the training set, leading to overfitting. This happens when the parameters of a model are over-optimised for the training set, making the model lose its generalisability. To be able to determine the moment overfitting takes place, a validation set is used [46]. The model training is then applied, until the loss on the validation set does not improve any longer.[50]

3.3 Optimizers

Optimizing a cost function for a neural network brings with it multiple challenges. First and foremost, the cost function is in most cases non-convex, meaning that optimizers can get stuck in local minimas as there is no global convergence criteria. Secondly, the high-dimensional surface of the cost function commonly has flat regions where there is no or only very little gradient. Lastly, cliffs which yield huge gradients can lead to skipping over interesting re-gions, possibly missing out on a good solution. All the aforementioned challenges have to be taken into account when considering optimization algorithms for neural networks. In the subsequent subsections the two optimizers used in this thesis are introduced shortly.

Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) [28, 47] is an iterative method to optimise a differentiable loss function. It uses the unbiased average gradient of a batch to traverse through the loss function. The algorithm can be seen in Algorithm 1. After initialising the learning rate ek

and the parameters of the model θ, a subset of the samples is used to calculate the gradient of the loss function. As obtaining the gradient of the cost function has a high computational complexity it can be extremely expensive to calculate. By using a subset of the training data the complexity of calculating the gradient is reduced greatly, as the batches usually have a size magnitudes smaller than the whole training set. However, since randomly choosing a subset as batch introduces noise, the gradient can be non-zero even if the current position of the loss function is in a minima. After propagating the loss through the network, the weights of the model are updated. Within an epoch, each sample will be represented in exactly one batch. This assures that the training is not biased towards sepcific samples.

Algorithm 1SGD-algorithm as in Goodfellow et al. [17]

1: Learning rate ek 2: Initial parameter θ

3: whileStopping criteria not met do

4: Sample batch of m examples from the training set tx(1), ..., x(m)uwith targets y(i).

5: Compute gradient estimate: ˆg Ð+_m1 5_θř

iL(f(x(i); θ), y(i)). 6: Apply update: θ Ð θ ´ e ˆg.

Adaptive Moment Estimator

The Adaptive Moment Estimator, also know as ADAM optimizer, was introduced in 2015 by Kingma and Ba [29]. It is based on AdaGrad [10] and RMSProp [55]. The ADAM unites the advantages from both those methods, it can deal with sparse gradients like AdaGrad as well

(17)

3.4. Convolutional Neural Networks

as with non-stationary objectives like RMSProp. In contrast to SGD, the ADAM method has an exponentially decaying momentum computed from the averages of gradients over time. In theory the momentum allows the optimization to continue over small saddle points or local minimas.

3.4 Convolutional Neural Networks

Convolutional Neural Networks [33] are a variation of conventional neural networks that are specialized for grid-like data such as images or time-series. What all CNNs have in common is that they all perform a linear operation called convolution in one or more of their layers. A 2D convolutional operation with the kernel size m, n over the image with pixels-coordinates i and j has the form:

S(i, j) = (I ˚ K)(i, j) =ÿ

m

ÿ

n

I(m, n)K(i ´ m, j ´ n) (3.3)

I denotes the input while K denotes a kernel function. As convolutions are commutative, equation 3.3 can be rewritten as:

S(i, j) = (I ˚ K)(i, j) =ÿ

m

ÿ

n

I(i ´ m, j ´ n)K(m, n) (3.4)

These convolutions can be applied to spatial features (e.g within an image) or in a tem-poral fashion. In Figure 3.1 the different dimensions can be seen. Convoluting along the x and y axis is spatial convolution (i.e. convolution in space) while convolution along the t axis is temporal convolution (i.e. convolution in time). In general, convolutions can be applied on any number of dimensions within the input space by expanding the kernels to the de-sired dimensionality. For spatio-temporal data such as videos, 3D convolution is an intuitive extension to common 2D convolutions, as they also incorporate the temporal channel in the operation. [17] frame t + 2 frame t + 1 frame t y x t

Figure 3.1: This figure shows temporal and spatial dimensions of a frame stack. Using convolutions to process images has the advantage that it requires less weights, as convolutional layers have sparse interactions and shared weights. Other than traditional dense layers where the weight matrix is multiplied with the input matrix, convolutional lay-ers share weights through kernels. Not only does that require less space, but it is also faster

(18)

3.5. Residual Networks

to compute. Nearly all CNN architectures apply a method called pooling after convolutional steps. This pooling method outputs a summary of activations over a region of convolutions. One of the most commonly used pooling methods is called max-pooling, where the maxi-mum value of a region is selected as output of a pooling layer. In general every convolution and pooling layer combination consists of three substeps. Given any grid-like input, a con-volution is performed as described previously. Then, a non-linear function is applied to each output of the convolution. Finally, a pooling method select a subset of the whole region through a pooling operation, generating the output of the layer.

For convolution and pooling layers the stride length determines how often a convolution or pooling operation is applied. A stride length of 1 corresponds to applying an operation on every pixel of the input, while having a stride length of 2 will apply an operation on every other pixel. If the stride is set higher than 1, the operation reduces the dimensionality of the data. Max-pooling over an image with stride length two in will output an image with a fourth of its original resolution. This is because the width and the height has been halved due to the stride length.

3.5 Residual Networks

Residual networks, introduced by He et el. in 2016, expand the toolbox of deep learning with the skip layer connection [18]. These skip layers have the capability of jumping over other layers and allow the network to adapt its depth implicitly. The main advantage of resid-ual networks is that through the skipping of layers the gradient can propagate more easily through the network. This allows for deeper networks without having to deal with the van-ishing gradient problem. Probably the most famous residual network is the ResNet [54]. It surpassed state-of-the-art image classification methods when it was published in 2016 and since then its extensions such as ResNext [63] have also performed well on similar applica-tions. Depending on the version used for the ResNet, it can have between 25 and 60 million parameters. This makes inference time relatively slow. A more lightweight image classifi-cation network is the MobileNet [24]. It has emerged to satisfy the need for less complex models which do not require vast amounts of computational power. The MobileNet has ap-proximately 4.2 million parameters and therefore has shorter inference times. The newest version of the MobileNet is called MobileNetV2 [49]. The second version has even fewer parameters at 3.5 million, while having a higher accuracy on the ImageNet [48] dataset.

The reduction of parameters is due to the use of depth-wise separable convolutions (See Figure 3.2), where full convolution operations are separated into two steps. First a depth-wise convolution is applied where one filter is used per input channel. Then a 1 x 1 convolution is applied over all the feature maps generated in the depth-wise convolution. The 1 x 1 convolution reduces the dimensionality of the feature maps. This approach has the benefit that, as the spatial and channel convolutions are completely separated, less parameters and computation has to be used, while still retaining accuracy.

(19)

3.5. Residual Networks

Input

Convolution 1x1, Relu6

Depthwise 3x3, Relu6

Convolution 1x1, Linear

Add

Skip Connection

Figure 3.2: This figure shows the most important building block of the MobileNetV2, the bottleneck residual block.

(20)

4 Method

This chapter presents a detailed description of the methods used in this thesis. First the data is described and how the extraction was done is explained. Then the facets of sampling are ex-plained, followed by an introduction of three different network architectures, how they have been trained and how they make predictions. Finally the validation methods are presented.

4.1 Data

The data used for this thesis consists of 164 games from the first and second Swedish foot ball league (Allsvenskan and Superettan). Each game is represented by two video streams, one depicting the left half of the pitch and one the right. The streams have an average length of approximately two hours of which a little more than 90 minutes is match data with official annotations. The streams have been recorded using two 4K cameras at 25 frames per second. In Figure 4.1 an example frame of a right-side stream can be seen. The 164 games yielded a total of 3263 officially annotated shot instances. The first step of data extraction was evalu-ating which games had all the necessary meta data required to process them. From the meta data delivered by Signality, some games did not contain calibration data used for cropping (see Section 4.5), while other games did not have correctly annotated event data used for the extraction of the shot timestamps. From a total of over 300 games, 164 games had complete meta data, meaning that they could be processed further without additional supervision.

In order to train any deep learning models, the data had to be extracted from the video streams. Given a video stream of a game, the official shot timestamps have been matched with the video stream time, such that the moment of the shot could be extracted. The anno-tations have slight inconsistencies, as different annotators interpret the "shot" action slightly differently. To get an idea of how well the shots are annotated, 100 shots from 100 different games have been checked frame by frame to see how precisely the shots have been annotated. The correct shot moment here is defined as the frame before the ball leaves a players foot (for a normal shot, free-kick or penalty) or head (for a header). In Figure 4.2 the histogram of anno-tation errors can be seen. Out of the 100 shots, four shot instances were wrongly annotated, meaning that they clearly were not shots. Their error is not represented in the histogram. Those four instances depicted scrambles with no shots, or clearances. The distribution of errors strongly resembles the normal distributionN (µ «0.5, σ « 1).

(21)

4.1. Data

Figure 4.1: Example showing the view of one camera. The angle of view can vary greatly depending on the stadium.

´5 ´4 ´3 ´2 ´1 0 1 2 3 4 5 0 10 20 30 40 50 Error in seconds Occur ences

Figure 4.2: Histogram of annotation-errors and the normal distribution which it resembles. The data consists of 100 shots from 100 different games.

For network training, the dataset has been split into training, test and validation subsets. The split has been done by game such that shots or non shots from the same game will always be in the same subset. As seen in Figure 4.4, the training subset contained 80% of the data, the test subset 10% and validation subset 10% as well. This corresponded to 131, 17 and 16 games in each subset respectively. The training data has been used to actually train the networks and update the weights, while the validation set has been used to determine when to stop training.

Initial Dataset

For an initial training iteration, positive (shots) and negative (no shots) instances are required to train a network of any architecture. For each shot instance, 50 frames have been extracted at a rate of 200 ms. All 50 frames span a time of ten seconds, five seconds before and five

(22)

4.1. Data

seconds after the shot, whereas the annotated shot frame was always in the middle. Having a sampling rate of less than 200 ms has been tested and discarded, as it would generate an excessive amount of data. With the extraction configuration used, the shots and negatives occupy a total amount of approximately 200 GB in memory. Before extracting data from the video streams, the correct timestamps had to be found first. The positive timestamps could simply be extracted by matching the official data with the video stream. Since there should be no overlap between positive and negative data, a temporal boundary starting five seconds before and ending five seconds after every positive instance has be set. Then any random timestamps not in one of the "shot-boundaries" has been picked to extract the negative data. The resulting data consisted of 50% shots (3263) and 50% (3263) non-shots, but also the same amount of shots and non-shots for each game.

Addition of Hard Negatives

When sampling negatives randomly according to the settings of the initial dataset, intuitively it becomes clear that most of the extracted samples will be relatively easy. Around half of the negative samples where similar to Figure 4.3. This is because when sampling at a random timestamp on a random side, approximately half of the time the ball will be on the opposing side, meaning that there will not be a lot of players on the extracted side of the pitch.Therefore hard negatives have been extracted to extend the dataset and increase its difficulty.

Figure 4.3: Example of an easy negative sample. Most of the players are on the opposite side of the pitch. This situation does not even remotely resemble a shot, hence this is an easy negative.

The timestamps of the hard negatives [20] have been extracted by running the baseline model with 5 frames and 400 ms sampling rate on the video streams of the games in the dataset. A timestamp was used as hard negative whenever the network predicted a false positive with a probability of more than 90% (prediction boundary). A sequence of 50 frames was then extracted around that timestamp and added to the already existing dataset. A max-imum of 25 hard negatives per game have been added. This limit was required because the network created a lot of false positives on certain games. While raising the prediction bound-ary to 95% would have reduced the number of false positives for some games, it would have resulted in other games not having any false positives at all. In Figure 4.4 the composition of the two datasets can be seen.

(23)

4.2. Sampling

Full

Dataset _Initial

Dataset

Train Test Validation

Shots 3263 Instances Negatives 3263 Instances Hard Negatives 3529 Instances 80% 10% 10% 80% 10% 10% 80% 10% 10%

Figure 4.4: Figure showing what the constitution of the two different datasets are constituted of. The initial dataset is balanced and consists of 50% shots and 50% non shots. The Full dataset consists of the initial dataset plus the hard negatives, hence it is not balanced.

4.2 Sampling

As mentioned in Chapter 2, it is not possible to feed the entire data obtained through the video stream into the network, because it would not only be too computationally demanding to run the networks on all the data, but also require too much GPU memory. First, the input resolution has to be chosen, then the number of frames of a specific resolution to be stacked as input. While feeding single frames into the network is computationally easy to process, there is no temporal information in single frames. Picking more than one frame corresponds to feeding the network with a short video. Next to the number of frames, the sampling rate is relevant. When using five frames for example, sampling every 200 ms will yield a total timespan of 800 ms, a rather short temporal scope. Sampling every 600 ms will yield a total timespan of 2400 ms, giving more global context to the network, but comes at the drawback of having sparser information of the shot movement. This means that sparse temporal sampling will result in lost information of the actual shooting motion, while dense temporal sampling preserves said information but is not able to give as much large scale information.

4.3 Temporal Shifting - Temporal Data Augmentation

For network training, a specific form of data augmentation has been used through which multiple samples have been extracted for each shot instance through a sliding (or shifting) window approach. In Figure 4.5 an instance of a shot can be seen with two frames before and two frames after the shot. Instead of just extracting the frames (1,2,3), variations (0,1,2) and (2,3,4) can also be extracted, as all three of the contain the frame with the actual shot.

frame 4 frame 3 frame 2 shot frame frame 1 frame 0

Figure 4.5: This figure shows a single shot instance with two frames before and two frames after the actual shot.

(24)

4.4. Input resolution normalisation

During training, a random shifted variation was used as a training sample. There are multiple reasons why this approach is promising for improving the results. Firstly, as seen in Figure 4.2, the annotations of the shots are not always accurate. When using three frames sampled at 200ms, there is a high probability that the frame with the actual shot is outside of these three frames. Without temporal shifting, the time window fed into the network corresponds to 400ms (The first frame is 200ms before the annotated shot and the last frame is 200ms after the annotated shot). When using temporal shifting as in Figure 4.5 the total time window increases to 800ms, making it more likely that the frame with the actual shot will be fed into the network. The second reason why temporal shifting might improve the network training is because it generates three times the amount of distinct training samples. From each unshifted sample, three shifted samples can be generated. This should in theory make the network generalise better, as overfitting is made harder through the increased variety of the training data.

4.4 Input resolution normalisation

The architectures in the literature (See Section 2.2) are applied on datasets with much lower resolution. The UCF-101 [53] for example has a fixed resolution of 320 x 240, while the video data from signality is in 4K (3840 x 2160). The challenge of having an appropriate resolution manifests itself in the fact that a lot of essential information is lost through conventional resiz-ing. The information about the ball or a players shot motion might be completely lost when using a low resolution. As a compromise, a combination of down-sampling through convo-lution and dimensionality reduction through classical image resizing has been used. Instead of using 4K images as input, all input has been resized to (1024 x 576) for all experiments. The resizing yields a resolution which is approximately 8 times bigger than the input used in the literature. Further down-sampling will be achieved by image convolution within the neural networks.

4.5 Cropping

Feeding the whole image into the neural network has the drawback that a lot of useless infor-mation is being processed. Useless inforinfor-mation in this context are the spectators, referees, the grass around the actual playing field amongst other things. All the aforementioned regions in the image do not correlate with the fact that a shot or non shot is taking place in the image. Therefore cropping the image to interesting regions not only reduces the dimensionality of the problem, but most likely also increases the speed of convergence of the neural networks, as the cropping explicitly tells the network what regions are of semantic interest. A similar approach has been taken by Karpathy et al. [27]. In the context of action detection, they use a center crop to reduce dimensionality and observe a 2-4 times increase in runtime, while retaining the classification accuracy.

In the context of a football game, the semantic information of a shot is dense around the penalty box, since every shot on goal will pass through it. The cropping strategy used for this thesis therefore extracts the region around the penalty box. The crop is defined by four corner points that make up the boundary: xmin, xmax, yminand ymaxas seen in Figure 4.6. The four

corner points are defined by adding or subtracting a margin from four anchor points: lower left corner, upper corner, upper penalty box corner and lower penalty box corner. The anchor points have been extracted from the game calibrations data. xminis defined as the position

of the lower left corner plus a margin. xmax is defined as the upper corner plus a margin.

yminis the y value of the upper corner of the penalty box minus a margin. ymax is defined

as the y value of the lower corner of the penalty box plus a margin. This method further reduces the amount of data being fed by approximately 60%. The area has been chosen so that the area in front of the goal is visible and shots from outside of the penalty box can still

(25)

4.5. Cropping

Figure 4.6: Example showing an un-cropped image. The margins for the x values and the y values have been chosen separately. The black parts on top and on the left are pre cropped by Signality. The arrows show how the margins are applied and point from the anchor points to the corner points.

be seen. From all the different stadiums the maximum xmaxand the maximum ymaxhas been

extracted. If a crop does not fill the whole area, the remaining parts have been filled with black. The resulting crop can be seen in Figure 4.7. The cropping area has been tested on multiple stadiums by ensuring that the area around the penalty box is within the crop for all games. The necessity to check arises from the fact that different stadiums have different camera angles. The resulting resolution after cropping was 693 x 360 pixels as this was the minimum dimensions required to fit all cropped images into.

Figure 4.7: Example showing a cropped image. The actual crop has been put in the middle of a black background, because different stadiums have different crop sizes. This method yields the same resolution for every stadium, because the maximum width and height over all stadiums has been used as reference.

(26)

4.6. Framework

4.6 Framework

The computational power for the training of the networks has been provided by Signality in form of the Google Cloud Compute Engine, using a NVIDIA Tesla K80 graphics card. The neural networks have been built using the deep learning framework Keras 2.2.41_{which itself}

is built on top of Tensorflow2and have been trained using the keras fit generator3 function-ality. This allowed for data to be loaded step by step, without having to load the full dataset of over 300 GB into memory.

4.7 Model 1 - Baseline Method

Applying a baseline method on the problem of shot detection had the purpose of determin-ing the feasibility of the problem and producdetermin-ing results which can be used to compare to more advanced methodologies. The evaluation and acquisition of knowledge about differ-ent sampling methods, such as the sampling rate, cropping the image, and the number of frames should give insights in the challenges within the context of shot detection. The archi-tecture follows the late fusion approach by Karpathy et al.[27], where multiple frames are run through a convolution-pooling stage (See Table 4.1 for details). This process extracts abstract features from the original input RGB-image.

Table 4.1: The layers of the feature extraction part of the baseline model. The first four con-volution and pooling layer-pairs follow Karpathy et al. [27]. The last concon-volution max-pooling stage has been added to further reduce the resolution.

Layer Kernel / Pool size Stride Units Activation

Conv2D (11,11) (3,3) 64 relu MaxPool (2,2) (2,2) - -Conv2D (5,5) (1,1) 128 relu MaxPool (2,2) (2,2) - -Conv2D (3,3) (1,1) 196 relu MaxPool (2,2) (2,2) - -Conv2D (3,3) (1,1) 256 relu MaxPool (2,2) (2,2) - -Conv2D (3,3) (1,1) 256 relu MaxPool (2,2) (2,2) -

-The features of all the frames are fused through concatenation of the feature maps. On top of the concatenation multiple dense layers try to learn the global motion characteristics. As mentioned in Section 4.4, the resolution of the video used for this thesis is more than 10 times larger than the resolution used in the literature. Therefore slight changes to the architecture have been made. The first four convolutional layers have been copied from Karpathy et al. [27], except that the filters have been reduced from (96,256,384,384) to (64,128,196,256). After the first convolutional layer, an additional pooling layer has been added to further reduce the dimensions.

To account for the increased input resolution, an additional layer has been added. The last convolution and pooling layers have the same number of units, kernel size and stride as the two layers before. This will further reduce the resolution by a factor of 4. After the last pooling layer, the features are flattened and then the flattened features of all the frames are concatenated together before being fed to the dense layers. Furthermore the dense layer size has been reduced from the original (4096,4096) to (128,128,64). This is because multiple

1_{www.keras.io/} 2_{www.tensorflow.org/}

(27)

4.8. Model 2 - MobileNetV2 Feature Extraction

frames generate too many parameters when applying a dense layer with 4096 units after the merging of the feature extraction part, because each pixel of the output of the feature extraction stage will have a connection with each of the units in the dense layer. The complete architecture can be seen in Figure 4.8. All layers have been trained from scratch.

frame t Feature extraction frame t - 1 Feature extraction frame t + 1 Feature extraction Concatenate Dense (128, relu) Dense (128, relu) Dense (64, relu) Dense (2, softmax)

Figure 4.8: This figure shows the architecture of the baseline model when processing three frames. Each frame gets first processed separately and is then concatenated with the output of all other frames. The feature extraction parts of each frame share their weights. Three dense layers are then used to learn the global relationship between the frames.

4.8 Model 2 - MobileNetV2 Feature Extraction

The second method used in this thesis is somewhat similar to the baseline method. But un-like the baseline method, the second approach uses the MobileNetV2 [49] pretrained on Im-ageNet [48] to extract the visual features. The network is based on the work of Howard et al. [24]. According to Yosinski et al. [65], trained layers of neural networks can be transferred to a different context and still perform well. Therefore, the network and its weights obtained from the training on the ImageNet database are transferred and used in the context of foot-ball shot detection, saving training time. Chen et al. [4] have used the MobileNet as a feature extractor in their approach. They remove all conventional convolutional layers from the Mo-bileNet and replace them with depthwise seperable convolutions which is exactly what the MobileNetV2 does as well. Furthermore, the MobileNet has been used as a feature extractor within the SkipNet decoder [51]. In contrast to the baseline method, the feature extraction parts seen in Figure 4.8 are substituted with the MobileNetV2. Hence, every frame runs di-rectly through a MobileNetV2 before further operations are applied. The dense layer is also the same as in the baseline method. The MobileNetV2 has been used with an α-value of 1 and

(28)

4.9. Model 3 - 3D Convolution

Table 4.2: The number of parameters for each network. All networks used have between one and four million parameters. The baseline method and the MobileNetV2 methods share parameters in the feature extraction stage, but more frames lead to a longer vector after flat-tening. Hence, the parameter increase of these two methods is greater than the 3D method when increasing the frame numbers.

Method

Baseline MobileNetV2 3D 1 1,684,998 2,446,850 2,817,858 Frames 3 2,012,678 2,774,530 2,968,770 5 2,340,358 3,102,210 3,119,682

max-pooling was applied right on top of the MobileNetV2, before concatenating the features. The number of parameters of the whole model with the dense layers can be seen in Table 4.2. All the layers in the MobileNet were set to be trainable. Therefore, the network could adjust the pre-trained weights to the problem at hand.

4.9 Model 3 - 3D Convolution

This approach intends to learn temporal information through 3D convolution as explained in Section 3.4 rather than through flattening and concatenation of image features as has been done in the first two approaches The advantage of having a temporal convolution when com-pared to feature extraction blocks is that the network can learn temporal relations a lot earlier. For this approach, Tran et al. [56] is used as a basis. They use 16 frames at a resolution of 171 x 128 as input and apply 3D convolution in all of their convolutional layers. In this thesis, only one, three or five frames with much higher resolution are used as input. Hence the net-work had to be modified slightly to account for these differences of input. As mentioned by Tran et al. in their paper [56], merging the temporal domain very early is not beneficiary for performance because the temporal information is lost too early. Therefore, for the approach used in this thesis, the temporal information is pooled in the second rather than in the first 3D pooling stage.

In between the two 3D convolutions, a 3D pooling takes place with a pool size of (2,2,1). Hence it does not pool in the temporal domain, but rather in the spatial domain. The second 3D pooling layer has a pool size of (2,2) in the spatial domain and the size of the temporal domain corresponds to the number of frames used as input to the network. The spatial kernel size of the convolutional layers has been set to 3x3. Tran et al. have found this size to perform the best. The temporal length of the 3D convolution layers has been set to the number of frames used as input. The number of filters has been left the same for the first two layers (64 and 128). However, the number of filters for all further convolutional layers have been reduced from 256 to 192.

After the pooling of the temporal domain, the network consists of four blocks of double convolution and single pooling. The number of blocks has been increased by one from three to four, to account for the increased resolution. These blocks exactly follow Tran et al.’s ap-proach, with exception of the reduced number of filters. Hence, all the pooling layers have a pooling size of 2x2. After the last pooling layer the output dimension is 1 x 7 x 192. The same dense layer as the baseline method in Section 4.7 has been used on top of the convolutional stage. The complete architecture can be seen in Table 4.3.

(29)

4.10. Network training

Table 4.3: The layers of the 3D convolution model. #frames denotes the number of frames used as input for a specific run. The convolutions are applied over the temporal domain in the first two convolutional layers. In the second pooling layer, the temporal dimension is pooled reducing the dimensions from 3D to 2D. After the convolutional stage, the same dense head is used as in the other two methods.

Layer Kernel / Pool size Stride Units Activation

Conv3D (3,3, #frames) (3,3,1) 64 relu

MaxPool3D (2,2,1) (2,2,1) -

-Conv3D (3,3, #frames) (3,3,1) 128 relu

MaxPool3D (3,3,#frames) (2,2,#frames) -

-Conv2D (3,3) (1,1) 192 relu Conv2D (3,3) (1,1) 192 relu MaxPool (2,2) (2,2) - -Conv2D (3,3) (1,1) 192 relu Conv2D (3,3) (1,1) 192 relu MaxPool (2,2) (2,2) - -Conv2D (3,3) (1,1) 192 relu Conv2D (3,3) (1,1) 192 relu MaxPool (2,2) (2,2) - -Conv2D (3,3) (1,1) 192 relu Conv2D (3,3) (1,1) 192 relu MaxPool (2,2) (2,2) - -Dense - - 128 relu Dense - - 128 relu Dense - - 64 relu Dense - - 2 softmax

4.10 Network training

The networks are trained on the training set. Early stopping is applied when the loss on the validation set does not improve over eight epochs. To be able to compare sampling methods, the following different sampling configurations are used for network training:

Sampling rate 800ms, 1200ms and 1600ms. The sampling rate is particularly interesting in the context of Figure 4.2. Using only 400ms on each side of the annotated shot will yield a shorter total temporal scope, meaning that it becomes more likely that the shot is not within the input fed into the network.

Resolution The image resolution was fixed to 1024 x 576. As mentioned in Section 4.4, using a smaller resolution is a trade-off between losing information and having too much data. Greater resolutions have been tried, but have resulted in large GPU memory consumption.

Cropping Cropping was used for all experiments, yielding an input resolution of 693 x 360 when cropping from the original 1024 x 576 image. See Section 4.5.

Augmentation Temporal shifting was used for the training of all networks except the single frame configurations. It increases the generalisability of the networks, as more training data can be generated. See Section 4.3. The single frame configurations do not use shifting because the annotated shot should always be within the sampled time window. Due to time constraints, the augmentation has not been performed spatially.

Frames Due to GPU memory limitations and time constraints, especially when running Mo-bileNetV2, the number of frames were limited to 1, 3, and 5. The 3D convolution

(30)

4.10. Network training

method using a single frame is still called 3D convolution in this thesis due to the usage of the Keras 3D convolutional layer. However, using only one frame leads to convolu-tions only in space and not in time (2D instead of 3D).

Color All networks have been trained using three channel RGB images.

Batch Size All network configurations have been trained using a batch size of five except the 5 frame configurations of the MobileNetV2 and the 3D convolution, which have been trained with a batch size of 4 due to memory limitations on the GPU.

Out of the above mentioned configurations all combinations have been tried out except the 5 frames and 1600ms sampling rate. This is because the temporal shifting would sample a frame outside of the extracted time window. Hence, six different configurations per network architecture have been evaluated, one configuration for single frame, three different config-urations for three frames and two configconfig-urations for five frames. SGD was used as an opti-mizer with learning rate 0.01 for training the baseline and the 3D convolution network. For the training of the MobileNet, ADAM has been used with a learning rate of 0.0001, β1=0.9,

β2=0.999 and no decay. The learning rates have been chosen through experimentation. The

βvalues and decays have been left as default. The ADAM optimizer has been tried out on the other networks, but yielded worse results than SGD. The ADAM optimizer has been sen-sitive to parameter changes and it has been a difficult task to find a working set of optimizer parameters for the MobileNet training. Two different optimizers have been chosen, because the SGD optimizer did not find a signal when being used for the training of the MobileNetV2. On the other hand, the ADAM optimizer did not perform well when being applied on the baseline and the 3D convolution training. Batch normalisation layers have not been used for the baseline and the 3D convolution methods, because as mentioned in [25], using small batch sizes can lead to very noisy estimates of the mean and variance, hence not reducing the covariate shift. Due to time constraint, regularization has not been used for training.

(31)

5 Experiments

This chapter presents how the trained networks have been evaluated. To be able to compare the methods mentioned in Chapter 4, two distinct experiment approaches were undertaken in this thesis to evaluate the performance. First, the methods are used to predict shot instances on an unseen test dataset with and without hard negatives. Second, the methods which performed the best on the aforementioned experiment were run on multiple unseen football game video streams to determine the performance in a real life scenario. In this section these two experiments as well as the results will be presented in more detail. The training times for each method and configuration can be seen in Figure 5.1.

5.1 Results on the Initial Dataset

The first experiment consisted of making predictions on the test data. The predictions are made with the three methods described in Chapter 4 run in the configurations mentioned in Section 4.10. As seen in Figure 4.4, both the initial and the full dataset have been split

Table 5.1: This table shows the total training time for each network, as well as the training and inference time for one sample. The total training times do not only depend on the network architecture, but also on the dataset used. The full dataset has 50% more data than the initial dataset. Early stopping usually took place in fewer epochs for the MobileNetV2, because through the pre-trained weights, the initialization is better than the other two methods.

Method Frames Total training time Training time per sample Inference time per sample 1 0.5h-1h 0.02s 0.01s Baseline 3 2h-5h 0.07s 0.06s 5 3h-7h 0.11s 0.07s 1 1h-2h 0.22s 0.05s MobileNetV2 3 3h-8h 0.65s 0.08s 5 5h-13h 1.12s 0.15s 1 1h-3h 0.15s 0.04s 3D Conv 3 4h-9h 0.50s 0.12s 5 8h-16h 1.03s 0.25s

Football Shot Detection using Convolutional Neural Networks

Linköping University | Department of Biomedical Engineering

Master’s thesis, 30 ECTS | Datavetenskap

2019 | LIU-IMT/TFK-A-M--19/41--SE

Football Shot Detection using

Convolutional Neural Networks

Simeon Jackman, simeon@jackman.ch

Upphovsrätt

Copyright

Acknowledgments

Contents

1

Introduction

1.1

Signality

1.2

Motivation

1.3

Aim

1.4

Research questions

1.5

Delimitations

2

Related Work

2.1

Deep Learning and Neural Networks

2.2

Action Detection

Benchmark Datasets

Methods

3

Theory

3.1

Empirical Risk, Loss functions

3.2

Network Training

3.3

Optimizers

Stochastic Gradient Descent

Adaptive Moment Estimator

3.4

Convolutional Neural Networks

3.5

Residual Networks

Input

Convolution 1x1, Relu6

Depthwise 3x3, Relu6

Convolution 1x1, Linear

Add

4

Method

4.1

Data

Initial Dataset

Addition of Hard Negatives

4.2

Sampling

4.3

Temporal Shifting - Temporal Data Augmentation

4.4

Input resolution normalisation

4.5

Cropping

4.6

Framework

4.7

Model 1 - Baseline Method

4.8

Model 2 - MobileNetV2 Feature Extraction

4.9

Model 3 - 3D Convolution

4.10

Network training

5

Experiments

5.1

Results on the Initial Dataset