A deep learning approach for action classification in American football video sequences

(1)

UPTEC STS 17033

Examensarbete 30 hp

November 2017

A deep learning approach for

action classification in American

football video sequences

(2)

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

A deep learning approach for action classification in

American football video sequences

Jacob Westerberg

The artificial intelligence is a constant topic of conversation with a field of research that is pushed forward by some of the world's largest companies and universities. Deep learning is a branch of machine learning within artificial intelligence based on learning

representation of data such as images and texts by processing the data through deep neural networks. Sports are competitive businesses that over the years have become more data driven. Statistics play a big role in the development of the practitioners and the tactics in order to win. Sport organizations have big statistic teams since statistics are manually obtained by these teams. To learn a machine to recognize patterns and actions with deep learning would save a lot of time. In this thesis a deep learning approach is used to examine how well it can perform to classify the actions pass and run in American football games. A deep learning architecture is first trained and developed on a public video dataset and then trained to classify run and pass plays on a new American football dataset called the All-22 dataset. Results and earlier research show that deep learning has potential to automatize sport statistic but is not yet ready to overtake the role statistic teams have. Further research, bigger and more task specific datasets and more complex architectures are required to enhance the performance of this specific type of deep learning based video recognition.

(3)

Populärvetenskaplig Sammanfattning

Redan i antikens Grekland har människan fantiserat om maskiner som kan tänka. Nuförtiden är artificiell intelligens (AI) ett stort forskningsområde med många olika undergrenar och applikationer. AI har möjligheter att automatisera uppgifter, känna igen och förstå tal, bilder och text samt att ställa diagnoser inom sjukvår-den. Dessa maskiner har potentialen att lösa problem som människan har svårt att göra som tunga matematiska uträkningar. Utmaningen för AI är istället att kunna lösa problem som människan har lätt att lösa, som sker per automatik men som på grund av denna automatik är svårare att beskriva, som att känna igen ord eller att identifiera objekt på bilder.

Djupinlärning är en gren av maskininlärning inom AI som med hjälp neuron-nät strävar efter att modellera abstraktioner i data. Dessa neuronneuron-nät utgörs av många lager med komplexa strukturer bestående av linjära och icke-linjära trans-formationer. Djupinlärning har med framgång använts i bland annat röst- och bildigenkänning.

Framgångar, taktik och värvningar inom den sportsliga sfären har länge associer-ats med att gå på magkänsla och att följa sina instinkter. Men i takt med teknikens utveckling har även många sporter blivit mer datadrivna, speciellt professionella organisationer och bettingföretag. Statistik från sportevenemang har fått större utrymme i bland annat spelarevalueringar och oddssättningar, men det tar tid att framställa denna statistik. Professionella lag har flera videoanalytiker som jobbar för att ta fram statistik, som passningar och skott, från matcher som sedan ska anal-yseras. Detta är en tidskrävande process, och det är här djupinlärning kommer in i bilden. I och med dess framgång inom bildigenkänning är det lockande att under-söka vad djupinlärning kan åstadkomma på stora dataset av matcher inom sport. Genom att träna dessa neuronnät att känna igen spelmönster och händelser, skulle det kunna vara möjligt att automatisera delar av framtagningen av sportstatistik så som pass, skott och löpningar.

Denna studie har undersökt möjligheten att identifiera spelhändelser i amerikansk fotboll. En djupinlärningsmodell togs fram och tränades på ett dataset bestående av klipp från matcher i syfte att identifiera två vanligt förekommande offensiva händelser. För att utvärdera modellen testades den på klipp som den inte sett tidi-gare och identifierar händelserna i klippen.

(4)

(5)

1 Introduction

For a long time, humans have fantasized about thinking machines. Evidence of this dates all the way back to the ancient Greece, where several mythical creatures may be regarded as artificial life [1]. Nowadays, artificial intelligence (AI) is a field with a lot of research and applications. AI solutions can automate tasks, recognize and understand speech, images and text as well as make diagnoses in medicine. However, in the beginning of AI, the field focused on solving problems that are hard for us humans to solve, but easier for computers - problems such as heavy mathematical computations. The real challenge for AI was not these computations, but tasks that are easy for humans to perform, but hard for us to describe formally - problems that are solved intuitively. For example, recognizing spoken words or figures such as animals on an image.

Deep learning is a branch within machine learning that allows computational mod-els consisting of several processing layers to learn representations of data with sev-eral levels of abstraction. It is the many stacked layers that have given the name to the technique. Deep learning can for example be used in speech recognition or to identify objects in images. In short, deep learning detects complicated structures in big data sets by using a back-propagation algorithm. It is this algorithm that tells the machine how it should change its internal parameters, also known as weights. These parameters are used to compute the representation in each layer based on the representation in the previous layer.

(8)

of sports games to operate as large datasets, it is alluring to explore the possibilities of automatizing the production of sport statistics and analytics. By recognizing game patterns and actions, it could be possible to, for example, count actions in numbers such as passes, shots, runs and saves of a sports game instantly.

1.1 Thesis Purpose

In this thesis, a deep learning approach will be used for action recognition in Amer-ican football. The actions pass and run will be classified in a specific lateral camera angle called All-22 [5]. The approach, in form of a model, will take a sequence of images as input and return the probabilities of different action types.

1.1.1 Research Question

With the thesis purpose in mind, the following research question can be formu-lated:

• Can a deep learning approach be used to recognize the actions pass and run in American football?

In order to investigate this a deep learning model will be trained on a dataset and used to predict video sequences from American football games.

1.1.2 Delimitations

In American football, the game is build up by plays. A play is a sequence and is initialized when the ball is snapped (thrown from the line of scrimmage to the quarterback) and terminated when the ball is out of bounds, on the ground or the ball carrier is out of bounds or on the ground. During this time, the team with the possession of the ball performs an action such as a pass or a run. For the data of this thesis, the run or play action is annotated to a full play sequence. This means that a video labeled ”Pass” is a video of a full play where the ball is snapped, the action pass is performed and the play is terminated. All videos contain either a pass or a run action.

(9)

the outcome of a pass may vary, only completed passes, i.e., when the quarterback passes to a teammate who successfully catches it, will be tried to recognize.

Figure 1: Top: Image description of how a running play is performed by a

run-ning back (RB). The arrow from the RB indicates a typical runrun-ning route. The X indicates the position from where the ball is snapped. Bottom: A pass play, where the QB (quarterback) throws the ball to his wide receivers (WR) instead of run-ning with it. The arrows indicate typical routes the WRs take to make themselves eligible to receive the ball.

2 Deep Learning

The aim of this section is to provide a brief understanding in the field of deep learning. In this chapter, deep learning is described together with the underlying relevant models of neural networks (NN) used in the field.

(10)

which is well suited for the aim of this thesis. The most common form of machine learning is a technique called supervised learning, which is the underlying tech-nique of this thesis. In supervised learning the data is labeled. For example, when training an algorithm to classify objects in images, the algorithm is presented with the image and what it represents. It produces an output vector with as many ele-ments as labels, and the goal is to get the desired element label as high predicted as possible [6].

Nearly every deep learning algorithms contains the same set of tools: a dataset, a cost function, optimization and a model. The latter is explained more thoroughly in the next section. The dataset is usually sets of information, one that the algorithm trains on and one to validate its learning outcomes on. The bigger the dataset, the bigger amount of training data. The aim of many deep learning algorithms is opti-mization. By optimization, one usually refers to either minimizing or maximizing a function. When minimizing is the goal, the function aimed to minimize is called the cost function.

2.1 Models

(11)

Figure 2: Top: A single layer perceptron, the inputs are multiplied with the

(12)

2.1.1 Feedforward Neural Network

Feedforward neural networks (FFN), or deep feedforward networks are seen as the quintessential architecture when it comes to deep learning models. Its fundamen-tal purpose is to approximate some function f*. Say this function is a classifier,

y = f∗(x), that finds a category y to the input x. The task of the FNN is to define a mapping y = f (x; θ) and learn the value of the parameters θ in a way that results in the best approximation possible for this classifier [1]. FNNs are the foundation of deep learning practice, since they are the basic components to several, more advanced deep learning models. These types of network are some-what inspired of neuroscience and neurons, hence the name neural networks [1]. FNNs are called networks since they are represented by several functions stacked on each other. The length of functions or layers stacked together is what gives the depth to the model and it is from this that the ”deep learning” originates. For ex-ample, three functions, f1_{, f}2_{and f}3 _{, might be connected into a chain that form}

f (x) = f3_(f2_(f1_{)). In this example f}1 _{is the first layer of the network, f}2 _the

second, and so on where input is processed and passed on to the next layer. The last layer in a model is often referred to as the output layer.

The learning algorithm decides how it utilizes the layers to produce an approxi-mation of f* as accurate as possible. Between the input layer and the output layers of a FNN, there can be more layers often referred to as hidden layers, see the bot-tom figure of Figure 2 [1]. Every hidden layer is commonly vector-valued and the dimension of the layers determines the width of the model where each element in the vectors are neurons (sometimes referred to as nodes), which is illustrated in Figure 2. A neuron in a NN-layer receives, processes and passes on input to other neurons, illustrated in the top of Figure 2. The value x2_i that a specific neuron i in hidden layer f2 _{passes on to the neurons in f}3_{can be described as}

x2_i = φ(

n1

∑

j=1

w_ij1x1_j),

where n1 _{indicates the last input neuron of previous hidden layer f}1_{, w}1

ij is the

weight, x1

j the value of node j in the previous hidden layer and φ the activation

function.

(13)

order to obtain non-linearity, this allows the neural network to learn more complex functions than a linear regression. Most neural networks use an activation function as a nonlinear function to describe output features. The most common activation function for FNNs is the rectified linear activation function. When used by a unit in neural network it is referred to as a rectified linear activation unit (ReLU). This function yields a nonlinear transformation when applied to the output of a linear one. A rectifier is an activation function that can be defined as

g(x) = max(0, x),

where x represent the input to a neuron. In other words, the function is thresholded at a value of zero as is illustrated in Figure 3. There are several advantages with ReLU such as it converges faster compared to other activation functions and im-plementation it is cheap computation-wise. The downside with ReLU is that it can be sensitive during training causing its neuron to become inactive in the network. But if the learning rate is not too high, this becomes less frequent [1].

When neural networks are used for classification, the softmax function is a com-mon activation function to use in the output layer. It can be viewed as

sof tmax(xi) =

exp(xi)

∑

j=1exp(xj)

,

(14)

Figure 3: The Rectified Linear Activation Function (ReLU). The function is

thresh-olded at zero.

2.1.2 Convolutional Neural Network

A convolutional neural network (CNN), first introduced in 1989 [7], is a type of a neural network that specializes in processing data with a known, grid-like topol-ogy. For example, a CNN can process images that are essentially 2D arrays that contains the pixel intensities. There are many data modalities that are arranged in multiple arrays. Signals and sequences for example are 1D arrays and videos or volumetric images are made up of 3D arrays. What distinguishes a CNN from a NN is that a CNN uses convolutions, a type of linear operation, instead of general matrix multiplication [6].

A typical CNN is composed by different stages. The first stages are built up by two types of layers: the convolutional layer and the pooling layer. In convolu-tional layers, nodes are organized in feature maps. Feature maps are produced by

local receptive fields, these fields traverse across an image and creates the

fea-ture maps [8] shown in Figure 4. Within a feature map each node is connected to local patches in the feature maps of layers before them by a set of weights re-ferred to as a filter bank [6]. The result of this locally weighted sum is then passed through a nonlinear activation function, such as the ReLU. This stage is called the

detector stage. All nodes in a feature map share the same filter bank but different

(15)

this: Firstly, in 2D arrays such as images, local parts of values often correlate and form local distinctive motifs. Secondly, local motifs such as edges and curves are invariant to location, which means they could appear anywhere in the image. For this CNN utilizes shared weights. Sharing same weights at different locations can detect similar patterns in different part of the array of the image [6].

The task of the pooling layer is to merge features that are semantically similar into one. A pooling node usually compute the maximum of a local patch of nodes into a feature map, this is called max pooling [9]. In the pooling layers, sub-sampling takes place and is according to Lecun et al. [8] an advantage for using CNNs for image processing. Sub-sampling reduces the size of the feature maps, which in turn reduces the number of parameters but also the importance of exact position of a feature in the input.

In the last stages of a CNN, fully-connected layers are often used. A fully-connected layer has connections to all the activations of the previous layer. They are there to let the network learn a function of the previously learned visual features [1].

Figure 4: An example of a CNN where a local receptive field with the size 2×2

traverses the input to create feature maps.

2.1.3 Recurrent Neural Network

(16)

means a notion of time is brought to the model [11]. These recurrent edges can form cycles that are self-connections from a node to itself across time which is seen in Figure 5. At time t, a node with a recurrent edge receives input from its current data point from that time step, but also from values of a hidden node from the network’s previous state which is illustrated in Figure 5. A simple set of equations of how a RNN evolves over time can be described as

ˆ

yt= f (ht; w),

ht= g(ht−1, xt; w),

where ˆyt_{is the output of the RNN at time t, x}t_{represents the input and h}t_{is the state}

of the hidden layer. w represent the weights for the network. The first equation indicates that the output at t depends on the hidden layer, and the second equation says that the hidden layer at time t depends on the hidden layer at time t− 1 and the input at time t. Thus, a RNN receives past computations to influence present computations.

RNNs have the ability to scale to much longer sequences compared to networks without this sequence-based architecture. The basic idea about RNNs is to share parameters across several different parts of the model, which makes it possible to expose the model to different forms and lengths and generalize across them. With individual parameters for every time value, it would not be possible to generalize to sequence lengths that have not been encountered during training of the network. It would also not either be able to share statistical strength through different sequence lengths and time positions. The ability to do this however, can come in handy when features can occur at multiple positions within a sequence [1].

(17)

Figure 5: A basic RNN with one output. It simply processes the input we chose to

call x, incorporates it into a hidden node we name h which passes it on through time. The right part is the same network unfolded, where each network is asso-ciated with one time instance and also receives input from the hidden node of the previous time instance.

2.1.4 Long Short-Term Memory

Even though the main purpose of a RNN is to be able to learn long-term dependen-cies, is has been shown to be problematic to learn to store this kind of information over a longer period. In order to address this problem, an idea to enhance the RNN with an explicit memory came up. And so, the Long Short-Term Memory (LSTM) was introduced in 1997 and immediately showed better at learning long-term dependencies [12]. The model has been very successful in applications such as handwriting and speech recognition, but also image captioning. Like a RNN, the LSTM handles sequential input, but in distinction to the RNN it has more pa-rameters and a gating system with units that control information flow, which the RNN has not. The LSTM also has internal recurrence compared to the RNN that only has outer recurrence noticeable if one compare Figure 5 and Figure 6. In a LSTM, the hidden layer of the RNN is replaces by a memory cell. This cell is built from simpler nodes in a specific order. It has an input node, an external input gate, an internal state, a forget gate and an output gate as seen in Figure 6.

The internal state unit st

i, where t is the time step and cell i, is a component that has

a linear self-loop, that gives the LSTM model its internal recurrence. This loop has a weight that is being controlled by a forget gate ft

(18)

weight a value between 0 and 1 [1] f_it= σ(bf_i +∑ j U_i,jf xt_j +∑ j W_i,jf y_jt−1),

xt_{here is the input and y}t_{is the current hidden layer vector that contains the outputs}

from all LSTM cells. bf_{, U}f _{and W}f _{are the biases, weights and recurrent weights}

respectively for the forget gates, i and j represent cells. After the forget gate, the internal LSTM cell is then updated with a self-loop weight ft

i from the forget gate

st_i = f_itst_i−1+ gt_iσ(bi+ ∑ j Ui,jxtj + ∑ j W_i,jf yt_j−1),

The external input gate gt

i is computed with a sigmoid unit similar to the forget

gate gt_i = σ(bg_i +∑ j U_i,jg xt_j+∑ j W_i,jo y_jt−1),

The specific LSTM output, y_it, can be shut off in the output gate q_itwhich also uses a sigmoid for gating [1]

(19)

Figure 6: A look inside the LSTM network cell. The network has input x and output

y. The hidden unit of a RNN is replaced by a memory cell with more components. The input feature is computed by a regular neuron unit and the value can then be accumulated into the state if the input gates allow it. All the gates have a sigmoid nonlinearity, noted as σ. The internal state unit got a linear self-loop with weights controlled by the forget gate. Output from the cell can be extracted and used in the next time step, or both.

2.2 Training

(20)

which we also want to be as small as possible. What we also want is the gap be-tween those two measures to be small. The disparity bebe-tween these measures is called overfitting [1], an example of overfitting is seen in Figure 10. After train-ing, the model modifies its internal parameters, also known as weights to reduce the error. In a standard deep-learning system, there can be hundreds of millions of weights that define the input-output function of the system.

2.2.1 Gradient Based Learning

When deep learning model is trained, the goal is to reduce the test error by training the model. In order to train a NN we need a cost function. The training algorithm in deep learning is roughly always based on methods using a gradient to descend the cost function. This technique is referred to as gradient descent (GD) and implies using the derivate of a function to follow the function descending to a minimum. For a classification algorithm, it is common to use cross-entropy as a cost function to manage the training error [1]. Take the following example,

A classifier is trained to identify whether an image represents a cat, dog or a frog. Suppose a softmax activation, introduced in Section 2.1.1, is used in the NN on the output neurons so the output can be treated as possibilities. Suppose this yields the following output:

Computed value Targeted value Result 0.5, 0.3, 0.2 1, 0 ,0 (cat) Correct 0.1, 0.6, 0.3 0, 1, 0 (dog) Correct 0.5, 0.4, 0.1 0, 0, 1 (frog) Incorrect

This indicates that the NN has a classification error of 1/3, and is far from correct on classifying the frog. However, if it was closer to classify the frog, yet incorrect, the classification error would still be 1/3. Now, cross entropy is introduced:

H(y, ˆy) =∑ i yilog 1 ˆ yi =−∑ i yilog ˆyi

where y is the true value and ˆy is the computed value. On the cat computed values,

we would get:

(21)

For all three classes, we get −((log(0.5) + log(0.6) + log(0.1))/3) = 1.16 as the average cross entropy error. If the computed values for the frog was slightly better, say 0.4, the error for the full NN would instead be 0.7, which is significantly smaller. If we used classification error only, it still would be 1/3 event though the observation is incorrect, yet better that the first observation.

NNs are usually trained by using iterative,gradient-based optimization that drive

the function to a low value, in this case, a cost function. Say we have a function

w = f (z), and the derivate of this function is f′(z). The derivate gives us the slope of f (z) at the point z, this tells us how to change the input z in order to make a small improvement on w. When f′(z) = 0, z is called a critical point. A

local minimum is a critical point where f (z) is lower than the neighboring points,

here it is no longer possible to decrease f (z) by taking small steps. However, this minimum might not be a global minimum, the absolute lowest value of f (z). This makes optimization difficult, since we want to reach lowest minimum possible. In cases where there are several minimums, the global minimum could be hard to find. In deep learning, a local minimum is generally accepted even though it is not truly minimal, as long as it corresponds to significantly low values of the cost function [1].

(22)

learning rates of model parameters such as the AdaGrad, RMSProp and Adam. Adam only requires first order gradients and little memory[1].

When a NN receives input to produce an output, the data is fed forward through the network, this is called the forward propagation. The back-propagation algo-rithm on the other hand, allows the information to flow back through the network in order to compute a gradient. Learning in deep neural networks requires the computing of gradients of complicated functions [1]. Back-propagation is used to compute gradients of a cost function with respect to the NNs internal parameters, also known as weights. It is back-propagation algorithm that tells the model how is should change these parameters. This computation is based on the use of the chain rule and proceeds backward in the model with respect to the computations that are performed to compute the cost [1].

2.2.2 Transfer Learning

Transfer learning means that learned features of a dataset are used to improve generalization of some other dataset. For example, say we have learned a model about the visual categories of horses and cows and then learn about another set, say cats and dogs. If we have sufficient data for the horses and cows, this could help to learn representations or features useful to generalize better on the cats and dogs. So, transfer learning exploits what is learned in one set of data to improve generalization of another set of data, especially if the second set is smaller. In image processing, this is doable because many visual categories share low-level notations such as edges, visual shapes and geometric changes. So transfer learning can be achieved via representation learning, a technique that discovers features that have underlying attributes useful for more than detecting one object. Instead of training on the set of cats and dogs from scratch, using transfer learning exploits visual categories from the set with horses and cows that could share low-level notions of edges and visual shapes with the cats and dogs set.

(23)

2.2.3 Dropout

Dropout is a relatively new method [13] of regularizing models and is also cheap computation-wise. Given a model and a training set, using dropout creates an ensemble of several sub-networks of the full neural network and trains them. An example of this can be seen in Figure 7. These sub-networks are created by remov-ing non-output units from the original neural network [1]. One main incentive to use dropout when training a network is to mitigate overfitting. Dropout has been proven more effective than other types of regularization techniques such as weight decay, and can also be combined with other regularization methods for further im-provements. Another advantage with dropout is that it is applicable on nearly all types of models such as FNNs, RNNs and CNNs. Since dropout is a regulariza-tion technique, is reduces the efficiency of the model, to compensate for this, the size of the model is often increased. It is also common to increase the number of epochs is usually increased. [1].

Figure 7: Example of a few the sub-networks the dropout can create out of a simple

(24)

2.3 Thesis Adoption

With the success of CNNs on image classification and detection [14,15,16], it be-comes highly relevant to implement a model that adopts what is learned in Section 2.1.2, since a video is a sequence of images.

The sequence of images means that information of what is happening can travel through time. Therefore, it is important to have recurrent adoption to the model of this thesis. Such tools were introduced in Section 2.1.3 and Section 2.1.4. By adding this to the model, it is possible to take temporal information into consider-ation.

A more detailed description of how CNNs, RNNs and LSTMs were adapted in the thesis can be found in Chapter 4.

3 Datasets

The following data was used in order to provide training and validation sets for the architecture used in this thesis.

3.1 UCF101

(25)

3.2 All-22 Action Dataset

The All-22 action dataset consists of American football clips that are extracted from either the National Football League (NFL) or the college football league, NCAAF. This dataset has been put together for the purpose of thesis. The clips contain in-game footage of plays but is condensed so that replays, commentary and commercial are left out since these sequences are not relevant for action recogni-tion. This means the clips are near effective play time of every game in the dataset (some noise may occur in the clips). Thanks to detailed play-by-play (description of every play of a whole game) documentation of the games accessible online, it is possible to label the clips with its corresponding action classes. This means that every extracted video sequence gets detailed descriptions of what is happening in them. The current state of the dataset has two classes, pass play and run play. The resolution of the videos in this dataset is 640×320. The videos are all 17 seconds long and contains a full play action and there is total of 2610 videos in the dataset.

3.3 ImageNet

ImageNet is an image dataset introduced in 2009 [19] with the purpose to provide good resources for sophisticated and robust algorithms. Its initial size was 3.2 million images but has grown to over 14 million, and has during its existence been successful and used in a lot of publications[20]. ImageNet has also proven fortunate for transfer learning [21], which will be used for this thesis. Except for taking advantage of this dataset for transfer learning, no training will be performed with this dataset.

4 Implementation

This chapter covers the approach towards action recognition on the video datasets introduced in the last chapter. Implementation was made in Keras, a high level neural network API written in python [22]. The training of the architecture was done on a virtual GPU on Amazon Web Services.

(26)

the VGG16 model was used as the foundation to the architecture. Since the actions occurred over time, the architecture also had to handle temporal information, so LSTM layers were implemented into the architecture. The following sections con-tain descriptions of the implementation to be able to recognize American football actions.

4.1 Proposed Architecture

The architectures used in this report are based on the VGG16 model seen in Figure 8. This model is a deep learning model built on convolutional layers, brought up in section 2.1.2, designed for large scale image recognition. It was introduced in 2014 and has been successful on datasets such as the ImageNet [23]. It is built up in blocks of convolution layers and max pooling layers with ReLU as activation function, which is mentioned in Section 2.1.1. At the end of the model, a flattening

layer that flattens the input is added and then fully-connected layers. For

classifi-cation tasks, like the task of this thesis, a softmax function is used, mentioned in Section 2.1.1.

(27)

step to output one vector with a probability distribution of the classes. The model computes the error with cross-entropy and has Adam as optimizer, brought up in Section 2.2.1.

(28)

Figure 8: Left: The default VGG16 architecture. Right: The proposed

(29)

Figure 9: A look of how data is processed in the last recurrent block of the proposed

model. The last LSTM layer has a softmax function as activation function and outputs a vector of a probability distribution of the number of action classes.

4.2 Data Preprocessing

(30)

4.2.1 UCF101

The length of the clips in the UCF101 dataset varies between two to six seconds and contains 30 frames per second. Because of the size of the dataset, extracting all of the frames from all videos would result in a lot of frames and memory us-age, so only 10 frames per second were extracted from each clip. Furthermore, a subset of 15 classes were randomly chosen to form a subset of the dataset used in training of the network. Those classes were Baby Crawling, Baseball Pitch, Basketball Dunk, Bench Press, Biking, Boxing Punching Bag, Drumming, Golf Swing, Hammering, Javelin Throw, Kayaking, Nunchucks, Skiing, Throw Discus and Volleyball Spiking. 7586 samples formed a training set, 2068 samples were used in the validation set and the test set. Due to the different length of the videos, one sample is a frame sequence of 10 images since the model requires the inputs to have the same shape. So one sequence corresponds to one second of the video.

4.2.2 All-22

The length of the clips in the All-22 dataset is 17 seconds for every clip, therefore two frames per second are extracted to compensate for the long length of the clips. Since all clips have the same length it is possible to represent the full clip in one sequence and include all the frames into it per clip. The actions aimed to recognize are the pass attempt and the run play. Outcomes of the actions were not taken into consideration, such as pass complete or incomplete and touchdown. Furthermore, each action is also global for one clip, which means one label per clip. The quality of the clips when it comes to relevant content varied, in some clips, close ups of players could occur after a play, which is not relevant for the action. If the clips were to be cut more accurately it would probably enhance the results, but also take much more time. The dataset was split in a training set with 1911 videos, a validation set with 397 and a test set with 302 videos. The classes are distributed evenly through the different sets.

4.2.3 Data Simplification

(31)

Two actions in the All-22 dataset were used to train and classify, which made the model a binary classifier. This also enhances the chance to classify the right action since the complexity is not as high as it could be with more actions included. The datasets were simplified like this in hope to increase accuracy of the models. The simplification also mitigated the limited computational power and memory stor-age. Several previous models have used multiple GPUs for training large datasets [24,15] or state of the art GPUs [14,25,16].

4.3 Optimization

There are several tools to optimize results presented in Chapter 2. The methods used in the thesis is introduced up in this section.

The model was trained on the training and validation set in epochs until the learn-ing curve evened out. Trainlearn-ing more epochs from this point will not enhance the model. Then the model was tested on a test set, so that the model was not optimized to fit a specific set of video sequences.

4.3.1 Dropout

Dropout, brought up in Section 2.2.3 was used to address the overfitting. A dropout ratio of 50% was used and the number of epochs were doubled to compensate for the slower growth of training accuracy. The dropout was added before the final LSTM layer with neurons corresponding to the number of actions.

5 Results

(32)

5.1 Performance Measure

When training a NN for classification, it is relevant to use accuracy as metrics to measure the performance of the model during training. Using accuracy as metrics when training in Keras yields four different outputs after each completed epoch. The first is training loss, computed with cross entropy. The second is training accuracy which indicates how well the model is doing on the data it is being trained on. The third is the validation loss, also computed with cross entropy. Forth output is the validation accuracy and shows how well the model is doing on the validation set, which it has not been trained on.

When the training was completed and the weights from the model saved, those weights were used to predict the output of the test dataset. When a sequence of frames is predicted, the model produces probabilities for each possible label. The highest predicted label was picked and compared to the correct labels and from this comparison the accuracy is computed.

5.2 Results

5.2.1 UCF101

(33)

Figure 10: Left: Graph of accuracy when training the architecture on the UCF

dataset. Right: Graph of the loss for each epoch during training of the architec-ture.

Figure 11: Left: Graph of the accuracy when adding dropout to the model trained

(34)

5.2.2 All-22

Figure 12 shows results of the model training on the All-22 dataset with two classes. The model was trained on 1911 samples and validated and tested on 397 samples. Each sample was a sequence of 30 frames retrieved from the dataset with a rate of two frames per second. Training accuracy ended up in high 90% and validation accuracy at 82%. The losses were low for both training and validation, but still the validation was higher. In contrast to the UCF dataset, on some runs, the training and validation accuracy would get stuck at around 65% for the whole training. This is an indication that the problem is hard to learn for the architecture and an additional LSTM layer was therefore added to the architecture.

These results also indicated overfitting, which made it relevant to also test dropout. But as for the UCF dataset, this did not increase accuracy much and instead in-creased the loss, which Figure 13 illustrates.

Figure 12: Left: Graph of the accuracy when training the architecture on the

(35)

Figure 13: Left: Graph of the training and validation accuracy of the

architec-ture on the All-22 dataset with dropout applied to the model. Right: Training and validation loss of the model during training of the architecture with dropout activated.

5.2.3 Model Accuracy

The predictions from each model on the test set can be seen in Table 3. For the model trained on the UCF dataset, results differed little when using dropout. Both predictions, as Table 3 shows, were roughly two-thirds accurate (67.4% and 68.32%), with dropout being slightly better than without it.

The predictions from the All-22 test set were about the same for the architecture with and without dropout. Both ended up at around 83% shown in Table 1.

Table 1: Accuracy from the model and dataset with and without dropout.

Model Dataset Dropout Accuracy

Proposed model UCF No 0.674

Proposed model UCF Yes 0.683

(36)

6 Related Work

There are many successful deep learning methods for image recognition and clas-sification [26, 27, 28]. One big reason is the number of big image datasets such as ImageNet, CIFAR-10 and MNIST that provides good conditions for training the network [29,30,31]. Bigger video datasets have started emerging and several deep learning approaches have recently been used for learning video representa-tion that have produced state-of-art results. Results of the related work presented in this chapter is found in Table 2

One approach on large-scale video classification using CNN introduced a dataset of one million YouTube videos with 487 classes called Sports-1M [14]. Since CNNs require long time to optimize the parameters in an efficient way, this issue was mitigated by introducing two processing streams, which increased the speed of the runtime. The model of this thesis uses one stream only. One of the streams learned features on low-resolution frames and the other is high-resolution that op-erated on the middle portion of the frame. The stream architecture, called slow fusion, fused temporal information throughout the network in a way that higher layers progressively got more information from spatial and temporal dimensions. The architecture was then tested on the Sports-1M dataset as well as another video dataset called UCF101.

Another approach with a similar approach to the previous one also used a two-stream convolutional network [15]. By placing two VGG16 networks in parallel, feeding different inputs and fusing the outputs, the architecture could process both spatial and temporal information from the videos, similar to [14]. The spatial in-formation was a single frame extracted from the video input whereas the temporal information was optical flow computed on consecutive frames from the video. Temporal information is important since an individual frame only is a small frac-tion of full video. Unlike the model presented in this thesis, optical flow was used to handle temporal information. It describes the motion between video frames which is supposed to make recognition easier since similar activities have mo-tions. They trained and evaluated the architecture on the datasets UCF-101 and HMDB-51.

(37)

the inputs were then extracted by a CNN and then processed by a convolutional pooling model that performed max-pooling on the final convolutional layer of the CNN. LSTM was also used on the feature extraction, similar to the model of this thesis. However, the results of the LSTM-model were outperformed by the Convolution-Pooling model. The full model was trained on the Sports-1M dataset and the UCF101.

(38)

via average pooling and after that embedded in a final fully connected layer. The outputs were then compared to decide an action class.

Table 2: Accuracy results from the different approaches on large video datasets in

percentages.

Approach

Dataset

UCF101 Sports-1M Youtube-8M HMDB51 All-22 Slow Fusion [14] 65.4% 80.2% - -

-Two-Stream [15] 88.0% - - 59.4%

-Optical Flow + ConvPooling [32] 90.8% 89.3% - - -Improved Trajectories [25] 91.5% - - 65.9% -Mixture of Experts [33] - 82.6% 84.8% - -Precondition & Effect [16] 92.4% - - 63.4% -Proposed architecture 67.0%* - - - 83.0% * 15 classes out of 101

The use of recurrent neural networks (RNN) for video recognition has also been tested. One paper introduced using a RNN to learn to detect events in basketball while also tracking the players responsible for the event [24]. The architecture was trained to classify 11 different event categories while automatically tracking players with their several Bidirectional Long Short-Term Memory (BLSTM), a form of LSTM.

(39)

in two parts. The kickoff detector focused on identifying how players aligned and their run up towards the ball. The punt kick, that occurs when the offensive team runs out of attempts to move the ball, is very similar to the other actions since the teams line up in a similar way and the ball is kicked and caught by the opponent. Therefore, detecting a punt was solved by making the other detectors reliable to when a play is not a punt and by a process of elimination detect a punt. These detectors where then processed by a HMM which encoded knowledge about th temporal structure of football games to improve accuracy. The architecture was then evaluated on 10 high-school football games averaging of 146 videos.

6.1 Comparisons

Compared to previous attempts on the UCF101 dataset in Table 2, the proposed approach of this thesis lacked both in accuracy and amount of data processed. The state-of-the-art methods brought up in Chapter 6 in do not use pretrained weights but computes them from scratch, which takes time and require more computational power than in accessible for this thesis. 67% correct predictions on 15 classes would probably be less with more action classes since it increases the complexity. Since the proposed architecture of this thesis was the first to train and predict the All-22 dataset, it is hard to compare to other results. However, somewhat similar approaches such as [24] managed to get better results on more classes than the binary classifier of this thesis.

7 Conclusion

(40)

more effective. The frame extraction also plays a role in the results, every single frame from the videos were not extracted which could mean information is lost. Extracting frames on the fly instead of storing frames locally could help process every frame of the videos, or a very big amount of computer storage. Predicting the correct class two thirds of the time with 15 possible classes does not compete with official sport statistics, that are more or less 100% accurate. To be able to compete with the methods of collecting sports statistics today, the obtained accuracy should be higher, at least in the high nineties to be able to work as some sort of complement to sport statistic teams.

The binary classifier for the All-22 dataset gained a higher accuracy than the one for UCF101, but it also had an easier task. To be able to predict either pass or run 83% of the time still does not compete with the accuracy of today’s sport statistics. If actions were to be increased, accuracy would probably drop. For a human eye, seeing whether a ball is thrown or carried is not challenging. This leads us into the fundamental problem of AI and deep learning, tasks that human solve naturally and intuitively but hard for us to describe formally. There are improvements that could be done, to start with improving the quality of the All-22 dataset. There are irrelevant frames in the video sequences mainly in the end of the clips and if they were removed, it could improve the results. Another trait of this dataset is the size of the frames. The videos of this dataset have the resolution 640×320, which is larger compared to other datasets, for example UCF101 that has 320×240 and ImageNet with 224×224. A higher resolution entails bigger area for the model to cover. Since a pass play is defined by a quarterback and a ball receiver, it is only a small fraction of the frame that sets this action apart from the run play, mainly defined by a running back. Every play in American football is initialized by a snap, which means that a fraction of each clip in the database is more or less the same for both actions. Removing the snap from the video sequences could also improve the results.

(41)

7.1 Future Work

(42)

References

[1] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org.

[2] M. Lewis, Moneyball. W. W. Norton & Company, 2016.

[3] “Sabermetrics,” http://sabr.org/sabermetrics, accessed: 2017-08-25.

[4] “The People Tracking Every Touch, Pass And Tackle in the World Cup,” https://fivethirtyeight.com/features/ the-people-tracking-every-touch-pass-and-tackle-in-the-world-cup/, accessed: 2017-08-25.

[5] “Coaches film,” http://www.nfl.com/coachesfilm, accessed: 2017-08-25. [6] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp.

436–444, May 2015.

[7] Y. LeCun, “Generalization and network design strategies,” in

Connection-ism in Perspective, R. Pfeifer, Z. Schreter, F. Fogelman, and L. Steels, Eds.

Zurich, Switzerland: Elsevier, 1989, an extended version was published as a technical report of the University of Toronto.

[8] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” in Intelligent Signal Processing. Institute of Electrical and Electronics Engineers, 2001, pp. 306–351.

[9] Y. Zhou and R. Chellappa, “Computation of optical flow using a neural net-work,” in Neural Networks. Institute of Electrical and Electronics Engi-neers, 1988, pp. 71–78.

[10] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Neurocomputing: Foun-dations of research,” J. A. Anderson and E. Rosenfeld, Eds. MIT Press, 1988, ch. Learning Representations by Back-propagating Errors, pp. 696– 699.

[11] Z. C. Lipton, “A critical review of recurrent neural networks for sequence learning,” arXiv:1506.00019, Tech. Rep., June 2015.

[12] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural

(43)

[13] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdi-nov, “Dropout: A simple way to prevent neural networks from overfitting,”

Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, Jan.

2014.

[14] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in The Conference on Computer Vision and Pattern Recognition, 2014. [15] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for

action recognition in videos,” in Proceedings of the 27th International

Con-ference on Neural Information Processing Systems - Volume 1, ser. NIPS’14.

MIT Press, 2014, pp. 568–576.

[16] X. Wang, A. Farhadi, and A. Gupta, “Actions ~ transformations,” arXiv:1512.00795, Tech. Rep., December 2015.

[17] K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101 human actions classes from videos in the wild,” arXiv:1212.0402, Tech. Rep., 2012. [18] “About UCF101,” http://crcv.ucf.edu/data/UCF101.php, accessed:

2017-08-27.

[19] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in The Conference on Computer

Vision and Pattern Recognition, 2009.

[20] “About imagenet,” http://image-net.org/about-publication, accessed: 2017-08-25.

[21] M. Huh, P. Agrawal, and A. A. Efros, “What makes imagenet good for trans-fer learning?” arXiv:1608.08614, Tech. Rep., August 2016.

[22] F. Chollet et al., “Keras,” https://github.com/fchollet/keras, 2015.

[23] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv:1409.1556, Tech. Rep., September 2014.

(44)

[25] L. Wang, Y. Qiao, and X. Tang, “Action recognition with trajectory-pooled deep-convolutional descriptors,” arXiv:1505.04868, Tech. Rep., May 2015. [26] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image

recog-nition,” arXiv:1512.03385, Tech. Rep., December 2015.

[27] F. Visin, K. Kastner, K. Cho, M. Matteucci, A. C. Courville, and Y. Ben-gio, “Renet: A recurrent neural network based alternative to convolutional networks,” arXiv:1505.00393, Tech. Rep., May 2015.

[28] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information

Processing Systems 25. Curran Associates, Inc., 2012, pp. 1097–1105. [29] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F.-F. Li, “Imagenet: a

large-scale hierarchical image database,” in IEEE Conference on Computer Vision

and Pattern Recognition, June 2009, pp. 248–255.

[30] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Uni-versity of Toronto, Tech. Rep., May 2012.

[31] Y. Lecun and C. Cortes, “The MNIST database of handwritten digits,” 2009. [32] J. Y. Ng, M. J. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, “Beyond short snippets: Deep networks for video classifi-cation,” arXiv:1503.08909, Tech. Rep., March 2015.

[33] S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S. Vijayanarasimhan, “Youtube-8M: A large-scale video classification benchmark,” arXiv:1609.08675, Tech. Rep., September 2016.

[34] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mix-tures of local experts,” Neural Computation, vol. 3, no. 1, pp. 79–87, 1991. [35] S. Chen, Z. Feng, Q. Lu, B. Mahasseni, T. Fiez, A. Fern, and S. Todorovic,

“Play type recognition in real-world football video,” 2014 IEEE Winter

Con-ference on Applications of Computer Vision (WACV), vol. 00, pp. 652–659,

A deep learning approach for action classification in American football video sequences

Examensarbete 30 hp

November 2017

A deep learning approach for

action classification in American

football video sequences

Abstract

A deep learning approach for action classification in

American football video sequences

Populärvetenskaplig Sammanfattning

Contents

1

Introduction

1.1

Thesis Purpose

2

Deep Learning

2.1

Models

2.2

Training

2.3

Thesis Adoption

3

Datasets

3.1

UCF101

3.2

All-22 Action Dataset

3.3

ImageNet

4

Implementation

4.1

Proposed Architecture

4.2

Data Preprocessing

4.3

Optimization

5

Results

5.1

Performance Measure

5.2

Results

6

Related Work

6.1

Comparisons

7

Conclusion

7.1

Future Work

References