Deep Learning of Affective Content from Audio for Computing Movie Similarities

(1)

DEGREE PROJECT, IN COMPUTER SCIENCE , SECOND LEVEL STOCKHOLM, SWEDEN 2015

Deep Learning of Affective Content from Audio for Computing Movie Similarities

IAN ALDEN COOTS

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

Deep Learning of Affective Content from Audio for Computing Movie Similarities

Att beräkna känslomässiga likhetsm˚att fr˚an filmljud med djupa neurala nätverk

IAN ALDEN COOTS

Master’s Thesis in Machine Learning Supervisor: Roelof Pieters, VionLabs AB

Supervisor: Giampiero Salvi, KTH Examiner: Anders Lansner, KTH

Stockholm, Sweden 2015

(3)

(4)

Abstract

Recommendation systems are in wide use by many different services on the internet today. Most commonly, recommendation systems use a technique called collaborative filtering, which makes recommendations for a user using other users’

ratings. As a result, collaborative filtering is limited by the number of user ratings in the system. Content-based recommendations perform direct comparisons on the items to be recommended and thus avoid dependence on user input. In order to implement a content-based recommendation engine, pairwise similarity measures must be calculated for all of the entities in the system. When the entities to be recommended are movies, it can be informative to make comparisons using emotional (or affective) content. This work details the investigation of different methodologies for extracting affective content from movie audio using deep neural networks. First, different types of feature vectors in concert with a variety of model parameters for training were examined in order to project input audio data into a three dimensional valence-arousal-dominance (VAD) space where affective content can be more easily compared and visualized. Finally, two different similarity measures for direct comparison of movies with respect to their affective content were introduced.

Sammanfattning

Rekommendationssystem har ett brett användningsomr˚ade bland tjänster p˚a internet. De vanligaste rekommendationssystemen implementerar s˚a kallad collaborative filtering som p˚a grund av dess beroende p˚a användares betygssättning av inneh˚all har m˚anga brister. Inneh˚allsbaserade rekommendationssystem slipper krav p˚a användarnas bedömningar och jämför istället inneh˚allet direkt. För att implementera s˚adana system m˚aste likhetsm˚att kunna beräknas parvis mellan en- titeterna i systemet. När dessa entiteter är filmer kan känslomässigt inneh˚all vara en upplysande deskriptor. I detta examensarbete utforskas olika sätt att extra- hera känslomässigt inneh˚all fr˚an filmljud med hjälp av flerlagrade neurala nätverk.

Först utförs en undersökning kring olika sorters egenskapsvektorer i samband med varierande parametrar för modellträning med syfte att projicera indata till det tredimensionella valence-arousal-dominance (VAD) vektorrummet där känslor kan lättare jämföras och visualiseras. Till slut presenteras tv˚a olika likhetsm˚att som kan tillämpas för att direkt jämföra filmer med hänsyn till deras känslomässiga inneh˚all.

(5)

Introduction

1.1 Background

Recommendation systems are important tools that users of many services rely on to find new products and guide their usage patterns. Many of these systems today, however, use collaborative filtering which relies on feedback from other users to make recommendations. This tends to favor popular items and can prevent relatively unknown content that a user may enjoy from being discovered. Content-based recommendation systems, on the other hand, attempt to find items with character- istics similar to those that a user likes. This requires the system to be able to reduce the items to be recommended into features that are descriptive and can be easily compared. Emotional (or affective) content varies greatly across movies and TV shows and could potentially be used as a feature for comparison in recommendation systems. Analysis of affective content is difficult, however, due to the subjective nature of emotion. Furthermore, how humans perceive broadcasted emotions is not well understood. Deep neural networks have proven useful in discovering latent underlying features for classification and regression in many fields and may thus prove useful in representing affective content.

1.2 Problem Formulation

This project comprises an exploration of methodologies for extracting features rep- resentative of emotional content from movie audio using deep learning techniques.

To enable integration into a content-based recommendation system, items must be directly comparable via a similarity measure that uses these features. To develop a system for comparing films by means of their audio, the following steps must be carried out:

1. Choose a representation for affective content.

2. Train models that perform well for extracting affective content.

3. Determine informative similarity measures for comparing content.

(9)

1.3 Prior Work

In literature, affective content has been represented in a multitude of ways – the most prototypical of which are categorical and dimensional. A categorical representation labels data with one of a finite number of emotional classifications[1, 2] such as happy, sad, or angry. Dimensional models, on the other hand, are continuous over a set of emotional primitives[3–8]. The emotional primitives most commonly used are valence, arousal and dominance, as they have been shown to be descriptive of and exhibit good separation for basic emotional classes. In affective analysis of music, dominance is often replaced with tension. In many applications, only two dimensions are used and dominance is left out.

There have also been attempts to expand these simple representations into ones that are slightly more complex and hopefully more informative. Examples of such are fuzzy categorical classifications[9, 10], hierarchical categorical classifications[11]

and a hierarchical 3-layered categorical/dimensional hybrid representation[12, 13].

Additionally, dimensional models have been expanded in order to project data as distributions with variances[14, 15] or as heatmaps[16] in valence-arousal space.

The input data for emotional content extraction has taken a variety of forms in prior work. Affective content from speech has been analyzed using the widest variety of features. Such features include pitch, duration, power envelope, energy, speaking rate, zero-crossing, Mel-frequency Cepstral coefficient (MFCC), Linear Prediction based Cepstral coefficient (LPCC), Subband based cepstral (SBC), and statistical related features[2–8, 12, 13, 17]. Emotional content has also been extracted from music, which resembles speech data in that both are audio signals. Features used for music affective content include spectrogram, Spectral Contrast, MFCCs, and Echo Nest Timbre features (ENTs), and chroma based features[14, 15]. Additionally, visual features have been used (usually combined with audio) to extract affective content from film. These include color, motion, shot-length, and Facial Animation Parameters (FAPs)[1, 11, 18].

Many different methodologies have been used to extract affective content from input features. Methods for categorical representations include k-Nearest Neigh- bor estimators (kNN)[3, 7], Support Vector Machines (SVMs)[1], Gaussian Mixture Models (GMMs)[2], Hidden Markov Models (HMMs)[9, 18], and Deep Belief Net- works (DBNs)[19]. Methodologies that have been used for projecting input feature vectors into a continuous emotional space include multiple linear regression (MLR), partial least squares (PLS)[20], support vector regression (SVR)[3, 7], fuzzy inference [12] and DBNs with a final linear regression layer[14–16]. The application of fuzzy logic to methodologies that produce categorical classifications has also been used to produce dimensional models, as in the case of fuzzy kNN[3], fuzzy C-mean clustering (FCM)[11] and fuzzy deep belief networks (FDBNs)[10].

(10)

Theory

This chapter provides a theoretic overview of the models and methods implemented in this work for the extraction and comparison of affective content.

2.1 Probabalistic Models

Models for machine learning may be broken into two main categories: discriminative and generative. Discriminative methods (from a probabilistic perspective) simply look at the likelihood of class labels given observed data. Generative models, on the other hand, specify a joint distribution over both labels and data. As a result, generative models are generally much more powerful than purely discriminative models due to the assumptions they make regarding how class labels are related to data.

Discriminative models, like Support Vector Machines, which tend to be sufficient for many straightforward machine learning applications, go about the task of classification in a simplistic manner. Such models take some pretermined representation of data and attempt to map samples directly to class labels. While this may seem like a reasonable thing to do, it is an oversimplification of the true nature of how data are generated. Because data are generated by a combination of underlying latent factors unique to members of a given class, this approach marginalizes the effect of these latent factors. Generative models attempt to learn a good representation of the latent factors that may have generated the data, and thus are more in line with the reality of how data is generated. Additionally, they confer the added benefit of being able to generate samples of their own.

2.2 Artificial Neural Networks

Artificial Neural Networks (ANNs) are computational models that approximate the computational nature of the brain. They are composed of interconnected computational units called (artificial) neurons. During a single instance of computation, a neuron follows three simple steps to produce output y:

1. Sum all weighted inputs 2. Add a bias, a

3. Apply an activation function, ϕ

(11)

Mathematically speaking,

y = ϕ(^X

i

wixi+ a) (2.1)

A visual representation of a neuron is shown in Figure 1. Activation functions may take any number of forms, though the most common are sigmoid and step functions.

Despite the fact that a single neuron performs relatively simple calculations, many neurons working in concert are capable of modelling very complex behaviors.

x4

x3

x₂ x₁

aj

σ yj

w_1j w2j

w3j

w_4j

Node j

Figure 1: An artifical neuron with 4 inputs x1-x4 and their weights w1j-w4j, bias a_j, sigmoid activation function, σ(·), and output y_j.

Although there is no limitation by definition on the connectivity or complexity of neural networks, many complex configurations are impractical to implement. Neural networks are often organized in clever ways such that connections are limited either in number or directedness in order to make training feasible.

2.3 Restricted Boltzmann Machines

Boltzmann Machines are a variant of stochastic recurrent neural networks with symmetric connections and no self-loops. The Restricted Boltzmann Machine (RBM) is a special case of Boltzmann Machine in which the subsets of visible and hidden nodes form a fully-connected bipartite graph. This means that each visible node is connected to every single hidden node, but to no other visible nodes. This connection scheme also pertains to the hidden nodes with respect to the visible nodes.

Restricted Boltzmann Machines are energy-based models, meaning that there is a scalar energy value associated with any given configuration of values for the visible and hidden units given the model’s parameters, θ = (W, b, a). For an RBM with with V visible units and H hidden units, W is the V × H matrix of visible-hidden connection weights, while b and a are the size V and size H vectors of the visible unit and hidden unit biases, respectively. Most commonly, both the visible and hidden units in RBMs are binary-valued. Such RBMs are called Bernoulli-Bernoulli RBMs,

(12)

v1 v2 v3

. . .

^v^V

h₁ h₂

. . .

^h^H

Visible layer Hidden

layer

Figure 2: Restricted Boltzmann Machine with V visible nodes and H hidden nodes.

Each connection between visible unit v_i and h_j has weight w_ij. Visible node v_i has bias term b_i and hidden node h_j has bias term a_j.

and the energy of the joint configuration (v, h) is given by:

E(v, h|θ) = −

V

X

i=1 H

X

j=1

w_ijv_ih_j−

V

X

i=1

b_iv_i−

H

X

i=1

a_jh_j (2.2)

The joint probability of the configuration (v, h) for energy-based probabilistic models is given by:

p(v, h|θ) = e^−E(v,h) P

u

P

he^−E(u,h) (2.3)

Marginalizing over all values for the hidden units gives the probability distribution of the configurations for the visible vectors:

p(v|θ) = P

he^−E(v,h) P

u

P

he^−E(u,h) (2.4)

The probability of the j^th hidden unit, h_j, being active given a visible unit configuration v is given by:

p(hj = 1|v, θ) = σaj+

V

X

i=1

wijvi

(2.5)

where σ(·) is the sigmoid function, σ(x) = _1+e¹−x.

Likewise, the probability of the i^th visible unit, vi, being active given hidden vector h is

p(v_i= 1|h, θ) = σb_i+

H

X

j=1

w_ijh_j (2.6)

(13)

Restricted Boltzmann Machines may also be modified to accommodate real- valued inputs by assuming each visible unit’s input is normally distributed[21].

Such Gaussian-Bernoulli RBMs have the following energy function:

E(v, h|θ) =

V

X

i=1

(v_i− b_i)² 2σ²_i −

V

X

i=1 H

X

j=1

w_ijv_i σi

h_j −

H

X

j=1

a_jh_j (2.7)

where σ_i² is the variance of the i^th unit. Conditional probabilities then take the form:

p(vi|h, θ) = N

bi+

H

X

j=1

wijhj, σ²_i

(2.8)

p(h_j = 1|v, θ) = σa_j+

V

X

i=1

w_ijv_i σi

(2.9)

It is fairly common, however, to standardize data such that each visible unit has a variance of 1. This simplifies the equations for the model’s energy and for the conditional probabilities of activation[22]. The conditional probability of a hidden node being active, p(h_j = 1|v, θ), then takes the same form as for Bernoulli-Bernoulli RBMs, given by equation 2.5. Furthermore, it simplifies training of the model since variances do not have to be learned.

E(v, h|θ) =

V

X

i=1

(v_i− b_i)²

2 −

V

X

i=1 H

X

j=1

w_ijv_ih_j−

H

X

j=1

a_jh_j (2.10)

p(v_i|h, θ) = N

b_i+

H

X

j=1

w_ijh_j, 1

(2.11)

2.3.1 Gibbs Sampling

Gibbs Sampling is a Markov Chain Monte Carlo (MCMC) method for obtaining samples from a multivariate distribution when exact inference is intractable. The method involves sequential sampling of each variable in the model conditioned on current values for all of the other variables. As the values of the variables are iteratively updated, the distribution of samples approaches the joint distribution.

Due to the lack of intra-layer connections in RBMs, all hidden units are conditionally independent of each other given the visible units and vice versa. This means that the sampling of all hidden units, and likewise of all the visible units, may be done concurrently as blocks, essentially reducing each iteration of Gibbs sampling in RBMs to two distinct steps.

The process of Gibbs sampling in an RBM for n iterations is thus:

v⁰∼ p(v) (2.12)

(14)

h⁰ ∼ p(h|v⁰) (2.13)

v¹∼ p(v|h⁰) (2.14)

h¹ ∼ p(h|v¹) (2.15)

...

vⁿ∼ p(v|hⁿ⁻¹) (2.16)

where xⁱ∼ p(x|y^j) indicates a sample xⁱdrawn from the conditional distribution of the random variable x conditioned on a value y^j of the random variable y.

2.3.2 Stochastic Gradient Ascent

Restricted Boltzmann Machines are generally trained via stochastic gradient ascent on the log likelihood of the training data[23]. Stochastic gradient ascent involves updating the model parameters in the direction of steepest ascent of a specified functional gradient. Thus,

θ^{τ +1} = θ^τ+ η∇F (θ^τ) (2.17)

where θ^{τ +1} are the updated parameters, θ^τ are the current values for the parameters, η is the learning rate, and ∇F (θ^τ) is the gradient with respect to the current parameters. Likewise, if the goal is to minimize the specified function, like training error for instance, the gradient is followed in the opposite direction. This is termed gradient descent and the update rule to the parameters is given by:

θ^{τ +1} = θ^τ− η∇F (θ^τ) (2.18)

Gradient following is considered stochastic when the updates to the parameters occur before the entire training set is seen. Updates are generally performed on small subsets of the training data called mini-batches. As a result, during any given update step, stochastic gradient ascent and descent do not follow the true gradient of the full training set, however over many iterations (termed epochs) they make a good approximation. Despite this, the stochastic nature of this method gives it an advantage over methods that follow the true gradient because it has the ability to escape shallow local optima on the way to better parameter values. Furthermore, the computational time per epoch is often reduced due to having to process many fewer examples at a time.

For maximum-likelihood learning via gradient ascent, the gradient of the likelihood L(θ|X) of the parameters θ given the data X is used. Due to the fact that log(x) is a monotonically increasing function, maximization of log(x) is equivialent to maximiztion of x. The log likelihood is often used in place of the likelihood for mathematical convenience. The derivatives of the log likelihood with respect to RBM model parameters[24] are given by the following:

(15)

∂ log p(v|θ)

∂w_ij = hv_ih_ji_data− hv_ih_ji_model = hv_i⁰h⁰_ji − hv_i^∞h^∞_j i (2.19)

∂ log p(v|θ)

∂bi

= hv_ii_data− hv_ii_model= hv⁰_ii − hv_i^∞i (2.20)

∂ log p(v|θ)

∂a_j = hh_ji_data− hh_ji_model= hh⁰_ji − hh^∞_j i (2.21) where the angle brackets, h·i, represent the expected value of the contained term.

The derivatives are thus given by differences between expectations of the data distribution and expectations from the model (i.e. expectations of the Markov chain after convergence from Gibbs sampling). The expectation of the data is easy to compute, but unfortunately, Gibbs sampling often requires a large number of iterations to converge. Furthermore, there is no known method to determine whether or not the Markov chain has reached equilibrium. Maximizing the log likelihood is the same as minimizing the Kullback-Leibler divergence between the data and the model distributions, DKL(P⁰||P_θ^∞)[25]. The Kullback-Leibler divergence DKL(P ||Q) between two distributions, P and Q, measures the information lost when Q is used to approximate P .

2.3.3 Contrastive Divergence

The issue of the exact gradient being computationally intractable is often mitigated using a technique called Contrastive Divergence (CD). Contrastive Divergence involves using the statistics from performing Gibbs sampling for a finite number of steps instead of those obtained from the converged Markov chain. Instead of following the gradient of the log likelihood, Contrastive Divergence instead follows the gradient of the difference of two Kullback-Leibler divergences:

DKL(P⁰||P_θ^∞) − DKL(P_θⁿ||P_θ^∞) (2.22) Contrastive Divergence carried out with k steps of Gibbs sampling per iteration is denoted as CD-k. Although Contrastive Divergence is just an approximation of the value ideally minimized, in practice, even CD-1 is a sufficient to produce viable results.

The parameter updates for CD-k then become:

∆w_ij = η(hv_ih_ji_data− hv_ih_ji_reconst.) = η(hv_i⁰h⁰_ji − hv^k_ih^k_ji) (2.23)

∆b_i = η(hv_ii_data− hv_ii_reconst.) = η(hv⁰_ii − hv_i^ki) (2.24)

∆a_j = η(hh_ji_data− hh_ji_reconst.) = η(hh⁰_ji − hh^k_ji) (2.25) The samples produced from Contrastive Divergence are termed reconstructions, as they are the reconstructed result of a data vector first encoded, and then decoded by the model.

(16)

2.4 Deep Belief Networks

Deep belief networks (DBNs) are a form of generative graphical model. They are comprised of an undirected top layer that feeds through any number of hidden layers down to a visible layer. All connections run between consecutive layers and have no intra-layer links. The undirected top layer acts as an associative memory and the downward connections decode the internal representation of the memory into an observed vector. DBNs may also be used as feedforward deep neural networks (DNNs) by taking all weights to be directed from the visible layer upwards through the hidden layers. As a deep neural network, an additional output layer is added in order to perform classification or regression with the feedforward activations. A graphical representation of DBNs is shown in Figure 3. DBNs are powerful models due to their ability to model (albeit latently) increasingly complex features in each consecutive hidden layer.

2.4.1 Learning

Supervised learning is the process of training models using annotated data. Super- vised learning algorithms involve modifying the parameters or structure of models such that they return labels that are as close to the true labels as possible. Unsu- pervised learning, on the other hand, is the process of training a model without the use of labeled data. While data itself is usually plentiful, labels tend to be orders of magnitude more scarce and expensive to obtain. Unsupervised learning thus has an advantage over supervised learning in that large quantities of data are more readily available for training. Most machine learning tasks, however, aim to relate input data to human-understandable concepts or values. Without labels, there are no explicit constraints on the nature of the output; the best the system can do is find inherent structure in the data. Semi-supervised learning is an attempt to leverage the advantages of both supervised and unsupervised learning. The general goal of semi-supervised learning is to extrapolate the meaning derived from annotated data from the structure discovered from unlabeled data. With respect to deep neural networks, unsupervised learning constitutes pre-training of the network on a large corpus of unlabeled data. This is then followed by supervised learning, or fine-tuning, where a smaller set of labeled data is used to further train the network from where the unsupervised training left off.

2.4.2 Backpropagation

Feedforward neural networks are generally trained in a supervised manner using backpropagation of errors. Backpropagation compares a neural network’s output with the correct values for training data and propagates these differences backwards through the network, updating the network’s parameters so as to best minimize the error.[26] This is a form of gradient descent in the error space of the training data.

(17)

. . . . . . . . . . . .

W0

W₁ W2

. . . . . . . . . . . . . . .

Visible layer Hidden

layer 1 Hidden layer 2 Hidden layer 3 Output layer

W₀^T W₁^T W₂^T W₃^T

Figure 3: A Deep Belief Network in its generative (left) and deterministic feedforward deep neural network (right) forms. In its generative form, the top two hidden layers form an associative memory. All connections then feed through each hidden layer downward to the visible layer. As a feedforward DNN, all connections feed upward through the network. An output layer is attached to the final hidden layer for classification or regression.

The error on input example xn is given by:

E_n= 1 2

X

j

(tnj− y_nj)² (2.26)

where ynj is the j^th node’s output on example xn and tnj is its target output value.

The partial derivative of the error with respect to the network weights may be

(18)

expanded using the chain rule:

∂E

∂w_ij = ∂E

∂net_j

∂netj

∂w_ij (2.27)

where netj is the input to node j from the network, given by the weighted sum P

iwijxi. Taking the derivative of this weighted sum, ^∂net_∂w ^j

ij = xi and defining δ_j = −_∂net^∂E

j, an update rule for the weights may be derived via equation 2.17:

∆wij = ηδjxi (2.28)

The value of δj may be derived by further expansion using the chain rule:

δ_j = − ∂E

∂netj

= −∂E

∂yj

∂y_j

∂netj

(2.29) Assuming node j is an output unit and evaluating the partial derivatives of equations 2.26 and 2.1, respectively,

∂E

∂y_j = −(t_j− y_j) (2.30)

∂yj

∂net_j = ϕ⁰_j(net_j) (2.31)

an expression for δj can be constructed:

δ_j = (t_j− y_j)ϕ⁰_j(net_j) (2.32) For an internal node j in layer l, the change in error with respect to its output takes a different form:

∂E

∂y_j =^X

k

∂E

∂net_k

∂y_j =^X

k

∂E

∂net_k

∂

∂y_j X

i

w_ikx_i (2.33)

where k iterates over all nodes in the (l + 1)^th layer. Taking advantage of the fact that the output of one layer is the input to the next, i.e. that y^(l)_j = x^(l+1)_j , the partial derivative _∂y^∂

j

P

iw_ikx_i is evaluated as w_jk.

∂E

∂y_j =^X

k

∂E

∂net_kw_jk =^X

k

δ_kw_jk (2.34)

Substituting the newly obtained value for _∂y^∂E

j into equation 2.29, δ_j for an internal node j may be calculated:

δ_j = ϕ⁰_j(net_j)^X

k

δ_kw_jk (2.35)

(19)

Deltas are thus recursively computed from the output layer back through to the input layer for training examples and weights are updated according to Equa- tion 2.28.

Deep networks tend to have an exceptionally large number of parameters, mak- ing the error space very highly dimensional. Because of high-dimensionality and due to the fact that backpropagation is a gradient-following method, it is very common for the weights to converge to sub-optimal local minima in the error space. As a result, the initialization of the weights, i.e. the location in error space from which the gradient following begins is very important. Another problem with particularly deep networks is that derivatives become exceptionally small as they are propagated through the many layers of the network. As a result, weight updates do not quickly move the network towards a favorable configuration. This is known as the vanishing gradient problem. Because of the high dimensionality of the error space, and due to diminishing error derivatives, it is very difficult (i.e. large amounts of data and time are necessary) to obtain a well trained deep network using backpropagation alone. For this reason, more practical models, such as Support Vector Machines (SVMs), have become the go-to for the majority of machine learning tasks.

2.4.3 Unsupervised Pre-training

Fortunately, a trick was devised for pre-training deep belief networks in an unsupervised manner in order to find a good initialization for the weights. This trick involves greedily constructing a DBN using RBMs as the building blocks for each layer. The motivation for this trick comes from the fact that an infinitely deep logistic belief net with tied weights is exactly equivalent to a Restricted Boltzmann Machine[25]. In a network with tied weights, W_i= W_i−1^T , that is to say, the weights between layers li and li+1 are equal to the transpose of weights between layers li−1 and li. Thus, a traversal through two consecutive layers of such a network is equivalent to a single up-down step of Gibbs sampling in an RBM.

Initially, the network is taken to be infinitely deep and is trained as a single RBM using CD-k. After the first RBM is trained, its weights are locked and a second RBM is trained using the hidden layer values of the first RBM as input to the second. RBMs are stacked one on top of the other and this process may be repeated until the desired number of layers has been constructed. Assuming the full maximum likelihood method is used to train RBMs, each consecutive RBM is guaranteed to not cause a decrease in the log likelihood of the data[27]. The use of CD-k nullifies this guarantee, although in practice, it generally holds when the RBMs are trained sufficiently long enough.

2.4.4 Supervised Fine-Tuning

After a DBN has been constructed by consecutive unsupervised training of stacked RBMs, it may be further trained in a supervised manner in order to perform classification or regression tasks. A final output layer is attached to the deepest layer

(20)

and the model is discriminately trained by stochastic gradient descent (SGD) using backpropagation of errors, where updates are given by Equation 2.28. The weights learned by unsupervised pre-training tend to produce values that are particularly close to a good minimum in the model’s error space. Supervised fine-tuning then further modifies those weights such that that minimum is approached. It has been shown that pre-trained networks perform considerably better than networks that were merely trained using backpropagation from a random initialization of weights[28].

2.4.5 Momentum

The use of momentum makes a modification to the parameter update rule by keeping track of a velocity vector that essentially accumulates previously traversed gradient.

Momentum has been shown to greatly increase the time to convergence by persist- ing movement over gradients of low curvature, where the rate of traversal tends to become exceptionally small under normal gradient following[29]. Additionally, momentum can help prevent the model from getting trapped in suboptimal local minima when following the gradient. The update rule using momentum is given by:

v^{τ +1} = µv^τ− η∇F (θ^τ) (2.36)

θ^{τ +1}= θ^τ + v^{τ +1} (2.37)

where v^τ is the velocity at step τ , and µ ∈ [0, 1] is the momentum coefficient.

2.5 Hidden Markov Models

Hidden Markov Models (HMMs) are a class of statistical model that models a Markov process. A Markov process is a process that fulfills the Markov property;

that is to say that the state at a given time step is dependent only on the state of the system at the previous time step, as shown by the probabilistic graphical model in Figure 4. HMMs are discrete-time Markov chains with a finite number of hidden states, st∈ {1, ..., N }, each with a certain probability of transitioning to another state. Each hidden state then has a probability distribution that governs the nature of the observations it emits. HMMs are thus defined by a parameter set λ = {p, A, B}. The initial state probability distribution p is an N dimensional probability vector, the transition matrix A is an N by N dimensional right stochastic matrix, and B is the set of N conditional probability distributions given by

p_j = P (s₁ = j) (2.38)

aij = P (st+1= j|st= i) (2.39)

b_j(x) = f_X|s(x|j) (2.40)

(21)

x1 x2 x3 x4

s1 s2 s3 s4 sT

x_T t = 1 t = 2 t = 3 t = 4 . . . t = T

Figure 4: An HMM chain. At each time step, the model is in a specific hidden state st which emits an observation xt. The probability of being in hidden state st = j is given only by the value of s_t−1. The observation x_t is drawn from the emission distribution b_j for hidden state s_t= j.

2.5.1 Stationary Distribution

If a Markov Chain is stationary, then the probability of being in a particular state at any time t eventually converges to a distribution known as a stationary distribution.

The stationary distribution, π, is left unchanged when multiplied by the transition matrix A, fulfilling the condition:

πA = π (2.41)

This is equivalent to the definition of an eigenvector of matrix A with eigenvalue equal to 1. The stationary distribution may therefore be calculated by finding the eigenvector of the transition matrix with eigenvalue 1.

2.5.2 Forward-Backward Algorithm

The Forward-Backward Algorithm is an algorithm for computing the probability of any state at a time t given a complete observation sequence,

γj,t = P (st= j|x1, ..., xT) (2.42) This is done in two passes: the forward and the backwards passes. The forward algorithm is a recursive procedure that computes the forward variables,

α_j,t = P (s_t= j|x₁, .., x_t) (2.43) for all j = 1, ..., N and t = 1, ..., T . These are given by:

α_j,1= 1 Z1

p_jb_j(x₁) (2.44)

αj,t= 1 Zt

bj(xt)

N

X

i=1

αi,t−1aij (2.45)

(22)

where Z_t= P (x_t|x₁, ..., x_t−1), is a normalization constant given by:

Z₁ =

N

X

j=1

p_jb_j(x₁) (2.46)

Zt=

N

X

j=1

bj(xt)

N

X

i=1

αi,t−1aij (2.47)

The forward algorithm is complemented by the backward pass, which recursively computes the backward variables,

βj,t∝ P (x_t+1, ..., xT|s_t= j) (2.48) for all j = 1, ..., N and t = 1, ..., T . These are given by:

β_j,T = 1 ZT

(2.49)

β_j,t= 1 Z_t

N

X

j=1

a_ijb_j(x_t)β_j,t+1 (2.50)

Finally, the conditional state probability may be computed as

γ_j,t = α_j,tβ_j,tZ_t (2.51)

2.5.3 Baum-Welch Algorithm

The Baum-Welch algorithm is a specialized case of the Expectation Maximization (EM) algorithm for training Hidden Markov Models. The algorithm iteratively updates model parameters such that the log likelihood of the training data given the model increases.

With R training sequences, the initial probability vector may be updated using the forward-backward variables:

p^new_j = γ_j P

jγ_j (2.52)

where

γ_j =

R

X

r=1

γ^r_j,1 (2.53)

The transition matrix values are updated using the conditional probabilities of state combinations, ξ^r_ij,t, for all given sequences, r:

ξ^r_ij,t= P (s^r_t = i, s^r_t+1= j|x^r₁, ..., x^r_T_r) (2.54) This value is calculated from the forward and backward variables

ξ^r_ij,t= α^r_i,ta_ijb_j(x_t+1)β_j,t+1^r (2.55)

(23)

and then time-averaged over all sequences:

ξ_ij =

R

X

r=1 Tr−1

X

t=1

ξ_ij,t^r (2.56)

The update is then given by:

a^new_ij = ξ_ij P

jξ_ij (2.57)

Updates for the emission distribution parameters are dependent on the distribution. For multivariate Gaussian observation distributions, the parameter updates for state i are as follows:

µ^new_i = PT

t=1γ_i,tx_t PT

t=1γ_i,t (2.58)

Σ^new_i = PT

t=1γ_i,t(x_t− µ^new_i )(x_t− µ^new_i )^T PT

t=1γi,t

(2.59) The Baum-Welch algorithm, as it is a variant of the EM algorithm, is not guaranteed to converge to a global maximum. With multimodal distributions, the algorithm can get stuck in local maxima depending on the initial parameters. In practice, the Baum-Welch algorithm is carried out for a finite number of iterations or until the log likelihood does not increase more than a small threshold value.

2.5.4 Similarity Measure

A natural measure of similarity is something akin to the inverse of divergence.

For probabilistic models, a natural measure of divergence is the Kullback-Leibler divergence, D_KL(P ||Q). Unfortunately, for HMMs, computation of this value is intractable, as it has no closed form. In a time-consuming process, it is usually approximated stochastically using Monte-Carlo approximations. To remedy this, Yoon and Sahraeian[30] proposed a low-complexity similarity measure for HMMs.

In order to define this similarity measure, several new concepts for HMMs are defined:

1. The state similarity, S_e(i, j), between state i ∈ {1, ..., N⁽¹⁾} in HMM λ⁽¹⁾ and state j ∈ {1, ..., N⁽²⁾} in λ⁽²⁾:

State similarity may be defined in inverse or exponential forms:

Se(i, j) = 1

D(b⁽¹⁾_i ||b⁽²⁾_j ) (2.60) and

Se(i, j) = e^−kD(b

(1) i ||b⁽²⁾_j )

(2.61) where k is a constant.

(24)

With continuous emission distributions, D(P₁||P₂) is most easily defined as the symmetric KL divergence, though any distance measure may be used. The symmetric Kullback-Leibler divergence is defined as

1

2[D_KL(P₁||P₂) + D_KL(P₂||P₁)] (2.62) For two d-variate Gaussians, N₁ and N₂, the KL Divergence has a closed form[31]:

DKL(N1||N₂) = 1 2

ln|Σ₂|

|Σ₁|− d + tr(Σ⁻¹₂ Σ1) + (µ2− µ₁)^TΣ⁻¹₂ (µ2− µ₁)

(2.63) 2. The expected similarity, ES(λ⁽¹⁾, λ⁽²⁾):

The expected value of the state similarity is given by:

ES(λ⁽¹⁾, λ⁽²⁾) = E[S_e(i, j)] =^X

i

X

j

π_i⁽¹⁾π_j⁽²⁾S_e(i, j) (2.64)

where π⁽¹⁾ and π⁽²⁾ are the stationary distributions of λ⁽¹⁾ and λ⁽²⁾, respectively.

3. The N⁽¹⁾× N⁽²⁾ correspondence matrix Q:

Each element qij of the matrix represents the contribution of each state pair ij to the expected similarity:

qi,j = π_i⁽¹⁾π_j⁽²⁾S_e(i, j)

ES(λ⁽¹⁾, λ⁽²⁾) (2.65)

A similarity measure for two HMMs, S(λ⁽¹⁾||λ⁽²⁾) can then be derived from a measure of the sparsity of the correspondence matrix.

S(λ⁽¹⁾||λ⁽²⁾) = 1 2

"

1 N⁽¹⁾

N⁽¹⁾

X

i=1

H(r_i) + 1 N⁽²⁾

N⁽²⁾

X

j=1

H(c_j)

#

(2.66) where H(v) is a normalized sparsity measure for vector v; and r_iand c_j correspond to the i^th row and the j^th column of Q, respectively.

For this application, H(v) will be taken as the normalized Gini Index. The Gini Index is deemed one of the best sparsity measures due to the fact that it is one of the few sparsity measures that satisfies all six of the criteria of sparsity specified by Hurley and Rickard [32].

H(v) = D · G(v)

D − 1 (2.67)

where the Gini Index, G(v), of the D-dimensional vector v is defined as follows:

G(v) = 1 − 2

D

X

k=1

v_(k)

||v||₁

D − k + ¹₂ D

!

(2.68)

(25)

where v_(k) is the k^th smallest element of v and ||v||₁ is the `₁ norm of v.

||v||₁ =

D

X

i=1

|v_i| (2.69)

(26)

Methodology

3.1 Label Collection

Label collection was carried out by means of an elementary web application, shown in Figure 5. The application was developed with Flask, a web microframework written in Python. Five second audio clips were presented to 25 users to be evaluated on a discrete scale from 0-4 inclusive for valence, arousal, and dominance. Colorized versions of the self-assessment manikins (SAMs) introduced in [33] were used as visual representations for the selectable values in valence-arousal-dominance (VAD) space. A length of five seconds was chosen due to that being the smallest length in which an evaluator could reliably ascertain affective content. Five second audio clips were extracted with a sliding window of 2.5 seconds from seven feature-length films in a variety of genres: Aladdin, Anchorman: The Legend of Ron Burgundy, Marvel’s Avengers, Forrest Gump, When Harry Met Sally, The Shining, and Star Wars: A New Hope. The application was gamified as a simple role playing game in order to boost user interest in labeling.

Initially, the web application was designed such that labelling a clip increased the probability of directly adjacent clips being presented to the user. This was done as an attempt to take advantage of adjacency in order to allow extrapolation between clips in hopes of mitigating the fact that labels were assigned to such long durations.

With 19,580 audio clips, however, the average user rated less than one percent of the total number of clips. Naturally, this produced a non-uniform sampling of the distribution of clips. The density of labels can be seen in figure 6. Furthermore, the 2.5 second overlap between adjacent clips caused discontent among users as many felt as though they were being presented with content they had already labeled. In order to remedy this issue, clip priority was abandoned and audio samples previously unrated by a given user were instead presented entirely at random.

3.2 Feature Extraction

In order for affective content to be extracted from audio, it must first be converted into a format that can be processed by the models to be used. Many models, deep belief networks included, require input to have a fixed dimensionality. A five-second mono-channel mp3 file with a sampling rate of 44.1 kHz contains over 220,000 points.

(27)

Figure 5: Label Collection Web Application

This is far too large to be fed directly into most models so dimensionality must first be reduced. Feature extraction was carried out in Python using the Essentia audio analysis library. Essentia is a C++ library with a Python wrapper.

3.2.1 Spectrogram Features

Spectrograms are a fairly natural way of visually representing audio data in a way that more closely corresponds to the way in which humans percieve sound - as a function of frequency. An audio signal’s spectrogram is equal to the squared magnitude of its short time Fourier transform.

spectrogram(t, ω) = |ST F T (t, ω)|² (3.70) This is essentially a measure of frequency (ω) content over time (t) in the signal.

With 1024 points used for the Fourier transform with an overlap of 900 frames, the resulting spectrogram contained 512 discrete frequency levels. A value of 1024 was chosen as most implementations of Fast Fourier Transforms see an increase in performance when the number of points used for the transform are a power of 2. To further reduce dimensionality, the spectrograms were truncated at 11 kHz, giving 256 levels, and the frequencies were aggregated over the duration of the five-second clips. Sum, mean, median and standard deviation were taken for each discrete frequency level in the spectrogram. Four variants of spectral feature vectors, shown in Figure 8, were tested. Taking statistics over the entire five seconds resulted in a 1024 dimensional vector, henceforth referred to as 1x. 2048 and 3072 dimensional vectors were obtained by taking statistics over the two halves and

(28)

Figure 6: The number of ratings assigned to a specific label corresponds to marker size. It can be seen that most labels are concentrated in the center of the space.

There are few to no examples in the lower left octants of the space.

Figure 7: Spectrogram of a five-second clip from Avengers. Image colors are mapped to a log scale to enhance contrast.

three thirds, respectively, of the full clips (2x and 3x respectively). This was done in an attempt to mitigate the use of aggregators and restore some of the time dependent information contained within a single clip. A fourth feature vector was constructed by concatenating the 1024 dimensional vector to the 2048 dimensional one in order to obtain another variant of 3072 dimensions (referred to as 2x 1x).

Although somewhat time-consuming to train, Mohamed, et al. showed that 3072

(29)

dimensional data from audio could produce reasonably good results[34] using deep belief networks. A sample spectrogram is shown in figure 7 and its corresponding high-dimensional feature vectors are shown in figure 8

Figure 8: Feature vectors extracted from the spectrogram in figure 7. The numbers in parentheses indicate over how many divisions the clips had statistics collected and which ordinal fraction is currently being represented. As an example, (2/3) represents the statistic collected over the second third of the clip’s spectrogram.

Image colors are mapped to a log scale to enhance contrast, however the values of the features themselves are not altered.

3.2.2 Mel Frequency Cepstral Coefficient (MFCC) Features

Mel Frequency Cepstral Coefficients (MFCCs) are commonly used features in audio processing applications involving speech or music. They are hand tailored to reflect the logarithmic perception of pitch from audio frequency. There are several implementations of MFCCs, though the variant used in this work is termed MFCC FB-40. MFCC FB-40s are extracted from a given frame in the following manner[35]:

1. Calculate magnitude spectrum as in Equation 3.70 2. Apply Mel filter bank

A Mel filter bank is a set of overlapping triangular filters equally spaced on the Mel scale. Filters may be of equal height or equal area, but FB-40 features use 40 filters of equal area. The filters in the range 200-1000 Hz are linearly spaced and only assume logarithmic spacing above 1000 Hz.

3. Take the natural logarithm of filter bank coefficients

4. Apply the discrete cosine transform (DCT) to log filter bank energies

Xk =

N −1

X

n=0

xncos

π N

n +1

2

k

k = 0, ..., N − 1 (3.71)

(30)

X_k then represents the k^th MFCC. It is customary to use N = 13.

MFCC feature vectors are then computed for a given five-second clip by ag- gregating statistical measures over frames in the sample. The aggregators examined include the minimum (∧), maximum (∨), mean (µ), median (µ_1/2), variance (σ²), and the first and second discrete derivatives of the mean and variance (∆µ,∆∆µ,∆σ²,∆∆σ²). Aggregators were collected over the full clip (1x) and over two concatenated halves independently (2x).

3.3 Model Training

Model training was carried out using Python and the Pylearn2[36] machine learning library. Pylearn2 is a modular library with a wide variety of machine learning models and algorithms. It is built on top of Theano, a python library with an optimizing compiler for evaluating symbolic expressions. Theano is tailored for calculations involving multi-dimensional arrays and due to its symbolic features, it is particularly useful for algorithms that require the evaluation of derivatives, such as backpropagation. The most powerful feature of Theano and thus Pylearn2, however, is support for GPU computing. For training large models like deep belief networks, this is an invaluable feature. The process for training a deep belief network is visually represented in Figure 9.

Film Source

Segmentation

Audio Clips

Feature Extraction

Annotation

Feature Vectors

Labels

Unsupervised Pre-training

Supervised Fine-tuning

DBN

Figure 9: Model Training Pipeline

3.3.1 Pre-Training

Due to the fact that pre-training of DBNs is an unsupervised procedure, it is possible to take advantage of a large number of training examples. Each model was therefore

(31)

pre-trained on all 19,580 of the five-second clips extracted from the seven movies available for labelling. This corresponds to 27 hours of audio - and for the 3072 dimensional representation - 2.7 GB of data. To reduce training time, computation was carried out on an nVidia GeForce GTX 680 graphics card with 1536 CUDA cores.

All DBNs were pre-trained as stacked RBMs using CD-1 for 3000 epochs. Each DBN contained a uniform number of units in each layer, determined by the input dimensionality. When it comes to choosing a mini-batch size, there are conflicting effects. Larger mini-batches take greater advantage of the speed-up from paral- lelization using the GPU, however they do not have correspondingly large maximum stable learning rates. As a result, weight updates are smaller per gradient evaluation[37] and learning as a whole takes longer. A moderate mini-batch size of 100 was chosen. Too large of a learning rate causes computational issues, so to obtain a viable value, the learning rate was halved until no computational problems arose. This resulted in a learning rate η = .002.

Even with hardware acceleration, a single epoch takes over 3.5 seconds. With 3000 epochs per layer, a 7 layer network (6 sequential RBMs) takes over a day to train. Fortunately, due to the greedy nature of pre-training, it is possible to reuse sequential RBMs to construct networks with varying numbers of layers.

3.3.2 Fine-Tuning

Models were fine-tuned using backpropagation of errors via stochastic gradient descent with momentum. Audio feature vectors for training were taken from six of the seven previously mentioned films resulting in a training set composed of 2700 examples. Feature vectors from a single movie were held out for test and validation.

Training was carried out until the mean squared error on the validation set had not decreased below its lowest encountered value for some predetermined number of iterations, N . Model performance is then measured by the mean squared error on the test set.

3.3.3 Feature Types and Parameters

Models were trained using the two types of features detailed Subsections 3.2.1 and 3.2.2. In all cases, layers were of uniform size determined by the input dimensionality. This was done because, according to Bengio[38], ”using the same size for all layers worked generally better or the same as using a decreasing size (pyramid-lke) or increasing size (upside down pyramid), but of course this may be data-dependent.” For the spectrogram features, a learning rate of η = 0.0005 was chosen for all training instances. For each of the feature types (1x, 2x, 3x and 2x 1x) models were trained with momentum values of µ = 0.1, 0.25, 0.5, and 0.75. With the MFCC features, many more feature types were explored by varying the number and types of aggregators used. The aggregators used can be seen in Tables 1, 2, and 3. Since the MFCC features are of much smaller dimensionality

(32)

than the spectrogram features, it was possible to explore more parameter values.

Learning rates of both 0.001 and 0.0005 were explored for all MFCC models tested.

Smaller learning rates of 1 × 10⁻⁵ and 5 × 10⁻⁵ were briefly examined, but resulted in longer training times and never produced minimal test errors, and as such were quickly abandonded. For every combination of parameters and feature type used, five models were trained, with depth ranging from two to seven layers. In order to enable direct comparison between models, the same test and validation sets were used for each instance of training.

3.4 Movie Analysis

The purpose of extracting affective content from a movie is to enable further analysis of the film’s content on higher levels of abstraction. In order to be useful for a content-based recommender system, movies must be able to be evaluated for similarity. This section details two methods for comparing films’ affective content. The flow of computation for comparing movie audio is shown in Figure 10.

Film Audio

Sliding Window

Feature Extraction

Feature Vectors

Propogation through

DBN

VAD Features Baum Welch

HMM HMM Similarity

Summation VAD

Vector Cosine Similarity Similarity

Matrices

Figure 10: Movie Comparison Pipeline

3.4.1 Affective Feature Extraction

After an optimal DBN has been trained, VAD feature extraction is carried out for a film’s audio in its entirety. This process is detailed below:

1. DBN selection: An optimally performing DBN is selected from the trained models.

2. Feature extraction: Features corresponding to the chosen DBN are extracted from the audio using a sliding window of five-second length.

(33)

3. Projection into VAD space: The features for each window are propagated through the DBN to obtain VAD features.

3.4.2 Cosine Similarity

A given movie may be represented as a single vector in VAD space by taking the integral of its VAD curves. With discrete time steps, this is simply the summation of each primitive over the length of the film. Movies may then be directly compared in a naive and computationally efficient manner using cosine similarity. The cosine similarity of two vectors is given by:

cos(θ) = u · x

||u||₂||v||₂ (3.72)

where ||v||₂ is the `₂ norm of vector v given by:

||v||₂= v u u t

D

X

i=1

v²_i (3.73)

3.4.3 HMM Similarity

Cosine similarity is a rather simple and superficial method of comparison for something as complex as the affective content profile of a film. In an attempt to com- pare more of the intricacies of emotional content, a second similarity measure is considered. In order to capture the time-evolving nature of a movie’s emotional content, the sequence of VAD values extracted from a movie can be used to train an HMM with Gaussian emissions using the Baum-Welch algorithm described in Subsection 2.5.3. These HMMs may then be compared directly using the HMM similarity measure of Equation 2.66. Hidden Markov Models were trained using the scikit-learn implementation of the model. Scikit-learn is a basic, general purpose machine learning library for Python.

(34)

Results

4.1 Model Training

This section details the results obtained from training deep belief networks of different sizes on various types of input features.

4.1.1 Spectrogram Features

Figure 11 shows some typical error curves during the fine-tuning step of model training if the training is carried out to an arbitrarily large number of epochs. This is done merely to show the behavior of the mean squared error of the training, test, and validations sets with respect to the number of layers in the network. The mean squared error is taken as the mean of the sum of squared errors for the three output primitives - valence, arousal and dominance. Thus, a mean squared error of 3 corresponds to an MSE of approximately 1 for each primitive individually.

The error on the training set continues to decrease as the model parameters are modified to better fit the training data. Fluctuations in the curves are caused by the stochastic nature of Stochastic Gradient Descent. Divergence of the test and validation error curves from the training set curve is caused by overfitting. As the model parameters are modified to better fit the training set, the model’s ability to generalize to unseen data is compromised.

The number of layers in a DBN affects how quickly the validation set error diverges from the training set error. As can be seen in Figure 11, the distance between the validation set and training set error curves increases more drastically as more layers are added. For these 1024-dimensional spectrogram features, it can be seen that as training progresses, the validation set error curve eventually changes concavity. This is of importance because a minimum may only be found in a region where a curve is concave up. For two- to four-layer networks, the validation error curve is still concave up by 1000 epochs. In the five-layer network the gradual transition from concave up to concave down may be seen. In networks with more than five layers, this transition happens very quickly. A larger number of layers, i.e.

a deeper network, allows for less error on the training set with fewer iterations, but quickly overfits. Due to the higher curvature of error curves in deeper networks, the valiation error minimum tends to be achieved much sooner with more layers.

Figure 12 shows the performance of models of different depths for the 4 variants

(35)

Figure 11: Graphs depicting the mean squared error of the training, test, and validation sets during training for 1024-dimensional spectrogram features. The test

Deep Learning of Affective Content from Audio for Computing Movie Similarities

Deep Learning of Affective Content from Audio for Computing Movie Similarities

Deep Learning of Affective Content from Audio for Computing Movie Similarities

Abstract

Sammanfattning

Contents

Introduction

1.1 Background

1.2 Problem Formulation

1.3 Prior Work

Theory

2.1 Probabalistic Models

2.2 Artificial Neural Networks

2.3 Restricted Boltzmann Machines

. . .

. . .

2.4 Deep Belief Networks

. . . . . . . . . . . .

. . . . . . . . . . . . . . .

2.5 Hidden Markov Models

Methodology

3.1 Label Collection

3.2 Feature Extraction

3.3 Model Training

3.4 Movie Analysis

Results

4.1 Model Training