Visualizing the Body Language of a Musical Conductor using Gaussian Process Latent Variable Models

(1)

Visualizing the Body Language of a Musical

Conductor using Gaussian Process Latent

Variable Models

Creating a visualization tool for GP-LVM modelling of motion capture data and investigating an angle based model for dimensionality reduction

MONA LINDGREN, ANDERS SIVERTSSON

Kandidatexamenrapport vid CSC Handledare: Hedvig Kjellström Examinator: Katarina Gustavsson

(2)

(3)

Abstract

In this bachelors’ thesis we investigate and visualize a Gaus-sian process latent variable model (GP-LVM), used to model high dimensional motion capture data of a musical conduc-tor in a lower dimensional space.

This work expands upon the degree project of K. Karipi-dou, ”Modelling the body language of a musical conductor

using Gaussian Process Latent Variable Models”, in which

GP-LVMs are used to perform dimensionality reduction of motion capture data of a conductor conducting a string quartet, expressing four different underlying emotional in-terpretations (tender, angry, passionate and neutral). In Karipidou’s work, a GP-LVM coupled with K-means and an HMM are used for classification of unseen conduction motions into the aforementioned emotional interpretations. We develop a graphical user interface (GUI) for visu-alizing the resulting lower dimensional mapping performed by a GP-LVM side by side with the motion capture data. The GUI and the GP-LVM mapping is done within Matlab, while the open source 3D creation suite Blender is used to visualize the motion capture data in greater detail, which is then imported into the GUI.

Furthermore, we develop a new GP-LVM in the same manner as Karipidou, but based on the angles between the motion capture nodes, and compare its accuracy in classi-fying emotion to that of Karipidou’s location based model. The evaluation of the GUI concludes that it is a very use-ful tool when a GP-LVM is to be examined and evaluated. However, our angle-based model does not improve the clas-sification result compared to Karipidou’s position-based. Thus, using Euler angles are deemed inappropriate for this application.

Keywords: Gaussian process latent variable model,

motion capture, visualization, body language, musical con-ductor, euler angles.

(4)

Referat

Visualisering av en Dirigents Kroppspråk

genom Latent Variabelmodell med Gaussisk

Process

I detta kandidatexamensarbete undersöks och visualiseras en latent variabelmodell med Gaussisk process som priori-fördelning, (GP-LVM), som används till att modellera och dimensionsreducera högdimensionell data från motion cap-ture av en dirigent.

Arbetet är en vidareutveckling av K. Karipidous exa-mensarbete ”Modelling the body language of a musical

con-ductor using Gaussian Process Latent Variable Models”.

Karipidou tränar GP-LVM:er som används till dimensio-nalitetsreduktion av motion capture-data från dirigerandet av en stråkkvartett, där dirigenten har kommunicerat fyra olika underliggande känslouttryck (ömsint, arg, passione-rad och neutral). Med en tränad GP-LVM, kopplad med k-means klustring och en HMM, klassificeras tidigare osed-da dirigentrörelser in i ovan nämnosed-da känslouttryck.

Vi utvecklar ett grafiskt användargränsnitt (GUI) med visualisering av den lågdimensionella mappningen beräk-nad av en GP-LVM jämsides motion capture-datan. GUI:t och GP-LVM-mappningen sker i Matlab, medan den öppna programvaran Blender används för visualiseringen av mo-tion capture-datan som sedan importeras till GUI:t.

Vidare utvecklas och utvärderas en ny GP-LVM, gjord på samma sätt som hos Karipidou men baserad på vinklar-na mellan motion capture-datans punkter istället för punk-ternas position som Karipidous modeller är tränade på. Precisionen i klassificeringen hos den nyutvecklade model-len jämförs sedan med precisionen hos Karipidous modell. Utvärderingen av GUI:t kommer fram till att det är ett användbart verktyg att ha när en GP-LVM ska undersö-kas och utvärderas. Däremot ger inte den vinkelbaserade modellen någon förbättring i klassificeringsresultat jämfört med Karipidous positionsbaserade, utan snarare en försäm-ring. Slutsatsen är att Eulervinklar, som riskerar att med-föra singulariteter, bör undvikas i datarepresentationen.

Nyckelord: latent variabelmodell med Gaussisk

pro-cess, motion capture, visualisering, kroppspråk, dirigent, eulervinklar.

(5)

Introduction

There are many reasons to study body language, not only in psychology and social science, but also in the field of machine learning. Modelling human motion is an area of interest in machine learning, not least because body language plays a big role in communication and conveys much of the underlying intentions and the context of what is being said. [12] It also has several applications, in the growing field of robotics and machine learning, where it will be increasingly important to improve the spectrum and nuance of human-computer communication. The ability for computers to read and classify body language could prove an integral part in this process. In order to make body language interpretable for computers, effective models that are able to capture subtle nuances of body motion are needed.

Musical conductors are trained during their whole professional career to com-municate a wide spectrum of emotions through their body language. The musical conductor’s emotional expressiveness makes her or him interesting in connection with the study of statistical modelling of body language. The master’s student Kelly Karipidou has in 2015 made a statistical model of a conductor, expressing different emotions through the conducting of a musical piece. The model used in Karipidou’s work is Gaussian Process Latent Variable Model (GP-LVM), and it is evaluated through a classification experiment. In this bachelor’s thesis, we develop a visual tool that can be helpful in qualitative analysis of a statistical model of human motion. Moreover, we investigate whether an angle-based model represents the body language in an effective way.

1.0.1 Problem formulation

This bachelors’ thesis has two aims: First, to create a tool for visualizing GP-LVM modelling of data. Second, to develop a new, angle-based model of a musical conductor’s body language in a mathematical way, and compare this new model to Karipidou’s position-based model. The new model is put through an experiment, in which classification of unseen motion data is done. The results are then compared to those of the location-based model.

(8)

CHAPTER 1. INTRODUCTION

1.0.2 Outline

In the chapter following this introduction, a brief summary on what has previously been achieved in the subject is given. Foremost, it describes the essentials of K. Karipidou’s Master Thesis, of which this work is a continuation. Chapter 3 presents the mathematical theory that is necessary to understand the dimensionality reduc-tion and the classificareduc-tion framework. The methods secreduc-tion of this thesis is divided into two main parts. Chapter 4 describes the components of the visualization GUI, while Chapter 5 presents and motivates the statistical model chosen to classify the different emotional interpretations of the musical piece. The results are presented in chapter 6, followed by discussion and conclusions.

(9)

Chapter 2

Background

There is a great interest in being able to model human motion patterns, particularly in machine learning, human computer interaction and in the computer games in-dustry. One challenge has been to find a manageable way to represent motion that will still preserve most of the information. Motion Capture (mocap) is a motion recording technique that has been used in statistical body motion modeling research. [16] A mocap recording of human movement usually results in multiple dimensions. Working with high dimensional data can be very computationally heavy. Moreover, high dimensional input data can cause overfitting, which is a result of an overly complex model and too little training data. [7, p. 24] Too many dimensions will teach the classifier exceptions, and will deteriorate the model’s generalizability. To avoid the curse of dimensionality, a dimensionality reduction technique is needed. GP-LVM as a dimensionality reduction technique introduced by [10], has proven to work very well on human motion data[4, p. 53], outperforming other dimensionality reduction techniques.[15]

One specific type of body motion that is used to communicate emotions nonver-bally in music is conducting. There have been many successful attempts to build conducting gesture recognition systems, [1] is an example of such. Kelly Karipidou [8] has during 2015 done work on statistical modeling and classification of con-ducting gestures. In Karipidou’s work, Motion capture technique has been used as feature extraction, resulting in a data set of 3D points in the Euclidian space. The data was divided into one training set and one testing set. The training set was used to train a GP-LVM model, which performed a mapping from observable space to a 2D latent space. The latent space was segmented into clusters using K-means, and these clusters would represent the states in an HMM, which was used to clas-sify the conducting gestures. Our work extends upon Karipidou’s thesis and one of our contributions is to learn a GP-LVM model from data measured in Euler angles instead of the Euclidian space approach. One of the reasons for choosing a data representation in Euler angles is that it result in a natural dimensionality reduction. That means two points in Euclidian space are replaced by a triplet of Euler angles. Another reason to work with Euler angles would be that the data representation

(10)

CHAPTER 2. BACKGROUND

would be independent of the director’s physique.

Our main contribution is the development of a visualization tool that shows body motion side by side with its latent space representation. This tool provides an opportunity to relatively quickly analyze the latent space in a qualitative way, and can be a great way to get a feel of how the latent space behaves. It also collects the various processes in one space. The tool is made partly in Matlab and partly in Blender. A user can execute the program entirely in Matlab, with no knowledge of Blender necessary.

2.0.1 Blender Motion Capture Addon

A tool that will be well utilized in this work is the Blender Motion Capture Addon [2]. Developed by B.Cook and supervised by Blender as part of Google’s Summer of Coding 2011, the Blender Motion Capture Addon was created to improve the workflow when dealing with motion capture data. The addon, available in Blender’s pre-installed addons (as of Blender 2.76b), includes functionalities for retargeting.

Retargeting is the practice of taking a performer source and connecting it to the 3D model, and have the model replicate the motions of the performer source. This is what is done in motion capture, where the recorded data points are referenced to data points on the 3d model. The model may or may not have the same shape as the recorded performer.

This addon has been automated via a custom Python script for the purposes of this tool.

(11)

Chapter 3

Mathematical Theory

In this chapter we present the mathematical theory required to understand how the conductor’s motions are modeled and classified.

3.1 Gaussian Processes

The Gaussian process is one of the integral parts of the GP-LVM model. In this section we will shortly describe the characteristics of a Gaussian process. For a more detailed explanation, [6] is recommended. It is also the source from which our explanation of Gaussian processes originates.

In order to understand what a Gaussian process is, one can imagine the following generalization, beginning with a normal distributed random variable.

A random variable Y, is defined to be a Gaussian random variable ( 3.1a), if its probability function is given by 3.1.

P(x) = 1 σ√2πe

−(x−µ)2/_2σ2

(3.1) where µ is the mean value and σ2_≥_{0 is the variance, and it is denoted Y ∼ N(µ, σ}2_). The next definition can be interpreted as an extension of a one-dimensional Gaussian random variable to a counterpart in several dimensions. A Gaussian n-dimensional random vector ( 3.1b) is a vector whose components are jointly Gaus-sian. It is specified by a vector of mean values and a symmetric covariance matrix. The Gaussian process is a further generalization of a Gaussian random vector to an infinite number of dimensions. Gallager [6] compares the Gaussian process with a stochastic process X(t); t ∈ T , forming a set of stochastic variables which are jointly Gaussian. This stochastic process has a normally distributed random variable at every value of the continuous parameter t, which, for example, can denote time. Since the parameter t is continuous, the stochastic process will consist of infinitely many dimensions. However, when applying real data, which is always limited, the dimensions will be reduced to a finite number.

(12)

CHAPTER 3. MATHEMATICAL THEORY

(a) One dimensional gaussian

distri-bution (b) Multivariate gaussian distribu-tion

(c) Samples of a gaussian pro-cess (Source: Keirstead, James. Digital image. Http://www.r-bloggers.com/. N.p., 5 Apr. 2012. Web. 19 May 2016.)

Figure 3.1

A Gaussian process is a distribution over functions that has a continuous input domain, which for example can be time, and it is defined by a mean value function

m(t) and a covariance function k(t, t0).

A variable f, distributed as a Gaussian process is written: f ∼ GP (m(t), k(t, t0₎₎ Samples from a Gaussian process, obtained at different values of t will result in a set of Gaussian distributions. Such a sampling procedure is shown in 3.1c.

3.2 Gaussian Process Latent Variable Models (GP-LVM)

The GP-LVM is a LVM and a dimensionality reduction technique that can be used for visualization of complex high dimensional data. It was introduced by [10] and can be described as a more general formulation of probabilistic Principal Component

(13)

3.2. GAUSSIAN PROCESS LATENT VARIABLE MODELS (GP-LVM)

Figure 3.2: A visualisation of a GP-LVM mapping. Analysis (PPCA). The GP-LVM is denoted by:

yi= f(xi) + ni, (3.2)

where f(xi) = GP (m(xi), k(xi, xj)), i, j = 1, ..., N where N is the number of

observable data points and ni is zero mean white noise. [8]

Lawrence has shown that PPCA can be interpreted as a Gaussian process map-ping from a lower dimensional latent space to the observable data space. The equivalence arises when the Gaussian process prior’s kernel function constrains the mapping to be linear. Lawrence introduces the GP-LVM as a more general exten-sion of the previously mentioned by considering a covariance function that allows for non-linear mappings. The GP-LVM model thus has the advantage of performing non-linear mappings from a high dimensional data space to a low dimensional latent space.

A Gaussian process mapping will not result in a point estimate, as would for example a one dimensional Gaussian. As we discussed in the previous section 3.1, with Gaussian processes, every mapping from input in the continuous domain will have an associated Gaussian distribution. Because of this property, every point in the latent space will cause a wide spectrum of probable data points in the observable space.

One important feature of the GP-LVM is that it is a non-parametric model. In comparison, a parametric model will make assumptions about the underlying function, the parameters of which are subject to optimization during a training phase. A parametric model will therefore consist of a set of optimized parameters. A non - parametric model does not make any assumptions about the function, but instead it tries to estimate a function that adjusts the data points as good as possible. [7] One could therefore say that all training data is baked together into one model. 3.2 visualizes what the GP-LVM does, it fits a non-linear, highly complex surface to data points, and then flatten it to a lower dimensional representation.

(14)

(15)

Chapter 4

Designing the visualization tool

In order to go from recorded data to an end result consisting of a latent representa-tion and a video, there is a multitude of steps that need to be executed in the right order. Therefore, this visualization tool is intended to not only visualize the end results, but also to gather the different steps in a single place, thus simplifying the workflow of the end user. All done from within this tool, the end user specifies the files to process, and the program handles the two main tasks. First, it computes the latent space representation of the input file by using a GP-LVM model. Second, through Blender, the tool renders the animation of the conductor’s movement and visualizes it together with its latent space representation side by side. The GUI design can be seen in figure 4.1.

This chapter outlines the components and design choices that have gone into the visualization tool. For an overview, consult the flowchart in figure 4.2.

4.1 The main program : Matlab

The main reason for using Matlab as the main program had mostly to do with enabling compability with GPmat and Netlab Matlab toolboxes, used to create and use GP-LVM models. A main frame (Figure 4.1) with buttons and plot windows was created with GUIDE, the native tool in Matlab for designing graphical user interfaces. [13] Each button is connected to a callback function in the main script. When pressing the file browser button, the user is prompted to choose a motion capture recording to be visualized.

The compute button callback function starts two tasks. One of them will be a system call to a Blender script that renders an animation of the selected bvh and csv file pair. This Blender script is described in detail in section 4.2. The resulting animation will consist of consecutive frames in .png format, saved on disk. The second of the two tasks is a procedure that converts the Mocap and csv data to a suitable input format for the GP-LVM mapping. When the input data is processed, a mapping to the two dimensional latent space is performed, using a GP-LVM model.(6.1) This mapping can be computationally heavy, especially if all frames are

(16)

CHAPTER 4. DESIGNING THE VISUALIZATION TOOL

Figure 4.1: The visualization tool design. For a regular user, this is supposedly the only interaction needed with the involved processes.

rendered, and if the latent space has a high resolution. Therefore, we have chosen to add the opportunity for the user to select frame rate and the latent space resolution through two sliders. The play button callback function reads, with the selected frame rate, the consecutive frames of the conductors body motion generated by Blender, and displays it side by side with its latent space representation.

4.2 Blender

The open source 3D creation suite Blender is used to make a 3D representation of the motion capture data. This was made in order to visualize the specific motions of the conductor and his baton side-by-side with their representation in the latent space. The application is intended for use with no knowledge of Blender required and to be possible to launch from Matlab. Therefore, the Blender application was divided into two factors: a scene with necessary components predefined, and a script for generating and rendering of the final animation.

4.2.1 Blender scene

The Blender scene, which is stored in a .blend file and accessed solely through Blender, holds the components necessary for visualizing the conductor and baton motions.

(17)

4.2. BLENDER

Figure 4.2: The A flowchart of the visualization tool. Within the GUI, the user specifies motion capture data which, if not previously treated, is sent on to Blender for animation of the motion capture and the GP-LVM model for dimensionality reuction. The finished data is then returned to the GUI, where they can be examined side by side.

(18)

The file consists of a single 3D scene with: • two directional lights

• a mannequin model, representing the conductor

• a cylinder, representing the baton. It has a single bone as its rig.

• two spheres, representing the motion capture points on both sides of the baton • an Empty, an object consisting of only a position, used for the calculated

midpoint of the baton

• a camera for rendering the scene

Except for the mannequin, all of these are standard assets in Blender. The man-nequin model is a Creative Commons model made by Sebastian Lague [9]. In order to allow full control of all bones, the Inverse Kinematic (IK) constraints in the man-nequin have been removed. The baton is represented by the two spheres at its side, the rigged cylinder as the body of the baton and the empty object at its midpoint. An IK constraint has been added to the cylinder bone, aiming it towards the empty object and thus aligning it with the baton orientation.

As part of the Motion Capture Addon, a predefined relationship between the naming conventions of the .bvh files’ armatures and the armature in the mannequin is also stored in the .blend file, which is used for automation of the process. For this reason, the Blender application requires the specific setup and naming conventions as detailed in Figure 4.3. If desirable, the Blender scene can be adjusted to account for new setups (see [2]).

4.2.2 Python script

A python script was developed to, upon outer activation, automate the workflow of the Blender Motion Capture Addon and create the sought animation without any user involvement In order to make the script able to launch from Matlab, the script can be launched directly from the system console (which Matlab can access), with all its required variables entered in the single run command. This also allows for Blender to run without any of its visual components, which significantly reduces the workload on the computer and allows more resources for the rendering itself.

The script first loads the motion capture data defined by the command line variables. For the .bvh file, it imports the file and assigns it to the mannequin via their predefined relations. In a similar fashion, the .csv file is then read. Since Blender has no support for reading .csv files, this is done via the csv reader in python and therefore entirely within the script. The data is then iterated through frame by frame, doing the following manipulations to the Blender scene:

(19)

4.2. BLENDER

Figure 4.3: The naming convention used for the motion capture data. This is the naming convention that the visualization tool is designed for. To adapt the tool for another convention, see [2].

(20)

1. The point of origin (location) of the cylinder is updated to match the location data of the baton top.

2. The location of the empty is updated to match the location data of the baton midpoint. (Thus, due to IK constraints, the cylinder aligns with the baton.) 3. The locations of the two spheres are updated to match the relocation data of

respective bottom point on the baton.

4. The above changes are stored as keyframes on that specific frame and we proceed to the next frame.

After this procedure, some rendering settings are set up and the script then renders the entire animation as .png images. Though not the most efficient when it comes to storage and animating, storing the animation in separate images such as .png files have some advantages: firstly, the already rendered part of the animation is less susceptible to corruption if the rendering would be interrupted; and secondly, the Matlab-based GUI would take video files such as an .avi and break it up into separate images to handle the side-by-side rendering, which would mean unnecessary computation in comparison with reading already separate images one by one.

4.2.3 Handling corrupt or missing data

In some cases, the raw data holds a missing line or data object. To account for this, the csv reading and baton handling simply does not add any keyframe at the corrupt frame, allowing instead for the graphics engine to natively move the baton smoothly between the keyframes.

In other cases, the last column of data in the .csv file is appended by several semicolons at the end. Since this column belongs to one of the spheres in the bottom of the baton, neither sphere will be used when animating these cases - meaning that it will be more difficult to follow any exact twists in the baton. Everything else (the cylinder, the empty and the mannequin) handles as normal.

(21)

Chapter 5

GP-LVM and the classification

framework

This chapter details how the motion data is processed and how the features are extracted. As the processes follow Karipidou’s thesis, this work will recount the most important principles for understanding their application and refer to the thesis [8] and the original method it follows [10] for more detailed descriptions of the underlying concepts.

5.1 The data

This work uses the motion capture recordings of Norman’s Andante Sostenuto from Karipidou’s thesis [8]. Therefore, we refer to the thesis for a more detailed descrip-tion of the data. In short, the modescrip-tion capture was recorded in a 3D modescrip-tion capture studio at KTH, where the conductor was conducting a string quartet. Twenty recordings were made: five recordings for each emotion. The data for the conduc-tor’s body is stored in a .bvh file, while the baton data is stored in a .csv file.

The .bvh file holds the rotation channels of each bone in the armature in Euler angles [14]. The .csv file contains the positions of its tip, two bottom points and a calculated midpoint, as well as its rotation data in quaternions [8].

5.2 Feature extraction

From the .bvh files, The Euler angles are trivially extracted, while the quaternions in the .csv files are converted to Euler angles using the Matlab function quat2eul (available in Matlab2015 and later). As described in 4.2.3, there are some cases where a few frames are missing in the recording. These were handled by interpolat-ing between the known angle data and assuminterpolat-ing the results to be the intermediate poses.

A musical conductor communicates a lot through the arms, hands and baton. We will therefore expect a great angular movement in the different limbs of the arms

(22)

CHAPTER 5. GP-LVM AND THE CLASSIFICATION FRAMEWORK

Figure 5.1: Example of an angle bug in the motion capture data. In the seventh recording, some channels experience a sudden rotation of -360°. These occasions are handled in the program by translating back into the interval -180°< θ < 180°. and of course, in the baton. The histogram in Figure 5.2 shows that there is a great difference in variance for the different dimensions. One of the dimensions, the z-rotation of the baton, stands out with a very large variance. The reason behind this will be discussed in the next sub section. Because the low variance dimensions will not contribute much with information to the model [8], they are sorted out.

A threshold is set to 245, which sorts out the 17 dimensions that have the highest variance. (Figure 5.3b). The remaining dimensions after this procedure are shown in 5.4

5.2.1 Difficulties with Euler angles

As mentioned above, this work represents the angles as Euler angles. However, Euler angles suffer from singularities when representing angles near 180°, where it leaps from 180° to -180°[11]. As such, two angles within close proximity in reality could be represented as being far apart from each other when expressed in Euler angles. This might generate noise in the latent space, as poses are far from their expected positions.

In the case of some of the recordings (AS_6, AS_7 and AS_15 as they’re called in [8]), a bug was discovered where some channels subtracted 360° or 720°in the middle of recordings. An example of this can be seen in figure 5.1. In the sake of consistency, these were translated in preprocessing to stay within -180° < θ < 180°. Another concern is Gimbal lock, where the rotation in Euler angles renders one of the three rotation axes unavailable. This is a difficulty not only when using Euler angles to rotate an object, but also poses difficulties in converting quaternions into Euler angles [3]. Representing all angles with quaternions instead was considered, as they lack several of these challenges, but this was not completed due to time constraints.

(23)

5.2. FEATURE EXTRACTION

Figure 5.2: A histogram showing the number of dimensions belonging to different ranges of variances.

(a) (b)

Figure 5.3: The variances for each of the 42 dimensions and the threshold at variance 245. The numbers 1 to 13 are the Euler angles for the body motion, and number 14 the corresponding for the baton. (a) is the actual histogram while (b) is a close-up of the same.

(24)

CHAPTER 5. GP-LVM AND THE CLASSIFICATION FRAMEWORK

Figure 5.4: The remaining 17 dimension after maximum variance extraction. They are sorted in descending order beginning with Baton z-rot.

5.3 The GP-LVM modelling

From the extracted angle data, a GP-LVM with dynamical prior is created, following the same procedure as in [8, p. 33]. A GP-LVM is trained on 160 snippets of the training data, 40 from each emotion, each snippet containing 30 consecutive frames. To avoid bias, these snippets are selected at random and thus do not match the snippets from Karapidou’s work (which were also selected at random [8]).

5.4 The classification framework

The framework used to classify the four emotional interpretations is the same as the one used in Karipidou’s work[8, p. 36] and originally in [4].

A brief summary of the classification procedure in the aforementioned sources is given here:

1. Clustering of the latent space

The lower dimensional latent space representation of the training data, generated by a GP-LVM, is segmented into several clusters using the K-means algorithm. The cluster centroids will represent the states in the HMM.

2. Transitional matrices First, each frame of the training data is assigned

to it’s most probable cluster mean. For each emotion and all of its corresponding cluster assignment data, a transitional matrix is calculated by counting the actual state traversals that are made, the number of times the state i follows the state j for all i = 1, 2, ... N, and j = 1, 2, ... N In order to get the transitional probabilities, the matrix element values are normalized.

3. Observation matrices

One observation matrix for each test file is created. It contains the probabilities

(25)

5.4. THE CLASSIFICATION FRAMEWORK

of how likely it is that an observation Y, which is one frame of the conductor motion data, was caused by the hidden state ck. If the number of observations are

K, and the number of hidden states are N, this observation matrix will contain the likelihoods p(Yi|ck), for i = 1, 2, ... N and k = 1, 2, ... K.

4. The optimal HMM path

For each test file, the optimal state sequence that results from each emotion’s associated transition and observation matrix is calculated, using the Viterbi algo-rithm.

5. Classification

The previous step results in four emotional paths for each test file. For each test file, the probabilities for each emotional path is calculated. The multiplicative sum of the transitional and observational probabilities are calculated. Subsequently, the exponential of the aforementioned sum is taken, followed by a normalization of the four probabilities. Each test file is classified as the emotion with the highest probability.

(26)

(27)

Chapter 6

Experiments and Results

Two experiments were made in order to compare an angle based approach to Karipi-dou’s position based model. They recreate the experiments Experiment 2: The

la-tent space representation of the AS dataand Experiment 3: Classification in

Karipi-dou’s thesis [8], except utilizing the angle based GP-LVM model instead.

6.1 The latent space representation of the body motion

data

In this experiment, a latent space is created (as per the description in Chapter 5) from the recordings of different emotions. This latent space is then color coded, using the same colors as in [8]. The latent space created with the training data is shown in figure 6.1, and Karipidou’s equivalent is portrayed in figure 6.3. There seems to be little distinction between the different emotional poses, with large over-laps and no distinct regions exclusive for a specific emotional sequence.

As the classification framework investigates sequences of frames rather than independent frames, the overlapping emotions pictured do not give definite infor-mation as to whether or not the model can make any distinctions in the emotional interpretations - the next experiment will go into further details on that topic in-stead. However, it is noteworthy that the Tender case is practically entirely over-lapped and also somewhat more centered than the other emotional interpretations. In comparison, Karipidou’s model is more evenly spread across its latent space, while the angular model is more spread out in one of its two latent dimensions (portrayed along the vertical axis). Another interesting difference is that the angle based model seems to contain noise that is, isolated points in the latent room -to a much higher degree than the position based, despite being developed from the very same recordings. More precisely, the sequences in the angle based latent space contain a certain amount of non-continuity.

(28)

CHAPTER 6. EXPERIMENTS AND RESULTS

6.2 Classification of the conductor’s underlying emotion

In this experiment, one recording of each emotion is run through the classification framework in accordance with [8, 5] to investigate the precision of the developed model. In order to classify the conductor’s movements, as a first step a GP-LVM is created from the high dimensional feature data, as described in chapter 5. This step reduced the 17 maximum variance dimensions of the recordings down to two.

The test recordings were chosen to be the same as in Karipidou’s experiment and made with five different clustering setups, similar to [8]. The resulting proba-bilities for the classifications are listed in table 6.1, in which the rows correspond to the director’s actual interpretation and the columns correspond to the resulting emotion chosen by the classification framework. For the probabilities Karipidou reached, we refer to her thesis [8]. For ease of comparison with those results, even miniscule probabilities outside of the precision of Matlab are accounted for. These probabilities were calculated using the online tool Wolfram Alpha, as done in [8].

The results of the classification implies that the overlap, described in section 6.1, causes the angle based model unreliable in classifying any of the four emotional interpretations. This compared to Karipidou’s location based model, which could correctly classify the emotions Tender and Neutral in all recordings[8]. In addition, the angle based classification gives an overall higher certainty in its classification, despite actually being incorrect in most all classifications.

(29)

6.2. CLASSIFICATION OF THE CONDUCTOR’S UNDERLYING EMOTION

10 clusters

Probability of being classified as emotion

Actual emotion Tender Neutral Passionate Angry

Tender 1 4.21396 ∗ 10−20 _{6.34926 ∗ 10}−105 _{1.21840 ∗ 10}−59 Neutral 1 1.18703 ∗ 10−31 _{1.12099 ∗ 10}−38 _{1.22610 ∗ 10}−32 Passionate 0.99999 2.25823 ∗ 10−8 _{4.41028 ∗ 10}−75 _{8.93868 ∗ 10}−38 Angry 0.01129 0.98870 3.14361 ∗ 10−96 _{1.38432 ∗ 10}−21

20 clusters

Tender 0.00007 0.99993 3.18006 ∗ 10−₃₃ _{3.62804 ∗ 10}−23 Neutral 5.62529 ∗ 10−21 ₁ _{1.68233 ∗ 10}−88 _{7.74318 ∗ 10}−24 Passionate 1 1.31530 ∗ 10−33 _{2.67776 ∗ 10}−271 _{4.58510 ∗ 10}−141 Angry 1 6.04111 ∗ 10−10 _{7.63952 ∗ 10}−232 _{1.13495 ∗ 10}−124

50 clusters

Tender 3.72671 ∗ 10−36 ₁ _{1.34883 ∗ 10}−93 _{3.05354 ∗ 10}−93 Neutral 0.99544 0.00455 9.28253 ∗ 10−112 _{1.91197 ∗ 10}−72 Passionate 1 1.36697 ∗ 10−44 _{3.20464 ∗ 10}−115 _{4.88465 ∗ 10}−163 Angry 1 2.61690 ∗ 10−38 _{2.58292 ∗ 10}−48 _{4.07316 ∗ 10}−95

100 clusters

Tender 1 8.65339 ∗ 10−73 _{4.09934 ∗ 10}−85 _{7.30680 ∗ 10}−11 Neutral 1 7.45291 ∗ 10−13 _{2.14992 ∗ 10}−42 _{3.71383 ∗ 10}−87 Passionate 2.33397 ∗ 10−12 ₁ _{4.14773 ∗ 10}−78 _{1.47737 ∗ 10}−159 Angry 5.60644 ∗ 10−8 _0.99999 _{5.88310 ∗ 10}−25 _{8.25743 ∗ 10}−76

125 clusters

Tender 3.68393 ∗ 10−8 _0.99999 _{3.09901 ∗ 10}−173 _{5.54192 ∗ 10}−107 Neutral 1 6.05234 ∗ 10−52 _{7.36373 ∗ 10}−228 _{9.46262 ∗ 10}−112 Passionate 1 9.41063 ∗ 10−105 _{1.23259 ∗ 10}−131 _{1.89732 ∗ 10}−126 Angry 0.01129 1.20091 ∗ 10−19 _{5.43475 ∗ 10}−59 _{1.43440 ∗ 10}−138 Table 6.1: The confusion matrices for each of the cluster setups, depicting the probabilities for classifying the recordings as each emotion.

(30)

CHAPTER 6. EXPERIMENTS AND RESULTS

Figure 6.1: The two dimensional latent data of our 17-dimensional training data, generated by the GP-LVM. The four different emotional interpretations are shown in different colors; yellow - tender, blue - neutral, green - passionate and red - angry. Below, these emotional interpretations are separated for clarity.

(31)

6.2. CLASSIFICATION OF THE CONDUCTOR’S UNDERLYING EMOTION

Figure 6.3: The two dimensional latent data of Karipidous 24 -dimensional training data. The four different emotional interpretations have the same colors as in 6.1; yellow - tender, blue - neutral, green - passionate and red - angry.

(32)

(33)

Chapter 7

Discussion and conclusions

The aim of this study was firstly, to build a visualization aid by which quick eval-uation of a developed GP-LVM can be made; and secondly, to train a new angle-based GP-LVM, and compare this new model to Karidpidou’s position-angle-based model. These are each given a section each below, but together they strive to expand or aid the research on modelling human motion through machine learning and thus gain further insights in non-verbal communication.

The main conclusion is that the visualization tool showed itself to be helpful when a GP-LVM was to be investigated, and that the new angle-based GP-LVM did not contribute to better classification results compared with Karipidou’s position-based model, and as such Euler angles are not recommended.

7.1 The visualization tool

Although the experiment section of this thesis has focused on the GP-LVM classi-fication and thus no formal analysis of the visualization tool has been performed, some subjective notes can be made on its end result. As often is the case when re-programming another’s source code, making everything work on another computer and gaining complete understanding of the program is draining. When developing this tool, care has therefore been taken to make it and its the entire content system independent as well as a central place for accessing the variables of most importance to run the GP-LVM. It was also designed to lessen the level of involvement in the source code from the user’s side, needing no experience with Blender nor essentially the machinations of the GP-LVM framework. It is easy to choose a GP-LVM and moton capture data, and once the visualization is finished one gets a quick feel for the way it models the data - for instance, it has given us some insights used for discussing the classification below.

When it comes to the visualization of the motion capture data, Blender was chosen before the Matlab plot tool used in Karipidou’s work because Blender’s 3d environment gives a much better perception of depth in the character. This limits any potential misinterpretations of the motion which otherwise could be harmful

(34)

CHAPTER 7. DISCUSSION AND CONCLUSIONS

for understanding the relation between the conductor’s motions and the motions in the latent space.

One issue that arose in the work of the visualization tool was long processing time. This was mainly on account of calculating the most likely point in latent space of each frame, which is a time-consuming process. To account for the long loading times, the tool is programmed to check if the computation for what the user requests has been done previously and can then load these instead of recalculating.

7.1.1 Conclusions

As per the discussion above, the tool can be concluded to work well for its purpose. As intended, by visualizing the pose and the resulting mapping just next to each other, the user can get a perception of how certain motion patterns behave in the latent space.

7.1.2 Future directions

The main difficulty with the tool as of currently, is its processing time for new motion capture data. This processing time is while Blender is rendering the motion capture data and the GP-LVM program is finding the most likely point in latent space frame by frame.

If one has a graphics card installed, the rendering in Blender can be repro-grammed to utilize the GPU for rendering instead of the CPU. This has two advan-tages: firstly, as the GPU is geared specifically towards rendering the rendering time will decrease, and secondly, transferring the Blender processes to the GPU frees up more resources on the CPU for the GP-LVM reduction which is run in parallel.

In some poses, the directional light used for the director casts rather strong shadows, making it difficult to clearly see a hand or the baton. The directional lights are placed to emphasize the sense of depth in the animation, but some tweaking might be desirable. For such actions, refer to [2].

7.2 Classification

The new angular based GP-LVM model and its latent space mapping was analyzed in a qualitative sense in section6.1. The sequences of the four different emotional interpretations seemed to overlap in such a way that would make it difficult for a classification algorithm to accurately distinguish them. Indeed, the results in table 6.1 shows a poor success rate for the classification algorithm.

Using the newly developed visualization tool, the angle-base model was shown to generate points in a close proximity in latent space for many of the smaller motions of the conductor - including entire bars of motion. This area also compares well with the area within which the tender motion is most concentrated, as is expected since the director motion visualizations of a tender report seldom shows any more

(35)

7.2. CLASSIFICATION

sweeping motions. However, this indicates that several sessions have a large concentration of poses within that area several of the music bars with smaller motions -and while it is then included in all emotional interpretations, the high concentration of the tender interpretation might result in it being considered the most probable solution by the classification algorithm. If so, this could explain the reason why tender is frequently reached by the algorithm. Similarly, a more general sweeping motion seems to lead towards an area where the neutral interpretation has a high concentration, possibly making it the preferred emotional interpretation for some of the files.

As noted in section 6.1, the GP-LVM model trained on body data represented as Euler angles also resulted in non-continuities in the latent space. Seeing as this noise is not present in Karipidou’s model - which is trained on the same recordings but with position data - this is suspected to be the result of using Euler angles and the difficulties they pose, which were discussed in section 5.2.

7.2.1 Conclusions

With the results from the qualitative experiment in section 6.1 and the classifica-tion results in secclassifica-tion 6.2, it can be concluded that when evaluating the GP-LVM with this particular data set, the position-based model proposed by Karipidou out-performs the angle-based model in classifying the different sequences. Our model together with the classification algorithm fails to distinguish any such emotional interpretations.

7.2.2 Future directions

First and foremost, the success of a machine learning model such as this is dependent on having plenty of data for training purposes. An increased amount of training data increases the accuracy and stability of the model and thus a prime recommendation is to expand on the limited data currently available.

As per the discussion above, using Euler angles for this application is not rec-ommended. In this work, using quaternions were avoided due to time constraints and an uncertainty with the authors as to how to investigate their variance when choosing features to extract. If this could be addressed, a quaternion based model would be of interest as it does not suffer from the same difficulties as Euler angles, as touched upon in section 5.2.1

A model containing more concentrated information can be created with better data cleaning as a first step. Each mocap recording contains a calibration phase, which occupies a relatively large section of the whole recording. In this calibration phase, the conductor stands straight with his arms stretched out. Including this calibration phase in the GP-LVM data will affect the transitional probabilities, which then could affect the classification result. Therefore, and although left in within this work to better compare with Karapidou’s model, it is recommended to trim the data to begin with the actual performance instead.

(36)

CHAPTER 7. DISCUSSION AND CONCLUSIONS

Furthermore, in order to improve the classification accuracy, data that represent the intensity of the movement could be added to the information used. This could for example be the velocity or acceleration of the motion capture sensors. Adding such information into the model could put a greater emphasis on the dynamics of the motions than currently, where only stale poses describe the solution.

(37)

Bibliography

[1] J. Borchers, E. Lee, W. Samminger, and M. Mühlhäuser. Personal orchestra: a real-time audio/video system for interactive conducting. Multimedia Systems, 9(5):458–465, 2004.

[2] B. Cook. Motion Capture Addon. Google Summer of Code, 2011. https: //wiki.blender.org/index.php/User:Benjycook/GSOC/Manual . Accessed: 2016-05-19.

[3] E. B. Dam, M. Koch, and M. Lillholm. Quaternions, interpolation and

anima-tion. Datalogisk Institut, Københavns Universitet, 1998.

[4] A. Davies, C. H. Ek, C. Dalton, and N. Campbell. Facial movement based recognition. In Computer Vision/Computer Graphics Collaboration

Tech-niques, pages 51–62. Springer, 2011.

[5] C. H. Ek, P. H. Torr, and N. D. Lawrence. Gaussian process latent variable models for human pose estimation. In Machine learning for multimodal

inter-action, pages 132–143. Springer, 2008.

[6] R. Gallager. Stochastic processes : theory for applications.

Cam-bridge press, 2013. http://www.rle.mit.edu/rgallager/documents/6. 262lateweb3.pdf. Accessed 2016-02-20.

[7] J. Gareth, D. Witten, T. Hastie, and R. Tibshirani. An Introduction to

Statis-tical Learning. Springer, 2013.

[8] K. Karipidou. Modelling the body language of a musical conductor using gaussian process latent variable models. Master’s thesis, Kungliga Tekniska Högskolan, 2015.

[9] S. Lague. Rigged wooden mannequin. http://www.blendswap.com/blends/ view/45969. Accessed: 2016-03-15.

[10] N. D. Lawrence. Gaussian process latent variable models for visualization of high dimensional data. Advances in neural information processing systems, 16(3):329–336, 2004.

(38)

BIBLIOGRAPHY

[11] J. M. H. Ang and V. D. Tourassis. Singularities of euler and roll-pitch-yaw representations. Technical report, College of Engineering & Applied Science, University of Rochester, New York, 1986.

[12] J. H. M. Knapp and T. Horgan. Nonverbal communication in human interac-tion. Cengage Learning, 2013.

[13] MathWorks. Creating Apps with Graphical User Interfaces in MATLAB. http: //se.mathworks.com/discovery/matlab-gui.html. Accessed: 2016-03-26. [14] C. G. G. U. of Wisconsin-Madison. Biovision bvh. http://research.cs.wisc.

edu/graphics/Courses/cs-838-1999/Jeff/BVH.html Accessed: 2010-05-15. [15] S. Quirion, C. Duchesne, D. Laurendeau, and M. Marchand. Comparing gplvm

approaches for dimensionality reduction in character animation. 2008.

[16] J. M. Wang, D. J. Fleet, and A. Hertzmann. Gaussian process dynamical models for human motion. Pattern Analysis and Machine Intelligence, IEEE

Transactions on, 30(2):283–298, 2008.

Visualizing the Body Language of a Musical Conductor using Gaussian Process Latent Variable Models