Modeling Time-Series with Deep Networks

(1)

(2)

(3)

Örebro Studies in Technology 63

MARTIN LÄNGKVIST

Modeling Time-Series with Deep Networks

(4)

Title: Modeling Time-Series with Deep Networks Publisher: Örebro University 2014 www.oru.se/publikationer-avhandlingar

Print: Örebro University, Repro 12/14 ISSN1650-8580

ISBN978-91-7529-054-6

(5)

Abstract

Martin Längkvist (2014): Modeling Time-Series with Deep Networks. Örebro Studies in Technology 63

Deep learning is a relatively new field that has shown promise in a number of applications and is currently outperforming other algorithms on a variety of commonly used benchmark data sets. One attractive property of deep learning methods is that they take advantage of unlabeled data, which is plentiful and cheaper to obtain than labeled data, in order to construct its own features which reduces the need for heavy pre- processing of the data. However, much focus has been on applications and methods for static data and not so much on time-series data.

Learning models of complex high-dimensional time-series data introduces a number of challenges that either require modifications to the learning algorithms or special pre-processing of the data. Some of the signals in multivariate time-series data are often redundant since they come from sensors with different properties or spatial locations but measure the same real-world phenomenon. Furthermore, sensors are subject to errors in the measurements due to faults, noise, and sensor bias.

Therefore, a common approach to analysing multivariate time-series data is to heavily pre-process the data to reduce the noise and complexity with noise-reduction techniques, feature extraction and signal removal. However, many of these steps require expertise of the domain which is difficult and expensive to acquire, or could even be non-existent. The primary contribution of this thesis is the algorithmic modifications to a deep learning algorithm that enables the algorithm to better handle multivariate time-series data. The aim is to change the amount of impact each input signal has on the feature learning. This reduces the influence that noisy or task-irrelevant inputs have on the learned features.

The secondary contribution of this thesis is the investigation of the feasibility to construct features from unlabeled raw time-series data. An advantage of using deep networks is the promise of unsupervised feature learning that removes the need to manually hand-design features. However, many of the reported successful applications that uses deep learning, and especially those applied to time-series data, still have used some form of feature extraction as a pre-processing step. This thesis investigates the importance of feature extraction for time-series data by comparing the performance from a deep network trained on raw data with models trained on feature transformed data.

A final contribution of this thesis is the application of deep learning methods to new data sets that can follow the success deep learning methods has had in computer vision applications. This thesis takes the first step by using new, challenging, and interesting multivariate time-series data sets and suggests that they can be used as benchmark data sets in order to further develop deep learning algorithms specific for multivariate time-series data.

Keywords: multivariate time-series, deep learning, representation learning, unsupervised feature learning, selective attention, weighted cost function, electronic nose, automatic sleep stage classification

Martin Längkvist, School of Science and Technology

Örebro University, SE-701 82 Örebro, Sweden, martin.langkvist@oru.se

(6)

(7)

Acknowledgements

First of all I would like to thank my supervisor Amy Lout for all her guidance during all aspects of my research. From formulating the topic of the thesis to the very last sentence of the thesis she has guided me in the right directions, usually in disguise of rhetorical thought-provoking questions. Thank you Amy for allowing me the opportunity to explore the topics that I'm passionate about and for supporting me. I am grateful to my opponent and the committee members that took the time to review my work. A big thanks goes to my co-supervisor Lars Karlsson for his detailed revisions of papers and the nal thesis. I am especially grateful to Lars Karlsson, Federico Pecora, and Mathias Broxvall for giving me the initial opportunity to work with The Comfortable Sleeplab project, which thought me a lot about machine learning and inspired the pursuit for new state-of-the-art methods. I also thank Silvia Coradeschi for her encouragements and, together with Amy, a pleasant visit to SATRA University in India.

From SASTRA Univeristy and the Centre for Nanotechnology & Advanced Biomaterials lab (CeNTAB), that I had the privilege to visit during my research, I would like to give a special thanks to Prof. John Bosco Balaguru Rayappan for the informative and pleasant visits in Sweden and India during the Indo-Sweden project. I also thank Ganesh Kumar Mani and Prabakaran Sankar for being such good friends and making me feel I have a second home in India.

There are many people at the AASS group that I would like to thank. From the Cognitive Robotic Systems lab I thank Marjan Alirezaie, Annica Kristof- fersson, Andreas Persson, and Hadi Banaee for the support and being good friends. I am also grateful to Achim Lilienthal, Marco Trincavelli, Victor Her- nandez Bennetts, Sepideh Pashami and Sahar Assadi from the Mobile Robotics and Olfaction lab for taking interest in my research regarding electronic nose data analysis.

From Örebro University Hospital, which I was involved with in several projects, I would like to thank Lena Leissner and Meeri Sandelin at the sleep unit of the neuroclinic for the interesting discussions about sleep data analysis.

From the micro biology lab I would like to acknowledge Lena Barkman for the

iii

(8)

iv

bacteria sample preparation. I also thank Bo Söderquist and Per Thunberg for encouraging research collaborations between the university and the hospital.

A big gratitude goes to my family who always take an interest in whatever project I am involved in. Finally, my biggest gratitude goes to Shan for always cheering for me and giving me suggestions and ideas that provided new perspectives.

(9)

List Of Publications

This work is a compilation of journal and workshop papers. Below is a list with a description of each publication and an index that is used as reference throughout the rest of this compilation thesis:

Paper I Martin Längkvist, Lars Karlsson, and Amy Lout, A Review of Unsupervised Feature Learning and Deep Learning for Time- Series Modeling, Pattern Recognition Letters, Volume 42, Pages 11-24, 2014

This paper reviews recent work using deep learning techniques for time-series problems. The paper rst gives an introduction to the properties of time-series data and what makes them challenging. A brief introduction to deep learning models is then provided as well as how these can be modied to make them more suitable for time-series data. Some additional tools for modeling temporal-coherence such as convolution and regularization are briey covered. The rest of the paper looks at the recent work for time-series modeling categorized by application. The applications covered are speech recognition, music recognition, motion capture modeling, electronic nose data classication, and medi- cal data modeling.

Paper II Martin Längkvist, Lars Karlsson, and Amy Lout, Sleep Stage Classication using Unsupervised Feature Learning, Advances in Articial Neural Systems, vol. 2012, Article ID 107046, 9 pages, 2012

This paper applies a deep belief network (DBN) to polysomnogr- phy recordings of model brain activity, muscle tension, and eye movements of patients with sleep apnea. The DBN is trained on both raw data and on pre-dened features. The features are chosen by a combination of features used in the literature and hand-designed features. The results are compared to a traditional shallow Gaussian mixture model (GMM) architecture. The best

v

(10)

vi

classication results were achieved with the DBN using hand- picked features. A surprising result is that a DBN using raw data outperformed the GMM with features, which means that the DBN is able to construct features from raw data that are more meaningful than some of the ones used in the literature. The trained DBN is then used in an experiment of nding anomalous data collected in a dierent environment than the training data, namely from data collected in the home of a healthy person using self-applied electrodes. The reconstruction error is measured over time and it is shown how this can give an indication of when an error in the data has occurred.

Paper III Martin Längkvist, Silvia Coradeschi, Amy Lout, and John Bosco Balaguru Rayappan, Fast Classication of Meat Spoilage Mark- ers Using Nanostructured ZnO Thin Films and Unsupervised Feature Learning, Sensors, 13(2), 1578-1592, 2013

This paper concerns the use of deep belief networks and stacked auto-encoders for detecting gases that are described in the literature as markers for spoiled meat. Three gas sensor based on nanostructered zinc oxide with dierent dopants are exposed to gases with varying concentration. The task is to classify each sample into the correct gas, concentration, and type of sensor that is used. These sensors have been custom-made to react well to a specic gas. However, the response time for them could range from 15 seconds to 15 minutes. The focus in this paper is on achieving a fast classication using as little input data as possible. The results are compared to an approach where the features are based on knowing the maximum response. The result was that a deep model using only 6 seconds of input data outperformed a feature-based approach that required the full sensor response time.

Paper IV Martin Längkvist and Amy Lout, Learning Feature Represen- tations with a Cost-Relevant Sparse Autoencoder, International Journal of Neural Systems, Accepted for publication, 2014 This paper presents an extension of a sparse auto-encoder that gives the model a selective attention of the input data. The proposed method is evaluated on a number of static image data sets that contain task-irrelevant information. The properties of the proposed method allows the model to learn features that ignores such task-irrelevant information and instead learn more task-relevant features. Two strategies for achieving selective attention are proposed. The rst strategy is to use knowledge of the labels with a feature selection algorithm on the input data in

(11)

vii order to rank the inputs and then focus more on reconstructing the highly-ranked inputs. The second strategy is unsupervised and does not require knowledge of the labels. Here the selective attention is instead decided by the reconstruction of each input and an input with a higher reconstruction error is focused on less in order to save the representational capacity for other inputs.

The proposed method achieved better or comparable results on the MNIST variations data set and Jittered-Cluttered NORB compared to previous works that generally used a larger number of hidden units.

Paper V Martin Längkvist, Lars Karlsson, and Amy Lout, Selective At- tention Auto-encoder for Automatic Sleep Staging, Biomedical Signal Processing and Control, Under revision, 2014

This paper uses the same algorithmic contribution as Paper IV and applies the idea of selective attention on a temporal auto- encoder on multivariate time-series data for the task of automatic sleep staging. While Paper II uses a DBN to construct new features from raw unlabeled data this paper uses pre-dened features in order to instead focus on the advantage of selective attention. The two methods of a xed and adaptive selective attention is further explored. The results are that both methods of selective attention outperformed a standard auto-encoder and that the xed method that bases the selective attention on the knowledge of the labels was better than the unsupervised adaptive method.

Paper VI Martin Längkvist and Amy Lout, Unsupervised Feature Learn- ing for Electronic Nose data Applied to Bacteria Identication in Blood, NIPS workshop on Deep Learning and Unsupervised Feature Learning, 2011

This workshop paper is the rst paper that used generative deep learning models on electronic nose data.

Paper VII Martin Längkvist and Amy Lout, Not all signals are created equal: Dynamic Objective Auto-Encoder for Multivariate Data, NIPS workshop on Deep Learning and Unsupervised Feature Learning, 2012

This workshop paper is the rst paper that explored the idea of selective attention for time-series data. The ideas in this paper were further developed and resulted in Paper VI and Paper VII.

Paper VIII Ganesh Kumar Mani, Prabakaran Sankar, Martin Längkvist, Amy Lout, and John Bosco Balaguru Rayappan, Detection of

(12)

viii

Spoiled Meat using an Electronic Nose, Sensors, Under revision, 2014

This paper uses a conditional Restricted Boltzmann Machine to detect spoiled meat.While Paper III showed that the gas and concentration of spoiled meat markers could be detected, this paper focuses on the actual application of detecting spoiled goat meat that has been exposed to Indian outdoors temperatures.

(13)

List of Figures

2.1 (a) RBM (b) conditional RBM . . . . 8 2.2 (a) Auto-encoder (b) Temporal Auto-encoder . . . 11 3.1 Polysomnographdata of the ve stages of sleep during 30 seconds. 15 3.2 (a) A hypnogram for one night recording of a patient with sleep

apnea. (b) Construction of visible layer from sleep data. . . 16 3.3 Data processing of the rst experimental setup, raw-DBN. . . . 17 3.4 The rst 20 of 200 learned features in the rst layer. The learned

features are of various amplitudes and frequencies. . . 18 3.5 Estimated sleep stage (red) and true sleep stege (black) before

applying HMM. . . 19 3.6 Estimated sleep stage (red) and true sleep stage (black) after

applying HMM. . . 19 3.7 Data processing of the second experimental setup, feat-GOHMM. 20 3.8 Overall classication accuracy for one acquisition during feature

selection withsequential backwards selection (SBS). . . 21 3.9 Individual sleep stage classication accuracy during SBS. . . . 21 3.10 Data processing of the second experimental setup, feat-DBN. . 22 3.11 Electronic nose exposed to bacteria in blood . . . 23 3.12 Sensor response for (a) undoped ZnO, (b) Mn doped ZnO, and

(c) F doped ZnO. Solid line is the response towards ethanol and dotted line is the response towards trimethylamine. The color indicates ppm level where red, black, and blue represent low, medium, and high ppm level, respectively . . . 26 4.1 (a) Values of xed weighting vector. The method can totally

ignore one feature for one sleep stage and fully reconstruct it in another sleep stage. (b) Values of adaptive weighting vector.

The method focuses more on some features and sleep stages than others. The dierence between values for each feature across the sleep stages is not as high as the xed method. . . 35

xi

(16)

xii LIST OF FIGURES 4.2 Average reconstruction error for each input unit for standard

and xed weighting vector for one sleep stage. The reconstruction error is generally increased for low values of α and decreased for higher values of α when the xed method is used. . . 36 4.3 (a)Input data, (b)Weighting vector for xed method, and (c)

reconstructions using xed method. . . 37 4.4 (a)Input data for Jittered-Cluttered NORB where each rows

is one category, (b)reconstruction of the input data with a CSSAE, and (c)a subset of the learned features. . . 38

(17)

List of Tables

3.1 Transition matrix for sleep stage changes . . . 19 3.2 Confusion matrix for raw-DBN . . . 20 3.3 Classication accuracy[%] (mean ± std) for ve sleep stages

with three dierent setups. . . 22 3.4 Classication accuracy[%] for DBNs and cRBMs applied to 10

strains of bacteria in two dierent media. . . 25 3.5 Classication accuracy(mean ± standard deviation) [%] with

ve-fold cross-validation for the task of classifying material, gas, and ppm level using dierent set-ups. The number after DBN denes the window width (number of visible units) and the num- bers after auto-encoder dene the model order in the rst and second layer. . . 26 4.1 Classication accuracyof sleep stage classication with and

without post-processing temporal smoothing for a deep belief net (DBN), sparse auto-encoder (SAE), and selective attention auto-encoder (SA-AE) with two dierent methods of selecting the weighting vector. . . 35 4.2 Classication errors [%] with 95% condence intervals on MNIST

variations using selective attention auto-encoder (SA-AE) with

xed weighting vector, sparse auto-encoder (SAE), supervised neural net (NNet), SVM with Gaussian kernel, 3-layered stacked auto-associator (SAA-3), 1 and 2-layered contractive auto-encoder (CAE-1, CAE-2), 3-layered stacked denoising auto-encoder (SdA- 3), deep belief network (DBN-1), and a supervised point-wise gated Boltzmann machine (supervised PGBM). . . 38 4.3 Comparison of classication errors on Jittered-Cluttered NORB

with previous works and a sparse auto-encoder with cost-sensitive learning with 2000 hidden units. . . 39

xiii

(18)

(19)

Chapter 1

Introduction

Applying high-dimensional raw data directly to a machine learning algorithm often gives unsatisfactory results due to the curse of dimensionality [2]. There- fore, practitioners of machine learning algorithms rst extract features from the data in order to capture relevant high-level information and reduce the dimensionality of the data and then apply the feature representations to the learning algorithm. These features are developed by domain-specic experts and the search for better feature representations is an active research area in most elds since choosing the right representation is the key for any successful application. However, developing domain-specic features for each task is expensive, time-consuming, and requires expertise of the data. For example, consider challenging tasks in articial intelligence such as computer vision and speech recognition that contain data that are complex, noisy, and high- dimensional. Furthermore, most data from real-world problems also have a temporal component and are called sequential data or time-series data.

Time-series data has unique properties that make them challenging to analyze and model. Finding a way to represent time-series data that accurately captures the temporal information is an open question and is crucial because raw time-series data is hard to manipulate in its original structure.

Prior knowledge or assumptions about data, such as noise levels, redundan- cies, and temporal dependencies, is often infused in the chosen model or feature representation [26].

The alternative to using hand-engineered features is to learn the feature representations automatically from the data simply by interacting with it. Un- supervised feature learning [8, 6, 23] does this by learning a layer of feature representations from unlabeled data. The advantage of unsupervised feature learning is that unlabeled data can be utilized, which is plentiful and easy to obtain, and that the features are learned from the data instead of being hand-crafted. These layers of feature representations can then be stacked to create deep networks, which are capable of modeling more complex structures in the data. The eld of deep learning aims to nd ways to properly construct

1

(20)

2 CHAPTER 1. INTRODUCTION and train these deep networks. Unsupervised feature learning and deep learning (together lately referred to as representation learning) have shown to be successful at learning layers of feature representations for static data and have achieved state-of-the-art results on a number of benchmark data sets [6].

However,much of the focus in the deep learning community has been on developing learning algorithms specic for static data,and not so much on time-series data. Some works have applied deep learning algorithms that are suited for static data directly to time-series data,but by doing so the temporal component is lost. Other works modify the learning algorithms to capture the temporal information as well. Many of the experiments that apply deep learning algorithms to time-series data use feature extraction on the original data which means that the learned features are not build from purely raw data.

Current unsupervised feature learning algorithms treat all inputs equally.

For some multivariate time-series data sets there may be signals that contain more task-relevant information than others,for example if one of the signals is redundant or contains too much noise. In that case,it can sometimes be benecial to reduce the dimensionality simply by removing those signals. But by doing so valuable information could be lost and it can be a dicult task to manually distinguish which signals can be removed. For some data sets,it might be the case that some signals are particularly useful during some periods of time and at other times they provide less valuable information. Therefore, the ability to focus on a subset of the inputs and changing this focus over time is currently lacking in current feature learning algorithms.

1.1 Contributions

The contributions of this compilation thesis are:

1. Review the challenges with learning feature representations for structured data and recent progress of applying deep learning models to various time-series problems. (Paper I)

2. Apply deep learning models to novel multivariate time-series data sets.

Some of these novel data sets could be considered for new benchmarks sets since they introduce challenges that are suited for deep learning algorithms. This contribution is investigated in Chapter 3. (Paper II, III,V)

3. Show that deep learning models are capable of building useful features from raw multivariate time-series data. Deep learning uses unsupervised feature learning to construct its own features from unlabeled data. De- spite this,most experiments performed with deep learning methods on time-series data have done so by using heavily pre-processed data. The use of raw data directly to a classier often results in poor performance,

(21)

1.2. ORGANIZATION 3 which explains the current caution of using raw data. This contribution is explored in Chapter 3. (Paper II, III)

4. Algorithmic modications to one representational learning algorithm that makes it more suitable for multivariate time-series data. These modications are implemented in an auto-encoder and the feature learning capabilities are compared to standard state-of-the-art deep learning models. This contribution is presented in Chapter 4. (Paper IV, V)

1.2 Organization

Chapter 2 gives an introduction to unsupervised feature learning and deep learning. This chapter presents the construction, training, netuning, and classication of a deep network with layers of Restricted Boltzmann Machines or auto-encoders. The process of setting the hyperparameters is discussed as well.

Chapter 3 shows how deep learning models can be applied to novel and challenging multivariate time-series problems. The classication accuracy is compared to a number of approaches that uses either raw input data or transformed input data.

Chapter 4 presents the algorithmic contribution to an existing representational learning algorithm that makes it more suitable for multivariate time-series data. These suggested modications is implemented on an auto-encoder and compared with the performance of a standard auto- encoders on novel data sets and well-known data sets.

Chapter 5 gives the conclusions of the thesis and proposes some suggestions for future work.

(22)

(23)

Chapter 2

Representation learning

The choice of data representation has a signicant impact on the performance of a machine learning algorithm and much research has been done to nd good feature representations that captures the task-relevant information in the data. Designing such desirable features often requires expertise of the current domain. Representation learning algorithms [33, 7, 68, 8, 4, 6] instead attempt to create features from unlabeled data without using any prior knowledge. This has the advantage that there is no need to hand-craft the features and that cheap and plentiful unlabeled data can be utilized. Representation learning algorithms are particularly useful for large, high-dimensional, unlabeled, complex data sets where the dynamics is unknown or where traditional shallow methods fail to accurately model the data because of their limited capacity. They have shown to be successful in a number of challenging AI tasks, such as speech recognition [36, 29], object recognition [55], text classi-

cation [18], music recognition [49, 15, 35], activity recognition [46], motion capture modeling [81], emotion classication [87, 93, 37], and in physiological data [91, 54, 75, 88, 90]. The growing interest in representation learning algorithms in recent years is contributed by their application of building deep networks. Deep networks are formed by stacking layers of representation learning modules which enables the deep network to model more complex data structures than a shallow network since the data is processed through multiple non-linear transformations [4].

2.1 Greedy layer-wise pre-training

The resurgence of models with multiple layers of hidden units came with the discovery of greedy layer-wise pre-training [33, 7]. This solution overcame the problem of vanishing gradients [3], which is the fact that the error signal from the top layer diminishes as it reaches the rst layer during supervised learning.

Greedy layer-wise pre-training provides a more useful parameter initialization than random initialization [23] and is the process of training each layer sep-

5

(24)

6 CHAPTER 2. REPRESENTATION LEARNING arately in an unsupervised manner before netuning the whole network in a supervised fashion. The training of a deep network therefore consists of (1) greedy layer-wise pre-training of each layer to initialize the model parameters, and (2) supervised netuning of the whole network to optimize for a specic task. However, with access to a large amount of labeled data, proper initialization, and particular design choices such as using rectied linear units (ReLUs) as the non-linearity [56, 41, 73] makes it possible to skip the pre-training phase.

2.2 Overtting and Regularization

The concept of overtting in machine learning occurs when an large model rel- ative to the amount of training data starts to describe the noise present in the data instead of the task-relevant patterns. Overtting is generally prevented by adjusting the complexity of the model or using regularization. Regularization reduces the allowed parameter space of the model and guides the process of feature learning that hopefully achieves better generalization to unseen data [13].

For example, learning sparse feature representations, which are representations where only a small number of the hidden units are "activated" (outputs close to a value of 1) for each training example, is implemented by adding a sparsity constraint [48, 67] to the objective function. Another regularization technique that achieves robustness to small changes in the input data can be achieved by adding a penalty term that penalizes the Frobenius norm of the encoder's Jacobian at training points [70] or by reconstructing clean input from cor- rupted input [86, 85]. Finally, a recently proposed method for regularization is dropout [78], which encourages each feature to learn meaningful representations independently of what the other features have learned by dropping half of the hidden units for each training example.

2.3 Hyperparameters

One of the challenges for training deep networks is the large number of design choices, such as connectivity, architecture, optimization method, and hyperparameters. Each regularization term comes with one or more hyperparameter.

The choice of optimization also comes with a number of hyperparameters such as learning rate and momentum. A full grid search over all possible combi- nations of hyperparameters is impractical for deep networks due to the large amount of parameters so a recommendation is to use random grid search [5]

or structured hyperparameter optimization [10, 9]. Some hyperparameters can be set according to some practical recommendations [5]. There has been some work done to reduce the amount of design choices, such as automatically set receptive elds [19], learning rates [72], number of hidden units [94], feature selection by grouping [77] or removing hyperparameters by combining regularization terms as done in sparse ltering [57], dropout [78], and contractive auto-encoder [70]. One recommendation for nding good values for hyperpa-

(25)

2.4. OPTIMIZATION 7 rameters is to look for methods that evaluates and monitor the unsupervised learned model instead of using the nal prediction performance that requires the time-consuming supervised netuning [6, 5].

2.4 Optimization

The choice of optimization algorithm also plays an important role for training deep networks. Many dierent optimization algorithms have been used, for example stochastic gradient descent (SGD), conjugate gradient (CG), batch methods such as limited Broyden-Fletcher-Goldfarb-Shanno (L-BFGS), and Hessian-free optimization [52, 53]. Each optimization algorithm comes with its own hyperparameters that need to be optimized together with the model hyperparametes. Some examples of hyperparameters for the optimization algorithm is the learning rate, learning rate decay, mini-batch size, and number of training iterations. The L-BFGS has the advantage of automatically setting the learning rate but it is recommended for large-scale problems to use SGD to reduce training time [14]. It has been argued that the learning rate is the most important hyperparameter to tune correctly [5].

2.5 Classication and regression

Once the layer(s) of hidden units have been pre-trained, the network is special- ized for a specic task. Two common tasks are classication and regression. In a classication task the output of the model is a representation of the category that the input data belongs to. A classication system is constructed with a deep network by attaching a layer of "softmax" units on the top hidden layer.

Each softmax unit represents one class and the output from softmax unit yiis the probability that the current input data belongs to class k. The weight matrix between the top layer hidden units and the softmax layer, Wij, is trained by minimizing the cost function:

Jsoftmax= −1 N

N n=1

k i=1

y⁽ⁿ⁾_i log( yi(n))+(1−y⁽ⁿ⁾_i ) log(1− yi(n))+λ 2

i

j

Wij²

(2.1) where N is the number of training examples in the current mini-batch, yi is the prediction that the input data belongs to class i, and yiis 1 if the input data belongs to category i and 0 otherwise. The output from the softmax units is calculated as:

yi= exp

jWijxj

_k

i=1exp

jWijxj

(2.2) where k is the number of classes. Similarly, the deep network can be used for a regression task if a regressor is placed on top layer hidden units.

(26)

8 CHAPTER 2. REPRESENTATION LEARNING

݄^ଵ j

ݒ i

݄^ଶ k

ܹ_ଵ^்

ܹ_ଵ

ܹ_ଶ ܹ_ଶ^்

(a)

ܣ_ଵ

ݒሺݐ െ ݊ሻ ݒሺݐ െ ͳሻ

j

݄ሺݐሻ

i ݒሺݐሻ ܤ_௡ ܹ

ܤ_ଵ

ܣ_௡

(b) Figure 2.1: (a) RBM (b) conditional RBM

2.6 Modules for deep learning

There are many dierent models that can be used for unsupervised feature learning and be stacked to construct deep networks, for example Restricted Boltzmann Machines (RBM)[33, 34, 48], auto-encoders [68, 7], sparse cod- ing [58], deep Boltzmann machines [27], and K-means [17]. Each model has its own advantages and disadvantages and since the eld is still under research, no clear winner of what model is most suitable for deep learning has yet emerged [6]. This section briey introduces the two most commonly used modules, namely the RBM and the auto-encoder, and how they can be modi-

ed to better model sequential data.

2.6.1 Restricted Boltzmann Machine

The Restricted Boltzmann Machine (RBM)is a generative probabilistic undi- rected graphical model that consists of visible units v, hidden units h, and bias vector c and b, see Figure 2.1a. The weight matrix W connects the visible and hidden units. There are no visible-to-visible, or hidden-to-hidden connections.

The energy function and the joint distribution for a given visible and hidden vector is dened as:

E(v, h) = h^TWv + b^Th + c^Tv (2.3) P(v, h) = 1

Zexp^E^(v,h) (2.4)

(27)

2.6. MODULES FOR DEEP LEARNING 9 where Z is the partition function that ensures that the distribution is nor- malized. For Bernoulli-Bernoulli RBM (binary visible and hidden units) the probability that hidden unit hj is activated given visible vector v is given by:

P(hj|v) = σ

bj+

i

Wijvi

(2.5) (2.6) The probability that visible unit viis activated given hidden vector h is given by:

P(vi|h) = σ

⎛

⎝ci+

j

Wijhj

⎞

⎠ (2.7)

where σ(·) is the activation function. A common choice for activation function has been the sigmoid activation function σ(x) = _1+e¹−x, but is more and more being replaced with rectied linear units [56].

The model parameters θ = {W, b, v} are trained to maximize the (log)likelihood of the training data. Calculating the gradient of the likelikehood function is intractable because of the partition function so instead the gradient is estimated using contrastive divergence (CD) [32], persistent contrastive divergence (PCD) [92, 82], or fast-weights persistent contrastive divergence (FPCD) [83].

For contrastive divergence the update rule is:

∂ log P(v)

∂Wij

≈ vihj_data− vihj_recon (2.8) where · is the average value over all training samples.

Modules of RBMs can be stacked on top of each other to form a deep network. When RBMs are stacked in this manner they are called a deep belief network (DBN) [33]. The output from a lower-level RBM becomes the input to the next level RBM.

2.6.2 Conditional Restricted Boltzmann Machine

The conditional RBM (cRBM) [81] is an extension of the RBM. The cRBM has auto-regressive weights that model short-term temporal dependencies and hidden units that model longer-term temporal structures in multivariate time- series data. A cRBM is similar to a RBM except that the bias vectors for the

(28)

10 CHAPTER 2. REPRESENTATION LEARNING visible and hidden layers is dynamic and depends on previous visible layers, see Figure 2.1b. The dynamic bias vectors are dened as:

b^∗j = bj+

n i=1

Biv(t − i) (2.9)

c^∗i = cj+

n i=1

Aiv(t − i) (2.10)

where Aiis the autoregressive connections between visible layers at time t − i and current visible layer, Bi is the weight matrix connecting visible layer at time t − i to the current hidden layer. The model order is dened by the constant n. The probabilities for going up or down a layer is

P(hj|v) = σ

bj+

i

Wijvi+

k

i

Bijkvi(t − k)

(2.11)

P(vi|h) = σ

⎛

⎝ci+

j

Wijhj+

k

i

Aijkvi(t − k)

⎞

⎠ (2.12)

The parameters W, b, c, A, and B, are trained in a similar manner as the RBM using contrastive divergence.

2.6.3 Auto-Encoder

The auto-encoder [7, 4] consists of an encoder and a decoder, see Figure 2.2a.

The goal of the auto-encoder is to reconstruct the input data via one or more layers of hidden units. The feed-forward activations in the encoder from the visible units vito the hidden units hjis expressed as:

hj= σf

i

Wjivi+ bj

(2.13)

where Wjiis the connection between visible unit i and hidden unit j, and σfis the activation function. A common activation function is the sigmoid function which is dened as σf(x) = _1+e¹−x. In the decoder phase, the reconstructions of the input layer is calculated as:

vi= σg

⎛

⎝

j

Wijhj+ bi

⎞

⎠ (2.14)

Notice the tied weights of the encoder and decoder, that is, the weight matrix for the decoder is the transpose of the weight matrix in the encoder. The use

(29)

2.6. MODULES FOR DEEP LEARNING 11

i

ݒ ݄

i ݒො

ܹ_ଶ

ܹ_ଵ

j

(a)

ݒሺݐ െ ݊ሻ

ݒሺݐ െ ͳሻ

j

݄ሺݐሻ i

ݒሺݐሻ

i ݒොሺݐሻ

ܹ_ଵ ܹ_ଶ

ܤ_௡

ܤ_ଵ

ܣ_௡ ܣ_ଵ

(b)

Figure 2.2: (a) Auto-encoder (b) Temporal Auto-encoder

of tied weights acts as a regularizer and reduces the number of parameters to learn [6]. The activation function in the decoder can be the sigmoid function or the linear activation function σg(x) = x if values in the input layer are not between 0 and 1.

The cost function for the auto-encoder to be minimized for a minibatch of N training example is expressed as:

L(v, θ) = 1 2N

N n=1

i

(v⁽ⁿ⁾_i − v⁽ⁿ⁾_i )² (2.15)

(30)

12 CHAPTER 2. REPRESENTATION LEARNING The auto-encoder can also be given a probabilistic interpretation by den- ing the cost function as the negative log likelihood of the probability distribution of the visible layer given the hidden layer. With the assumption that this is a Gaussian distribution with mean equal to the reconstruction of the visible layer and identity covariance we get L = − log P(v|h) = ¹₂

i(v⁽ⁿ⁾_i − v⁽ⁿ⁾_i )² which is the same as Equation (2.15).

The cost function can be complemented with a number of regularization terms that restricts the allowed parameter space and helps prevent overtting.

The weights of the model in each layer l can be kept small by adding a L2 weight decay term, ^λ₂

i

j

l(W_ij^(l))². Sparse feature representations can be achieved by adding the Kullback-Leibler (KL) divergence as a sparsity penalty term, β

jKL(ρ||pj), where

KL(ρ||pj) = ρ log ρ

pj + (1 − ρ) log 1 − ρ

1 − pj (2.16)

and pj is the mean activation for hidden unit j over all training examples in the current mini-batch. Each regularization term comes with one or more hyperparameters (λ, β, ρ).

Learning in an auto-encoder is the process of nding the model parameters θ = {W, b} that minimizes the cost function. The model parameters can be learned with backpropagation For auto-encoders, the decoder is only used in the pre-training phase and is not used in the supervised netuning phase or during classication.

2.6.4 Temporal Auto-Encoder

The auto-encoder can be extended to a temporal auto-encoder in order to make it more suitable for multivariate time-series data, see Figure 2.2b. The hidden units depend on the visible units of the current timeframe as well as visible units of previous timeframes. The hidden layer at time t is calculated as:

hj= σf

_n

k=1

i

B^kjivi(t − k) +

i

Wjivi+ bj

(2.17) where Bⁿ is the weight matrix between the hidden layer and visible units at time frame t − n. The reconstruction layer is calculated as:

vi= σg

⎛

⎝ⁿ

k=1

j

A^kjivi(t − k) +

j

Wijhj+ bi

⎞

⎠ (2.18)

where Aⁿis the weight matrix between visible units at time frame t − n and the reconstruction of the visible layer at the current time frame t. There is no reconstruction of past visible layers. The past visible layers act as an extra bias term, much similar to a conditional RBM [81].

(31)

Chapter 3

Representation learning for multivariate time-series data

This chapter presents in more detail how deep learning models can be applied to multivariate time-series problems. The analysis of time-series have been an active research area for decades [39, 20]. Modeling high-dimensional time-series data shares some of the challenges for modeling static data, such as how to represent the data, how to identify patterns, how to deal with noise, and how to introduce invariance. But it also introduces additional challenges, such as how to capture short and long-term behavior and how to nd correlations between signals. The importance of temporal dependency of time-series data can be observed when two seemingly identical input frames at dierent times actually should be associated with dierent predictions depending on the preceding input frames.

Time-series data can either be sampled data from a sensor that measures a real-valued continuous process or outputs from simulations of a man-made processes. For long-term time-series data it is common to observe measures that vary periodically (seasonality) or a trend, which may be caused by nat- ural factors or depreciation of the sensor equipment. Sensor-based data often contain noise and might produce unreliable measures. When the goal is to capture a complex real-world occurrence it is often not sucient to use only one sensor. Therefore, multiple sensors that either have dierent characteris- tics or are spatially located are used to create multivariate time-series data.

The more sensors that are used the higher the chance is to capture enough information in order to fully understand the examined process. But due to economic and practical considerations the number of sensors is limited which means that only a subset of the variables that would be required to understand the process is available. Reversely, having too many sensors means a higher redundancy and make the data more dicult to analyze.

Some popular tools for analysis of time-series data include Dynamic Time Warping (DTW) [11] for measuring similarity between two time-series and

13

(32)

14 CHAPTER 3. REPRESENTATION LEARNING FOR MULTIVARIATE TIME-SERIES DATA motif discovery [16, 74] for nding repeating patterns. But there are also nu- merous techniques for modeling time-series. These can either be parametric, that is, they measure signal properties such as frequency spectra and signal correlations, or non-parametric. Non-parametric models may contain known equations of the system dynamics or be completely "black-box", which do not use any a priori knowledge of the system dynamics. Such "black-box" models can either be linear, like autoregressive models [51], Linear Dynamical Systems (LDS) [50], and Hidden Markov Model (HMM) [65], or non-linear, like neural networks, RBMs, and auto-encoders. Non-linear, non-parametric models are best suited when the system dynamics are complex or even unknown. The task for such models is to identify the parameters of the assumed model, which is a process called "training" the model. The trained model can then be used for simulation, signal analysis, prediction, fault detection, controller synthesis, or data classication.

This section presents how a RBM and an auto-encoder can be applied for learning features from unlabeled multivariate time-series data. The learned models are then used for the task of classifying sleep stages from polysomno- graphic (PSG) recordings and identifying odour markers from electronic nose data.

3.1 Sleep stage classication

Sleep stage classication is an important step for diagnosing chronic sleep disorders. The task involves labeling each 30-second segment of data from a polysomnograph (PSG) into pre-dened sleep stages. A PSG recording consists of various physiological signals such as brain activity from an electroen- cephalography (EEG), eye movements from an electrooculography (EOG), and muscle activity from an electromyography (EMG). Sleep stage labeling is a time-consuming task that is manually performed by human sleep experts. The ve pre-dened sleep stages, introduced by Rechtschaen and Kales (R&K) [69], are awake (W), stage 1 (S1), stage 2 (S2), slow wave sleep/deep sleep (SWS), and rapid-eye-movement sleep (REM). Figure 3.1 shows one epoch of data from a polysomnograph for each of the ve stages. Awake stage is characterized by low EEG amplitude, more than 50% EEG alpha activity, high EMG and the presence of slow eye-movements (SEMs) and blinking arti- facts in the EOG. In the rst stage of sleep, S1, the amount of alpha activity in the EEG is below 50% and the EMG amplitude is decreased. There might be SEMs visible in the EOG in stage 1. In the next sleep stage, S2, there are no SEMs in the EOG and K-complexes and sleep spindles can be seen in the EEG. Slow wave sleep (SWS) is characterized by over 20% of the EEG is delta waves. The EMG amplitude is low and there are typically no eye-movements in the EOG. Finally, REM-sleep has the lowest amplitude in the EMG, clearly visible rapid eye-movements (REM) in the EOG, and EEG with low amplitude, mixed frequencies, and no K-complexes or sleep spindles. A graph that

(33)

3.1. SLEEP STAGE CLASSIFICATION 15

EMG EOG2 EOG1 EEG

Awake

EMG EOG2 EOG1 EEG

REM

EMG EOG2 EOG1 EEG

Light sleep

EMG EOG2 EOG1 EEG

Medium sleep

EMG EOG2 EOG1 EEG

Deep sleep

Figure 3.1: Polysomnograph data of the ve stages of sleep during 30 seconds.

shows these ve stages over an entire night is called a hypnogram. A typical hypnogram can be seen in Figure 3.2a.

A common approach to the time-consuming task of sleep stage classication is to try to capture the valuable information with a set of features and often a feature selection step [63]. The features are designed with the help from sleep technicians in order to mimic the R&K-system and infuse the a priori knowledge of the data. As with most learning algorithms, the choice of feature has a big impact on the classication performance. While many dierent feature sets and methods for creating a computerized automatic sleep stager

(34)

16 CHAPTER 3. REPRESENTATION LEARNING FOR MULTIVARIATE TIME-SERIES DATA

0 1 2 3 4 5 6

SWS S2 S1 REM W

Time [h]

(a)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 EMG

EOG2 EOG1 EEG

Time [s]

EEG EOG1 EOG2 EMG

−0.2 0 0.2 0.4 0.6 0.8 1

(b)

Figure 3.2: (a)A hypnogram for one night recording of a patient with sleep apnea.

(b)Construction of visible layer from sleep data.

has been done [38, 61, 24, 95, 60, 21], the diculty to generalize to dierent patients, sleep labs, or equipment has prevented an automatic sleep stager to be used in a clinical setting, and therefore it can be said no set of universally applicable set of features has yet been found [63, 71]. The ongoing search for an optimal feature set has led to the discovery of new useful features such as frequency bands [40] and fractal exponent [79], which are not directly obvious from the R&K rules.

Instead of using pre-designed features or researching for new ones, this section presents how a deep belief network (DBN) can be used to construct its own feature representation for sleep stage classication from unlabeled raw data without using any a priori expert knowledge. The advantage of this approach is that potentially new and better features, that do not necessary adhere to the R&K system, can be discovered from unlabeled data that better untangles the factors of variations in sleepdata and can provide a better generalization.

The data that is used in this section has kindly been provided by St. Vin- cent's University Hospital and University College Dublin, and can be down- loaded from PhysioNet [28]. The dataset consists of 25 acquisitions from sub- jects with suspected sleep apnea (sleep-disordered breathing). Each recording consists of 2 EEG channels, 2 EOG channels, and 1 EMG channel. Only one of the EEG signals are used in order to reduce the dimensionality. Labeling of the data has been performed by one sleep technician. All signals are slightly pre-processed by notch ltering, bandpass-ltering, and downsampling.

(35)

DBN HMM Data

w

v =

⎡

⎢⎣

EEG EOG EOG EMG

EEG _+w^+w EOG _+w^+w EOG _+w^+w EMG _+w^+w

⎤

⎥⎦

w

(36)

18 CHAPTER 3. REPRESENTATION LEARNING FOR MULTIVARIATE TIME-SERIES DATA

EEG EOG1 EOG2 EMG

Figure 3.4: The rst 20 of 200 learned features in the rst layer. The learned features are of variousamplitudesand frequencies.

(37)

3.1. SLEEP STAGE CLASSIFICATION 19

0 1 2 3 4 5

SWS S2 S1 REM W

Time [h]

Figure 3.5: Estimated sleep stage (red)and true sleep stege (black)before applying HMM.

0 1 2 3 4 5

SWS S2 S1 REM W

Time [h]

Figure 3.6: Estimated sleep stage (red)and true sleep stage (black)after applying HMM.

HMM. The overall classication accuracy before using HMM was 59.3% and after using HMM 84.5% for this particular patient.

Table 3.1shows the transition matrix learned by the HMM. Beside the obvious fact that probabilities of staying in the same state is very high, it can be observed that going from wake-state or the near to wake-state S1to slow wave sleep is close to 0.

A full 25 leave-one-out cross-validation is performed to obtain an average classication accuracy. In each validation, 1of the 25 acquisitions is left out as test data while training and validation samples are randomly drawn from the other 24 acquisitions in a way that each class has equally amount of training examples. The average classication accuracy for all sleep stages is 67.4 ± 12.9%. The confusion matrix for all training examples is shown in Table 3.2.

The most dicult stages to classify are S1, which is the stage between being awake and falling asleep, followed by REM-sleep.

% SWS S2 ToS1REM W

From

SWS 99.84 0.113 0.015 0.007 0.029 S2 0.065 99.82 0.043 0.036 0.039 S1 0.0 0.317 99.51 0.017 0.155 REM 0.0010.0410.027 99.90 0.031

W 0.0 0.0210.172 0.002 99.81

Table 3.1: Transition matrix for sleep stage changes

(38)

Feature Extraction

Feature

Selection PCA GMM HMM

Data

− Hz − Hz − Hz − Hz

− Hz

Modeling Time-Series with Deep Networks

Modeling Time-Series with Deep Networks

Abstract

Acknowledgements

List Of Publications

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1 Contributions

1.2 Organization

Chapter 2

Representation learning

2.1 Greedy layer-wise pre-training

2.2 Overtting and Regularization

2.3 Hyperparameters

2.4 Optimization

2.5 Classication and regression

2.6 Modules for deep learning

Chapter 3

Representation learning for multivariate time-series data

3.1 Sleep stage classication

2.2 Overtting and Regularization

2.5 Classication and regression

3.1 Sleep stage classication