Learning Representations with a Dynamic Objective Sparse Autoencoder

(1)

Objective Sparse Autoencoder

Martin Längkvist and Amy Lout

Applied Autonomous Sensor Systems

School of Science and Technology

Örebro University

SE-701 82, Örebro, Sweden

firstname.lastname@oru.se

Abstract

The main objective of an auto-encoder is to reconstruct the input sig-nals via a feature representation of latent variables. The number of latent variables denes the representational capacity limit of the model. For data sets where some or all signals contain noise there is an unnecessary amount of capacity spent on trying to reconstruct these signals. One so-lution is to increase the number of hidden units to increase the capacity so that there will be enough capacity to capture the valuable information. Another solution is to pre-process the signals or perform a manual sig-nal selection. In this paper, we propose a method that will dynamically change the objective function depending on the current performance of the model. This is done by weighting the objective function individually for each input unit in order to guide the feature leaning and decrease the inuence that problematic signals have on the learning of features. We evaluate our method on various multidimensional time-series data sets

(2)

and handwritten digit recognition data sets and compare our results with a standard sparse auto-encoder.

1 Introduction

A promising tool for solving dicult AI problems with high factors of varia-tions such as computer vision or multivariate time-series analysis is the recent developments of representation learning algorithms [1, 6, 4, 12]. The two most commonly used methods are restricted Boltzmann machines (RBMs) [13, 15, 23] and autoencoders [29, 5]. These algorithms have the advantage that they do not require domain-specic knowledge since the features are trained from un-labeled and un-labeled data instead of using hand-crafted features. These models have been used with various modications; either in attempts to learn better features, or to re-apply domain-knowledge by making them more suitable for the current domain task. The feature learning process involves starting from a randomly initialized model (often a neural network), and then optimizing the model parameters according to an unsupervised training criterion and then fur-ther improved with a supervised ne-tuning criterion. This step can be repeated using the output of the rst model as input to a second model in order to con-struct a deep network that has the ability to model higher level of abstractions since the data goes through several non-linear transformations. There is much focus in deep learning research to nd ways of guiding the learning of features so that the model learns a good generalization of the data and avoiding overtting. However, there are many ways in which one can guide the learning of features. One practical strategy to improve the feature learning is to add or change a term in the objective function (change the regularization) so that an idea about how a good feature representation should look like is translated into the feature learning algorithm. For example the idea that a feature representation should

(3)

be sparse was implemented by adding a sparsity constraint [23, 28]. Another example is the Contractive Auto-Encoders (CAE) [31] which learns features that are robust to the input by penalizing changes in the feature representation for small changes in the input. A similar eect is achieved with the de-noising auto-encoder [37, 36] where the goal is to reconstruct clean input from corrupted input. Another recently proposed method for avoiding overtting is dropout [14] where the co-adaption between the features are removed by randomly dropping half of the hidden units for each training example. This encourage each feature to learn meaningful representation independently of what the other features have learned and thus the mistakes one feature might bring is not dependent on being xed by other features.

In this work, we propose a small modication to the sparse auto-encoder that can be used to deal with problematic signals (noisy signals, dead signals, task-irrelevant signals) and improve the already learned featuresa_{. It does this}

by re-shifting the attention the model has on the input data by dynamically change the criterion for evaluating what makes a good signal and how much each signal should inuence the feature learning. Since there is a representa-tional capacity limit in a model dened by the number of hidden units, for high-dimensional data, it is often advantageous to remove noisy or redundant signals. In a RBM or auto-encoder, the representational capacity is spent on attempting to reconstruct every signal, even if that signal is problematic. One solution for such data is to manually remove signals that are suspected to be problematic. However, it can sometimes be a guessing game when deciding which signals to remove, especially when it is dicult to distinguish between noise and task-relevant information. Removing signals can also mean a loss of useful information which is why our method will evaluate each signal during learning and change the attention between what is considered good and bad

(4)

signals. The proposed method will work both in the supervised and unsuper-vised phase of learning. In the unsuperunsuper-vised phase of learning, a good signal is a signal that has a low reconstruction error, meaning it is easy to model. In the supervised phase a good signal is a signal that improves the classication result. This work mainly deals with multivariate time-series data, which previously have shown to be successful using unsupervised feature learning and deep learn-ing methods ranglearn-ing from modellearn-ing human motions [34], audio-visual speech classication [26], symbolic sequences of polyphonic music [9], EEG-Based pre-diction of epileptic seizures [24], sleep stage classication [17] and bacteria iden-tication with an electronic nose [18]. One diculty with unsupervised feature learning algorithms is the amount of design choices regarding the learning al-gorithm. This work will in a sense automate the process of signal selection. There has been work done to automatize other aspects of learning such as au-tomatically set receptive elds [11], learning rates [32], hyper parameters [8], number of hidden units [38], feature selection by grouping [33] or remove hyper parameters by combining regularization terms as is done in sparse ltering [25]. This paper rst gives a background of a regular sparse auto-encoder in 2. Our model is presented in 3. The data sets used in this work and the experiments performed on them are presented in 4 and 5. Finally, a discussion is given in 6.

2 Background

2.1 Auto-Encoder

[Figure 1 about here.]

We start o by describing a regular sparse auto-encoder. An auto-encoder [2] consists of one input layer, one or more hidden layers, and one output layer, see Figure 1. The model consists of an encoder and a decoder. The goal of an

(5)

auto-encoder is to reconstruct the input data via the hidden layers. It was primarily used as a dimensional reduction algorithm as the number of hidden units was set as fewer as the number of visible units. But it has shown to be a competent building block for building deep networks even in an over-complete setting. For the case when the network has more than one hidden layer, each hidden layer is rst trained individually, followed by a ne-tuning step. The feed-forward activations in the encoder from the visible units to the hidden units in layer i is expressed as:

h(i)= σ(W₁(i)v(i)+ b(i)₁ ) (1) where σ is a non-linear transformation function. A common choice is the logistic function, σ(x) = 1

1+e−x. For inputs that are not between 0 and 1, the last layer

can have a linear activation function σ(x) = x. The next hidden layer is trained in the same manner but with the previous hidden layer as input. The decoder step then maps the last hidden layer back to reconstructions of previous hidden layers and nally the reconstruction of the rst visible layer. One pass of the decoder in layer i is calculated as:

ˆ

v(i)= σ(W₂(i)v(i)+ b(i)₂ ) (2)

It is common in auto-encoders to have tied weights, i.e., W1 = W2T. This

works as a regulizer as it constrains the allowed parameter space and reduces the number of parameters to learn [4]. The cost function to be minimized is expressed as: J = 1 2N X k (v(k)− ˆv(k))2+λ 2 X i X j Wij2 + βX i ρ log ρ pi + (1 − ρ) log 1 − ρ 1 − pi (3)

(6)

where pi is the mean activation for hidden unit i. The rst term is the

recon-struction error term, the second term is the weight decay term and the third term is the sparsity penalty term. The inclusion of these regularization terms prevents the trivial learning of a 1-to-1 mapping of the input. Each regular-ization term comes with one or more hyper parameter (λ, β, ρ) that has to be manually set. This can either be done with a full grid search, random grid search [3], or hyperparameter optimization [7].

A netuning step of all layers is performed after all layers have been pre-trained. The cost function for supervised netuning is the same as for unsu-pervised training except for the reconstruction error term which becomes the cross-entropy loss:

−1 N

X

i

(1 − y(i)) log(1 −y_e(i)) + y(i)log(y_e(i)) (4)

where y is the predicted label andyeis the correct label.

3 Proposed model

The proposed model is an extension of the sparse auto-encoder. The main idea comes from the fact that if one signal is removed, the reconstruction error of the other signals will generally decrease, especially for noisy data sets and models with limited model representational capacity. This has been shown in [20] where the reconstruction error for one signal was decreased if another signal was removed. The error was further decreased if two signals were omitted. However, for some data sets it is not clear which signals are problematic and that could be removed. This would require domain-specic knowledge and valuable information could be lost when completely removing signals. Instead, our model

(7)

will weight the reconstruction error of each input unit in order to reduce the inuence that problematic inputs have on the parameter learning. It does this by dynamically changing the objective function during learning. We present two cases where this idea is utilized: unsupervised feature learning and supervised netuning.

3.1 Unsupervised feature learning with a dynamic

objec-tive

In order to obtain a dynamic objective we introduce a residual weight vector, α ∈ Rdv that will weight the reconstruction error for each visible unit. If α

j= 1

for ∀j the model generalizes to a regular sparse auto-encoder. The cost function is the same as for a sparse auto-encoder (Equation (3)) except for the rst term which becomes: 1 2N X k (v(k)− ˆv(k))2· α (5) To further understand the inuence of α on the parameter updates we introduce an auto-encoder with 1 hidden layer, see Figure 2(a). The activations of all three layers (input layer, hidden layer, and output layer) are given by:

a(1)= v

a(2)= σ(W1a1+ b1)

a(3)= σ(W2a2+ b2)

(8)

The gradients of θ = {W, b} are calculated as: δ(3)= −((a(1)− a(3)_{) · α) · (1 − a}(3)₎ ₍₇₎ δ(2)= W1Tδ(3)+ β −ρ p(2)_j + 1 − ρ 1 − p(2)_j !! a(2)(1 − a(2)) (8) ∆W2= 1 Nδ (3)_(a(2)₎T _{+ λW} 2 (9) ∆W1= 1 Nδ (2)_(a(1)₎T _{+ λW} 1 (10) ∆b2= 1 N X δ(3) (11) ∆b1= 1 N X δ(2) (12)

For the case of tied weights (W1= W2T), the gradient of W1 is calculated as:

∆W1= ∆W1+ ∆W2 (13)

The inclusion of α has the purpose of aiding the learning of representations. In the error term for the nal layer (Equation (7)), α will weight the reconstruction error of each visible unit. If 0 ≤ αj < 1 the amount by which the weights and

biases in the previous layers are changed during back-propagation is decreased compared to if αj = 1. This makes the modied reconstruction error low and

the model will be happy with the results and thus "lose interest" in trying to reduce the reconstruction error further for signals that have been assigned a low α. Similarly, for αj ≥ 1 any reconstruction error will be more punished and if

possible the model will sacrice the attention from other units in order to keep the reconstruction error for that unit low. In a deep learning setting, each layer has a separate α. After a layer has been trained, the α for that layer is no longer used in the training of subsequent layers.

(9)

3.2 Supervised netuning with a dynamic objective

For dynamic objective supervised netuning we introduce α in a similar way where α ∈ Rdy will weight the classication error for each class. The residual

error is expressed as:

−1 N

X

i

(1 − y(i)) log(1 −y_e(i)) + y(i)log(y_e(i)) · α (14)

The network structure is similar to that for the unsupervised case except for the nal layer that is replaced with a softmax classier layer, see Figure 2(b). The activations for each layer becomes:

a(1)= v a(2)= σ(W1a1+ b1) a(3)=Wsa (2)_{− max(a}(2)₎ P a(3) (15)

and the parameter updates of θ = {W, b} are:

δ(3)= −(y − a_e (3)) · α (16) δ(2)= W₂Tδ(3)+ β(−ρ p(2)_j + 1 − ρ 1 − p(2)_j ) · a(2)· (1 − a(2)) (17) ∆W2= 1 Nδ (3)_(a(2)₎T_{+ λ} sW2 (18) ∆W1= 1 Nδ (2)_(a(1)₎T_{+ λW} 1 (19) ∆b1= 1 N X δ(2) (20)

3.3 Setting α

One important aspect of using a dynamic objective is how α is updated after each epoch. This work will experiment with a couple of possible ways to update

(10)

α. The update-rule depends on the learning setting (unsupervised or supervised) and the data set (amount of noise). In all settings α is initially set to 1.

For unsupervised learning, α is updated after each training epoch according to: r = 1 N X (a(3)− a(1)₎2 ₍₂₁₎ α = r P r (22)

This will increase the α-values for inputs that have high reconstruction error and decrease the α-values for inputs that have low reconstruction error. However, for some data sets the opposite eect is desirable. We can instead invert the alpha values according to:

α = 1 − r (23)

The values of α can be scaled to guarantee a range from 0 and 1 by:

α = r − min r

max r − min r (24)

For supervised netuning α is updated to increase αy for categories that

have a low classication accuracy and decrease αy for categories that have a

high classication accuracy. With normalization the update of α becomes:

r = 1 N X (log(a(3)) ·_ey (25) α = r P r (26)

It is also possible to automatically learn α during training instead of setting it depending on the current model performance. To do this, we use the same penalty as the sparsity-penalty, namely the Kullback-Leibler (KL) divergence on an added regularization term of α. The penalty for changes in α = 1 is thus

(11)

expressed as: γX j (αplog( αp αj ) + (2 − αp) log( 2 − αp 2 − αj )) (27)

The gradient for α for unsupervised learning is:

∆α = − 1 2N X (a(3)− a(1))2+ γ(−αp α + 2 − αp 2 − α) (28) and for supervised learning:

∆α = −1 N X log(a(3)) · y + γ(−αp α + 2 − αp 2 − α) (29) For the update of α there is a trade-o between the reconstruction error and how far α deviates from αp= 1. In conclusion, signals with large reconstruction

error will generate a lower α and thus focus less on those inputs. If αjdeviates

from 1, the square error term will decrease and the α-penalty term will increase. The goal is therefore to nd an optimal α that minimizes the total cost function. In this setting, there are two sets of parameters to be optimized: the network parameters, θ = {W, b}, and the residual weight vector, α. Training is done in two steps similar to how training is done in sparse coding [27]. First xate α, then update θ, and then update α with the new θ. With the added α-penalty term comes a new hyperparamter, γ. A lower value of γ will make α values further away from 1, meaning more input signals will be ignored. The algorithm for updating θ and α is presented in Algorithm 1. where αu and

αs is the residual weight vectors for the unsupervised and supervised phase

respectively.

Table 1 summarizes the desired behavior of α depending on the nature of the data. There are three cases were the use of α is helpful: (1) guiding the feature learning during unsupervised learning with relative noise-free multi-dimensional data where some signals are easier to reconstruct than others, (2) improving the

(12)

Algorithm 1 Training multiple layers of dynamic objective sparse auto-encoder initialize θ

for all layers, l = 1 : L do initialize α(l) u repeat θ ← θ, α(l)u until overt repeat θ ← θ, α(l)u α(l)_{← θ, α}(l) u until overt end for initialize αs repeat θ, αs← θ, αs,ye until convergence

overall classication accuracy in supervised ne-tuning, and (3) removing the inuence of noise for a noisy data set. These three cases are shown in the experimental secttion.

[Table 1 about here.]

3.4 Learning mixture of experts with α

We will also present a method for doing a mixture of experts with the use of a dynamic objective. This is similar to model averaging which has gained much attention lately with the recent success of machine learning competitions using model averaging from decision trees and dropout [14]. After pre-training the model is duplicated dy times where dy is the number of classes. Then each

model θiis ne-tuned with α is all zeros except for αi = 1. This will make each

model specialized on one class. During inference we run the predicted category is calculated as:

(13)

In other words, the predicted label is that whose specialized model gives the highest category certainty.

4 Datasets

We evaluate our method on various data sets from dierent domains, see Fig-ure 3. The domains are multi-variate time-series data from an electronic nose and motion capture data, and images of handwritten digits where additional factors of variations have been added.

4.1 Electronic nose

Data from an electronic nose was obtained to discriminate between dierent type of bacteria that can be found in blood and could lead to septicemia. Identifying bacteria in blood using an electronic nose has been done before with a 22 sensor array [35], as well as with a single sensor [10]. Each sample contains one type of bacteria. A total of 10 bacteria is used. The sampling system used for this data set is the NST 3220 Emission Analyzer from Applied Sensors, Linköping, Sweden, which is composed of 10 MOS and 12 MOSFET sensors, for a total of 22 sensors. The odour sampling phases and recovery phase is 30 and 260 seconds long respectively. With a baseline phase of 10 seconds, the total length of one reading is 5 minutes. The dataset has 800 readings evenly divided among 10 classes.

4.2 Activity Recognition

(14)

CMU locomotive A dataset of total length of 164 seconds with 4 locomotive styles of motion (jog, jump, run, walk) from the CMU Graphics Lab Motion Capture Database. Sample rate is 120 frames per second.

PAMAP2-activity recognition [30] This dataset consists of 9 subjects per-forming 12 activities (lying, sitting, standing, walking, running, cycling, Nordic walking, ascending stairs, descending stairs, vacuum cleaning, ironing, rope jumping) wearing three Colibri wireless IMU sensors and one heart-rate moni-tor. The number of dimensions used is 31, the total length of the data set is 2.7 hours, and the framerate is 100Hz.

4.3 MNIST variation dataset

This dataset consists of 4 variations of the standard MNIST handwritten num-ber [21]: added background noise (mnist-back-rand), added background images (mnist-back-images), rotated numbers (mnist-rot), and rotated numbers with background images (mnist-rot-back-images). We use two of these variations in our experiments, namely mnist-back-rand and mnist-back-images. The input size is 28 × 28 = 784 dimensions and each dataset consists of 10000 training examples, 2000 validation examples, and 50000 test examples.

5 Experiments

The experiments were performed on a GeForce 660 Ti using MATLAB 2012a with Parallel Computing Toolbox that has built-in GPU support. We use early-stopping to prevent overtting in both pre-training and ne-tuning. Training is stopped if the cost for a held-out validation set is not improved in the last 5 epochs. The optimization method used is Marc Schmidts minFunc with L-BFGS. We chose a large batchsize of 10000 to enjoy the GPU speed-up [22].

(15)

Experiments are performed with 5-fold repeated random sub-sampling valida-tion unless the dataset already has a dedicated test set (MNIST variavalida-tions).

5.1 Electronic nose data

The full data set consists of 800 readings. Training and test set were randomly divided into subsets of 50% and 50% respectively. The validation set was formed from 10% of the training set. To reduce dimensionality the second half of the readings were discarded and the rst half were downsampled by a factor of 2. With 22 signals this gave an input dimension of 1650. Each reading was subtracted with the rst value of that reading in order to remove bias. Readings were then normalized by dividing with the maximum value of all readings for that signal in order to keep each value between 0 and 1. These experiments require a reading of at least 2.5 minutes. A method using sparse auto-encoders for fast classication of electronic nose data has been presented in [16].

The hyper-parameters are set by random grid search [3] where the possible choices for the parameters are

β ∈ {0.001, 0.01, 0.1, 1, 10}

λ ∈10−6_{, 10}−5_{, 10}−4_{, 10}−3_{, 10}−2 ρ = 0.2

The number of hidden units was set to 200. Figure 4 shows the classication accuracy with dierent model parameters on the e-nose data set.

We rst pre-train a regular 1-layer sparse auto-encoder. Then we ne-tune that model with two dierent approaches, namely, with a dynamic objective (up-dating α after each epoch), and without a dynamic objective (no update of α).

(16)

Figure 5 shows the classication accuracy and α-values for each bacteria class after each ne-tuning epoch. When a dynamic objective is used, classes that have a high accuracy are assigned a lower α-value according to Equation (26) and vice versa for classes with low accuracy, see Figure 5(c). The initial value of αis set to 1/dy. The α for the three most dicult classes (STLUG, HINFL, and

PSAER) is increased while α for the other classes is decreased. The absolute values of α is higher in the beginning of ne-tuning but later approaches the initial values. This gives the eect that, while the overall classication accu-racy may be lower, classes that were previously dicult to classify now obtain a better classication accuracy at the cost of previously easy-to-classify classes obtain a lower accuracy, see Figure 5(a) from epoch 0 to 1. For the subsequent ne-tuning epochs, the same process is applied until convergence. Figure 5(b) shows the classication accuracy where a standard ne-tuning without a dy-namic objective is applied. Although the overall accuracy is improved from 90.5% to 95.5%, some classes (HINFL, SRFCL and PRMIR) even get a de-creased classication accuracy. The ne-tuning is stopped after the rst epoch since the optimal solution has been found. For completeness, the values of α is shown in Figure 5(d) which is initially set to 1 and not updated.

The overall classication results for these experiments and some comparisons can be seen in Table 2. The previous record of 96.2% using a 2-layered RBM [19], and the result of a standard sparse auto-encoder of 95.83%, was improved using a dynamic objective to 98.81%.

This shows that using a dynamic objective is helpful in the netuning phase. One advantage of this approach is that it can be used on any pre-trained model

(17)

so there is no need to re-train the already learned features. There is also no added hyperparameter so it is fast to examine if using a dynamic objective in the netuning would improve the results.

5.2 Motion capture Data

We evaluate our model on two data sets of multi-dimensional time-series motion capture data.

For the PAMAP2-activity recognition [30] data set we prepared the data by removing the invalid orientation from the three IMUs as well as the 3D-acceleration data with scale 6g because of saturation. Each signal is normalized by subtracting the mean and then dividing with the variance of each signal. We use a linear activation function in the nal layer since the data are real-valued sensor readings. Data was downsampled by a factor of 4. We used a window of 50 samples with no overlap which means that each example is an activity of 2 seconds. Pre-processing is done by downsampling by 4. The ground-plane forward velocity is calculated from (and replacing) the X and Z position coordinates.

Table 3 shows the classication accuracy for the two motion capture data sets.

For PAMAP2 the classication went from 89.2% to 90.89% by updating α in the netuning phase. We can improve this result by using α in the unsupervised phase of learning as well and achieve 91.91% accuracy. Figure 6 shows the mean reconstruction error for each signal and sample point and the nal α-values for using a dynamic objective in the unsupervised phase.

(18)

The reconstruction error is higher for some signals than others. We decrease α for the dicult-to-reconstruct signals which contains much noise. Outlier examples in the current training batch are removed before updating α. Tied weights are used otherwise the eect of α on W1 would be decreased during

back-propagation. The nal values of α can be seen in Figure 6(b). When using a mixture of experts with dynamic objective the results were improved to 94.32%. For each class we set α to 0 except for αi ∈101102103104 for class

i. Figure 7 shows the classication accuracy for the two data sets with dierent values of α. It can be seen that the optimal α value is dierent for the two data sets.

For the CMU data set, an earlier version of dynamic objective sAE has been presented in [20] and reported a classication accuracy of 90.2 ± 3.5%. In that work, a dynamic objective was only used in the unsupervised phase. A standard sAE give an accuracy of 89.1% and a sAE with dynamic objective gave an accuracy of 92.19%. This result was signicantly improved by using a mixture of experts with dynamic objective to 97.12%.

5.3 MNIST variation

For the purpose of demonstrating that our algorithm also works on images we performed experiments on 2 of the MNIST variation data sets [21]. We use a 1-layer network with 500 hidden units. The input data can be seen in Figure 3(d). Figure 8 shows the learned features with and without using a dynamic ob-jective in the unsupervised phase for mnist-backg-noise. The use of a dynamic objective produce features that are more concentrated on the middle of the image and does not make an eort to reconstruct the noise around the edges.

(19)

The classication results are presented in Table 4. For mnist-back-rand α was automatically updated according to Section 3.3. For mnist-back-images α was set to focus on the half of the inputs that was most dicult to reconstruct. When using automatic update of α it tried to reconstruct the background instead of the numbers.

Figure 9 shows the α-values for γ = 1 and γ = 0.1. A lower value of γ has the eect that all input pixels get a lower α-value, notice the values on the colorbar.

Figure 10 shows how α changes after each training epoch. The salt and peppar noise around the edges get more and more ignored by the model and the capacity of the model is instead spent on reconstructing the handwritten digit.

6 Discussion and Future work

This paper shows that an automatic method that weights the input signals im-proves the learning of feature representations. We have shown that our method improves the classication results when used in either the unsupervised phase or the supervised phase. The accuracy was improved if a dynamic objective was used in both phases. Our method also provides a natural way of implementing a mixture of experts. User knowledge about the data can be in-cooperated in the initial setting of α or the update-rule. The update-rule of α could be manually set or by including α in the model parameters. If α is included in the model parameters, the added α-penalty term keep the values close to 1. However, if a reconstruction error is large for one unit, α would decrease for that unit and

(20)

that input unit would be ignored. This setting is suitable for data that has many noisy signals.

A dynamic objective should only be applied to a previously trained model since the update of α depend on the current reconstruction error which is nat-urally high in the beginning of training a randomly initialized network. This makes this model suitable as a strap-on for previously trained auto-encoders that improves the already learned feature representations.

Future work include exploring ways of updating α based on criteria other than the reconstruction or classication error in order to improve generalization.

Acknowledgements

The motion capture data used in this project was obtained from mocap.cs. cmu.edu and was created with funding from NSF EIA-0196217.

References

[1] I. Arel, D. Rose, and T. Karnowski. Deep Machine Learning - A New Frontier in Articial Intelligence Research. IEEE Computational Intelli-gence Magazine, 14:12 18, Nov 2010.

[2] Yoshua Bengio. Learning deep architectures for AI. Technical Report 1312, Dept. IRO, Universite de Montreal, 2007.

[3] Yoshua Bengio. Practical recommendations for gradient-based training of deep architectures. Technical Report arXiv:1206.5533, U. Montreal, Lec-ture Notes in Computer Science Volume 7700, Neural Networks: Tricks of the Trade Second Edition, Editors: Grégoire Montavon, Geneviève B. Orr, Klaus-Robert Müller, 2012.

(21)

[4] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Unsupervised feature learning and deep learning: A review and new perspectives. Technical Report arXiv:1206.5538, U. Montreal, 2012.

[5] Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. Advances in Neural Infor-mation Processing Systems 19 (NIPS 2006), pages pp. 153160, 2006. [6] Yoshua Bengio and Yann LeCun. Scaling learning algorithms towards AI.

In L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors, Large-Scale Kernel Machines. MIT Press, 2007.

[7] J. Bergstra, D. Yamins, and D. D. Cox. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision archi-tectures. In Proc. 30th International Conference on Machine Learning (ICML), 2013.

[8] James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13:281305, February 2012.

[9] Nicolas Boulanger-Lewandowski, Yoshua Bengio, and Pascal Vincent. Mod-eling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription. In Proceedings of the Twenty-nine International Conference on Machine Learning (ICML), 2012. [10] M. Bruins, A. Bos, P. L. Petit, K. Eadie, A. Rog, R. Bos, G. H. van Ramshorst, and A. van Belkum. Device-independent, real-time identica-tion of bacterial pathogens with a metal oxide-based olfactory sensor. Eur. J. Clin. Microbiol. Infect. Dis., 28:775780, Jul 2009.

(22)

[11] Adam Coates and Andrew Y. Ng. Selecting receptive elds in deep net-works. In Advances in Neural Information Processing Systems (NIPS), 2011.

[12] D. Erhan, Y. Bengio, A. Courville, P.A. Manzagol, P. Vincent, and S. Ben-gio. Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research, 11:625660, February 2010.

[13] G. E. Hinton, Osindero S., and Teh Y. A fast learning algorithm for deep belief nets. Neural Computation 18, pages 15271554, 2006.

[14] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. http://arxiv.org/abs/1207.0580, 2012.

[15] G.E. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504507, July 2006.

[16] Martin Längkvist, Silvia Coradeschi, Amy Lout, and John Bosco Bal-aguru Rayappan. Fast classication of meat spoilage markers using nanos-tructured zno thin lms and unsupervised feature learning. Sensors, 13(2):15781592, 2013. doi:10.3390/s130201578.

[17] Martin Längkvist, Lars Karlsson, and Amy Lout. Sleep stage classication using unsupervised feature learning. Advances in Articial Neural Systems, 2012, 2012. doi:10.1155/2012/107046.

[18] Martin Längkvist and Amy Lout. Unsupervised feature learning for elec-tronic nose data applied to bacteria identication in blood. In NIPS work-shop on Deep Learning and Unsupervised Feature Learning, 2011.

(23)

[19] Martin Längkvist and Amy Lout. Unsupervised feature learning for elec-tronic nose data applied to bacteria identication in blood. In NIPS work-shop on Deep Learning and Unsupervised Feature Learning, 2011.

[20] Martin Längkvist and Amy Lout. Not all signals are created equal: Dy-namic objective auto-encoder for multivariate data. In NIPS workshop on Deep Learning and Unsupervised Feature Learning, 2012.

[21] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio. An empirical evaluation of deep architectures on problems with many factors of variation. In Proceedings of the Twenty-nine International Conference on Machine Learning (ICML), 2007.

[22] Quoc V Le, Jiquan Ngiam, Adam Coates, Abhik Lahiri, Bobby Prochnow, and Andrew Y Ng. On optimization methods for deep learning. In In Pro-ceedings of the Twenty-Eighth International Conference on Machine Learn-ing, 2011.

[23] Honglak Lee, Chaitanya Ekanadham, and Andrew Y. Ng. Sparse deep belief net model for visual area V2. In Advances in Neural Information Processing Systems 20, pages 873880, 2008.

[24] Piotr Mirowski, Deepak Madhavan, and Yann LeCun. Time-delay neu-ral networks and independent component analysis for eeg-based prediction of epileptic seizures propagation. In Association for the Advancement of Articial Intelligence Conference, 2007.

[25] J. Ngiam, P. Koh, Z. Chen, S. Bhaskar, and Andrew Y. Ng. Sparse ltering. In Advances in Neural Information Processing Systems (NIPS), 2011.

(24)

[26] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. Multimodal deep learning. In In Proceedings of the Twenty-Eigth International Conference on Machine Learning, 2011. [27] Bruno A. Olshausen and David J. Field. Emergence of simple-cell

recep-tive eld properties by learning a sparse code for natural images. Nature, 381:607609, 1996.

[28] Marc'Aurelio Ranzato, Y-Lan Boureau, and Yann LeCun. Sparse feature learning for deep belief networks. In Advances in Neural Information Pro-cessing Systems (NIPS 2007), 2007.

[29] Marc'Aurelio Ranzato, Christopher Poultney, Sumit Chopra, and Yann LeCun. Ecient learning of sparse representations with an energy-based model. In J. Platt et al., editor, Advances in Neural Information Processing Systems (NIPS 2006). MIT Press, 2006.

[30] A. Reiss and D. Stricker. Introducing a new benchmarked dataset for activ-ity monitoring. In The 16th IEEE International Symposium on Wearable Computers, 2012.

[31] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Bengio Y. Contracting auto-encoders: Explicit invariance during feature extraction. In Proceedings of the Twenty-nine International Conference on Machine Learning (ICML), 2011.

[32] Tom Schaul, Sixin Zhang, and Yann LeCun. No more pesky learning rates. arXiv:1206.1106, 2012.

[33] Kihyuk Sohn, Guanyu Zhou, and Honglak Lee. Jointly learning and select-ing features via conditional point-wise mixture rbms. In NIPS workshop on Deep Learning and Unsupervised Feature Learning, 2012.

(25)

[34] Graham Taylor, G. E. Hinton, and Sam Roweis. Modeling human motion using binary latent variables. In Advances in Neural Information Processing Systems, 2007.

[35] M. Trincavelli, S. Coradeschi, A. Lout, B. Söderquist, and P. Thunberg. Direct identication of bacteria in blood culture samples using an electronic nose. IEEE Trans Biomedical Engineering, 57(Issue 12):28842890, 2010. [36] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.A. Manzagol. Stacked

denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11:33713408, 2010.

[37] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Man-zagol. Extracting and composing robust features with denoising autoen-coders. In Proceedings of the 25th international conference on Machine learning, pages 10961103, 2008.

[38] Guanyu Zhou, Kihyuk Sohn, and Honglak Lee. Online incremental feature learning with denoising autoencoders. In In Proceedings of the 15th Inter-national Conference on Articial Intelligence and Statistics (AISTATS), 2012. JMLR W&CP 22.

(26)

List of Figures

1 (a) Unsupervised and (b) Supervised netuning of a auto-encoder with 2 layers. . . 27 2 (a) Unsupervised and (b) Supervised netuning of a 1-layer

dy-namic objective auto-encoder. . . 28 3 (a) Electronic nose exposed to bacteria in blood (b) Motion

cap-ture: CMU locomotive (c) Motion capcap-ture: PAMAP2 activity recognition (d) MNIST variations . . . 29 4 Classication accuracy on e-nose data with dierent (a) hidden

layer sizes (b) λ-values (c) β-values . . . 30 5 Classication accuracy after netuning (a) with a dynamic

ob-jective and (b) without a dynamic obob-jective. Values of α for (c) a dynamic objective and (d) without a dynamic objective. See text for details. . . 31 6 (a) Average reconstruction error for training data after pre-trained

on PAMAP2 data set. (b) Final α-values for unsupervised learn-ing. Signals with a low reconstruction error have a higher α-value and dicult-to-reconstruct signals have a lower α-α-value. This makes noisy signals to be ignored and the representational capacity is spent on easier signals. . . 32 7 Classication accuracy with dierent values of αi for training

model θi for use in a mixture of experts. . . 33

8 Learned features (a) without dynamic objective and (b) with dy-namic objective . . . 34 9 αvalues for (a) γ = 1 (b) γ = 0.1 . . . 35 10 Progress of α after each epoch of unsupervised learning on

mnist-backg-noise. . . 36

Martin Längkvist, Amy Lout

(27)

W11 W21 v v h1 b 11 b 21 h1 b 22 h2 b 12 W12 W22 (a) W11 v h1 b 11 y h2 b 12 W12 Ws (b)

Figure 1: (a) Unsupervised and (b) Supervised netuning of a auto-encoder with 2 layers.

(28)

𝑣 𝛼 𝑣 𝑊1 𝑊2 (a) 𝑣 𝑦 𝑊1 𝑊2 𝛼 (b)

Figure 2: (a) Unsupervised and (b) Supervised netuning of a 1-layer dynamic objective auto-encoder.

(29)

10 20 30 40 50 60 70 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Sample [0.5 Hz] Sensor value (a) 2 4 6 8 10 12 14 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 Sample [30 Hz] Sensor value (b) 5 10 15 20 25 30 35 40 45 50 −5 −4 −3 −2 −1 0 1 2 3 4 5 Sample [25 Hz] Sensor value (c) 20 40 60 80 100 5 10 15 20 25 30 35 40 45 50 55 (d)

Figure 3: (a) Electronic nose exposed to bacteria in blood (b) Motion capture: CMU locomotive (c) Motion capture: PAMAP2 activity recognition (d) MNIST variations

(30)

50 200 500 1000 60 65 70 75 80 85 90 95 100

Hidden layer size

Classification accuracy [%]

Pre−train Finetune Finetune+alpha

(a)

0.01 0.001 0.0001 1e−05 1e−06 1e−07 1e−08 1e−09 1e−10

30 40 50 60 70 80 90 100 Hyperparameter: λ Classification accuracy [%] Pre−train Finetune Finetune+alpha (b) 10 3 0.3 0.03 0.003 0.0003 3e−05 0 10 20 30 40 50 60 70 80 90 100 Hyperparameter: β Classification accuracy [%] Pre−train Finetune Finetune+alpha (c)

Figure 4: Classication accuracy on e-nose data with dierent (a) hidden layer sizes (b) λ-values (c) β-values

(31)

0 1 2 3 4 5 6 7 8 9 10 65 70 75 80 85 90 95 100 Epoch Classification accuracy E.coli (ECOLI) Pseudomonas aeruginosa (PSAER) Staphylococcus aureus (STA) Klebsiella oxytoca (KLOXY) Proteus mirabilis (PRMIR) Enterococcus faecalis (SRFCL) Staphylococcus lugdunensis (STLUG) Pasteurella multocida (PASMU) Steptococcus pyogenes (HSA) Hemophilus influenzae (HINFL)

(a) 0 1 70 75 80 85 90 95 100 Epoch Classification accuracy E.coli (ECOLI) Pseudomonas aeruginosa (PSAER) Staphylococcus aureus (STA) Klebsiella oxytoca (KLOXY) Proteus mirabilis (PRMIR) Enterococcus faecalis (SRFCL) Staphylococcus lugdunensis (STLUG) Pasteurella multocida (PASMU) Steptococcus pyogenes (HSA) Hemophilus influenzae (HINFL)

(b) 0 1 2 3 4 5 6 7 8 9 10 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Epoch α −value E.coli (ECOLI) Pseudomonas aeruginosa (PSAER) Staphylococcus aureus (STA) Klebsiella oxytoca (KLOXY) Proteus mirabilis (PRMIR) Enterococcus faecalis (SRFCL) Staphylococcus lugdunensis (STLUG) Pasteurella multocida (PASMU) Steptococcus pyogenes (HSA) Hemophilus influenzae (HINFL)

(c) 0 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Epoch α −value E.coli (ECOLI) Pseudomonas aeruginosa (PSAER) Staphylococcus aureus (STA) Klebsiella oxytoca (KLOXY) Proteus mirabilis (PRMIR) Enterococcus faecalis (SRFCL) Staphylococcus lugdunensis (STLUG) Pasteurella multocida (PASMU) Steptococcus pyogenes (HSA) Hemophilus influenzae (HINFL)

(d)

Figure 5: Classication accuracy after netuning (a) with a dynamic objective and (b) without a dynamic objective. Values of α for (c) a dynamic objective and (d) without a dynamic objective. See text for details.

(32)

5 10 15 20 25 30 35 40 45 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Sample Signal 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 (a) 10 20 30 40 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Sample Signal 0.3 0.4 0.5 0.6 0.7 0.8 0.9 (b)

Figure 6: (a) Average reconstruction error for training data after pre-trained on PAMAP2 data set. (b) Final α-values for unsupervised learning. Signals with a low reconstruction error have a higher α-value and dicult-to-reconstruct signals have a lower α-value. This makes noisy signals to be ignored and the representational capacity is spent on easier signals.

(33)

85 90 95 100 100 101 102 103 104 Classification accuracy [%] α CMU PAMAP2

Figure 7: Classication accuracy with dierent values of αi for training model

θi for use in a mixture of experts.

(34)

(a) (b)

Figure 8: Learned features (a) without dynamic objective and (b) with dynamic objective

(35)

5 10 15 20 25 5 10 15 20 25 0.985 0.986 0.987 0.988 0.989 0.99 0.991 (a) 5 10 15 20 25 5 10 15 20 25 0.85 0.86 0.87 0.88 0.89 0.9 0.91 (b)

Figure 9: α values for (a) γ = 1 (b) γ = 0.1

(36)

0 0.2 0.4 0.6 0.8 1

Figure 10: Progress of α after each epoch of unsupervised learning on mnist-backg-noise.

(37)

List of Tables

1 Desired behavior of α . . . 38 2 Classication accuracy [%] on electronic nose data. . . 39 3 Classication accuracy (mean ± standard deviation) [%] from

recognition of 12 activities from PAMAP2 data set and 4 styles of motion from CMU locomotive data set. We use the notation (sup), (unsup), and (unsup + sup) to indicate if a dynamic ob-jective was used in the unsupervised phase, supervised phase, or both. MoE stands for mixture of experts. . . 40 4 Classication accuracy [%] from 2 variations of MNIST data set. 41

(38)

Table 1: Desired behavior of α low (0 ≤ α < 1) high (α > 1)

easy to reconstruct signals hard to reconstruct signals high accuracy classes low accuracy classes noisy signals

(39)

Table 2: Classication accuracy [%] on electronic nose data. Setup Bacteria in blood

Features + SVM [35] 93.7 RBM-1 [19] 93.8 RBM-2 [19] 96.2 cRBM-1 [19] 85.0 SVM on raw data 92.05 ± 2.44 sAE-1 95.83 ± 0.98 dosAE-1 (sup) 98.81 ± 0.82

(40)

Table 3: Classication accuracy (mean ± standard deviation) [%] from recogni-tion of 12 activities from PAMAP2 data set and 4 styles of morecogni-tion from CMU locomotive data set. We use the notation (sup), (unsup), and (unsup + sup) to indicate if a dynamic objective was used in the unsupervised phase, supervised phase, or both. MoE stands for mixture of experts.

Method PAMAP2 CMU

Softmax on raw data 78.20 ± 1.2 88.74 ± 6.6 sAE (sup) 89.20 ± 4.9 89.10 ± 5.4 dosAE (unsup) 90.69 ± 4.3 90.20 ± 3.5 dosAE (sup) 90.89 ± 5.2 91.89 ± 3.1 dosAE (unsup+sup) 91.91 ± 2.4 92.19 ± 3.7 dosAE (unsup+sup MoE) 94.32 ± 2.1 97.12 ± 1.9

(41)

Table 4: Classication accuracy [%] from 2 variations of MNIST data set. mnist-back-rand mnist-back-images Softmax on raw data 72.17 69.13

sAE 79.23 78.98

dosAE (unsup+sup) 81.13 82.41 dosAE (unsup+sup MoE) 80.53 79.82