Neural Networks and Uncertainty Estimation for Financial Asset Predictions

(1)

Neural Networks and Uncertainty Estimation for Financial Asset Predictions

ROBERT CEDERGREN

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

Uncertainty Estimation for Financial Asset Predictions

ROBERT CEDERGREN

Master in Computer Science Date: December 22, 2020

Supervisors at Lynx Asset Management: Hannes Meier, Max Nyström Winsa, Tobias Rydén

Supervisor at KTH: Giampiero Salvi Examiner: Pawel Herman

School of Electrical Engineering and Computer Science

Swedish title: Neurala Nätverk och Osäkerhetsskattning i

Prediktioner av Finansiella Tillgångar

(3)

(4)

Abstract

With the capability of modeling complex non-linear mappings, neural net- works have obtained state-of-the-art performance on various tasks. However, traditional neural networks are prone to overfitting as they tend to be overcon- fident on unseen, noisy and incorrectly labeled data. Neither do they produce meaningful representations of uncertainty.

Deep ensembles and Bayesian neural networks using mean field approxi- mation as well as multiplicative normalizing flows are methods that all aim to alleviate these issues. Using these methods on financial time series, this work studies the relationship between absolute prediction errors and predictive un- certainty estimates. It further looks at predictive performance and whether the predictive uncertainty estimates can improve predictive performance.

The results obtained in this work are an outcome from a single test dataset.

As such, the results demonstrate tendencies and indications as opposed to im- plicate far-reaching conclusions.

All methods obtained a minor positive correlation between predictive un-

certainty and absolute prediction error. However, the observed magnitude im-

plies that there hardly exists any meaningful linear relation. Further investi-

gation illustrated the mean absolute prediction error to be an increasing non-

linear function of predictive uncertainty percentiles for all methods. In terms

of predictive performance, we only observed deep ensembles to obtain bet-

ter performance metrics than a trivial buy-and-hold trading strategy. To see

whether the methods could potentially gain from utilizing the predictive un-

certainty estimates when trading, we scaled the predictions by the predictive

uncertainty estimates. However, no method demonstrated any potential to gain

from this procedure.

(5)

Sammanfattning

Med förmågan att modellera komplexa icke-linjära avbildningar har neurala nätverk uppnått toppresultat för många olika typer av problem. Traditionella neurala nätverk är emellertid överanpassnings-benägna då de tenderar att vara alltför säkra på osedd, brusig eller felaktigt annoterad data. Inte heller produ- cerar de någon meningsfull representation av osäkerhet.

Djupa ensembler och Bayesianska neurala nätverk som använder mean fi- eld approximering samt multiplikativa normaliserande flöden är metoder som alla syftar till att lindra dessa problem.

Genom att använda dessa metoder på finansiella tidsserier studerar detta arbete sambandet mellan absoluta prediktionsfel och skattningar av den pre- diktiva osäkerheten, dvs. osäkerheten i prediktioner. Vidare utvärderas model- lernas prediktiva förmåga och huruvida de prediktiva osäkerhetsskattningarna kan förbättra prediktionsförmågan.

Resultaten som erhållits i detta arbete är ett resultat från ett enda test- dataset. Dessa påvisar således enbart tendenser och indikationer till skillnad från att innebära långtgående slutsatser.

Alla metoder erhöll ett mindre positivt linjärt samband mellan prediktiv

osäkerhet och absoluta prediktionsfel. Den observerade storleksgraden impl-

cerar dock att det knappast föreligger någon meningsfull linjär relation. Ytterli-

gare undersökning illustrerade att det genomsnittliga absoluta prediktionsfelet

observerades vara en växande icke-linjär funktion av percentiler av prediktiv

osäkerhet för alla modeller. Vad gäller prediktiv förmåga sågs enbart djupa

ensembler få bättre prestandamått än en trivial köp-och-behåll strategi. För att

se om metoderna potentiellt skulle kunna vinna något på att använda de pre-

diktiva osäkerhetsuppskattningarna för handel så lät vi skala om modellernas

prediktioner med osäkerhetsskattningarna. Ingen metod demonstrerade emel-

lertid någon potential att vinna något på detta förfarande.

(6)

Acknowledgments

I want to take the opportunity to thank my supervisors at Lynx Asset Manage- ment:

• Hannes Meier (Principal Quantitative Researcher),

• Max Nyström Winsa (Principal Quantitative Researcher),

• Tobias Rydén (Senior Research Partner at Lynx and Adjunct Professor of Computational Statistics at Stockholm University).

Thank you for letting me do this project and thank you for all discussions, meetings and inputs.

I would also like to express my gratitude to my supervisor and examiner at KTH:

• Supervisor: Giampiero Salvi (Associate Professor with KTH and Pro- fessor at Norwegian University of Science and Technology),

• Examiner: Pawel Herman (Associate Professor with KTH).

I would like to thank you both for your invaluable feedback, comments and

discussions, which have been of great importance for this report.

(7)

1 Introduction 1

1.1 Research Questions and Objective . . . 3

1.2 Scope and Limitations . . . 4

1.3 Outline . . . 4

2 Background 5 2.1 Financial Background . . . 5

2.1.1 Forwards and Futures Contracts . . . 5

2.1.2 Sharpe Ratio . . . 6

2.1.3 Maximum Drawdown . . . 6

2.2 Bayesian Inference . . . 7

2.2.1 Bayesian Modeling . . . 7

2.2.2 Uncertainty . . . 8

2.2.3 Variational Inference . . . 10

2.2.4 Normalizing Flows . . . 12

2.2.5 Variational Inference with Normalizing Flows . . . 14

2.2.6 Auxiliary Random Variables for Variational Methods . 15 2.3 Models . . . 16

2.3.1 Variational Bayesian Neural Networks . . . 16

2.3.2 Deep Ensembles . . . 22

2.3.3 Complexity by Model . . . 24

2.4 Related work . . . 25

2.4.1 Uncertainty Calibration for Neural Networks . . . 26

2.4.2 Uncertainty Estimation and Approximate Bayesian In- ference . . . 26

2.4.3 Uncertainty Estimation and Ensembles of Neural Net- works . . . 35

2.4.4 Neural Networks, Uncertainty Estimation and Finan- cial Asset Predictions . . . 36

vi

(8)

3 Methods 38

3.1 Data . . . 38

3.1.1 Data Pre-Processing . . . 38

3.1.2 Data Split . . . 40

3.2 Evaluation . . . 41

3.2.1 Absolute Error . . . 41

3.2.2 Trading Performance . . . 42

3.3 Experimental Settings . . . 42

3.3.1 Mean Field Variational Bayesian Neural Networks . . 44

3.3.2 Multiplicative Normalizing Flows for Variational Bayesian Neural Networks . . . 44

3.3.3 Deep Ensembles . . . 45

4 Results 47 4.1 Optimization . . . 47

4.2 The Uncertainty and Error Relationship . . . 50

4.2.1 Correlation . . . 50

4.2.2 Mean Absolute Error vs. Uncertainty Percentiles . . . 51

4.3 Trading Performance . . . 54

4.4 Hyperparameters . . . 56

4.5 A Toy Regression Problem . . . 56

5 Discussion 59 5.1 The Dataset . . . 59

5.2 Variational Bayesian Neural Networks . . . 60

5.3 Uncertainty and Systematic Trading . . . 61

5.4 Limitations . . . 62

5.5 Adequacy of the Variational Bayesian Neural Networks . . . . 63

5.6 Ethics, Sustainability and Social Impact . . . 63

6 Conclusions 65 6.1 Future Work . . . 66

Bibliography 68

(9)

(10)

Introduction

Neural networks are powerful tools for modeling complex non-linear map- pings that have obtained state-of-the-art performance on various tasks. This includes, but is not limited to, applications such as computer vision, speech and speaker recognition, natural language processing as well as robotics.

Standard neural networks provide point estimates of the model parameters as well as the predictions and can thus be seen as deterministic functions [1, 2, 3, 4]. However, traditional neural networks are prone to overfitting as they tend to be overconfident on unseen, noisy and incorrectly labeled data [1, 2, 3, 5, 6]. Neither do they account for or produce meaningful representations of uncertainty [1, 2, 3, 4, 5, 7]. In many applications, it is desirable that a model provides some reliable quantity of uncertainty or confidence in the pre- dictions, as overconfidence and extrapolation in wrong situations could lead to harmful and unintended behavior [6, 8, 9]. For instance, a self-driving car that predicts that there are no obstacles when driving across a pedestrian crossing should desirably also be reliably confident in this prediction as it acts upon it.

Moreover, learning meaningful uncertainty estimates enable models to know what they do not know. For instance, a discriminative image classifier trained on cats and dogs will, when presented with an image of a cow, predict either cat or dog. However, it should desirably also be able to express a high degree of uncertainty/confidence in this prediction.

Uncertainty can be formalized with distributions over model parameters as well as model outputs. The former accounts for model uncertainty typically referred to as epistemic uncertainty [7, 9, 10]. Modeling the distribution of the output captures noise inherent in the data and can be referred to as modeling the aleatoric uncertainty [7, 9, 10]. The predictive uncertainty can be modeled as a sum of the aleatoric and the epistemic uncertainty [9, 10, 11].

1

(11)

One way of modeling epistemic uncertainty within the scope of neural networks is by using Bayesian neural networks [12, 13]. Bayesian neural net- works provide both regularization and uncertainty measures of predictions [2, 3]. Distributions are placed over the network weights and a posterior distribu- tion over the weights can be inferred that captures the parameter uncertainty of the network. However, obtaining the posterior through exact Bayesian in- ference is typically intractable [6, 9, 10, 11, 14, 15, 16] due to the non-linear nature of neural networks [11, 15, 17]. Hence, one needs to resort to practical approximation schemes [6, 10, 11, 14, 16].

Variational inference [11, 18] is an optimization-based approximation scheme that can make Bayesian inference computationally efficient and scalable to large datasets [11, 16]. A typical variational posterior approximation is the mean field approximation [16, 18, 19, 20], which consists of a fully factor- ized Gaussian distribution. Using variational inference for Bayesian neural networks, mean field variational Bayesian neural networks [3] (MF-VBNN) efficiently learn distributions over the weights of a neural network. The flexi- bility of the approximate posterior distribution determines to what extent the complexity of the true posterior distribution can be captured [11, 16, 20]. The mean field approach is efficient [20, 21], yet it does not offer much flexibility [16, 20, 21].

Normalizing flows [20] as well as auxiliary random variables [21, 22, 23, 24] are general recipes that enable flexible, complex and scalable approximate posterior distributions using variational inference. Combining these concepts for Bayesian neural networks, multiplicative normalizing flows for variational Bayesian neural networks [17] (MNF-VBNN) have been demonstrated to effi- ciently and significantly improve performance over previous methods in terms of both predictive accuracy as well as predictive uncertainty estimates [17].

One way of modeling both epistemic and aleatoric uncertainty is through the use of deep ensembles [6]. Deep ensembles have been demonstrated to im- prove accuracy, uncertainty estimates as well as out-of-distribution robustness of neural networks [6, 25, 26]. Compared to Bayesian neural networks, which require significant modifications to the training procedure, deep ensembles are parallelizable, relatively simple to implement and require little hyperparame- ter tuning [6].

As neural networks have proven to be successful in many applications,

financial investors are increasingly interested in their capabilities. Predicting

financial asset returns is a daunting task, particularly since financial time series

are inherently extremely noisy. Value investing, carry investing and momen-

tum investing are a few of numerous investment strategies. Value investing is

(12)

an investment philosophy where an investor buys securities that appear to be undervalued in terms of some fundamental analysis [27]. Carry trading is a trading strategy where an investor borrows low-yielding assets and invests in assets with a higher rate of return [28, 29]. Momentum investing is based on the empirical observation that securities with strong past performance tend to, on average, outperform securities with poor past performance [27, 30].

Analogously to a self-driving car, a trading algorithm making predictions about future returns should also be reliably confident about its predictions.

When faced with an observation, which might be far from the training data, a financial trading algorithm should, analogously to the discriminative image classifier trained on cats and dogs, also be able to express its uncertainty/confidence along with the prediction. Suppose a financial trading algorithm learns to output predictions accompanied with meaningful uncertainty estimates, e.g., model error increases as the uncertainty increases. In that case, a systematic trading strategy might benefit from scaling predictions with the corresponding uncertainty estimates. Intuitively, it would make sense for a trading algorithm to make smaller bets when it is less certain about its predictions.

Based on the previous motivations, this work uses financial time series with MF-VBNN, MNF-VBNN and deep ensembles to study predictive uncer- tainty estimates as well as systematic trading performance.

1.1 Research Questions and Objective

The objective of this work is to employ a range of different neural network architectures such as MF-VBNN, MNF-VBNN and deep ensembles for finan- cial time series prediction in order to contribute to the knowledge development within both academia as well as the financial industry by answering the fol- lowing research questions:

1. How is the uncertainty of the posterior predictive distribution related to the absolute prediction error?

2. How do the methods compare in terms of systematic trading?

3. Can a systematic trading strategy be improved for any of the methods by

utilizing the uncertainty of the posterior predictive distribution?

(13)

1.2 Scope and Limitations

This work is limited to involve the methods mentioned in Section 1.1 in a regression setting using financial time series. The main focus is on the rela- tionship between the predictive uncertainty and the absolute prediction errors of the models.

Typically, one minimizes the test set error bias by using independent test datasets through k-fold cross-validation in order to be able to test for statistical significance and to draw conclusions. However, as the dataset used in this work is small, only a single test dataset is used. Consequently, the results obtained in this work demonstrate tendencies and indications as opposed to implicate statistical significance and far-reaching conclusions.

1.3 Outline

The thesis has the following outline. Chapter 2 gives an introduction to the

necessary financial and theoretical background as well as the models that this

work revolves around. Finally, it outlines related work. Chapter 3 presents the

methods, which include data analysis, evaluation metrics and experimental

settings. Chapter 4 presents the results of the experiments, followed by Chap-

ter 5 that discusses the results. Chapter 6 presents conclusions and suggests

future work.

(14)

Background

This chapter provides background on the core aspects of the theory of this work as well as related work. Section 2.1 introduces a few necessary financial terms.

Section 2.2 describes the theory of Bayesian inference that this work revolves around. Section 2.3 describes the models that are used in the experiments and Section 2.4 outlines related work.

2.1 Financial Background

2.1.1 Forwards and Futures Contracts

A forward contract is an agreement for future delivery of an asset at a stipulated price. Formally, at time t = 0 two parties agree to trade an underlying asset A for the forward price P at time T > 0, when money changes hands. A forward contract serves the purpose of locking in the price. As such, it protects the entered parties from future price fluctuations.

Futures contracts are the exchange-traded extensions of forward contracts.

Similar to forward contracts, futures contracts obligate a buyer/seller to buy/sell an underlying asset at a stipulated price on a specified future date, the maturity date. Futures contracts are standardized in terms of quality, quantity, contract delivery dates, etc. This standardization limits the flexibility of the contracts, however, it has the offsetting advantage of better liquidity due to the fact that investors will focus on the same contracts. Moreover, the exchange solves the counter-party risk associated with forward contracts as futures investors need to post a deposit, the margin, to guarantee their obligations. Another key differ- ence in relation to forward contracts is that futures contracts are continuously settled. Formally, denoting the futures price F (t, T, A) at time t, delivery date

5

(15)

T of the underlying asset A, the buyer receives F (t, T, A) F (t 1, T, A) at every time step t.

2.1.2 Sharpe Ratio

The Sharpe ratio [31] is a performance measure that measures the risk-adjusted return of an investment. The Sharpe ratio for a risky asset is defined as

Sharpe ratio = E [r r

f

]

pVar[r r

f

] , (2.1)

where r is the return of a risky asset and r

f

is the risk-free rate. The risk-free rate is the rate that can be earned with certainty and common practice is to use treasury bills as the risk-free asset.

When trading with futures contract, no transaction occurs when the con- tract is entered and an investor can invest the capital, except the margin, at the risk-free rate. Hence, in this work, the risk-free return is assumed to be zero and when we refer to the Sharpe ratio, we will refer to

Sharpe ratio = E [r]

, (2.2)

where is the standard deviation (volatility) of the risky asset. As such, the Sharpe ratio can be interpreted as the expected return per unit of risk associated with the return. This measure is used in this work as a performance measure of trading strategies.

2.1.3 Maximum Drawdown

Drawdown is a measure of a decline of an asset from a historical peak. Max- imum drawdown is a performance measure that measures the highest draw- down over the history of an asset. Let {r

^t

}

^Tt=1

denote a return series and let {R

t

}

^Tt=1

denote the corresponding cumulative return series where R

t

= P

t

i=1

r

i

. The maximum drawdown is then defined as Maximum drawdown = max

i2[1,T ]

[max

t2[1,i]

[ {R

t

}

^tt=1

] R

i

]. (2.3)

As such, maximum drawdown is a downside risk indicator as it quantifies the

maximum loss from a peak of a portfolio/asset over a specified time period and

is used in this work as a performance measure of trading strategies. Note that

it only measures the size of the largest loss and does not take the frequency of

large losses into account.

(16)

2.2 Bayesian Inference

This section introduces the theory of Bayesian inference that this work re- volves around. We start to describe Bayesian modeling and then move on to uncertainty, variational inference, normalizing flows, variational inference with normalizing flows, and lastly, auxiliary random variables for variational methods.

2.2.1 Bayesian Modeling

Let D = {X, Y} be a dataset of N i.i.d. inputs X = {x

¹

, ..., x

N

} and outputs Y = {y

1

, ..., y

N

}, where x

ⁱ

and y

i

are D

x

-dimensional and D

y

-dimensional vectors, respectively. In a probabilistic context, one is interested in finding the parameters ! of the function y = f

^!

( x) that is most likely to have generated the outputs given the inputs. This can be done through maximum likelihood estimation (MLE) where the parameters ! are given by

!

^MLE

= argmax

!

log p( Y|X, !) (2.4)

= argmax

!

X

N i=1

log p( y

i

|x

ⁱ

, !), (2.5) where p(Y|X, !) is the likelihood of the data, i.e., the probabilistic model from where the outputs are generated given the inputs and some parameters.

Following a Bayesian approach, we place a prior distribution p(!) over the parameters. This represents our prior beliefs to what values of the parameters that are governing the process that has generated the data. The parameter esti- mation then amounts to a maximum a posteriori (MAP) estimation where the parameters ! are given by

!

^MAP

= argmax

!

[log p( Y|X, !) + log p(!)]. (2.6) Setting p(!) to be a zero-centered diagonal Gaussian or Laplace distribution corresponds to a L2 or L1 regularization, respectively [3, 11]. Note that MLE is a special case of MAP where the prior is uninformative.

MLE and MAP estimations obtain point estimates of the model parameters

!. Alternatively, by employing Bayesian parameter estimation, we admit that

there might be many possible values of the model parameters ! compatible

with the data. In other words, by using Bayes theorem, we can invoke the

(17)

posterior distribution

p(! |X, Y) = p( Y|X, !)p(!)

p( Y|X) = R p( Y|X, !)p(!)

p( Y|X, !)p(!)d! , (2.7) which embodies the uncertainty of the model. The posterior distribution cap- tures the most probable parameters given the dataset.

The p(Y|X) term in Equation 2.7 is commonly referred to as the evidence or the marginal likelihood, as the last step in Equation 2.7 marginalizes out the model parameters from the likelihood. Marginalization can be done analyti- cally for relatively small datasets and simple models such as Bayesian linear regression where the prior and the likelihood are conjugate distributions. How- ever, for big datasets and more complex models such as (non-linear) neural networks, this marginalization becomes intractable [9] and one needs to resort to some approximation scheme, e.g., variational inference, which is covered in Section 2.2.3.

The posterior distribution is employed for the predictive (posterior) distri- bution for unseen data x

^⇤

, y

^⇤

as

p(y

^⇤

|x

^⇤

, X, Y) = Z

p(y

^⇤

|x

^⇤

, !)p(! |X, Y)d! (2.8)

= E

^p(!|X,Y)

[p( y

^⇤

|x

^⇤

, !)]. (2.9) This can be seen as a prediction by an ensemble of an uncountable set of models. The integral can be interpreted as a weighted sum of predictions for all possible values of !, weighted by the plausibility p(!|X, Y). Again, the marginalization can be performed analytically for relatively simple models, however, for more complex models, this marginalization needs to be approxi- mated [9].

2.2.2 Uncertainty

As explained in Chapter 1, the concept of (predictive) uncertainty can be di- vided into two sub-categories, epistemic and aleatoric uncertainties [9].

Epistemic uncertainty refers to model uncertainty. This uncertainty is re-

ducible and can be decreased given enough data [9]. As such, epistemic un-

certainty is due to the lack of a model’s knowledge, i.e., an insufficient amount

of observations. It is thus important for small datasets and necessary in order

for a model to detect out-of-distribution samples [10], i.e., samples that are

different from the training data.

(18)

Aleatoric uncertainty is the uncertainty inherent in the data [10]. Thus, it is irreducible and can not be explained away with more data [9, 10]. Aleatoric uncertainty can be further categorized into homoscedastic and heteroscedas- tic uncertainty. Homoscedastic uncertainty assumes constant variance of the output, while heteroscedastic uncertainty means that different inputs have dif- ferent levels of noisy outputs [9, 10]. Heteroscedastic models are useful when noise levels vary across the observation space [9].

For instance, in a regression setting, the mean squared error (MSE)

MSE / X

^N

i=1

||y

i

ˆ y

_i

||

²2

, y = f ˆ

^!

(x), (2.10)

is commonly used as loss metric/minimization objective [6]. As

log p( y|x, !) = log N (y; f

^!

(x),

²

I) (2.11)

/||y ˆ y||

²2

, (2.12)

we note that minimizing the MSE is equivalent to minimizing a Gaussian neg- ative log-likelihood with constant variance. As such, the MSE assumes ho- moscedastic aleatoric uncertainty.

By enabling a model to output the estimate of the variance of the likelihood [6, 9, 10] such that {ˆy, ˆ

²

} = f

^!

(x), we get a heteroscedastic model. The likelihood can then be given by p(y|x, !) = N (y; f

^!

( x)), which amounts to the minimization objective

1 N

X

N i=1

log p(y

_i

|x

i

, !) / X

N

i=1

y

i

ˆ y

i

ˆ

i 2 2

+ || log ˆ

²i

||

1

. (2.13) From Equation 2.13 we note that allowing the model to predict the aleatoric uncertainty enables the model to be more robust against noisy data, as inputs with higher aleatoric uncertainty have a smaller effect on the loss. The model is nevertheless discouraged from ignoring the data completely, i.e., predicting high aleatoric uncertainty for all inputs, due to the log ˆ

²

term. Large aleatoric uncertainties make this term penalize the model. On the contrary, the model is also discouraged from predicting very low aleatoric uncertainty for all inputs, as this would blow up the

^{y ˆy}_ˆ ²₂

term.

As mentioned in the previous section, the predictive distribution (Equation

2.8) needs to be approximated. By sampling parameters ˆ! ⇠ p(!|X, Y) we

can approximate the first and second moments of the predictive distribution

(19)

empirically [4, 9, 10] using Monte Carlo integration:

E[y

^⇤

] ⇡ 1 n

X

n i=1

f

^!^ˆⁱ

( x

^⇤

), (2.14)

Var(y

^⇤

) ⇡

²

+ 1 n

X

n i=1

f

^!^ˆⁱ

( x

^⇤

)

²

E[y

^⇤

]

²

(2.15)

⇡ |{z}

²

aleatoric uncertainty

+ 1 n

X

n i=1

ˆ y

²_i

✓ 1 n

X

n i=1

ˆ y

_i

◆

2

| {z }

epistemic uncertainty

, (2.16)

where {ˆy

i

}

ⁿi=1

is the set of n number of outputs from the sampled set of pa- rameters {ˆ!

ⁱ

}

ⁿi=1

from a homoscedastic model.

For a heteroscedastic model the predictive uncertainty can be approxi- mated by

Var(y

^⇤

) ⇡ 1 n

X

n i=1

ˆ

²_i

| {z }

aleatoric uncertainty

+ 1 n

X

n i=1

ˆ y

²_i

✓ 1 n

X

n t=1

ˆ y

_i

◆

2

| {z }

epistemic uncertainty

, (2.17)

where {ˆy

i

, ˆ

i

}

ⁿi=1

is the set of outputs from the sampled set of parameters { ˆ !

i

}

ⁿi=1

[6, 9, 10].

Note that aleatoric and epistemic uncertainties are considered independent.

2.2.3 Variational Inference

As touched upon, the true posterior p(!|X, Y) can in general not be evalu- ated analytically. As such, performing exact Bayesian inference is in general intractable and one needs to resort to some approximation scheme. One such family of approximation techniques is called variational inference [11].

Variational inference aims to determine an approximate variational distri- bution q (!), governed by the parameters , as close as possible to the true posterior p(!|X, Y) [11, 16]. Various divergence measures exist, yet the most commonly used for variational inference is the Kullback-Leibler (KL) diver- gence [16, 32]. The KL divergence between two distributions p(x) and q(x) is defined as

D

KL

(q( x)||p(x)) = Z

q(x) log p(x)

q( x) dx, (2.18)

and measures the difference between the unknown distribution p(x) and the

approximate distribution q(x) [11, 16]. Note that Equation 2.18 is defined

(20)

only when q(x) is absolute continuous with respect to p(x) [9], i.e., p(x) = 0 whenever q(x) = 0, for all x. The KL divergence satisfies D

KL

(q( x)||p(x)) 0 with equality if, and only if, p(x) = q(x) [11]. Moreover, we note that it is not a symmetrical quantity, i.e., D

KL

(p( x)||q(x)) 6= D

^KL

(q( x)||p(x)).

The KL divergence between the variational and the true posterior is then defined as

D

KL

(q (!) ||p(!|X, Y)) = Z

q (!) log p(! |X, Y)

q (!) d!. (2.19) Following Jordan et al.’s [18] approach, the log-likelihood log p(Y|X) = log p( D) can be lower bounded using Jensen’s inequality for an arbitrary dis- tribution q (!) [11]:

log p( D) = log Z

p( D, !)d! (2.20)

= log

Z p( D, !)q (!)

q (!) d! (2.21)

Z

q (!) log p( D, !)d!

Z

q (!) log q (!)d! (2.22)

= E

^{q (!)}

[log p( D, !)] E

^{q (!)}

[log q (!)] (2.23)

= E

q (!)

[log p( D|!)] + E

q (!)

[log p(!)] E

q (!)

[log q (!)]

(2.24)

= E

^{q (!)}

[log p( D|!)] D

KL

(q (!) ||p(!)) (2.25)

= L( ). (2.26)

The L( ) term is commonly referred to as the evidence lower bound (ELBO) [3, 16]. It can be verified that the difference between log p(D) and the ELBO is the KL divergence (Equation 2.19), i.e.,

log p( D) = L( ) + D

^KL

(q (!) ||p(!|D)). (2.27) Thus, minimizing the KL divergence between the true posterior and the vari- ational posterior (Equation 2.19) is equivalent to maximizing the ELBO [11, 16]. Note that the KL divergence in the ELBO (Equation 2.25) acts as a regu- larizer [3] that encourages q (!) to be close to p(!) [9]. More specifically, it encourages the variational posterior to match the modes [11] of the prior.

Note that without this regularization, the model in question simply performs

MLE.

(21)

The idea, and dilemma, of variational inference is to specify a variational distribution that belongs to a family of distributions similar to the true poste- rior, yet simple enough to be evaluated efficiently [11]. A simple and probably the most commonly used variational distribution q (!) is the fully factorized Gaussian distribution, i.e., q (!) = Q

_i

q

_i

(!

i

) where each component i is a Gaussian. This is commonly referred to as the mean field approximation [18, 19]. The ideal would be to recover the true posterior distribution, i.e., q (!) = p(! |D), such that KL(q (!)||p(!|D)) = 0 [11, 20]. However, this is not pos- sible for the assumption of a diagonal covariance Gaussian [20]. The main limitation of the mean field approximation is that it severely limits the flexi- bility and expressiveness of the posterior approximation [11, 16, 20, 21]. The true posterior distribution will be more expressive than such a constrained ap- proximation [20] and, consequently, the mean field assumption leads to biased maximum likelihood estimates of ! [33].

2.2.4 Normalizing Flows

The notation in this section follows the notation used in [34].

Normalizing flows offer a general recipe for constructing flexible and ar- bitrarily complex probability distributions over continuous random variables [20].

The idea of flow-based models is to transform a D-dimensional real-valued vector u sampled from a base distribution p

u

( u) into a D-dimensional vector x through a transformation T [20, 34, 35]:

x = T(u), u ⇠ p

u

(u). (2.28) The transformed density p

x

(x) is then given through the change of variables formula for probability densities [20, 34, 35]:

p

x

(x) = p

u

(u) det ✓ @T (u)

@ u

◆

1

(2.29)

= p

u

( u)| det J

T

( u)|

¹

(2.30)

= p

u

(T

¹

( x))| det J

^T ¹

( x)|

¹

, (2.31) where J is the Jacobian, i.e., J

T

( u) is the D⇥D matrix with all partial deriva- tives of T defined as

J

_T

( u) = 2 6 4

@T1

@u1

· · ·

_@u^@T_D¹

... ... ...

@TD

@u1

· · ·

^@T_@u^D_D

3 7 5 . (2.32)

(22)

Thus, we note that the key assumptions of flow-based models are that T is invertible and that both T and its inverse T

¹

are differentiable.

Another important property is that the invertible and differentiable trans- formations are composable [34] such that

(T

2

T

1

)

¹

= T

₂ ¹

T

₁ ¹

, (2.33)

det J

T2 T1

( u) = det J

T2

(T

1

( u)) · det J

^T1

( u). (2.34) In other words, given invertible and differentiable transformations, the composition is also invertible and differentiable. In effect, multiple simple transformations can be stacked together without sacrificing the assumptions of invertible and differentiable transformations while still being able to evalu- ate the density p

x

(x) [34].

A sequence of K invertible and differentiable mappings constructs the flow z

K

= T

K

... T

2

T

1

( z

0

), (2.35) where z

0

= u, z

K

= x and T

k

: R

^D

7! R

^D

[20, 34, 36]. This renders a sequence of variables that have been sampled from increasingly complex distributions [20].

The corresponding log density [20, 33, 34] is given by

log p

K

(z

K

) = log p

0

(z

0

) X

K k=1

log det

✓ @T

k

(z

k 1

)

@ z

k 1

◆ (2.36)

= log p

0

( z

0

) + X

K k=1

log det

✓ @T

_k¹

(z

k

)

@ z

k

◆ . (2.37)

Analogously to Equation 2.35, the inverse mapping [36] is given by z

0

= T

₁ ¹

T

₂ ¹

... T

_K¹

( z

K

). (2.38) Thus, by repeatedly applying the rule of change of variables for probability densities through a series of invertible, differentiable and composable transfor- mations, normalizing flows enable to transform a simple probability density to an arbitrarily complex one [20]. Note that the complexity of the final distri- bution is determined by the complexity of the individual mappings as well as the length of the flow.

Flow-based models offer two functionalities [33, 34]. First, they enable

sampling from the model, i.e., to sample from the base distribution and per-

form forward evaluation using Equation 2.35 [33, 34, 36]. Second, they enable

(23)

evaluation of the model’s density using Equation 2.37 [34, 36]. For sampling, the base distribution needs to be known and easy to sample from. For den- sity evaluation, the base distribution needs to be able to be evaluated and the inverse mapping (Equation 2.38) as well as its Jacobian determinant need to be efficiently computed [20, 34]. An efficiently computed Jacobian determi- nant means that it should be computed within linear time with respect to the dimension of the input [20, 34].

The application and specific setting will dictate what operations that need to be performed as well as what operations that will be designed to be efficient [34]. In practice, the base distribution is commonly chosen to be a fully fac- torized Gaussian and either T or T

¹

, whichever is decided, is implemented with functions that include neural networks [34].

2.2.5 Variational Inference with Normalizing Flows

Normalizing flows are used in variational inference by transforming samples from a simple distribution to samples from a complex distribution [20]. Keep- ing our notation consistent, !

0

is sampled from the base posterior distribution q

0

(!

0

), such as a diagonal Gaussian, and then transformed with K flows as in Equation 2.35.

Given a variational posterior q (!) = q

K

(!) parameterized by a normal- izing flow of length K, the variational objective (the ELBO) can the be rewrit- ten as

L

N F

( ) = E

^{q (!)}

[log p( D|!)] + E

q (!)

[log p(!)] E

^{q (!)}

[log q (!)]

(2.39)

= E

^q0(!0)

[log p( D|!)] + E

q0(!0)

[log p(!)] (2.40) E

^q0(!0)

[log q

0

(!

0

)] + E

^q0(!0)

"

_K

X

k=1

log det

✓ @T

k

(!

k 1

)

@!

k 1

◆ #

.

(2.41)

For variational inference with normalizing flows, the sampling, the forward

mapping (Equation 2.35) and the log determinant evaluation are the only rel-

evant mechanisms [36]. For density estimation with normalizing flows, the

log-likelihood of the data is maximized with the goal to learn mappings from

a complex to a simple distribution [33]. As such, both directions of the map-

pings T and T

¹

do not need to be computed, and thus not even tractable, in

general [33]. As a result, methods developed for density estimation are typi-

cally suboptimal for variational inference [33].

(24)

Figure 2.1 displays the effect of a normalizing flow on a unit Gaussian and a uniform distribution, and illustrates the ability of normalizing flows to produce multimodal posterior approximations.

Figure 2.1: Normalizing flows. The figure, taken from Rezende et al. [20] (page 5), illus- trates examples of the effect of a normalizing flow on two 2-dimensional distributions, a unit Gaussian and a uniform distribution.

2.2.6 Auxiliary Random Variables for Variational Meth- ods

Auxiliary random variables construct more flexible and expressive distribu- tions by introducing latent variables in the posterior distribution [21, 22, 24].

Following Ranganath et al.’s [21] and Louizos et al.’s [17] work, the mean field approximation can be expanded with a mixing density q (z) such that the approximate posterior becomes

q (!) = Z

q (! |z)q (z)dz, (2.42)

where q (!|z) is a fully factorized distribution.

Substituting Equation 2.42 into the ELBO (Equation 2.23), the entropy E

q (!)

[log q (!)] becomes intractable as the containing integral is analyti- cally intractable in general [21]. From Bayes rule

q( z|!)q(!) = q(!|z)q(z), (2.43)

(25)

it can be seen that we can obtain q(!) by computing the posterior distribution of the latent variables given the model parameters, q(z|!). However, comput- ing this posterior is in general as difficult as the integral in the entropy [21].

We thus introduce an auxiliary distribution r

✓

( z|!), parameterized by ✓, to approximate q(z|!). Following Ranganath et al.’s [21] approach, the entropy can then be bounded as

E

q (!)

[log q (!)] = E

q (!)

[log q (!) D

KL

(q( z|!)||q(z|!))] (2.44) E

q (!)

[log q (!) + D

_KL

(q( z|!)||r

✓

( z|!))] (2.45)

= E

^{q (!)}

[ E

q(z|!)

[log q (!) + log q( z|!) log r

✓

( z|!)]]

(2.46)

= E

^{q (!,}z)

[log q (!) + log q( z|!) log r

^✓

( z|!)]

(2.47)

= E

q (!,z)

[log q (! |z) + log q (z) log r

✓

( z|!)], (2.48) where the last step is a result of Bayes rule (Equation 2.43). Note that the bound is exact when r

✓

( z|!) equals the posterior q(z|!).

Substituting the bounded entropy (Equation 2.48) into the ELBO (Equation 2.23) gives the tractable lower bound

L( , ✓) =E

q (!)

[log p( D|!) + log p(!) log q (!)] (2.49)

= E

^{q (!,}z)

[log p( D|!) + log p(!) (2.50)

log q (! |z) log q ( z) + log r

✓

( z|!)]. (2.51)

2.3 Models

This section describes the models that are used in the experiments, i.e., MF- VBNN, MNF-VBNN and deep ensembles.

2.3.1 Variational Bayesian Neural Networks

Standard neural networks can be seen as deterministic functions. They learn

point estimates of the model parameters that result in point estimates of the

predictions. Bayesian neural networks instead learn distributions over the

weights. A posterior distribution over the weights can be inferred that captures

the parameter uncertainty of the network. However, obtaining the posterior

through exact Bayesian inference is typically intractable due to the non-linear

(26)

nature of neural networks. In this section we apply the concepts for approxi- mate Bayesian inference relying on variational inference presented in Section 2.2.

Bayes by Backprop

In this subsection we present an algorithm that uses backpropagation for es- timating probability distributions over the weights in neural networks intro- duced in [3], called Bayes by Backprop.

Let W

1:L

denote the weight matrices of a neural network with L fully con- nected layers. The ELBO is then given by

L( ) = E

^{q (}W1:L)

[log p( Y|X, W

^1:L

) + log p( W

1:L

) log q ( W

1:L

)] (2.52) and the resulting cost (objective) function to minimize is given by

F(D, ) = D

KL

(q (W

1:L

) ||p(W

1:L

)) E

q (W1:L)

[log p( Y|X, W

1:L

)].

(2.53) The KL divergence in Equation 2.53 acts as a regularizer that encourages the variational posterior q (W

1:L

) to match the modes of the simpler prior p( W

1:L

). As such, the KL term will penalize the model for fitting a too com- plex posterior distribution over the weights.

Under certain conditions [37], the derivative of an expectation can be ex- pressed as the expectation of a derivative. Using this fact, Blundell et al. [3]

generalized the idea of the re-parameterization trick [15, 38, 39]. The re- parameterization trick refers to the notion that sampling from a distribution q ( ·) can be parameterized in terms of a parameter-free noise variable ✏ and a deterministic function h( , ✏) [9, 15, 20]. Instead of using it for enabling backpropagation in latent variable models [15], Blundell et al. [3] applied it in the context of learning weights in neural networks.

Unbiased estimates of the gradients of the expectations in the objective function in Equation 2.53 can then be obtained by evaluating the expectations through Monte Carlo integration [3, 20]:

F(D, ) ⇡ 1 n

X

n i=1

log q ( W

⁽ⁱ⁾1:L

) log p( W

⁽ⁱ⁾1:L

) log p( Y|X, W

⁽ⁱ⁾1:L

), (2.54)

where W

⁽ⁱ⁾1:L

denotes the ith Monte Carlo sample drawn from the variational

posterior q (W

1:L

).

(27)

Note that this method does not require any closed forms in the objective (Equation 2.53) [3]. As such, Bayes by Backprop (MF-VBNN) introduces un- certainty in the weights of the network while enabling different combinations of prior and variational posterior families [3]. For a diagonal Gaussian varia- tional posterior, the re-parameterization becomes

W

1:L

= h( , ✏) = µ + log(1 + exp(⇢)) ✏, (2.55) where = (µ, ⇢) are the variational parameters to be learned, denotes element-wise multiplication and where ✏ are samples from a standard Gaus- sian [3]. The softplus parameterization of the standard deviation, = log(1+

exp(⇢)), ensures that it is non-negative. By sampling from a unit Gaussian, scaling the sample with the standard deviation and then shift it with µ, a sample of the weights W

1:L

is obtained from a diagonal Gaussian variational posterior [3, 20].

Following Blundell et al.’s [3] work, each update step in the optimization for a diagonal Gaussian variational posterior, with step size ↵ and n Monte Carlo samples from the variational posterior, proceeds as follows:

1. Sample ✏ ⇠ N (0, I) for i = 1, ..., n.

2. Let W

⁽ⁱ⁾1:L

= µ + log(1 + exp(⇢))) ✏

⁽ⁱ⁾

. 3. Let = (µ, ⇢).

4. Let f(W

⁽ⁱ⁾_1:L

, ) = log q (W

⁽ⁱ⁾_1:L

) log p(W

⁽ⁱ⁾_1:L

) log p( Y|X, W

⁽ⁱ⁾1:L

).

5. Compute the gradients w.r.t. µ as

@f ( W

⁽ⁱ⁾1:L

, )

@µ = @f ( W

⁽ⁱ⁾1:L

, )

@ W

⁽ⁱ⁾_1:L

+ @f ( W

⁽ⁱ⁾1:L

, )

@µ . (2.56)

6. Compute the gradients w.r.t. ⇢ as

@f ( W

⁽ⁱ⁾1:L

, )

@⇢ = @f ( W

⁽ⁱ⁾1:L

, )

@ W

⁽ⁱ⁾_1:L

✏

⁽ⁱ⁾

1 + exp( ⇢) + @f ( W

⁽ⁱ⁾1:L

, )

@⇢ . (2.57) 7. Update the parameters with the expectation of the gradients as

µ µ ↵ 1 n

X

n i=1

@f ( W

⁽ⁱ⁾_1:L

, )

@µ (2.58)

⇢ ⇢ ↵ 1 n

X

n i=1

@f ( W

⁽ⁱ⁾1:L

, )

@⇢ (2.59)

(28)

Following Blundell et al.’s [3] work, the prior can be parameterized with a scale mixture prior [3] of two zero Gaussian densities

p( W) = ⇡N (W|0,

²1

) + (1 ⇡) N (W|0,

²2

),

where 0 < ⇡ < 1, and where

₁²

and

²₂

are the variances of the first and the second mixture component, respectively. Moreover,

1

>

2

and the second mixture component is given a small standard deviation, i.e.,

2

⌧ 1 [3]. As such, most of the probability mass is put closely around zero. The idea is to encourage smaller absolute values of the elements of the weights while still allowing for some bigger variations. Figure 2.2 illustrates a scale mixture prior of two zero Gaussian densities using

1

= 1 and

2

= 0.1.

Figure 2.2: Scale mixture prior. Three different views of the same scale mixture prior of two zero Gaussian densities using ⇡ = 0.5, 1= 1and 2= 0.1.

Multiplicative Normalizing Flows for Variational Bayesian Neural Net- works

This section presents the main theory introduced by Louizos et al. [17] used for this work’s implementation of MNF-VBNN using fully connected layers.

Let p(W) and q (W) denote the prior and approximate posterior over the weights for an arbitrary layer, respectively, where refers to the parameters that govern the approximate posterior.

Louizos et al. [17] proposed to use the ideas of normalizing flows and auxiliary random variables to learn the parameters of a neural network. The concept of auxiliary random variables can be used to define the approximate posterior for an arbitrary layer with

W ⇠ q (W|z), z ⇠ q (z), (2.60) such that q (W) = R

q ( W|z)q (z)dz. To allow for the re-parameterization

(29)

(Equation 2.55), q (W|z) is defined as a fully factorized Gaussian:

q ( W|z) = Y

^Dⁱⁿ

i=1 D

Y

out

j=1

N (z

i

µ

ij

,

²_ij

), (2.61) where D

in

and D

out

are the input and output dimensions of the corresponding layer, respectively. Note that z is a D

in

-dimensional vector and that W is a D

out

⇥ D

in

-dimensional matrix. The mixing density q (z) should be chosen to be a known and simple distribution such that samples can be drawn easily [17], e.g., a fully factorized Gaussian.

Normalizing flows can now be applied to samples z

0

from q (z), trans- forming them into the final iterate z

TK

. The vector z

(·)

is of much lower di- mensionality than W. As such, the problem of increasing the flexibility and expressiveness of the approximation of the posterior q (W

1:L

) is reduced to increasing the flexibility and expressiveness of the mixing density q (z

1:L

). In other words, non-linear relationships between the elements of the weights for a particular layer are enabled through the new augmented posterior [17]. The idea is that this possibly multimodal posterior should better capture the true posterior and, thus, provide better predictive performance as well as predic- tive uncertainty estimates [17]. Louizos et al. [17] use the term multiplicative normalizing flows for this procedure of approximating a variational posterior distribution.

Algorithm 1 describes the forward propagation that can be used together with the re-parameterization (Equation 2.55).

Algorithm 1 [17] Forward pass for a particular layer for MNF-VBNN using fully connected layers. Let M

w

, ⌃

_w

2 R

^Dⁱⁿ^⇥D^out

denote the means and vari- ances of the weights of the particular layer, respectively. Moreover, let NF(·) denote a normalizing flow and let H 2 R

^N^b^⇥Dⁱⁿ

denote the pre-activations of the layer, where N

b

denotes the size of a batch of inputs. The variable Z

0

2 R

^N^b^⇥Dⁱⁿ

consists of N

b

sampled D

in

-dimensional vectors. For the first layer, H denotes the batch of inputs from X.

Require: M

w

, ⌃

w

, H 1: Z

0

⇠ q(z

0

)

2: Z

TK

= NF(Z

0

) 3: M

h

= ( H Z

T_K

) M

w

4: V

h

= H

²

⌃

w

5: E ⇠ N (0, I)

6: return M

h

+ pV

h

E

(30)

Following the theory outlined in Section 2.2.6, the ELBO is given by L( , ✓) =E

q (W1:L,z1:L)

[log p( Y|X, W

1:L

, z

1:L

) + log p( W

1:L

) (2.62)

log q (W

1:L

|z

1:L

) log q (z

1:L

) + log r

✓

(z

1:L

|W

1:L

))], (2.63) where r

✓

( ·) is the auxiliary distribution parameterized by ✓. Both Ranganath et al. [21] and Louizos et al. [17] parameterize the auxiliary distribution r

✓

( z|W) with inverse mappings (Equation 2.38). For an arbitrary layer, Louizos et al.

[17] define

r

✓

(z

Ta

|W) =

Din

Y

i=1

N (˜µ

i

, ˜

_i²

), (2.64)

˜

µ

i

= ( b

1

⌦ tanh(c

^T

W))(1 D

out ¹

), (2.65)

˜

i

, = ( b

2

⌦ tanh(c

^T

W))(1 D

out ¹

) , (2.66) where (·) is the sigmoid function, ⌦ is the outer product, 1 is a vector of ones and b

1

, b

2

as well as c are learnable vectors, all with the same dimensionality as z

(·)

. The D

in

-dimensional vector z

Ta

corresponds to the fully factorized variable obtained from the inverse mapping z

Ta

= NF

¹

( z

T_K

).

With a standard Gaussian as prior p(W) and q (W|z) being parameterized as in Equation 2.61 we get

D

KL

(q ( W))||p(W)) =E

q(W,zT)

[ D

KL

(q ( W|z

^TK

) ||p(W)) (2.67) + log r

✓

(z

TK

|W) log q (z

TK

)], (2.68) for an arbitrary layer, where

D

KL

(q ( W|z

^TK

) ||p(W)) = 1 2

X

i,j

( log

_ij²

+

_ij²

+ z

_T²_{K i}

µ

²_ij

1), (2.69)

log r

✓

( z

TK

|W) = log r

✓

( z

Ta

|W) +

TK

X

+Ta

t=T_K

log @ z

t+1

@ z

t

, (2.70)

log q (z

TK

) = log q (z

0

)

TK

X

t=1

log @ z

t+1

@z

t

. (2.71)

(31)

Following the work of Louizos et al. [17], we use the flow

m ⇠ Bernoulli(0.5), (2.72)

h = tanh(f(m z

t

)), (2.73)

µ = g( h), (2.74)

= (k( h)), (2.75)

z

t+1

= m z

t

+ (1 m) (z

t

+ (1 ) µ, ) (2.76) log @ z

t+1

@z

t

= (1 m)

^T

log , (2.77)

where f(·), g(·), k(·) are fully connected linear mappings and m is a sampled binary mask. During training, the binary mask is resampled every time such that z

t

is split over different dimensions. The transformation h in Equation 2.73 is a neural network with a tanh non-linearity and with no hidden layer that can be extended to involve an arbitrary number of hidden layers.

The flow described in Equations 2.72-2.77 corresponds to the masked Real NVP [40] with the difference of using the parameterization (z

t

+ (1

) µ). This specific parameterization aims to obtain more numerically stable updates [17, 35] and was introduced in the Inverse Autoregressive Flow (IAF) [35], inspired by updates in LSTMs [35].

The aforementioned flow is appealing as both the data generation as well as the log-likelihood evaluation can be performed in a single forward pass during training [34, 40, 41]. As only a subset of the variables is updated in each transformation, a long sequence of transformations might be needed in practice [33]. However, Louizos et al. [17] used flows of length two with good results.

The weights W and the latent variables z are re-parameterized using Equa- tion 2.55.

2.3.2 Deep Ensembles

As seen, variational Bayesian neural networks require approximation schemes, various assumptions and significant modifications to the training procedure of neural networks. Moreover, they typically only model epistemic uncertainty.

An alternative method for obtaining predictive uncertainty estimates, deep en- sembles, was proposed in [6]. This method aims to increase robustness to model misspecification and out-of-distribution examples [6].

Deep ensembles do not require significant modifications to the training

procedure of their counterparts. They are parallelizable, relatively simple to

(32)

implement and require little hyperparameter tuning. Furthermore, deep en- sembles account for both aleatoric and epistemic uncertainties. A drawback of the method is that the number of parameters grows linearly with the number of networks in the ensemble.

The recipe of deep ensembles for regression contains three main ingredi- ents [6]: (1) a trained ensemble of deterministic neural networks, (2) negative log-likelihood as objective function and (3) adversarial training [42, 43].

Deep ensembles consist of several randomly initialized deterministic neu- ral networks. As such, the method captures model uncertainty by averaging the predictions over the models in the ensemble [6]. It has been demonstrated that an ensemble of randomly initialized neural networks gives rise to a more diverse set of predictions than subspace sampling methods such as Bayesian neural networks and Monte Carlo dropout [26].

For regression problems, neural networks typically output a single-valued prediction for a corresponding input and use MSE as training criterion to be minimized [6]. As mentioned in Section 2.2.2, this is the same as assuming a homoscedastic Gaussian distribution for the targets. The separate networks that make up the deep ensemble all output a variance estimate in addition to the point prediction, meaning that the training criterion is the negative log- likelihood. In other words, deep ensembles learn the aleatoric uncertainty by treating the target as a sample from a heteroscedastic distribution. As this technique takes into account the inherent noise in the data, the idea is that the models should be more robust on datasets where the noise level varies across the observation space [6, 10]. Indeed, networks that employ this method have been shown to become more robust against noisy data [10].

Adversarial examples are created through data perturbation, generating new examples ’close’ to the original samples in order to fool a model [6, 42, 43]. For instance, an image can be altered in such a way that it is visually indistinguishable from the original image for a human, yet different enough to make a neural network to change an already correct classification [6, 43].

Figure 2.3 displays a demonstration of this. Training with adversarial exam- ples (adversarial training) aims to improve the robustness and smoothness of predictive estimates [6].

One method for creating adversarial examples is called fast gradient sign

method [42]. Let x be the input to a model with corresponding target y, let

L(!, x, y) be the training criterion and let ✏ be a small value which bounds

the max-norm of the perturbation [6, 42]. The fast gradient sign method then

(33)

generates an adversarial example

x

⁰

= x + ✏sign(r

x

L(!, x, y)) (2.78) to be used for training, where the required gradient can be computed through backpropagation [42]. Intuitively, the method adds disturbance to the input along the direction that is likely to increase the loss of the network [6].

With the fast gradient sign method, the loss to be minimized becomes L(!, ˜ x, x

⁰

, y) = 1

2 L(!, x, y) + 1

2 L(!, x

⁰

, y). (2.79)

Figure 2.3: Fast gradient sign method. The figure, taken from Goodfellow et al. [42] (page 3), demonstrates the fast gradient sign method on an image of a panda. The altered image is visually indistinguishable from the original image for a human, yet different enough to fool the classifier to change an already correct classification. Note that the term confidence in this image refers to softmax probabilities.

2.3.3 Complexity by Model

This section elaborates on the number of parameters employed by the different models, going from less complex to more complex models in terms of the number of parameters. For simplicity, bias terms are omitted and the input and output dimensions of the network are assumed to be equal. Let D