Backtesting expected shortfall: A quantitative evaluation

(1)

STOCKHOLM SWEDEN 2016,

Backtesting expected shortfall: A quantitative evaluation

JOHAN ENGVALL

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ENGINEERING SCIENCES

(2)

(3)

A quantitative evaluation

J O H A N E N G V A L L

Master’s Thesis in Financial Mathematics (30 ECTS credits) Master Programme in Mathematics (120 credits) Royal Institute of Technology year 2016 Supervisor at KTH: Boualem Djehiche Examiner: Boualem Djehiche

TRITA-MAT-E 2016:57 ISRN-KTH/MAT/E--16/57--SE

Royal Institute of Technology School of Engineering Sciences KTH SCI SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

(4)

(5)

How to measure risk is an important question in finance and much work has been done on how to quantitatively measure risk. An important part of this measurement is evaluating the measurements against the outcomes a procedure known as backtesting. A common risk measure is Expected shortfall for which how to backtest has been debated. In this thesis we will compare four different proposed backtests and see how they perform in a realistic setting. The main finding in this thesis is that it is possible to find backtests that perform well but it is important to investigate them thoroughly as small errors in the model can lead to large errors in the outcome of the backtest.

(6)

(7)

Sammanfattning

Hur man mäter risk är en viktig fråga inom den finansiella industrin och det finns mycket skrivet om hur man kvantifierar finansiell risk. En viktig del i att mäta risk är att i efterhand kontrollera så att modellerna har gett rimliga estimeringar av risken denna procedur brukar kallas backtesting. Ett vanligt mått på risk är Expected shortfall där hur detta ska göras har debatterats.

Vi presenterar fyra olika metoder att utföra detta och se hur dessa presterar i en verklighetstrogen situation. Det vi kommer fram till är att det är möjligt att hitta metoder som fungerar väl men att det är viktigt att testa dessa noga eftersom små fel i metoderna kan ge stora fel i resultatet.

(8)

(9)

I would like to thank my supervisor Boualem Djehiche for valuable input when writing this thesis.

Stockholm, December 2016 Johan Engvall

(10)

(11)

1 Introduction 1

2 Background 3

2.1 Risk measures . . . 3

2.2 Forecast evaluation . . . 5

2.3 Likelihood ratio tests . . . 8

2.4 Stochastic differential equations . . . 8

2.5 Backtesting risk measures . . . 10

2.6 Backtesting Value-at-Risk . . . 10

2.7 Conclusion . . . 13

3 Theoretical framework 15 3.1 Backtesting expected shortfall . . . 15

3.2 Backtest 1 . . . 15

3.3 Backtest 2 . . . 17

3.4 Backtest 3 . . . 18

3.5 Backtest 4 . . . 19

3.6 Conclusion . . . 19

4 Data 21 4.1 Returns . . . 21

4.2 VaR . . . 24

4.3 ES . . . 25

4.4 Summary . . . 28

5 Results 29 5.1 Introduction to the results . . . 29

5.2 Geometric Brownian Motion . . . 30

5.3 Student’s t . . . 39

5.4 Heston’s stochastic volatility model . . . 39

5.5 Summary . . . 39 i

(12)

6 Discussion 41 6.1 Comparison between the methods . . . 41 6.2 Further research . . . 42

References 44

Appendices 45

A Figures 47

A.1 Student’s t . . . 47 A.2 Heston’s stochastic volatility model . . . 54

(13)

Abbreviations

CDF Cumulative distribution function ES Expected shortfall

GBM Geometric Brownian motion LR Likelihood ratio

SDE Stochastic differential equation VaR Value-at-Risk

Notation

S(x, y) Scoring function L_t Loss at time t X_t Net value at time t

VaR_t,α Value-at-Risk at time t with confidence level α ES_t,α Expected shortfall at time t with confidence level α

bxc The integer part of x

F_X(u) Cumulative distribution function of the stochastic variable X evaluated at u

Φ(x) Cumulative distribution function of the standard normal distribution evaluated at x

t_ν(x) Cumulative distribution function of the Student’s t-distribution with ν degrees of freedom evaluated at x

S_t The price of an asset at time t W, W¹, W² Brownian motion

(14)

(15)

Introduction

With the increased complexity of the financial markets quantifying risk has become more important. Supervisors of the financial markets want to make sure that financial companies have adequate capital to cover the risks they are exposed to. Banks also want to internally keep track of their risks to ensure that the risks of individual departments are not too high. This is important in order to avoid a situation that happened to for example Barings bank where large risks accumulated in a small regional office and lead to the collapse of a large international bank. This must of course be complemented by qualitative risk management methods.

Currently the most common measure of risk is Value-at-Risk(VaR) which basically is a one-sided confidence interval of the distribution of the expected loss. This is defined for a given confidence level and time period. For example given a one day 5% VaR the probability that the loss exceeds the VaR should be 5%. So a loss exceeding this estimate is an event you expect to occur at least on a monthly basis. The main reason that VaR have become popular is that it is easy to work with and its meaning is easy to understand.

There are however some drawbacks with VaR leading to another risk measure being proposed as a replacement. The measure being proposed is Expected Shortfall(ES). ES is more complex, but solves many of the problems associ- ated with VaR. A move to ES as the primary risk measure have for example been proposed by the Basel Committee on Banking Supervision (cf. [4]).

An important part in estimating risk is to evaluate how accurate the estimates have been. This procedure is generally called backtesting. It is part of the larger field of forecast evaluation. The aim is to look at a forecast and the corresponding outcome and evaluate if it was accurate or not. What is

1

(16)

an accurate estimation of course depends of the type of estimation. If you estimate a mean value you want the distance between the estimation and the outcome to be as small as possible. If you estimate a median you want half the outcomes to be larger and half the outcomes to be smaller than the estimate. For VaR this procedure is quite straightforward but for ES the procedure is much more complex. An estimate for which you cannot even in retrospect measure the accuracy of is of course useless.

This thesis aims to investigate how to evaluate methods of backtesting ES.

This thesis’ main contribution will be the findings about the implementation of backtests. Many different backtests have been proposed but the literature regarding actual implementation is scarce. In the implementation of a backtest for ES many assumptions and approximations have to be made. This thesis shows that great care have to be taken when making these decisions.

The thesis will first introduce four different backtests. These are to a large extent built on previously presented ideas for backtest but the exact implementation is not clearly defined in all cases. In these cases we discuss different ways they could be implemented in practice. This is especially the case regarding an approximation method proposed by Emmer, Kratz and Tasche (cf. [7]) where no actual methods for the testing were proposed so in this case we present two different methods.

The presented methods are then tested with simulated data. They are tested both with data for which the ES is assumed to be correctly modeled and with data where the ES is assumed to be modeled incorrectly. The aim with this is to see how the methods would perform in a realistic scenario.

The outline of the thesis will be the following. Chapter 2 will present some general theory and necessary for risk measures and the backtesting of them.

In chapter 3 the backtests are presented and discussed. Then in chapter 4 how the data are simulated are presented. In chapter 5 the results will be presented and in chapter 6 the main conclusions are presented as well as some suggestions for further studies.

(17)

Background

2.1 Risk measures

General properties

To understand what VaR and ES are and why they are important we must first think about what risk is and more importantly how to quantify it.

Artzner et al. (cf. [3]) describes the following five properties that are desir- able for a quantitative risk measure. A risk measure for the asset described by the random variable X denoted ρ(X) should have the following three properties for any random variables X and Y :

Normalization

ρ(0) = 0.

This means that holding no position means you have zero risk.

Translation invariance

∀a ∈ R ρ(X + a(1 + rf)) = ρ(X) − a.

This property ensures that adding a fixed amount of cash to a position will decrease the risk with an equal amount. This is important for the interpre- tation of a risk measure as an amount of capital needed to stay solvent.

Monotonicity

If X ≤ Y then ρ(X) ≥ ρ(Y ).

This property ensures that a portfolio with a lower value will have a higher risk.

3

(18)

A coherent risk measure also have the following two properties Subadditivity

ρ(X + Y ) ≤ ρ(X) + ρ(Y ).

This property means that diversification can never increase the risk only decrease it.

Positive homogeneity

ρ(aX) = aρ(X) ∀a ≥ 0.

This property means that the risk will increase proportionally to the size of the position. Doubling the position will double the risk.

That a risk measure is coherent is not necessary for its ability to be used quantitatively to measure risk but it gives some important results, mainly that the effect of diversification cannot be negative as it can for a risk measure that is not coherent.

2.1.1 Value-at-Risk

Value-at-Risk(VaR) is defined as one sided confidence interval of the loss.

The VaR is given by a confidence level and a time period. A n-day α% VaR is given by

VaR_α,t(X) = inf{x|P (X_t< −x) ≤ α}, where X_t is the net value of the asset after n days defined by

X_t = S_t− S_t−1. From this the loss can be defined by

L_t = −X_t.

The VaR satisfies the three properties of a risk measure but it does not satisfy the property of subadditivity and it is therefore not a coherent risk measure.

This means that diversification can affect the VaR negatively although in practice it is clear that the risk would decrease. Also it does not take into account losses larger than the VaR level. For example the 5% VaR is not affected by the losses less probable than 5%. This means that there could be large but not that probable losses which are not accounted for. From an theoretical point of view this is a large drawback of the VaR, but regardless

(19)

it is widely used in practice.

If VaR for longer time periods is calculated one should account for the inter- est rate but as these methods would only be relevant if the VaR is calculated for short time periods it will be disregarded.

2.1.2 Expected shortfall

The Expected shortfall(ES) is defined as (cf. [1]) ES_α(X) = 1

α Z α

0

VaR_u(X)du, (2.1)

which if X has a continuous distribution can also be written as the conditional probability

ES_α(X) = E[L|VaR_α(X) ≤ L]. (2.2) The main advantage of ES over VaR is that ES satisfies the property of subadditivity and therefore is a coherent risk measure.

2.1.3 Estimation of VaR and ES

Most methods for estimation of risk measures rely on using historical data.

There are two main approaches, parametric and non-parametric methods.

In a parametric method you estimate a given probability distribution to the data. Common choices are the normal distribution and Student’s t distribution. In a non-parametric method you do not assume anything about the distribution of the data but instead resample from the historical data. The reliance on historical data is a problem because the asset experiencing the risk is often not static meaning that older data often becomes irrelevant. But often there are no real alternatives as the distribution of the loss have to be estimated somehow.

2.2 Forecast evaluation

An important topic which is the foundation of backtesting is forecast evaluation. Much has been written about how to evaluate forecast. Forecasts can take different form and are evaluated differently.

Two important examples of types of forecasts are interval forecasts and point forecasts. Point forecasts forecast a point with some type of property, for example the mean or the median. An interval forecast is a forecast of an

(20)

interval in which the outcome with some probability will lie within. The interval can be both one-sided and two-sided. Therefore the median can be viewed as both a point forecast and an interval forecast. VaR can therefore be regarded as both a one-sided interval forecast and a point forecast.

For a point forecast you have a forecast x and an outcome y. You then want to assess if x is a good forecast for y. To do that you generally use a scoring function denoted S(x, y) which is defined as any mapping

S : O × A → [0, ∞),

where O is the observation domain consisting of all possible outcomes and A is the forecast domain consisting of all possible forecasts. In the more general theory x is often seen as an action and y as an outcome. The forecast domain is therefore often called the action domain hence the use of A. When dealing with forecasts O and A are often assumed to be an identical predication- observation domain D which is the same as to assume that it is possible to forecast all different possible outcomes. If the predication-observation domain is an interval of real numbers

D = I × I I ⊆ R, the following conditions must hold (cf. [9]):

• S(x, y) ≥ 0 ∀x 6= y

• S(x, y) = 0 if x = y

• S(x, y) is continuous in x

• The partial derivate ∂_xS(x, y) exists and is continuous in x whenever x 6= y

The scoring function can be seen as a metric between the forecast and the outcome and therefore seen as a measure of the error between the prediction and the outcome. Thus a lower score means a more accurate forecast. But there is for example no condition that requires symmetry of the scoring function. To estimate x when the outcome is y can be considerably more wrong than estimating y when the outcome is x. But the requirement of continuity assures that for estimates that are similar the error must be small. Common examples of scoring functions are for example the squared error

S(x, y) = (x − y)², and the absolute error

S(x, y) = |x − y|.

(21)

2.2.1 Elicitability

An important property in the area of point forecast evaluation is elicitability. Elicitability is a concept originally introduced by Osband (cf. [14]) and formalized by Lambert, Pennock and Shoham (cf. [12]). A functional φ(Y ) of a random variable Y is elicitable if the following hold

φ(Y ) = argmin

x

E[S(x, Y )],

for some S. What this means is that a functional can be seen as minimization of a given scoring function. An example of this is the least squares method where the mean is given by the minimization of the scoring function

S(x, y) = (x − y)².

Gneiting (cf. [9]) shows that VaR is elicitable and he also shows that ES is not elicitable. The scoring function that elicits VaR is

S(x, y) = (1_y≤x− α)(x − y). (2.3) We see that for a low α the scoring function will be considerably higher if x > y than the other way around.

ES is not elicitable but Emmer, Kratz and Tasche (cf. [7]) shows that ES is conditionally elicitable. The conditional elicitability of ES comes from the fact that VaR is an elicitable functional and ES defined conditional of the VaR as in (2.2) means taking a subset of the outcomes. For this subset the ES is then just the expected value which is elicitable.

2.2.2 Hypothesis testing

A statistical hypothesis test is performed by taking two hypotheses, one null hypothesis which is assumed to be true and an alternative hypothesis which one aims to test against. For a given set of observations the probability that they would be achieved under the null hypothesis is calculated giving a p-value. The p-value is the probability that these observations would come from the assumed null model therefore a low p-value means an unlikely outcome. As the definition of a p-value is how unlikely the outcome is to be achieved when the null hypothesis is true, for a true null hypothesis where the tested observations are continuous the p-values are expected to follow the distribution

p ∈ U (0, 1).

(22)

In the case that the observations are discrete the possible p-values will be finite and not necessarily uniformly distributed, but with an increasing number of possible outcomes the distribution will converge towards the uniform distribution conditional on that the null hypothesis being correct.

2.3 Likelihood ratio tests

A common method for comparing the goodness of fit of two models is the likelihood ratio(LR) test. It compares a null model against an alternative model where the null model is a special case of the alternative model. The test then either rejects or does not reject the null model. If the null model is rejected it means that the data are unlikely to come from that model. Wilks (cf. [15]) shows that the minus two times the logarithm of the likelihood ratio is asymptotically χ² distributed. The Neyman-Pearson lemma (cf. [13]) states that the likelihood ratio test is the most powerful test of the goodness of fit for these models.

2.4 Stochastic differential equations

A stochastic process can be described as stochastic differential equation(SDE) which is written on the form

dS_t= µ(S_t, t)dt + σ(S_t, t)dW_t.

2.4.1 Geometric Brownian motion

An important example is the geometric Brownian motion(GBM) dS_t = µS_tdt + σS_tdW_t,

this SDE has an an explicit solution S_t= S₀exp

µ − σ² 2

t + σW_t

. (2.4)

2.4.2 Heston’s stochastic volatility model

Another important example is the Heston stochastic volatility model (cf.

[10]) where the volatility is a stochastic process. The SDE for the asset is dS_t= µS_tdt +√

ν_tS_tdW_t¹, (2.5)

(23)

with the following SDE for the volatility dν_t = κ(θ − ν_t)dt + ξ√

ν_tdW_t², and the covariation of the processes

dW_t¹dW_t² = ρdt.

This SDE has no explicit solution therefore a numerical solution has to be simulated. The process for the volatility is commonly denoted a Cox- Ingersoll-Ross process which is larger than zero if the following condition is satisfied (cf. [6])

2κθ ≥ ξ². (2.6)

Discretization

The stochastic process is approximated with a discretization by the Euler- Maruyama method which approximates a solution of the SDE

dX_t = a(X_t)dt + b(X_t)dW_t, recursively by

Xn+1= Xn+ a(Xn)∆t + b(Xn)∆Wn,

where the ∆W_n are normally distributed and independent. The volatility process is then discretized by

ν_n+1= κθ + (1 − κ)ν_n+ ξ√

ν_n∆W_n¹, (2.7) and the SDE for the asset

Sn+1 = Snexp

µ − ν_n 2

+√

νn∆W_n²

. (2.8)

where the correlation between ∆W_n¹ and ∆W_n² is ρ and ∆t = 1. If the condition (2.6) is satisfied ν_n should theoretically be positive, however because of discretization errors it could become negative which as (2.7) and (2.8) uses the square root of ν_n would be problematic. Because of this ν_n+1 is taken as

ν_n+1 = max(0, κθ + (1 − κ)ν_n+ ξ√

ν_n∆W_n¹).

so for negative values the volatility are just taken as zero instead. This should not be major problem if (2.6) is satisfied as the values in theory should be positive.

(24)

2.5 Backtesting risk measures

The process of evaluating the performance of a risk measure is often called backtesting. This is a forecast evaluation applied with the risk measure as the forecast and the realised loss as the outcome. There are two reasons why one may want to do this. Either you want to assess the accuracy of the model in which case you generally look at two-sided tests, or you want to make sure that the risk estimates are not too low in which case you generally want to use a one-sided tests. The latter is commonly used for regulatory purposes as you are then more interested in that the estimated risk not being too low but do not care if the estimate of the risk is to high. But the former is more interesting if you want to assess how accurate the modelling of the risk measure is. For financial institutions is it of course also important that the risk is not overestimated.

2.6 Backtesting Value-at-Risk

The most common way to backtest VaR is to evaluate it as an interval forecast instead of a point forecast. This is generally done by defining the sequence

I_t= 0 if L_t≤ VaR_t,

1 if L_t> VaR_t. (2.9) If the VaR model is correct the I_t should be independent and identically distributed stochastic variables with a Bernoulli distribution

I_t ∈ Be(α) ∀t.

This gives two important properties of the sequence that should be fulfilled.

Firstly, that the number of exceedances of the VaR is correct and, secondly, that the exceedances are independent of each other. The first property is often called unconditional coverage while the second property is called independence. The combined property is called conditional coverage. In general, methods, especially earlier methods have been focused on the first property but the second property is also important. In a more general sense the first property assures that the number of large losses is not too high whilst the second property assures that large losses do not occur too clustered together.

If losses occur clustered the aggregate loss over a period could become high.

For example if you have large losses every day in a week the aggregate loss for the entire week could be unacceptably high. A dependence could also occur in different orders. A first order dependence would imply that the probability of an exceedance at time t depends on if there was an exceedance

(25)

at time t − 1. A second order dependence would imply that the probability of an exceedance at time t depends on if there were exceedances at times t − 1 and t−2. Generally lower order dependence is more interesting to account for.

As VaR is elicitable it should therefore also be possible to evaluate it as a point forecast but this is not something that is commonly done. What can be seen is that the scoring function in (2.3) looks quite similar to the indicator function in (2.9). The indicator function would however not constitute a valid scoring function as it is not continuous at L_t = VaR_t and is zero at other points than L_t = VaR_t.

2.6.1 Using a binomial model

An early but often used backtest framework was proposed by Kupiec (cf.

[11]) which relies on the fact that the sum of Bernouilli distributed variables is binomially distributed. Forming the sum

Y =

N

X

t=1

It,

for the model to be correct Y should come from the following distribution Y = Bin(N, α).

This method accounts only for unconditional coverage. So any interactions of the exceedances are not accounted for. However the method currently used for regulatory purposes is based on this method.

2.6.2 Using a Markov chain model

Christoffersen (cf. [5]) proposes that the first order interactions could be modeled as a first order Markov chain with the following transition matrix assuming that the model is correct

Π =1 − α α 1 − α α

,

he also designs LR tests for the unconditional coverage, independence and conditional coverage. The LR test for unconditional coverage tests

H₀ : E[I_t] = α,

(26)

against

H₁ : E[I_t] 6= α.

giving the the likelihood under the null hypothesis L = (α; I1, I2, . . . , IT) = (1 − α)ⁿ⁰αⁿ¹, and the likelihood under the alternative

L(π, I₁, I₂, . . . , I_T) = (1 − π)ⁿ⁰πⁿ¹.

The LR test for unconditional coverage can then be formulated as LRuc= −2log[L(α; I1, I2, . . . , IT)/L(ˆπ, I1, I2, . . . , IT)]^asy∼ χ²(1).

where ˆπ = _nⁿ¹

0+n1 is the maximum likelihood estimate of π. The LR test of independence uses the Markov chain approximation. With the transition probability matrix

Π₁ =1 − π₀₁ π₀₁ 1 − π₁₁ π₁₁

,

where π_ij = P (I_t= j|I_t−1= i). The approximate likelihood function for the process is

L(Π₁; I₁, I₂, . . . , I_T) = (1 − π₀₁)ⁿ⁰⁰πⁿ₀₁⁰¹(1 − π₁₁)ⁿ¹⁰πⁿ₁₁¹¹. The maximization of the log-likelihood function is then

Π₁ =

n00

n00+n01

n01

n00+n01

n10

n00+n01

n11

n00+n01

. Independence corresponds to

Π₂ =1 − π₂ π₂ 1 − π₂ π₂

, with the likelihood under the null hypothesis

L(Π₂; I₁, I₂, . . . , I_T) = (1 − π₂)⁽ⁿ⁰⁰⁺ⁿ¹⁰⁾π₂⁽ⁿ⁰¹⁺ⁿ¹¹⁾. The LR test of independence follows the following distribution

LR_ind = −2log[L( ˆΠ₂; I₁, I₂, . . . , I_T)/L( ˆΠ₁; I₁, I₂, . . . , I_T)]^asy∼ χ²(1).

The LR tests for unconditional coverage and independence can be combined to a joint test for conditional coverage by

LR_cc = LR_uc+ LR_ind,

(27)

following the distribution

LR_cc ^asy∼ χ²(2). (2.10)

This method is more complex than the method proposed by Kupiec, but it also accounts for more possible misspecifications in the risk model. Inter- actions are something that you would expect to see in models that do not fully account for the stochastic volatility. If you assume a constant volatility you will get more exceedances when the volatility is high and less when the volatility is low. This makes the independence property interesting to account for as it is probable that many risk models do not fully account for stochastic volatility.

Testing only for the unconditional coverage should give the same result as the binomial model and Kupiec derives the same LR test as Christoffersen’s unconditional coverage test. One drawback of this method is that it cannot be tested with a two sided test. Often you are only interested in whether the risk model underestimate the risk not if it overestimates it. The Basel Committee’s traffic light test is an example of a test which is based on a one-sided test that is based on the binomial model.

A model accounting for higher order interactions could be constructed but would be even more complex and therefore not likely to be reasonable to implement. The effect of these interactions would probably also be less important than the first order interactions as it is reasonable to assume that checking for first order interactions should cover most of the instances of clustering.

2.7 Conclusion

To quantify risk is crucial in the financial industry. The most widely used measure is Value-at-Risk followed by Expected shortfall. The main drawbacks with VaR is that it lacks a property called subadditivity and that it does not capture tailrisk which makes ES a better alternative from a theoretical viewpoint.

An important procedure when measuring risk is to backtest your risk measures, that is when you compare the estimate to the outcome to investigate how accurate the estimate was. The main advantage of VaR is that it is easy to estimate and to backtest. While for ES there has been no consensus regarding how to backtest it.

(28)

When backtesting there are two important properties to consider unconditional coverage and independence. Unconditional coverage is that the probability of an exceedance is correct. Independence is that the probability of an exceedance is not dependent of earlier exceedances. Generally backtests have been focused on unconditional coverage but independence is also an important property. In financial data you often see volatility clustering therefore if the risk model does not account for the stochastic volatility you would expect to see a clustering of the VaR-exceedances.

(29)

Theoretical framework

3.1 Backtesting expected shortfall

As pointed out in Acerbi and Szekely (cf. [2]) the discovery that ES is not elicitable made by Gneiting (cf. [9]) in 2011 sparked a somewhat confused debate concering whether or not ES could be backtested at all. While this makes it impossible to backtest it directly as a point forecast it is possible to evaluate it in other ways. As pointed out earlier the primary way to backtest VaR is to evaluate it as an interval forecast not a point forecast. As ES is not a pure interval this is however not directly applicable. Therefore the backtest have to be designed using some other forecast evaluation technique.

3.2 Backtest 1

When the ES is defined as an integral of VaR as in equation (2.1) the integral can be approximated as a sum of different VaR-levels. Which gives the approximation

ESα(X) = 1 α

Z α 0

VaRu(X)du ≈ 1 N

N

X

k=1

VaR^kα

N(X).

Emmer, Kratz and Tasche (cf. [7]) suggests using the approximation ES_α ≈ 1

4(VaR_α(X) + VaR_0.75α(X) + VaR_0.5α(X) + VaR_0.25α(X)).

However, why this particular approximation should be used is not specified.

The approximation corresponds to a leftpoint rectangular approximation of 15

(30)

the integral. We use the approximation ES_α ≈ 1

5(VaR_α(X) + VaR_0.8α(X) + VaR_0.6α(X)+

+ VaR0.4α(X) + VaR0.2α(X)).

(3.1)

This approximation gives more even VaR-levels when used on ES0.025 and ES_0.05. This could be beneficial when used on VaR modeled with historical simulation.

These levels should be tested jointly. Applying the Markov chain model by Christoffersen this would be modeled as a 6 state Markov chain. The problem with this approach is that the transition probabilities between two different levels of VaR-exceedances would be very low requiring a huge amount of data to test accurately. So that method is not considered in this thesis.

The solution would be to test the levels independently or to test only the unconditional coverage jointly. A problem with testing the levels independently is that they are clearly not independent and it would be hard to account for this in the testing. It would be hard to determine what accuracy you have in your test and it would therefore be difficult to justify a high enough sig- nificance because of the high risk of type I errors. There are methods for combining multiple tests but they often assume independence between the tests, which is unrealistic in this scenario. The correlation is probably higher for VaR-levels that are closer to each other. For example 0.5% VaR and 1%

VaR is probably more correlated than 1% VaR and 5% VaR.

If the tests are assumed to be independent Fisher’s method (cf. [8]) could be used

χ²_2k = −2

k

X

i=1

ln(p_i).

On the other hand taking the minimum p-value as the combined p-value would be reasonable if the values are highly correlated. The correlation matrix for the VaR-exceedances of a correct VaR-model with levels 2.5%, 2%, 1.5%, 1% and 0.5% are approximately







1.00 0.89 0.77 0.63 0.44 0.89 1.00 0.86 0.70 0.50 0.77 0.86 1.00 0.81 0.57 0.63 0.70 0.81 1.00 0.71 0.44 0.50 0.57 0.71 1.00





 .

(31)

So it is reasonable to assume that they will be highly correlated. The backtest used in this thesis will be to take the p-value of the combined test as the minimum p-value where the individual p-values of the test will be calculated with Christoffersen’s method described in (2.10). As the tests will be based on the discrete sequence (2.9) the possible p-values will be finite however the possible outcomes will be large enough that they are considered approximately continuous.

3.3 Backtest 2

Testing the unconditional coverage jointly with the approximation method would mean that instead of testing against a binomial distribution as in the method by Kupiec you test against a multinomial distribution. With the given approximation you divide the losses

I_t =











0 if L_t≤ VaR_t,^α

n, 1 if VaR_t,^α

n, < L_t≤ VaR_t,^2α

n, ...

n − 1 if VaR_t,(n−1)α n

, < L_t ≤ VaR_t,α, n if VaR_t,α, < L_t.

This is tested for the hypothesis that the probability distribution of I_t is P (I_t= 0) = 1 − α P (I_t= 1) = P (I_t = 2) = · · · = P (I_t = n) = α

n. (3.2) This is done with a Pearson chi-squared test. Let X_m be the number of times I_t= m, T be the total number of occurrences and P_i the probability of event i. The test statistic is then given by

D =

n

X

i=0

(Xi− T Pi)² T P_i ,

which approximately follows a chi-square distribution with n degrees of freedom

D ∈ χ²_n.

The cumulative distribution function of the χ²-distribution gives the p-value of the test. A rejection of the hypothesis means that the ES model can be assumed to be flawed with that confidence. For example if the p-value is less than 5% we can say with 95% certainty that the ES model is flawed.

A consideration has to be made regarding the number of VaR-levels tested.

(32)

Increasing the number of VaR-levels will make the approximation in (3.1) more correct but will decrease the power of the test in (3.2).

One drawback with this method is that it is impossible to design a one- sided test. From a regulatory perspective this makes this test less interesting but it could still be interesting from an internal model validation perspective.

Another drawback is that this method cannot account for conditional coverage, it only accounts for unconditional coverage in the same manner as the Kupiec test does for the backtest of VaR.

As for backtest 1 the sequence (3.3) that the test is based on is discrete.

Also in this case the amount of possible outcomes is considered large enough to be considered approximately continuous when looking at the resulting p-values.

3.4 Backtest 3

As pointed out by Acerbi and Szekely (cf. [2]) one way to design the backtest is to use the conditional elicitability of ES. In this method the VaR is first backtested and then the magnitude of the losses in case the loss is larger than the VaR is compared to the ES. Using ES on the conditional probability form in (2.2) it can be rewritten as

E[ES_α,t− L_t|VaR_α,t− L_t < 0] = 0.

This gives for t such that V aR_α,t− L_t < 0 the following hypothesis is tested H₀ : E[ES_α,t− L_t] = 0,

against

H₁ : E[ES_α,t− L_t] < 0.

This hypothesis is tested with a paired difference test, which means that the difference of the magnitudes and losses on periods with VaR-exceedances.

This difference follows a Student’s t-distribution for which the hypothesis that the difference is zero is tested against the hypothesis that non-zero. For the VaR-exceedances you form

D = ES_α,t− L_t,

which is assumed to follow a Student’s t distribution. The distribution of D may be skewed because it is upward bounded by ES − V aR but not

(33)

downward bounded. This would violate the Student’s t assumption of the distribution D. If this is a problem the results will show that.

As the test in this method also relies on the underlying VaR model you should evaluate this model and aggregate the p-value with the p-value of the test for the magnitude of the ES. However, in this thesis however this is not done and this test will make the assumption that the VaR-model is correct.

This is in order to be able to see how this test performs without the VaR-test that would probably perform better.

3.5 Backtest 4

A fourth method is to look at the difference between the expected ES and the loss for each time point. This will be similar to the third method but instead of calculating the interval to see which losses are among the α% largest the α% largest losses are used. Important to remember here is that each loss will be the outcome of a different unknown probability distribution. This makes comparing which losses are the most unlikely less straightforward. But taking the α% largest of

D = ES_t− L_t,

would be a reasonable approximation. This will have a higher probability to reject the ES for being too small than for being too large. The main problem will arise when the ES varies in size, which is reasonable to expect it to do because of the volatility being stochastic and not constant. The differences are then tested in the same way as backtest 3. An advantage of this method is that it does not require the underlying VaR model to be tested.

3.6 Conclusion

Some authors claims that the lack of elicitability is a significant shortcoming of ES as a risk measure. However, this is not true. The main drawback compared to VaR is that VaR is a pure interval forecast which is not that difficult to backtest. ES is a more complicated forecast and there is not that much literature on the evaluation of similar forecasts. It is more difficult to backtest, but that is not connected to its lack of elicitability.

As the ES is not directly backtestable some other approach has to be made.

In this thesis two different types of methods are investigated, one which relies on testing the magnitude of the losses larger than the cut-off level and

(34)

another one which relies on an approximation of the ES into different levels of VaR. In the second method one also have to decide how the tests of different levels should be aggregated into a single test. When choosing how to evaluate the accuracy of ES it is important to first decide what you want to achieve with the evaluation, do you want to make sure that the ES estimate is not to low or do you want to make sure that it is close to the actual ES.

Generally the methods which rely on using the magnitude of the ES will require much more data. This is because for ES at level α you will only be able to use α% of your sample to evaluate the magnitude of the ES. A commonly proposed level of the ES is 2.5%. This means that for every 200 observations you would be able to use 5 to evaluate the magnitude. So if you only have a few hundred data points the statistical power to reject an incorrect model will be low. This drawback means that these types of models are unlikely to be implemented in practice.

Also if a suitable backtest for ES is found it will probably be much weaker than the backtest for VaR given the same data. Meaning that backtest models for ES will be far less competent at separating a good ES model from a bad than a VaR backtest. This means that you would get an increase in either the false positives or false negatives. This could potentially be a big problem if ES were to be used as the primary risk measure.

(35)

Data

Simulated data are used to test the methods. The reason for this is to be able to evaluate the performance of the backtests independent of the performance of the methods estimating ES and VaR. If you use real data you do not know if the ES-models are correct and therefore you cannot know which models that should be rejected. So for a given rejected ES-model you do not know if it is a correct rejection of a model that is incorrect or an incorrect rejection of a correct ES-model. Even with simulated data you would not know which models that are correct and incorrect, you would only know the probability that a given model is incorrect.

4.1 Returns

4.1.1 Asset paths

The data that forms the basis for the estimation of risk are daily asset prices denoted S_t for the asset price at time t. Three different types of models are used to simulate the asset prices. The first is a GBM model that corresponds to log-normally distributed asset returns, a stochastic process where the normal distribution is changed to a Student’s t distribution and a stochastic volatility model.

For the GBM the explicit solution given in equation (2.4) is used. This solution is then discretized to generate data corresponding to discrete times

S_n+1= S_nexp

µ − σ² 2

+ σ∆W_n

,

where ∆W_n are normally distributed and ∆t = 1. Calculating ES with a parametric normal method should give a correct answer for this model. For

21

(36)

the simulated data the parameters are

µ = 10⁻⁵ σ = 10⁻⁴

The simulated data for N = 1000 are plotted in figure 4.1.

Figure 4.1: 500 discretized GBM paths with 1000 time steps

For the second model the same discretization as the GBM is used but instead of assuming that the increments are normally distributed the increments are assumed to follow the Student’s t-distribution. The resulting discrete stochastic process would not correspond to a valid continuous process as the sum of two Student’s t-distributed stochastic variables is not a Stu- dent’s t-distributed variable. But we are only interested in the discrete case so that does not pose a problem. As the Student’s t-distribution has a fatter tail than the normal distribution a parametric model that assumes a normal distribution should underestimate the ES when used on these data. To get a good comparison to the GBM data the same expected value and standard deviation are used for the Student’s t distribution, the degrees of freedom are set to 4 giving the parameters

µ = 10⁻⁵ ν = 4 σ =

rν − 2

ν 10⁻⁴ = 7.07 ∗ 10⁻⁰⁵

(37)

The simulated data for N = 1000 are plotted in figure 4.2.

Figure 4.2: 500 Student’s t paths with 1000 time steps

For the stochastic volatility model the asset prices are simulated with the Heston model where the asset prices follow the SDE given in equation (2.5).

Asset prices following this process should get a higher grouping of large returns. There is also a negative correlation between the volatility process and the asset process meaning that large volatility and negative returns are more likely to occur at the same time. The parameters are set to

κ = 0.02 θ = 10⁻⁸ ξ = 10⁻⁵ ρ = −0.5 S0 = 1 ν0 = 10⁻⁸ µ = 10⁻⁵ Looking at the condition in (2.6) which assures that the volatility is positive we have

2κθ

ξ² = 2 · 0.02 · 10⁻⁸

(10⁻⁵)² = 4 ≥ 1, so the condition is fulfilled.

4.1.2 Log-returns

From the time series S_t log-returns are calculated by the formula r_t= log

S_t St−1

,

(38)

Figure 4.3: 500 discretized paths from the Heston volatility model with 1000 time steps

where log is the natural logarithm. These log-returns are then used to estimate the risk. The main reason for using log-returns is if the underlying asset follows a geometric Brownian motion the log-returns will be normally distributed. The assumtion that log-returns are independent is also much more reasonable than making the same assumption for asset values or absolute returns.

4.2 VaR

To calculate VaR you must decide on a method and an estimation window of historical data n. This n will correspond to the number of historical returns you use in your estimation of the risk measure. A longer time period gives a more reliable estimate but you will also risk using older data that are not that relevant.

4.2.1 Historical simulation

In the historical simulation approach you form the historical distribution with the n last daily losses and assume that the loss the next day will come

(39)

from this distribution

P (Lt = Lt−1) = P (Lt= Lt−2) = · · · = P (Lt= Lt−n) = 1 n, then the losses are sorted

L₁ ≥ L₂ ≥ · · · ≥ L_n, and the VaR is calculated as

V aR = L_bαnc+1.

This means that you need at least _α¹ observations to reasonably estimate VaR this way. You also get a large degree of uncertainty when _α¹ is close to n as the VaR estimate will jump when _α¹ = n. For example for a b0.01 · 100c + 1 = 2 but b(0.01 − δ) · 100c + 1 = 1 for any δ. So you can get a large change in the VaR with small changes in for example α or n.

4.2.2 Parametric method

In a parametric method you assume a parametric distribution that the data follows and estimates the parameters from the n returns. If you assume that the log-returns are normally distributed this would mean that you calculate the expected value and standard deviation

µt = E[ri] σt= SD[ri] i ∈ [t − n, t − 1]

where i is an integer. You then assume that the log-return follows the distribution

r_t∈ N (µ_t, σ_t),

using this the loss distribution can be calculated which gives the VaR from the cumulative distribution function F of the stochastic variable L_t

V aR[_t,α = F_L(1 − α).

For the Student’s t distribution the degrees of freedom are set to 4 when estimating the distribution. It would be interesting to estimate all three parameters but with the number of distributional fittings made it would not be possible as it is much more computationally intensive.

4.3 ES

The estimations of ES are similar to the estimation of VaR for both methods.

(40)

4.3.1 Historical simulation

The losses are sorted in the same way and the ES is calculated as the mean of the losses smaller than the VaR

ESc_α,t = 1 α





α − bαnc n

L_bαnc+1+

bαnc

X

k=1

L_k n



,

which when bαnc is an integer simplifies to

ESc_α,t = 1 αn

αn

X

k=1

L_k,

which is just the average of the α% largest losses. From this it can be seen that an estimation of ES with this method requires more data than the same method for estimation of VaR. When _bαc¹ < n the estimation of the ES and the VaR will be the same. It will however not experience as large jumps as it is a mean of multiple values.

4.3.2 Parametric methods

To estimate the ES with a parametric method you use the same distributional assumption and parameter estimation as for the VaR estimation. Using the integral definition of ES given in (2.1) we get

ES_α(X) = 1 α

Z α 0

V aR_u(X)du = 1 α

Z 1 1−α

F_L⁻¹(u)du, from this you can then derive the ES for the normal method

ES_α,t(X) = S_t−1 1 − Φ(Φ⁻¹(α) − σ)e^µ+σ²^/2 α

! ,

and the ES for the Student’s t method ESα,t = St−1

1 − 1

α Z α

0

e^µ+σt⁻¹^ν ^(u)du

.

For the normal method this can be solved analytically but for the Student’s t method this have to be evaluated with numerical integration.

(41)

4.3.3 Creating samples for evaluation

When creating the sample for comparison you will start with N sampled asset prices. You choose an estimation window n for the estimation of the VaR or ES. You then estimate the VaR at time t using the log-returns from the time periods t − n, t − n + 1, · · · , t − 1. You do this for the time periods t = n + 1 to t = N giving a total of N − n − 1 VaR or ES estimations.

You then pair them with the losses from the same time period giving the samples ([V aRα,t, Lt) and ( cESα,t, Lt). On this paired sample you use the methods described in chapter 3. So for a sample of N observed returns using a estimation window of n the backtesting will be based on N − n − 1 paired losses and risk measure estimates. As can be seen from this overlapping samples will be used, if it were possible non-overlapping samples would be preferred. This would however require unrealistically long samples.

Especially for the historical simulation where the estimation window n is short and the α is low this could be a problem. The historical simulation only uses the tail values for the calculation meaning that the estimate will only change when the new loss exceeds the previous VaR or a return that is excluded because it is to far back is larger or equal to the VaR.

For the parametric method this is a smaller problem. The distribution is fitted from all the values therefore the estimate will change as long as the newest value is not exactly the same as the last excluded value.

The rounding in the historical simulation method for VaR could also be a problem. The calculation bαnc + 1 will be rounded to an integer. The probability that you actually estimate is therefore

bαnc + 1

n , (4.1)

with α = 100 and n = 100 this gives an estimated probability of 2% but with the same α and n = 99 the estimated probability is 1.01%.

For the test that relies on using multiple VaR levels described in section 3.2 this could also cause problems if the rounding is uneven or even that different levels round to the same VaR. This could however only occur when the following is true for the difference between the VaR levels

∆α < 1 n,

(42)

so for example if the difference between VaR-levels is 0.5% you need at least 200 returns to be sure that you not encounter the problem of VaR-levels rounding to the same estimate of the VaR.

Two different n will be used 199 and 299. That 199 and not 200 is used is because of the rounding described in equation (4.1). Three N will be used 500, 1000 and 5000. 5000 is an unrealistically long sample but it is interesting to see what happens with more data as it makes the testing more accurate.

500 and 1000 are probably more realistic lengths in practice. Two levels of α is used 2.5% and 5%.

As for n = 199 and α = 0.025

∆αn = 0.005 ∗ 199 = 0.995 < 1,

this is checked. You will get the following outcomes for the levels 0.025, 0.020, 0.015, 0.010 and 0.005

b0.025 ∗ 199c = 5 b0.020 ∗ 199c = 4 b0.015 ∗ 199c = 3 b0.010 ∗ 199c = 2 b0.005 ∗ 199c = 1,

so the eventual problem with rounding to the same VaR estimate is not present.

4.4 Summary

The data sample is constructed in such a manner as to be close to a realistic scenario but at the same time have all the characteristics known. If the characteristics of the data are not known it would be impossible to know what is a good performance of the backtest. Three types of data are simulated with different characteristics to see how the backtest methods react to different types of data. Both data that it is assumed to accept and data that it is assumed to reject is used. Different lengths of the samples are also used to see how the methods performs with different amounts of data. It is interesting to see both how much data are needed to get good results and to see what the result is when a lot of data are used.

(43)

Results

5.1 Introduction to the results

When performing the simulations of the backtests the form of the results are multiple p-values. For each different combination of backtest 500 p-values will be calculated. From these p-values conclusions concerning the accuracy of the backtest are drawn. This is done with a qq-plot of the p-values against the quantiles of a uniformly distributed stochastic variable between 0 and 1.

If the hypothesis is correct the p-values should follow that distribution.

This will be the same as plotting p-values as a function of their quantile, this will be compared to the function f (p) = p. If the p-values are lower than the reference it will mean that more than the expected number of backtested models will be rejected for a given confidence of the backtest. This would indicate that the models are incorrect. This is something you would expect if you know that the models you are testing are wrong. If the p-values are higher than the reference it is also an indicator that something is not right.

One could mistake this for a very good model. This is however not the case rather it is an indication that the backtest is too weak as you expect unlikely outcomes to appear with small probability. The chance that you in 500 tries would not achieve an outcome that should be rejected at the 5% level just by chance is very low.

With four different methods, three different types of data, two different sig- nificance levels and two difference estimation windows the total number of resulting plots would be 48. Therefore only the most interesting results will be shown in this section. The rest of the results will be put in an appendix.

29

(44)

5.2 Geometric Brownian Motion

For the results on the GBM data the plots will discussed in more detail as they cover the different possibilities quite well, for example a non-parametric model, a parametric model with the right parametric assumption and a parametric model with the wrong parametric assumption.

5.2.1 Method 1 and 2

Historical simulation

In figure 5.1 the method for the estimation of the risk measures is historical simulation which we expect to give a correct estimation of the ES. We see that method 1 works well except for the combination of α = 0.025 and n = 299. When this combination is used you get a rounding for the VaR- levels 2.5%, 1.5% and 0.5% which is described in section 4.3.3. The rounding is not going to be that big but when taking a large amount of data the error will be enough to reject many hypotheses with a high degree of confidence as the unconditional coverage will be wrong.

Also in figure 5.1 we have method 2 on GBM data with historical simulation, which we expect to be correct. Here we see a low level of rejections except when α = 0.025 and n = 299. The reason that we get a high chance of rejection of the hypothesis is due to the same problem with rounding as in method 1. For the VaR level 0.5% you will get the second worst outcome and for the VaR level 1% you will get the third worst outcome. The hypothesis assumes that the outcomes above 0.5% VaR and between 1% VaR and 0.5%

VaR is equally likely but the rounding makes above 0.5% VaR twice as likely giving very low p-values.

Parametric normal method

In figure 5.2 method 2 is used on GBM data with the parametric normal method that we expect to be correct. Here we see that the p-values are close to the expected levels especially with a large amount of data, but even with less data the levels are close to the expected levels at least for α = 0.05.

We can also see that the assumption that the amount of possible p-values is large enough for it to be approximately continuous does not seem to hold for α = 0.025 as the p-values on some levels are constant.

Also in figure 5.2 the method used for the risk measure estimation is the parametric normal method which we expect to give a correct estimation of

(45)

the ES. We see that with a low amount of data the p-values are quite close to the expected but with a large amount of data the p-values will be lower than expected. The number of rejected hypotheses is in the order 2-3 times the expected. This is not that surprising as the aggregation of the p-values by taking the minimum gives a large rejection region.

Parametric Student’s t method

In figure 5.3 we have method 2 on the GBM data with the parametric Stu- dent’s t method that we expect to be wrong. Here we see a large amount of rejections even with a low amount of data. With α = 0.05 and N = 5000 the p-values are even displayed as zero.

Also in figure 5.3 the method used for the risk measure estimation is the parametric Student’s t method which we expect to overestimate the ES. We see that for α = 0.05 the model rejects the hypothesis with fewer data points but for α = 0.025 more data points are required for a high rejection. Both models get a very high rejection with much data. The reason that some of the lines are straight in these plots is because a common outcome will be to have zero VaR-exceedances because the Student’s t model will give a high VaR. But with a correct model these outcomes would be very uncommon.

(46)

Figure 5.1: All plots show the p-values of the hypothesis on the y-axis and quantile of the p-value on the x-axis. The data in all plots come from a geometric Brownian motion and the method used for estimating the risk measures is the historical simulation. The method, α and estimation window for the risk measures are given by the title of the plot.

(47)

Figure 5.2: All plots show the p-values of the hypothesis on the y-axis and quantile of the p-value on the x-axis. The data in all plots come from a geometric Brownian motion and the method used for estimating the risk measures is the parametric normal method. The method, α and estimation window for the risk measures are given by the title of the plot.

(48)

Figure 5.3: All plots show the p-values of the hypothesis on the y-axis and quantile of the p-value on the x-axis. The data in all plots come from a geometric Brownian motion and the method used for estimating the risk measures is the parametric Student’s t method. The method, α and estimation window for the risk measures are given by the title of the plot.

(49)

5.2.2 Method 3 and 4

Historical simulation

In figure 5.4 data from the GBM and risk measures estimated with historical simulation. We see that for a smaller data set the p-values are close to the anticipated level but with larger data sets the p-values becomes much lower than expected.

Parametric normal method

The figure 5.5 shows the p-values with the GBM data modeled with the parametric normal method. For method 3 the p-values are close to the expected level especially for a large data set. For method 4 the p-values are as expected for smaller data sets and α = 0.025 otherwise they are much lower than expected.

Parametric Student’s t method

Figure 5.6 shows the methods 3 and 4 on the GBM data where the risk measures are estimated with the parametric Student’s t distribution. As these methods test the magnitude one-sided and the ES in this case should be overestimated we expect the p-values to be high. This is also the outcome in all cases.

(50)

Figure 5.4: All plots show the p-values of the hypothesis on the y-axis and quantile of the p-value on the x-axis. The data in all plots come from a geometric Brownian motion and the method used for estimating the risk measures is the historical simulation. The method, α and estimation window for the risk measures are given by the title of the plot.

(51)

Figure 5.5: All plots show the p-values of the hypothesis on the y-axis and quantile of the p-value on the x-axis. The data in all plots come from a geometric Brownian motion and the method used for estimating the risk measures is the parametric normal method. The method, α and estimation window for the risk measures are given by the title of the plot.

(52)

Figure 5.6: All plots show the p-values of the hypothesis on the y-axis and quantile of the p-value on the x-axis. The data in all plots come from a geometric Brownian motion and the method used for estimating the risk measures is the parametric Student’s t method. The method, α and estimation window for the risk measures are given by the title of the plot.

(53)

5.3 Student’s t

The results for method 1 and 2 with the data from Student’s t distribution are very similar to that of the data from a GBM when comparing non-parametric methods and parametric methods with the right and the wrong distributional assumptions. It can be interesting to see that both assuming too light and too heavy tails will have the same effect on the p-values. This is not that surprising as the tests build on a goodness of fit test that only tests if the model assumptions are correct.

5.4 Heston’s stochastic volatility model

For the data from the Heston’s stochastic volatility model only the historical simulation and parametric normal model are used. The Student’s t method would be hard to compare against because the differences between the models are hard to describe. The historical simulation is non-parametric and should therefore be correct. The Heston model is a modified GBM with stochastic volatility so the parametric normal method should be incorrect if the volatility process fluctuates a lot. If the volatility process is close to constant a GBM will be a very good approximation. For the parametric normal method the p-values are low except for method 3 with a low amount of data.

For the historical simulation method 1 has consistently low p-values. For method 2 the outcome is similar to that of the GBM data except for a low amount of data where the p-values are low. For method 3 the p-values are close to the reference with a low amount of data but become low with more data. Method 4 has low p-values in all cases.

5.5 Summary

Many methods have a problem when the risk measures are estimated with historical simulation. For the methods based on an approximation the problem is caused by the rounding in the models. For the methods relying on the difference in magnitude the uncertainty of the ES estimate with that few data points is the probable cause of the problems.

The methods relying on the difference in the magnitude of the ES will often require a large amount of data in order to get low p-values even if low p-values are expected.

(54)

Furthermore method 4 seem to have problems even with the correct parametric method. Making it unreliable in most cases.