Optimal trading with transaction costs using a PMP gradient method

(1)

IN

DEGREE PROJECT MATHEMATICS, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2016,

Optimal trading with transaction costs using a PMP gradient method

DANIEL GULLBERG

(2)

(3)

Optimal trading with transaction costs using a PMP gradient method

D A N I E L G U L L B E R G

Master’s Thesis in Optimization and Systems Theory (30 ECTS credits) Master Programme in Applied and Computational Mathematics

(120 credits) Royal Institute of Technology year 2016 Supervisor at Ampfield AB: Torbjörn Hovmark Supervisor at KTH: Johan Karlsson Examiner: Johan Karlsson

TRITA-MAT-E 2016:34 ISRN-KTH/MAT/E--16/34--SE

Royal Institute of Technology School of Engineering Sciences KTH SCI SE-100 44 Stockholm, Sweden

(4)

(5)

Abstract

This thesis considers a portfolio optimisation problem with linear transaction costs, as interpreted by Ampeld Aktiebolag, and analyses it by using a gradient method based on Pontryagin's maximum principle (PMP). First the problem is outlined and afterwards it turns out that a gradient PMP method is easy to employ and gives reasonable solutions. As with many gradient methods the convergence is very slow, but a good estimate could possibly be found in sub-second time with the right implementation and computer.

The strength of the method is the good complexity, linear in the number of time steps and quadratic in the number of dimensions for each iteration. This is compared with quadratic and dynamic programming which have polynomial and ex- ponential complexity respectively.

The main weakness, apart from slow convergence, lies in the assumptions that have to be made. All functions, such as the volatility and transaction costs, are considered to only depend on time, not the transactions made. Using the method in this thesis on a more realistic problem would be dicult, why the PMP gradient method is most suited for a preliminary analysis of the problem.

(6)

(7)

Sammanfattning

Detta examensarbete analyserar ett portföljoptimeringsproblem med linjära transak- tionskostnader, såsom det är tolkat av Ampeld Aktiebolag, med hjälp av en gradientmetod baserad på Pontryagins maximumprincip, eller PMP. Först presenteras problemet och efteråt visar det sig att en gradientmetod är enkel att applicera och ger rimliga lösningar. Som för många gradientmetoder är konvergensen väldigt långsam, men en rimlig approximation kan möjligen hittas på under en sekund med rätt realisation och dator.

Styrkan hos metoden är den goda komplexiteten, linjär i antalet tidssteg och kvadratisk i antalet dimensioner per iteration. Detta jämförs med kvadratisk och dynamisk programmering, som respektive har polynomiell och exponentiell kom- plexitet.

Den största svagheten, förutom långsam konvergens, ligger i antagandena som måste göras. Alla funktioner, såsom volatiliteten och transaktionskostnaderna, antas bara bero på tiden, inte transaktionerna som gjorts. Att använda metoden i detta arbete på ett mer realistiskt problem skulle vara svårt, varför gradientmeto- den lämpar sig bäst för en preliminär analys av problemet.

(8)

(9)

Acknowledgements

Tack till Torbjörn Hovmark för att ha guidat projektet från start till mål och hela Ampeld för en intellektuellt stimulerande miljö. Tack även till Johan Karlsson för värdefulla kommentarer och synpunkter genom hela processen.

(10)

(11)

Notation

Symbol Meaning

F (t) a traded risky asset

α(t) the drift of F

σ(t) the volatility of F Q(t) the covariance matrix Q = σσ^T

σ_i the volatility of Fi, σi =√ Q_ii W (t) a Wiener-process aecting F

ρ the correlation between the assets

x the portfolio

V (t) the value process for the portfolio

V₀ the value today

c(t) transaction costs

| · | component-wise absolute value

U (V ) utility function

U_t utility as a stochastic process

γ risk aversion constant

u(t) the control variable u = ˙x λ(t) the adjoint variable in PMP f₀(x, u, t) the momentary cost/value

f (x, u, t) ˙x

φ(x(T )) the terminal cost

S_f the manifold x(T ) must belong to H(x, u, t, λ) the Hamiltonian H = f0+ λ^Tf

N the number of time steps

M the dimension of x

T time horizon

∆t time between grid points s the step size in the gradient method

(12)

(13)

1. Introduction

The modern portfolio theory, also known as mean-variance analysis, was introduced by Markowitz in 1952 in his paper 'Portfolio Selection' [14]. He discusses grounds for selecting a portfolio, and argues for a diversied portfolio by also taking the variance of the assets into account instead of only considering their mean.

The portfolio selection problem is interpreted as maximising expected return given a tolerable risk level or minimising the variance given a desired level of return. This thesis will examine a portfolio optimisation problem within this framework, where the expected utility is maximised.

Given that predictions are made of the behaviour of the assets, which Markowitz states is the rst step of portfolio selecting, it turns out that the case without transaction costs is much easier to handle than the case with them. With no transaction costs the portfolio can at all times be rebalanced to an optimal level. For the problem and method in this thesis that would boil down to a static quadratic optimisation problem that can be solved using standard methods. If transaction costs are in place the problem becomes much more dicult as the time aspect now has to be considered.

When thinking about trading risky assets most people think of trading on the stock market. The concept is simple, a portfolio of stocks is bought for a certain price, and as this price changes over time the value of the portfolio uctuates. In this thesis it will instead be assumed that futures contracts are traded, which dier from stocks. Two parts enter a contract to buy and sell an asset at a specied time, for a price agreed on when the contract is entered. Since no goods immediately change hands, entering such a contract carries no cost. As time passes, changes in the price of the contract is settled, usually on a daily basis. This means that although no initial investment is needed, a margin¹ is required in case the prices move against the held position. The daily settlement also has the eect of reducing default risks. At the terminal time of the contract a delivery of the asset is done either physically or by cash settlement.

1 The term for collateral when trading futures.

(16)

There are some details to note. Firstly, the margin must be stored somewhere. In theory this could also be invested in a market, but in this thesis the problem of capital placement will be ignored, for simplicity it is assumed to be safely placed in the bank and all focus will be on the trading of futures contracts. Secondly, although there is no exchange of money between the parts entering a contract there are transaction costs involved when trading in practice. These come from the bid-ask spread and fees and commission to the exchange and broker. The costs are roughly of the size 10 $, which can be compared to the value of the underlying product, 30000-100000 $, or the tick size of around 10 $. This cost is large enough to warrant an inclusion in formulating the problem later. Note also that the forthcoming problem formulation only depends on the dynamics of the asset, so it can be applied to others that behave equivalently. For more information on dierent types of assets, see for example [5].

The thesis was nished in collaboration with Ampeld Aktiebolag, a company doing automated trading based on mathematical modelling. The goal is to inves- tigate a gradient method based on PMP and see how well it can solve a portfolio optimisation problem with transaction costs. This will be done in several steps.

First, the optimisation problem is formulated, originating from the dynamics of the assets, which has the structure of an optimal control problem. Secondly, the problem is analysed with Pontryagin's maximum principle and a gradient method is used for further analysis. Finally, simulations are made in order to evaluate the method.

(17)

2. Background

This Chapter is split into two parts. In the rst part, the optimisation problem is motivated by the maximisation of the expected utility of the portfolio, starting with the asset dynamics and using simplifying assumptions. The second part contains a short discussion of the problem at hand and observations that can immediately be made, as well as of the assumptions.

2.1 Motivation of the problem

The problem considered in this thesis is

(P ) max

x

Z T 0

α^Tx V0

− c^T| ˙x|

V0

− 1

2(1 − γ)x^TQx V₀²

dt. (2.1)

This Section will motivate the problem, as interpreted by Ampeld Aktiebolag and described in [9]. Not all steps are mathematically rigorous, the motivation for this is simply that it makes the problem easier and that it seems to work in practice.

Given that the drift and covariance of dierent assets are known or estimated and a starting capital of V , the question is how to invest in the market. Let F (t) be a vector representing the prices of M dierent risky assets, i.e. F (t) : [0, T ] → R^M. The assets are assumed to have dynamics¹

dF (t) = α(t)dt + σ(t)dW (t). (2.2) The drift α(t) : [0, T ] → R^M and the volatility σ(t) : [0, T ] → R^{M ×M} are both assumed to be given deterministic functions. The Wiener processes W (t) ∈ R^M drive

1 More commonly the geometric Brownian motion dynamics dF = D(F )αdt + D(F )σdW are used, where D is a diagonal matrix, see for example [5].

(18)

the randomness and have mean zero and variance one. They are by convention also assumed to be independent. The covariance matrix is given as Q = σσ^T. A portfolio is formed, containing the number of contracts bought or sold, x(t) ∈ R^M.² The portfolio x is piecewise continuous, jumps occurring when contracts are bought or sold. With this portfolio the value process becomes

V (t) = x^T(t)F (t). (2.3)

To get the dynamics of the value process the transaction costs must be considered, as well as the dynamics of F . Assuming transaction costs growing linearly with the absolute value of the number of contracts traded, the value process dynamics are

dV (t) = α^T(t)x(t)dt + x^T(t)σ(t)dW (t) − c^T(t)| ˙x(t)|dt, (2.4) where | · | denotes the component-wise absolute value.

To simply start maximising the expected value of V at a terminal time T would be naïve. In order to properly rank the possible outcomes of a bet the concept of utility is used, as introduced by Neumann-Morgenstern [18]. Merton wrote about dierent utility functions [15] and to consider risk-adjusted relative return, power utility functions are used³

U (V ) = V^γ

γ , γ < 0 (2.5)

U (V ) = ln(V ), γ = 0. (2.6)

The dimensionless constant γ is selected by the investor to reect the willingness to take risks. Small γ −1 corresponds to a risk adverse behaviour and decisions are more based on the fear of losing money rather than a desire to make money. As γ increases, especially if it approaches 0, the behaviour becomes risk neutral and more bets will be accepted. To get some intuitive insights to the utility function, consider the following bet with two outcomes:

• A: Double the capital with probability p

• B: Get 90% of the capital back with probability 1 − p

2 In practice contracts are traded discretely, i.e. x(t) ∈ Z^M.

3 Sometimes denoted as ^V^γ_γ⁻¹ to easier see the convergence to log-utility when γ approaches zero, but this does not change the internal ranking of the outcomes.

(19)

The bet looks very attractive, but with γ = −40, which is what will be used for simulations, p > 98.5% is needed in order to have a better expected value of the utility function by taking the bet rather than ignoring it⁴.

Moving on, the goal is thus to maximise the expected utility, where the utility is given by

U (V_T) = (V_T)^γ

γ , (2.7)

where T is a chosen terminal time. With some abuse of notation, let

U_t = U (V_t) = (V_t)^γ

γ (2.8)

denote the utility as a stochastic process, which will converge to the function UT

at the terminal time. To nd the dynamics of the utility, Itô's lemma [5]

df =

µ_t∂f

∂x +σ_t² 2

∂²f

∂x²

dt + σ_t∂f

∂xdW_t, (2.9)

for f(x), a twice-dierentiable scalar function of the stochastic process Xt with dynamics dXt = µtdt + σtdWt, will be used. Using equations (2.4) and (2.8) together with the lemma the utility dynamics become⁵

dUt= γU

V (α^Txdt + x^TσdW − c^T| ˙x|dt) +1

2γ(γ − 1)U

V²x^TQxdt. (2.10) The utility at the nal time is then

U_T =

= U₀+ Z T

0

γU_t

V (α^Txdt + x^TσdW − c^T| ˙x|dt) +1

2γ(γ − 1)U_t

V²x^TQxdt

(2.11)

and the expected value is

E(UT) = U0+ E

Z T 0

γUt

α^Tx

V −c^T| ˙x|

V −1

2(1 − γ)x^TQx V²

dt

. (2.12)

4 In general, p >₂^1−0.9γ−0.9^γ^γ.

5 Technically, the use of the lemma is not correct. Since ˙x is unbounded higher order terms cannot be ignored.

(20)

The expected value of a Wiener integral is zero [5], i.e.

E

Z

f (s)dW (s)

= 0 for Wiener processes, so the dW -term has vanished.

The order of integration between the expected value⁶ and the integral can be swapped as long as the function is integrable over the double integral [5]. By assuming independence between Ut and the rest of the integrand⁷, the expected value can be split into two parts, giving

E(U_T) = U₀+ Z T

0

γE(U_t)E α^Tx

V −c^T| ˙x|

V −1

2(1 − γ)x^TQx V²

dt. (2.13) Taking the dierential yields

dE(U_t) = γE(U_t)E α^Tx

V −c^T| ˙x|

V − 1

2(1 − γ)x^TQx V²

dt, (2.14) this time without any dW -term. The equation is solved by multiplying with the integrating factor, and the solution is

E(U_T) = U₀exp

γE

Z T 0

α^Tx

V − c^T| ˙x|

V − 1

2(1 − γ)x^TQx V²

dt

. (2.15)

Maximising E(UT) is thus reduced to

maxx E(U_T) =

= max

x U₀exp

γE

Z T 0

α^Tx

V −c^T| ˙x|

V −1

2(1 − γ)x^TQx V²

dt

. (2.16) Only the expected value of the integral in the exponent can be maximised. Note that both U0 and γ are negative, so it is still a maximisation problem when taking them out, arriving at

6 The expected value is also an integral.

7 Since α and Q are deterministic and V is proportional to x this is heuristically reasonable.

(21)

maxx E(U_T) =

= U₀exp

γ max

x E

Z T 0

α^Tx

V − c^T| ˙x|

V − 1

2(1 − γ)x^TQx V²

dt

. (2.17) The only thing relevant is what can be optimised, so consider

maxx E

Z T 0

α^Tx

V − c^T| ˙x|

V − 1

2(1 − γ)x^TQx V²

dt

. (2.18)

The problem is dicult due to the stochastic value process in the denominators.

The idea to advance from here is that the main interest is how to trade presently.

A solution strategy similar to model predictive control (MPC) is employed. First,

nd x, or ˙x, by assuming that the current value of V holds for the entire time axis. Secondly, when some time has passed, make new observations and solve the problem again. Finally, this is repeated until the terminal time T is reached.

With V = V0, the capital today, the problem is deterministic and can nally be formulated as in (2.1)

(P ) max

x

Z T 0

α^Tx

V₀ − c^T| ˙x|

V₀ − 1

2(1 − γ)x^TQx V₀²

dt.

2.2 Discussion of the problem and assumptions

Looking at (2.1) there are a few immediate observations that can be made. First, the objective function is concave, meaning that the problem is convex. Secondly, Qis positive denite, unless some assets are perfectly correlated, then Q is positive semi-denite. The latter case might arise by trading dierent contracts based on the same product, but for the remainder of the thesis Q is assumed to be positive denite. These two points mean that there will be a unique optimal solution for a bounded x.

Next, notice that the choice of V0 will not aect the optimal value of the objective function. Scaling V0 with a factor can be countered by scaling x equally, leaving the same objective value. This is as expected since the utility is based on relative return, the interpretation simply being that with a larger capital larger volumes will be traded, but ratios are unchanged.

Finally, trading inuences the market, i.e. market impact, by for example changing prices. All such eects will be disregarded in this thesis, as the method used does

(22)

not easily incorporate them. Therefore, functions such as c and Q will only depend on time and not the trades made. Apart from the assumptions mentioned here and in the problem motivation, another potential factor that was ignored is limits on trading.

(23)

3. Optimal control

By setting ˙x = u in (2.1) the structure becomes that of an optimal control problem. Optimal control is about determining the control and state trajectories of a dynamic system that optimises some performance index. The general formulation is

maxu φ(x(T )) + Z T

0

f₀(t, x(t), u(t))dt

s.t. ˙x(t) = f (t, x(t), u(t)), x(0) = x₀, x(T ) ∈ S_f,

(3.1)

where φ is the terminal cost and Sf is a manifold to which x must belong at the

nal time. There are two main methods in optimal control, the dynamic programming approach via the Jacobi-Bellman equation and Pontryagin's maximum (minimum) principle. This thesis focuses on using the latter in order to analyse the problem, and the method will be described closer in this Chapter together with brief descriptions of related elds.

3.1 Pontryagin's maximum principle

The Pontryagin maximum principle was introduced in the fties and is derived for example in [10]. To summarise, dene an adjoint variable λ(t) : [0, T ] → R^M and the Hamiltonian

H(t, x(t), u(t), λ(t)) = f₀(t, x(t), u(t)) + λ^T(t)f (t, x(t), u(t)). (3.2) The following theorem [10] gives conditions for x, u and λ necessary for an optimal solution.

Theorem 1. If u^∗(t) is an optimal control for (3.1) and x^∗(t) the corresponding

(24)

trajectory, then there exists an adjoint variable λ(t) satisfying (i) u^∗(t) = argmax

u

H(t, x^∗(t), u, λ(t))

∂H

∂u(t, x^∗(t), u^∗(t), λ(t)) = 0

(ii) ˙λ(t) = −∂H

∂x(t, x^∗(t), u^∗(t), λ(t)) and λ(T ) − ∂φ

∂x(x^∗(T )) ⊥ S_f (iii) ˙x^∗(t) = f (t, x^∗(t), u^∗(t))

with x^∗(0) = x₀ and x^∗(T ) ∈ S_f

An important remark is that several stationary points can full the conditions, in that case they need to be compared to nd the true optimum. The rst condition states that the Hamiltonian is maximised over u. The second and third conditions make up a system of dierential equations called the two-point boundary-value problem (TPBVP), named by the fact that there are boundary values for both the initial and terminal time. In practice, the optimal u as a function of x and λ is found and inserted into the TPBVP. Solving the latter then gives the optimal control by plugging the solution back into the expression for u. However, this can only be done analytically in few cases, most require numerical methods.

3.2 Numerical PMP methods

In the survey [20] Rao writes about dierent numerical optimal control methods and how their use has varied over the past ve or so decades¹. He divides them into two categories, direct and indirect methods. In indirect methods the calculus of variations is used to determine rst-order optimality conditions in order to transform the problem into a multiple-point boundary-value problem which gives candidate optimal solutions. A direct method works by immediately discretizing the problem, thus transforming it into a non-linear problem (NLP). Rao states that although these methods are based on two dierent philosophies they have much in common and are merging more and more as time goes by.

With the advances of computers the most popular methods currently are the direct methods and dynamic programming [8]. PMP-based methods fall under the cat- egory of indirect methods, and Rao states that the three most common methods

1 Excluding some work done in the aerospace community.

(25)

are shooting, multiple-shooting and collocation. The former are liked because of their conceptual simpleness [10]. For more information Rao references for example [6, 16, 19] among many others.

Typically in numerical PMP methods either the estimate of the control or λ is iteratively updated, starting from an initial guess. When one of them is given, determining the remainders of x, u and λ normally becomes tractable. Two examples of such methods will be briey described in the following Sections, the shooting and the gradient methods. These and more can be found in e.g. [10]. Note that the latter is not mentioned by Rao, not currently being popular.

3.2.1 Shooting methods

In a shooting method the TPBVP is transformed into an initial-value problem by guessing an initial value for λ and substituting the condition for the terminal time with this new condition. With λ(0) and x(0) both known, simple time-marching techniques allow for integration of λ and x on the entire time axis. Finally, a comparison is made between the resulting x(T ) and λ(T ) and the problem conditions, and the guess for λ(0) is updated depending on the deviation in the terminal values. The main advantage is the conceptual simpleness, while a disadvantage is that integrating λ forward in time can be unstable.

3.2.2 Gradient methods

An example of a method where u is iteratively improved is the gradient method.

With u(t) known for t ∈ [0, T ], x(t) can be determined, and with x(t) also known λ(t) can be computed. The previous u(t) is updated by observing the gradient of the Hamiltonian with respect to u, which is the correct gradient of the value function whilst λ satises the adjoint equation in Theorem 1 [12]. An advantage with this method is that λ can be integrated backwards, which is usually more stable, while a disadvantage is that convergence typically is slow.

3.3 Gradient methods in NLP

In literature, descriptions of gradient methods primarily concern their use in non- linear programming, owing to the popularity of direct methods currently seen [8].

Although the gradient method in NLP diers from the one in PMP there are still similar ideas that could potentially be employed. One is varying the step size between iterations, for example using a line search method [13, 17, 19].

(26)

3.4 Optimal control in nance

Optimal control methods have previously been used on nancial problems. Law describes in his dissertation [11] various problems in nancial mathematics and analyse a few using PMP. One point made is that adapting PMP to stochastic problems is usually very dicult, especially in the cases where the volatility depends on the control. As mentioned in Section 2.2 all such eects were disregarded in this thesis, but the above statement gives an idea that using PMP on certain more realistic problems might be unsuitable. The optimal control method that is mainly focused on is however the Jacobi-Bellman equation. The same situation can be found in e.g. [2, 3], which Law references. Björk also mentions the Jacobi-Bellman equation in [5]

3.5 Iterative methods

Many of the methods described previously are iterative methods, where the optimal solution is successively found starting from an initial guess. What diers between the methods is how to get from one iteration to the next. The gradient PMP method has many conceptual similarities with iterative methods for solving linear systems of equations, e.g. the Jacobi and Gauss-Seidel methods [4]. The main dierence between the two aforementioned methods is the order in which the elements in the solution is updated. Either they are all updated simultaneously, or one after another. One iteration is nished when all elements have been updated.

The idea of not updating all of them at once is to get better estimates of elements later in the iteration by using the already updated elements rather than their old versions.

(27)

4. Analysing the problem with PMP

The problem under consideration is to optimise (2.1) using the control u(t) = ˙x(t):

maxu

Z T 0

α^T(t)x(t)

V₀ − c^T(t)|u(t)|

V₀ − 1

2(1 − γ)x^T(t)Q(t)x(t) V₀²

dt s.t. ˙x(t) = u(t), x(0) = x0.

(4.1)

Recall from Section 2.1 that x is piecewise continuous, so ˙x and thus u will be a distribution. Applying the PMP machinery outlined in Section 3.1 to the above problem results in

f (t, x, u) = u(t), x(0) = x₀, x(T ) free (4.2) f₀(t, x, u) = α^T(t)x(t)

V₀ − c^T(t)|u(t)|

V₀ − 1

2(1 − γ)x^T(t)Q(t)x(t)

V₀² (4.3)

φ(x(T )) = 0, (4.4)

by identication of dierent entities. This gives the Hamiltonian H(t, x(t), u(t), λ(t)) = f0(t, x(t), u(t)) + λ^T(t)f (t, x(t), u(t)) =

= α^T(t)x(t)

V₀ − c^T(t)|u(t)|

V₀ − 1

2(1 − γ)x^T(t)Q(t)x(t)

V₀² + λ^T(t)u(t). (4.5) From this the optimal control, the time derivative of λ and the terminal value λ(T ) can be identied, giving

u^∗(t, x(t), λ(t)) = argmax

u

H(t, x(t), u(t), λ(t))

= argmax

u

−c^T(t)|u(t)|

V₀ + λ^T(t)u(t)

(4.6)

(28)

˙λ(t) = −∂H

∂x = −α(t)

V₀ + (1 − γ)Q(t)x(t) V₀² λ(T ) = 0.

(4.7)

Optimising the Hamiltonian over u to get the optimal control will give

ui =











∞ λ_i > _V^cⁱ

0

≥ 0 λ_i = _V^cⁱ

0

0 −_V^cⁱ

0 < λ_i < _V^cⁱ

0

≤ 0 λ_i = −_V^cⁱ

0

−∞ λ_i < −_V^cⁱ

0

Having u = ±∞ on an interval gives x = ±∞. In Section 2.2 it was noted that an optimal x is bounded, meaning that the control cannot be innite on an interval.

Therefore, for an optimal control λ must stay within the region

λ_i ∈

−c_i V₀, c_i

V₀

(4.8) and nally

u_i =







≥ 0 λ_i = _V^cⁱ

0

0 −_V^cⁱ

0 < λ_i < _V^cⁱ

0

≤ 0 λ_i = −_V^cⁱ

0

(4.9)

In the cases of equality, the magnitude of u is arbitrary and should be chosen so that λ stays within the allowed region.

To get some intuitive insights to this PMP problem rst observe the simpler case where c = 0. Now (4.1) is simplied and each time step can be considered inde- pendently, giving

maxx

α^Tx V₀ −1

2(1 − γ)x^TQx V₀²

, (4.10)

which is optimised by simply taking the derivative with respect to x. The optimal level is

x^∗(t) = Q⁻¹(t)α(t)V₀

1 − γ . (4.11)

(29)

When this holds it is also true that

∂H

∂x(t, x^∗, u, λ) = 0, which implies that

˙λ(t) = −∂H

∂x(t, x^∗, u, λ) = 0.

This states that λ is constant when x is at the optimal level in the no transaction cost case. This will typically not happen, since with transaction costs x will be rebalanced to lower absolute values, otherwise losses would be incurred when the excessive contracts must be closed again shortly. Therefore, λ will drift between the borders ±c/V0, as can be seen by observing (4.7). Since x is of lower magnitude than when ˙λ = 0, ˙λ will have opposite sign of x. Hence, if contracts were previously bought, i.e. λ was large, the derivative is negative and λ will drift toward the lower border, and vice versa.

4.1 Motivation of method selection

For a rst analysis of the problem a conceptually simple method is suitable. Two such methods are the gradient method and the shooting method, as mentioned in Section 3.2. In this Section the choice of using a gradient based method will be argued for.

In shooting methods u should be dened in terms of λ to solve the TPBVP by transforming it into an initial value problem. In this case, though, u is not well dened. For λ = ±c/V0 only the sign of u is known, not the size. The next problem is that of stability. Integrating λ forward in time can be unstable, and here u does not specically appear in the expression for ˙λ, the only control over λ is indirect by varying x.

The gradient method, on the other hand, is very simple to apply and the starting guess of u makes x and λ computable, and u can be updated according to

u⁽ⁱ⁺¹⁾ = u⁽ⁱ⁾+ sHu. (4.12)

The step size s has to be chosen as well as the gradient Hu. Normally

H_u = ∂H

∂u = −c sgn(u)

V₀ + λ (4.13)

(30)

would be used [10], but there are some problems with this approach. For small

|λ| the control should be zero (4.9), but by strictly looking at (4.13), the gradient will only be zero when λ is of the same magnitude as c/V0. This is rarely the case, and for small |λ| it is possible that u uctuates around zero instead of coming to rest. The latter issue is also related to the next point.

Secondly, the absolute value at u = 0 is not dierentiable since the left and right derivatives do not coincide, even though they both exist. When nding the ascent direction a decision has to be made as to which derivative to use. This is also a reason why passing zero is a problem.

A third objection is that the slow convergence of gradient methods in general [10]

will most likely be especially true in this case, where u is an impulse and will need to tend to large values.

In conclusion, while the gradient method has its problems it is much easier to apply and was therefore selected for this thesis. A more thorough discussion on how to work around the problems is held in Section 5.1.

4.2 Applying a gradient method

Before applying any method a discretization of the time axis is done. It is split into N points, ∆t = T/(N − 1) apart. The problem is then to nd the optimal portfolio level x on each interval. Now the algorithm outlined in Section 3.2.2 can be elaborated for this case, giving the method procedure

1. Get u(t), rst from a starting guess and subsequently from the previous iteration

2. Integrate x(t) forward in time using (4.14) 3. Integrate λ(t) backwards in time using (4.15)

4. Find the gradient, see (4.13) and Section 5.1 for a discussion on the subject 5. Update u(t) with (4.12) and begin a new iteration until some criterion¹ is

fullled

Applying the discretization on the equations in Section 4 gives

1 See Sections 5.4 and 7.1.5 for examples of such a criterion. The simulations in this thesis were simply run for a predetermined number of iterations

(31)

x(t + ∆t) = x(t) + u(t)∆t

x(0) = x₀ (4.14)

and

λ(t − ∆t) = λ(t) + α(t)

V₀ − (1 − γ)Q(t)x(t) V₀²

∆t λ(T ) = 0.

(4.15)

It should be noted that u has dimension [contract/time], the Hamiltonian [1/time]

and its derivative with respect to u [1/contract]. Thus, the step size s in (4.12) has dimension [contract²/time]. How to interpret this is not clear, but it is not unreasonable to think that s in some way is related to the number of transactions that are made on a time interval.

4.3 Parameter values

For the simulations performed in the remainder of the thesis parameter values of roughly the sizes in Table 4.1 were used. The data was the same kind Ampeld uses in their modelling. Moreover, the plots of α, λ, x and u later only show one dimension, which was chosen because of relatively large λ-values.

Parameter Approximate value α_ij ± 0-0.3 $/(contract·minute) Qij ±0-1000 $²/(contract²·minute) σ_i =√

Q_ii 10-100 $/(contract·minute) ρ = Q_ij/(σ_iσ_j) ± 0-0.7

c 5-10 $

V₀ 10000000 $

γ -40

M 20

N 2188

∆t 1 minute

T 2 days

Table 4.1: Parameter values used in simulations

(32)

5. Gradient method specications

After the method is selected there are still decisions that have to be made. For a gradient method the most important choices are the direction and the step size, but other interesting aspects are the starting guess and the stopping criterion. In this Chapter all of these choices are discussed in turn.

5.1 Direction

In a gradient method the direction is normally the gradient of the Hamiltonian with respect to u [10], but as mentioned in Section 4.1 there are some problems that have to be addressed. By including restrictions on the gradient they can be circumvented.

The rst problem mentioned is that u should be zero when |λ| is small, but by using the gradient in (4.13) u might oscillate around zero instead of coming to rest. In order to avoid this, u = 0 is not allowed to move when |λ| is small and u is not allowed to go from positive to negative or vice versa between iterations, then it is instead set to zero. This will help u come to rest at zero when |λ| is small and make it stay there. To immediately set u to zero at all points where |λ|

is small could be too volatile.

The second problem is what derivative to use for u = 0 when |λ| > c/V0, allowing a non-zero gradient. When u is non-zero it will have the same sign as λ, and in order for the gradient not to make a jump as u tends to zero from a non-zero value the sign of u in (4.13) should be interpreted as the sign of λ.

Putting everything together leads to the following denition of the gradient:

1. Use the gradient in (4.13), but with some exceptions 2. If u = 0 and |λ| < c/V0, there is no change

3. If u = 0 and |λ| ≥ c/V0, the sign of u is dened as the sign of λ

(33)

4. If u would switch sign between iterations it is instead set to zero

5.2 Step size

After the direction is chosen a step size s must be selected. A large step size typically means quicker convergence to the optimal solution, but if it is too large there might be numerical instabilities. This Section contains ideas on how to select it for this problem.

A rst idea, very common in gradient methods for non-linear optimisation, is to vary the step size with each iteration. This was mentioned in Section 3.3 and is usually done by observing the objective function from where the gradient was found. In this case it would not be easy to implement. While it is guaranteed that the Hamiltonian gives the correct gradient for the value function while λ fulls the adjoint conditions [12], s cannot be picked by looking at it. When running the algorithm there will be values of λ that lie outside the allowed region, so optimising the Hamiltonian would mean set u to innity, which will lead to instability. The obvious way to work around this would be to instead look at the value function when determining s.

Next, s could be varied depending on other parameters. One idea would be to base the step size on the current norm of the gradient. Finding working formulas to get from a norm size to a step size might not be so easy, as arguments can be made both for increasing and decreasing the step size when the norm gets smaller.

Normally when the norm is small the solution is approached, so greater care is taken to actually converge, meaning the step size is reduced. On the other hand, u will contain impulses. As the norm gets smaller the changes are reduced, and if the step size is also reduced the changes will be extremely small. Getting to the high values in the impulses would then take a very long time. With a small norm maybe the step size can instead be increased without causing instabilities.

In Section 5.2.2 this is found to be the case.

Another idea is that the model might be more sensitive in certain parts of the time axis. For example, errors in λ for large t will propagate backwards and aect all previous λ, since the integration is done backwards, while early errors in x and u will propagate forward in time. If a more sensitive part of the time axis could be identied, one that is more prone to causing model failures for larger s, perhaps a smaller s could be run on those parts. Varying s along the time axis for a given gradient is equivalent to having a constant s for a dierent gradient. What changes should be made for this new gradient is not obvious.

Finally, it might be easiest to use a xed step size throughout the iterations. If

(34)

it is suciently small there might not be any problems with instability, and time could be saved when there is no need to keep changing the step size.

In Section 7.3 the choice of using a variable or a xed step size is discussed.

5.2.1 Study of λ

The next part of the Section will try to illustrate why the model collapses when s is too large by observing the adjoint variable. Figures 5.1 through 5.3 depict how λ changes with a good step size and a step size that is too large.

0 500 1000 1500 2000

−2

−1.5

−1

−0.5 0 0.5 1 1.5

2x 10⁻⁵ λ after 0 iterations

[minutes]

[1/contract]

λ c/V0

−c/V0

Figure 5.1: λ after 0 iterations

By comparing Figures 5.2 and 5.3 to Figure 5.1 it can be seen that in the former case λ has decreased in value and is more contained within the allowed region while it has grown in the latter case. The integration of λ has one part that does not change between iterations, stemming from α, and one part that does, the part with x and Q, see (4.15). The latter part is supposed to give a dampening, i.e.

'buy less', but if s is too large the eect might instead become 'sell', and vice versa.

Since λ had grown in the latter case, the overreaction will be even greater when continuing the algorithm and the oscillations will continue and amplify. Eventually λ will tend to innity and cause a collapse.

(35)

0 500 1000 1500 2000

−2

−1.5

−1

−0.5 0 0.5 1 1.5

2x 10⁻⁵ λ After 10 iterations − good s

[minutes]

[1/contract]

λ c/V0

−c/V0

Figure 5.2: λ after 10 iterations with good step size

0 500 1000 1500 2000

−3

−2.5

−2

−1.5

−1

−0.5 0 0.5 1 1.5

2x 10⁻⁵ λ After 10 iterations − s too large

[minutes]

[1/contract]

λ c/V₀

−c/V0

Figure 5.3: λ after 10 iterations with too large step size

(36)

5.2.2 Optimal and bad step sizes

Even if great care is needed when selecting s, of course it is desirable to do it as well as possible within the allowed boundaries. Consequently, it can be interesting to note what the optimal step size is for various iterations, which is done in Figures 5.4 and 5.5. The method used to determine an optimal step size was to simply test dierent s with a xed increment until the objective value no longer increased.

Since the problem is convex this should give a correct approximation of the optimal step size.

0 200 400 600 800 1000

0 2 4 6 8 10

12x 10⁵ Optimal step sizes with 50 increment

iteration [contract2/minute]

s

Figure 5.4: Optimal step sizes

As can be seen, the optimal step size uctuates vigorously. While a slight increase can be noted in the mean as the number of iterations nished gets higher, the variance is high enough as to not warrant any clear conclusions.

(37)

0 200 400 600 800 1000 10³

10⁴ 10⁵ 10⁶

Optimal step sizes with 50 increment

s

Figure 5.5: Optimal step sizes with logarithmic scale

Next, it is also interesting to note how high s can be set without causing a deterio- ration of the value function. A decrease of the target function means that the step size is much too large, instead of nding the optimal increase a decrease is made.

The robustness will be much lower than this, because a slightly too high step size for many iterations will also cause a collapse. Figure 5.6 illustrates what step size is needed to get a decrease in the value function, granted that the previous iterations were run with a good step size.

It can be seen that the allowed step size before a value function reduction increases consistently while the optimal step size uctuates heavily and only notices a marginal increase in its mean. Because of this it can be concluded that increasing the step size with later iterations should be possible, but it is not easy to determine to which extent it is desirable.

(38)

0 200 400 600 800 1000 0

1 2 3 4 5 6 7 8 9

10x 10⁵Step size causing value function decrease with 500 increment

s

Figure 5.6: Step sizes needed to get a decrease in the value function

5.3 Starting guess

A good starting guess could drastically reduce the number of iterations needed to

nd an optimal solution. For this problem perhaps a preliminary analysis of the data could be done in order to get an initial guess of u. First, there is information given by α and secondly there is an easily obtained solution to the cost-free problem available, see Equation (4.11). The former idea is surely too crude, ignoring both c and Q, while the latter has some potential. The problem is of course that a lot more would be traded than in the optimal solution for the cost-including case, meaning that u will be non-zero at many more points than it should be and also have the wrong magnitude. A u which is optimal in the zero-cost case gives rise to a λ which is zero everywhere¹. Thus, changes to u will initially only come from the other part of the gradient, which might be coarse.

A quick test was done to determine if this idea has any merit at all. The portfolio which is optimal in the cost-free case had a negative objective value with transaction costs, compared to a zero starting guess which has value zero. No compensating later improvement was found either.

To conclude, the procedure is probably not worthwhile. The most likely outcome is that u ends up at the wrong level in many points, and adjusting this error takes

1 The derivative is zero everywhere, and so is λ(T ).

(39)

a lot of time. Instead, a simple starting guess of u = 0 was used. In that case there are no erroneous impulses that needs to be reduced or removed and all that remains is nding the correct ones. It is possible that there are better starting guesses, but it is not obvious what they would be.

5.4 Stopping criterion

The classic stopping criterion in a gradient method is when the norm of the gradient gets suciently small [10]. In that case not much is changed between iterations, and hopefully the optimal solution has been reached. In Section 7.1.5 this and other possible stopping criteria are discussed closer.

(40)

6. Computational performance

This Chapter examines the computational diculties. First, the complexity in both time consumed and memory used is examined for the algorithm, which is followed by measures of time consumption together with the number of Flops achieved. This is done in order to get an idea of how good the method is in comparison to other methods, and also to understand how the competitiveness will change as the problem size changes. It should be noted that the implementation is not optimal, but at least it will give an idea of the performance.

6.1 Computational complexity

When using the algorithm various matrix multiplications must be computed. Note that when multiplying an N × M matrix with an M × P matrix the complexity is O(N · M · P ). By investigating the various parts of the algorithm the overall complexity can be found. The results for N time steps and M dimensions can be found in Table 6.1, recall the formulas in (4.14), (4.15), (4.12) and (4.3). Also note that there are two possible ways to use the value function, and both are included. The heaviest matrix multiplication needed, Qx, is already computed when integrating λ. Thus, to compute the value function with the same x is easier.

If x has changed the computation must instead be done from the beginning. The former case could arise when keeping track of the objective value during iterations, perhaps to help as a stopping criterion, see Section 7.1.5. The latter case would happen if the value function is used to compare dierent step sizes. Each dierent s would give rise to a new portfolio.

Depending on the implementation the exact number of operations could dier. The number of oating-point operations the dierent parts need for the implementation used during simulations can be found in Table 6.2. Note that when updating u manual checks are needed, see Section 5.1. The diculty of a check does not increase with a larger problem, only the number of times they have to be performed.

Also note that not all logic operations will be needed every time, so that the

(41)

Part Complexity x integration M · N λ integration M²· N

Update u M · N

Value function (old x) M · N Value function (new x) M²· N

Table 6.1: Computational complexities for parts of the algorithm

results below are a worst case scenario. For comparison, also the case with naïvely applying the gradient in (4.13) to update u is included.

Part Floating-point operations

x integration M · N

λ integration (2M²+ 3M ) · N

Update u 34M · N

Update u naïve gradient 7M · N Value function (old x) 11M · N Value function (new x) (2M²+ 9M ) · N

Table 6.2: Floating-point operations for parts of the algorithm

In summary, the algorithm will always have complexity O(N · M²) per iteration because of the integration of λ. The heaviest computation is nding the value function with a new x, which is slightly worse than the integration of λ. No inves- tigation was made on if the number of iterations needed to nd a good estimate increases as the problem size grows.

6.2 Memory complexity

When running the algorithm data must be stored in variables, in this implementation consisting of double-precision oating-point variables¹. In Table 6.3 the amount of variables needed for dierent data is listed. Apart from these variables only sporadic memory is needed, to store for example the step size and V0. This will not grow as the problem grows and is thus ignored².

1 binary64 in the IEEE 754 standard.

2 So is x0 even though it grows linearly in the number of dimensions.

(42)

Data Variables

x M · N

λ M · N

u M · N

α M · N

c M · N

Q M²· N

Table 6.3: Memory allocation needed

In the case of M = 20 and N = 2188, keeping in mind that one variable takes up 8 bytes, the total amount of memory needed is slightly above 8 MB.

6.3 Time consumption

In order to determine whether or not the method is worthwhile to use in practise the time consumption has been examined. All the following tests were done with M = 20 and N = 2188 with the processor Intel Core i3-3220 CPU 3.30 GHz with 3M cache and 8 GB RAM. It should be noted that the memory needed, found in Section 6.2, exceeds the cache memory, meaning that the program most likely is cache bound, i.e. the lack of cache memory is the biggest set back for run time. This is likely a contributing factor to why the results below are far from the theoretical maximum for the processor.

Table 6.4 contains the performance in run time and Flops achieved for the algorithm as a whole and its individual components. In Section 6.1 the number of operations were calculated, which was used for computing the Flops. 'One iteration' in the Table means integrating x and λ and updating u, the three necessary parts of the algorithm. The fact that the number of Flops dier much means that there is a large margin for error, but the numbers can be seen as a rough estimate of what times can be expected.

If instead the naïve gradient is used without any alterations (4.13), the average time for updating u is 0.4 ms, corresponding to 0.7 GFlops. The average time for the algorithm is then 1.65 ms, performing 1.4 GFlops. Since this case does not rely on a worst-case scenario analysis the Flops are probably more accurate than the previous analysis, so even though this method was not used the numbers are interesting.

(43)

Part Time consumed [ms] Flops [GFlops]

One iteration 1.727 1.98

Integrate x 0.0814 0.54

Integrate λ 1.139 1.65

Update u 0.5068 2.94

Value function (new x) 1.061 1.90

Table 6.4: Average performance over 500000 iterations

(44)

7. Simulation results

This Chapter contains the results of the simulations. They were performed with both a xed and a variable step size, and the results are presented in turn. Finally, in Section 7.3 the choice of which type of step size to use is discussed.

In order to determine how fast an approximation of the optimal solution can be expected to be found, the appearance of the solution after dierent numbers of iterations can be observed. Before the results are shown, Figure 7.1 contains a plot of α, included as a reference on how the solution can be expected to look like.

Recall that the only way to contribute positively to the objective value is to have a positive portfolio when α is positive and vice versa. The parameter values used are listed in Section 4.3.

0 500 1000 1500 2000

−2

−1.5

−1

−0.5 0 0.5 1 1.5 2

α

[minutes]

[$/(minute*contract)]

α

Figure 7.1: α for one dimension

(45)

7.1 Using a xed step size

This Section contains the results of using a xed step size. It shows the results after dierent number of iterations for the portfolio, control, adjoint variable, norm of the gradient and value function. In Section 7.1.5 the results are discussed.

7.1.1 Portfolio

The rst examination is of the portfolio, see Figures 7.2 through 7.5.

0 500 1000 1500 2000

−60

−40

−20 0 20 40 60 80

Portfolio after 1000 iterations with fixed s

[minutes]

[contracts]

x

Figure 7.2: Portfolio after 1000 iterations

After 1000 iterations mainly the outline of the solution can be seen, it is not very sharp and the portfolio is not on the same level as later, much below those values.

All the later plots are very similar to each other, with levels being much the same and likewise for the shape.

(46)

0 500 1000 1500 2000

−100

−50 0 50 100 150

[minutes]

[contracts]

x

0 500 1000 1500 2000

−100

−50 0 50 100 150

[minutes]

[contracts]

x

(47)

0 500 1000 1500 2000

−100

−50 0 50 100 150

[minutes]

[contracts]

x

7.1.2 Control

Next, observe the control in Figures 7.6 through 7.9.

0 500 1000 1500 2000

−1.5

−1

−0.5 0 0.5 1 1.5 2

Control after 1000 iterations with fixed s

[minutes]

[contracts/minute]

u

Figure 7.6: Control after 1000 iterations

Again, the overall look is much the same for dierent number of iterations, although

(48)

0 500 1000 1500 2000

−4

−2 0 2 4 6 8

[minutes]

[contracts/minute]

u

0 500 1000 1500 2000

−6

−4

−2 0 2 4 6 8 10 12

[minutes]

[contracts/minute]

u

(49)

0 500 1000 1500 2000

−10

−5 0 5 10 15 20

[minutes]

[contracts/minute]

u

the solution for 1000 iterations is not yet well evolved. However, when it comes to heights they dier greatly in the dierent plots, with the control becoming more impulse-like as the algorithm is run longer.

7.1.3 Adjoint variable

Finally, the appearance of λ can be studied, see Figures 7.10 through 7.13. Recall that while λ lies outside the allowed region the solution is not optimal.

The situation is similar to the one for the portfolio. Not much changes with many iterations, and already after 10000 iterations λ is almost entirely contained within the allowed region.

(50)

0 500 1000 1500 2000

−5

−4

−3

−2

−1 0 1 2 3 4

5x 10⁻⁶ λ After 1000 iterations with good s

[minutes]

[1/contract]

λ c/V0

−c/V0

0 500 1000 1500 2000

−3

−2

−1 0 1 2

[minutes]

[1/contract]

λ c/V₀

−c/V0

(51)

0 500 1000 1500 2000

−3

−2

−1 0 1 2

[minutes]

[1/contract]

λ c/V0

−c/V₀

0 500 1000 1500 2000

−3

−2

−1 0 1 2

[minutes]

[1/contract]

λ c/V₀

−c/V0

(52)

7.1.4 Value function and gradient norm

For this problem the norm of the gradient approximately corresponds to how much of λ that lies outside the allowed region, and how this amount changes with the number of iterations is illustrated in Figure 7.14. Figure 7.15 shows how the value function increases over the course of time.

0 200 400 600 800 1000

0 0.002 0.004 0.006 0.008 0.01 0.012

λ outside border (approximately norm of gradient) for 100000 iterations

iteration/100

[1/contract]

λ outside border

Figure 7.14: Norm of gradient for 100000 iterations

The graphs in Figures 7.14 and 7.15 are very similar, although mirrored. There is signicant improvement in the rst approximately 20000 iterations, and afterwards the curves atten.

Optimal trading with transaction costs using a PMP gradient method

Optimal trading with transaction costs using a PMP gradient method

DANIEL GULLBERG

Optimal trading with transaction costs using a PMP gradient method

Table of contents

1. Introduction

2. Background

2.1 Motivation of the problem

2.2 Discussion of the problem and assumptions

3. Optimal control

3.1 Pontryagin's maximum principle

3.2 Numerical PMP methods

3.2.1 Shooting methods

3.2.2 Gradient methods

3.3 Gradient methods in NLP

3.4 Optimal control in nance

3.5 Iterative methods

4. Analysing the problem with PMP

4.1 Motivation of method selection

4.2 Applying a gradient method

4.3 Parameter values

5. Gradient method specications

5.1 Direction

5.2 Step size

5.2.1 Study of λ

5.2.2 Optimal and bad step sizes

5.3 Starting guess

5.4 Stopping criterion

6. Computational performance

6.1 Computational complexity

6.2 Memory complexity

6.3 Time consumption

7. Simulation results

7.1 Using a xed step size

7.1.1 Portfolio

7.1.2 Control

7.1.3 Adjoint variable

7.1.4 Value function and gradient norm

3.4 Optimal control in nance

5. Gradient method specications

7.1 Using a xed step size