• No results found

Generative Neural Network for Portfolio Optimization

N/A
N/A
Protected

Academic year: 2021

Share "Generative Neural Network for Portfolio Optimization"

Copied!
76
0
0

Loading.... (view fulltext now)

Full text

(1)

School of Education, Culture and Communication

Division of Applied Mathematics

MASTER THESIS IN MATHEMATICS / APPLIED MATHEMATICS

Generative Neural Network for Portfolio Optimization

by

Mengxin Liu

Masterarbete i matematik / tillämpad matematik

DIVISION OF APPLIED MATHEMATICS

Mälardalen University SE-721 23 Västerås, Sweden

(2)

School of Education, Culture and Communication

Division of Applied Mathematics

Master thesis in mathematics / applied mathematics

Date:

2021-01-15

Project name:

Generative Neural Network for Portfolio Optimization

Author:

Mengxin Liu

Supervisor(s):

Supervisor at Qognica AB: George Fodor Supervisor at MDH: Olha Bodnar

Reviewer: Christopher Engström Examiner: Daniel Andrén Comprising: 30 ECTS credits

(3)

Contents

1 Introduction 1

1.1 Problem description . . . 1

1.2 Literature review . . . 2

1.3 Outline . . . 4

2 Traditional Portfolio Optimization Method 5 2.1 Mean-Variance Portfolio Optimization . . . 5

2.2 CAPM . . . 7

3 Limitation of Traditional Portfolio Optimization 9 3.1 Drawbacks within Assumptions . . . 9

3.2 Drawbacks within Applications . . . 9

4 Preprocessing 11 4.1 Distribution of Daily Return . . . 11

4.1.1 Normal Distribution . . . 13

4.1.2 Student’s t Distribution . . . 13

4.1.3 Complete Data . . . 15

4.2 Whether to Include Technical Indicators . . . 15

4.3 Scaling . . . 15

4.3.1 Standard Score . . . 16

4.3.2 Min-Max Scaler . . . 16

4.3.3 Robust Scaler . . . 16

4.3.4 Max Abs Scaler . . . 16

4.3.5 Power Transform . . . 17

5 Artificial Neural Network 18 5.1 Introduction to Neural Network . . . 18

5.1.1 Relation between Different Concepts . . . 18

5.1.2 Definition of Artificial Neural Network . . . 19

5.1.3 Differences between ANN and Statistical Method . . . 21

5.2 Training Neural Network . . . 22

(4)

5.3 Activation Function . . . 24

5.3.1 Sigmoid function . . . 24

5.3.2 Hyperbolic Tangent function . . . 24

5.3.3 Rectified Linear Unit function . . . 25

5.3.4 Exponential Linear Unit . . . 26

5.3.5 Leaky ReLU . . . 26

5.4 Approaches to Prevent Overfitting . . . 27

5.4.1 Increase Data Size . . . 27

5.4.2 Reduce Size of Neural Network . . . 27

5.4.3 L1 Regularization . . . 28

5.4.4 L2 Regularization . . . 28

5.4.5 Dropout . . . 28

5.5 Supervised Learning and Unsupervised Learning . . . 29

5.6 Generative Adversarial Network . . . 30

5.6.1 Cost function . . . 31

5.7 Implement Neural Network in Portfolio Optimization . . . 32

5.7.1 How to Optimize Portfolio from Output of Neural Network . . . 32

6 Empirical Study 34 6.1 Data Software and Hardware . . . 34

6.1.1 Data and Data Source . . . 34

6.1.2 Software Choice . . . 34

6.1.3 Hardware . . . 34

6.2 Risk Measurement . . . 35

6.2.1 Volatility . . . 35

6.2.2 Value at Risk . . . 35

6.2.3 Conditional Value at Risk . . . 35

6.3 Monte Carlo Simulation . . . 36

6.3.1 Simulated Path of Monte Carlo Simulation . . . 36

6.3.2 Calculate VaR using Monte Carlo Method . . . 37

6.3.3 Calculate CVaR using Monte Carlo Method . . . 37

6.3.4 Markowitz GMV Portfolio Selection . . . 39

6.4 Studies on GAN . . . 40

6.4.1 Structure of GAN . . . 40

6.4.2 Key Point on Selecting Batches . . . 41

6.4.3 Output from GAN . . . 42

6.4.4 The Effect of Epoch . . . 46

6.4.5 The Effect of Batchsize . . . 47

6.4.6 The Effect of Latent Dimension . . . 47

6.4.7 A Portfolio Optimization Example . . . 47

7 Discussion 49 7.1 Advantages of Generative Neural Network Portfolio Optimization . . . 49

(5)

8 Further Research and Conclusion 51 8.1 Further Research . . . 51 8.2 Conclusion . . . 52 A Weight of GMV portfolio

A.1 Weights of GMV Portfolio . . . . B VaR and CVaR of Stocks Using Normal Monte Carlo

B.1 Part 1 . . . . B.2 Part 2 . . . . B.3 Part 3 . . . . C VaR of the First GAN Result

C.1 Part 1 . . . . C.2 Part 2 . . . . D Epoch Study

E Batchsize Study

(6)

List of Figures

3.1 Rolling correlation coefficient between AAK and ABB . . . 10

4.1 Histogram of ABB . . . 12

4.2 Comparison Between Histogram and Normal Distribution PDF . . . 13

4.3 Comparison Between Histogram and Student’s t Distribution PDF . . . 14

5.1 A brief description of the relation between three different concepts . . . 18

5.2 LTU unit . . . 20

5.3 Neural Network . . . 21

5.4 Graph of Sigmoid function . . . 24

5.5 Graph of Hyperbolic Tangent function . . . 25

5.6 Graph of Rectified Linear Unit function . . . 25

5.7 Graph of Exponential Linear Unit . . . 26

5.8 Graph of Leaky ReLU . . . 27

5.9 Graphical Explanation of Autoencoder . . . 30

5.10 Graphical Representation of GAN . . . 31

6.1 One path generate by Monte Carlo simulation . . . 36

6.2 VaR of ABB using Monte Carlo simulation . . . 37

6.3 CVaR of ABB using Monte Carlo simulation . . . 38

6.4 VaR and CVaR of ABB using Monte Carlo simulation . . . 38

6.5 Value of GMV portfolio in 10 years . . . 39

6.6 VaR and CVaR of GMV portfolio using Monte Carlo simulation . . . 40

6.7 Data structure of input . . . 41

6.8 ABB Price paths generated by GAN . . . 42

6.9 Histogram comparison . . . 43

6.10 Histogram comparison(daily return) . . . 44

6.11 Heatmap of one generated data . . . 44

6.12 Heatmap real stocks returns data . . . 45

6.13 Rolling Correlation of Neural Network . . . 46

(7)

Abstract

This thesis aims to overcome the drawbacks of traditional portfolio optimization by employing Generative Deep Neural Networks on real stock data. The proposed framework is capable of generating return data that have similar statistical characteristics as the original stock data. The result is acquired using Monte Carlo simulation method and presented in terms of individual risk. This method is tested on real Swedish stock market data. A practical example demon-strates how to optimize a portfolio based on the output of the proposed Generative Adversarial Networks.

(8)

Acknowledgements

I would like to thank everyone in the Qognica AB for giving me this chance of doing this thesis. It has been a really enjoyable journey for me. Also, I would like to thank my supervisor Olha Bodnar for giving me constructive opinions.

(9)

Chapter 1

Introduction

1.1

Problem description

Portfolios are a set of financial assets selected to optimize trade-offs between risks and returns. Optimal portfolios define a line in the risk vs. return plane called the efficient frontier. The optimization process as such is done by Portfolio Managers. A manager selecting assets will typically consider factors such as the risk aversion of the investor, risk/return profile of each asset and the risk-free rate and the borrowing rate. Advances in financial engineering led to an increased sophistication on both the optimization instruments side and also on the investor’s understanding of risks. This trend could be accelerate using recent results in machine learning methods and in advanced computerized mathematical modelling tools.

In order to construct a portfolio, it is important to model the asset. The assets in a portfolio can be represented as a combination of weight, expected return, and risk. The weight wiis the

representation of the portion of stock i in the portfolio. Expected return µiis the representation

of investors’ expectation on the return of stock i in the future. Normally risk is measured by a function that considers the standard deviation σ

The Harry Markowitz paper[28] gives a solution of how to construct a portfolio based on the formulation introduced before. The weights of each stock can be represented by a weight vector ~WT = (w1, w2, · · · , wN), in order to calculate the variance of the whole portfolio, the

covariance matrix Σ need to be calculated:

Σ =    σ1,1 · · · σ1,N .. . . .. ... σN,1 · · · σN,N   

Where σN,N = σN2is the variance of asset N. σi, jis the covariance between asset i and j.

Now the portfolio’s risk σpcan be calculated using formula:

σ2p= ~WTΣ ~W

(10)

Lagrange Multiplier. The calculated portfolio is called Global Minimum Variance (GMV) Portfolio.

If the investors want to have more returns while controlling the risk, then the optimization problem could be formulated using the concept of Sharpe ratio[43]. Sharpe ratio Spcan be

calculated with the formula:

Sp=

µp− rf

σp

Where rf is the risk free rate, µp is the expected return of the portfolio, and σp is the

volatility of the portfolio.

Solving the optimization problem that maximizes the Sharpe ratio will give a portfolio called optimal portfolio. We can define the negative Sharpe ratio as cost function, then the optimization problem minimizes this cost function.

In this thesis, some questions are raised. Is the mean variance portfolio optimization frame-work a good portfolio selection method? Just estimating the risks by the standard deviation might not capture all the regularities that could identify risk patterns. Recently with the de-velopment of computing power, the Artificial Intelligence method especially neural network algorithm is becoming more and more important in many fields like computer vision due to their capacities to recognize patterns. This lead to the problem of this thesis. Is it possible to apply neural network algorithm in the portfolio selection process? How does a neural network based portfolio perform compared with the Markowitz portfolio selection framework?

This thesis aims to find the answer to the previous question. A designed unsupervised neural network will try to extract features from the existing stock data. Then the neural net-work will generate many return series that have similar characteristics to the original data. Then a portfolio will be constructed based on the generated data. The designed portfolio will have minimum risk(in the measurement of standard deviation, value at risk or conditional value at risk).

1.2

Literature review

The aim of academic studies in modelling time series is to find a model that can better describe time series characteristics. A more complete model of time series will give a better predic-tion or estimapredic-tion. The predicpredic-tion will be used for optimizapredic-tion. A model that seeks to find a better representation of time series has two parts: structure and parameters. When we try to choose a model, we want to select a structure that leads to the least amount of parameters. Among many proposed time series models with financial applications, Autoregressive Integ-rated Moving Average (ARIMA) model[47] is a good model with good prediction power and few parameters. As a common rule, Occam Razor[6] states, the simpler model is preferred in any case, this being a normal regularization principle. Apart from this advantage, what makes it interesting in financial application is its capability of simulating Brownian motion. Brownian motion is one of the most common way to model prices of financial assets. When investors try to identify the parameters of ARIMA, essentially they are doing statistical mod-elling, which is built upon statistical assumptions[11]. However, if we want to build a model that has no statistical assumptions, ARIMA is not the most suitable in this case. Hence we

(11)

want to build a model based on no statistical assumption enabling us to find correlations or patterns that are hard to recognize in a normal setting. Compared to the traditional method, this method will give a more precise estimation of the financial characteristics. To solve this question, we choose to implement the ideas from Artificial Neural Network research. Because it is a widely developed field, giving us a new way to model time series(financial time series in this case).

Since the introduction of the neural network, many researchers have been trying to im-plement artificial neural network techniques in financial applications. Article by Cavalcante [7] categorizes the machine learning related articles and summarizes the core implications ac-cording to its directions. From the author’s summarization, the most common applications of machine learning in financial applications are price prediction. Machine learning technique can also be applied in other applications such as features extraction and outliers detection.

Under the category of price prediction, there are some articles trying to achieve this goal. W.Bao[4] proposes a framework to predict stock price. Indices data are fed into a wavelet transform system. The purpose of the wavelet transform is to denoise price data. Later the data will go through a stacked autoencoder. Autoencoder is an unsupervised learning method designed to extract deep features from the data. Subsequently, the extracted features are fed into a long short term memory model(LSTM) in order to acquire one step ahead prediction. According to the author, the proposed framework has the capability of predicting price data with Coefficient of determination R2above 90%. This demonstrates the potential of Artificial

Neural Network in the financial market.

Another approach in predicting stock price involves a commonly implemented way of processing data: technical indicator. Tegner [46] suggests a method to predict the financial asset price in the future. In his suggested framework, the input of the neural network consists of prices and technical indicators like Moving average and Momentum. Then a selection is conducted in order to find the technical indicators that have more importance than others. To acquire the prediction from the neural network, the author chooses to feed the selected data into a Recurrent Neural Network(RNN). According to the article, the most effective method achieves an accuracy of 52%. This is one of the articles that incorporate the idea of technical analysis with the power of Artificial Neural Networks.

One of the latest topics of the artificial neural network is generative model. One frame-work: Variational Autoencoder(VAE)[24] is gaining more attention. It can be applied in many applications like text generation[50][42], and also image related tasks [33]. Variational Au-toencoder can be seen as an extension of auAu-toencoder, its probabilistic characteristics enable it to generate different outputs with similar characteristics. It belongs to the family of Gener-ative Autoencoder, this type of autoencoder has the ability to generate new data, making it an interesting topic in financial applications.

Another generative model Generative Adversarial Networks(GAN)[18] is another model that has been applied in many research fields. X.Zhou [52] proposes a framework that im-plements GAN with high-frequency data to predict stock data in the future. The framework incorporates Long Short Term Memory with GAN to predict the stock price. The perform-ance is measured based on two measurements, Root Mean Squared Relative Error(RMSRE) and Direction Prediction Accuracy(DPA). The result indicated that GAN could be a good topic in financial related applications.

(12)

Convolutional Generative Adversarial Network(DCGAN)[34] is one common variation of GAN. This technique combines Deep Convolutional Neural Network with GAN, and is com-monly applied in image-related applications. Another variation called Wasserstein GAN[3] and its improved version [50] gives a better result compared to Normal GAN structure in some applications

Reinforcement learning can also be applied in the financial fields[10][31]. These articles try to implement reinforcement learning into trading execution. The result demonstrates that reinforcement learning can be applied in the buy or sell case trading execution problem.

To better understand the ideas and the applications of some references, we also choose to run some programs to test the different neural network models, which will be reflected in our empirical studies part.

1.3

Outline

This thesis will have the following structure, in Chapter 2 we will give an introduction to the traditional model for portfolio optimization. This will give the reader a better understanding of portfolio optimization. Then in Chapter 3, we will discuss the drawbacks of traditional port-folio optimization. In Chapter 4, we give our methods of preprocessing data, including filling and scaling data. Chapter 5 includes the introduction to Artificial Neural Network and also Generative Adversarial Network that is implemented in this thesis. In Chapter 6, the empirical studies result, and a study on the effect of hyperparameters will be presented. In Chapter 7, the advantages and disadvantages are discussed, based on the results in the empirical studies part, a discussion will be presented on the proposed Generative Neural Network portfolio op-timization framework. Finally, in Chapter 8, some directions for further studies, especially directions that can improve the proposed framework will be suggested. Also, we will give the conclusions of our proposed framework.

(13)

Chapter 2

Traditional Portfolio Optimization

Method

The modern portfolio theory begins with the paper of Harry Markowitz in 1952[28]. This paper updates the investors’ relationship with risk and return. Before the era of Modern Port-folio Theory, the risk is not properly implemented in the stock selecting process. The investors focused more on the return of the individual stock. Modern portfolio theory allows investors to make decisions in terms of risk and return. The weights of the constructed portfolio can be calculated by solving an optimization problem. The following section is a brief introduction to the two most commonly applied modern portfolio theories

2.1

Mean-Variance Portfolio Optimization

The Mean-Variance Portfolio Optimization starts with the assumption that the investor at time t will hold the portfolio for a time period ∆t. The portfolio will be judged based on the terminal value at time t + ∆t Under this theory the portfolio selection process is a trade-off between return and risk.

Suppose that investors need to construct a portfolio in a pool of N risky assets. Denote was the weight vector, which represents the weight of each stock. The weight vector can be written as: w = (w1, w2, · · · , wn). Then to represent that the investor will fully invest his/her

money, we introduced the first constraint of portfolio optimization.

N

i=1

wi= 1 (2.1)

This constraint represents that the investor needs to invest all the available money into risky assets. Therefore the sum of the weight equals to one.

Then the investors need to estimate the expected return of the stock either from a statistical model or other method. The asset return is denoted as µ = (µ1, µ2, · · · , µn). Next, the

(14)

variance-covariance matrix needs to be calculated. The variance-variance-covariance matrix Σ can be written as: Σ =    σ1,1 · · · σ1,N .. . . .. ... σN,1 · · · σN,N    (2.2)

Where the σi, jdenote as the covariance between assets i and j.

With these assumptions, we have the expected return of the portfolio µp:

µp= wTµ (2.3) and the variance of the portfolio σp2:

σp2= wTΣw (2.4) Now we can form an optimization problem that minimizes the risk given a target expected return µ0: min w w T Σw Subject to µ0= wTµ wTI= 1, I = [1, 1 · · · , 1]

This optimization can be solved using Lagrange multipliers, and the solution is[15]:

w= j + kµ0 (2.5)

Where j and k are given by:

j= 1 ln− m2· Σ −1[nI − mµ] k= 1 ln− m2· Σ −1[lµ − mI] and l= ITΣ−1I m= ITΣ−1µ n= µTΣ−1µ

Now with different choices of µ0, the optimization problem can be solved, and obtain the

weight portfolio. Then the variance of this portfolio can be calculated using equation 2.4. Then we can form many expected return and standard deviation pairs. This forms the term Efficient Frontier.

(15)

Now the efficient frontier starts from the Global Minimum Variance Portfolio(GMV). The optimization problem of this portfolio can be described as:

min w w T Σw Subject to wTI= 1, I = [1, 1 · · · , 1] Now the solution of this optimization problem is[15]:

w= 1 ITΣ−1I· Σ

−1I

Now if the investor has a risk aversion toward risk denoted as λ , then the optimization problem can be formulated as:

max w (w T µ − λ wTΣw) Subject to wTI= 1, I = [1, 1 · · · , 1]

2.2

CAPM

Capital Asset Pricing Model(CAPM) is an equilibrium asset pricing model. The CAPM is founded based on the following assumptions[14]:

1. The investor makes decision based on expected return and the standard deviation of return.

2. Investors are rational and risk-averse.

3. Investors use Modern Portfolio Theory to do portfolio diversification. 4. Investors invest in the same time period.

5. Investors all have the same expected return and risk evaluation of all assets. 6. Investors can borrow or lend at risk free rate at an infinite amount.

7. There is no transaction cost.

To introduce the formula of CAPM, we start with a more general case: the single-index model. The single index model can be described as the linear regression between index return and stock return. In other word, the return of stock i can be described as:

(16)

Where aiis the part of the stock return that is irrelevant to the market return. Rmis the return

of the market βiis a constant that describes the relation between market return and return of

stock.

Then rewrite aias:

ai= αi+ ei

Where αiis the mean value of aiand eiis the random value of aiand has expected value of 0.

Now the return of stock i can be written as:

Ri= αi+ βiRm+ ei

Then it is obvious that the correlation between eiand Rmis 0

Now we give:

1. The mean return: ¯Ri= αi+ βiR¯m

2. Variance of return: σi2= β2 iσm2+ σei2

3. Covariance between return of stock i and stock j: σi j= βiβjσm2 The proofs of the above formulas can be found in[13].

Now the formulation of CAPM is written as:

Ri= Rf+ βi(RM− Rf)

This is the standard form of CAPM, and according to the formula, the expected return on a particular stock can be calculated based on the beta of the stocks, risk free rate on the market and the expected return of the market. This is the standard form of the CAPM also known as (Sharpe Lintner Mossin) form. There are many other forms that try to solve some of these problems in the standard form.

In theory, CAPM is a good estimation of the expected return of stocks. However, in reality, implementation is more complicated. From the CAPM formula, we can see that the variables are expressed in terms of future values. In other word, investors need to estimate the return of the market and the future beta of stocks. This exposes a problem: a large scale data systematic data on estimating the expectation does not exist, therefore the accuracy of CPAM cannot be guaranteed.

(17)

Chapter 3

Limitation of Traditional Portfolio

Optimization

3.1

Drawbacks within Assumptions

The Traditional Portfolio Optimization is very useful in many ways, however, it has many drawbacks. Lagrange multiplier is implemented to solve this optimization problem, to make the solution optimal, it is necessary to fulfil Karush-Kuhn-Tucker Condition[26]. From the necessary condition, the whole process needs to be stationary, however, this is not necessarily the case, as there is no proof for that. Finally, the whole optimization process does not assume the stochastic characteristics of the data. Therefore this kind of optimization is not robust enough since it does not consider one of the most important characteristics of stock price data. When we talk about the potential solution to this type of optimization, stochastic properties should not be ignored after all investors want to find a portfolio that can give them a secure position in most cases.

3.2

Drawbacks within Applications

When investors try to implement Markowitz portfolio optimization theory in reality, they may face another weakness of Modern Portfolio Theory: the difficulties of estimating required in-puts. Starting with GMV portfolio. The required input of GMV portfolio optimization is the Covariance Matrix. The core idea behind this optimization is that By combining covariance and variance of each stock, investor can find the right combination to minimize the variance of the designed portfolio. The problem with this idea is that covariance is not a good rep-resentation of the relations between the two stocks. Specifically, a number does not have the capability of explaining how two stocks move together. In Figure3.1 we present the rolling correlation coefficient between stock AAK and ABB.

(18)

Figure 3.1: Rolling correlation coefficient between AAK and ABB

As can be seen in the Figure, the correlation between stocks varies a lot, therefore the optimization based on the covariance matrix cannot construct a good portfolio that can give us the minimum portfolio variance in the future.

Besides the difficulties of finding the accurate covariance matrix. If the investors also care about the return, then they need to give another input: Expected Return vector. Expected re-turn is what investors expect from a stock in the future. It is very hard to estimate the rere-turn of a stock in the future, therefore the investors may have inaccurate estimations. Since the max-imum Sharpe ratio portfolio optimization is very sensitive to the expected return[14], hence it will result in a scenario that can be described as "Garbage in, Garbage out". If investors have a wrong estimation of expected return, then the constructed portfolio cannot have a good performance.

(19)

Chapter 4

Preprocessing

Sometimes the original data contains missing data, therefore it should either be ignored or filled in with data. In this thesis, the core idea is that the missing should be filled. In the following section, an approach is proposed to fill the missing data. Note that we implement distribution studies only to give us knowledge about how to fill the missing data. Since missing data does not contribute much in the whole dataset, it will not change our purpose that in the main neural network application part: we do not have any statistical assumption about the distribution of asset returns.

4.1

Distribution of Daily Return

In order to compute the daily return of stocks, the log return of the stock price daily data is selected to represent the daily return of stocks. The formula for calculating log return is[47]:

Ri= ln(Pf/Pi)

Where Riis the return at the end of the period, Pf is the price at the end of the period and Piis

the price at the initial of the period.

The reason behind choosing log return is that we can calculate the return in a period by summing up all individual log return. For example, denote Pnas the price at time n, therefore

the log return during period 1 to n can be calculated as: R1−n= ln Pn P1  = lnP2 P1 ·P3 P2 · · · Pn Pn−1  And we know that:

(20)

Consequently: R1−n= n

i=1 Ri

Also, if we choose to apply normal return, then there is a limit of return. The normal return range from -1 to +∞, because the stock price cannot have more negative price changes than its current price. Subsequently, it will create difficulties in studying or emulating daily return. Log return, on the other hand, ranges from −∞ to +∞, making it a good choice to calculate daily return in our studies.

Now we choose to identify the statistical properties of the daily stock returns, it is better to plot the histogram of the stock daily returns. Figure 4.1 is the histogram of ABB daily return. This will give us a first glimpse of the statistical properties of stock returns.

0.08 0.06 0.04 0.02 0.00 0.02 0.04 0.06

0

5

10

15

20

25

30

35

40

Figure 4.1: Histogram of ABB

With the histogram of the daily log return, we decide to find the distribution that can better describe the daily returns of stock.

To find the distribution that fits best with the daily return of stocks, some potential candid-ates are suggested. In the following part, introductions will be made to these distributions and compared between the (Probability Density Function) PDF of fitted distribution and histogram of the stock return.

(21)

4.1.1

Normal Distribution

A random variable X follows random distribution if its probability density function can be written as[39]: f(x) = 1 σ √ 2πe −1 2( x−µ σ ) 2 , −∞ < x < ∞ (4.1) Where −∞ < µ < ∞ and 0 < σ2< ∞. We denote that µ as mean and σ2as variance. Then a

random variable with mean µ and variance σ2can be written as: X ∼ N µ, σ2.

Random variable that generates from the normal distribution will not be suitable for this case, the figure shows PDF of the normal return with mean and variance calculated according to the history daily return. From Figure 4.2 we can see that the PDF of normal distribution does not fit well with the histogram.

0.08 0.06 0.04 0.02 0.00 0.02 0.04 0.06 0.08

0

5

10

15

20

25

30

35

40

Normal Distribution

Figure 4.2: Comparison Between Histogram and Normal Distribution PDF

4.1.2

Student’s t Distribution

To introduce Student’s t Distribution we first introduce the concept of gamma function. The gamma function Γ(z) can be written as:

Γ(z) =

Z ∞ 0

(22)

Then a random variable X follows Student’s t Distribution with ν degree of freedom if the probability density function can be written as[39]:

f(x; ν) = Γ ν +1 2  √ π ν Γ ν2  1 +x2 ν (ν +1 2 ) , , −∞ < x < ∞

The comparison between Student’s t Distribution is presented in the Figure4.3

0.08 0.06 0.04 0.02 0.00 0.02 0.04 0.06 0.08

0

5

10

15

20

25

30

35

40

Normal Distribution

Students t Distribution

Figure 4.3: Comparison Between Histogram and Student’s t Distribution PDF The student t distribution can be viewed as the generalization of Cauchy Distribution and Normal Distribution. First, we look at the probability density function of Cauchy Distribution[39]:

f(x) = 1

π (1 + x2) − ∞ < x < ∞

Now, this probability density function can be viewed as the special case that the degree of freedom ν=1. Now when we let the degree of freedom ν → ∞, then we have:

lim ν →∞f(x; ν) = 1 √ 2πe −1 2x2 − ∞ < x < ∞

If we compare the result with the Equation 4.1 we find that the result of the form of the standard Normal Distribution.

(23)

4.1.3

Complete Data

Before filling the missing data, we need to select stocks that we want to fill. Because the missing data is generated from the fitted distribution, if our inputs to the neural network are filled with too many random variables, the true information inside the real stock price will be distorted. Hence we select stocks that only contains missing data with length less than one-fourth of the total length of the data.

In terms of distribution, the Student’s t Distribution is chosen because it fits good with the statistical property of daily returns of stock.

To get the parameters of the distribution we choose to implement Python package Scipy[48], we use functions from Scipy to fit the distribution to return data and generate daily returns that follow fitted distribution.

4.2

Whether to Include Technical Indicators

In the finance industry, technical indicators are applied in many trading strategies. However, in its essence, technical indicators analysis is not exact science[32]. Because it is a reflection of the market price trend, technical analysis aims to find the market trend at an early stage. Technical analysis believes that crowds’ psychology affects the stock price. The investors study the market trend will orient investors to decide on buying and selling in some degree of confidence.

Some studies also choose to use technical indicators as an input, studies like Tegner 2018[46] and Widegren 2017 [49] all choose to combine data and technical indicators to feed them into Artificial Neural Network.

However, in this thesis technical indicators will not become an input of the designed neural network. The technical indicator or technical analysis aims to provide predictions of the future asset price. In this thesis, prediction is not important. In other word, we do not care about the future price of tomorrow or next month. The focus of this program is to estimate the risk of stocks in any given period. The potential loss of the portfolio to be exact. Therefore applying technical indicators cannot provide any improvement to the risk estimation and potentially generate noise and affect our estimations.

4.3

Scaling

Before feeding data into the neural network. It is important to perform scaling on data. Be-cause the range of the data might be different from each other. If the data is fed into the neural network or machine learning algorithm without any preprocessing, the result will be deeply compromised because of the different data range. In the following section, several scaling techniques will be introduced.

(24)

4.3.1

Standard Score

The formula for normal scale can be expressed as[53] ˆ

X=x− µ σ

Where: µ is the mean of the data, and σ is the standard deviation of the data.

The Standard score is the most common technique of the scaling and applied in many machine learning algorithms. However, in our application, this method is not appropriate, because our designed network requires data inputs in the range of -1 to 1. Standard score scale data based on the standard deviation, from the Figure 4.2 one can see that the stock data has the trend of fat tail distribution, therefore the scaled data cannot fulfil our requirement for the proposed framework.

4.3.2

Min-Max Scaler

Another common technique that can be implemented in our application is Min-Max scaler. The formula for the scaling can be expressed as[35]:

ˆ

X= x− min(x) max(x) − min(x)

Then the result will transform the result into the range 0 to 1, the whole data will be scaled based on the maximum and the minimum of the data. Although this technique will ensure that the result in the range of 0 and 1, however in the context of stock data, the whole data will be scaled according to the outliers, hence the scaled data will overly concentrated around a narrow interval, resulting in inaccurate outputs from the neural network.

4.3.3

Robust Scaler

Now to solve the issue we faced in the Min-Max Scaler, we can implement another type of scaler: Robust Scaler. The formula for robust scaler can be formulated as[35]:

ˆ

X= x− Q2 Q3− Q1

Where Q1is the 25th quantile of x, Q2is the median of x, and Q3is the 75th quantile of x.

By using Robust Scaler, the scaled data will be evenly distributed. Therefore in the situ-ation of large outliers, compared to Min-Max Scaler, the Robust Scaler will allow data to be distributed in a wider range.

4.3.4

Max Abs Scaler

Max Abs Scaler is a variant of the Min-Max Scaler, compare to Min Max Scaler Max Abs Scaler will scale the data in range of -1 to 1. The formula for Max Abs Scaler is[53]:

ˆ

X= x max(abs(x))

(25)

4.3.5

Power Transform

Power transform is one of the techniques that try to transform data to a more normal distri-bution liking data using power function. Under this category, there are two major approaches Box-Cox Transform[40] and Yeo-Johnson Transform[51]. Under this application, we have to use Yeo-Johnson Transform. Because the Box-Cox Transform has to be implemented on strictly positive values. Since the Box-Cox Transform is not implemented, there will not be an introduction for it. Now Yeo-Johnson Transform is formulated as:

yλ i =             (yi+ 1)λ− 1  /λ if λ 6= 0, y ≥ 0 log (yi+ 1) if λ = 0, y ≥ 0 −h(−yi+ 1)(2−λ )− 1i/(2 − λ ) if λ 6= 2, y < 0 − log (−yi+ 1) if λ = 2, y < 0

Where 0 ≤ λ ≤ 2. Yeo-Johnson Transform is the choice for the application because it can be applied to negative data. The transformed data will display some kind of normal distribution. After performing Power transform we can apply Max Abs Scaler to make scaling compatible with our network’s required inputs. Because Power Transform is not a linear scaler, the effect of outliers will not have more impact on the Max Abs Scaler’s result compared to Max Abs Scaler result without Power Transform.

(26)

Chapter 5

Artificial Neural Network

5.1

Introduction to Neural Network

5.1.1

Relation between Different Concepts

With the evolution of computer related technologies especially the GPU1, Artificial Intelli-gence becomes applicable rather than a proposed concept. Terminology like machine learning and neural network become more and more popular and frequently appearing in articles and media. However, some readers may have difficulties in understanding the relations between different concepts. In this section, we give a small introduction to these concepts. Figure 5.1 is a brief description of the three most common concepts in the Artificial Intelligence research field.

Figure 5.1: A brief description of the relation between three different concepts The concept of Artificial Intelligence began in 1950s, the core idea of A.I is to perform the

(27)

intellectual task, which is done by human automatically. In the following years, this concept has been continuously developed. And under this concept, a new approach was proposed: Machine Learning

In traditional model based method, data and rules are fed into the system, then the result will be calculated according to the data and rules. Machine Learning implements a whole new paradigm, a relation is found according to the given input and result. In other word, machine learning program aims to replicate the given result with a given input. When a machine learn-ing model is trained, we can give a new set of data as input and obtain output from machine learning algorithm.

To explain the concept of Deep Learning, we first elaborate on how machine learning algorithm works. In a machine learning program, three types of information are necessary. Input data, which for example can be numbers, pictures, and sound etc. And the examples of the known results, which can be tags of pictures in an image recognition task. Finally the measurement of the algorithm’s performance. This measurement measures the distance between the result of the algorithm and the expected output. Then adjustments are made according to the predetermined measurement and output from this machine learning algorithm will improve accordingly.

Now we come to the difference between Machine Learning and Deep Learning. Contrary to the first impression, compared to the normal machine learning algorithm, the deep learning algorithm will not necessarily give a deep interpretation of the data. The term Deep Learning defines algorithms that feed input data through a successive set of layers that have an increas-ingly meaningful interpretation[16]. By doing this, input data can be represented by different layers of interpretation. The depth of the model is the term that describes the number of layers contributed to a model. Under the definition of Deep Learning, the technique Aritficial Neural Networkis one commonly applied approach. In the following section, a detailed explanation of the neural network will be presented.

5.1.2

Definition of Artificial Neural Network

The concept of Artificial Neural Network(ANN) borrows the idea of how biology neuron works in real life.

Now the term Artificial Neural Network is defined as an interconnected assembly of simple elements nodes or units, whose functionality is similar to biological neuron. The processing ability is stored in the weight of the inter unit, which can be obtained from learning.[21]

The first concept of Artificial Neural Network starts with the paper by McCulloch and Pitts(1943)[29]. The paper proposes a computational model that imitates the way neuron work when performing complex computation. This is the world’s first Artificial Neural Network structure. One of the simplest ANN networks Perception is proposed by Frank Rosenblatt in 1957[38]. It is a variation of a another network Linear Threshold Unit(LTU).Figure 5.2 is a description of the LTU

(28)

W1 W2 W3

X1 X2 X3

Σ

Step Function

Figure 5.2: LTU unit

First, LTU unit computes the weighted sum of its inputs, which can be expressed as: (w1x1+ w2x2+ · · · + wnxn), we denote the weighted sum as z, then we can rewrite the output of

weighted sum as: z = wTx. Then LTU unity applies a step function on the weighted sum z, the output is represented as: G(x) = Step(wTx). With the LTU unit defined, we can introduce the term Perceptron. A Perceptron is consist of one layer of LTU units with each neuron connected to input layer.[17]. When we stack multiple Perceptrons together, we create a Multi-Layer Perception (MLP).

In an MLP we have an input layer, which handles the input. Then in the middle program-mers can choose to have one or multiple LTU units. The middle layer is called a hidden layer. The number of hidden layers is predetermined and can be adjusted according to the applica-tions. The middle layer is connected to an output layer. If ANN has more than 2 hidden layers, it is called Deep Neural Network (DNN). Figure 5.3 is an example of Deep Neural Network. As the figure shown, this network has 2 hidden layers with 5 and 4 LTU units respectively.

(29)

Input Layer ∈ ℝ⁶ Hidden Layer ∈ ℝ⁵ Hidden Layer ∈ ℝ⁴ Output Layer ∈ ℝ⁶

Figure 5.3: Neural Network

For the convenience of this thesis, in the following parts, the term neural network will be equal to artificial neural network.

5.1.3

Differences between ANN and Statistical Method

Traditionally when one wants to complete a task, a common option is statistical method. To explain the philosophy of ANN, assuming that we want to solve a practical problem: Identi-fying handwritten numbers. To complete this task using statistical techniques, a model has to be proposed. This model is an appropriate representation of the relationships between inputs and outputs. Denote the this model as: y = f (x, β ). In this case x is input(picture) and y is the desired output(identified number). The input of this task can be a large amount of data, and the function f is unknown. Consequently, it requires a large number of parameters to give a relatively accurate model. This means that the model for identifying handwritten number is large and complex.

Neural Network, on the other hand, takes a different approach. Compared to the statistical approach, Neural Network model has far more parameters than statistical technique, therefore there are many combinations of parameters. In reality, a different combination of paramet-ers can sometimes give the same output, making it hard to interpret the parametparamet-ers inside the neural network. To be more precise, Neural Network works as a black-box method, and will not give any more interpretable result from its parameters[17]. However, in this example, we just want to have a model that can recognize handwritten numbers and do not care about the re-lationships between pixels. Neural Network works well in this kind of applications. Moreover, in the financial market there are hundreds of variables, hence finding a robust statistical model is very hard. Neural Network is an optimal choice in this type of situation.

(30)

5.2

Training Neural Network

First, in a linear model, the relation between input and out can be expressed as: f(x) = wTx+ b

where x: input, w: weight and b:bias Now this can be elaborate as:

f(x) = (w1x1+ w2x2+ · · · + wnxn+ b)

Then the output of one layers can be fed as inputs to the subsequent layer. However, the previous equation can only represent the linear relationship, therefore it is important to convert the output to non-linear relationship. To achieve that, the activation function is applied. Now the output can be written as:

u= τ(w1x1+ w2x2+ · · · + wnxn+ b)

where u: output, τ: activation function

Then we need to demonstrate how to estimate parameters in the model, the weights in a model are chosen to minimize the error of output and expected output, in other word error. The error of a model can be expressed in terms of Mean Squared Error:

E=

l

i

( ˆyli− yli)2

Also, other error can substitute MSE in the Neural Network. When the structure and loss function of the designed network is known, neural network problem becomes a non-linear optimization problem. This type of problem can be solved in many ways, in terms of Neural Network the choice will be Backpropagation (Gradient Descent) algorithm. The weight wiis

altered according to the error, the easiest way to reflect this idea can be written as: ∆wi= α

dE dwi

(5.1) Denote w(k)as the weights at iteration k then the weights at iteration k+1 becomes:

w(k+1)= w(k)+ ∆w(k) (5.2)

Then if we want to minimize the error, the direction should be the opposite of gradient, then we can combine Formula 5.1 and 5.2 as:

w(k+1)= w(k)− α dE

dw(k) (5.3)

This is the simplest form of Gradient Descent. One of the drawbacks of gradient decedent is that calculation of the sum of all gradient is required, therefore it is computationally heavy. Stochastic Gradient Decedent is designed to mitigate the workload of the gradient decedent

(31)

algorithm. Instead of calculating the sum of all gradients, it randomly selects observations to calculate the gradient.

This process is represented as:

E(w) = N

n=1 En(w) Then we have: w(k+1)= w(k)− α dEn dw(k)

In the practical applications, both gradient descent and stochastic gradient descent require to compute the gradient. This can be achieved using the chain rule, and consider the output of the network as a function of weights.

5.2.1

Hyperparameters

Hyperparameters is the parameters that have been specified before the training process, unlike normal parameters in a neural network, hyperparameters cannot be derived or improved from the normal training process. Therefore it is important to select a good set of hyperparameters. In some case, the hyperparameters optimization technique can be applied to improve the ac-curacy. However, in this thesis due to the length and complexity of this topic, this technique is not implemented. Instead, the chosen hyperparameters are given then readers can choose to improve the network using their proposed method. To explain these terminologies we need to start from gradient descent technique, this technique as introduced before is an iterative method. Follow the previous notation we called parameter α learning rate in practical ap-plications. This parameter itself is also a hyperparameter that need to be optimized in some applications.

Now if we can feed all the data we have to the neural network, then there is no need for batch size. However, in almost all cases, this cannot be achieved because the data size is too large for the computer to handle at once. Now to solve this problem, we have to divide our data into smaller pieces and then update the weights inside the neural network for a piece of data. In the end, we can get the weights of the trained neural network. And the size of this small batch is the batch size.

One epoch is described as the whole data set pass through to the neural network one time. Then naturally for readers who are not familiar with this topic, why do we need more than one epoch to neural network? To put it in another way, why do we require to feed the same data more than once? Gradient descent method is an iterative method, then for a limited number of datasets, one epoch will not give us a satisfying result, in other word an underfitted result. However, for a large number of epoch, the weights inside the network will become too focused on the training data, resulting in an overfitted model.

With epoch defined, we can define iteration, which is a term that describes how many batches are required to finish one epoch. This is not a hyperparameter since we already define the batch size.

(32)

5.3

Activation Function

As illustrated before neural network requires an activation function to transform linear func-tion into non-linear funcfunc-tion. In the subsequent part, we will introduce some commonly im-plemented activation functions[5].

5.3.1

Sigmoid function

Sigmoid function as known as logistic function is one common activation function implemen-ted in the neural network. The formula for Sigmoid function is given as:

f(x) = 1 1 + e−x The figure of Sigmoid function is given in Figure 5.4

Figure 5.4: Graph of Sigmoid function

5.3.2

Hyperbolic Tangent function

Hyperbolic Tangent function(Tanh) is another function commonly used activation function. It is zero centered function with limit between -1 and 1. The output of Hyperbolic Tangent function can be calculated using following formula:

f(x) = e

x− e−x

ex+ e−x

(33)

Figure 5.5: Graph of Hyperbolic Tangent function

5.3.3

Rectified Linear Unit function

Rectified Linear Unit function (ReLU) is one commonly used activation functions in deep learning. The ReLU function can be represented as:

f(x) = max(0, x)

Because the right positive part of the function is linear, ReLU function is easier to optimize using gradient-descent method. Figure 5.6 is the graph of this function.

(34)

5.3.4

Exponential Linear Unit

Exponential Linear Unit(ELU) is a variation function of ReLU and can converge faster than regular version of activation function, the ELU function is formulated as:

f(x) = (

z, z> 0

α (ez− 1), z≤ 0

The difference between ReLU and ELU is in the negative part of the function. ELU smooth slowly until −α and ReLU smooths sharply. Figure 5.7 is the graph of this function, where α = 0.7

Figure 5.7: Graph of Exponential Linear Unit

5.3.5

Leaky ReLU

Leaky ReLU is a variant of ReLU the formula is represented as: f(x) =

(

x, x> 0 α x, x≤ 0 Figure 5.8 is the graph of Leaky ReLU, where α = 0.1

(35)

Figure 5.8: Graph of Leaky ReLU

5.4

Approaches to Prevent Overfitting

Overfitting is a common problem facing in the machine learning area, in neural network it is not uncommon to encounter this problem. Overfitting will compromise the performance and accuracy of the neural network in its actual applications. Hence it is necessary to take measures to prevent overfitting. Therefore we introduce several measures to prevent overfitting.

5.4.1

Increase Data Size

The most obvious and easiest solution is increasing the size of data. After all the causation of overfitting is that there is not enough data to fully trained a complicated neural network. However, in many circumstances, it is not possible to acquire more data. In this case, there is no more stock return data than the existing data on the market. Therefore other measures need to be applied to prevent overfitting training data.

5.4.2

Reduce Size of Neural Network

Another approach is reducing the size of the neural network. To elaborate on this concept, we have to define the complexity(capacity) of a neural network. The capacity of a neural network is defined as the number of trainable parameters in a network.[23] For a complex neural network, there are more parameters, which means that it has more capacity to learn and even perfectly represent the training data. For example, let’s assume that our training data consists of 10000 numbers, a network with 200000 trainable parameters will easily find a perfect fit for training data set. In this case, we have an overfitted neural network.

An overfitted model will not provide any significant prediction for a new set of data. Be-cause the network itself is a perfect representation of the training data rather than a general de-scription for a set of data. Naturally, to solve this problem, the number of trainable parameters in a network need to be reduced. On the other hand, a network with 100 trainable parameters

(36)

will also not give any meaningful prediction of the data, because the size of network does not have the capability of representing training data, in the actual application phase this model cannot give a meaningful result. Hence finding the right number of trainable parameters in a neural network is the key to training a well-performed neural network.

5.4.3

L1 Regularization

One technique that can mitigate the effect of overfitting is to regularize the size of the para-meters, this can be achieved by introducing regularization term to the loss function. Denote L as loss function, then this process can be expressed as[46]:

L( f (w, b), y) + λ R(w) where R(w) is the regularization function.

Then if the regularization function is L1 norm, then it is called L1 regularization. L1 regularization is expressed as:

R(w) = ||w|| = L

k=1

i, j W k i, j

5.4.4

L2 Regularization

When we replace L1 norm with L2 norm, then we can construct the L2 regularization. L2 norm is formulated as:

R(w) = ||w||2

=

m

k

Wmk2

Then the adjusted loss function is minimized in the neural network, which means that the loss function is minimized on the condition that the elements in the weight will not become too large. The parameters λ is the term that represents the amount of regularization. This parameter should be chosen in a balanced way because a large λ will result in an underfitting model.

5.4.5

Dropout

Another way to prevent the overfit is to apply dropout technique. This approach is proposed by Srivastava in 2014[44]. First, denote l as the number of layers l ∈ L{1 · · · L}, u(l) as the input to the layer l. y(l)as the output from layer l. w(l)and b(l)as weights and bias at layer l. Then a normal feed forward network is represented as:

u(l+1)i = w(l+1)i yl+ b(l+1)i y(l+1)i = τu(l+1)i 

(37)

Where τ is the activation function.

Now after implement dropout, the network should look like: r(l)j ∼ Bernoulli (P) ˜y(l)= r(l)∗ y(l) u(l+1)i = w(l+1)i eyl+ b(l+1)i y(l+1)i = τu(l+1)i 

Denote ∗ as a element product, and vector r(l)follows Bernoulli Distribution, with probability of P.

5.5

Supervised Learning and Unsupervised Learning

The traditional application of the neural network is to do the classification application, a brief description of classification can be described as a set of data and corresponding label or tag of the data set. Then an artificial neural network is trained based on the input of the data and the corresponding labels. This kind of task can be described as Supervised Learning. Because the labels or examples are provided by human. In other words, the algorithm tries to replicate the human’s judgment or decision. To be more precise, supervised learning requires us to provide data input as well as responses or outcomes. The job of supervised learning is to predict the output with given input.

However, in stock or derivative market, it is hard to implement supervised learning. The reason behind this argument is the difficulties of finding robust examples or tags for neural networks. For example, if we want to train a neural network that can distinguish good stocks and under-performed stocks. We need to give neural network examples, but recognizing a good stock is not as easy as it sounds. First, there is not a universally recognized standard for a well-performed stock, also we cannot guarantee that our examples are correct or will have a good return in the future. This leads to the second difficulty of implementing supervised learning: enforcing biases. Even if we manage to find tags or examples for a supervised neural network, the designed neural network may amplify human’s judgments and potentially give false answers. After all, we cannot differentiate luck and skill. To put it in another way, we cannot tell whether investors earn money just because they are lucky or they possess the necessary expertise.

To solve this problem, a new type of learning: Unsupervised Learning is applied. Un-supervised learning only needs input data and tries to extract the features by itself without any guidance from human instead of making prediction. Consequently, unsupervised learning tend to find interesting features that cannot be found by human, making it suitable for applic-ations in stock market and portfolio optimization, because it may reveal some interesting new features. Apart from this advantage, it is easier to acquire unlabelled stock data, since it takes extra effort to tag the data.

Under the category of unsupervised learning, one particular application is gaining more attention in the research and application area: Autoencoder. An autoencoder is a network that

(38)

has the same input and output data. A simple explanation of autoencoder can be illustrate as Figure 5.9:

Data Encoder Compressed Data Decoder Data

Figure 5.9: Graphical Explanation of Autoencoder

First, the data will be fed into an encoder, then the output of the encoder will have fewer dimension than the original data. Next, a decoder will process the compressed data and return data that have the same dimension as the input data. An autoencoder will force the data to compressed into a lower dimension. By doing so, the algorithm has to extract useful feature from the data to minimize the difference between original data and the recovered data.

Apart from the feature extraction, a type of autoencoder: generative autoencoder can also randomly generate data that is similar to the original data. This type of generation leads to a new direction of Monte-Carlo simulation, a more accurate random data will give a better estimate of the portfolio risk. Hence portfolio optimization will select a lower risk portfolio.

5.6

Generative Adversarial Network

The Generative Adversarial Network(GAN) is a new type of generative network that was in-troduced by I.Goodfellow in 2014[19], and consists of two parts generator and discriminator (critic in some literature). The core idea of GAN is a competition between generator and dis-criminator. Discriminator try to identify whether the sample is real or not. The discriminator in a generative adversarial network is a supervised network and has binary output. Under this setting, the examples in this network is the real stock price data.

The job of discriminator is to classify data into two groups: real and fake. At the same time, the generator tries to generate sample data that can be classified as real data. In a real-world example, a generator can be described as a counterfeit artist who tries to create a fake copy of a word famous painting. Discriminator is an art specialist that can distinguish fake paintings and real artwork. Under this metaphor, the training process can be represented as the counterfeit artist keep sending paintings to art specialist to tell whether the painting is real or not. When the painting a classify as real artwork, the counterfeit artist can produce an infinite amount of paintings that have the same characteristics of a real painting. This graphical representation of this process is shown in Figure 5.10

(39)

Noise Generator Real

Data Discriminator

Generated Data

Data with same characteristics as

real data Train

Identify Output

Figure 5.10: Graphical Representation of GAN

Then to help readers further understand GAN, a detailed explanation of GAN is intro-duced.

First, denote Discriminator and Generator as functions D and G, then their parameters are represented as θDand θG. Under this notation the optimization process of GAN can be

represented as[18]: The Discriminator tries to minimize CD(θD, θG), while only changing

parameter θD. The Generator tries to minimize CG(θD, θG), while changing parameter θG.

This process in its essence is an optimization problem, the objective of this problem is to find the minimum point. (sometimes it may find the local minimum)

As introduced before, the training process is achieved using Stochastic Gradient Descent method and these. Now denote the input noise to the Generator as xGand the observed

vari-able(processed stock returns in this case) as xD. For the reason that the neural network

al-gorithm cannot process the entire data at once, the data is divided into several batches. Then the batches are fed into the network to update the gradient. Note that the updating process is conducted at the same time[18]. At each step, Discriminator find the parameter θDto reduce

the value of CD. Generator updates the parameter θGto reduce the value of CG.

5.6.1

Cost function

As previously explained, Neural Network is an optimization problem, therefore it is important to specify the cost function. The discriminator in GAN can be constructed using a normal deep neural network with Binary Cross-Entropy loss function. Because in this case, the tag for the data are real and fake. This type of problem can be described as a binary classification problem.

(40)

The cost function of Discriminator is given as[19]: CD  θ(D), θ(G)= −1 2ExD∼pdatalog D(xD) − 1 2ExGlog(1 − D(G(xG))) (5.4)

Reader who are familiar with Neural Network will recognize that this is the standard form of the cost function of the binary classification problem. Discriminator implements the core idea of binary classification, with tag marked the actual data and data generated from Gener-ator.

In terms of the cost function of the Generator, the objective is to minimize the Cross-Entropy between the output from the Generator and the actual data. Cross-Cross-Entropy loss func-tion measures the distance between the distribufunc-tion of the empirical data and model distribu-tion. Therefore by repeatably training the data, the data will have a similar distribudistribu-tion. The cost function of the Generator is given as[18]:

CG= −

1

2ExGlog D(G(xG)) (5.5)

This cost function can be explained as Generator trying to maximize the log possibility that the discriminator recognize differences between actual data and the data generated by Generator.

5.7

Implement Neural Network in Portfolio Optimization

With the necessary introduction being made, now a brief introduction will be presented for the application of this type of Neural Network. Investors can implement neural network tech-nique specifically generative model to conduct portfolio optimization. As suggested before we can bring the idea of Monte-Carlo simulation to our result. Since the data generate similar characteristics with actual stocks returns. By implementing this type of procedures, Neural Networks processes the capability to simulate the stock prices in different scenarios. Hence, the optimized portfolio will have more robustness compared to traditional Markowitz portfolio optimization. In other word, the designed portfolio may not be the one that gives more return or smaller risk in one particular case. However, since the Monte-Carlo simulation covers the majority of the scenarios, the designed portfolio will give a more secure position in all pos-sible outcomes compared to the static optimized method with normal portfolio optimization method.

5.7.1

How to Optimize Portfolio from Output of Neural Network

Since the goal for portfolio optimization is to minimize the risk of the constructed portfolio (in terms of VaR or CVaR), it is necessary to determine the weight of each stock that give the least amount of risk in terms of value at risk or conditional value of risk. This kind of idea has been implemented before, Rockafellar 2000[37] gives a solution for portfolio optimization in terms of conditional value at risk. This article is a groundbreaking method, some even call it Markowitz 2.0 to signify the importance of this article. Despite that this method cannot be implemented in this application, because it assumes the distribution of returns(Smooth

(41)

Multivariate Discrete Distribution to be exact). As demonstrated before, we do not want to make any distribution assumption for asset returns. To solve this problem, the grid search method is implemented to find the optimal weights for the portfolio. This solution is not the optimal solution, but due to the time constraint, we choose to implement this easy approach. Readers can choose to do more research on this and propose a better solution.

(42)

Chapter 6

Empirical Study

6.1

Data Software and Hardware

6.1.1

Data and Data Source

The data source of this thesis is Yahoo Finance, we use API(Application Programming Inter-face) to download data from the server. And in terms of data, we select stocks that are listed on the Nasdaq Stockholm, which consists of 378 stocks1. The period of data is from 2009-02-09

to 2019-02-08. And the prices are quoted daily and consist of Open, Close, High, Low and Adjusted Close price of each day. Also, the data contains the daily trading volume of each stock.

6.1.2

Software Choice

In the empirical studies part, programming language Python with necessary scientific pack-ages like numpy[22] scipy[48] and pandas[30] is selected. In the neural network application part of the empirical studies, the framework is Keras[8] with backend of Tensorflow-GPU[1]. This indicates that the neural networks are run on the GPU of the computer. The operating system choice is Ubuntu (Ubuntu 18.042).

6.1.3

Hardware

Throughout this thesis, all the codes are run on a PC with an Intel I5-8400 processor, 8GB RAM. In terms of graphics card, the computer has Nvidia GTX 1060 with 6GB of graphics memory.

1http://www.nasdaqomxnordic.com/aktier/listed-companies/stockholm 2https://www.ubuntu.com/desktop

(43)

6.2

Risk Measurement

6.2.1

Volatility

The most common way to measure the risk of an investment asset is volatility, in a math-ematical definition volatility is defined as the standard deviation of return. The core idea of implementing standard deviation as the risk representation is that average is the expected out-come of a stock. Therefore a stock that has more diversion from the mean has more risk. In a statistical term, this is measured using variance or standard deviation. It is natural to think that a stock that has more differences from the mean will give more risk to the investors.

6.2.2

Value at Risk

Nonetheless using standard deviation to measure the risk of investment assets has its disad-vantages. Because it only represents the diversion from the mean disregard of the direction of the return. For investors, a positive direction to mean is preferred to investors. Subsequently, another representation of risk can be defined. More specifically a risk representation that measures downward risk. One of the measurements is Value at Risk(VaR). Denote X as value of investment asset, and parameter 0% < α < 100% then α VaR is defined as [12]:

VaRα(X ) = min{c : P(X ≤ c) ≥ α}

VaR can be interpreted as the minimum loss at (1-α) worst-case scenario. By applying VaR investors can better estimate the downturn risk of their assets.

6.2.3

Conditional Value at Risk

Another form of risk representation is Conditional Value at Risk, denote X as value of invest-ment asset and parameter 0% < α < 100% then α VaR is defined as[37]:

CVaRα(X ) = E [X |X ≥ VaRα(X )]

Compared to VaR, CVaR has more distinctive advantages, which makes it preferable to VaR. In Sarykalin 2008[41] the author talks about the advantages in several aspects. Compare to VaR, CVaR has following advantages:

1. CVaR has better mathematical properties, the risk represent by CVaR is coherent. 2. CVaR deviation can represent risk, in other word a good substitute of standard deviation. 3. A Risk management based on CVaR is more efficient compared to the one based on

VaR. To be more precise, CVaR can be optimized with regular optimization method. 4. CVaR considers the effect of the case when loss exceeding a certain level, on the other

(44)

6.3

Monte Carlo Simulation

A Monte Carlo simulation describes a type of simulation that conduct random sampling re-peatedly, then using statistical methods to analyze result[36]. To put it in a detailed explana-tion, it is a simulation in another case scenario. In other word, the results from Monte Carlo simulation could be the daily return of financial assets if the actual results do not happen. By repeatedly simulate the Potential Realities, a more accurate estimation of financial assets can be achieved.

To perform a Monte Carlo simulation, we should first identify a statistical distribution. The most common choice is normal distribution. First, we can draw random variables based on the statistical distribution that represents the daily return of stock. Then we calculate the value of the stock at a given time at a specific path.

6.3.1

Simulated Path of Monte Carlo Simulation

Following the introduction from the previous section, we choose to draw random samples based on the Gaussian distribution. In this case, we choose to implement daily log return as random variables. Because it is easier to calculate the return in a given period.

To demonstrate path generate, we decide to give one of the paths generated by Monte Carlo simulation based on the statistical properties of ABB stock. Figure 6.1 is the example path generated by the Monte Carlo simulation.

0 50 100 150 200 250 Days 1.0 1.1 1.2 1.3 1.4 Value ABB price path

Figure

Figure 3.1: Rolling correlation coefficient between AAK and ABB
Figure 4.1: Histogram of ABB
Figure 4.2: Comparison Between Histogram and Normal Distribution PDF
Figure 4.3: Comparison Between Histogram and Student’s t Distribution PDF The student t distribution can be viewed as the generalization of Cauchy Distribution and Normal Distribution
+7

References

Related documents

The goal of the TE process is to compute a traffic distribution in the network that optimizes a given objective function while satisfying the network capacity constraints (e.g., do

The goals of this thesis were to reduce the total number of articles in Volvos cross member portfolio, investigate whether it pays to increase material quality and develop

The system provides the basic value-adding functionality (size- reduction and separation) to the aggregates production process for producing various aggregate products. For

FMV har inom ramen för den kunskapsuppbyggande verksamheten Försvarets Framtida Taktiska Kommunikation (FFTK) en uppgift att för Försvarsmakten under 2003 demonstrera

The following table shows the resulting returns and their standard deviation of numerous historical simulations for the two optimization models with two different periods

We saw that classical programming techniques can solve the optimal portfolio problem if the constraints are linear and that the Differential Evolution algorithm can

The first of the algorithms for the single target relay problem is used to solve several different multiple target relay positioning problems involving a base station and two

Linköping Studies in Science and Technology Dissertations No. 1580 Per -M agnus Olsson M ethods