Prediction of securities’ behavior using a multi-level artificial neural network with extra inputs between layers

(1)

INOM

EXAMENSARBETE TEKNIK, GRUNDNIVÅ, 15 HP

STOCKHOLM SVERIGE 2017,

Prediction of securities’ behavior using a multi-level artificial neural network with extra inputs between layers

ERIC TÖRNQVIST

XING GUAN

(2)

Abstract

This paper discusses the possibilities of predicting changes in stock pricing at a high frequency applying a multi-level neural network without the use of recurrent neurons or any other time series analysis, as suggested in a paper by Chen et al. [2017]. The paper tries to adapt the model presented in a paper by Chen et al. [2017] by making the network deeper, feeding it data of higher resolution and changing the activation functions. While the resulting accuracy is not as high as other models, this paper might prove useful for those interested in further developing neural networks using data with high resolution and to the fintech business as a whole.

(3)

1 Introduction

1.1 Background and purpose

This paper’s goal has been to successfully evaluate an algorithm for forecasting financial instruments, mainly stocks. This falls within the field of financial technology business, hereinafter referred to as ”fintech”, and more specifically the algorithm trading business, hereinafter referred to as ”algotrading”.

This paper was written independently of any company. The reason for trying to answer this thesis is the large monetary value in the technology of forecasting stock prices, as well as the technological novelty the field poses.

Furthermore, this paper examines the potential of artificial neural networks when it comes to forecasting the values of financial instruments, with a focus on stock forecasting. The goal of the forecast is not to predict the explicit future value, but rather if the financial instrument increases, decreases or stay unchanged in monetary value.

There is a possibility to implement a similar algorithm in a real time applica- tion, and to forecast financial instrument changes as they are happening. This is not something within the scope of this paper. The paper could on the other hand prove useful as a proof of concept for any fintech company with ambitions to explore the complex world of artificial neural networks.

It would be arrogant to imply neural networks are not used by the major companies in equity trading, but might prove useful for smaller and less established companies looking to explore a relatively new area of prediction in the finance world.

The network design used in this paper differs from the common model, time series analysis, in that it does not require the data provided to be in a specific interval, or even ordered. This network design relies mainly on the correlation between changes in values of instruments shown by Chen et al. [2017] and Chiang et al. [2009].

There are no ethical uncertainties concerning this paper and the work as a whole, as the design should be seen as a proof of concept and not a self learning network that actually engages in trading. Any such networks would pose ethical questions, as for example how to avoid abuse of the intent of the market regulations.

1.2 Thesis

The question examined in this paper is whether the artificial neural networks could predict the price of financial instruments, in this case stocks. The price prediction is not numeral, but rather regared as a classification problem grouping on increase, decrease and unchanged price. The result will be measured in percentage on how accurate the prediction is compared to the given data. The thesis is therefore: Can a specially crafted neural network accurately predict whether a financial instrument will increase or decrease in value in a close future, without

(6)

the use of time series, memory functions or recurrence, in comparison with existing methods and network designs?

To clarify, specially crafted means a network not only taking input as a matrix of values used in one layer, but rather take input as several matrices given to different layers depending on the stability of the parameters, where the more stable parameters will be fed later in the network.

The network tested by this paper does not rely in any way on time series analysis, memory functions or recurrence in the network, to mimic the method done by Chen et al. [2017], only relying on the X_t−1data to predict the change Y_t.

The goal of this paper is not to find the most optimal method of forecasting securities’ returns, but rather to test a similar approach as the one suggested by Chen et al. [2017].

As means of comparison, several other models were used to give context to the prediction rate of the network proposed. These models used for comparison were the following:

• Support Vector Machine

• Random Forest

• Regular Neural Network

• Zero Rule (used as baseline)

1.3 Business value

This paperwork is mainly relevant to the financial industry and foremost fintech industry. Artificial neural networks have already demonstrated their ability to predict and categorise with high accuracy in other fields such as image recog- nition. The idea is to facilitate the financial company with their analysis of the securities using neural networks. It has several advantages in comparison to a human investment advisers, first the terms of raw calculation power, speed and possible quantity, greatly surpassing that of any person. Second, if the automated investment indicators are good enough, it can potentially replace the real investment advisor, thus lowering to the high fee or premium the investors of today are paying making the network model a less expensive alternative to using human advisers.

With rapid technological advancement, the competition within the industry is very high. Due to the efficiency which technological can bring, most within the financial industry are open to make investments and are willing to use new technological product in order to improve their competitive advantage.

2 Theory

This section is meant to clarify the theories and terms needed to understand this paper.

(7)

2.1 Financial instruments

Financial instruments are assets that can be traded and they are used as a way of letting capital change hands. In order to do this, several types of assets can be used as instruments, as long as they represent an agreement involving the exchange of something, real or virtual, with monetary value[Fabozzi, 2003, p. 1–2]. The focus of the paper will be the trading done with stocks on a public exchange market, more specifically on Nasdaq Stockholm.

2.1.1 Stocks

Stocks are a security signifying the ownership of part of a corporation. Stocks are traded differently for each corporation they represent ownership in, both when it comes to value and frequency. The value of which the stock is traded is the value that will be referred to in this paper when doing forecasts. The stocks are normally traded in open markets (see section 2.2), with open hours similar to those of any ordinary business.

2.1.2 Other types of instruments

There is a wide array of types of financial instruments besides stocks, many not representing anything real, but rather a virtual contract. An example of this is a stock option - an agreement to buy or sell a stock at a fixed price at a future time.

With the number of types of financial instruments being innumerable, ranging from options to exotic derivatives, there are some instruments more similar to stocks and some more different[Fabozzi, 2003, p. 1–7]. The models that are used in this paper will not focus on being compatible with any of these, but might be modified to be so. As the various instruments might have an effect on the price of a stock, they are not essential to fully understand but it is essential to know they exist.

2.2 Markets

Financial markets can be categorised into two groups, open and closed markets.

This paper will focus only on the open markets, where information concerning prices and orders is publicly available. Most of the open markets follow their own protocol when it comes to order placement and the actual exchange of equity, but the general principle is similar among them. The protocol consists of every order, both to sell and buy, being placed in an order book, which then tries to match the orders for a successful exchange[Lee, 2006, p. 638–642].

2.2.1 Order book

The order book is significant, since it is a representation of the value of the financial instrument it refers to. If the book is heavy on the buy-side, it indicates that there are more parties looking to buy said instrument rather than selling it,

(8)

giving the investors information about the depth of orders being placed. Since an order is executed and an exchange takes place every time there is a matching order on both sides of the book, the order books acts not only as a system to keep track of orders, but as a bulletin board for orders.

When matching orders are found on each side, and the exchange takes place, both of the orders are removed. The orders can also be partially removed. For this to happen, the orders as a whole must not be filled, as long as an exchange can be made with at least one instrument. The exchanged instruments are then subtracted from the orders, and they are left in the book in their now reduced quantity[Lee, 2006, p. 638–642].

The orders in the order books are cancelable, meaning that before an exchange has taken place, the part placing the order can at any time remove the order in something know as an order cancel.[Refco Private Client Group, 2005, p. 105]

It common practise that an anonymous version of the order book is kept visible for everyone to see for each and every stock traded.

2.2.2 Price

When matching orders are found on both sides of the order book, and an exchange takes place, a new price is established. This is referred to as the price of a certain financial instrument, and will generally fluctuate over time. The price is not easily predictable, and can in some cases be regarded as a Brownian motion[Osborne, 1959, p. 145]. Predicting the exact next position of a Brownian motion is hard, if not impossible. However, seeing as a Brownian motion moves based on apparent random variables, there has been success in predicting the general direction of stock prices using neural networks.[Chen et al., 2017]

2.3 Machine learning

Machine learning is a subfield of computer science and the purpose is to train the computer with a large amount of data in order to find a pattern and make a prediction. A widely quoted definition of Tom M. Mitchell on machine learning:

”A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”[Mitchell, 1997, p. 2].

2.3.1 Supervised learning and unsupervised learning

Generally machine learning can be classified in two main categories: supervised learning and unsupervised learning, not including the hybrids semi-supervised learning and reinforcement learning.

Supervised learning

In supervised learning, the data set and the correct output is known[Mueller and Massaron, 2016, p. 168], and is referred to as training and test data,

(9)

which is further explained in section 2.3.2. The algorithm is provided input data and the corresponding output data for each data point. The goal of the supervised learning algorithm is then to determine the relationship between input and output, often using a predefined error function. The algorithm can be designed to either calculate a function for determining new output based on new, untested input, known as regression, or categorise the new input into an existing output category, known as classification[Mueller and Massaron, 2016, p. 168]. Classification and regression will be further explained in section 2.3.4.

Unsupervised learning

Unsupervised learning is used to approach problems with little or no idea of what the output could look like[Mueller and Massaron, 2016, p. 169]. Unsupervised learning can be used to organise larger data clusters, without the use of human labelling. Since the algorithm is not provided with output data corresponding with the input data, it must internally decide what type of cost is associated with labelling, rather than relying on measuring it against given correct output. Since the input data is not provided with any corresponding output data, regression in a classical sense becomes impossible. Unsupervised learning is therefore more suited for classification problems than for regression problems.

2.3.2 Training and test data

A critical phase of any machine learning algorithm is the training phase, providing the algorithm with training data is essential. Data is usually gathered both for training and testing at the same time, and the difference lies mostly in how the data is used and partitioned.

One might be confused as to why not to use all gathered data as training data and save no data for testing. If no data is saved for testing, this can often lead to something known as overfitting or over training due to lack of indicators. Using test data makes it less likely that the model is overfit, as the test phase would detect a much lower precision, thus being a good indicator of overfitting[Mueller and Massaron, 2016, p. 289]. If a model is overfit, it becomes inefficient at making generalisations when fed input data unrepresented in the training data.

Partitioning for training and testing

In machine learning the training and test data is essential in order to train algorithms, and subsequently validating the model generated. A common model is to partition the raw data into a larger portion, training data, and a smaller partition, test data. These partitions are commonly within the range 70-75%

training/20-15% testing[Mueller and Massaron, 2016, p. 191].

This method is commonly used for larger data sets, where the amount of data is sufficient to avoid having to use cross validation. Both the test data and the training data will commonly only be used once.

(10)

Cross validation

Cross validation is an alternative way to train and test models. There are many ways of doing cross validation, ranging from exhausting to non-exhaustive methods. One commonly used method is to split the data in k parts of equal sized data. This is known as k-fold cross validation. One part of the data is then used as test data and the rest for the training data, and then the parts are rotated. In doing so this method ensures all separate parts of the data will have been used as test data and for the remaining training data. The result is that k-parts of error will be produced where the averaged error is used as error of prediction for the model. This method is usually used when the data set is too small to do simple partitioning. This method is also used when the algorithm uses some sort of boosting, requiring several iterations with different partitions, something that might also be associated with a smaller data set.

2.3.3 Validity and relevance

A common way of measuring an algorithm’s relevance is to calculate the precision and recall of the model produced. This is done using the test data, calculating four sets - true positives, true negatives, false positive and false negatives. These sets are then used to calculate the recall and precision using the following formulae:

precision = |{true positives}|

|{true positives} ∪ {true negatives}|

recall = |{true positives|}

|{true positives} ∪ {f alse positives}|

where |A| is the cardinality of the set {A}.

Using both of these values combined gives an idea of the relevance of the model created.

When a model is being evaluated using the training data, it is common to simply calculate the percentage of wrongly classified data points, the training error. When the model is fully trained, the training error is compared with the test error. These are often calculated using the following formula:

error = |{f alse positives} ∪ {f alsenegatives}|

|{true positives} ∪ {true negatives} ∪ {f alse positives} ∪ {f alse negatives}|

or, as more commonly expressed, in machine learning algorithms:

error = 1 n

n

X

i=1

I(yi6= ˆyi)

where n is the total number of data points in either training data or test data, I(yi 6= ˆyi) is a function that is 1 if yi 6= ˆyi and 0 if yi = ˆyi, yi is the correct

(11)

classification and ˆyi is the classification made by the model.[James et al., 2013, p. 37]

The results in this paper are presented using accuracy, or more precisely categorical accuracy, something that can be calculated using one of the following formulae:

accuracycategorical= 1 − error

or accuracycategorical= 1

n

X

i=1

I(y_i= ˆy_i)

2.3.4 Classification and regression

In classification there are two types of models, binary classification and multiclass classification. These types differ mainly in how many types of classes are allowed; binary classification allows only for two classes, often described as one negative and one positive, whereas the multiclass classification allows for multiple classes. All classification models have in common that they make no claim in finding the underlying function that dictates the class of the data, but rather pairs the data with the best matching class label. This can be seen as it maps the continuous input data to discrete output data.

In regression, many models can be found, ranging from linear regression and simple regression to logistic regression and segmented regression. The purpose of a regression model is to find the underlying function of the data, mapping continuous input data to continuous output data. This, often complex, function that can then be used to predict the output value of any input data.

This can be summarised as classification trying to identify class membership, whereas regression tries to predict a response.

2.3.5 Misclassification, cost and bias–variance trade-off

Before going into some of the models used in machine learning, the cost function and cost of misclassification must be clarified.

The cost function is a function that determines the cost of having a boundary limit in a certain position. This is used in order to avoid early stopping: halting the algorithm as soon as it separates the data sets instead of searching for an optimal solution.

The cost of misclassification on the other hand is used to avoid overfitting the data. If the cost of misclassification is set too high, or misclassification simply is forbidden, the algorithm will have a high tendency to overfit, since it will take all the inherent noise given in the training data and force the boundary limit to adapt to it. This is something that lowers the generalisation of the model produced, creating bias. If the cost of misclassification is set too cheaply, the algorithm will instead try to over generalise, ignoring not only much of the noise,

(12)

but also some of the valid data, creating variance. This is something called the bias–variance trade-off[James et al., 2013, p. 33-42].

Perfectly optimising the trade-off leads the resulting error to be the irre- ducible error, and comes as a result of the noise in the data. This optimisation is much of the tuning required when adapting an algorithm to new and unseen data.

2.3.6 Support vector machines

The support vector machine, hereinafter referred to as ”SVM”, is an algorithm which is used for classification or regression analysis. The algorithm takes the multidimensional input vector, transforming it to a higher dimension space and then finding the linear hyperplane which separates the data into two classes with least error and largest distance margin between the respective classified points.[Mueller and Massaron, 2016, p. 297–305]

In order to do this efficiently, the SVM uses the so-called kernel trick, allowing for a higher dimension hyperplane separation without actually transforming the data into a higher dimension. The idea behind this is to use a kernel, a function that explicitly maps values of dimension N to a dimension M , where N ≤ M , allowing a linear hyperplane to separate the often originally non- linearly separable data points. This is however computationally expensive, so the kernel trick was introduced.

The kernel trick takes the data points, as they are always used in pairs, and instead of explicitly mapping them and producing a dot product, uses a function that directly produces a value that can be used as if it would have been the mapped dot product. These functions are used as kernel functions in SVM, and are an important step in reducing the computational cost when classifying data.

The kernel function used by the SVM in this paper is the radial basis function kernel.

2.3.7 Random forest

Random forest is an ensemble method which is used for classification, regression and other tasks. Ensemble method can be explained as a group of weak learners grouped together to form a strong learner. The individual classifier, in this case a decision tree, is a weak learner and perform worse than as a group of individual classifiers.

The decision tree has a structure similar to that of any unbalanced non- binary tree, and works by making decisions based on the information gained by separating on any specific feature. This is then made random by either giving the information calibration a random variable, by scrambling the trees randomly or simply building them with a certain amount of randomness.

As mentioned above, a group of decision trees that are individually considered weak learners can be grouped to form a strong learner, a random forest.

When running a random forest, the new input will move into the system, and

(13)

the input will run through all the decision trees. The output can then be calculated from all of the terminal nodes which produce individual results, either by majority vote or by an weighted vote.

2.3.8 Zero rule

A common method when it comes to evaluation of the effectiveness of an algorithm is to compare it with random guessing. This is however highly ineffective when it comes to unbalanced data. For this reason, the zero rule or ZeroR is often used as base line instead or random guessing.

Instead of, as in random guessing, randomly guessing for each and every class, the algorithm simply looks for the most frequently occurring class in the training set and classifies all data in the test set as that class. This results in a theoretically higher accuracy than random guessing can achieve.

ZeroR is most effective when creating a base line for highly unbalanced data, since it will effectively get the same accuracy as the size of the largest class given a representative training set. In the case of the data similar to that in this paper, where one class represents 70% of the data points and there are three classes, the ZeroR would outpreform random guessing by approximately 37 percentage points.

2.3.9 Artificial neural networks

The artificial neural networks are inspired by and modelled after the brain.

The network has neurons which are connected, where each neuron outputs information while it is receiving input from other neurons. One of the famous examples is modelled by McCulloch and Pitts[McCulloch and Pitts, 1943]. The neuron model used in their experiment has three main parts.

The first part is the input received from other neurons or the system input.

In a multilayer neural network, the input of all the layers except the first comes from the neurons in the layers before them. In order to efficiently determine whether or not the neuron providing input should be listened to, they are assigned with weights. This will generate an input signal to the neuron that is the weight of the neuron sending it multiplied by the signal strength.

In the second part, representing the cell membrane, the neuron receives the weighted input and processes it. This process can be thought of as summing up the incoming signals.

The third part is the actual summation of the weighted inputs, and comparing this value with the threshold determined by the neuron itself. If the value is higher than the threshold, the neuron will fire, sending a signal matching its weight to all the neurons getting their input from the firing neuron. After this step, the circle begins again with the next neuron in a chain.

Viewing this process as a chain can be misleading, as it is in fact a network.

This allows for all the neurons in each layer to send their weighted signals to all the neurons in the next layer, creating much more of a mesh than a chain. This being said, a chain can be good for the conceptual understanding.

(14)

During the flow of signals through the network, learning is created by ad- justing the weights. This is the learning process in a neural network. During the learning phase the network will iterate and autonomously adjust the weight in order to reduce the error.

At the start, the weights will be randomly assigned, and therefore the result will most likely be poor. As the system iterates, the weights will adjust to better match the desired output. When this process is complete the network will be in a local or global minimum. This might seem like an issue, but there are methods of introducing noise in the neuron weights in order to optimise the conditions for the minimum obtained to be the global minimum. After stopping, it is common for some of the neurons to be excluded from the process or that they play an insignificant role, while other neurons might prove to be pivotal.

This also mimics nature, as not all stimuli will affect all reactions.

Activation functions

The activation functions or threshold functions are used in artificial neural networks. The purpose of the functions is to calculate and determined whether the function should fire or not, that is if the output should be one or zero, or if the output should be any value in between. It is not certain that every sort of function is suitable for every data type.

One type of function used is a linear function. This linear neuron is one of the simplest and is limited computationally. The output is the sum of all inputs times their respective weights. This is generally not suitable in a neural network, seeing as any whole sequence of neurons in the network can be written as one neuron using linear combinations, thus removing the entire point of a multilayer network.

Another type of neuron is the binary threshold neuron. Its function sums the weighted inputs and if the sum exceeds the threshold, the neuron will send out a one, otherwise a zero.

A rectified linear neuron, hereinafter referred to as ”ReLu” is a combination of a linear neuron and a binary threshold neuron. The ReLu neuron computes a linear weighted sum of its inputs. If the sum, z, is below zero, the output will be zero otherwise the output will be equal to z.

The sigmoid is a commonly used activation type in artificial neural networks [Mueller and Massaron, 2016, p. 281]. One of the advantages of using this type of neuron is that the output is a smooth and bounded function which still represents the total input. It typically uses a logistic function.

Lastly, the softmax is a frequently used activation function used to output data from a neural network. This is practical since it gives the output as a probability distribution.

The networks in this paper uses exclusively the ReLu activation function, with one notable exception - the output threshold being softmax. The choice of ReLu is because it helps prevent a vanishing gradient when computing gradient descent[Maas et al., 2013].

(15)

Figure 1: ∆t = 1 minute Figure 2: ∆t = 24 hours

3 Method

When designing a network, it is common to first theorise a layout to later implement using the selected software program. As this is a well tested process, this paper aims to follow this process.

3.1 Correlation

There are usually higher dependency among stocks with high resolution compared with low resolution data. In this paperwork the high resolution data, one minute data is used. The dependency among the stocks can be measured by their correlation. Figure 1 represents the minute-data correlation between stocks from Swedbank, SWED-A, and SEB, SEB-A, two of the larger banks traded on the Nasdaq Stockholm exchange. The relation between the stock volume, stock return and price volatility has been further examined by Karpoff [1987] and Chen et al. [2017].

One of the main differences between the model used in this paper and other models is that it requires very little data about previous events, e.g. earlier prices or returns. This is largely due to utilising the correlation, as suggested by Chen et al. [2017], and applying it on data with even higher resolution.

3.2 Theoretical design idea

As the goal of the network is much similar to the network designed by Chen et al. [2017], similar layouts was tested. This resulted in a multilayer network, containing six hidden layers. Chen et al. [2017] use several techniques to utilise the fact that there are strong correlation between stocks in a selected sector, grouping the input from sectors together, thus highly reducing the complexity of the network[Chen et al., 2017, p. 4-5]. This method of complexity reduction is something used to reduce the computational cost of the network.

However, this method proved not needed by the model in this paper, seeing as the grouping by itself do not contribute to increased accuracy of the

(16)

model combined with the computer running the network being able to deliver predictions well below the one minute mark.

Another technique used by Chen et al. [2017] is the feeding of input into the hidden layers of the network, and even into the last closing nodes. This type of late feeding is used to quicken the learning process as it, by reducing the number of weights affecting the input, gives values known to have higher impact more influence of the output. The values used by Chen et al. [2017] was calculated using both price and volume traded, as the selection of input requires testing and cannot easily be theorised, several different parameters were tested, further examined in section 6.2.1.

3.3 Practical programming

All programming in this paper was done in Python, using the following packages:

• Keras - neural network [Chollet et al., 2015]

• scikit-learn - preprocessing, SVM, random forest [Pedregosa et al., 2011]

• NumPy - matrix algebra [Walt et al., 2011]

• matplotlib - plotting [Hunter, 2007]

4 Earlier studies

4.1 J. Lindblad

Johan Lindblad examined and compared three different methods of machine learning algorithms that all predict the price of a financial instrument.[Lindblad, 2015] In his paper, each method’s input data, output data and implementation is thoroughly explained. Lindblad selects four stocks with corresponding data covering three days, analysing how well the different methods perform.

The paper gave an insight of what implementation and which techniques are worth using and how they work, and acts as a very good foundation for understanding the equity trading system.

The paper’s conclusion is that although some results were similar to those associated with the original methods, they could not be reproduced to match the same prediction rate.

4.2 H. Chen et al.

Hao Chen, Keli Xiao, Jinwen Sun and Song Wu created a dual layer neural network with the goal of predicting the movements of the 100 largest companies in S&P500, while also comparing their network with other types of networks[Chen et al., 2017]. This was done in a short time interval, making claims on high frequency usage. In their article, it is explained how the network was created along with some key results.

(17)

The article is considered of importance to the goal of this thesis paper, since it will in many aspects try to recreate their results, although using another algorithm.

The article’s conclusion is that dual layer neural networks have a good possibility of being of actual use in real life applications, and that they often out preform single layer neural networks along with several other machine learning models.

5 Forecasting using neural networks

Forecasting values using neural networks is one of the things neural networks excel in doing. When coming to the choice of either categorising stocks in categories based on their return or trying to establish their specific continuous values, the former seemed a more surmountable task to undertake. Having decided the networks purpose, it is crucial to choose parameters fitting said purpose. The network constructed by Chen et al. [2017] has a similar goal, but instead focus on the harder task of predicting specific values of return. Seeing as the network in this paper has as its goal to preform a simpler but similar task, it seems effective to use the same parameters as those used by Chen et al.

[2017].

The differences between the network in this paper and the one in the paper by Chen et al. [2017] is the networks design being much deeper and the resolution for the data along with the prediction period being much shorter.

5.1 Neural network design

The networks design has been made significantly deeper than the one used by Chen et al. [2017] in an attempt to better fit the network to the data. The network now consists of six layers, not counting the layers used by the keras package to change the data matrices into valid input or the additional three layers used for regularisation by dropout. The network can be studied in detail in figure 3.

Another significant difference is the use of the activation function ReLu instead of using sigmoid or tanh functions. This is to minimise the problem of the disappearing gradient. Another difference is the usage of the softmax function at the end node instead of the sigmoid function.

The number of neurons in each hidden layer varies from 112 to 133, where the number of neurons for each layer has been decided using trial end error, along with common rules of thumb when constructing neural networks.

It should be noted in figure 3 that parameters Q, EM A and ADs are fed into the network at increasingly late stages. The reasoning behid this is further explained in section 6.2.

(18)

Figure 3: Full multi-layer neural network

5.2 Algorithm

The algorithm used for optimising the neural network is gradient descent, or more specifically the adam-algorithm (Adaptive Moment Estimation). Adam is a hybrid algorithm between momentum conservation algorithms and adaGrad, an algorithm based on gradient alone[Kingma and Ba, 2014].

The loss function used is categorical crossentropy, a logarithmic loss function for measuring correctly classified data points. The choice of loss function seems given, since it is one of few widely used functions for loss in multi-categorisation

(19)

problems.

Lastly, since the data is heavily unbalanced due to issues further discussed in section 6.1.4, increased regularisation was needed. This was done two methods:

The first was adding class weights for misclassification costs, decreasing them logarithmically the more overrepresented the class, and the second was adding dropout layers, where the dropout layers randomly drops a fraction of their output, forcing the network to try to adapt to the sudden loss of information.

6 Data handling

6.1 Acquiring data

Acquiring data proved to be a cumbersome task. Instead of the original plan to be able to access the order book for a stock market going back several years, the data used was scraped from the Stockholm exchange using Google’s finance API. The data resolution is therefore much less than anticipated. Seeing as Google only saves data going back fifteen open days, the experiment could only be run on data dating back 57 days, i.e. late February 2017 to late May 2017 . 6.1.1 Selection of original data

The goal of gathering data was to have enough data to be able to train and run the network. In order to do so, high dimensional data was required. Seeing as there is no definite limit to what constitutes high dimensionality, an arbitrary goal of 100 stocks was set. In order to achieve this goal, an even higher number of stocks was chosen in order to be able to still have the 100 if any of the data was unfit for training. The buffer was chosen to be an additional 20 stocks.

Initially, a selection was going to take place to assure the stocks were chosen in a fair manner, but seeing as Stockholm Large Cap contains the one hundred and twenty largest stocks traded at the Stockholm exchange, it seemed fitting to scrape all stocks in said group.

Another reason to select stocks from the large cap is that there is a sub- stantially lower risk of any company having their stocks removed, and therefore eliminating them form the data set. An additional positive feature is that the stocks are traded fairly frequently, thus generating more data.

6.1.2 Accessing data

The data was accessed using Google’s finance API, which in turn saves data from the Stockholm exchange. The data was scraped from the API using a program written in golang. The data was then stored in separate files for each date scraped. The program doing the scraping was thoroughly tested in order to assure that no data was systematically skipped, missed or dropped. The files were then sampled and compared with data available from other sources to confirm its validity.

(20)

6.1.3 Second selection process

After further analysing the data from several dates, it seemed some stocks were not traded all days. This could be for several reasons, ranging from trading halts to lack of interest. Since this resulted in the data being unusable, all stocks not traded each day was excluded, a total of eight stocks. No further selection of stocks was made.

6.1.4 Merging and scrubbing

The merging and scrubbing of files were also done using programs written in golang. Both of these were extensively tested and the resulting files sampled and compared to originals in order to assure no data being corrupted.

The data was first merged into files, where one file covered each stock for the total time the data was gathered. The data in this file was later scrubbed, removing any unwanted information, such as the name of the exchange platform.

Lastly the data needed to have matching dimensions between files in order to be usable for training the neural network. This was achieved by using linear interpolation. Since this is effectively interpolating the price between minutes where no trade occurred, and therefore no change in price happened, it should not be seen as a source of error. The traded volume for each of these interpolated data points were set to zero.

6.1.5 Preprocessing

After gathering, scrubbing and combining all the data files, some preprocessing was needed. For this task the sklearn preprocess function was used. This function takes the data given and transform it to have a zero mean and unit variance for each given time. This was done on all input parameters mentioned in 6.2.1 This is generally something that greatly improves the rate at which a neural network can improve the accuracy.

6.2 Using the data for calculations

The rising calculation power with today’s processors and the amount of big data available make the neural networks more popular. In the recent year, the interest in building financial tool using complex models in order to support the decision of predict securities has increased dramatically. As of yet, only a small amount of articles and papers has been published on the subject of using neural networks to predict high resolution data in high frequencies.

Due to the inspiration from the recent work by Chen et al. [2017], some new data parameters are being used and tested in order to achieve higher accuracy of prediction. These parameters are as follows:

Intra-interval proportions, Q, a group of indicators which describe the momentum of the market, EM As which stand for exponential moving averages, and AD (Advance/Decline), ADv (Advance/Decline volume), and ADR (Ad- vance/Decline Ratio) indicators which indicate the breadth of market. In further

(21)

of this paperwork the last three indicators will be denoted as ADs. The variance of the indicators ranking from low to high is Q, EM A and ADs. Due to the variance of each indicator they are placed in different layer, where Q is placed on layer 1 and EM A in layer 3 and ADs in layer 5 in order to achieve more stable performance.

6.2.1 Parameter calculation

In this section the formulas used are presented along with a short motivation of the choosing parameters.

X - Stock price and volume traded

The input layer includes the stock price each minute along with the volume traded that minute. These parameters are extracted for each minute and for each stock and fed as input to the network.

All parameters are normalised in order to achieve a more stable performance.

Y - Category of stock change

The output of the neural network is denoted as Y and is predicting the future outcome, t+1 whether the return of the stock, Xt+1−Xt, is negative, unchanged or positive. The classes were unevenly distributed, as can be seen in table 1.

It should be noted that the stocks used are all traded in discrete prices, meaning that there is no interval in witch the stocks can change price and still be considered unchanged. All stocks changing price, regardless of the magnitude of the change, will not be categorised as unchanged. The Y is divided into three categories and defined as follows:

0 = negative, 1 = unchanged, 2 = positive Class negative unchanged positive

Mean in % 13.08 73.44 13.48

Standard deviation in % 7.84 15.88 8.05 Table 1: Distribution of data

Q - Intra daily data

The following features are used due to their short term momentum movement dependency. Where Q is capturing the relationship between its open, high, low, and close prices. The Q is defined as the following formula

Q_t,i= (a_t,i dt,i

,b_t,i dt,i

,c_t,i dt,i

)

where at,i = High − Open, bt,i = Open − Close, ct,i = Close − Low and dt,i = High − Low.

(22)

EMA - Market momentum

Another set of indicators describing the market movement is exponential movement averages, EM A. The EM A is defined as following formula

EM A_t(m, S_t,i) = αS_t,i+ (1 − α)EM A_t−1(m, S_t,i) where α =_m+1² and m = 1, 2, ..., 10, 20, 30, 40, 50, 100

EMAs are used as stabilisers where the EMAs are smoother than the raw prices which in term bring more stability to the prediction.

AD - Advance/Decline quotas

The last set of parameters used is the ADs. They are the quotas calculated using all stocks in the available data set. They are calculated using the following formulas:

AD = AS − DS, ADv = AV − DV and ADR = AV DV

where AS = number of advancing stocks, DS = number of declining stocks, AV = volume of advancing stocks, and DV = volume of declining stocks 6.2.2 Data usage

All the input parameters X, Q, EM A and ADs are all calculated for a specific time, and therefore the input data can be paired with a corresponding correct return label. After pairing input and correct return label, the data is split for each time t to create time slices, one for each minute containing both input data and correct label. Since the chosen method of prediction does not require any specific time order, the slices are shuffled randomly. This is done in order to better avoid fitting the network to trends when training the network.

7 Experiment and result

For simplicity the network with extra input between layers is going be denote as

”HNN”, hand-crafted neural network. After having trained the HNN with 70%

of the data and 30% data on test in 100 epochs, the accuracy is 74.72%. All the methods are performing with an accuracy slightly above 70%. In comparison with the baseline, ZeroR, the HNN performance is 1.21% better. In comparison to the regular neural network, NN, the performance is 1.01% better. The best result is produced by the SVM with an accuracy of 74.78%. The random forest preformed the worst, placing 2.08% below the baseline.

Another aspect of the models was the time it took to train them and to get a prediction. As seen in table 2, the SVM was much slower than the other models, where as the random forest was the fastest - the opposite order compared to accuracy. Both the neural network models preformed fairly similar, both

(23)

being significantly faster than the SVM but significantly slower than the random forest.

Further, both precision and recall was calculated for the NN and HNN, both showing results very close to their categorical accuracy.

Model NN HNN SVM Random Forest ZeroR

Accuracy 73.71% 74.72% 74.78% 71.43% 73.51%

Precision 73.78% 74.77% - - -

Recall 73.64% 74.66% - - -

Time ∼ 110 min ∼ 180 min ∼ 380 min 22 min 2 min Table 2: Results from experiments

7.1 Evaluation

The different models preform very similarly, neither achieving accuracy significantly higher than the baseline. This suggests that the predicting stock changes using a neural network, or any other of the models used, with this type of instantaneous data is difficult.

It should however be noted that the baseline achieves such high accuracy because the data is unbalanced, something accounted for in the HNN model.

The HNN model does in fact get overall results in the same range regardless of class, resulting in a more even prediction rate.

The HNN gets outpreformed my the baseline in roughly 50 percent of the cases, as seen in figure 5, but preforms significantly better than baseline in the other half. The last 10 percent outpreforms baseline case with 7.5-25 percentage points.

7.2 Overall performance

The networks computational performance is overall good, both learning and classifying data from a chosen stock in around one minute, averaging on 69 seconds when being run in an Ubuntu 64-bits environment using Intel i5-6400 CPU 2.70GHz being capped at 50% processing power. This should imply that several networks can be trained in parallel, and given enough cores be trained before each and every minute, producing newly trained predictions at each minute. The most time consuming tasks in the training is the reading of data from the ∗.data files containing all the price data for each stock.

(24)

7.3 Result

Figure 4: Accuracy of HNN Figure 5: Difference HNN-ZeroR

Figure 6: Confusion matrix from HNN

The resulting prediction accuracy is near what was expected. In figure 6, it can be seen that the HNN model is heavily biased in accordance with the unbalanced data. However, the model is favouring the correct forecast in both unchanged and increased value. The model does seem to be too biased towards the unchanged label when it comes to decrease in value.

Seeing as the network more often than not predicts a correct result for the stocks tested, it seems likely the architecture chosen for the network is valid, although the model of prediction, using correlation only, as a whole seems less so.

(25)

Another feature is the low computational cost of the predictions, making it possible to predict multiple stocks at once, using parallelism. In conclusion, this paper suggests that it is possible to predict whether a financial instrument will increase or decrease in a minute interval. Furthermore, this paper also suggests that with some further training and testing, along with an optimisation of data reading, an even higher accuracy might be achieved, all within an one minute interval.

8 Discussion

8.1 Sources of error

There are three things that should be regarded as the main sources of errors in this paper. They are the following.

Data fitted to trend

The data is collected from late February until late May. Due to the limited time period, the data is only covering a part of a complete conjuncture, which in turn means the neural network is more likely to be trained to be more suited for bull market, as was the case during this period, and not covering the whole spectrum of trends. An argument against this is the fact that the neural network is trained for a high frequency trading which is dominated by its short term momentum rather than long term. Overall, the optimal amount of data is unknown, but can most likely be decided by further testing. This report have seen no indication of change in accuracy when changing the amount of data anywhere in between 10 and 57 days.

Class unbalance

The histogram shown in figure 4 shows the proportion of different predictions achieving specific accuracy. Most of the tested stocks have an accuracy between 50-70%, a few between 80-90% and then a larger proportion in the interval 90- 99%. One possible explanation for this is that the most traded stocks have a more evenly distributed output due to the high frequent trading which is harder to predict for the network compare to the stocks’ return which stand unchanged.

In such case most of the inputs are unchanged which is easier for network to train and predict. This would explain why there are stocks with very high prediction accuracy, being significantly above average for the model.

This can be confirmed by looking at the outliers with accuracy > 90%. They all have in common that they are traded very infrequently, resulting in the HNN defaulting to always guessing unchanged, being in practice a ZeroR-model in those cases.

(26)

Neural network as a black box

The neural network which have been trained can be seen as a black box, given that it is hard or impossible to foresee which patterns the model have learned most from. The securities have different characteristics such as trading volume, trading frequency, and elasticity, making it impossible to predict the outcome based on these characteristics. This makes the network difficult to debug or analyse, making it challenging to optimise. It is therefore possible to be stuck using highly suboptimal models, as the significance of parameters are practically impossible to predict.

8.2 Improvements and continued research

This paper has been written with the goal of answering the thesis question stated in section 1.2, and while doing so several other questions and potential implementation problems have emerged. The following are a selection of them.

Change of parameters

Since the network relies highly on the parameters chosen, e.g. size of layers, epochs, it would be interesting to test for a optimal combination of these. This could also incorporate a different choice of optimisation function and activation function, as well as other loss functions and/or data structure.

Change of input

In this paper a regular neural network, not getting input in different layers, was used as as comparison to the HNN. In theory, these models should preform identically, or even better with the NN model, since the network itself should be able to weight the parameters similarly to the HNN. However, this is not the case, as the model trains for a limited number of epochs. An interesting topic is therefore whether both neural networks can achieve similar results with the data being given either all at once or in stages.

Optimisation

The goal of this paper was simply to evaluate the efficiency of a HNN, rather than trying to optimise the computation required in both training and prediction. An interesting continuation would therefore be to try to parallelise the calculations and preferably run them on several GPUs instead of a limited number of CPUs. Another approach could be to optimise the memory handling and data structures used in order to more effectively store and read the data while doing the calculations.

Using less unbalanced data

One of the biggest problems with implementing the models in this paper was the highly unbalanced data. This lead to the baseline models preforming very well,

(27)

significantly better than random guessing. This is mainly because of reasons discussed in section 8.1, and could most likely be avoided using stocks traded more frequently in other stock markets, e.g. Bombay Stock Exchange or New York Stock Exchange, and thereby decreasing the need of interpolation.

8.3 Business and market impact

In this paper, the model tested was overall well performing, but not nearly performing well enough to be used as a commercial product used to deliver data to algotrading bots. However, by making some assumptions, the impact on two different sectors can be discussed. The assumptions is as follow:

• The model used in this paper can very accurately predict future stock prices

• A majority of the algotrading bots would operate on similar data

• The model would be able to change and adapt to any change in trading behaviour

• No regulations or laws would be in place to prevent above assumptions Impact on fintech

Making the assumptions above, the model in this paper would have a significant impact on the fintech business - the markets would need to change significantly, as would the system as a whole. One could speculate that the markets would easily change to a ”win more” situation, where the rich get richer, since return is most often a function of capital used. It would also change the way in which trading is made, seeing as there would exist an oracle with the possibility to predict the outcome of the trade.

Another possible scenario is that the markets would start to experience oscillation, as all models will accurately predict the change, causing the traders to act on this information. This would most definitely lead to great instability in the market, wreaking havoc with the global economy.

Impact on algotrading

Keeping with the assumptions above, the algotrading business would change as well. One possible scenario is, assuming there still would be an economy left to trade in, the algotrading bots being changed to be more similar to the high frequency trading bots. These bots operate on knowledge known by all operators on the market, utilising small discrepancies in price. Since this is visible to all operators, the one benefiting from it will be the operator with the least amount of system delay. This could be either processing speed in the computers making the trades, the amount of code the computers need to run or simply the delay in the network. The high frequency trading business is however highly regulated, with several notable incidents where algotrading bots went

(28)

rouge, causing great market upset, and in the end yielded substantial fines for the companies in charge. If algotrading bots were to operate with information given by an accurate algorithm, they too would need much regulations in order to be considered acceptable by the market.

9 Conclusion

In conclusion the model build and tested does not provide any significant advantage over any other models used. If comparing only on result, the SVM should be the choice of model, since it had a marginally higher accuracy, and if only computational cost is considered the random forest if by far the fastest model, being approximately 7.3 times faster than the HNN and 17.3 times faster than the SVM. If the goal is to minimise both time and error, the simple NN model must be considered more effective than the HMM, seeing as the difference between them are 0.01 percentage points while the time cost for the HNN is nearly the double.

There are however many improvements that can be done, and the data used should not be considered representative for all possible types of stock data.

Since this is the case, the model does still have potential to be used in other stock exchanges, using other time intervals or different network parameters and regularisation functions.

References

H. Chen, K. Xiao, J. Sun, and S. Wu. A double-layer neural network frame- work for high- frequency forecasting. Association for Computing Machin- ery Transactions on Management Information Systems, 7(4.11), 2017. doi:

10.1145/3021380.

T. C. Chiang, H.-C. Yu, and M.-C. Wu. Statistical properties, dynamic condi- tional correlation, scaling analysis of high-frequency intraday stock returns:

Evidence from dow-jones and nasdaq indices. Physica A, 388(8):1555–1570, 2009. doi: 10.1016/j.physa.2008.12.042.

F. Chollet et al. Keras. https://github.com/fchollet/keras, 2015.

F. J. Fabozzi. The Handbook of Financial Instruments. Frank J. Fabozzi Series.

John Wiley Sons, 2003. ISBN 9780471445609.

J. D. Hunter. Matplotlib: A 2d graphics environment. Computing In Science

& Engineering, 9(3):90–95, 2007. doi: 10.1109/MCSE.2007.55.

G. James, D. Witten, T. Hastie, and R. Tibshirani. An Introduction to Statistical Learning with Applications in R. Springer, 2013. ISBN 9781461471370.

(29)

J. M. Karpoff. The relation between price changes and trading volume: A survey.

The Journal of Financial and Quantitative Analysis, 22:109–126, 1987. doi:

10.2307/2330874.

D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

C. Lee. Encyclopedia of Finance. Encyclopedia of Finance. Springer US, 2006.

ISBN 9780387262840.

J. Lindblad. Jämförelse av maskininlärningsmetoder för att förutsp˚a ak- tieprisrörelser(Swedish) [{A} comparison of machine learning methods to predict stock price movement]. DD151X Examensarbete i teknik Datateknik, kommunikation och industriell ekonomi, grundniv˚a, 2015. URL http://urn.

kb.se/resolve?urn=urn:nbn:se:kth:diva-188990.

A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier nonlinearities improve neural network acoustic models. In Proc. ICML, volume 30, 2013.

W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5:115–133, 1943. doi:

10.1007/BF02478259.

T. M. Mitchell. Machine Learning. McGraw-Hill Education, 1997. ISBN 9780070428072.

J. Mueller and L. Massaron. Machine Learning For Dummies. –For dummies.

John Wiley Sons, 2016. ISBN 9781119245513.

M. F. M. Osborne. Brownian motion in the stock market. Operations Research, 7(2):145–173, 1959. doi: 10.1287/opre.7.2.145.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn:

Machine learning in Python. Journal of Machine Learning Research, 12:

2825–2830, 2011.

Refco Private Client Group. The Complete Guide to Futures Trading: What You Need to Know About the Risks and Rewards. John Wiley Sons, 2005.

ISBN 9780471742166.

S. v. d. Walt, S. C. Colbert, and G. Varoquaux. The numpy array: A structure for efficient numerical computation. Computing in Science and Engg., 13 (2):22–30, Mar. 2011. ISSN 1521-9615. doi: 10.1109/MCSE.2011.37. URL http://dx.doi.org/10.1109/MCSE.2011.37.

(30)

Prediction of securities’ behavior using a multi-level artificial neural network with extra inputs between layers

Prediction of securities’ behavior using a multi-level artificial neural network with extra inputs between layers

ERIC TÖRNQVIST

XING GUAN

Contents

1 Introduction

2 Theory

3 Method

4 Earlier studies

5 Forecasting using neural networks

6 Data handling

7 Experiment and result

8 Discussion

9 Conclusion

References