A comparative study on artificial neural networks and random

(1)

IN

DEGREE PROJECT TECHNOLOGY, FIRST CYCLE, 15 CREDITS STOCKHOLM SWEDEN 2016,

A comparative study on artificial neural networks and random

forests for stock market prediction

VICTOR ERIKSSON

THUJEEPAN VARATHARAJAH

(2)

networks och random forests med avseende på aktieprognoser

VICTOR ERIKSSON THUJEEPAN VARATHARAJAH

Degree Project in Computer Science, DD143X Supervisor: Atsuto Maki

Examiner: Örjan Ekeberg

May 11, 2016

(3)

Abstract

This study investigates the predictive performance of two different machine learning (ML) models on the stock market and compare the results. The chosen models are based on artificial neural networks (ANN) and random forests (RF).

The models are trained on two separate data sets and the predictions are made on the next day closing price. The input vectors of the models consist of 6 different financial indicators which are based on the closing prices of the past 5, 10 and 20 days.

The performance evaluation are done by analyzing and comparing such values as the root mean squared error (RMSE) and mean average percentage error (MAPE) for the test period. Specific behavior in subsets of the test period is also analyzed to evaluate consistency of the models.

The results showed that the ANN model performed better than the RF model as it throughout the test period had lower errors compared to the actual prices and thus overall made more accurate predictions.

(4)

Denna studie undersöker hur väl två olika modeller inom maskininlärning (ML) kan förutspå aktiemarknaden och jämför sedan resultaten av dessa. De valda modellerna baseras på artificiella neurala nätverk (ANN) samt random forests (RF).

Modellerna tränas upp med två separata datamängder och prognoserna sker på näst- följande dags stängningskurs. Indatan för modellerna består av 6 olika finansiella nyckel- tal som är baserade på stängningskursen för de senaste 5, 10 och 20 dagarna.

Prestandan utvärderas genom att analysera och jämföra värden som root mean squared error (RMSE) samt mean average percentage error (MAPE) för testperioden. Även specifika trender i delmängder av testperioden undersöks för att utvärdera följdriktighe- ten av modellerna.

Resultaten visade att ANN-modellen presterade bättre än RF-modellen då den sett över hela testperioden visade mindre fel jämfört med de faktiska värdena och gjorde därmed mer träffsäkra prognoser.

(5)

Introduction

Stock market prediction is an important field of interest among investors. Accurate predictions mean that one potentially can foresee events or trends thus making investments more profitable. Making such predictions is difficult due to the complex nature of the market which is influenced by a wide variety of factors. Several methods exist and technological methods are one main branch that has become increasingly popular in recent years. These kinds of methods utilize algorithmic models in order to make the predictions. The problem can be seen as a time series prediction problem which is the task of estimating a value based solely on previous values in a time series.

Two examples of technological methods from machine learning (ML) that have been used are artificial neural networks (ANN) and random forests (RF) [1, 2, 3, 4, 5]. An ANN is a model that is inspired by the human nervous system and a RF is a collection of decision trees. Both these models are suitable for applications like stock prediction as they have the ability to model complex structures such as nonlinear patterns. These methods make predictions by analyzing existing data, building a model to reflect the underlying nature of the data and then uses this model to generalize on previously unseen data.

There exist quite a lot of research on ANNs within the field of stock prediction which has yielded some interesting results [1]. RFs have also been researched within the area, but have generally not attracted the same amount of attention as ANNs. However, in Caruana and Niculescu-Mizil’s [6] study where they did large scale empirical comparisons on a set of supervised learning algorithms, the overall performance of the RF model was ranked higher than the ANN model. Thus an interesting question emerges regarding how ANN and RF models compare when applied to stock prediction. The goal of this study is to compare the performance between optimized implementations of ANNs and RFs, respectively.

1.1 Objective and problem statement

The objective of this thesis is to investigate and compare the predictive properties that artificial neural networks and random forests have on the stock market. The question to be answered is:

(8)

How does artificial neural networks and random forests compare to one another when applied as one day prediction models on the stock market and which of the two has the best predictive performance?

1.2 Scope

The ANNs and RFs will be implemented for the Walmart stock and the Nasdaq Com- posite stock index respectively.

Due to the complexity of the models together with constraints in time and resources, the models will be implemented and optimized using built-in tools and classes in Matlab.

To some extent, this also limits the level of optimization and alteration that can be made on the models.

Performance will be measured with the chosen metrics, and the conclusion will thus be based on these as well.

It is important to note that this study will strictly be of a comparative nature. There will be no analysis as to why one model performs better than the other. Instead, the study will draw its conclusion from comparing the performance results of both models on the chosen stocks.

There will not be any guidelines in how one can earn money from the stock market using these models as this is not the aim of this study.

1.3 Outline

The remainder of this report is outlined in the following fashion. Chapter 2 presents relevant background information on the topics of stock prediction, machine learning, ANNs and RFs. Chapter 3, presents the methodology of the conducted study in terms of data selection, how the input vectors were constructed and how the models were implemented and optimized. Chapter 3 also presents the metrics that are used to evaluate the performance of both models. Chapter 4 presents the results of the conducted study in terms of the performance metrics as well as graphs showing the overview of the results.

Chapter 5 discusses the results, methodology, offers source criticism and presents the final conclusion to answer the proposed problem statement of section 1.1. Appendix A gives a brief description of biological neural networks.

(9)

Chapter 2

Background

In this chapter, some of the key concepts regarding stock prediction, ML, ANNs and RFs are explained.

2.1 Stock prediction

Stocks relate to a stake in certain kinds of companies in which one may invest money and be liable only in proportion to the amount invested. These are called limited companies and have a finite number of stocks that is partitioned into shares. When buying shares in such a company, one becomes a shareholder and thus an owner of some fraction of the company. The type of stock regulates the terms of ownership and therefore shareholders may have different influence on the business operation. [7]

A limited company can be listed on a stock exchange which is a marketplace for trading shares but these can also be traded privately and the combination of all possible trading places are in the context of this thesis referred to as the stock market. The stock price at a moment in time is determined by the most recent price that a buyer and seller agrees upon. Consequently, one can say that the stock price is regulated by supply and demand [8]. The total amount of shares available in a company corresponds to the supply while many different factors affect the demand which may in turn have varying predictability. There are often several different prices to consider when analyzing a specific stock but in this thesis, the closing price is the only one used which is the price of the last transaction just before the stock exchange closes for the day.

There are generally two branches of stock prediction which are fundamental analysis and technical analysis. Fundamental analysis focuses on financial factors such as a company’s balance sheet, market positions, credit value etc. and uses these to make an estimate of the company’s intrinsic value. Technical analysis on the other hand bases its estimates purely on historical data such as the stock price and different key indicators.

[8]

(10)

2.2 Machine learning

Machine learning can be seen as a subfield of Artificial intelligence (AI). Where AI generally involves any computational model based on intelligence, ML concerns itself with developing algorithms with the ability to learn and improve themselves by being exposed to new data. In this regard ML is a natural part of AI as AI involves learning as well as other cognitive attributes often related to intelligence such as reasoning and perception. [9]

A model of ML can be described as a function representing the nature of the input data. Moreover can ML be divided into two main branches which is supervised learning and unsupervised learning. In supervised learning, the training set of the algorithm contains data points of inputs as well as its desired outputs. In unsupervised learning, the desired outputs are unknown to the learning algorithm [10]. The branch used throughout this thesis is supervised learning because the implemented models have historical data points available in the form of dates in combination with stock prices.

ML can be further divided into regression or classification models. Classification models output a class which is a value of a discrete set while regression models instead output a continuous value [10]. This makes regression the valid choice for the application of this thesis as the implementations must be able to output real valued stock prices.

Ultimately, the goal of a ML model is to analyze a training set, find some underlying nature of the data and then be able to generalize this model to previously unseen data as well [11]. In supervised learning, this is done through bias-variance tradeoff which is the problem of simultaneously minimizing both the bias and the variance of a model. The bias is the difference between the predicted value and the actual value while the variance indicates to what degree the values are spread out [12].

Bias: B = 1 n

n

X

k=1

f_k− ˆf_k (2.1)

where f_k are the true values, ˆf_k are the predictions and n is the number of points in the data set.

Sample variance: s²= 1 n − 1

n

X

k=1

f_k− ¯f_k (2.2)

where f_k are a value, ¯f_k are the mean of all values and n is the number of points in the data set.

High bias may cause underfitting which means that relevant properties of the data are missed in the learning process. High variance may cause overfitting which is the reverse of underfitting and means that the model is overly sensitive to certain properties of the training data. Both underfitting and overfitting result in decreased performance regarding the ability to generalize well to unseen data. [10]

The general learning approach of ML can roughly be described in three steps. The first step is to partition the total data set into three parts which is the training data, validation data and test data. The second and third steps is the hyperparameter optimization

(11)

CHAPTER 2. BACKGROUND

and model parameter optimization, respectively. Basically the hyperparameters are the parameters that are adjusted before the actual learning while the model parameters are adjusted during the learning. In this respect the hyperparameter optimization is a so called meta-optimization task meaning that it optimizes another optimizer which in this case is the learning algorithm itself. There are a wide variety of methods used for optimizing the hyperparameters and this study takes the approach of using cross- validation together with loss function optimization. This means that the hyperparameters of a ML model are iteratively changed until the minimum of the chosen loss function, evaluated over the validation set, are found. The third step of the model parameter optimization can also be done in different ways which are the case for the two models in this study, see sections 3.3 and 3.4. [10]

2.3 Artificial neural networks

ANNs are a collection of models inspired by the biological neural network (BNN) model of the human nervous system (see appendix A for a basic explanation of BNNs). Like BNNs, ANNs generally prove to be well suited for typical cognitive tasks such as pattern recognition and learning [13, 14]. Even though simplified compared to a BNN, the ANN models implement the basic structure of their biological equivalent and can be described as finite directed weighted graphs. Edges that are directed into vertices are inputs while edges directed out are outputs and the collection of the m inputs or n outputs can be seen as n- or m-dimensional vectors, respectively.

ANN models have at least an input and an output layer and the number of vertices in these is determined by the dimensionality of the problem at hand. An ANN can be viewed as a function as described in section 2.2 and for the particular regression problem of this thesis, more specifically a function from R⁶ to R.

One of the simplest ANN models is the feed forward neural network (FFNN) which can be modeled as acyclic graphs. There are many other kinds of ANN models such as the FFNNs cyclic counterpart which is called Recurrent Neural Networks (RNN) but the FFNNs are more commonly used compared to its alternatives [15].

A more mathematically stringent formulation of a vertex in an ANN is that the jth vertex can be seen as a composite function F (~x, ~w) = (A ◦ T )(~x, ~w) where T is the transfer function, A is the activation function, ~x = x1, ..., xn is the input vector and

~

w = w₁, ..., w_nis a vector containing the edge weights for the corresponding inputs in ~x, see figure 2.1. [13, 16]

The transfer function T for the jth vertex is most often a summation function defined as the dot product of the input and weight vectors, or equivalently the weighted sum of all the inputs, T = ~x · ~w. The output of T is passed on to the activation function A that determines whether the artificial neuron is activated or not. The activation function may or may not use the threshold value (theta in figure 2.1) to perform the evaluation [13]. Different activation functions are used depending on the model but generally a distinction can be made between linear and nonlinear activation functions. The training of an ANN is done by the general principles of ML discussed in section 2.2, through

(12)

Figure 2.1: A basic outline of an artificial neuron. Source: https://en.wikibooks.

org/wiki/File:ArtificialNeuronModel_english.png

exposing the network iteratively to known data and adjusting model parameters like the edge weights correspondingly.

One of the early ANNs, modeled in the 1960’s by Frank Rosenblatt, is called a single layer perceptron (SLP) and was a FFNN built on the concept of the perceptron algorithm [13, 17]. This algorithm can be defined as an artificial neuron that uses the Heaviside step function (2.3) as the activation function which means that the output is a binary value dependent on the value of the transfer function in comparison with the threshold value theta [13], see figure 2.1.

H(θ) =

(0, θ < 0

1, θ ≥ 0 (2.3)

The learning of a SLP is usually done according to the delta rule which is an iterative gradient descent algorithm. This means that steps proportional to the negative gradient of the function are taken in order to find a minimum after a finite number of steps [17]. In the case of SLPs, the edge weights are first chosen randomly, then the desired output data is compared to the actual output data in a loss function and the weights are adjusted thereafter. A first limitation of this model is that the input data must be linearly separable, otherwise the training algorithm may not converge towards a solution in a finite number of steps which means that there is a large number of fundamental examples that SLPs are unable to learn [13]. A second limitation is that the activation function only can perform binary classification and thus regression applications cannot be modeled using SLPs.

To overcome the limitations of the SLP model, the multilayer perceptron (MLP) was developed in the 1970’s which is a FFNN model that differs from SLPs in several respects. The MLP model contains except for the input and output layers an arbitrarily large number of hidden layers between these.

MLPs also use more sophisticated learning methods based mainly on backpropagation (BP) that is a generalization of the delta rule used in SLPs. BP algorithms may aside

(13)

Figure 2.2: A Multilayer perceptron (MLP) with one hidden layer.

Source: https://upload.wikimedia.org/wikipedia/commons/c/c2/

MultiLayerNeuralNetworkBigger_english.png

from adjusting the edge weights also shift the activation function by using a so called bias vertex to allow it to trigger in a different manner [13]. The added layers and the BP algorithm provide the network with an internal structure that gives MLPs the ability to handle problems such as non-linearly separable ones that SLPs are unable to model appropriately [17, 13]. MLPs may also use other activation functions than the Heaviside step function which means that MLP models can be used effectively both for classification and regression problems. Common activation functions used in MLPs are different linear functions, hyperbolic tangent functions as well as the sigmoid function [16]. The sigmoid function is the primary function in the implementations of this thesis and can be seen below in (2.4). These kinds of non-constant activation functions are all continuous in the range y ∈ [0, 1] and this is vital for the functionality of the BP algorithm. Several different activation functions may be used in one single MLP model.

S(t) = 1

1 + e^−t (2.4)

There are a wide range of learning algorithms that implement BP with some additional optimization technique. The Levenberg-Marquardt algorithm (LMA) that is used in this thesis, utilizes the Gauss-Newton algorithm (GNA) as the optimization and MSE as the loss function. LMA is generally considered to be a fast and efficient choice among learning algorithms [18].

For ANNs, the training data is used by the learning algorithm to adjust the edge weights and the bias vertex of the network. The validation data are used during training to evaluate the loss function and determine when satisfactory performance of the network has been reached. It is also used to optimize the hyperparameters as described in sections 2.2 and 3.3. The test data is purely used for performance evaluation.

(14)

2.4 Random forests

2.4.1 Decision tree

A decision tree is a model of supervised learning that can be represented by a tree graph.

A decision tree is thus a set of questions organized in a hierarchical manner where given an input vector the decision tree estimates an unknown property by asking successive questions about its known properties [19]. Starting at the root, the interior nodes of the tree represent each question and the edges between them are the corresponding answers.

As such, the question that is asked next depend on the previous answer. In following such a path, a decision is made once a terminal node, or leaf, has been reached.

Figure 2.3: An illustrative example of a decision tree that is used to figure out whether a photo (the input vector) represents an indoor or an outdoor scene. Source: Criminisi et al. [19] (p.88).

2.4.2 Regression tree

A regression tree is as the name implies, a decision tree that is used for regression problems. The general idea is to use recursive partitioning on data that act in complicated nonlinear ways to get smaller partitions with lower variance allowing for better predictions [19].

The process involves partitioning the data space into smaller disjoint regions, and again partition those sub-regions recursively so that we get chunks of space that have simple predictive models contained in them (to predict unknown values). This recursive partitioning is what is represented in the regression tree. As such, each interior node represents a split as well as a partition of space following from earlier splits. The interior nodes guide a path to the terminal nodes which represents a cell of a partition, and which

(15)

has a predictive model attached to it. It is the predictive model contained in the terminal node that decides the actual form of the prediction and common models include using the constant mean or linear/polynomial regression [19].

Figure 2.4: Regression tree (a) An illustrative example of a 2-dimensional regression tree with binary splits. The predictive model in the terminal node takes the mean of all data points contained in that partition. (b) An illustrative example of the partitioned space and the containing data points. Source: https://dzone.com/articles/

regression-tree-using-gini%E2%80%99s

The performance of a regression tree is mainly down to how well the tree is set up and partitioned. A common way to split a regression tree is to use a greedy approach where each split maximizes the variance reduction [19, 20]. This is done by initially choosing the data point and the split that maximizes the variance reduction, and set this as the root node. As there are finitely many data points in the data set, there are only finitely many splits that needs to be considered. The same process is then done recursively for each interior node. For instance, in a data space with a regression tree that does binary splits, the root node would represent the split on the data point that minimizes the sum of the variances on both partitions of the split. The interior nodes, which represent the subregions, would then go through the same process recursively.

The recursive partitioning keeps going until a stopping criterion has been reached which is when the growing of the tree stops. A typical stopping criterion is when a partition has some minimal amount of observations (data points) contained in it. Thus the leaf with the predictive model represents the partition that has met the stopping criterion [19, 21].

A tree that has been overfitted may be a tree that has grown too much. This may occur due to having a stopping criterion where the amount of minimal observations are too low, leading the tree to have bad generalization. The process of pruning, meaning removing unreliable branches after growing the tree, may improve its ability to generalize [20].

(16)

2.4.3 Random forest

A RF is an ensemble learning technique that has shown to have better performance in generalization and overfitting avoidance in comparison to a single decision tree [19]. The introduction of RFs in classification and regression trees (CART) was first proposed in a paper by Leo Breiman [22]. As this study involves only using regression trees, the usage of the term random forest will throughout this thesis correspond to a random regression forest.

A RF is a collection of randomly trained regression trees that are collectively used to output a prediction. In a RF, there is no need to prune each individual tree after growing them. The main reason being its use of the random subspace method in a collection of trees [22]. The random subspace method involves the following process when growing each individual tree:

1. Randomly choose a subset of the training data for training the individual regression tree.

2. At each node, choose a random subset of features, and only consider splitting on these features.

The prediction of the RF model, is the mean of the predictions of all individual trees in the RF. The use of RF instead of a single regression tree have shown to considerably decrease the risk of overfitting. As the individual trees are all randomly different from one another, this leads to decorrelation between their individual predictions thus making the prediction of the RF generalize better [19, 22].

Hyperparameters that can affect the performance of a RF includes variables such as the number of trees, how many features that are considered at each node, and the stopping criterion [23].

(17)

Chapter 3

Method

To answer the proposed problem statement, this study was initially conducted by doing a literature study on the topics of stock market prediction, ANNs, RFs and ML in general. The literature study also involved reading available material on the subject of ML techniques for stock market prediction and in particular research material that used ANN and RF models as the models for prediction.

Suitable input vectors for one-day stock prediction were also researched and it was decided that the use of technical analysis indicators would be a relevant choice. This decision was made after reading similar studies of ANN and RF based stock prediction using technical analysis indicators as well [2, 4, 5].

Both predictive models and the necessary operations to gain results were implemented in Matlab R2015a (8.5.0.197613). The steps taken in Matlab involved processing data, initializing ANN and RF models, training and testing the models for prediction as well as calculating performance metrics.

3.1 Data selection

The input data set comes from the Walmart stock price (WMT) and the Nasdaq Com- posite stock index (IXIC). These stocks were chosen due to differences regarding some properties such as the closing prices and the volatility which gives a good basis for a comparison between different types of data. To gain fair and comparable results, the same amount of datasets needed to be gathered and used for both models.

When researching available studies of stock prediction using ANNs and RFs, the literature study showed that the amount of gathered datasets has been of varying amounts.

Paluch and Jackowska-Strumillo [2] used one year of data in their ANN based stock prediction model. In contrast to that Ticknor [1] used 734 days to gain optimal results in his study of an ANN based prediction model. Manojlovic and Stajduhar [4] used 1512 days of data in their paper on RF based stock forecasting.

In this study, the training, validation and test data consisted of 254 days for both models as can be seen in figures 3.1 and 3.2. The data was gathered by initially fetching 274 days of closing price data from Yahoo Finance for WMT and IXIC in the timespan

(18)

Figure 3.1: The actual prices of the WMT stock in the interval between Feb 3 2015 - Feb 4 2016. Source: Matlab.

January 5th 2015 - February 4th 2016. They were then processed to 6-dimensional input vectors with 254 observations representing the timespan February 3rd 2015 - February 4th 2016. The input vector consisted of technical indicators that were based on the closing prices of 5, 10 and 20 days prior to the day that is to be predicted (see section 3.2).

The last 15 % of the gathered data was chosen as a subset to test performance. Thus the initial 85 % of the data was to be separated into training and validation sets. This was done by studying the properties of the gathered dataset for both stocks. The proposed methods aim was to include unusual spikes and dips in the training data. For instance, in the WMT stock when considering the initial 85 % of the 254 observations, there is a sudden dip in the closing price on day 177 and it reaches it’s lowest price at day 199 before starting to recover. With the proposed method, it was reasonable to have these days in the training data. In such a manner, the partitioning for each stock was done as detailed below.

• The WMT stock was partitioned by setting 80 % as training data, 5 % was set as validation data and the remaining 15 % was set as test data to evaluate performance.

• The IXIC index was partitioned by setting 70 % as training data, 15 % was set as validation data and the remaining 15 % was set as test data to evaluate performance.

(19)

CHAPTER 3. METHOD

Figure 3.2: The actual prices of the IXIC index in the interval between Feb 3 2015 - Feb 4 2016. Source: Matlab.

This partitioning also means that both the ANN and RF models for the respective stocks had the same subset of data for building their structure and the same subset to test their performance. Table 3.1 presents both partitioning schemes that was used.

Training days Validation days Test days

WMT 1 - 204 205 - 216 217 - 254

IXIC 1 - 178 179 - 216 217 - 254

Table 3.1: Table presenting the partitioning of the data sets for each stock. The respective stock had the same partitioning to build and evaluate both of its respective ANN and RF models.

3.2 Technical analysis indicators used in the input vector

The most commonly used technical indicators are moving averages and oscillators [2, 4].

As such, the approach was to use 5, 10 and 20 day exponential moving averages (EMA) and rates of change (ROC) to construct the input vector.

(20)

Exponential moving average (N = 5, 10, 20 days):

EMA_N(k) = C(k) + aC(k − 1) + a²C(k − 2) + ... + a^{N −1}C(k − N + 1)

1 + a + a²+ ... + a^{N −1} (3.1) where C(k) indicates the closing price on day k, and a = 1 − (2/(N + 1)).

Rate of change (N = 5, 10, 20 days):

ROC_N(k) = C(k) − C(k − N )

C(k − N ) (3.2)

where C(k) indicates the closing price on day k.

The ROC determines the rate of price change in a given period. As such, a positive rate indicates an upgoing trend in the closing price while a negative rate indicates a downgoing trend.

3.3 ANN model

As mentioned in the beginning of this chapter, the ANN model implementation as well as the pre- and post processing of data was done in Matlab where the implementation utilized the built in software neural network toolbox. The model is a feedforward three-layered MLP. This network configuration is common as one hidden layer is enough to model nonlinear functions while additional hidden layers pose an increased risk of overfitting depending on the nature of the data [3]. Furthermore, models with multiple hidden layers are more likely to fall into a local minimum when training the network [24].

As suggested from the dimensionality of the data, the input layer consists of six nodes and the output layer of a single node. Determining the optimal hyperparameter of the number of neurons in the hidden layer is not entirely straightforward and a variety of recommendations exist regarding the subject. The following are some commonly used rule-of-thumb methods [25]:

• The size of the hidden layer is somewhere between the sizes of the input and output layers.

• The size of the hidden layer is 2/3 of the size of the input layer plus the size of the output layer.

• The size of the hidden layer is less than twice the size of the input layer.

The hyperparameter was empirically determined with these three recommendations as a guideline to be optimal at five nodes. This was done by cross-validation and loss function optimization as described in section 2.2 with RMSE as the loss function (see section 3.5.1). In this particular situation, the estimated optimal size coincides exactly with that of the second recommendation above.

(21)

CHAPTER 3. METHOD

The activation functions used were a sigmoid function for the hidden layer and a linear function for the output layer. During the training of the network, the inputs and targets were normalized in the range [-1, 1] with the mapminmax function and the trained network was configured to output the unnormalized values for better interpretability. The stopping criterion used during training was that when the RMSE failed to decrease for six successive times, validation checks halted and thus training terminated.

Figure 3.3: A schematic image reflecting the structure of the ANN model. Source:

Matlab.

For backpropagation, the Levenberg-Marquardt algorithm was used. It is recom- mended in the API as a first choice due to its speed compared to some other alternatives [18]. The Bayesian regularization algorithm was tested without any noticeable improve- ment in performance so Levenberg-Marquardt was deemed satisfactory. A schematic representation of the model is shown in figure 3.3.

3.4 RF model

The RF models were implemented using the TreeBagger class in Matlab and the implementation follows Breiman’s RF model. It uses the mean of observations in its terminal nodes to predict unknown values. The models for WMT and IXIC were individually tuned with the validation data to gain optimal performance. The hyperparameters tuned were NumTrees, NVarToSample, and MinLeafSize and are explained below.

• NumTrees - Amount of regression trees to grow in the RF model.

• MinLeafSize - Minimum number of observations at each leaf, also known as the stopping criteria.

• NVarToSample - The amount of features to be chosen at random for each split in the trees.

NumTrees was set to be 100 trees for both WMT and IXIC. In general, more trees in a RF could lead to better performance but this performance gain eventually converges to a limiting value [22]. In his original paper, Breiman used 100 trees for his RF model.

In this study, testing was done to see if more trees than 100 had any positive effects on

(22)

the RMSE (see section 3.5.1) of the validation data. As there was little to no gain, 100 trees was used as the hyperparameter value for both RF models.

MinLeafSize was set to be 5 observations for both WMT and IXIC. This is the default value of the TreeBagger class. In order to test if it was a suitable value, the performance was compared between models that used 5, 10 or 20 observations on the training and validation data. As 5 observations gave the lowest RMSE, it was chosen as the hyperparameter value.

NVarToSample was set to 2 features for the WMT stock and 4 features for the IXIC index. The values were chosen by iterating through possible values on each stock and choosing the value that gave the lowest RMSE on the validation set.

NumTrees MinLeafSIze NVarToSample

WMT 100 5 2

IXIC 100 5 4

Table 3.2: Table presenting the hyperparameter values set for each RF model for the respective stocks.

3.5 Performance metrics

The chosen performance metrics are the root mean square error (RMSE) and the mean absolute percentage error (MAPE). These are relevant measures of performance as they involve calculating the bias error and as such these metrics calculate the errors based on the actual price compared to the predicted price. The equations together with brief descriptions are presented in the subsections below.

3.5.1 Root mean squared error RMSE =

v u u t 1 n

n

X

k=1

( ˆf_k− f_k)² (3.3)

where ˆf_k are the predicted values, f_k are the true values and n is the number of data points.

The RMSE is the square root of the average of the sum of squared bias errors.

3.5.2 Mean absolute percentage error MAPE = 1

n

X

k=1

f_k− ˆf_k f_k

(3.4)

where ˆfk are the predicted values, f_k are the true values and n is the number of data points.

(23)

CHAPTER 3. METHOD

The MAPE measures the size of the sum of bias errors in percentage terms. It is calculated as the average of the unsigned percentage error.

(24)

Results

In this section, the results of the ANN and RF models, with respect to both stocks are presented. This involves graphs illustrating the difference between the actual price and the one-day predicted price of the test set as well as tables containing the performance metrics.

4.1 Performance on WMT stock

In figure 4.1, the graph presents the differences between the actual and the one-day predicted closing prices on the WMT stock using the respective models on the test set.

Figure 4.1: A graph presenting the predictions of the two models on the WMT stock against the actual prices during the 38 day test period ranging between days 217-254.

(25)

CHAPTER 4. RESULTS

In table 4.1, the performance metrics for the models on the WMT stock are presented.

RMSE MAPE

ANN 1.1177 0.0137

RF 2.1480 0.0288

Table 4.1: Table presenting the RMSE and MAPE metrics for the WMT stock. The figures are represented with 4 decimal points.

Both the RMSE and MAPE were better for the ANN model, in comparison to the RF model. This indicates that the ANN had better performance on the WMT stock.

4.2 Performance on IXIC index

In figure 4.2, the graph represents the differences between the actual and the one-day predicted closing prices on the IXIC stock index using the respective models on the test set.

Figure 4.2: A graph presenting the predictions of the two models on the IXIC stock index against the actual prices during the 38 day test period ranging between days 217-254.

(26)

In table 4.2, the performance metrics for the models on the IXIC stock index are presented.

RMSE MAPE

ANN 79.5055 0.0139 RF 135.5593 0.0236

Table 4.2: Table presenting the RMSE and MAPE metrics for the IXIC stock index. The figures are represented with 4 decimal points.

Both the RMSE and MAPE were better for the ANN model, in comparison to the RF model. This indicates that the ANN had better performance on the IXIC stock index.

(27)

Chapter 5

Discussion and conclusion

In this section, the results and the methodology of the performed study are discussed.

The reviewed sources that are used in this study are also commented upon in the manner of source criticism. Finally, the proposed problem statement described in section 1.1 is answered in the final conclusion.

5.1 Discussion of results

The results for both WMT and IXIC consistently show better performance for the ANN model compared to the RF model. The MAPE metric shows that the errors in the predictive models all are within a couple of percents with 2.88 % being the highest for the RF model prediction on the WMT stock.

An observation of the WMT stock prediction seen in figure 4.1 is that there is a drastic price drop from around 67 USD to 57 USD between the days 175 - 190. This is incidentally exactly the price range as represented in the test data except that the price development looks completely different there. As there are no other representations of this particular price range in the training data, this could potentially impact the models abilities to accurately generalize these prices in the test period. The ANN model has better generalization performance in this case compared to the RF model.

When analyzing the results for the IXIC index in figure 4.2, a similar pattern is seen as for the WMT stock. The sudden divergence from day 235 and onwards is consistent with the hypothesis of bad generalization mentioned above, i.e after day 235 the value drops unusually low when compared to the training data in figure 3.2. Just as for the WMT stock, here the ANN model handles the lack of reliable training data better than the RF model as it consistently shows lower biases for every data point throughout this period.

Further observations of the results might indicate that the quality of the validation data has different impacts on the two models for their respective hyperparameter optimization process. For the WMT stock prediction in figure 4.1, the RF model has a rather large bias consistently throughout the test period in comparison to the ANN model. In the IXIC index prediction in figure 4.2 prior to day 235, the difference in bias between the

(28)

ANN and RF model are not that extreme. This might suggest that the hyperparameters potentially have been better optimized for the IXIC models in comparison to the WMT models. However, as the ANN model still generalize better for both stocks it seems to suggest that it was less sensitive to overfitting, in comparison to the RF model. The chosen partitioning of the data is further discussed in section 5.2.2.

Overall the ANN model had some noticeable spikes during the test period for both stocks while the RF model was smooth to a larger extent. The spikes generally synced well with the changes of the actual price even if it occasionally, in connection with very sudden changes, resulted in a bigger misprediction than usual. The cause of this may be that, when sudden changes with high variance in the price occur, the ANN model tries too follow the recent change but is too slow in order to estimate these rather extreme day-to-day changes. These kinds of scenarios may be harder to predict than those over a longer time period because of the increased difficulty of generalizing such unique events from a limited set of training data.

5.2 Discussion of method

The complexity of ANNs and RFs means that there are several different hyperparameters that can be tuned to regulate their performance. When implementing the ANN and RF models, some of the hyperparameters were chosen by general literature study as well as performing own tests to choose what could be regarded as optimal. In contrast, some other hyperparameters were set rather arbitrarily, generally due to the time constraint.

Studying these factors further could have improved the results.

5.2.1 Amount of data/observations

One of the factors that may affect performance is the total amount of data to collect for each stock. The literature study showed different reports using very different amounts of data. Choosing 255 trading days in total (about one year of trading days) was an amount that could have been adjusted by measuring performance using different amounts of datasets. It is possible that in the instances where the predictive models had difficulties to generalize (see section 5.1), using more historical data for training and validating the models, could have aided in increasing the performance of the models. As such, when considering the size of the datasets, it might be viable to also study the behaviour of the stocks. For instance, considering a factor such as the stock’s volatility over a certain period of time could aid in deciding whether to use more or less data for that specific stock. Further studies of the stock and it’s behaviour could thus aid in selecting an optimal amount of days/observations.

However, studying different amounts of datasets and assessing the optimal amount would have required much time and resources, especially considering the need to choose the same amount of data for both the ANN and RF models. It should also be noted that the aim of this study is to compare the performances of ANN and RF models. Using the amount that was chosen in this study can have aided in noticing possible strengths and

(29)

CHAPTER 5. DISCUSSION AND CONCLUSION

weaknesses of the respective models.

5.2.2 Partitioning of the data

The partitioning of the data into training, validation and test sets is a factor that may affect performance as well (as briefly discussed in section 5.1). The chosen amount of test days was set rather arbitrarily, and further large-scale studies to determine optimal amount of test days for both models could improve the results. The quality of the validation data is also of importance when optimizing the models. As mentioned in section 3.1, the training data was slightly arbitrarily chosen to include sudden spikes and dips in the closing price, and the validation data was the remaining set of days that were prior to the test days.

In the WMT stock in figure 3.1, it is noticed that the stock price declines until around day 199 when it is starting to recover. The partitioning presented in section 3.1 shows that the training data (days 1 - 204) capture prices of all possible levels while the validation data (days 205 - 216) reflect a more local behaviour in this case as these prices do not quite include the volatility compared to all 254 days and the test data in particular. An implication of this may be that the validation data do not give an entirely fair reflection of the true nature of the stock and thus the hyperparameters may be further optimizable to some extent.

In contrast, the IXIC index in figure 3.2 shows that in the training data (days 1 - 178), the range of prices is approximately in the interval from 4500 USD to 5200 USD.

The validation data (days 179 - 216) consist of prices from 4800 USD to 5150 USD and matches the typical price development of the training data well, and to a large extent the test data as well. As discussed in section 5.1, this might suggest that the hyperparameter optimization process for the IXIC models was more effective than for the WMT models.

Distributing the training and validation set is a complicated task. Increasing the validation data and decreasing the training data might create a superior model in some cases, whilst in some other cases it might be the direct opposite. The complexity increases further due to the need for both the ANN and RF models to use the same partitioning scheme.

Overall, as both the ANN and RF models used the same partitioning scheme for the same stock, it would be viable to state that the chosen methodology was fair for the comparative analysis. The fact that ANNs are able to perform better for both stocks, regardless of the possible shortcomings discussed above, is an interesting result to notice.

It is something that can be used to strengthen the case for the use of ANN models within the field of stock prediction.

5.2.3 The choice of technical indicators

The decision to use technical indicators to process the input vector was made after reading similar studies using the same approach. Thus, there was no greater knowledge on the subject of technical analysis apart from the general ideas and the technical indicators chosen were based on what had been read in similar reports. Further research on the

(30)

subject of technical analysis and existing indicators and possibly using more indicators could have been something that could have improved the results and the methodology.

5.3 Comments on the reviewed sources

The primary sources used in this study includes Bishop’s [13] textbook on ANNs for pattern recognition, Breiman’s [22] paper on RFs, as well as Criminisi, Shotton and Konukohglu’s [19] review of RFs and its use cases. These were part of the main sources used to gain knowledge on ANNs and RFs and were thus key sources to be able to understand, implement and optimize the models. The legitimacy of these sources can be regarded as strong as all authors have extensive backgrounds within the subject.

Additional sources that were used to gain knowledge within these fields can also be regarded as strong as they generally were printed within respectable institutes and/or were quite extensive in their material.

Other key sources include various studies made on the subject of implementing ANN and RF based models for stock prediction. The reliability of these sources could be questioned as most reports were summarizing in how the models were implemented rather than having an exhaustive description. However, as these studies were made within educational institutes, it would be justifiable to regard them as viable sources.

5.4 Final conclusion

The purpose of this study was to compare the predictive performances of ANNs and RFs for one-day predictions on the stock market. ANN and RF based models were implemented and their predictability was tested on two different stocks for an interval of 38 consecutive days. The results of the study indicate the following:

• Seen over the whole test period, the ANN model is less erroneous than the RF model for both stocks. As indicated in tables 4.1 and 4.2, the ANN model showed better RMSE and MAPE.

• The nature of the data has a definite impact on the performance of the models.

The ANN model was able to generalize better than the RF model on unreliable training data even if both models deteriorated to some extent at these occasions.

• There are few isolated occasions in the test data where the RF model outperforms the ANN model.

According to these points, the ANN model showed better performance over the test period as a whole, showed to be less sensitive to unreliable training data and proved to be the more consistent of the two. This makes the ANN model a better choice for one day stock prediction compared to the RF model under the circumstances.

(31)

Appendix A

Biological neural networks

The intelligence of humans is dependent on the functioning and structure of the nervous system [26]. A neuron is a type of cell of the nervous system that is responsible for transmitting information through electrical and chemical signals and a set of neurons that is interconnected is called a biological neural network (BNN). The human brain which is the center of the nervous system, has been estimated to consist of up to 10¹¹ neurons where each neuron has up to 10⁵ connections to other neurons [27]. These connections are established through the dendrites and axon terminals that are interconnected via synapses that allow signals to travel between neurons, see figure A.1.

Figure A.1: A basic illustrative example of a neuron. Source: https://upload.

wikimedia.org/wikipedia/commons/8/86/1206_The_Neuron.jpg

The sum of the input signals determines whether the neuron will send out a signal through its axon thus starting a potential chain reaction in the BNN. Furthermore, synapses have variable strength of their connections and this strength increases the more signals are passed through a synapse over time. Consequently the passing signals will also be stronger between these kinds of neurons as explained by the Hebbian theory [28].

(32)

With each new learning experience the synaptic connections will adapt to configure the nervous system in order to store this knowledge for future scenarios. This is the central model that characterizes the learning process of humans [29].

(33)

Bibliography

[1] J. L. Ticknor, “A Bayesian regularized artificial neural network for stock market forecasting,” Expert Systems with Applications, vol. 40, no. 14, pp. 5501–5506, 2013.

[2] M. Paluch and L. Jackowska-Strumillo, “The influence of using fractal analysis in hybrid MLP model for short-term forecast of close prices on Warsaw stock exchange,”

in Computer Science and Information Systems (FedCSIS), Federated Conference, pp. 111–118, IEEE, 2014.

[3] L. S. Maciel and R. Ballini, “Design a neural network for time series financial forecasting: Accuracy and robustness analysis,” Instituto de Economía, Universidade Estadual de Campinas, Sao Paulo-Brasil, 2008.

[4] T. Manojlovic and I. Stajduhar, “Predicting stock market trends using random forests: A sample of the Zagreb stock exchange,” in Information and Communi- cation Technology, Electronics and Microelectronics (MIPRO), 38th International Convention, pp. 1189–1193, IEEE, 2015.

[5] M. Kumar and M. Thenmozhi, “Forecasting stock index movement: A comparison of support vector machines and random forest,” in Indian Institute of Capital Markets 9th Capital Markets Conference Paper, 2006.

[6] R. Caruana and A. Niculescu-Mizil, “An empirical comparison of supervised learning algorithms,” in Proceedings of the 23rd International Conference on Machine Learning, pp. 161–168, ACM, 2006.

[7] P. H. Skärvad, Företagsekonomi 100. Liber, 2007.

[8] J. J. Murphy, Technical analysis of the financial markets: A comprehensive guide to trading methods and applications. Penguin, 1999.

[9] R. S. Michalski, J. G. Carbonell, and T. M. Mitchell, Machine learning: An artificial intelligence approach. Springer Science & Business Media, 2013.

[10] E. Alpaydin, Introduction to machine learning. MIT press, 2014.

[11] P. Domingos, “A few useful things to know about machine learning,” Communica- tions of the ACM, vol. 55, no. 10, pp. 78–87, 2012.

(34)

[12] G. James, D. Witten, T. Hastie, and R. Tibshirani, An introduction to statistical learning, vol. 112. Springer, 2013.

[13] C. M. Bishop, Neural networks for pattern recognition. Oxford university press, 1995.

[14] L. Fausett, “Fundamentals of neural networks: architectures, algorithms, and applications,” 1994.

[15] H. Jiang, T. Liu, and M. Wang, “Direct estimation of fault tolerance of feedforward neural networks in pattern recognition,” in Neural Information Processing, pp. 124–

131, Springer, 2006.

[16] V. Sharma, S. Rai, and A. Dev, “A comprehensive study of artificial neural networks,” India (International Journal of Advanced Research in Computer Science and Software Engineering, Volume 2, Issue 10), 2012.

[17] K. Gurney, An introduction to neural networks. CRC press, 1997.

[18] MathWorks, “Levenberg-Marquardt backpropagation.” http://se.mathworks.com/

help/nnet/ref/trainlm.html. Accessed: 2016-04-11.

[19] A. Criminisi, J. Shotton, and E. Konukoglu, “Decision forests: A unified frame- work for classification, regression, density estimation, manifold learning and semi- supervised learning,” Foundations and Trends in Computer Graphics and Vision,R

vol. 7, no. 2–3, pp. 81–227, 2012.

[20] M. Robnik-Šikonja and I. Kononenko, “Pruning regression trees with MDL,” in Proceedings of the 13th European Conference on Artificial Intelligence, John Wiley

& Sons, Chichester, England, pp. 455–459, 1998.

[21] B. Kitts, “Regression Trees,” tech. rep., http://www.appliedaisystems.com/

papers/RegressionTrees.doc, 1997.

[22] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.

[23] The Pennsylvania State University, “From Bagging to Random Forests.” https:

//onlinecourses.science.psu.edu/stat857/node/181. Accessed: 2016-04-11.

[24] S. Karsoliya, “Approximating number of hidden layer neurons in multiple hidden layer BPNN architecture,” International Journal of Engineering Trends and Tech- nology, vol. 3, no. 6, pp. 713–717, 2012.

[25] G. Panchal, A. Ganatra, Y. Kosta, and D. Panchal, “Behaviour analysis of multilayer perceptrons with multiple hidden neurons and hidden layers,” International Journal of Computer Theory and Engineering, vol. 3, no. 2, p. 332, 2011.

[26] A. C. Neubauer and A. Fink, “Intelligence and neural efficiency,” Neuroscience &

Biobehavioral Reviews, vol. 33, no. 7, pp. 1004–1023, 2009.

(35)

BIBLIOGRAPHY

[27] S. Herculano-Houzel, “The human brain in numbers: a linearly scaled-up primate brain,” Frontiers in human neuroscience, vol. 3, p. 31, 2009.

[28] J. Hertz, A. Krogh, and R. G. Palmer, Introduction to the theory of neural compu- tation, vol. 1. Basic Books, 1991.

[29] M. Mayford, S. A. Siegelbaum, and E. R. Kandel, “Synapses and memory storage,”

Cold Spring Harbor perspectives in biology, vol. 4, no. 6, p. a005751, 2012.

(36)

2.1 A basic outline of an artificial neuron. Source: https://en.wikibooks.

org/wiki/File:ArtificialNeuronModel_english.png . . . 6 2.2 A Multilayer perceptron (MLP) with one hidden layer. Source: https://

upload.wikimedia.org/wikipedia/commons/c/c2/MultiLayerNeuralNetworkBigger_

english.png . . . 7 2.3 An illustrative example of a decision tree that is used to figure out whether

a photo (the input vector) represents an indoor or an outdoor scene.

Source: Criminisi et al. [19] (p.88). . . 8 2.4 Regression tree (a) An illustrative example of a 2-dimensional regression

tree with binary splits. The predictive model in the terminal node takes the mean of all data points contained in that partition. (b) An illustrative example of the partitioned space and the containing data points. Source:

https://dzone.com/articles/regression-tree-using-gini%E2%80%99s 9 3.1 The actual prices of the WMT stock in the interval between Feb 3 2015 -

Feb 4 2016. Source: Matlab. . . 12 3.2 The actual prices of the IXIC index in the interval between Feb 3 2015 -

Feb 4 2016. Source: Matlab. . . 13 3.3 A schematic image reflecting the structure of the ANN model. Source:

Matlab. . . 15 4.1 A graph presenting the predictions of the two models on the WMT stock

against the actual prices during the 38 day test period ranging between days 217-254. . . 18 4.2 A graph presenting the predictions of the two models on the IXIC stock

index against the actual prices during the 38 day test period ranging between days 217-254. . . 19 A.1 A basic illustrative example of a neuron. Source: https://upload.

wikimedia.org/wikipedia/commons/8/86/1206_The_Neuron.jpg . . . . 25

(37)

List of Tables

3.1 Table presenting the partitioning of the data sets for each stock. The respective stock had the same partitioning to build and evaluate both of its respective ANN and RF models. . . 13 3.2 Table presenting the hyperparameter values set for each RF model for the

respective stocks. . . 16 4.1 Table presenting the RMSE and MAPE metrics for the WMT stock. The

figures are represented with 4 decimal points. . . 19 4.2 Table presenting the RMSE and MAPE metrics for the IXIC stock index.

The figures are represented with 4 decimal points. . . 20

(38)

AI Artificial intelligence ANN Artificial neural network BNN Biological neural network EMA Exponential moving average FFNN Feedforward neural network IXIC NASDAQ composite stock index LMA Levenberg-Marquardt algorithm MAPE Mean absolute percentage error ML Machine learning

MLP Multilayer perceptron MSE Mean squared error RF Random forest

RMSE Root mean squared error ROC Rate of change

SLP Single layer perceptron WMT Walmart stock

(39)

A comparative study on artificial neural networks and random