• No results found

Predicting Stock Market Price Direction with Uncertainty Using Quantile Regression Forest

N/A
N/A
Protected

Academic year: 2022

Share "Predicting Stock Market Price Direction with Uncertainty Using Quantile Regression Forest"

Copied!
49
0
0

Loading.... (view fulltext now)

Full text

(1)

U.U.D.M. Project Report 2020:50

Examensarbete i matematik, 15 hp

Handledare: Robin Eriksson, institutionen för informationsteknologi Examinator: Erik Ekström

November 2020

Predicting Stock Market Price Direction with Uncertainty Using Quantile Regression Forest

Minna Castoe

(2)
(3)

Abstract

The ability to successfully and accurately forecast the trend in stock market price movements is crucial for both traders and investors due to its importance in influencing traders’ future decisions to either buy or sell an underlying asset which could yield significant profit. In recent years, Machine Learning algorithms in general and ensemble learning algorithms in particular have been successfully shown to generate high prediction accuracy of stock price direction.

However, a common direction in which the prediction is made is in the fashion of finding the conditional mean prediction, but since the market can be seen as stochastic, there is an underlying uncertainty that should be accounted for. The literature shows that Random Forest which generates the mean prediction, among all other ensemble learning methods has proved effective in stock price forecasting. Hence, we use Random Forest to deal with the stock pre- diction problem as well as a generalization of this model where the output from the predictor is not only the mean but t-quantiles, called Quantile Regression Forest.

The main contribution of this paper is the study of the Random Forest classifier and Quantile regression Forest predictors on the direction of the AAPL stock price of the next 30, 60 and 90 days. The stock prediction problem is constructed as a classification problem as well as a regression problem. The forecasting ability of the Random Forest classifier is accessed using the confusion matrix where four parameters; accuracy, precision, sensitivity and specificity are computed from this matrix. On the other hand, the forecasting ability of Quantile Regression Forest is accessed using the standard strategic indicators such as RMSE and MAPE. Using seven technical indicators and the historical time series data of AAPL where all the data available for this company has been used starting from the day they went public, with time span ranges from December 12, 1980 to August 1, 2020, experimental results show that both Random Forest and Quantile Regression Forest accurately predict the direction of stock market price with accuracy over 90% in Random Forest and small error, MAPE between 0.03% and 0.05% in Quantile Regression Forest.

Keywords— Random Forest Classifier, Quantile Regression Forest, Stock price prediction, Ensemble Learning algorithms, Technical indicators, Prediction intervals.

(4)

Acknowledgement

My sincerest gratitude goes to my supervisor, Robin Eriksson, who provided me with continual support, guidance and recommendations regarding the topic of my interest. His great effort helped me to dispose of many of the difficulties that I encountered throughout the

progression of this thesis.

I would also like to acknowledge with gratitude, the love and support of my family and all my friends, this journey would not have been possible without them.

(5)

Contents

List of Figures III

List of Tables IV

1 Introduction . . . . 1

1.1 Background . . . . 1

1.2 Problem Formulation . . . . 2

1.3 Aim . . . . 3

1.4 Outline . . . . 3

2 Literature Review . . . . 4

3 Data and methodology . . . . 10

3.1 Experimental design . . . . 10

3.1.1 Data Description . . . . 10

3.1.2 Exponential Smoothing . . . . 11

3.1.3 Technical Indicators . . . . 11

3.2 Predication Models . . . . 15

3.2.1 Decision tress . . . . 15

3.2.2 Random Forest . . . . 16

3.2.3 Quantile Regression Forest . . . . 16

3.3 Data Labelling . . . . 18

3.4 Model Evaluation Criteria . . . . 21

4 Experimental Results . . . . 23

4.1 Random Forest Classifier . . . . 23

4.1.1 Random Forest with α = 0.0095 . . . . 23

4.1.2 Random Forest with α = 0.2 . . . . 26

4.1.3 Random Forest with α = 0.95 . . . . 28

4.2 Quantile Regression Forest . . . . 30

4.2.1 Prediction Intervals . . . . 32

4.3 Comparison between RFC and QRF . . . . 34

5 Discussion and Conclusion . . . . 35

5.1 Conclusion . . . . 37

5.2 Further Research . . . . 38

Bibliography I

(6)

List of Figures

1 Daily AAPL closing prices from December 1980 to August 2020 . . . . 10

2 Target 30 days . . . . 19

3 Target 60 days . . . . 19

4 Target 90 days . . . . 19

5 Prediction Models . . . . 20

6 Daily exponentially smoothed AAPL closing prices from 1980 to 2020, α = 0.0095 . . . . 24

7 OOB error rate, α = 0.0095 . . . . 25

8 Daily exponentially smoothed AAPL closing prices from 1980 to 2020, α = 0.2 26 9 OOB error rate, α = 0.2 . . . . 27

10 Daily exponentially smoothed AAPL closing prices from 1980 to 2020, α = 0.95 28 11 OOB error rate, α = 0.95 . . . . 29

12 Mean Square Error, α = 0.0095 . . . . 31

13 50% Prediction Interval for 30 days . . . . 33

14 95% Prediction Interval for 30 days . . . . 33

15 50% Prediction Interval for 60 days . . . . 33

16 95% Prediction Interval for 60 days . . . . 33

17 50% Prediction Interval for 90 days . . . . 33

18 95% Prediction interval for 90 days . . . . 33

19 Variable Importance for RFC and QRF . . . . 34

(7)

List of Tables

1 Summary of the literature survey . . . . 9

2 Descriptive statistics of the response variable, α = 0.0095 . . . . 19

3 Confusion Matrix for α = 0.0095 . . . . 24

4 Results of random forest classifier, α = 0.0095 . . . . 24

5 Confusion Matrix for α = 0.2 . . . . 26

6 Results of random forest classifier, α = 0.2 . . . . 27

7 Confusion Matrix for α = 0.95 . . . . 28

8 Results of random forest classifier, α = 0.95 . . . . 29

9 Results of Quantile Regression Forest . . . . 30

10 Accuracy with 3 different random splits of the AAPL data set . . . . 35

11 Results from Di (2014) . . . . 36

12 Results from Khaidem et al. (2016) . . . . 36

13 Results obtained using RFC with different alpha . . . . 36

14 Results from Vijh et al. (2020) . . . . 37

15 Results obtained using our model . . . . 37

(8)

List of Acronyms

EMH Efficient market hypothesis ML Machine Learning

SVM Support Vector Machine MA Moving Average

SVR Support Vector Regression ANN Artificial Neural Network LR Logistic Regression RF Random Forest

RFC Random Forest Classifier QRF Quantile Regression Forest

AUC-ROC curve Area under the receiver operating characteristics curve CART Classification and Regression Trees

MACD Moving Average Convergence Divergence A/D OSC Accumulation/distribution Oscillator RSI Relative Strength Index

ROC Price Rate of Change OSCP Price Oscillator

CCI Commodity Channel Index RMSE Root Mean Squared Error MAPE Mean Absolute Percentage Error MAE Mean Absolute Error

(9)

1 Introduction

This section provides a short background on the topic, discusses the problem that is investig- ated in this paper, states the aim and also describes the structure of the paper.

1.1 Background

The trends in stock market price refer to the future upward or downward movements of the price series, also called bear and bull. Attempting to successfully and accurately predict the trends in stock market price or index has spawned a variety of models and methods, two of which have been commonly used, technical and fundamental analysis (Malkiel, 1999).

Fundamental analysis is based on the study of demand and supply. A decrease in demand or an increase in supply tends to reduce the price, while an increase in demand or a decrease in supply will lead to a rise in stock prices (Atiya & Abu-Mostafa, 1996). Technical analysis is however performed mainly on a chart. Therefore, the basis of the technical analysis is the pattern in the data. The past pattern of the stock price behaviour is assumed to be rich enough in information to be used to determine the future behaviour of security (Malkiel, 1999; Fama, 1965). This is the main assumption behind several technical theories which are also called chartist theories. The past behavior of the stock market prices tends to recur itself, and hence, the history of the stock market price can be used to predict the future trends of the price.

On the contrary, the random walk theory which supports the fundamental analysis contradicts the chartist theories, it provides that the movements of the stock prices are random, i.e. the movements are a series of identically independent random variables and thus, the past cannot be used to make a meaningful prediction of the future. (Fama, 1965; 1995).

In the 1960s, Fama propounded the theory of the efficient market, which later resulted in him being awarded the Nobel Prize for economics sciences in 2013. Efficient market theory or hypothesis (EMH) has been one of the most debated investment theories. It is linked with the idea of random walk and it states that the stock market prices, at any point in time, fully reflect all the available information; prices fully reflect all the known information, resulting changes in stock price to be random and thus, unpredictable (Malkiel, 2003). In other words, asset price movements or fluctuations are unpredictable since all the sellers and buyers in the markets have the same information available.

Moreover, a review of several theoretical and empirical studies on EMH has been provided by Fama (1970), which supported the random walk theory and showed that the financial mar- ket is random, and therefore it is hard to predict the future trends in the stock market price.

Neither technical analysis which uses the past stock prices to predict the future prices, nor fundamental analysis which attempts to analyze the financial information, would give higher returns than this could be obtained from a portfolio of individual stocks that is randomly selec- ted (Malkiel, 2003). However, as there have been many studies supporting this theory, there have also been many studies that rejected it. The stock prices do not follow random walks and therefore the random walk theory is strongly rejected (MacKinlay & Lo, 1988; 2011).

The rejection of the EMH and random walk theory and the non-consensus on the validity of the EMH made the analysis of the stock market price movements a challenging and disputed task. This led to different approaches being used in financial forecasting and stock direction prediction issues. Approaches that have commonly been used in stock analysis and direction

(10)

prediction have been classified according to Shah et al. (2019) into four categories: statistical, pattern recognition, machine learning (ML) and sentiment analysis (Shah et al., 2019).

Ballings et al. (2015) however grouped the methodologies used to predict the behaviour of the stock price into three different categories: ML and data mining, technical analysis, and time series forecasting. While Khaidem et al. (2016) classified the most used methodologies in predicting the stock price behaviour into four categories, added modeling and predicting volatility of stocks using differential equations to the previous three mentioned methodologies.

Also, Masoud (2014) shad light on four approaches that can be used in order to forecast the price trends: fundamental analysis, technical analysis, ML and time series forecasting.

Aside from technical and fundamental analysis, ML and its applications have come to play an integral role in financial analysis. During recent years, ML has had fruitful applications in finance and has become a useful tool in handling the issue of the stock market direction prediction. Among the major approaches, ML algorithms have later become prominent in finance and started to be used for financial forecasting and stock price analysis in the early 1980s (Vachhani et al., 2019). This approach is a broad subgroup of Artificial Intelligence and besides the stock price behaviour prediction and stock analysis, these algorithms have been also used in portfolio optimization, stock betting and credit lending (Vachhani et al., 2019).

ML algorithm in its turn can be divided into four broad groups: supervised, unsupervised, semi-supervised and reinforcement machine learning algorithms. Supervised ML consists of two groups of algorithms: regression and classification algorithms, whereas unsupervised ML consists of clustering and association machine learning algorithms. Unsupervised ML algorithms, particularly, clustering algorithms have been used mainly in finance for financial risk analysis including any form of financial risks, e.g. credit risk, investment risk, business risk and operational risk (Kou et al., 2014).

Unlike the unsupervised learning algorithms which have been used mostly to determine the connection in an unconnected or uncorrelated data set, the supervised learning algorithms have become useful in providing efficient analysis of the stock market prices and trends using the historical data (Shan et al., 2019). Supervised ML algorithms among all other phenomena, have been playing an important role in financial forecasting and proved effective in predicting the future trends of the stock market price. This type of ML algorithm is used for the purpose of making predictions of many phenomena ranging from simple predictions to complicated predictions.

1.2 Problem Formulation

The prediction of the direction of stock market prices has become a crucial, highly challenging and controversial task in financial forecasting and analysis. Predicting the trends in stock market prices, i.e. whether a stock price would rise or fall has been an area of interest for many researchers and also for investors due to its importance in influencing the traders’ future decision to either buy or sell an instrument which could yield significant profit.

The fact that the stock market is fundamentally random, dynamic, nonlinear, complicated, nonparametric and chaotic in its nature (MacKinlay & Lo, 1988; Atiya & Abu-Mostafa, 1996;

Masoud, 2014; Khaidem et al.,2016), makes the prediction of the stock market movements a difficult task for researchers. However, accurate prediction of the trends in stock market prices can help traders and investors adjust their strategies for better trading in the future and thus

(11)

increasing the opportunity of gaining profits and contrastingly reducing the chances of losses, i.e. maximize profits and minimize loss. On the other hand, stock market prices behaviours are assumed to be affected by different factors, such as economic, political and natural factors, movement of other stock markets, market psychology, traders’ expectations and choices and other unexpected events. Hence, all these factors should be taken into account in financial forecasting and analysis.

Furthermore, the problem of stock direction prediction has been studied as regression as well as a classification problem. However treating this problem as a classification problem gives more accurate results than treating it as a regression problem, this is due to that the outcome or the response variable in the classification algorithms is arranged to be binary, i.e.

two-class, either upward trends (bull trends) or downward trends (bear trends).

Classification algorithm which falls under the supervised machine learning algorithms, has received good attention in recent years as an efficient predictive technique and one of the top ML algorithms that can significantly predict the future trends of the stock market prices.

Among the major classification algorithms used to predict the direction of the stock market price, the following are worthy of attention: Support Vector Machine (SVM), Artificial Neural Network (ANN), Logistic Regression (LR), Decision trees and Random Forest (RF).

1.3 Aim

The paper aims to propose and design an efficient method to predict the trend in the stock market price movements. Two ML algorithms are used for the purpose of making a signific- ant prediction with seven technical indicators served as input variables. Also, the long-term prediction is the main focus in this paper that represents 30, 60 and 90 days prediction respect- ively. As RF has been successfully shown to generate high forecasting accuracy, we consider first Random Forest Classifier (RFC) which focuses on the conditional mean of the response variable and then generalize this model to a model where the output from the predictor is not only the mean but t-quantiles, this generalization of Random Forest model called Quantile Re- gression Forest (QRF). In addition, the predicting uncertainty in RF is also taken into account via prediction intervals and the robustness of the two prediction models are evaluated using different measures and parameters.

To the best knowledge of the author, there is no study that deals with the prediction issue of the stock market price using QRF. However, RF has been widely used to capture both, short and long-term predictions of the stock market price movements.

1.4 Outline

The remaining portion of the paper is organized as follows: section 2 contains a literature review of various classification and regression ML algorithms that are related to time series forecasting and have been used in order to predict the trends in stock market price. Section 3 describes the data set, the pre-processing of the data, computation of the technical indicators which serve as input variables, methods and algorithms employed and also the model evalu- ation criteria. Thereafter, empirical results from the real data sets are shown in section 4, and finally, the last section, section 5, discusses and compares the results obtained in this paper with results obtained in other papers, followed by some concluding remarks.

(12)

2 Literature Review

This section contains a literature survey of various classification and regression ML algorithms and data mining techniques that have been utilized for the purpose of predicting the trend in the stock market price movements. This section empathizes also the importance of the technical indicators in machine learning.

Since the predictive model for the empirical study used in this paper is a type of ML al- gorithms, so reviewing existing related work on ML approaches that have been proved effect- ive in stock market price forecasting allows us to understand and conclude that stock market prices are to some extent predictable, and ML is a useful technique that can be used in order to make a significant prediction of the trends of the stock market prices, both short and long-term prediction. The aim of the literature survey is to consider the most used algorithms that have been applied in order to predict the trend of the stock market prices, and also to justify the selection of the variables that have been used as predictors or input variables for the empir- ical analysis. Taking into account the metrics and parameters that have been used in order to evaluate the robustness and the accuracy of these models, as well as the time span of the stock data that is used to perform the process of the direction prediction issue.

There is considerable evidence showing that supervised algorithms, among all other al- gorithms, are effective tools in forecasting the stock market trends. Several studies have focused on comparing different prediction algorithms in order to determine the superior al- gorithm. Ballings et al. (2015) used data from 5767 publicly listed European companies in order to forecast the long-term stock price movements using several models. Ensemble meth- ods include RF, AdaBoost and Kernel Factory as well as single classifier models such as ANN, LR, SVM and K-Nearest Neighbor, have been compared, and results showed that RF has been one of the top algorithms preceded by SVM. LR in its turn was the inferior algorithm.

This study aimed to use different predictor variables in order to forecast the stock price move- ments one year ahead, these predictor variables have however been selected based on prior studies, such as cash flow yield, book- to- market ratio and size, stock price index, price or earnings ratio, inflation rate and money supply as well as financial indicators include liquidity indicators (current ratio, collection period of receivable), profitability indicators (ROA, ROE, ROCE) and solvency indicators (gearing ratio, solvency ratio). In addition, the area under the receiver operating characteristics curve (AUC-ROC curve) has been used as a performance measurement in order to check the performance of the models.

Kumar & Thenmozhi (2006) attempted to predict the daily movement direction of S&P CNX NIFTY Market Index of the National Stock Exchange by applying two different super- vised learning algorithms, RF and SVM to a sample including 1360 trading days. The results obtained in this study were compared to the other classifications models that have been used in prior studies to predict the direction of the stock market prices, e.g. ANN, Logit Model and Linear Discernment Analysis.

This study used 12 different technical indicators to perform the process of the short-term prediction: Relative Strength Index (RSI) and stochastic %K, Momentum, Commodity Chan- nel Index (CCI), Price Oscillator (OSCP), 5- and 10-day disparity, Accumulation/distribution Oscillator (A/D OSC), Larry William’s in %R (William’s %R), Price rate-of-change (ROC),

(13)

Moving Average (MA) and MA of Stochastic (%D) and slow stochastic (Slow %D). Also, the hit ratio has been used in this paper as a measure of the performance of the models. The experimental results showed that SVM with a hit ratio of 68.44% outperforms both, RF with a hit ratio of 67.40%, and all other classifications models from the other studies.

Another study by Ou & Wang, (2009) explored the predictive ability of ten different data mining and ML algorithms in predicting stock price movements of the Hang Seng index of the Hong Kong stock market. Among all the ten approaches which included Linear- and Quadratic discriminant analysis, K-Nearest Neighbor, Naïve Bayes, Logit model, tree-based classification, Neural Network, SVM, Least Squares SVM and Bayesian classification with Gaussian process, the results showed that SVM produces the best predictive performance of the stock price movements for in-sample prediction whereas LS-SVM is the best for out-of- sample prediction in terms of hit ratio and error rate criteria which both have been used as performance measurements. Instead of technical indicators, stock trading data were used as predictors including high and low price, closing price of S&P 500 index and currency change between Hong Kong dollar and US dollar, with a sample of 1732 trading days.

Similar work has been done by Zhang & Dai (2013) where four ML algorithms have been compared in order to determine the most effective algorithm in predicting both, short and long-term stock price trends. Using 16 features including PE ratio, PX Ebitda, current enterprise value, PX volume, 10-day volatility, 2-day net price change, 10-and 50-day MA, alpha overridable, quick ratio, alpha for beta pm, beta row overridable, IS EPS, Risk premium and the corresponding S&P 500 index, the results showed that SVM has the highest predict- ing accuracy (79.3%) in the long-term prediction case (44 days), compared to LR, Quadratic Discriminant Analysis and Gaussian Discriminant Analysis with data sample of 1471 trading days for 3M stocks. However, the short-term prediction which represents a day or a week prediction has shown very low accuracy.

Rodriguez, PN. & Rodriguez, A. (2004) predicted the short-term movement of stock mar- ket prices. A comparison of different ML algorithms has been performed, and seven different classifications algorithms have been applied to predict the daily movements of the stock prices of three large emerging markets stock indices, IPC (Mexico), Bovespa (Brazil) and KLSE Composite (Malaysia) with a sample period from Jan-1990 to Dec-2003. The technical indic- ators which have been used in this paper are: 1- and 2-day ROC, 4-day Momentum, 14-day RSI and Stochastic, OSCP, and 14- and 21-day Disparity. RF however was one of the best clas- sifications models used in predicting the direction of the stock market among all other models, including LR model, Neural Networks model, Gradient Boosting Machine, tree-based model and PolyClass, where AUC-ROC curve was used in order to evaluate the performance of the models.

The short -term prediction of stock price trend has also been the main focus for Di (2014).

Di focuses in his paper on only the stock price trend in the near future, 1 to 10 days, by ap- plying SVM classifier to three well-known stocks, AAPL, MSFT and AMZN and two market indexes: NASDAQ and S&P 500. In addition, 12 main technical indicators were used as fea- tures and have been applied to the time series data which contained a set of 4 years, from January 2010 to December 2014. Using 5-fold cross-validation, the results showed an accur- acy over 70% on predicting a 3 to 10-day average price trend and 56% on predicting the next day price trend.

(14)

Milosevic (2016) has in its turn applied ML algorithms to predict the long-term movement of stock market prices. The long-term prediction in this paper represented a one-year ahead prediction of a total 1739 stocks selected from various indexes like S&P 1000, FTSE 100 and S&P Europe 350. Some of the stocks were discarded in order to balance the data set, thus, the data set ended up with 1298 data rows, 649 were labeled as good stocks and the other 649 were labeled as bad stocks. Good stocks however explain the stocks that have 10% higher price in one year period and the rest are classified as bad stocks.

Using eight different ML algorithms, C4.5 decision trees, SVM with sequential minimal op- timization, JRip, LR, Naive Bayes, Bayesian Networks, Random trees and RF, and with 28 different financial indicators, the results showed that the algorithm that performed the best is RF with precision, recall and F-score of 75.1%. However, the performance of RF has in- creased when the number of features used was reduced to 11 instead of 28 financial indicators, the precision, recall and F-score obtained in this case with 11 financial indicators is 76.5%.

The effectiveness of the RF in forecasting the direction of the stock market prices has been also confirmed by Khaidem et al. (2016). RF has been used as the main predictive model in this paper, it was applied to different stocks, AAPL and GE which are both listed in Nasdaq, and Samsung Electronics Co. Ltd. which is traded in Korean Stock Exchange.

The robustness of the model has been evaluated by considering four parameters: accuracy, precision, recall and specificity, and also by plotting the ROC curve. With a high accuracy, in the range 85%-95% for long-term prediction, the author presented RF as one of the most effective prediction models that can be used to predict the movement of stock market prices.

The technical indicators which have been used in this study are: RSI, Stochastic Oscillator

%K, Williams %R, Moving Average Convergence Divergence (MACD), ROC and On Balance Volume. In addition, Khaidem et al., (2016) suggested exponential smoothing in the historical stock data to improve the model’s capabilities in producing better results and giving higher accuracy.

On the contrary, Vijh et al. (2020) showed that a compression between two ML techniques, ANN and RF, indicates that ANN is a better technique to predict the next day closing price of the stock compared to RF. This conclusion was drawn based on Root Mean Square Error (RMSE), Mean Absolute Percentage Error (MAPE) and Mean Bias Error (MBE), where these three parameters were used in order to evaluate the robustness of the models.

The two models were applied to five different sector companies: Nike, JP Morgan Chase and Co, Johanson and Johanson, Goldman Sachs and Pfizer, with a data set included ten years, from 4-5-2009 to 4-5-2019. The prediction process of the stock closing price was completed by using six different variables including some technical indicators: stock high minus low price (H-L), stock close minus open price (C-O), 7-, 14- and 21-day MA and past 7-day standard deviation (7-day STD DEV).

Added to RF, another ensemble learning algorithm that has been commonly used in order to predict the direction of the stock market price is the Xtreme Gradient Boosting. This model has been classified as a non-metric classifier model. Dey et al. (2016) applied this model to the historical data of two stocks, Apple Inc. and Yahoo! Inc. stock price to examine the short- and long-term prediction of the direction of the stock market price.

Using six technical indicators: RSI, Stochastic Oscillator %K, Williams %R, MACD, ROC and On Balance Volume, results showed that Xtreme Gradient Boosting model gives better results than other forecasting models used in literature to predict the direction of stock market

(15)

price, with accuracy over 87% for the long-term prediction.

This model has been proved to be much better than the traditional non-ensemble algorithm and also the metric and non-metric classifiers in terms of accuracy. The historical data was also smoothed exponentially, and accuracy, precision, recall and specificity as well as the RMSE and ROC curve have been used to evaluate the robustness of the model.

Concerning the ensemble learning algorithms, two tree-based classifiers have been also compared to non-ensemble models by Basak et al. (2019). In that paper, Xtreme Gradient Boosting and RF have been used for the purpose of predicting the direction of the stock market price. Same technical indicators have been used in this study and the results, based on ten different companies: AAPL, AMS, AMZN, FB, MSFT, NKE, SNE, TATA, TWTR and TYO, showed that both models give an effective prediction of the direction of the stock market price where RF in its turn outperforms Xtreme Gradient Boosting with extremely high accuracy over 90% for medium to long-term prediction. In addition, exponential smoothing was also used here and the efficacy of the performance of the two models was evaluated by using different parameters including F-score, accuracy, precision, recall and specificity.

Prediction of the stock market index has been also studied by Patel et al. (2014a, 2014b) in different papers. The first paper focused on the prediction of the direction of the movement of the stock and stock market index. Ten years of historical data, from Jan 2003 to Dec 2012, has been used of two stocks chosen from the Indian stock markets: Infosys Ltd. and Reli- ance Industries and two stock price indices: CNX Nifty, S&P Bombay Stock Exchange (BSE) Sensex. Also, ten different technical indicators have been used as predictors: simple 10-day MA, weighted 10-day MA, momentum, stochastic K%, Stochastic D%, RSI, MACD, Larry William’s R%, A/D OSC and CCI.

This study used two different approaches for input variables, the first approach used the tech- nical indicators which were computed using the stock trading data, and the second approach treated these technical indicators as trend deterministic data. The comparison of four predic- tion algorithms, ANN, SVM, RF and Naïve-Bayes with these two approaches showed that RF provided better prediction than the other prediction models in the first approach. Whereas in the second approach, it has been found that this approach improves the results obtained in the first approach for all the four prediction models. Moreover, the performance of each of these models was evaluated by using accuracy and F-measure.

The second paper however used completely the same historical data for ten years for the same two indices: CNX Nifty and S&P Bombay Stock exchange (BSE) Sensex from the Indian stock markets. Same technical indicators were used also, and the prediction model used here was divided into two-stage approaches, single-stage approach where each of the three models is used single-handedly and two-stage fusion approach, uses ANN, RF and Support Vector Regression (SVR) resulting into SVR-ANN, SVR-RF and SVR-SVR fusion prediction models. In addition, evaluation measures that have been used to evaluate the performance of these predictions models are: MAPE, Mean Absolute Error (MAE), Mean Squared Error (MSE) and relative Root Mean Squared Error (rRMSE). The predictions are made however for 1-10, 15 and 30 days respectively, and the results showed that SVR-ANN performs the best overall for both the stock market indices.

On the other hand, ANN has received much attention in forecasting the direction of the stock market prices, it has been used widely for the purpose of predicting the stock price movements. Senol & Ozturan (2009) applied the ANN model to historical data of 27 different

(16)

stocks from Istanbul stock exchange (ISE) with average trading days being 2250 days, and with five different technical indicators: 14-day Stochastic %K, 14-and 37-day MA, Stochastic MA %D and 14-day RSI. Different techniques of the ANN model with different technical indicators have been tested where technical indicators were divided into seven different pre- diction systems. This prediction model has also been compared to the LR model where results showed that ANN model outperforms LR model. Moreover, ANN model with three technical indicators being 14-day RSI, 14-day Stochastic and Stochastic MA, gives the best results with the lowest average MSE.

Another study similar to the study by Senol & Ozturan (2009) which focused on the move- ments of the Turkey stock market, Masoud (2014) focused on its turn on Libyan stock market.

Statistical performance, using different measures such as MAE, MSE, RMSE, R-squared (R2) and MAPE as well as financial performance using the Prediction rate (PR) of the ANN model has been estimated in order to evaluate the forecasting ability and the accuracy of the model.

Using a sample of 763 trading days and a mixture of 12 different technical and fundamental indicators based on the previous studies including A/D OSC, CCI, Larry William’s %R, MACD, Momentum, ROC, RSI, 10-day simple and weighted MA, Stochastic %K, MA of

%K (Stochastic %D) and MA of %D (Stochastic slow %D), with average prediction rate 91%, results showed that ANN model provides a significant prediction of the movements of the stock market price.

A third study by Qiu & Song (2016) has also confirmed that ANN model is an effective model in stock price direction prediction. This study optimized the ANN model instead of us- ing genetic algorithms (GA), for better results and higher prediction accuracy. The model was applied to the most widely used market index (Nikkei 225) in order to predict the direction of the next day’s price of the Japanese stock market index using a sample of 1707 trading days.

In addition, two different types of predictor variables were compared also, the first type in- cluded 13 technical indicators being: Momentum, Larry William’s %R, RSI, A/D OSC, CCI, ROC, Stochastic %K, MA of %K (Stochastic %D), MA of %D (Stochastic slow %D), OSCP, 5-and 10-day Disparity, whereas the second type included only 9 of these 13 technical indic- ators. The results, using the hit ratio to evaluate the prediction performance of the model, showed that the second type of the technical indicators gives better results than the first type with hit ratio being 81.27%. The ANN model was also compared to different models in pre- vious studies and it has been shown that, ANN model in this paper had higher prediction accuracy than the other models.

Below we summarize the literature survey in a table includes the work that has been done by each author.

(17)

Table 1: Summary of the literature survey

Author Prediction Method Features Performance Measurements

Rodriguez, PN. & Rodriguez, A., (2004)

RF, LR, NN, Gradient Boosting Machine, tree- based model and PolyClass

8 Lagged technical indicators AUC-ROC curve

Kumar & Thenmozhi (2006) RF and SVM 12 Technical indicators Hit Ratio (68.44)

Senol & Ozturan (2009) ANN 5 Technical indicators MSE

Ou & Wang, (2009) Linear- and Quadratic

discriminant analysis, K- Nearest Neighbor, Naïve Bayes, LM, tree-based classification,NN, SVM, LS- SVM and Bayesian with Gaussian process

5 Stock trading data Hit Ratio and Error Rate Criteria.

Zhang & Dai, (2013) Gaussian Discriminant Ana- lysis, Quadratic discrimin- ant analysis, LR and SVM

16 Features Accuracy (SVM=79.3%)

Masoud (2014) ANN 12 Technical and Fundamental

Indicators

MAE, MSE, RMSE, R-squared and MAPE

Patel et al. (2014a) ANN, SVM, RF and Naïve

Bayes

10 Technical Indicators Accuracy and F-measure

Patel et al. ( 2014b) ANN, RF and SVR 10 Technical Indicators MAPE, MAE, MSE and rRMSE

Di (2014) SVM 12 Technical indicators 5-fold cross-validation (Accuracy)

Ballings et al. (2015) RF, AdaBoost, KF, NN, LR, SVM, K-Nearest Neighbor

Financial indicators, profitabil- ity indicators and solvency in- dicators

AUC-ROC curve

Milosevic (2016) RF, C4.5, SVM, JRip, LR,

Naive Bayes, Bayesian Net- works, Random trees

28 and 11 Financial indicators Precision, Recall and F-score

Dey et al. (2016) Xtreme Gradient Boosting 6 Technical Indicators Accuracy, Precision, Recall, Spe- cificity, RMSE and ROC curve

Qiu & Song (2016) ANN 13 and 9 Technical indicators Hit Ratio (81.27)

Khaidem et al. (2016) RF 6 Technical indicators Accuracy, Precision, Recall, Spe-

cificity and ROC curve.

Basak et al. (2019) RF and Xtreme Gradient

Boosting

6 Technical indicators Accuracy, Precision, Recall, Spe- cificity and F-score

Vijh et al. (2020) ANN and RF 6 Features RMSE, MAPE and MBE

(18)

3 Data and methodology

This section describes the data set used, provides the technical indicators with their formu- las which serve as input variables, shows how the data is labeled, and finally discusses the prediction models and their evaluation criteria.

3.1 Experimental design

3.1.1 Data Description

This study is based on the historical data of one company, Apple Inc., all of the data available for this company has been used starting from the day they went public, with a time span ranges from December 12, 1980 to August 1, 2020. The data sample is obtained from Yahoo Finance. It consists of the daily closing index levels with a total number of samples includes 9,993 trading days. The entire data thereafter was split into two sets, 80% of the entire data is used as in-sample data, i.e., training data set and the remaining 20% is considered as out- of-sample data or testing data set. The training data set is used to train the prediction model whereas the testing data set is used for the evaluation of the trained model. Moreover, the historical data is first exponentially smoothed, then the technical indicators are extracted from using the daily closing index levels.

The total data points of the AAPL daily closing price are plotted in Fig 1 below.

0 30 60 90

jan 1980

jan 1990

jan 2000

jan 2010

jan 2020

CLOSING PRICE

Figure 1: Daily AAPL closing prices from December 1980 to August 2020

(19)

3.1.2 Exponential Smoothing

Time series data is first exponentially smoothed. Exponential smoothing is a way to smooth out time series data and it is used to smooth univariate data which contains a single variable.

The main purpose of using this technique in this paper is to remove the noise and the random variation from the historical data, and hence allowing the prediction model to easily determine the price trend in the stock market price behaviour for short-term prediction as longer-term prediction.

Unlike averaging methods e.g. simple averages and moving averages, which apply equal weights to the historical data, exponential smoothing methods apply an unequal set of weights to the historical data. These weights are typically assigned in an exponential manner from the most recent to the most distant observations, therefore these methods are known as exponen- tial smoothing methods. In other words, exponential smoothing methods imply exponential decreasing weights, these weights are assigned unequally for newest to oldest observations.

The most recent observations are assumed to be more relevant and thus they are given more priority and assigned more weights.

The exponential smoothing method in its turn includes different types of methods, such as Holt’s linear method, Holt- Winters’ method and Pegels’ classification. The simplest ex- ponential smoothing method however is the single exponential smoothing (SES), which can be obtained as soon as two observations are available. This type of exponential smoothing is used in this paper. The smoothed statistics for the next period of a series “Y” is calculated using the following formula:

St+1= St+ α(Yt− St) which can also be written as

St+1= αYt+ (1 − α)St

and

S0= Y0 where,

tis the time period (t > 0) Ytis the actual observation Stis the smoothed statistics

α is the smoothing factor, a constant between 0 and 1. The closer alpha to zero, the slower the smoothing is, larger alpha however reduces the level of smoothing and alpha = 1 implies that the smoothed statistics is equal to the actual observation.

3.1.3 Technical Indicators

Technical indicators are useful tools that can be used to advance the technical analysis. These indicators help investors to make decisions regarding the buying and selling of the stocks and thus create a better understanding of the price action by determining what stocks to buy and what stocks to sell and more importantly when to do that.

The efficiency of the technical indicators in analyzing future trends has been agreed upon by many investors and financial managers. Technical indicators and their corresponding para-

(20)

meters are exploited by investors to check for bullish and bearish signals which can further help investors make decisions regarding entry and exit to the market.

The two main types of technical indicators are lagging and leading indicators. Lagging indicators, also called trend-following indicators, are these indicators which follow the price action and thus move after prices move, whereas leading indicators are those which change before prices change and therefore lead price movements (Larson, 2012). Technical indicators can also be grouped based on their functions into four important types: trend, momentum, volume and volatility indicators.

In the light of prior studies, technical indicators have been used as input variables in the construction of the prediction model to predict the direction of the stock market prices. Thus, the feature selection in this paper is based on the most used technical indicators that produced significant results with RF technique and the other ensemble learning techniques that are used as prediction models in prior studies. Description of the technical indicators used in this paper as well as their formulas are given below.

• Moving Average Convergence Divergence

The moving average convergence divergence (MACD) is defined to be a trend-following momentum indicator that helps investors understand whether the bearish or bullish movement in prices is becoming stronger or weaker. This indicator was developed by Gerald Appel and it turns two moving averages of prices into a momentum by compar- ing and subtracting one from another and thus shows the relationship between them. It is computed by subtracting the 26-day exponential moving average which is the longer moving average from the 12-day exponential moving average of a security’s prices which is defined to be the shorter one. The line obtained from this calculation called the MACD line and the 9-day exponential moving average of the MACD line called the signal line which can work as an incitement for buy and sell signals. However, MACD indicates a buy signal whenever it is above the signal line and a sell signal whenever it is below the signal line.

The formula for calculating MACD is as follows:

MACD= EMA12(C) − EMA26(C) SL= EMA9(MACD) where,

MACDstands for moving average convergence divergence or MACD line and SL stands for the signal line.

EMAn= n-day exponential moving average C= closing price

• Relative Strength Index

Relative strength index (RSI) is a popular momentum oscillator that was developed by J. Welles Wilder. It evaluates the conditions of overbought and oversold in the stock prices by measuring the extent of recent changes in prices. The RSI compares stock’s average gains and losses over a specific period of time, typically 14 trading days. RSI

(21)

ranges between 0 and 100, and traditionally, RSI above 70 indicates that the stock is overbought, while RSI below 30 indicates that the stock is oversold.

In this paper, we use a 27-day time-frame to calculate the initial value of the RSI. The formula for calculating RSI is:

RSI= 100 −1+RS100

RS=Average gain over past 27 days Average loss over past 27 days

where,

RSIstands for relative strength index, and RS stands for relative strength.

• Price Rate of Change

The price rate of change (ROC) is another momentum oscillator that compares and cal- culates the percent change in price between the current price and the price n-periods ago. In other words, ROC measures the changes between the current price with respect to the earlier closing price in n days ago. It moves from positive to negative, and fluctu- ates above and below the zero-line. However, this oscillator can be used for determining the overbought and oversold conditions, divergences and also zero-line crossovers.

We use a 21-day time-frame to calculate the initial value of the ROC. The formula for calculating ROC is as follows:

ROC=

Ct−Ct−21 Ct−21

× 100

where,

ROCstands for price rate of change at time t Ct = closing price at time t

Ct−21= closing price 21 periods ago

• Stochastic Oscillator

The stochastic oscillator which is often denoted by the symbol (%K), is a momentum oscillator that was developed by George Lane. The stochastic oscillator identifies the location of the stock’s closing price relative to the high and low range of the stock’s price over a period of time, typically being 14 trading days. The stochastic oscillator varies from 0 to 100, a reading above 80 generally represents overbought while below 20 represents oversold. We use a 14-day time-frame %K. The formula for calculating the stochastic oscillator is given below:

%K =

Ct−L14 H14−L14

× 100

where,

Ct = the current closing price.

L14= lowest low over the past 14 days H14= highest high in the last 14 days

(22)

• Williams Percentage Range

Williams percentage range which is also called Williams %R is a common indicator developed by Larry Williams. This indicator is often denoted by the symbol (%R), it measures the overbought and oversold levels and it works inversely to %K. Whilst %K ranges between 0 and 100, %R ranges between 0 and -100. A Williams %R below -80 indicates a buy signal, whereas a Williams %R above -20 indicates a sell signal.

We use also a 14-day time frame %R, the formula used to calculate the Williams %R is:

%R =

H14−Ct H14−L14

× −100

where,

Ct = the current closing price

L14= lowest low over the past 14 days H14= highest high in the last 14 days

• Commodity Channel Index

The Commodity Channel Index (CCI) was developed by Donald Lambert, it is a useful oscillator that is used to estimate the direction and the strength of the stock price trend.

This indicator is also used to determine when stock prices reach the condition of either overbought and oversold. The CCI is calculated by first determining the difference between the mean price of a stock and the average of the means, then comparing this difference to the average difference over a period of time, typically 20 days. The CCI is often scaled by an inverse factor of 0.015. The formula used to calculate the CCI is:

CCI=Typical price−MA20

0.015×D

where,

Typical price= average of low, high and close prices: ∑20i=1(H + L +C) ÷ 3 MA20= simple moving average over 20 days

D= mean deviation

• Disparity Index

The Disparity Index (DIX) is another useful indicator that is used commonly in technical analysis. This indicator was developed by Steve Nison and it is a momentum indicator that compares the stock’s current price with its moving average (MA) over a particular time period. DIX below 0 indicates that the stock’s current price is below the n-day MA, DIX above 0 indicates that the stock’s current price is above the n-day MA, whereas in the case the DIX equals 0 indicates that the stock’s current price is equal to the n-day MA. 14-day MA is used in this paper. The formula for calculating the DIX with 14-day MA is as follows:

DIX=MACt−MA14

14×100

where,

Ct = current stock price.

MA14= moving average over 14 days

(23)

3.2 Predication Models

In order to predict the trend in the stock market price movement, we use ensemble learning algorithms. Ensemble learning algorithms or techniques combine several machine learning algorithms into one predictive model in order to produce better predictive performance than this could be obtained from using any model singly. The main goal of using these techniques is to deal with modeling issues related to time series forecasting and hence, improve the sta- bility and accuracy of the machine learning algorithm, and produce better results as well.

Furthermore, the main factors that are assumed to cause an error in machine learning models are: variance, bias and noise, and ensemble learning algorithms can be used to handle the over-fitting issue and thus improve the algorithm used by minimizing all these factors.

Time series forecasting problems can however be classified as classification problems or regression problems. There is no big difference between these two except that in the regression predictive modeling problems a quantity is predicted, i.e. regression algorithm produces a numerical or continuous output variable, whereas in the classification predictive modeling problems a category is predicted, i.e. the output variable is discrete or categorical.

We turn our focus to the long-term prediction rather than short-term prediction since higher predictive accuracy for long term prediction has been obtained in the prior studies. Two en- semble learning algorithms are used in order to make a significant long-term forecast of the direction of the stock market price, RFC and QRF.

Since QRF is a generalization of RF, and RF in its turn is an ensemble learning algorithm that is constructed from decision trees and used to improve the accuracy of the decision trees, it is worth considering the framework under which both decision trees and RF operate. The prediction models employed in this paper are described in the following subsections.

3.2.1 Decision tress

Classification and regression problems can be constructed in a form of a tree structure called a decision tree. Decision tree algorithm (Quinlan, 1986), which is also called classification and regression tree (CART) in computer science, is essentially a type of machine learning algorithm used to deal mostly with classification problems, rather than regression problems which both belong to the family of supervised learning algorithms.

The basic idea behind CART is to start with a root node where the entire data set is situated.

The data then is split into two or more mutually exclusive child nodes depending on different classes, each child node is in turn split into grandchild nodes and so on.

The trees that are descended from the root node called sub-trees. Each node in the decision tree acts as a test case for some attribute, and each sub-nodes descending from the node cor- responds to the possible responses to the test case. The child node that descends from the root node which provides the classification of the output variable is called the leaf or terminal node;

the node that does not split further. While the one which is split into further sub-nodes called internal or a decision node. CART deals with different parameters called the predictors or input variables, at each specific node, the final decision is reached by splitting data depending on the response of each particular question that is asked over each parameter.

Furthermore, decision tree learning studies the training data to such an extent where this is

(24)

assumed to influence the performance of the model negatively and thus produce insignificant results causing over-fitting. However, the more the data is split, the higher the risk of over- fitting, which explains why the accuracy under the decision tree algorithm is quite low. Hence, RF was introduced as an ensemble learning algorithm to the decision tree to deal with this problem and to give more effective results than those obtained from a single tree.

3.2.2 Random Forest

The fact that decision trees have a high variance which causes over-fitting in data when using the decision tree technique resulted in searching for a model that deals with over-fitting issues where the variance can be reduced. Random forest (Breiman, 2001) falls under the category of the ensemble learning algorithms, it builds multiple decision trees, often called forests.

The name Random Forest comes from the fact that this model is a forest of randomly created decision trees, used to overcome the issue of over-fitting that often occurs when using a single tree in the case of decision tree algorithms.

The primary difference between decision tree and RF is that in the decision tree learning the entire data is used in order to construct a single tree containing all the parameters or the predictors, that is, the entire training data set is considered as the root, whereas RF selects a set of the predictors randomly and thereafter builds a decision tree for each set of the predictors selected, i.e. RF is not built on the entire data, each decision tree is however built on the part of the data where the data is recursively split into partitions.

The final outcome in the RF is however reached by simply combining the outcomes of the multiple decision trees that are created randomly and then taking the average of all the out- comes obtained from all these decision trees in the forests based on the respective parameters that have been used in each tree. That is, after creating multiple random decision trees, each tree votes the class depending on the poll created, and the class that receives the most votes is defined as the predicted class. Furthermore, the more trees in the forest or the higher number of trees created, the higher accuracy and more effective results obtained.

3.2.3 Quantile Regression Forest

Quantiles, in general, refer to dividing a sample or probability distribution into equal-sized subgroups where each subgroup contains the same fraction of the total population. In other words, it divides the range of a continuous random variable into subgroups of equal probabil- ity, therefore, quantiles can be considered as the points or values that describe the location of the distribution. The most used quantile is the median which corresponds to 50% percentile or 0.5-quartiles. Similarly, 0.25 and 0.75 quantiles or 25 and 75 percentiles are called the first and the third quartiles respectively, 0.2, 0.4, 0.6, and 0.8-quantiles which correspond to 20, 40, 60 and 80 percentile are called quantiles, and finally, 0.1, 0.2, . . . , 0.9 quantiles corresponding to 10, 20, . . . , 90-percentile are known as deciles.

Quantile regression (Koenker & Bassett, 1978; Koenker & Hallock, 2001) however is an extension of the classical least-square model, it is used whenever the simple linear regression cannot be applied to study the effect of the predictor variables on a specific response variable, or in other words, it is used whenever the response variable has a non-linear relationship with

References

Related documents

Ho et al (2004) on US stock documents that the relationship between R&D intensity and the components of systematic risk are stronger for manufacturing compared

105 Som tidigare forskning har visat har AB Bofors haft inflytande över skolväsendet i Karlskoga och därför är det kanske inte helt orimligt att undervisningen i

med fokus på kommunikation mellan sjuksköterskan och patienten i postoperativ vård samt patientens kommunikativa behov och sjuksköterskans förhållningssätt till detta..

Independent of using either Marker- or book leverage, the coefficients of the variables Target Cash Holding, Post-Bubble and Large Relative Deal Size were found to be

The benefit of using cases was that they got to discuss during the process through components that were used, starting with a traditional lecture discussion

From the probability density functions of the two models, we find that the probability density function of MJD model is clearly closer to the kernel density estimation for the

Next, an explanation of the problem and the hypotheses based on the literature review will follow, where it is hypothesized that uninformed individual traders, as a group, have a

In this study, a predictive model of the stock market using historical technical data was compared to one enhanced using social media sen- timent data in order to determine if