• No results found

Comparasion of recommender systems for stock inspiration

N/A
N/A
Protected

Academic year: 2021

Share "Comparasion of recommender systems for stock inspiration"

Copied!
45
0
0

Loading.... (view fulltext now)

Full text

(1)

Linköpings universitet SE–581 83 Linköping +46 13 28 10 00 , www.liu.se

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Datateknik

21 | LIU-IDA/LITH-EX-A--21/032--SE

Comparison of recommender

systems for stock inspiration.

Jämförelse av rekommendationssystem för aktie-inspiration.

Nils Broman

Supervisor : Gazi Salah Uddin Examiner : Martin Sjölund

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Över-föring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och till-gängligheten finns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet än-dras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

Recommender systems are apparent in our lives through multiple different ways, such as recommending what items to purchase when online shopping, recommending movies to watch and recommending restaurants in your area. This thesis aims to apply the same techniques of recommender systems on a new area, namely stock recommendations based on your current portfolio. The data used was collected from a social media platform for investments, Shareville, and contained multiple users portfolios. The implicit data was then used to train matrix factorization models, and the state-of-the-art LightGCN model. Experiments regarding different data splits was also conducted. Results indicate that rec-ommender systems techniques can be applied successfully to generate stock recommen-dations. Also, that the relative performance of the models on this dataset are in line with previous research. LightGCN greatly outperforms matrix factorization models on this pro-posed dataset. The results also show that different data splits also greatly impact the re-sults, which is discussed in further detail in this thesis.

(4)

Acknowledgments

First off, I would like to thank my supervisor at Nordnet, Anders Blomqvist, as well as the entire Shareville team for letting me be a part of their team and for helping me get all relevant data.

I also would like to thank my supervisor at the university, Gazi Salah Uddin, as well as my examiner, Martin Sjölund, for providing valueable inputs regarding the thesis.

(5)

Contents

Abstract iii

Acknowledgments iv

Contents v

List of Figures vii

List of Tables viii

1 Introduction 2 1.1 Background . . . 3 1.2 Aim . . . 3 1.3 Research questions . . . 3 1.4 Contributions . . . 4 1.5 Delimitations . . . 4 2 Theory 5 2.1 Content based filtering . . . 5

2.2 Collaborative Filtering . . . 5 2.3 Data modelling . . . 6 2.4 Models . . . 6 2.5 Optimization . . . 11 2.6 Data splitting . . . 12 2.7 Evaluation techniques . . . 13 3 Method 15 3.1 Pre-study . . . 15 3.2 Data . . . 15 3.3 Implementation . . . 16 3.4 Tuning . . . 17 3.5 Evaluation . . . 18 3.6 Previous Applications . . . 18 4 Results 19 4.1 Pre-study . . . 19 4.2 Data . . . 20 4.3 Tuning . . . 21 4.4 Evaluation . . . 22 5 Discussion 28 5.1 Results . . . 28 5.2 Method . . . 29

(6)

5.3 The work in a wider context . . . 30

6 Conclusion 31

Bibliography 32

(7)

List of Figures

2.1 Matrix Factorization . . . 7 2.2 Interaction graph . . . 9 2.3 High-order connectivity . . . 10 2.4 Pointwise data . . . 11 2.5 Pairwise data . . . 11

2.6 Temporal global vs temporal user/leave one last . . . 13

4.1 MF-BCE training recall plot . . . 22

(8)

List of Tables

2.1 Implicit data model . . . 6

2.2 Explicit data model using % of total investments . . . 6

3.1 Data example . . . 16

4.1 Data filtering statistics . . . 20

4.2 Statistics of the three data splits on unfiltered data. . . 20

4.3 Statistics of the three data splits on 4,4 filtered data. . . 20

4.4 Statistics of the three data splits on 10,10 filtered data. . . 20

4.5 SVD model tuning results. . . 21

4.6 MF-BCE model tuning results. . . 21

4.7 MF-BPR model tuning results. . . 21

4.8 LightGCN model tuning results. . . 21

4.9 Item Popularity model evaluation . . . 23

4.10 MF-BCE model evaluation . . . 23

4.11 MF-BPR model evaluation . . . 23

4.12 LightGCN model evaluation . . . 23

4.13 Item Popularity model tuning results. . . 24

4.14 Results of best MF-BCE model on 4,4 filtered data with global temporal split. . . . 24

4.15 Results of best MF-BPR model on 4,4 filtered data with global temporal split. . . . 24

4.16 Results of best LightGCN model on 4,4 filtered data with global temporal split. . . 24

4.17 Example of instruments in the training set, for a single user. . . 25

4.18 LightGCN recommendations for the example user. . . 25

4.19 MF-BPR recommendations for the example user. . . 26

4.20 MF-BCE recommendations for the example user. . . 26

4.21 Item Popularity recommendations. . . 27

(9)

Abbreviations

CF - Collaborative Filtering kNN - K-Nearest Neighbours MF - Matrix Factorization

SVD - Singular Value Decomposition DL - Deep Learning

NN - Neural Network

RNN - Recurrent Neural Network GNN - Graph Neural Network GCN - Graph Convolutional Network

SGD - Stochastic Gradient Descent RMSE - Root Mean Squared Error BCE - Binary Cross Entropy

BPR - Bayesian Personalized Ranking

NDCG - Normalized Discounted Cumulative Gain GMF - Generalized Matrix Factorization

NCF - Neural Collaborative Filtering CMN - Collaborative Memory Network NCGF - Neural Graph Collaborative Filtering LightGCN - Light Graph Convolution Network

(10)

1

Introduction

Today, every large internet company has some kind of recommender system making sure that users find the most relevant content. Netflix has a recommender system to recommend movies for you to watch, Amazon has a system to suggest new products for you to buy – the list goes on and on. With recommender systems, each user can get a personalized suggestion based on previous products the user has been interested in. This is in favour of both the company, which might sell more products, as well as the user, who gets more interesting product suggestions.

Just like Netflix has a recommender system for suggesting movies to watch, there exists rec-ommender systems for stocks. However, these recrec-ommender systems tend to differ from the movie recommenders. The current recommender systems for stocks focus on high returns, and they discard user preferences, meaning that they would recommend the same stocks to every single user. Personalized recommender systems for stock inspiration is an area that is currently quite unexplored.

And while the interest for stocks is only growing, especially in the nordic countries (Nordnet, 2021), the use case of this type of recommender system could be on the rise. However, which techniques that are most suitable for implementing such a system is far from obvious, leading us to the following question. What is the best approach for building a recommender system for stock inspiration?

Using AI and Machine learning within the field of finance can be very challenging. Since the world of finance is such a complex and perplexing environment, applications need to meet the requirements of the challenging environment. Therefore, solutions within AI and ma-chine learning have been on the rise within this area (Vinuesa et al., 2020; Rea, 2020), because of their high complexity and ability to learn from previous data. However, applications such as predicting price movements of certain stocks or finding the golden formula for option pricing can be very hard despite using complex techniques. This thesis will instead apply machine learning to a novel area, that of suggesting stocks based on current holdings. This paper will compare and discuss methods for building a recommender system for stocks. The system will, based on the users portfolio, suggest a couple of stocks which might be of interest for the user.

(11)

1.1. Background

1.1

Background

Nordnet provides a digital platform for savings and investments and they are the leading company in the nordic markets (Nordnet, 2020). They have customers in Sweden, Denmark, Norway and Finland and their vision is to disrupt the classic banking industry and democ-ratize investing. Nordnet is publicly listed on the swedish stock market, has over 1.3 million customers across all nordic countries and has increased their number of customers with 41% since 2020 (Nordnet, 2021). They provide services such as a digital platform, where one can invest in stocks, funds and different financial products, they also offer different kinds of loans, and pension solutions. Further, they run the biggest social platform for investments, Share-ville, with over 250 000 users. This means that Nordnet has a ton of data on user investments, which is suitable for creating a recommender system. With a recommender system, Nordnet users would have an easier time to find investments, which would benefit both the user and Nordnet.

This study will be based on relevant and influential research papers within the topic of collab-orative filtering models for recommender systems. These papers include, but are not limited to, He, Deng, et al. (2020), Rendle, Freudenthaler, Gantner, et al. (2009), and Lim (2007) as well as Meng, McCreadie, MacDonald, et al. (2020) paper regarding different data splitting techniques. Other relevant research is also presented, however these papers are most central to this thesis.

1.2

Aim

The purpose of this thesis is to evaluate whether applying commonly used recommender sys-tem techniques on portfolio data is viable, and whether the results obtained are comparable to previously reported results. Also, the thesis will do a comparasion of some recommender systems models to determine which ones perform the best on this new type of data. The result will tell whether a recommender system of any kind will be suitable for the task of recommending stocks based on user interest.

1.3

Research questions

Recommender system techniques have previously been applied to stock data in order to rec-ommend stocks with the highest returns (Gottschlich and Hinz, 2014), (Paranjape-Voditel and Deshpande, 2013), (Nair and Mohandas, 2015). However, the data used is very different to the data from Nordnet. Those recommender systems are applied to price data from multiple stocks. The data from Nordnet will not contain any price data, but instead data containing user portfolios. The data is more similar to data used in movie-recommendation datasets1, which motivates the following question:

1. Can recommender system techniques be applied to stock-data to achieve the same qual-ity of results as when applied to previously reported datasets?

This question regards whether the results of the system when applied to stock-data is up to par with the results when applied to other datasets. Where the quality of the results are measured according to three evaluation metrics, presented later. Therefore, the question is relevant for different kinds of models. Examples of other datasets include a location check-in

(12)

1.4. Contributions

dataset2, a business and restaurant dataset3and a book dataset4. The result of this question will determine the generalizability of the recommender system techniques used.

The next research question addresses the up and coming use of deep learning in recom-mender systems. Does the introduction of highly complex non-linearities in the systems really result in a performance enhancement:

2. Do state-of-the-art deep learning models outperform factorization models when ap-plied to stock-data?

This will be answered based on the evaluation metrics of the models, which are presented later, in the theory chapter. This question will answer if the state-of-the-art models are gen-eralizable to a different dataset. Also, there is no commonly used data splitting technique used in recommender systems. Everyone uses different types of data splitting, therefore, the following question is highly relevant:

3. Do different data splitting techniques affect the results of the recommender systems?

1.4

Contributions

The main contributions of this thesis includes presenting a new type of dataset, a portfolio dataset, and applying known recommender system techniques to this dataset. The results will include a comparison of different techniques evaluated on this new dataset based on rel-evant metrics. Further, results regarding different types of data splits will also be presented.

1.5

Delimitations

A delimitation for this work is that this system will not be suitable for day-traders as their trading might not be captured in the data. The data is captured at one single point in time and therefore, will not give accurate recommendations to an active trader. The system will be more suited for long-term investors with multiple holdings as another delimitation is that the system will not be able to recommend instruments to users with no current holdings. This problem is also known as the cold start problem within the area of recommender systems.

2Gowalla: https://snap.stanford.edu/data/loc-gowalla.html 3Yelp2018: https://www.yelp.com/dataset/documentation/main 4Amazon-book dataset introduced by McAuley et al. (2015)

(13)

2

Theory

First off in this section, the data framework and some algorithms in recommender systems are presented, then some different techniques used to optimize these algorithms. Four different widely used data splitting strategies are then introduced and finally some evaluation metrics for recommender systems.

When building a recommender system there are a few important components that are of interest, the data, the model and the evaluation. The data arguing being the most crucial. How should the data be structured or preprocessed to best suit the problem. The next step, choosing the model is also of high importance, since the model is the actual predictor in a recommender system. Further, the optimization of the model is of significance when fitting the model to the problem of recommending. Finally, it is important to evaluate the model results on relevant metrics, to understand and analyse the results achieved.

2.1

Content based filtering

Content based filtering (Meteren, 2000) is one of two big branches in recommender systems. Content based recommender systems recommend items that are similar to the users current items. Basically it tries to find the most similar items, the most similar content.

“A content-based filtering system selects items based on the correlation between the content of the items and the user’s preferences as opposed to a collaborative filtering system that chooses items based on the correlation between people with similar preferences. ” (Meteren, 2000)

Content based filtering requires gathering information about all the different items.

2.2

Collaborative Filtering

Collaborative filtering (Sarwar et al., 2001) , on the other hand, finds similar users, and then recommends items that the similar user currently has.

(14)

2.3. Data modelling

Collaborative filtering models however, suffer from the cold start problem (Lam et al., 2008; Bobadilla et al., 2012). To recommend items to brand new users, and to recommend brand new items. Content Based filtering does not suffer from this.

2.3

Data modelling

The data is modelled in a matrix consisting of users as rows and financial instruments as columns, see Table 2.1. This is referred to as implicit feedback in the field of recommender systems. A one at instrument 1, and user 2 indicates that user 2 has a position in instrument 1. These financial instruments are all the products that Nordnet offers on their platform, everything from stocks, funds, certificates and options. Further in the report, these financial instruments will simply be referred to as instruments.

One could also use explicit feedback in this scenario, as shown in Table 2.2 where the value 0.43, for row user 2 and column instrument 1, indicates that user 2 currently has 43% of their total portfolio’s worth invested in instrument 1. With the argument that a user with a higher percentage of their portfolio invested in a certain instrument, should be more interested in that instrument. For example, user X has 60% of their capital invested in instrument A com-pared to a user Y that has 10% of their capital in instrument A, this way modelling of the data assumes that user X is more interested in instrument A compared to user Y.

instrument 1 instrument 2 instrument 3

user 1 0 0 0

user 2 1 0 1

user 3 0 1 0

user 4 0 0 1

Table 2.1: Binary data model. A one indicates that the user owns that instrument.

instrument 1 instrument 2 instrument 3

user1 0.5 0 0

user2 0.43 0 0.57

user3 0 1 0

user4 0 0.05 0.95

Table 2.2: Data model using % of total investments. If a user is fully invested, the row should sum up to 1, representing 100% of that user’s investments.

2.4

Models

In recommender system research, the models have developed from nearest neighbor and matrix factorization, to DL and finding high-order connections with graph structures. Here, I present the most common and basic models used, as well as some state-of-the-art solutions.

Item popularity

One very basic baseline model is the item popularity model. It basically just recommends the most popular items to every user. Despite its simplicity, it has shown to be a solid baseline for evaluating recommender systems.

k-Nearest Neighbours

K-Nearest Neighbours (Cover and Hart, 1967) is an algorithm that classifies the current entry to the class of its k-nearest neighbours. If k = 1, the current data point gets assigned the

(15)

2.4. Models

nearest datapoint, according to a specified similarity metric, such as Euclidean distance or cosine similarity.

The most commonly used similarity measure in recommender systems is the cosine similarity, defined as:

cosine_similarity= A ¨ B

}A}}B} (2.1)

The core idea of kNN is that similar users share interest, if all your friends own a certain instrument, it may be interesting to you. In recommender systems, this algorithm can be used in different ways. Either by finding the nearest neighbor based on the user’s different item interactions as in ItemKNN, or by user-similarity.

B. Wang, Q. Liao, and C. Zhang (2013), presents an alternate approach to kNN, where they propose a weight based similarity algorithm and downsides of the Pearson correlation coef-ficient is discovered.

Matrix Factorization

Matrix factorization (Koren, Bell, and Volinsky, 2009) aims to approximate the user item matrix with a smaller number of dimensions. Let R1 « HW be the approximation, where

H P Rusers˚k, W P Rk˚items where users =number of users, items =number of items and k = number of latent factors. The idea is now that R1 can approximate the original matrix

and thus leads to a good future recommendation. To find the optimal R1, an optimization

problem is formalized as:

argmin

H,W

}R ´ R1}

F (2.2)

Where } ¨ }Fis the Frobenius Norm defined as:

}A}F = g f f e m ÿ i=1 n ÿ j=1 |aij|2 (2.3)

Figure 2.1: Visualization of Matrix Factorization.

To not overfit the model, regularization is added. This introduces two hyperparameters α and β to be optimized during training:

argmin

H,W

(16)

2.4. Models

This optimization problem can then be solved by using SGD. To further improve this model, one can add user-bias, to account for users with many or few interests, item-bias for popular items and a global bias.

One way of utilizing matrix factorization is through singular value decomposition. SVD decomposes the user-item matrix into three parts:

M=UΣVT (2.5)

Where M is the original user-item matrix, and the decomposition contains the matrices U and V and the diagonal matrixΣ. Since the user-item matrix usually is very sparse, truncated SVD is often used for recommender systems:

M « UfΣfVfT (2.6)

Where M is the approximated user-item matrix and f is the number of singular values used in SVD. SinceΣ contains the singular values of M and can be chosen to be in decreasing order, a low rank approximation of M is made by picking the f largest singular values inΣ and discarding the others. This is usually known as dimensionality reduction or low-rank approximation and in recommender systems, this can be directly used as a predictive model. It is efficient when the rank of M is way greater than f .

Multiple different variants of MF are widely used in recommender systems. Funk (2006) popularized MF in recommender systems by placing third in the Netflix prize (Bennett and Lanning, 2007), and currently, MF is widely used in these kinds of applications. However, the idea of MF for recommender systems is older than the netflix prize. Sarwar et al. (2000a) studied the use of SVD for dimensionality reduction in recommender systems way before the netflix prize. It was, however, after the netflix competition, that the research really got a boost. Koren (2008) presents a combination of MF and kNN, where the two models are com-bined smoothly into one, more accurate, model. Rendle (2010) proposed a general predictor, a factorization machine model which combines factorization models with support vector ma-chines for improved parameter estimation in hugely sparse settings.

More recently, a different way of initializing the missing data was proposed (He, H. Zhang, et al., 2016). Instead of handling the missing data as zero, the data was initialised with the item popularity. This did come with performance issues however, and a faster way of calcu-lating and optimizing MF was also proposed, to optimize element wise with alternating least squares method. This did not only speed up the training process but also improved accuracy of the original MF model.

Deep learning

As the research in DL has made huge advances the last couple of years, naturally, that have also spread to recommender systems. The argument for introducing complex non-linear models in recommender systems is that the task is simply too complex to be explained with a linear model. The need for more expressive models that can model the data in complex struc-tures is needed to improve recommendations. And since DL can approximate any nonlinear function, could it not be learnt to find better recommendations than traditional linear models. In a standard NN with an input layer, a hidden layer and an output layer, the inputs may be vector representations of the user and item, and the output might be a predicted rating of that user and item. Using these settings, one can use DL in recommender systems.

He, L. Liao, et al. (2017) incorporated MF in their model with neural nets, arguing that the non-linearity of the neural nets would result in a more expressive model. They introduced a

(17)

2.4. Models

generalized version of MF, GMF, which is computed as follows:

GMFyui = f(h(eideu)) (2.7)

Where ei is the latent item vector, eu is the latent user vector, d denotes the element-wise

product, h is the weight and f denotes the activation function. When h is set to 1 and f is the identity function, MF is recovered. He, L. Liao, et al. (2017) combined GMF with a multi-layered perception, defined as:

MLPyui = fL(WLT(fL´1(...(f1(W1TE(ei, eu) +bi)..)) +bL) (2.8)

These two models were then combined to form NCF. NCFyui =σ(hTGMFy

MLPy



) (2.9)

Where the σ is the sigmoid function in order to force the output between 0 and 1 for implicit recommendations.

However, it is also argued that the NCF model does not perform as good as a well tuned MF model, according to Rendle, Krichene, et al. (2020). Rendle, Krichene, et al. (2020) points out that the MLP part of NCF has a hard time of actually learning the dot product operator, and would need a large amount of data to do so. Furthermore, they argue that MLP is not applicable to real life large scale applications, since the computational complexity would be too inefficient compared to a simple dot product. Rendle, Krichene, et al. (2020) also accused He, L. Liao, et al. (2017) of cherry-picking the results by selecting the best iteration based on the test set, instead of correctly selecting the best iteration on the validation test to then present the results on the test set.

Graph-based models

A new up and coming topic area in recommender systems are graph-based models. The core idea behind graph based models is that they can capture high-order connectivity between users in the data.

Figure 2.2: Visualization of the interaction graph. The example data is from the X. Wang et al. (2019) paper.

The idea of X. Wang et al. (2019) and Yang et al. (2018) and He, Deng, et al. (2020) is that the high-order connectivity graph is more expressive than the user-iterm interaction graph, as seen in Figure 2.3. When recommending an item to u1, the connection u1Ði2Ðu2suggests

that since both u1and u2has interacted with i2, i5might be interesting for u1as well.

Yang et al. (2018) created a model, HOP-rec, by combining factorization based methods with graph based methods. They extracted indirect preferences from the user-item interactions

(18)

2.4. Models

Figure 2.3: Visualization of high order connectivity where u1is the target node. The example

data is from the X. Wang et al. (2019) paper.

to construct graphs. The method involves random surfing in a graph to extract high-order information among neighboring items for each user.

In X. Wang et al. (2019), they argue that Yang et al. (2018) only exploit high order connectivity to enrich the training data, and not to enhance the actual prediction. Whereas X. Wang et al. (2019) claims to also incorporate these ideas into the prediction. Their proposed model is based on graph neural networks, NCGF, and where they refine the user and item input embedding by propagating it through multiple layers. With the same idea as Yang et al. (2018), to capture high-order connectivity in the data.

He, Deng, et al. (2020) later introduces a new model, LightGCN, where they only incorpo-rated the most important part of the previous work in the area, neighborhood aggregation. They claimed that other parts of previous works, such as feature transformation and nonlin-ear activation contributed little to the performance. Therefore those things were discarded and a simpler, easier to train and better performing model was presented. Instead of pass-ing the user and item embeddpass-ings into activation functions and feature transformation, the normalized sum of neighbors is passed onto the next layer, simplifying the NGCF model substantially.

Session-based models

The main idea behind session based models is that by not modelling the entire user ses-sion when recommending an item, important information is left out. Early work in the area uses Markov Chains to model dependencies between consecutive item-interactions (Rendle, Freudenthaler, and Schmidt-Thieme, 2010). The research goes on to incorporate RNNs to model a session-based recommender system (Hidasi et al., 2016). Li et al. (2017) extends the RNN model even further to include an attention mechanism to encode the users main pur-pose of the session. The model propur-posed by Li et al. (2017) is considered state-of-the-art in session-based recommender systems. Recently, S. Wu et al. (2019) proposed a session-based model that combined Graph Neural Networks with an attention network to further improve the results obtained by session based recommender systems.

(19)

2.5. Optimization

2.5

Optimization

When optimizing a recommender system, one could use the root mean squared error loss, achieving a better score the closer the prediction to the truth is. However, RMSE is a loss function for prediction, and since the goal of a recommender system is not predicting with best accuracy, but to provide relevant suggestions, this loss can be misleading. Dealing with implicit data, a commonly used loss is binary cross entropy, simply a classification loss that returns a score whether the item is relevant or irrelevant. Does there exist a better loss? More suited for ranking items and not just for classifications. In the following section, BPR is presented.

Bayesian Personalized Ranking

Bayesian Personalized Ranking (Rendle, Freudenthaler, Gantner, et al., 2009) is an optimiza-tion criteria that originates from an alternate problem setting. Instead of modelling the data as simply zeros and ones, which can be visualized in Figure 2.4, we model the data according to the pairwise correlation of two data points. This can be visualized in Figure 2.5.

Figure 2.4: Pointwise modelling of the data.1

Figure 2.5: Pairwise modelling of the data.1

(20)

2.6. Data splitting

This results in the alternate update step: Θ ÐÝΘ+α( e ´xuij 1+e´xuij ˚ δ δΘxuijxuij +λΘ˚Θ) (2.10)

This optimization technique is being user in several papers, (He, Deng, et al., 2020; X. Wang et al., 2019; Yang et al., 2018; Hidasi et al., 2016; Ebesu, Shen, and Fang, 2018) and also, MF with BPR-optimization is widely used as a benchmark for new algorithms such as in (He, L. Liao, et al., 2017; X. Wang et al., 2019; Hidasi et al., 2016; Li et al., 2017; Ebesu, Shen, and Fang, 2018), and (Yang et al., 2018).

2.6

Data splitting

Following sections will present four frequently used data splitting techniques in recom-mender systems. These techniques are used to split the data into three sets, the training set, validation set and the test set. Different splits can affect the outcome of the recommender sys-tem in different ways (Meng, McCreadie, MacDonald, et al., 2020). Even in scenarios where the dataset and metrics are the same.

Temporal Global

Temporal global data splitting (Meng, McCreadie, MacDonald, et al., 2020) is to divide the data depending on different time points of the interactions. This assumes that the data in-cludes timepoints for all interactions. The data split is then performed by choosing time points to split the data into certain percentages of training data, validation data and test data. For example, all interactions before date x may be training data, all data points between x and y may be validation data and all data from date y and onwards may be test data. An example is shown in Figure 2.6.

This approach is generally seen as the most realistic setting for data splitting.

Temporal User

Similarly to temporal global, temporal user split takes a time point into account. But this time point is specific for each user, for example, for each user, the last 20% of interactions with the data is split into the test set, and earlier interactions into the train dataset. This time point can be different for each user, this split is also visulized in Figure 2.6.

Leave one out

This is the most commonly used data splitting strategy within recommender systems (Meng, McCreadie, MacDonald, et al., 2020). The split is simply to leave one item out of the training dataset to use as test data. So, for every user, one item will be left out for test-data. This approach maximizes the amount of training data that can be used. However, it is a rather unrealistic data split.

Leave out basket

Leave out basket is similar to leave one last, but instead of leaving only one data point for the test, multiple data points are left out of the training data.

(21)

2.7. Evaluation techniques

Figure 2.6: Visualization of temporal global split and temporal user split/leave one last.

2.7

Evaluation techniques

In this section, a set of different evaluation techniques will be presented. These evaluation metrics are very common in evaluating recommender systems, see for example the work of (He, H. Zhang, et al., 2016; He, L. Liao, et al., 2017; X. Wang et al., 2019), and (Ebesu, Shen, and Fang, 2018).

Normalized Discounted Cumulative Gain

Cumulative Gain (CG) at position p:

CGp= p

ÿ

i=1

reli (2.11)

Where reliis the graded relevance at position i.

Discounted Cumulative Gain (DCG) penalizes highly relevant documents that appear low in the search: DCGp= p ÿ i=1 2reli´1 log2(i+1) (2.12)

The DCG is then simply normalized using Ideal DCG (IDCG) as follows: nDCGp= DCGp IDCGp (2.13) where IDCGp= |RELp| ÿ i=1 2reli´1 log2(i+1) (2.14)

A perfect NDCG score is 1, when nDCGp= IDCGpand generally, a higher NDCG is better.

Precision

Precision, or positive predictive value (ppv) is defined as:

Precision= |relevant items X retrieved items|

(22)

2.7. Evaluation techniques

This metric does not take into account the actual rank of the item, such as NDCG does, it only cares about if the item is recommended.

Recall

Recall, or true positive rate (tpr) is calculated:

Recall = |relevant items X retrieved items|

|relevant items| (2.16)

All evaluation metrics, NDCG, precision and recall are evaluated for a list of length k. This is the usual setting for recommender systems, since the full list of predicted items are not of interest. Often, only the top 10 or top 20 items are relevant. Therefore, these evaluation met-rics are evaluated based on the top k predicted items. Using this methodology, the evaluation metrics are called precision@k, recall@k and NDCG@k respectively.

(23)

3

Method

This chapter begins to describe how the pre-study was conducted, goes on to present the data in more detail and continues to present the implementation of the models. Then, the tuning of the models is described, as well as the final evaluation of these models on the data.

3.1

Pre-study

A pre-study was conducted in order to get a grasp of the state of the research within recom-mender systems. The pre-study consisted of reading up on the field of recomrecom-mender systems and consuming several papers regarding that topic. The pre-study included reading papers on models from highly influential older papers to newer state of the art-model-papers. As well as data splitting techniques, relevant evaluation metrics and data filtering. The pre-study was concluded with a literature summary as well as a mapping of important papers.

3.2

Data

The data was fetched from Shareville’s database. It was captured at a single point in time1 and contained seven columns in total, as shown in Table 3.1. The ’portf id’ column was never used, instead the user id was used as an identifier. Making the recommender system user specific as opposed to portfolio specific. Also, all inactive users were filtered out.

To explain the columns in Table 3.1, ’user id’ is the identifier of one investor, ’country’ is the country of the investor, ’portf id’ is portfolio is and could be used to separate one investor’s different portfolios, but it was discarded in this project. ’Inst id’ is the id of the instrument, an identifier of what instrument the user owns, and % of portf explains how much the investor has invested in that instrument, based on the investors portfolio capital. The percentage of uninvested money is simply the 100 - the sum of percentages invested. ’Symbol’ is the instruments ticker2 to more easily identify the instrument. Finally the ’timestamp’ is the time of the last time that user acquired that instrument. In the case that a user has bought instrument A multiple times, only the last timestamp is expressed in the table.

1The data was captured on March 16, 2021.

(24)

3.3. Implementation

user id country portf id inst id % of portf timestamp symbol

1234 "SE" 4321 9876 14.54 "2020-01-04 14:04:05.541664" "VOLV" 1234 "SE" 4321 9877 8.22 "2019-01-04 17:04:05.543181" "INVE B" 1234 "SE" 4322 9878 2.97 "2020-06-24 09:02:05.741361" "BITCOIN XBT" 1235 "SE" 4320 9879 19.12 "2020-03-10 17:24:05.121449" "HM B" 1235 "NO" 4319 9880 5.81 "2020-08-03 13:45:20.750138" "NN INDEXFOND NO" 1236 "DK" 4318 9881 73.44 "2021-01-04 09:31:00.066482" "VEST" 1237 "FI" 4317 9882 32.90 "2021-01-19 10:59:59.935102" "SAMP" Table 3.1: A fictional example of how the data was structured.

As shown in 3.1, a user with multiple assets will occupy multiple rows in the table, one for every owned instrument. The user with id 1234 has three financial instruments in their portfolio, "VOLV", "INVE B" and "BITCOIN XBT".

The dataset was exported from the database as a csv file and preprocessed by converting the timestamps to milliseconds. The dataset was then filtered with two different settings. One setting was filtering such that all users had more than 4 instruments in their portfolio, and all instruments occured in more than 4 portfolios. Similarly, the other setting was users with more than 10 instruments owned and instruments with more than 10 owners. The unfiltered data was also used in the experiments. These three datasets will be named ’no filter’, ’4,4 filter’ and ’10,10 filter’ respectively.

The three datasets, the unfiltered one, the 4,4 filtered and the 10,10 filtered dataset was then split according to three different splitting techniques. Resulting in nine different splitted datasets. Global temporal, user temporal and leave one out data splits were used.

For the implicit modelling of the data, the column % of the portfolio was simply discarded and replaced with ones. With the one indicating that the user owns that instrument. Some models also needed the data to be converted into a large item user matrix. Where the value at position (x, y) indicated the implicit or explicit rating of item y for user x. An example is shown in the matrix (3.1) below.

        

instrument1 instrument2 . . . instrumentn

user1 1 0 . . . 0 user2 0 1 . . . 0 user3 0 0 . . . 0 .. . ... ... . .. 0 userm 0 0 . . . 1          (3.1)

3.3

Implementation

The implementation included pre-processing the data, filtering the data, implementing the different models and implementing the evaluation metrics. This was implemented in Python.

(25)

3.4. Tuning

For pre-processing and filtering of the data, pandas3were used. For splitting the data, some functions from beta_recsys4 (Meng, McCreadie, Macdonald, et al., 2020) was used, and for implementing the actual models, the following libraries were used:

• PyTorch5

• NumPy6

• Surprise7 • SciPy8 • Pickle9

The SVD model was implemented with the aid of SciPy, and two the other models, MF and LightGCN were implemented in PyTorch. MF was implemented using two different loss functions, BCE and BPR. The evaluation metrics were implemented with the help of NumPy.

3.4

Tuning

For tuning the models, the validation set was used to select the best model. The model Item Popularity has no trainable hyperparameters since it only recommends the most commonly owned instruments. However, for the other models, the hyperparameter tuning was con-ducted as follows.

Hyperparameters tuned for the SVD model: • dimensions, k = [16, 32, 64, 128, 256] Hyperparameters tuned for the MF model:

• learning rate, lr = [5e-3, 1e-3]

• regularization, λ = [1e-3, 1e-4, 1e-5, 0] • loss function, BCE or BPR

Hyperparameters tuned for the LightGCN model: • learning rate, lr = [1e-3, 5e-4]

• regularization, λ = [1e-3, 1e-4, 1e-5]

• number of propagation layers, n_layers = [1, 2, 3, 4]

The models were tuned on the global-temporal data split with the 4,4 filter setting, the best hyperparameters were then applied to the models when re-trained the data that had been split or filtered differently. The model with the best evaluation results on the validation-set were then used to evaluate the test set. The models were trained with all combinations of hyperparameters and a set seed of 2021 was used whenever possible10. Both LightGCN and

3https://pandas.pydata.org/ 4https://github.com/beta-team/beta-recsys 5https://pytorch.org/ 6https://numpy.org/ 7http://surpriselib.com/ 8https://www.scipy.org/ 9https://docs.python.org/3/library/pickle.html

(26)

3.5. Evaluation

MF were tuned with a fixed batch size of 2048, and both models used Adam as optimizer. The LightGCN model was trained 1000 epochs as suggested by the authors (He, Deng, et al., 2020) and the MF models were trained 50 epochs, with more frequent evaluations, because of the algorithm’s faster convergence (Rendle, Freudenthaler, Gantner, et al., 2009).

3.5

Evaluation

To evalute the models, the metrics precision@k, recall@k and ndcg@k, where k = [5, 10, 20], were calculated on the test-sets. Different types of data splits were also evaluated: global temporal split, user temporal split and leave one out split. In the global temporal split, 10% of the data was used as validation, 10% of the data was used for testing and the rest was used for training. The same applied to the user temporal split. The models that were evaluated were chosen by the performance on the validation set.

3.6

Previous Applications

All the models used in this paper has been applied to different datasets previously. Sarwar et al. (2000b) applied SVD to the a movie dataset, as well as a dataset from a e-commerce site. Matrix factorization has been applied widely and is one of the most common techniques, it has been applied to datasets that include movies, social media interactions (He, L. Liao, et al., 2017), restaurant reviews, location check-ins and e-commerce (X. Wang et al., 2019). LightGCN has been appplied to the three datasets metioned in the introduction, locations, restaurant reviews and e-commerce. However, to the authors knowledge, none of the models have been applied to a portfolio dataset prior to this thesis. The applications of the models are also mapped out in the Table A.

The evaluation metrics used has also widely been used in recommender system research. He, H. Zhang, et al. (2016), He, L. Liao, et al. (2017), and X. Wang et al. (2019), and Ebesu, Shen, and Fang (2018) are just some of the papers that use these evaluation metrics for analysing the performance of the recommender systems.

(27)

4

Results

This chapter will present the findings of the pre-study, some data statistics, the results of the hyperparameter tuning, and the results of the best settings for each model.

4.1

Pre-study

The pre-study resulted in a summary report of the current state of the research regarding recommender systems. As well as a mapping of relevant papers, see Table A. Also, the pre-study resulted in key-decisions for the project.

The decision to model the data as implicit feedback instead of explicit feedback was moti-vated by the assumption stated in Section 2.3 of the explicit data in this scenario. A user with a smaller percentage of capital in one instrument might like the instrument more compared to another user with a higher percentage of their capital in that instrument, which contradicts the assumption. Also, using implicit feedback seems to result in better results1(Steck, 2010). The chosen models were also a result of the pre-study. Item popularity was chosen because of its simplicity as a baseline model. The SVD model 2.6 was chosen because of its speed for sparse matrices and also simplicity. The different matrix factorization models 2.4 were chosen because it is a well established, popular model in recommender systems for other types of data with a decent track record. The LightGCN model 2.4 was chosen because its performance, and because it is the state-of-the-art in collaborative filtering models for implicit feedback.

Two other models were also considered, namely, the NCF 2.4 (He, L. Liao, et al., 2017) model and the CMN model (Ebesu, Shen, and Fang, 2018). The NCF model was not included since it has not been shown that the nonlinearity introduced greatly enhances the quality of the recommendations, and Rendle, Krichene, et al. (2020) shows that it can be outperformed by a better tuned MF model. The CMN model is also a state-of-the-art model but it was discarded because of its similarity to the LightGCN models as well as its relative performance. The CMN model also uses a neighborhood approach, while performing worse (He, Deng, et al.,

1Experiment comparing implicit and explicit feedback, reproducing results of (Steck, 2010): https://github.

(28)

4.2. Data

2020) than the LightGCN model. Other baselines, such as nearest neighborhood approaches, were also discarded due to their poor performance (Koren, Bell, and Volinsky, 2009).

Furthermore, session-based models were also discarded since they did not seem to fit the task. Session-based models are good at modelling recurrent transactions or similarly, which would not fit this project because it is directed towards buy-and-hold users as stated in Section 1.5. Further, the Markov Chain assumes strong dependency between consecutive items (Rendle, Freudenthaler, and Schmidt-Thieme, 2010), however this may not be the case. Finally, there exists no obvious way to model a sell transaction for a session based model.

4.2

Data

Statistics of the data after the filtering are presented in Table 4.1. A interaction is defined as one holding. The density is calculated as densityA = number_o f _nonzero_elements(A)total_number_o f _elements(A) . All results

are rounded to four decimals.

Filter # of interactions # of unique users # of unique instruments density

- 1 307 227 190 065 15 501 0.0004%

4,4 1 075 710 90 461 6 674 0.1782%

10,10 680 965 35 530 4 061 0.4720%

Table 4.1: Number of interactions/datapoints in each of the three datasets after different fil-tering.

Number of interactions in each of the three datasets, the train-set, validation-set and the test-set is presented in Tables 4.2, 4.3 and 4.4. Each table represents a different data filtering setting.

Data split # train # validation # test Global temporal 1 045 781 130 723 130 723

Temporal user 853 193 208 919 245 115 Leave one out 963 293 153 869 190 065

Table 4.2: Statistics of the three data splits on unfiltered data. The # train, # validation and # test columns represent the number of interactions left in each of the threee datasets.

Data split # train # validation # test Global temporal 860 568 107 571 107 571

Temporal user 787 274 144 204 144 232 Leave one out 894 816 90 433 90 461

Table 4.3: Statistics of the three data splits on 4,4 filtered data. The # train, # validation and # test columns represent the number of interactions left in each of the threee datasets.

Data split # train # validation # test Global temporal 544 771 68 097 68 097

Temporal user 506 040 87 461 87 464 Leave one out 609 908 35 527 35 530

Table 4.4: Statistics of the three data splits on 10,10 filtered data. The # train, # validation and # test columns represent the number of interactions left in each of the threee datasets.

(29)

4.3. Tuning

4.3

Tuning

The results of the tuning of the SVD model is presented in Table 4.5. Further performance of the SVD model is not reported due to its poor results. MF with BPR as loss functions will be noted simply as MF-BPR, and MF with BCE loss function will be noted as MF-BCE.

Dimensions recall@20 precision@20 ndcg@20

8 0.0159 0.0041 0.0080 16 0.0144 0.0036 0.0073 32 0.0118 0.0028 0.0058 64 0.0103 0.0027 0.0054 128 0.0093 0.0025 0.0052 256 0.0091 0.0026 0.0050

Table 4.5: SVD tuning performance for different number of dimensions.

The results presented in Tables 4.6, 4.7 and 4.8 descibes the evaluation results, achieved on the validation-set of the best tuned models. The data originates from the global temporal split data with 4,4 filtering applied as mentioned in Section 3.4.

MF-BCE @5 @10 @20 Recall 0.0597 0.1051 0.1723 Precision 0.0276 0.0245 0.0203 NDCG 0.0454 0.0615 0.0823

Table 4.6: Best tuning results of the MF-BCE model evaluated on the validation set. Achieved with lr=0.005 and λ=0.

MF-BPR @5 @10 @20 Recall 0.0692 0.1161 0.1904 Precision 0.0325 0.0276 0.0228 NDCG 0.0542 0.0711 0.0938

Table 4.7: Best tuning results of the MF-BPR model evaluated on the validation set. Achieved with lr=0.005 and λ=1e ´ 05.

LightGCN @5 @10 @20 Recall 0.1339 0.2003 0.2855 Precision 0.0624 0.0481 0.0349 NDCG 0.1137 0.1371 0.1635

Table 4.8: Best tuning results of the LightGCN model evaluated on the validation set. Achieved with lr=0.001 λ=0.001, and n_layers=4.

The Figures 4.1 and 4.2 shows validation results during the model training. The figures in-clude recall@5, recall@10, recall@20 with regards to current epoch in training.

(30)

4.4. Evaluation

Figure 4.1: Plot over all training epochs for MF-BCE with lr=0.005 and λ=0.001.

Figure 4.2: Plot over all training epochs for MF-BPR with lr=0.005 and λ=1e ´ 05.

4.4

Evaluation

In this section, the evaluation results for each model is presented. In Tables 4.9, 4.10, 4.11 and 4.12, the models are evaluated using three different data splitting techniques and three different filtering techniques. The hyperparameters used were the ones found during tuning, and remained the same when training the models, regardless of the data split and filtering. The metrics were applied to the test-set.

Finally, the results achieved on the test-set is presented in Tables 4.13, 4.14, 4.14 and 4.15 where all metrics are included, these tables show the results when the models were trained on the 4,4 filtered dataset with global temporal data split.

(31)

4.4. Evaluation

Filter data split recall@20 precision@20 ndcg@20

- Global temporal 0.0966 0.0100 0.0427 4, 4 Global temporal 0.0977 0.0122 0.0449 10, 10 Global temporal 0.0877 0.0136 0.0433 - User temporal 0.1280 0.0080 0.0510 4, 4 User temporal 0.1106 0.0082 0.0475 10, 10 User temporal 0.1004 0.0118 0.0481

- Leave one out 0.1283 0.0064 0.0483

4, 4 Leave one out 0.1063 0.0053 0.0416

10, 10 Leave one out 0.0879 0.0044 0.0330

Table 4.9: Item Popularity performance for different filtering and different data splits.

Filter data split recall@20 precision@20 ndcg@20

- Global temporal 0.1419 0.0156 0.0669 4, 4 Global temporal 0.1464 0.0186 0.0696 10, 10 Global temporal 0.1333 0.0212 0.0673 - User temporal 0.1598 0.0101 0.0678. 4, 4 User temporal 0.2072 0.0147 0.0894 10, 10 User temporal 0.1608 0.0187 0.0788

- Leave one out 0.1600 0.0080 0.0645

4, 4 Leave one out 0.2152 0.0108 0.0857

10, 10 Leave one out 0.1730 0.0086 0.0666

Table 4.10: MF-BCE performance for different filtering and different data splits.

Filter data split recall@20 precision@20 ndcg@20

- Global temporal 0.1461 0.0161 0.0697 4, 4 Global temporal 0.1652 0.0209 0.0819 10, 10 Global temporal 0.1512 0.0239 0.0775 - User temporal 0.1595 0.0103 0.06860 4, 4 User temporal 0.2226 0.0160 0.0982 10, 10 User temporal 0.1782 0.0210 0.0889

- Leave one out 0.1635 0.0082 0.0666

4, 4 Leave one out 0.2289 0.0114 0.0977

10, 10 Leave one out 0.2001 0.0100 0.0779

Table 4.11: MF-BPR performance for different filtering and different data splits.

Filter data split recall@20 precision@20 ndcg@20

- Global temporal 0.2502 0.0271 0.1334 4, 4 Global temporal 0.2846 0.0348 0.1627 10, 10 Global temporal 0.2626 0.0399 0.1506 - User temporal 0.2955 0.0192 0.1484 4, 4 User temporal 0.3196 0.0233 0.1621 10, 10 User temporal 0.2658 0.0314 0.1483

- Leave one out 0.2368 0.0118 0.1102

4, 4 Leave one out 0.3312 0.0166 0.1556

10, 10 Leave one out 0.2991 0.0150 0.1348

Table 4.12: LightGCN performance for different filtering and different data splits.

Table 4.17 shows an example of the training data for one single user. This is when the global temporal data split have been applied. The following tables 4.18, 4.19, 4.20 and 4.21 present

(32)

4.4. Evaluation

Item Popularity @5 @10 @20 Recall 0.0282 0.0525 0.0977 Precision 0.0130 0.0125 0.0122 NDCG 0.0215 0.0305 0.0449

Table 4.13: Item Popularity model evaluated with the global temporal split with 4,4 filter settings.

MF-BCE @5 @10 @20 Recall 0.0457 0.0839 0.1464 Precision 0.0227 0.0212 0.0186 NDCG 0.0352 0.0499 0.0696

Table 4.14: Results of best MF-BCE model on 4,4 filtered data with global temporal split.

MF-BPR @5 @10 @20 Recall 0.0568 0.0978 0.1652 Precision 0.0281 0.0245 0.0209 NDCG 0.0452 0.0607 0.0819

Table 4.15: Results of best MF-BPR model on 4,4 filtered data with global temporal split.

LightGCN @5 @10 @20 recall 0.1334 0.1993 0.2846 precision 0.0622 0.0478 0.0348 ndcg 0.1129 0.1362 0.1627

Table 4.16: Results of best LightGCN model on 4,4 filtered data with global temporal split.

the predicted recommendations by each model on the example user. In all tables, the first column is the Ticker, the second column is the type of instrument and the third column is the country the instrument is listed in.

(33)

4.4. Evaluation

Ticker Instrument type Country

BITCOIN XBT Certificate

-3KR Stock SWE

TIN NY TEKNIK A Fund SWE

HEC Stock USA

USPY ETF GER

QDVD ETF GER

WTI2 ETF GER

W1TA ETF GER

PLEJD Stock SWE

UI Stock USA

DIS Stock USA

EQQQ ETF GER

NOVO B Stock DEN

MOWI Stock NOR

SKIS B Stock SWE

ESSITY B Stock SWE

KAHOT Stock NOR

NWH.UN Stock CAN

AUAG SILVER BULLET Fund

-Table 4.17: Example of instruments in the training set, for a single user.

Ticker Instrument type Country

EVO Stock SWE

ESP0 ETF GER

XACTHDIV Fund SWE

IQQH ETF GER

EMBRAC B Stock SWE

INVE B Stock SWE

MSFT Stock USA

INVE A Stock SWE

KIND SDB Stock SWE

AAPL Stock USA

BRK.B Stock USA

CIBUS Stock SWE

FNOX Stock SWE

GZUR ETF GER

LATO B Stock SWE

AMZN Stock USA

KAMBI Stock SWE

O Stock USA

SBB B Stock SWE

SWMA B Stock SWE

(34)

4.4. Evaluation

Ticker Instrument type Country

SWED A Stock SWE

INVE B Stock SWE

CAST Stock SWE

AXFO Stock SWE

SBB B Stock SWE

HM B Stock SWE

EVO Stock SWE

NAS Stock NOR

INVE A Stock SWE

LATO B Stock SWE

KINV B Stock SWE

SWMA Stock SWE

DANSKE Stock DEN

AZN Stock SWE

TELIA B Stock SWE

SEB A Stock SWE

FING B Stock SWE

EQNR Stock NOR

SALT B Stock SWE

NIO Stock SWE

Table 4.19: MF-BPR recommendations for the example user.

Ticker Instrument type Country

INVE B Stock SWE

HM B Stock SWE

INVE A Stock SWE

AZN Stock SWE

LATO B Stock SWE

CAST Stock SWE

INDU C Stock SWE

SWED A Stock SWE

SWMA Stock SWE

DANSKE Stock DEN

SVOL B Stock SWE

KINV B Stock SWE

KIND SDB Stock SWE

EMBRAC B Stock SWE

NDA SE Stock SWE

EVO Stock SWE

LUND B Stock SWE

AXFO Stock SWE

ERIC B Stock SWE

TELIA Stock SWE

(35)

4.4. Evaluation

Ticker Instrument type Country

FORTUM Stock SWE

SAMPO Stock FIN

NDA FI Stock SWE

INVE B Stock SWE

NOKIA Stock FIN

NOVO B Stock DEN

NAS Stock NOR

DANSKE Stock DEN

KAHOT Stock NOR

AAPL Stock USA

HM B Stock SWE

NEL Stock NOR

KINV B Stock SWE

VWS Stock DEN

EVO Stock SWE

LATO B Stock SWE

SWED A Stock SWE

CAST Stock SWE

ORSTED Stock DEN

KOA Stock NOR

(36)

5

Discussion

This chapter reasons about the results and compares them to current state-of-the-art results on different datasets. Further improvements regarding the method used is also discussed. As well as the work in a wider context.

5.1

Results

The results of the SVD tuning and testing, presented in Table 4.5 were very poor. This is most likely due to the use of the Frobenius Norm 2.4 as well as severe overfitting. Since the SVD model uses no regularization and simply always finds the optimal fit with respect to the Frobenius Norm on the training data, the results are not generalizable. This is similar to what Lim (2007) found when investegating the model. SVD simply acts as an overfit MF with no regularization. And as shown in Figure 4.1, MF with BCE overfits after only 10 epochs. When comparing the loss functions in MF, it is clear that BPR is the most suitable loss function for the problem, like Rendle, Freudenthaler, Gantner, et al. (2009) previously acknowledged. When using BCE loss, the model gets best results after only a couple epochs, as seen in Figure 4.1 and thereafter, the model overfits and performs poorly. This is regardless of the magnitude of regularization used. While the MF with BPR-loss does not suffer from this problem, see Figure 4.2. Even though the recall might not be increasing, it is barely decreasing. Showing clear superiority when it comes to what loss function is more appropriate for this problem. The LightGCN achieves a better result compared to the other models with respect to all eval-uation metrics, which is shown in Tables 4.16 and 4.12. Which is in line with the work of He, Deng, et al. (2020). This means that their model does generalize well to new data and possi-bly that it could be applied to several other types of data as well. LightGCN also achieves a lower BPR loss compared to the MF model, indicating that it is a more expressive model that is better suited for the problem.

MF-BPR results on recall and NDCG, presented in Table 4.15 are better than the results achieved on other datasets, such as Gowalla, Yelp and Amazon-book (X. Wang et al., 2019). The precision score is not presented on those datasets. MF-BPR performs the best results on the Gowalla dataset compared to Yelp and Amazon-book, yet it performs even better on this

(37)

5.2. Method

stock-dataset. As for the LightGCN model, the recall and NDCG results, see Table 4.16, are also better than the reported results on Gowalla, Yelp and Amazon-book regarding recall and NDCG. This indicates that the results of the models are generalizable to other datasets as well. The models are not limited to a certain type of data. However, the data heavily impacts the exact results obtained, and therefore, one should not expect to obtain the same results when training these models on new data. Although, since LightGCN has outperformed MF-BPR on multiple different datasets, the relative results seem to generalize to new datasets as well. As seen in Tables 4.11, 4.12 and 4.13 , the results differ when using different types of data split. This is in line with what Meng, McCreadie, MacDonald, et al., 2020 also showed. For the MF-BPR model, the user temporal data split seemed to result in better recall and NDCG compared to the global temporal split. This is interesting since the global temporal split utilizes more training data. However, the test sets for the two splits are quite different. The user temporal test set contains pretty much every user, with fewer data points for each user. Whereas with the global temporal split, there are more data points for each user and not as many users in total. This explains the higher recall for the user temporal split, since the number of relevant stocks are simply smaller for this type of split. Similarly for NDCG, since it is harder to rank more data points, compared to fewer. With this reasoning, the results of the leave one out data split are reasonable. The leave one out data split has more training data compared to user temporal and the recall and NDCG results are also boosted by the effect of only having one interaction as test for each user. For precision it is the opposite, when more data points occur in the test set, a higher precision is expected simply because there are more relevant items and retrieved items are fixed to k, see 2.15.

When comparing the different types of data filtering, it seems to be a trade-off between amount of data and quality of data. As seen in Table 4.9, the Item popularity model al-ways performed better with more data. Removing users with few owned instruments and instruments that occur rarely, simply removes noise in the data, simplifying the learning of the chosen model. Some indication of this can be seen in for example Table 4.12 where the model performed better with the 4,4 filter setting compared to the unfiltered data. However, filtering too much results in fewer data points which makes the generalization of the model worse, and therefore the test-results are worse. For some models, the 10,10 filter setting re-moved too much data, which affected the results negatively, as shown in Tables 4.12, 4.11, 4.10.

In Table 4.17, the example users current holdings is presented, including several German Exchange traded funds, mixed with some Nordic and American stocks. To this user, the LightGCN model recommends a mix of German listed ETF:s and Swedish stocks and funds as presented in Table 4.18, while the MF-BCE models does not recommend any ETF:s listed in Germany, instead, it basically only recommends Swedish stocks for this user, see Table 4.20, except for one Danish stock. MF-BPR diversifies a little more, including two more Norweigan stocks in the list of recommendations. Since the current holdings of the user includes both stocks from three Nordic countries (Sweden, Norway, Denmark) and ETF:s listed in Germany, it can be considered slightly odd that both MF models lean heavily towards Swedish stocks. This would be reasonable if the top owned instrument only were Swedish stocks, but as you can see in Table 4.21, the most popular instruments included several stocks from Finland, Denmark and Norway, apart from the Swedish stocks.

5.2

Method

The method used in this project could be further improved. Obvious improvements include tuning the models with more combinations of hyperparameters to further improve the results of the models. One could also add learning rate decay to potentially improve the models even more. To get a better understanding of the different algorithms and exactly what type of

(38)

algo-5.3. The work in a wider context

rithms work the best on the current dataset, more models could have been implemented and evaluated. However, these things were simply not included because of the time limitation of the thesis.

In this study, the models were trained with a fixed seed as mentioned in 3.4. An alternate approach is to train the models with random seeds multiple times. It could be that some models have been initiated with particularly good or bad initial parameters because of the set seed. If random seeds were used, this would not be an issue. However, the results would be harder to reproduce. The best approach would be to use 10 or more different set seeds and present the mean results of all models. However, since the training process is burdensome with respect to time- and computing-limitations this method was not used.

5.3

The work in a wider context

Recommender systems in general primarily suffer from two biases, observation bias and bias originating from imbalanced data (Farnadi et al., 2018). Observation bias is when a feedback loop is created, causing the model to only recommend items similar to previously consumed items. This can also be related to issues regarding personal identity, whereby the recom-mender system limits the content of the user and pushes their individual interests to the items that they are exposed to (Milano, Taddeo, and Floridi, 2020). Bias originating from im-balanced data on the other hand, is caused when there is a systematic bias present in the data due to societal or historical biases. A good example of this regarding recommender systems within recruitment:

“For example, in job recommendation, due to social bias, the data contain a lot of evidence indicating that nursing is a successful profession for females. However, this does not mean that female users must be recommended to receive recommen-dations for nursing positions in order to be successful.” (Farnadi et al., 2018)

Usually, the model is not aware of such biases which leads to that these biases are reproduced in the recommendations of the models. Research within fairness in recommender systems aim to address these issues, which are closely related to social effects of the recommender system.

Other important issues with recommender systems include recommending inappropriate content, in this case, "ethical filters" have been proposed (Rodriguez and Watkins, 2009). Pri-vacy risks regarding the data is another issue. For example when the data is collected without explicit consent from the user, also when storing the data about the user.

(39)

6

Conclusion

The aim of this report, to apply recommender system techniques to a new dataset and to eval-uate the quality of the results, has been achieved. Further experiments consisted of evaluating the best model for this problem and comparing different datasets. The research questions of interest are presented below together with answers.

• Can recommender system techniques be applied to stock-data to achieve results that are comparable to the quality of results of other established recommender systems?

Yes, the same techniques used in other areas of recommender systems can be applied to stock-data to ultimately recommend stocks based on a user portfolio. The results were comparable to the results of other datasets, or even better. This means that these systems could be applied to different applications as well, for example to recommend stocks for a fund to buy, and inversely, to recommend a fund to a company for a potential investment.

• Do state-of-the-art deep learning models outperform factorization models when ap-plied to stock-data?

The state-of-the-art method, LightGCN, outperformed classic methods such as MF when ap-plied to the stock-dataset. This was shown by all evaluation metrics used. This proves the generalizability of this model, to not only perform well on reported datasets.

• Do different data splitting techniques affect the results of the recommender systems?

Yes, the results for using different types of data splitting techniques differed greatly, not only because of the pure amount of data used in training, but also because of the different charac-teristics of the different splitting techniques.

(40)

Bibliography

[1] Nordnet. (2021). “Nordnet: Månadsstatistik februari,” [Online]. Available: https : / / nordnetab . com / sv / press _ release / nordnet manadsstatistik -februari-2/(visited on 03/23/2021).

[2] R. Vinuesa, H. Azizpour, I. Leite, et al., “The role of artificial intelligence in achieving the sustainable development goals,” Nature Communications, vol. 11, 2020.

[3] S. Rea, “A survey of fair and responsible machine learning and artificial intelli-gence: Implications of consumer financial services,” DecisionSciRN: Judgement Biases in Decision-Making (Sub-Topic), 2020.

[4] Nordnet. (2020). “Detta är Nordnet,” [Online]. Available: http://nordnetab.com/ sv/om/oversikt-nordnet/(visited on 03/23/2021).

[5] X. He, K. Deng, X. Wang, et al., “Lightgcn: Simplifying and powering graph convo-lution network for recommendation,” Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2020.

[6] S. Rendle, C. Freudenthaler, Z. Gantner, et al., “BPR: Bayesian perzonalized ranking from implicit feedback,” UAI ’09: Proceedings of the Twenty-Fifth Conference on Uncetainty in Artificial Intelligence, pp. 452–461, 2009.DOI: 10.5555/1795114.1795167.

[7] Y. J. Lim, “Variational bayesian approach to movie rating prediction,” 2007.

[8] Z. Meng, R. McCreadie, C. MacDonald, et al., “Exploring data splitting strategies for the evaluation of recommendation models,” Fourteenth ACM Conference on Recommender Systems, pp. 681–686, 2020.DOI: 10.1145/3383313.3418479.

[9] J. Gottschlich and O. Hinz, “A decision support system for stock investment recom-mendations using collective wisdom,” Decis. Support Syst., vol. 59, pp. 52–62, 2014. [10] P. Paranjape-Voditel and U. Deshpande, “A stock market portfolio recommender

sys-tem based on association rule mining,” Appl. Soft Comput., vol. 13, pp. 1055–1063, 2013. [11] B. Nair and V. Mohandas, “An intelligent recommender system for stock trading,”

In-tell. Decis. Technol., vol. 9, pp. 243–269, 2015.

[12] J. McAuley, C. Targett, Q. Shi, et al., “Image-based recommendations on styles and sub-stitutes,” Proceedings of the 38th International ACM SIGIR Conference on Research and De-velopment in Information Retrieval, 2015.

References

Related documents

It investigates how the film content parameters Actor, Director and Genre can be used to further enhance the accuracy of predictions made by a purely collaborative

[r]

Johanna Boettcher, Gerhard Andersson and Per Carlbring, Combining attention training with cognitive-behavior therapy in Internet-based self-help for social anxiety: study protocol

För omsättning till den här undersökningen och principen överraskning ser refe- renshypotesen ut på följande sätt: Överraskning = vilseledning + operationssäkerhet + tempo

Resultatet visade att det inte var någon större skillnad mellan de olika bakgrundsvariablerna i frågan om respondenterna ansåg att det var viktigt med etiska och/eller

The recommender system suggests the best-fit component variants for certain actual contexts which are later on used by a composition technique to improve application run-

During the initial group interview, interviewees expressed that the music recommended by the recommender systems in music streaming websites does not match with their

One of the frequently used models is Matrix factorization (MF). MF is based on the theory that the rating matrix R can be written as the matrix multiplication of two matrices that,