Is it possible to beat the bookie: Machine learning and football

(1)

Is it possible to beat the bookie

- Machine learning and football

!

Engineering Degree Project

(2)

Abstract

Forecasting is one of the toughest tasks in the world, but that doesn’t stop us from trying. In this project, machine learning is used to predict three-way re-sult on football fixtures and applied to the field of football betting. The sports betting market is one of the biggest in the world right now, and the sports bet-ting companies are making a lot of money. So the question arises, is it possib-le to beat them? The answer is yes, with machine possib-learning and historical data, it is possible to beat them.

Keywords: random forest, neural network, football, logistic regression,

machine learning, betting.

(3)

1 Introduction _________________________________________________5 1.1 Background ______________________________________________5 1.2 Related work _____________________________________________6 1.3 Problem formulation _______________________________________6 1.4 Motivation _______________________________________________7 1.5 Objectives _______________________________________________8 1.6 Scope/Limitation __________________________________________8 1.7 Target group _____________________________________________9 1.8 Outline _________________________________________________9 2 Method ____________________________________________________10 2.1 Approach ______________________________________________10 2.1.1 Strategies _________________________________________11 2.3 Data set and data processing _______________________________20 2.3.1 League name ______________________________________21 2.3.2 Home team formation/Away Team formation __________21 2.3.3 Home defensive players/Home oﬀensive players/Home goalkeeper/Away defensive players/Away oﬀensive players/Away goalkeeper __________________________________________________21

2.3.4 Highest average of home players/Lowest average of home players/Lowest average of home players/Lowest average of home players _____________________________________________________24

2.3.5 Home Team/Away Team ____________________________24 2.3.6 Home win, draw, loss/Away win, draw, loss ___________27 2.4 Ethical Considerations ____________________________________27 3 Theory ____________________________________________________28 3.1 Machine learning ________________________________________28 3.2 Logistic Regression ______________________________________28 3.3 Random Forest __________________________________________29 3.4 Neural Network _________________________________________29 3.5 Hyperparameters ________________________________________30 4 Implementation ______________________________________________30 4.1 Scripts _________________________________________________30 4.2.1 whoscored/WSScraper.py _____________________________30

(4)

4.2.2 whoscored/main.py ___________________________________30 4.2.3 whoscored/player_strat/PlayerScraper.py ________________31 4.2.4 oddsportal/main.py ____________________________________31 4.2.5 get_data_players.py ___________________________________31 4.2.6 ml/ __________________________________________________31 5 Results ____________________________________________________32 5.1 Bookies ________________________________________________32 5.2 Logistic Regression ______________________________________33 5.3 Random Forest __________________________________________35 5.4 Neural Network _________________________________________36 6 Analysis ___________________________________________________38 7 Discussion _________________________________________________40 8 Conclusion _________________________________________________41 8.1 Future work _____________________________________________41 References ___________________________________________________42 A Appendix 1 _________________________________________________45 A.1 fixtures database _____________________________________45 A.2 player_fixtures database ______________________________48 A.3 fixture_formations ____________________________________49 A.4 Hyperparameter selection for Logistic Regression ________49 A.5 Hyperparameter selection for Random Forest ____________50 A.6 Hyperparameter selection for Neural Network ___________51

A6.1 Optimizer _____________________________________________51 A6.2 Learning rate __________________________________________51 A6.3 Neurons ______________________________________________52 A6.4 Hidden layers __________________________________________52

(5)

1 Introduction

It’s maybe impossible to predict the future. But with the help of machine le-arning, we can get an estimate of something that is going to happen. In the world of football betting, an odds set by the bookies is just that, an estimate 1

of something that is going to happen.

This project will test different machine learning models' ability to pre-dict different football fixtures . The probabilities created by the different mo2

-dels will be converted into odds and compared with the bookies. If the odds given by the model are lower than the odds given by the bookies, that means the model thinks something is more credible to happen than the bookies.

1.1 Background

The gambling market is one of the biggest markets in the world right now. In Sweden, the gambling companies spent 738.5 million dollars on commercials for the year 2018, that’s roughly 73.85 dollars per person [1]. The biggest gambling company that operates in Sweden had a 670 million dollar profit that year [2]. Approximately two-thirds of the Swedish population gambled with money in 2018 and about half out of those stated that they gambled at least once every month [3]. A lot of people gamble, and gambling companies earn a lot of money.

But why do people gamble? According to the International Journal of Mental Health and Addiction who interview 131 people from different back-grounds in New Zealand [4], there were five major reasons:

• Economic reasons, such as winning experiences. • Personal reasons, such as mood.

• Recruitment, such as how gambling is normalised. • Environmental reasons, such as the availability. • Social reasons, such as social participation.

Notice that none of the above reasons were “saving” or “investing”.

There are a lot of different ways to gamble, one popular gambling fea-ture is trying to predict the 3-way-result on a football fixfea-ture, which means either home team wins, draw, or away team wins. A gambler will then make a decision based on the given odds and different parameters about the fixture. Since it boils down to a decision problem, we can use machine learning to predict the fixtures. So the question arises, is it possible with the help of machine learning to beat the sports betting companies and earn money?

A bookie or bookmaker is someone who sets odds.

1

A football match between two teams.

(6)

1.2 Related work

Several articles try to predict the outcome of football fixtures with the help of data.

Sumpter has written about a strategy in his book “Soccermatics: Mat-hematical Adventures in the Beautiful Game”. In short, he used a statistical model to show that it pays off to back the favourites in Premier League and that there is an even more strongly bias against betting on a draw between two evenly matched teams, especially between the big six . He was able to 3

make a 25% return over half a season [5].

Stübinger and Knoll tested different machine learning models' ability to predict “easy wins” on the top five European football leagues by letting the model train on predicting the goal difference between the home team and the away team. Therefore, when the model predicts that one of the team will win by more than 1 goal, a bet was placed. A Random Forest had 75.62% accura-cy on the 3-way results where the model predicted that the team would win by more than 1 goal [6].

Tax and Joustra trained different machine learning models, such as de-cision trees, neural networks, and naive Bayes, to predict fixtures from the Dutch Eredivisie. One of the models had a higher accuracy than the bookies on unseen data [7].

1.3 Problem formulation

The odds set by the gambling companies can be seen as probabilities. To convert a probability into an odds you need to divide 1 with the probability. For example, if the 3-way odds for a fixture between Team A and Team B would have been 2.00 for Team A to win, 5.00 for a draw and 3.33 for Team B to win, the probabilities would have been 50% for Team A, 20% for a draw and 30% for Team B. In a perfect world, this is how odds are generated. But, gambling companies do not give out these true odds. Instead, they will decre-ase the odds, so the odds will become 1.90 for Team A, 4.90 for a draw, and 3.23 for Team B, to make money.

So how well does the odds created by different machine learning mo-dels stand against the odds given by the bookies? This report will focus on finding models that can predict 3-way-result on football fixtures and find strategies that can earn money.

At the website oddsportal.com, where you can find odds from different bookies, there are 37 different bookies compared. These bookies compete with each other for customers. To gain customers, you can’t provide odds that

(7)

le, when big teams with a large fanbase like Arsenal plays against a smaller team with a smaller fanbase like Fulham, are the odds only based on histori-cal data? Or are there some factors that depend on how much money that would be placed on one of the teams to gain customers? If that is so, I believe the machine learning models would find odds that are too generous.

1.4 Motivation

Creating different machine learning models to predict football fixtures and find betting strategies can be used with more than just trying to earn money. The approach taken in this report can also help coaches and teams perform better. For example, when Team A plays against Team B, what happens if we feed our model with features from another player or a different starting for-mation, do the probability of winning increase? What do the players need to do to increase the probability of winning? Is it passing accuracy, shot accura-cy, more shots, etc?!

(8)

1.5 Objectives

The first objective (O1) is to scrape whoscored.com for data on differ-ent fixtures, both team performance data and individual player data. The whole dataset will be done after the odds from different bookies will be scraped from the website oddsportal.com. O3 will focus on preparing and creating different features before O4 can begin training different machine learning models. When O4 is done and the result is considered approved, O5 will begin trying to find strategies that can yield a positive return.

Finally, O6 will look more closely at the different odds created by the models and given by the bookies, where do they differ and why can that be.

1.6 Scope/Limitation

Football is one of the biggest sports in the world. According to FIFA (Fédéra-tion Interna(Fédéra-tionale de Football Associa(Fédéra-tion), 3.572 billion people watched one game or more in the world cup 2018 [8]. In 2007, FIFA had 265 million registered football players [9].

Because of this, the amount of time or computer power needed to col-lect and process the data cannot be achieved within this timeframe. Instead, the leagues and seasons that were used in this project are shown in Table 1.1.

O1 Scrape data from whoscorred.com

O2 Scrape data from oddsportal.com

O3 Data preprocessing

O4 Train and evaluate models ability to predict 3 way results

O5 Create and evaluate strategies

O6 Examine the results

Country League Seasons

England Premier League 2009 - 2020

(9)

The sports betting companies also provide a large number of different odds. For example, in 2014, a Swedish teacher placed a bet that the Uruguayan striker Luis Suarez would bite someone during the World Cup. When Urugu-ay faced Italy, Luis Suarez bit the Italian defender Giorgi Chiellini and the Swedish teacher won 1400 dollars [10].

This report will not cover every possible odds that the bookies offer. Instead, the odds that will be used will be a 3-way result, which is either home win, draw, or away win.

1.7 Target group

The target group of this report is primarily people with an understanding of machine learning and an interest in football.

1.8 Outline

Chapter 2 will describe the method that has been used to solve this problem and an explanation of the strategies. Chapter 3 will present the theory behind the project with explanations of the different machine learning models used. Chapter 4 presents the different script used with the method described in Chapter 2. Chapter 5 will present the result from the method and Chapter 6 will analys this results. Last, Chapter 7 will discuss the results and some own thoughts about this report.

(10)

2 Method

The machine learning models that were experimented with were logistic regression, random forest, and neural network. These three models are the models that I feel most comfortable with. In previous project, I've always got the best results with these three models. To find the best hyperparameters for each model and to find strategies outperforming the bookies, controlled expe-riments were conducted on a validation dataset.

2.1 Approach

Before the different machine learning models were trained, data was extrac-ted from the websites whoscored.com and oddsportal.com. The website whoscored contains data about the different fixtures and the website oddspor-tal contains the corresponding closing odds on the fixtures, which means that the odds are generated just before the game start. Because of that, we need data on the line-up as well, otherwise, the odds from the bookies will not match the probabilities from the models.

The data received was divided into three different sets, 60% for trai-ning, 20% for validation, and 20% for testing. Since football is always evol-ving, to build a model for the test set, validation and training data were com-bined to produce a split of 80% training data and 20% testing data.

In total, 23328 fixtures were scraped. If the validation and training set was not combined, that would mean that the model would have not been aware of 4665 fixtures. And since that 4665 fixtures are also the latest, maybe they are

2009 2016 2018

2009 2018 2020

Testing

Figure 2.1 - visualisation of dataset split.

Training

(11)

with the odds created from the bookies. If the odds created by the model were lower than the odds by the bookies, the model has found that something is more likely to happen than what the bookies think. A range of differences between the bookies' odds and the models' odds was tested on both the fixtu-res that had a prediction from the model and on all the fixtufixtu-res, with or wit-hout a prediction from the models.

After the hyperparameter tuning and the strategies were found on the validation set, the validation set was combined with the training set to build a new training set that was going to be used to train a model for the test set. This new model had the same hyperparameters as the first model, and the strategies that worked on the validation set would hopefully work on the test set.

2.1.1 Strategies

As stated before, the probabilities that the model created were converted into odds. In order to convert a probability into odds, you need to divide 1 with the probability. There are different ways to create odds, this is the way the European bookies create odds.

These odds created by a model were then compared to the odds from the bookies to find a specific threshold that could yield a positive return. The dif-ferent threshold was inserted to a matrix together with the probabilities from the model to visualise the different thresholds together with the probabilities.

odds =

_probability

1

0.0 -34 -32 -10 . . . 1 0.3 43 32 5 . . . -10 0.6 10 -3 2 . . . 0 . . . . . . . . . . . . . . . 2.7 0 10 23 . . . 0.0 0.1 0.2 . . . 0.9 D if fe re nc e Probability

(12)

Table 2.1.1 shows the matrix used to create strategies. The left axis displays the differences in the odds from the model compared to the bookies and the bottom axis shows the probability from the model.

In Table 2.1.1, when the left axis is 0.3 and the bottom axis is 0.2 the matrix shows 5. This means that every odds created from the model that is lower with 0.3 or more than the odds from the bookies and has a probability from the model that is over 0.2 would have returned a winning odds of 5. In other words, if betting 10€ on every fixture where the odds created from the model were lower than the odds created from the bookies with 0.3 or more and the models' probability was more than 0.2, you would now have 50€.

For example, when Manchester United played home against Man-chester City, the 3way odds from the bookies were 4.50 for a home win, 3.97 for a draw, and 1.77 for away win. If one of our models would have had the probability of 0.4 for Manchester United to win, the odds from the model would have been 2.5. Therefore, it has a probability from the model higher than 0.2, and the difference between the models' odds and the bookies' odds is more than 0.3, it meets the criteria.

Strategies do not have to be one single element from the matrix, it can also be an individual row or column, multiple rows and columns or the whole matrix. The idea is that the more fixtures used in the strategies, the more ac-curate the strategies will be.

Team Bookies Model

H A 1 X 2 1 X 2 P T

A B 1.43 4.47 8.62 1.32 76 % 5.56 18 % 16,7 6 % 1 1 C D 2.39 3.29 3.14 10.0 10 % 2.00 50 % 2.50 40 % X X E F 1.30 5.62 10.9 1.28 78 % 7.69 13 % 11.1 9 % 1 X G H 4.54 3.50 1.84 5.00 20 % 3.57 28 % 1.92 52 % 2 2

(13)

Table 2.1.2 describes what you need to build a result-matrix. The column Bookies gives the 3-way-result odds from the bookies between the two teams described in the column Team were H is for Home and A is for Away. 1 me-ans home win, X meme-ans to draw, and 2 meme-ans away win. The column Model describes the output from the machine learning model, with both the odds and the probabilities. Column P gives the predicted 3-way-result from the model. The last column gives the true outcome for the fixture.

If we start with the first row were team A plays at home against team B, the 3-way-odds from the bookies are 1.43 for team A to win, 4.47 for a draw, and 8.62 for team B to win. The probabilities produced from the model are 76% for team A to win, 18% for a draw, and 6% for team B to win. Therefo-re, by dividing 1 with the decimal percentage of the probability, the 3-way-odds from the model becomes 1.32 for team A to win, 5.56 for a draw, and 16.7 for team B to win. The lowest odds or the highest probability becomes the predicted value from the model, in this case, it is a home win which is described in column P, which was also the true outcome of the fixture. This means that if one would have placed 10€ on team A to win against team B, the profit would have become 4.3€ which corresponds to a return of 43%. Let's put this into a result matrix with a decimal percentage.

0.0 0.43 0.43 0.43 0.43 0.43 0.43 0.43 0.43 0 0 0.3 0 0 0 0 0 0 0 0 0 0 0.6 0 0 0 0 0 0 0 0 0 0 0.9 0 0 0 0 0 0 0 0 0 0 1.2 0 0 0 0 0 0 0 0 0 0 1.5 0 0 0 0 0 0 0 0 0 0 1.8 0 0 0 0 0 0 0 0 0 0 2.1 0 0 0 0 0 0 0 0 0 0 2.4 0 0 0 0 0 0 0 0 0 0 2.7 0 0 0 0 0 0 0 0 0 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Table 2.1.3 - Result matrix

D if fe re nc e Probability

(14)

The difference between the odds from the bookies and the model is 1.43 - 1.32 = 0.11 and the probability from the model is 76% or 0.76 in decimal percentage. The return from the odds became 43% or 0.43 in decimal per-centage. Therefore, the matrix will now look like the matrix in Table 2.1.3.

The matrix from Table 2.1.3 is only when the model has a prediction. We could also create a matrix were every odds that is lower than the odds from the bookies are in the matrix. In this case, the odds for home win is the only odds that are lower than the odds from the bookies, so that matrix would look the same as the matrix in Table 2.1.3.

The next fixture, between team C and team D, resulted in a draw, which was also predicted from the model. The odds for a draw from the bookies is 3.29 and the probability from the model is 50%, which becomes 2.00 in odds from the model. The difference in odds is therefore 3.29 - 2.00 = 1.29 and the profit becomes 229%. 0.0 2.72 2.72 2.72 2.72 2.72 0.43 0.43 0.43 0 0 0.3 2.29 2.29 2.29 2.29 2.29 0 0 0 0 0 0.6 2.29 2.29 2.29 2.29 2.29 0 0 0 0 0 0.9 2.29 2.29 2.29 2.29 2.29 0 0 0 0 0 1.2 2.29 2.29 2.29 2.29 2.29 0 0 0 0 0 1.5 0 0 0 0 0 0 0 0 0 0 1.8 0 0 0 0 0 0 0 0 0 0 2.1 0 0 0 0 0 0 0 0 0 0 2.4 0 0 0 0 0 0 0 0 0 0 2.7 0 0 0 0 0 0 0 0 0 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

(15)

The result-matrix will now look like the matrix in Table 2.1.4. Notice that the fixture does not land in the 0.5 probability column because the probability from the model is the same as 0.5, not higher. The away win for the fixture from the bookies is 3.14, while the away win odds from the model is 2.50. This is now an odds that do not have a prediction from the model, but the odds are still lower than the odds from the bookies. Therefore, we can update our matrix with all the odds, prediction or without prediction, from the mod-el. The difference in odds between the bookies and the model is 3.14 - 2.50 = 0.64. Since this was a loss, the result becomes -1. For example, if one would have placed 10€ on team D to win against team C. The result becomes -10€ because the result is a draw.

0.0 1.72 1.72 1.72 1.72 2.72 0.43 0.43 0.43 0 0 0.3 1.29 1.29 1.29 1.29 2.29 0 0 0 0 0 0.6 1.29 1.29 1.29 1.29 2.29 0 0 0 0 0 0.9 2.29 2.29 2.29 2.29 2.29 0 0 0 0 0 1.2 2.29 2.29 2.29 2.29 2.29 0 0 0 0 0 1.5 0 0 0 0 0 0 0 0 0 0 1.8 0 0 0 0 0 0 0 0 0 0 2.1 0 0 0 0 0 0 0 0 0 0 2.4 0 0 0 0 0 0 0 0 0 0 2.7 0 0 0 0 0 0 0 0 0 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

(16)

The result-matrix for every odds will now look like the matrix in Table 2.1.5. The next fixture between team E and team F resulted in a draw while the predicted outcome from the model is a home win for team E. This results in a loss of 1 for the model. The difference in odds between the bookies and the model is 1.30 - 1.28 = 0.02 and the probability from the model is 78%. The home win odds are the only odds that are lower than the odds from the bookies, we can now update both of our matrices.

0.0 1.72 1.72 1.72 1.72 1.72 -0.57 -0.57 -0.57 0 0 0.3 2.29 2.29 2.29 2.29 2.29 0 0 0 0 0 0.6 2.29 2.29 2.29 2.29 2.29 0 0 0 0 0 0.9 2.29 2.29 2.29 2.29 2.29 0 0 0 0 0 1.2 2.29 2.29 2.29 2.29 2.29 0 0 0 0 0 1.5 0 0 0 0 0 0 0 0 0 0 1.8 0 0 0 0 0 0 0 0 0 0 2.1 0 0 0 0 0 0 0 0 0 0 2.4 0 0 0 0 0 0 0 0 0 0 2.7 0 0 0 0 0 0 0 0 0 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

(17)

The last fixture from Table 2.1.2 is between team G and team H. Every odds from the model for this fixture is higher than the bookies. Therefore, there is no place for the fixture in the result matrices.

The result matrices for the fixtures in Table 2.1.2 are now done. We can now see how the model performed with the odds from model compared with the odds from the bookies together with the probabilities. For example, what’s the profit when betting 10€ on every odds that is lower than the boo-kies and also has a prediction from the model? The answer is 17.2€. What’s the profit when betting 10€ on every fixture, with or without prediction, that has a probability higher than 40% and the odds from the model are lower than the bookies with more than 0.3? The answer is 22.9€. We could also build strategies, for example, what happens when placing a bet on every ele-ment in the column where the probability is higher than 0.4 from the matrix in Table 2.1.6 0.0 0.72 0.72 0.72 0.72 1.72 -0.57 -0.5 7 -0.57 0 0 0.3 1.29 1.29 1.29 1.29 2.29 0 0 0 0 0 0.6 1.29 1.29 1.29 1.29 2.29 0 0 0 0 0 0.9 2.29 2.29 2.29 2.29 2.29 0 0 0 0 0 1.2 2.29 2.29 2.29 2.29 2.29 0 0 0 0 0 1.5 0 0 0 0 0 0 0 0 0 0 1.8 0 0 0 0 0 0 0 0 0 0 2.1 0 0 0 0 0 0 0 0 0 0 2.4 0 0 0 0 0 0 0 0 0 0 2.7 0 0 0 0 0 0 0 0 0 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

(18)

The sum of the row where the probability is higher than 0.4 is 10.88, which means that if one would have place 10€ on every bet, the profit would have become 108.8€. Table 2.1.8 shows how the different bets are placed. The fix-ture between team C and team D is placed 5 times because the fixfix-ture is in that column 5 times. We can also take the sum of the whole matrix in table 2.1.6 and try that.

The sum of the whole matrix shown in Table 2.7 is 52.69. This means that if one would have place 10€ on every bet the profit would have become 5725€. How the bets are placed are shown in Table 2.1.9

Fixture Bets placed Proffit

Home Away 1 X 2 1 X 2

A B 1 0 0 0.43 0 0

C D 0 5 0 0 11.45 0

E F 1 0 0 -1 0 0

G H 0 0 0 0 0 0

Fixture Bets placed Proffit

Home Away 1 X 2 1 X 2

A B 8 0 0 3.44 0 0

C D 0 25 0 0 57,25 0

E F 8 0 0 -8 0 0

G H 0 0 0 0 0 0

Table 2.1.8 - Bets placed

(19)

2.2 Reliability and Validity

A problem with the reliability of forecasting is that it’s difficult (maybe even impossible) to test whether this will be true in the future. Football is always evolving and the sports betting companies are working hard to increase their profits. The results produced in this report depends on the fixtures presented in Table 1 with the features from appendix A1 - A3.

Although the dataset contains many fixtures and features, there are companies like Whyscout and Opta that can provide real-time data with the players x and y position. Furthermore, this means that the results this report shows are true for the data used, but if the question is “Can you beat the boo-kies” the answers can only be answered in this period. The data that has been collected comes from whoscored.com and oddsportal.com. The website whoscored provides information about the fixtures and oddsportal provides the corresponding odds. Whoscored gets its data from the company Opta [11] which is the world-leading company in sports data [12]. Oddsportal is a web-site the provides odds from different bookies. In this report, the odds are ba-sed on an average of the provided bookies from oddsportal. The reason that the odds became average was just that it was easier to scrape from the websi-te.

(20)

2.3 Data set and data processing

Data was collected from whoscored.com and oddsportal.com and stored in four different MySQL databases called fixtures, player_fixtures,

fixture_for-mation, and odds which are described in the appendix A1 - A3. These fields

were then combined in order to create the dataset used.

In total, the dataset contains 14046 different fixtures and 1639 different features on every fixture. The number of home wins accounts for 45% of the dataset, draws 26%, and away wins 29%.

Features Type Data

0 Categorical League name

1 Categorical Home team formation

2 Categorical Away team formation

3 - 39 5 period average Home defensive players 40 - 76 5 period average Home offensive players 77 - 113 5 period average Home goalkeeper 114 - 150 5 period average Away defensive players 151 - 187 5 period average Away offensive players 188 - 224 5 period average Away goalkeeper

225 - 261 5 period average Highest average of home players 262 - 298 5 period average Highest average of away players 299 - 335 5 period average Lowest average of home players 336 - 372 5 period average Lowest average of away players 373 - 580 20 period average Home team

581 - 788 20 period average Away team

789 - 791 20 period average Home win, draw, loss 792 - 794 20 period average Away win, draw, loss 795 - 1002 10 period average Home team

1003 - 1210 10 period average Away tem

1211 - 1213 10 period average Home win, draw, loss 1214 - 1216 10 period average Away win, draw, loss

(21)

The different numerical features are used as an average over the last games. Table 2.3 shows the structure of how the features are used. Below will the different types in the column Data be presented and explained.

2.3.1 League name

A categorical value for the name of the league where the fixture is played.

2.3.2 Home team formation/Away Team formation

A categorical value that described the line-up formation for the team. For ex-ample 4-4-2.

2.3.3 Home defensive players/Home offensive players/Home

goal-keeper/Away defensive players/Away offensive players/Away goalkeeper

Position Type

Goalkeeper Goalkeeper

Defensive Center Defensive

Defensive Left Defensive

Defesive Right Defensive

Defensive Mid Center Defensive

Defensive Mid Left Defensive

Defensive Mid Right Defensive

Forward Offensive

Forward Left Offensive

Forward Right Offensive

Mid Center Offensive

Mid Left Offensive

Mid Right Offensive

Attacking Mid Center Offensive

Attacking Mid Left Offensive

Attacking Mid Right Offensive

(22)

Table 2.3.3 shows the different positions a player can have and what type that position belongs to. Each player is assigned a type based on their position. The average is then calculated based on the different types with the different features from the table below. For example, team Y is playing team X at home and for some reason, they decide playing with two defensive centers as defence and nothing else. One of the defensive player A has 3 shots on goal over the lates 5 games in the order 1-0-1-1-0 with the rightmost number as the most previous fixture. The other defensive player B has 2 shots on goal over the latest 5 games in the order 1-0-0-1-0 with the rightmost number as the most previous fixture. Therefore, in Table 2.3 were Data is “Home defen-sive players” and Type is “5 period average”, the feature “shots on goal” will be 1 (2 + 3 / 5).

Feature Explanation

Total shots Total number of shots

Woodwork Total shots that hit the wood of the goal Shots on target Shots that hit the goal

Shots off target Shots that missed the goal Shots blocked Shots that were blocked

Touches The number of touches on the ball Pass success Percentage of successful passes

Total passes Total number of passes Accurate passes Number of successful passes

Key passes Number of passes that lead to a goal chance Dribbles won Number of successful dribbles

Dribbles attempted Number of dribbles attempted

Dribbled past Number of time a player was dribbled past by an opponent

(23)

Offensive Aerials Number of aerials when attacking Defensive Aerials Number of aerials when defending Successful tackles Number of successful tackles Tackles attempted Number of tackles attempted

Was dribbled Number of tackles were the player got dribbled in-stead

Tackle success Percentage of successful tackles

Clearances Number of times ball kicked away from goal when defending

Interceptions Number of time ball is stoped without tackle Corners Number of corners taken

Corners accuracy Corners that hit team player

Dispossessed Being tackled by an opponent without attempting to dribble past

Errors Misstake that leads to a goal or shot conceded Fouls Tackles that lead to free kick

Offsides Number of offsides Total saves Number of saves

Collected Save type for goalkeeper Parried save Save type for goalkeeper Parried danger Save type for goalkeeper Goals conceded Number of goals conceded

Goals scored Number of goals scored

(24)

2.3.4 Highest average of home players/Lowest average of home players/Lowest average of home players/Lowest average of home players

The lowest and highest from every feature presented in Table 2.3.3.1.

2.3.5 Home Team/Away Team

Consists of features on the team performance. Each feature presented in the table below is used for the team and against the team. For example, when to-tal shots are used as a 20-period average feature, the average toto-tal shots by the team over the 20 latest fixtures are used and the average total shots con-ceded over the 20 latest fixtures are also used. The table below will only pre-sent features for the team, but as stated before, every feature is used against as well. For example, team X is playing team Y at home. The 5 lates fixtures team X has played they have had 10 shots on goal in the order 2-3-1-2-2 with the rightmost number as the most previous fixture. Their opponents in those fixtures have hade 20 shots on goal in the order 5-5-2-3-5. Therefore, in Tab-le 2.3 were Data is “Home team” and Type is “5 period average”, the feature “shots on goal” will be 2 (10/5) and the feature “shot on goal conceded” will be 4 (20/5).

Feature Explanation

Total shots Total number of shots

Woodwork Total shots that hit the wood of the goal Shots on target Shots that hit the goal

Shots off target Shots that missed the goal Shots blocked Shots that were blocked

Touches The number of touches on the ball Pass success Percentage of successful passes

Total passes Total number of passes Accurate passes Number of successful passes

Key passes Number of passes that lead to a goal chance Dribbles won Number of successful dribbles

(25)

Dribble success Percentage of successful dribbles Aerials won Number of duels in air that is won Aerials won % Percentage of aerials won

Offensive Aerials Number of aerials when attacking Defensive Aerials Number of aerials when defending Successful tackles Number of successful tackles Tackles attempted Number of tackles attempted

Was dribbled Number of tackles were the player got dribbled in-stead

Tackle success Percentage of successful tackles

Clearances Number of times ball kicked away from goal when defending

Interceptions Number of time ball is stoped without tackle Corners Number of corners taken

Corners accuracy Corners that hit team player

Dispossessed Being tackled by an opponent without attempting to dribble past

Errors Misstake that leads to a goal or shot conceded Fouls Tackles that lead to free kick

Offsides Number of offsides Total saves Number of saves

Collected Save type Parried save Save type Parried danger Save type

Goals conceded Number of goals conceded Goals scored Number of goals scored

(26)

6-yard box Number of shots inside the 6-yard box Penalty area Number of shots inside penalty area Outside of box Number of shots outside of box

Open play Number of shots created from an open play situa-tion

Fastbreak Number of shots created from an fast break situa-tion

Set pieces Number of shots created from an set piece situation Penalty Penalties for the team

Right foot Shots with the right foot Left foot Shots with the left foot

Head Shots with the head

Other body parts Shots with other body parts Cross Passes that was a cross Freekick Passes from freekicks

Corners Passes from corners Through ball Passes from through ball

Throw in Passes from throw in Long Passes over 25 yards Short Passes under 25 yards Chipped Passes in the air

Ground Passes on the ground Head passes Passes with the head

(27)

The features presented in table 2.3.5 is the most known features. There are several other features used, for example punches by the goalkeeper and shot positions on the goal. Appendix A1 - A3 shows the different labels on every feature from the Mysql database.

2.3.6 Home win, draw, loss/Away win, draw, loss

Average win, draw and loss for the team in the latest 20, 10 or 5 games.

2.4 Ethical Considerations

The data used contains only names of famous football players, such as Lionel Messi, Cristiana Ronaldo and Zlatan Ibrahimovic. I believe that there is no ethical considerations to take in account because there has not been used any sensitive information about these players.

Left Left passes

Right Right passes

Defensive third Passes in the defensive third Mid third Passes in the mid third Final third Passes in the final third Clearances head Clearances with head

Clearances feet Clearances with feet Blocked crosses Crosses blocked

Turnover Loss of possession

Possession Amount of time a team possesses the ball

(28)

3 Theory

3.1 Machine learning

The main part of this project will be training machine learning models that try to predict the result of different football fixtures.

Mitchell summarised machine learning with the quote “A computer

program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E”.

In short, we learn from input data, and we except the result to improve given more input data. In this case, our task (T) is predicting the results for different football fixtures with data on the fixtures as experience (E). We ex-cept the accuracy (P) on the fixtures to improve with the data (E) [13].

The accuracy in this project is the number of correct predictions on the fixtures. For example, if two fixtures were played and the first ended up in a home win and the second ended up in an away win and our models' predic-tion for the fixtures were both home win. The accuracy of those fixtures would have become 50%.

To visualise the accuracy, this project has used confusion matrices. The confusion matrices are used to visualise the models' accuracy. The left axis shows the true outcome of the fixtures were the bottom axis shows the pre-dictions from the models.

3.2 Logistic Regression

In machine learning, logistic regression is a classification model used to pre-dict if something is true or false, despite having the word “regression” in the name. The theory behind logistic regression is similar to linear regression.

In logistic regression, the line fitting the data is not a straight line like linear regression. Instead, the line fitting the data is shaped like an S curved in Figure 3.1, called the sigmoid function.

(29)

The output from logistic regression is a value between 0 and 1. By computing the sigmoid function of X, which is a weighted sum of the input features, the function returns a probability value between 0 and 1.

There are multiple ways to train a logistic regression model, one of the most common ways is to use an optimisation algorithm like gradient descent to calculate the parameters of the model (weights).

Since Logistic Regression is a binary classification model, techniques like one-vs-all or one-vs-one are used when facing a multi-class problem. One-vs-all is training one model for each classifier while one-vs-one is trai-ning one model for each combination of classifications [14].

3.3 Random Forest

Random forest is a machine learning model that consists of multiple random decision trees. A decision tree is a machine learning model that is recursively splitting a tree with decisions based on the best conditions.

In a random forest, each tree is built on a random sample and at each node, a subset of features is randomly selected to generate the best split. To generate a split, the tree uses either the function Gini or Entropy. Decisions tree are commonly known to overfit the data, but with the help of random forest and not letting the tree grow too big the problem can be solved [15][16].

3.4 Neural Network

The human brain has been called one of the most complicated things in sci-ence and the universe. In artificial intelligsci-ence, researchers and scientists took inspiration from an already perfect system, the human brain.

Neural networks (NN) are a set of layers with neurons connected to ge-nerate their understanding. Modeled after the human brain, NN has the goal of having machines mimic the human brain [17].!

Is it raining?

Is it windy? Is it snowy?

yes no

yes no yes no

(30)

3.5 Hyperparameters

All of the models above have different types of hyperparameters. A hyperpa-rameter is a pahyperpa-rameter that is set before the model starts its training. For ex-ample, in the random forest, how many different trees should we use?

To test which hyperparameters to use, the model is trained with diffe-rent hyperparameters and evaluated against the validation set to determine which combination of hyperparameters performs best. This is called hyperpa-rameter tuning. The hyperpahyperpa-rameter tuning for each model is shown in ap-pendix A4 - A6.

The combination of hyperparameters available for each model is too large to test. Therefore, when the model was built, each hyperparameter was tested one at the time. For example, in the neural network, the best optimizer was tested first, then the best learning rate, and so on. To determine which hyperparameter performed best, the accuracy of the validation set was used. The hyperparameter with the highest accuracy was chosen.

4 Implementation

4.1 Scripts

4.2.1 whoscored/WSScraper.py

The Script WSScraper is used to scrape the website whoscored.com for team data on the fixtures presented in Table 1. Since whoscored is a website built with javascript, fetching static website won’t be enough to collect the data. Instead, this script uses a chrome driver together with the python packet selenium, which works like a regular browser. The chrome driver will also fake its user to protect the script for being ban-ned from the website. The data scraped is uploaded to a MySQL database in realti-me.

Input layer Hidden layers Output layer

(31)

4.2.3 whoscored/player_strat/PlayerScraper.py

To get individual player stats, the script PlayerScraper.py is used. The script works in the same way as WSScrapery.py, but instead of scraping team stats the script fo-cus on individual players. This script has a main controller that executes the browsers in parallel using the python packet joblib.

4.2.4 oddsportal/main.py

Scraping the website oddsportal.com for odds is done in the same way as scraping whoscored with a chromedriver. The website oddsportal is although much nicer than whoscored, every odds is on the same page and the website won’t ban anyone. Since the odds is on a different website, the script also searches for the correct fixture in the database created by whoscored to match the correspond odds against the correct fixture.

4.2.5 get_data_players.py

Creating the dataset is done in the script get_data_players.py. The script fetches data from the mysql databases, pairs the correct fixtures with the correct players and cal-culate the different features used in the dataset.

4.2.6 ml/

In the folder ml there’s one script for each of the models handling the training, tes-ting and visualisation of the results.!

(32)

5 Results

5.1 Bookies

Figure 5.1 shows the accuracy for the bookies, which means that the lowest odds is seen as the bookies prediction, since the lowest odds is equivalent with the highest probability. The bookies used are 10Bet, 188BET, 1xBet, 888sport, Asianodds, bet-at-home, bet365, Betago, Betclic, Betfair, Betfred, BetJoe, Betsafe, Betsson, BetVictor, Betway, BoyleSports, bwin, ComeOn, Coolbet, Dafabet, Expekt, GGBET, Intertops, Interwetten, Jetbull, Leonbets, Marathonbet, NordicBet, Oddsring, Pinnacle, SBOBET, STS.pl, Unibet, Wil-liam Hill and youwin. These different bookies can differ in odds, by how much lies beyond the scope of this project.

The accuracy for the bookies on the validation-set is 54.25% and the accuracy on the test-set is 50.21%. In total, there’s only 3 predictions on draws and none of them are correct.

Figure 5.1 - Confusion matrices on the bookies result. Left matrix: vali-dation-set, Right matrix: test-set

(33)

5.2 Logistic Regression

In Figure 5.2 the predictions for the logistic regression models is viewed. The accuracy on the validation-set is 54.00% and the accuracy on the test-set is 49.32%. Unlike the bookies confusion matrices from Figure 5.1, the model have some correct predictions for draw, but most of the predictions is still for the home team.

Figure 5.2 - Confusion matrices with Logistic Regression. Left matrix: validation-set, Right matrix: test-set

(34)

Figure 5.3 and Figure 5.4 show the two different types of result matrices. In the left matrix in Figure 5.3, most of the rows and columns are positive, thus this is on the validation data. The more orange the colors of the elements are, the higher the profit.

Figure 5.3 - Result matrices with Logistic Regression on every odds lo-wer than the bookies and with a prediction from the model. Left matrix: validation-set, Right matrix: test-set.

Figure 5.4 - Result matrices with Logistic Regression on every odds, with or without prediction. Left matrix: validation-set, Right matrix: test-set.

(35)

5.3 Random Forest

Figure 5.5 shows the predictions for the random forest model. The accuracy on the validation-set is 52.58% and the accuracy for the test-set is 49.29%.

Figure 5.5 - Confusion matrices with Random Forest. Left matrix: vali-dation-set, Right matrix: test-set

Figure 5.6 - Result matrices with Random Forest on every odds lower than the bookies and with a prediction from the model. Left matrix: va-lidation-set, Right matrix: test-set.

(36)

The result matrices in Figure 5.6 and Figure 5.7 for the random forest model has a lot of negative rows and columns, even on the validation-set. The ran-dom forest was also the model that performed worst considering accuracy on both the validation-set and the test-set.

5.4 Neural Network

Neural Network was the model that performed best on both the validation-set and the test-set. In Figure 5.8 the prediction for the neural network is shown. The accuracy on the validation-set is 54.04% and the accuracy on the test-set is 49.89%. Although it has the highest accuracy, there is no prediction for draws.!

Figure 5.8 - Confusion matrices with Neural Network. Left matrix: vali-dation-set, Right matrix: test-set

Figure 5.9 - Result matrices with Neural Network on every odds lower than the bookies and with a prediction from the model. Left matrix:

(37)

va-Figure 5.9 and va-Figure 5.9.1 shows the result for the neural network. In both the matrices at Figure 5.9 most of the rows and columns are positive, this means that the model performed well on both the validation-set and the test-set.

Figure 5.9.1 - Result matrices with Neural Network on every odds, with or without prediction. Left matrix: validation-set, Right matrix: test-set.

(38)

6 Analysis

Neural Network is the model that performs best on both the validation dataset and the test dataset because of the highest accuracy. Although no model has better accuracy than the bookies, the result matrices created by the neural network show that the models predicted odds yield a positive return on both the validation- and test dataset.

When summing up the whole left matrix in Figure 5.9, the result be-comes 2320.43. This means that some fixtures are used more than others, as explained in 2.1.1

As stated before, this is a strategy that we believe would work on our test data. The right matrix in Figure 5.9 shows the same strategy for our test data when summing up the matrix, the result becomes 518.47. That means if one would have bet 10€ on every bet according to this strategy, the profit would have become 5184.7€.

Random forest and logistic regression both fail at performing as well on the test data as the validation data. In Figure 5.3 and Figure 5.6, the sum of both matrices becomes positive. But when applying the same strategy to the test-data, the result becomes negative.

Figure 6.1 shows the result for a home win and away win with a neural

Figure 6.1 - Result matrices on test dataset with Neural Network on every odds lower than the bookies and with a prediction from the model. Left matrix shows result on home win and right matrix shows result on away win.

(39)

"

Table 6.2 and Table 6.3 show the most common home and away teams from the home win result matrix in Figure 6.1. This can be viewed as the models' way of saying which teams it thinks is underrated when playing at home and overrated when playing away. The most underrated team according to this model is Birmingham when playing at home while the most overrated team is Preston when playing away.

Home team Occurrence Average home odds Average away odds

Birmingham 20 2.46 3.25 Sheffield Utd 19 2.08 4.53 Swansea 19 2.24 3.58 Millwall 19 2.55 3.24 Nice 17 2.2 3.99 Napoli 16 1.82 6.41

Away team Occurrence Average home odds Average away odds

Preston 26 2.5 3.2 Getafe 21 2.28 4.64 Stoke 19 2.74 2.99 Middlesbrough 17 2.64 3.12 Saint-Etienne 17 2.39 3.47 Blackburn 16 2.39 3.34

Table 6.2 - Most common home teams from left matrix on Figure 12

(40)

7 Discussion

The expected result from this project was to find strategies that could beat the sports betting companies. Logistic regression and random forest fail on yiel-ding a positive return on both the validation set and test set, while the neural network returns positive profit on both the validation set and test set. This shows that it is hard to predict football fixtures, even for big companies like sports betting companies. Since their accuracy does not differ much from the models' accuracy. But with a large dataset and a strong neural network, it is possible.

Sumpter mentioned that it pays off to back the favourites in Premier League. Table 6.2 shows which teams the model thinks is underrated when playing at home. Four out of six teams are English and the average odds are in their favour. Table 6.3 shows which teams the model thinks is overrated when playing away and that table also has four out of six teams from Eng-land. The odds are in this case not in their favour, which match was sumpter was mentioning.!

(41)

8 Conclusion

The approach taken in this report can be applied to any other team sport avai-lable, not just football. The project is also relevant for football in itself, not combined with betting. Coaches and football clubs can develop their models and tweak their input data to get a higher probability of winning the next fix-ture.

The result in this report are good, if you put in consideration that the bookies are multimillion companies. But, to get better result, the approach should have focused only on neural network and collecting more data, since the neural network was the model that performed best and also have the big-gest number of hyper-parameters.

8.1 Future work

The dataset used in this report is free, if you’re willing to build a web scraper scraping the data. I would like to investigate the possible results with data from either optasports.com or whyscout.com. If the results are this good with the data available online, how would the results look then with the amount of detailed data that can be provided.

When I built the web scraper for oddsportal.com, I noticed that there was a big difference between the odds provided by the different bookies. To-day, most of the bookies provides live betting with odds that changes in real-time. I started wondering if it’s possible to set up different browsers that mo-nitors thees different odds in real time to find arbitrage betting strategies.!

(42)

References

[1] A. Lindberg, “Spelreklam ökar kraftigt efter nya lagen” [Online]. Available: https://www.dn.se/ekonomi/spelreklamen-okar-kraftigt-efter-nya-lagen/

[Accessed: 2020-04-17]

[2] T. Blixt, “Kampen om miljarderna - vi har kartlagt Sveriges nya spelbo-lag” [Online].

Available: https://www.breakit.se/artikel/17475/kampen-om-miljarderna-vi-har-kartlagt-sveriges-nya-spelbolag

[Accessed: 2020-04-17]

[3] Spelinspektionen, “Två av tre svenskar spelar om pengar” [Online]. Available: https://www.spelinspektionen.se/press/nyhetsarkiv/allmanheten-om-spel-2018-tva-av-tre-svenskar-spelar-om-pengar/

[Accessed: 2020-04-17]

[4] Tse, S., Dyall, L., Clarke, D. et al. Why People Gamble: A Qualitative Study of Four New Zealand Ethnic Groups. Int J Ment Health

Addiction 10, 849–861 (2012). https://doi.org/10.1007/s11469-012-9380-7

[5] D.Sumpter, “If you had followed the betting advice in Soccermatics you would now be very rich” [Online].

Available: https://medium.com/@Soccermatics/if-you-had-followed-the-bet-ting-advice-in-soccermatics-you-would-now-be-very-rich-1f643a4f5a23

[Accessed: 2020-05-10]

[6] Stübinger J., Knoll J. (2018) Beat the Bookmaker – Winning Football Bets with Machine Learning (Best Application Paper). In: Bramer M., Petridis M. (eds) Artificial Intelligence XXXV. SGAI 2018. Lecture Notes in Computer Science, vol 11311. Springer, Cham

[7] N.Tax & Y.Joustra, “Predicting The Dutch Football Competition Using Public Data: A Machine Learning Approach” [Online].

Available: https://www.researchgate.net/profile/Niek_Tax/publication/ 282026611_Predicting_The_Dutch_Football_Competition_Using_Public_-Data_A_Machine_Learning_Approach/links/5601a25108aeb30ba7355371/

(43)

Predicting-The-Dutch-Football-Competition-Using-Public-Data-A-Machine-[8] L.Engholm, “Osannolika succén – halva jordens befolkning tittade på fotbolls-VM” [Online].

Available: https://www.svt.se/sport/fotboll/halva-planeten-tittade-pa-fotbolls-vm

[Accessed: 2020-05-03]

[9] M.Kunz, “256 million playing football” [Online].

Available: https://www.fifa.com/mm/document/fifafacts/bcoffsurv/ema-ga_9384_10704.pdf

[Accessed: 2020-05-03]

[10] B.Mundy, “Premier League fan wins bet after Luis Suarez bite” [On-line].

Available: http://www.bbc.co.uk/newsbeat/article/28025917/premier-league-fan-wins-bet-after-luis-suarez-bite

[Accessed: 2020-05-04]

[11] Whoscored, “Who are we?” [Online]. Available: https://www.whoscored.com/AboutUs

[Accessed: 2020-05-11]

[12] Optasports, “World leaders in sports data” [Online]. Available: https://www.optasports.com/

[Accessed: 2020-05-11]

[13] T.Mitchell, “Machine learning” [Online].

Available: http://profsite.um.ac.ir/~monsefi/machine-learning/pdf/Machine-Learning-Tom-Mitchell.pdf

[Accessed: 2020-05-12]

[14] J.Zornoza, “Logistic Regression Explained” [Online].

Available: https://towardsdatascience.com/logistic-regression-explained-9ee73cede081

(44)

[15] P.Gupta, “Decision Trees in Machine Learning” [Online].

Available: https://towardsdatascience.com/decision-trees-in-machine-learn-ing-641b9c4e8052

[Accessed: 2020-05-13]

[16] H.Deng, “An Introduction to Random Forest” [Online].

Available: https://towardsdatascience.com/random-forest-3a55c3aca46d

[Accessed: 2020-05-13]

[17] P.Canuma, “Understanding Neural Network” [Online].

Available: https://towardsdatascience.com/understanding-neural-networks-22b29755abd9

(45)

A Appendix 1

A.1 fixtures database

fixture_id league_name season home_name away_name home_country away_country home_id away_id date_time home_shots away_shots home_shots_on_target away_shots_on_target home_woodwork away_woodwork home_shots_off_target away_shots_off_target home_shots_blocked away_shots_blocked home_6_yard_box away_6_yard_box home_penalty_area away_penalty_area home_outside_of_box away_outside_of_box home_right_foot away_right_foot home_left_foot away_left_foot home_head away_head home_other_body_parts away_other_body_parts home_ontargetlowcentre away_ontargetlowcentre home_ontargetlowleft away_ontargetlowleft home_ontargetlowright away_ontargetlowright home_ontargethighcentre away_ontargethighcentre home_ontargethighleft away_ontargethighleft home_ontargethighright away_ontargethighright home_misshighcentre away_misshighcentre home_misshighleft away_misshighleft home_misshighright away_misshighright home_postright away_postright home_postleft away_postleft home_postcentre away_postcentre home_missleft away_missleft home_missright away_missright home_aerial_duel_success away_aerial_duel_success home_aerials_won away_aerials_won home_offensive_aerials away_offensive_aerials home_defensive_aerials away_defensive_aerials home_successful_tackles away_successful_tackles home_tackles_attempted away_tackles_attempted home_was_dribbled away_was_dribbled

(46)

home_tackle_success away_tackle_success home_interceptions away_interceptions home_clearances away_clearances home_head_clearances away_head_clearances home_feet_clearances away_feet_clearances home_saves away_saves home_claims away_claims home_blocked_shots away_blocked_shots home_blocked_crosses away_blocked_crosses home_pass_success away_pass_success home_total_passes away_total_passes home_crosses away_crosses home_through_balls away_through_balls home_long_passes away_long_passes home_short_passes away_short_passes home_average_pass_streak away_average_pass_streak home_accurate_passes away_accurate_passes home_key_passes away_key_passes home_freekick_passes away_freekick_passes home_throw_in away_ground_passes home_head_passes away_head_passes home_feet_passes away_feet_passes home_forward_passes away_forward_passes home_backward_passes away_backward_passes home_left_passes away_left_passes home_right_passes away_right_passes home_defensive_third_passes away_defensive_third_passes home_mid_third_passes away_mid_third_passes home_final_third_passes away_final_third_passes home_dribbles_won away_dribbles_won home_dribbles_attempted away_dribbles_attempted home_dribbles_past away_dribbles_past home_dribbles_success away_dribbles_success home_unsuccessful_dribbles away_unsuccessful_dribbles home_total_attempts away_total_attempts home_open_play away_open_play home_set_piece away_set_piece home_counter_attack away_counter_attack home_penalty away_penalty

(47)

home_conversion_rate away_conversion_rate home_ontargetlowcentre_goals away_ontargetlowcentre_goals home_ontargetlowleft_goals away_ontargetlowleft_goals home_ontargetlowright_goals away_ontargetlowright_goals home_ontargethighcentre_goals away_ontargethighcentre_goals home_ontargethighleft_goals away_ontargethighleft_goals home_ontargethighright_goals away_ontargethighright_goals home_postcentre_goals away_postcentre_goals home_postleft_goals away_postleft_goals home_postright_goals away_postright_goals home_cards away_cards home_fouls_cards away_fouls_cards home_unprofessional_cards away_unprofessional_cards home_dive_cards away_dive_cards home_other_cards away_other_cards home_possession away_possession home_touches away_touches home_corners away_corners home_corners_accuracy away_corners_accuracy home_dispossessed away_dispossessed home_errors away_errors home_fouls away_fouls home_offsides away_offsides home_loss_of_possession away_loss_of_possession home_turnover away_turnover home_lead_to_shot away_lead_to_shot home_lead_to_goal away_lead_to_goal home_punches away_punches!

(48)

A.2 player_fixtures database fixture_id player_id team_id position player_name total_shots woodwork shots_on_target shots_off_target shots_blocked touches pass_success total_passes accurate_passes key_passes dribbles_won dribles_attemped dribbled_past dribble_success aerials_won aerials_success offensive_aerials defensive_aerials successful_tackles tackles_attemped was_dribbled tackle_success clearances interceptions corners corners_accuracy dispossessed errors fouls offsides total_saves collected

(49)

A.3 fixture_formations fixture_id home_name away_name home_id away_id home_formation away_formation

(50)

(51)

A.6 Hyperparameter selection for Neural Network

A6.1 Optimizer

(52)

A6.3 Neurons

A6.4 Hidden layers

(53)

Is it possible to beat the bookie: Machine learning and football