Comparison of supervised machine learning models forpredicting TV-ratings

(1)

INOM

EXAMENSARBETE DATATEKNIK, GRUNDNIVÅ, 15 HP

STOCKHOLM SVERIGE 2020,

Comparison of supervised machine learning models for predicting TV-ratings

Jämförelse av modeller som utnyttjar övervakad

maskininlärning för att förutsäga tittarsiffror

SEBASTIAN ELF

CHRISTOPHER ÖQVIST

KTH

SKOLAN FÖR KEMI, BIOTEKNOLOGI OCH HÄLSA

(2)

(3)

Comparison of supervised machine learning models for predicting TV- ratings

Jämförelse av modeller som utnyttjar övervakad maskininlärning för att

förutsäga tittarsiffror

Sebastian Elf

Christopher Öqvist

Degree Project in Computer Engineering First level, 15 ECTS

Supervisor at KTH: Anders Lindström Examiner: Ibrahim Orhan

TRITA-CBH-GRU-2020:056

Royal Institute of Technology School of Engineering Sciences in Chemistry, Biotechnology and Health 131 52 Huddinge, Sweden

(4)

(5)

Abstract

Manual prediction of TV-ratings to use for program and advertisement placement can be costly if they are wrong, as well as time-consuming. This thesis evaluates different supervised machine learning models to see if the process of predicting TV- ratings can be automated with better accuracy than the manual process. The results show that of the two tested supervised machine learning models, Random Forest and Support Vector Regression, Random Forest was the better model. Random Forest was better on both measurements, mean absolute error and root mean squared error, used to compare the models. The conclusion is that Random Forest, evaluated with the dataset and methods used, are not accurate enough to replace the manual process. Even though this is the case, it could still potentially be used as part of the manual process to ease the workload of the employees.

Keywords

Machine learning, supervised learning, TV-rating, Support Vector Regression, Random Forest.

(6)

(7)

Sammanfattning

Att manuellt förutsäga tittarsiffor för program- och annonsplacering kan vara kostsamt och tidskrävande om de är fel. Denna rapport utvärderar olika modeller som utnyttjar övervakad maskininlärning för att se om processen för att förutsäga tittarsiffror kan automatiseras med bättre noggrannhet än den manuella processen.

Resultaten visar att av de två testade övervakade modellerna för maskininlärning, Random Forest och Support Vector Regression, var Random Forest den bättre modellen. Random Forest var bättre med båda de två mätningsmetoder, genomsnittligt absolut fel och kvadratiskt medelvärde fel, som används för att jämföra modellerna. Slutsatsen är att Random Forest, utvärderad med de data och de metoderna som används, inte är tillräckligt exakt för att ersätta den manuella processen. Även om detta är fallet, kan den fortfarande potentiellt användas som en del av den manuella processen för att underlätta de anställdas arbetsbelastning.

Nyckelord

Maskininlärning, övervakad inlärning, tittarsiffror, Support Vector Regression, Random Forest.

(8)

(9)

Acknowledgments

Without the help of some individuals and companies this thesis would not have been able to be done. Therefore, the people behind this thesis would like to thank and show their gratitude:

To June AB for the opportunity to work on this bachelor thesis, especially Johan Middlemiss LeMon and Katarina Svensson for their assistance.

To our supervisor Anders Lindström KTH CBH for the feedback on the thesis and his involvement.

To Mediamätning i Skandinavien AB, for supplying the data, especially Mats Lindkvist for his cooperation.

To TV4 AB for supplying statistics and explanation on how they predict TV-ratings, especially Alexander Ingeby for his cooperation.

(10)

(11)

1 Introduction

This chapter presents the problem statement, goals, and scope of this project.

1.1 Problem statement

At the time of writing, TV-ratings are often manually predicted by a team in a prediction department on a TV-channel. These predictions are the base for where advertisements and programs should be on the daily TV-schedule. No matter how much experience the members of this department have, mistakes in their work can occur. Workers can potentially place advertisements in wrong time slots when constructing TV-guides. This could be caused by a miscalculation of the demographic watching or the TV-ratings of a time slot. This can, for example, lead to the wrong price for the advertisements or breaking of contract which in turn might lead to financial loss and decrease in viewership. An automated process could potentially help solve these problems.

If the process of deciding when an advertisement should be shown could be automated with better accuracy compared to the manual process, the problem with the wrong advertising at the wrong time slot could be avoided. This could potentially increase the financial gain for a TV-channel by reducing personnel cost, set the right price for an advertisement, and show more appropriate advertisements for the targeted viewers.

1.2 Project goals

The overall goal of this thesis is to develop a prediction module for TV-ratings, which utilizes supervised machine learning to help automate the process of assigning advertisements to the most appropriate time slot in the daily TV-guide.

More precisely, this project will

• evaluate the prediction accuracy of two or more machine learning models to find the best model for predicting TV-ratings,

• based on the evaluation and prior related works that have been done with the models, choose the best model,

• train and calibrate the best model to have a prediction accuracy error around 1,5%, which is TV4 average error in their prediction of TV-ratings at the time of the writing.

1.3 Scope of the project and delimitations

This thesis will only take supervised machine learning training models in consideration when using and testing the different prediction models to predict TV- ratings. More specifically, only supervised learning models will be considered in

(14)

2 | Introduction

related works, testing, and implementation. Because of limited time only a few supervised machine learning training models will be tested.

(15)

3 | Theory and related works

2 Theory and related works

This chapter will cover related works done in the field regarding predictions using machine learning and the theories behind it.

2.1 Machine learning

There are four types of machine learning techniques: supervised machine learning, unsupervised machine learning, semi-supervised machine learning and reinforcement machine learning [1].

Supervised machine learning is a technique that uses the knowledge it has gained from previous and current data to make its prediction. Thus, this requires humans to both handle the in- and output as well as feedback on the accuracy of the predictions.

Unsupervised machine learning is used when the training data is not classified or labeled. Semi-supervised machine learning uses both supervised and unsupervised training. Reinforcement machine learning uses a reward system to train, and the training data should be an initial state rather than labeled data.

2.2 Related works

Following are related works in the field of machine learning solutions for prediction and classification, as well as a summary of these studies.

A study made by Indu Kumar et al [2] compared different supervised machine learning algorithms and their accuracy within supervised learning methods to predict stock exchange prices. It was shown that different algorithms were better than others, and that the size of the dataset was a factor to consider. The supervised machine learning algorithms tested in this study were Support Vector Machine (SVM), Random Forest (RF), K-nearest neighbors (KNN), Naive Bayes (NB), and Softmax Algorithm. Out of these tested algorithms it was shown that RF was the best when it came to prediction accuracy when the dataset was large. On a smaller dataset, Naive Bayes had the best prediction accuracy when guessing stock prices.

During this study twelve different technical indicators were used to predict the stock price. Reducing the number of indicators gave less accurate results in the end when testing was done.

In another study by Jiao and Jakubowicz [3] the supervised machine learning models Logistic Regression (LR), RF, Neural Network, and Gradient Boosted Decision Trees (GBDT) were tested and used on predicting stock prices. The study used time series prediction and it was shown that LR with lasso constraint was the better model to use of the tested ones. This meant that LR was slightly better in its performance than the other models in its accuracy. The other models in the study

(16)

also performed well compared to LR with GBDT being the best out of these.

Yu-Hsuan Cheng et al [4] made a study on predicting the audience rating based on information gathered from the social media platform Facebook using machine learning. The machine learning method used was Back-propagation Network and it was trained with supervised learning. The prediction results were measured with its mean absolute error (MAE) and mean absolute percentage error (MAPE) and according to Lewis’ MAPE definition the results were promising. According to Lewis’ MAPE definition, if MAPE ≤ 20%, it is a good prediction, and all their predicted values accumulated over one week fell under this definition. There were some deviations in some specific cases where there was a premiere or finale of another show the same day. The authors recommend taking this into consideration to reduce such errors.

Another study regarding predictions using machine learning by Sunakshi Mamgain et al [5] researched the potential of predicting the popularity of a new car for a car company. This study tested a few different models for prediction which included KNN, LR, RF, and SVM. The Authors defined the problem as a multivariate regression problem and classified it as supervised learning. The results of this study showed that SVM had the best accuracy (by a relatively small margin) out of the tested models. Furthermore, the authors stated that for future work in this field they would use SVM and try to modify it for even better performance.

Chongsheng Zhang et al [6] wrote a study in 2017 comparing different classification algorithms and their accuracy. Eleven different algorithms were examined in this study and all of them were state-of-the-art at the time. The study's conclusion was that GBDT and RF had the best average prediction accuracy and SVM had a good average prediction accuracy. When it came to training time GBDT and RF were slow or average at best. In testing and prediction GBDT was the fastest, but KNN and SVM were the most efficient in overall running time. While some algorithms like SVM and RF did not do very well with some of the datasets that were used during the study, GBDT still had good accuracy on these datasets.

Despite the performance of GBDT this algorithm is still less used than for example SVM and RF. In addition, the study also mentions that GBDT is not covered well in large studies and literature.

Ramya Akula et al [7] conducted a study using machine learning to predict how successful a TV-show would be. Specifically, different regression models. To accomplish this the datasets used included, but not limited to, different TV-shows historical TV-ratings, writers, directors, title, air date, characters, and their number of lines. The success of a TV-show was based on their TV-ratings, the higher the TV-ratings was, the more successful TV-show. In this study they concluded that

(17)

there was no one-fits-all algorithm to predict what TV-show was the most successful. For example, KNN could predict The Office well, but no algorithm could predict South Park with good enough accuracy to say that the show’s success could be predicted. According to the study, the reason for this could be the number of factors that go into producing a TV-show. The study brings up that future work should include factors such as demography and schedule to make the accuracy better.

2.2.1 Summary of literature studies

Many of the related works point to the use of supervised machine learning being the most common to utilize for predictions. SVM and RF both use this type of learning. Among the models using this type of learning in the studies they were most used and showed potential for good prediction accuracy. They showed potential because they were either the best or average in most of the studies when it came to prediction accuracy.

The number of models that will be used and tested can be narrowed down since the related works have shown, with relatively high certainty, which models performed well for regression problems trying to predict a continuous variable e.g. stock exchange value and TV-rating.

2.3 Models using supervised machine learning

As mentioned in section 2.2.1 SVM and RF stood out as good candidates to solve the problem in this thesis, hence this chapter will cover these in more detail.

Hyperparameters, the parameters for the models that tunes the learning process, will also be covered.

2.3.1 Support Vector Machine

SVM works in such a way that it creates one or more hyper-plane(s) in an n- dimensional space, where n can be infinitely big, and uses this for classification, regression, or other tasks. The hyperplane that has the largest distance to the nearest training data points of any class has a good separation, because in general, the larger the margin the lower the generalization error of the classifier [8]. This is visualized in figure 2.1 in a 2-dimensional space and the hyper-planes H₁, H₂, and H₃.

One of the disadvantages of SVM is that its performance decreases with a larger dataset due to the long training time. It can also be hard to understand the final model due to it being complex and thus hard to calibrate the final model [9].

(18)

Figure 2.1: Simple example of SVM in 2D space H₁ does not separate the classes.

H₂ does separate the classes but with a small margin.

H₃ separates the classes with the maximal margin.

By Zack Weinberg.

2.3.1.1 Support Vector Regression

Support Vector Regression (SVR) is a modified version of SVM. SVR uses vectors and the position of a curve to make a prediction, as opposed to SVM, which uses the curve itself as a boundary to classify data or potentially make predictions. SVR is used for, as the name suggests, regression and not classification. Support vectors help to make a prediction with the data points and the function that they build upon, which is visualized in figure 2.2. With the support vectors the algorithm tries to find a close match, and this helps the decision making. The data points that does not act as support vectors or lies outside the boundary line can be removed, because of the probability that they are outliers [10].

Since the basic model for SVR is the same as SVM, these have the same disadvantages.

(19)

Figure 2.2: Simple visualization of Support Vector Regression.

Hyperparameters

The hyperparameters used for the SVR model to adjust for a dataset and to get the best possible prediction are [11]

• kernel – Picks the core that the user wants to have for its algorithm,

• γ (gamma) - Tells SVR how much the regression curve should bend when training the dataset,

• C - Decides the weight/cost used for different outcomes,

• ε (epsilon) - How large an error can be tolerated by the algorithm.

SVR has more hyperparameters than these four to adjust the model, but these four are the ones that change the prediction depending on their values.

2.3.2 Random Forest

The way RF works is that it constructs multiple decision trees as shown in figure 2.3. A decision tree is constructed by multiple nodes. Each node has a condition that has a binary result, the result decides the next node in the tree. The last node in the tree represents the output. The output in RF is the class that appears most often or the mean prediction of the different decision trees [12]. This reduces the chance of overfitting which is when a model performs well on the training data but does not generalize well, that singular decision trees have [13].

The disadvantage of RF compared to normal decision trees is that since it

(20)

constructs multiple decision trees this takes more time and/or computational resources. The predictions also take longer [14] and the model becomes harder to understand as the RF grows. As mentioned in section 2.3.1 this could lead to some challenges when training and testing the model, due to the limited hardware since training and testing could take a long time.

Figure 2.3: Random Forest Simplified.

Each decision tree votes on what it thinks the prediction is going to be (orange dots). The votes in the different decision trees are put together in a majority-vote. The majority-voting result is the end prediction. By Venkata Jagannath.

Hyperparameters

The hyperparameters used for the RF to adjust for a dataset and to get the best possible prediction are¹ [15]

• n_estimators - The number of trees in the forest,

• min_sample_split - The minimum number of samples that are needed to split a node,

• min_sample_leaf - The minimum number of samples needed to be a leaf node,

• max_features - Number of features taken into consideration when searching for the best split.

1 Since there is no standard notation for the hyperparameters, the notations in the implementation will be used.

(21)

• max_depth: The maximum depth of the tree, and bootstrap: Decides if bootstrap samples are used for building trees or the whole dataset.

2.3.3 Measurement of accuracy

The measurements presented and described in this chapter were used to evaluate the models and compare them against each other to decide which had the best performance. The one with the best performance was used to compare with TV4s accuracy in predicting TV-ratings.

Root mean square error (RMSE) takes the prediction and measures the standard deviation of the errors [16]. RMSE uses two different vectors to find the distance between these two. The two vectors are the actual value that is trying to be predicted and the other vector is the predicted value. The mathematical formula for RMSE is:

𝑅𝑀𝑆𝐸(𝑋, ℎ) = √¹

𝑚 ∑^𝑚_{𝑖 = 1}(ℎ(𝑥^(𝑖)) − 𝑦^(𝑖))² (1)

• m - Number of instances RMSE is being measured on.

• - A vector with the feature values of the instance.

• - Label of the vector.

• X - A matrix with all the vectors.

• h - function of the prediction.

• RMSE(X, h) - measures the cost function of samples using the function h.

MAE measures the average absolute deviation. MAE does the same as RMSE, it measures the distance between two vectors, but in a different way. The mathematical formula for MAE is:

𝑀𝐴𝐸(𝑋, ℎ) = ¹

𝑚 ∑^𝑚_{𝑖 = 1}|ℎ(𝑥^(𝑖)) − 𝑦^(𝑖)| (2)

• m - Number of instances MAE is being measured on.

• - A vector with the feature values of the instance.

• - Label of the vector.

• X - A matrix with all the vectors.

• h - function of the prediction.

• MAE(X, h) - measures the cost function of samples using the function h.

RMSE has the advantage over MAE if the deviation is a so-called Gaussian

(22)

distribution (A bell-shaped distribution). MAE has the advantage over RMSE when there are many outliers in the distribution.

MAPE is an accuracy measurement that calculates the absolute of the average percentage of errors, the lower this value is the better the prediction. One of the problems this metric faces is that it is undefined where actual values equals zero, since dividing with zero is undefined.

𝑀 =¹

𝑛 ∑ |^𝐴^𝑡^{− 𝐹}^𝑡

𝐴_𝑡 |

𝑛𝑡=1 (3)

• n - Number of observations.

• Aₜ - Actual value.

• Fₜ - Forecast value.

(23)

11 | Method

3 Method

In this chapter, the methods, and the arguments why these methods were used in this thesis will be presented.

To discover and decide on what type of machine learning models could be used for the problem, a literature study was done. The study included previous studies, research, and work about supervised machine learning. Thereafter, the chosen models were evaluated and compared with different methods. Finally, the result from the evaluation and comparison were collected and analyzed.

3.1 Chosen models

The decision to use supervised machine learning models was based on its ability to predict data from what it has previously learned with labeled data [1]. All of the studies examined regarding different prediction models had supervised learning and used it in some form, and this was a strong indicator that the supervised model was a better fit for this work than the other models. Furthermore, the type of data used in the studies was also similar to the data used for this thesis. Based on the studies reviewed, it was concluded that SVM and RF were the best fit for this thesis, thus these models were chosen for further evaluation.

3.1.1 Arguments for SVM and SVR

SVM was, from related works presented in section 2.2, one of the models that had a top score or an average score in prediction accuracy among the different tested supervised learning models. Some other models had similar performance to SVM, like RF and GBDT. Even if GBDT has similar performance, it is not as widely used and researched as SVM and RF, this led to some uncertainties regarding GBDTs performance with the problem in this study. The machine learning model LR is better when you have a prediction of a variable that can have two states. An example of this is that stocks can go up or down. As LR output is binary, it cannot generate a continuous variable that is needed for this thesis. LR can be modified to generate a continuous variable. However, this process was deemed too time- consuming.

SVR is a modified version of SVM with the same basic model, that is better on regression than classification and clustering as mentioned in section 2.3.1.1. Due to the problem in this thesis being a regression problem and not a classification problem, the version of SVM chosen was SVR.

3.1.2 Arguments for Random Forest

Most of the same arguments for SVM also applies to RF. RF was used in some studies and is a common prediction model to use when implementing machine learning. RF got top scores on many of the studies mentioned in section 2.2 and in others it still performed well. RF is also a model that does not need a large dataset

(24)

12 | Method

for its training. This means that with a small dataset RF still performs well, thus making RF a good choice from a data perspective, because of its ability to use smaller datasets as well as large datasets. RF is also chosen because of its ability to prevent overfitting [13]. Overfitting was not going to be a problem for this work due to the size of the dataset used for training the model. If a complete product would be developed in the future, overfitting may become a problem because of an increased size of the dataset, thus overfitting was taken into consideration.

3.2 Data

This chapter will present the data used during the training and evaluation of the models as well as the process of formatting the raw data acquired.

3.2.1 Raw data

The data used for TV-ratings came from Mediamätning i Skandinavien AB (MMS), a company that, among other things, measures TV-ratings in Sweden. The data acquired was historical data from 2019 with the features (columns) shown in tables 3.1 and 3.2. Table 3.1 and 3.2 are two halves of the same table, divided for illustration purposes.

Table 3.1: Example of data used for the prediction models first half.

Table 3.2: Example of data used for the prediction models second half.

.

3.2.2 Data features and formatting

In this chapter the features in the dataset acquired from MMS will be explained, along with the formatting of the dataset and the reasons why.

3.2.2.1 Datum

To represent the date a certain program was aired, a method called one-hot encoding was used. One-hot encoding is, as shown in table 3.3, a method to switch out a text to binary format [17]. This method was used since the machine learning models chosen cannot learn from text and the numerical version of the date does

(25)

13 | Method

not indicate that two dates, seven days apart, is the same day of the week. This also solves the problem that would occur if the days of the week were enumerated from 1-7, that is, the machine learning model would assume that two nearby values, say 1 and 3 (Monday and Wednesday) are closer related than two distant values, for example, 1-7 (Monday and Sunday). This is not correct since the days of the week loop around. A disadvantage of this method is the generating of more features which could lead to the learning processes becoming slower. Names, dates, and other text formats can still be implemented and be in the dataset, but the model cannot use this to learn anything.

Table 3.3: Example of Datum representation with one-hot encoding.

3.2.2.2 Kanal

Kanal shows on what TV-channel the program was aired on. This was represented as binary features with the help of one-hot encoding, as shown in table 3.4, for the same reason as mentioned in section 3.2.2.1. The feature Kanal was used to see if there were a connection between the channels and the TV-ratings.

Table 3.4 Example of Kanal representation with one-hot encoding.

3.2.2.3 Starttid

Starttid in the datasets is the time a program starts during a day. The time can be between 02:00 - 25:59. It goes between these times because of how a broadcast day looks like. A broadcast day unlike the time of a regular day, that spans between 00:00 - 24:00 the same day, starts at 02:00 one day and ends at 01:59 the following day, but in a broadcast this still counts as the same day. Therefore, a broadcast day starts at 02:00 and ends at 25:59 to show that this is still the same

(26)

14 | Method

day in broadcast terms and that the date changes when the clock passes 25:59 as shown in table 3.5. Changes made to the data was the removal of “:”. It was removed due to the models interpreting this as a text and not a numeric value.

Table 3.5: Example of Starttid.

3.2.2.4 Programnamn

Programnamn is the name of the program being broadcast. The feature Programnamn was in text format when acquired. As mentioned in section 3.2.2.1 the models do not handle text format well, thus the decision to format the text to numeric form was made. The process of formatting these texts includes going through the dataset and extracting all unique program names and assigning them a numeric representation (from one and up). After this the names of the programs were replaced by this numeric value as shown in table 3.6.

A disadvantage of this decision is that this could lead the models to think that the shows with numeric values one and two would be closer related than one and twenty even though this might not be the case.

Table 3.6: Programnamn represented with numeric values.

3.2.2.5 Demographic data

The demographic data was, as shown in table 3.7, table 3.1, and table 3.2. These features were used to give the models an understanding of what kind of demographic watches a certain type of program. This behavior and pattern are important for the prediction model to learn, so it can get a more accurate result in its prediction.

(27)

15 | Method

Table 3.7: Demographic data. Rat(1000) A20-34[KO] represents TV-rating between the age of 20 and 34 in thousands.

Kvi A3-99 [KO]%

This feature represented the female demographic of the TV-rating, age three to ninety-nine, as a percentage of the total number of viewers. In the feature Kvi A3- 99 [KO]% the “,” was replaced with “.”. The change was made, because of “,” is a separator between features in the .csv document.

Målgruppskod

Målgruppskod is what demographic group a program is meant to be watched by.

The viewer can belong to one-off two groups, adult, or children. This was translated into the one-hot encoding scheme as shown in table 3.8.

Table 3.8: Example of Målgruppskod representation with one-hot encoding.

3.2.2.6 PgGenre1

PgGenre1 is the genre of the program being broadcasted. This feature was in a text format but since there was a limited number of genres this could be converted to a binary format using one-hot encoding as shown in table 3.9.

Table 3.9: Example of PgGenre1 representation with one-hot encoding.

(28)

16 | Method

3.2.2.7 Data not formatted

The features Längd (length of program) and Vecka (week number) were not changed in any way. This due to them already being in a format that could be processed by the models.

3.2.2.8 Formatted data

As shown in table 3.10, 3.12, and 3.13 this was how the dataset looked like in the final document, when the data acquired from MMS had been processed using the methods mentioned in section 3.2.1. For illustration purposes the feature names were manually formatted and table where divided into three parts table 3.10, table 3.11, and table 3.12

Table 3.10: Example of formatted data, part one.

Table 3.11: Example of formatted data, part two.

Table 3.12: Example of formatted data, part three.

3.3 Evaluation of the results

The process of evaluating the results will be presented in this chapter. The models were compared against each other to see which had the best performance. The best model was then compared to experts and their accuracy in predicting TV-ratings.

(29)

17 | Method

The chapter will also explain how the models were evaluated to get the best prediction from them.

3.3.1 Hyperparameters

To improve the prediction accuracy from the models, hyperparameter tuning was done to get the best hyperparameters for the models with the dataset acquired. The hyperparameters for each model were tuned with the tool RandomizeSearchCV (described in section 3.4.1).

The hyperparameters tested for SVR:

• kernel = radial basis-, polynomial- and linear function.

• C = 1,10,100 and 1000.

• γ = 0.1, 1 and 10.

• ε = 0.1, 0.2, 0.3, 0.4 and 0.5.

The hyperparameters tested for RF:

• n_estimators: 200, 400, 600, 800…, 2000

• min_sample_split: 2, 5, 10

• min_sample_leaf: 1, 2, 4

• max_features: auto, sqrt

• max_depth: 10, 20, 30, 40…, 200, None

• bootstrap: True, False

Since RandomizedSearchCV chooses what combination of hyperparameters to test pseudorandomly the hyperparameter tuning was done four times too, as best as possible, counteract the possibility of missing the most optimal hyperparameters.

3.3.2 Training process

A training split of 80/20 were used during the training process. 80% of the data in the dataset was used for training and 20% of the data in the dataset for testing after the models had been trained. The 80/20 split was chosen after it showed the best results from the splits tested, which was 70/30, 75/25, 80/20, 85/15, and 90/10.

Since the splits were generated pseudorandomly the training was done ten times on each model to see if this made a difference in the outcome of the predictions. The dataset used for the training process was a twelve-week period between September and December the year 2019.

3.3.3 Testing process

To see how well the models compared, their prediction accuracy was the focus. The model with the best prediction accuracy was the model chosen for further

(30)

18 | Method

evaluation. The model that performed the best between the two was evaluated from the average prediction error percentage acquired from TV4.

3.3.3.1 Accuracy measurement

RMSE and MAE were used to see how the model’s accuracy in predicting was. Prior related works presented in section 2.2 point to that RMSE and MAE is a common measurement for prediction accuracy on regression problems. The book Hands-On Machine Learning [16] claims RMSE is the preferred method, in general, to utilize for regression problems. The book also mentions that RMSE is more sensitive to outliers in the dataset compared to MAE but performs well when outliers are rare [18]. The uncertainty on how many outliers there would be, both RMSE and MAE was chosen to compare the models.

To compare the best model with the prediction accuracy acquired from TV4 a measurement that used average percentage error was needed, thus MAPE was used. TV4 had an average prediction accuracy error of around 1,5% (since 2010) according to their lead of estimation. The farther into the future a prediction is made, the larger the error and more factors of uncertainty there is going to be.

Some of the uncertain factors could be what other channels are going to show, this is data that other channels do not know long in advance, weather and streaming services are some other factors.

As mentioned in section 2.3.3 MAPE is undefined where actual values equal zero.

As some of the values in the dataset were zero, these were replaced with the value one when calculating MAPE. This could skew the results either increasing or decreasing the MAPE depending on the predicted value.

3.4 Tools and framework

The following were the frameworks and tools used during the process of testing different prediction models.

3.4.1 Scikit-learn

Scikit-learn is open-source and provides tools for data analysis, data prediction, and machine learning models [19]. The machine learning models used from scikit- learn in this thesis were SVR and RF. The model parameters could be modified with a tool from scikit-learn called RandomizedSearchCV. It automatically fine- tunes the hyperparameters to best fit the respective models and datasets [20]. Due to this, the fine-tuning process of hyperparameters for the models took less time than it would have done if this were done manually. RandomizedSearchCV was used instead of GridSearchCV, because of the faster time it finds parameters. The disadvantage RandomizedSearchCV has with that it is faster, is that there is a possibility the best parameters are not found [21].

(31)

19 | Method

3.4.2 Panda

Panda is an open-source software library. It is a library consisting of tools to help with data analysis and for data manipulation [22]. This library made it possible to manipulate the data, so it could be used for learning in scikit-learns prediction models. The function in Panda for translating categorical features to binary features was one-hot encoding, as mentioned in section 3.2.2. Without this function, the features that were in text-format would not have been included in the training and testing.

3.4.3 Numpy

Numpy is primarily a framework for scientific computing within Python by providing an N-dimensional array object among other things [23]. Numpy was used to convert the .csv data to these array objects which was needed to be able to generate training and testing datasets for the models.

(32)

20 | Method

(33)

21 | Results

4 Results

This chapter will present the results of the hyperparameter tuning of the models, the model’s performance compared to each other and the performance of the best model compared to TV4s current processes of estimating TV-rating.

4.1 Hyperparameters

Since RandomizedSearchCV chooses pseudorandomly which combination of hyperparameters to try during testing the processes of testing was done four times to make sure the best hyperparameters possible was chosen. All four tests gave the same result for both models presented below.

The best result of hyperparameter tuning SVR according to RandomizedSearchCV:

• kernel = radial basis function (rbf).

• C = 1000.

• γ = 1.0.

• ε = 0.1.

The best result of hyperparameter tuning RF according to RandomizedSearchCV:

• n_estimators = 2000

• min_sample_split = 2

• min_sample_leaf = 1

• max_features = auto

• max_depth = None

• bootstrap = True

4.2 Test results

The training and testing were both done on pseudorandom subsets of the dataset.

To make sure the training and testing were thorough, each model was trained and tested ten times and then compared against each other. The results of this process will be presented below. The unit of measurement for MAE and RMSE is TV-rating in thousands.

4.2.1 Support Vector Regression

The following were the results after training the SVR model ten times. MAE had a maximum value of 42,22, minimum value of 37,67, and mean value of 39,73. RMSE had a maximum value of 102,73, minimum value of 77,2, and mean value of 89,11.

Refer to Appendix for further details.

(34)

22 | Results

4.2.2 Random Forest

The following were the results after training the RF model ten times. MAE had a maximum value of 18,79, minimum value of 17,01, and mean value of 17,91. RMSE had a maximum value of 48,16, minimum value of 38,63, and mean value of 42,73.

Refer to Appendix for further details.

4.2.3 Comparison of the two machine learning models

The results of the two models MAE and RMSE mean values with standard deviation are presented side by side in figure 4.1. As shown, RF performed better on both metrics, hence RF was chosen for comparison against the current processes of estimating TV-rating.

Figure 4.1: Comparison of MAE and RMSE between Support Vector Regression and Random Forest.

4.3 Chosen model compared to the current process

In figure 4.2 the average percentage error for RF and TV4s current process of estimating TV-rating (1.5%) is compared. The average percentage error for RF was calculated from the mean of MAPE from the ten training rounds and is presented together with the standard deviation. No standard deviation is presented for TV4 since no data regarding this was acquired. As shown by figure 4.2, the chosen model is not more accurate than the current process.

(35)

23 | Results

Figure 4.2: Comparison between Random Forest and TV4.

(36)

24 | Results

(37)

25 | Analysis and discussion

5 Analysis and discussion

This chapter will analyze the results as well as discuss the results from different standpoints like ethical and economical.

5.1 Analysis of results

As shown in chapter 4 both RMSE and MAE were lower for RF than SVR, in other words the predictions were closer to the actual value for RF. Even though RF outperformed SVR it was not as good as the TV4s estimation team.

5.1.1 Model Comparison

The results from comparing the two machine learning models were straight forward in this study. RF performed better according to all metrics measured compared to SVR. This does not mean that RF is superior to SVR in all ways just that for this problem with this dataset RF is the better model.

With a larger dataset and more (or less) features there is a possibility that SVR could outperform RF. Further since only two machine learning models were trained, tested, and then compared there is a possibility that there are other models that could outperform both RF and SVM.

5.1.2 Chosen model versus current process

As shown in section 4.3 the performance of the chosen machine learning model is not good enough to replace the current process of estimating TV-ratings if compared to the data received from TV4. A reason for this could be the amount of experience in TV4s estimation team, according to the lead of estimation at TV4 their average error on estimations has been as low as 1.5% since 2010. Another reason for this could be the limited amount of data used during this study. For this study, a dataset covering twelve weeks was used which contained just under 20 000 rows of data. The machine learning model may have performed better if the dataset used for training and testing was for a full year or more.

Even though the machine learning model trained and tested in this thesis cannot yet replace the current process it could be incorporated in the process as an easy tool to get fast predictions.

5.1.3 Hyperparameter tuning

The hyperparameters used were chosen from the result of RandomizedSearchCV as explained in section 3.3.1 and section 4.1. As mentioned in those chapters the problem with this method is that the combination of hyperparameters tested is pseudorandomly chosen. This could lead to the actual best hyperparameters not being tested. To try to counteract this, the method was run multiple times with a decently large variety of values.

The other option for hyperparameter tuning was to try every combination of hyperparameters to find the best one, this method was not chosen since it is very

(38)

time consuming especially taking into consideration that this work was done using laptops with limited hardware.

5.2 Discussion of methods

This chapter will discuss the methods used to evaluate the models as well as the methods used to format the data.

5.2.1 Measurement methods

For measuring the performance of the machine learning models and comparing these with each other MAE, RMSE and MAPE were used. Both MAE and RMSE calculate the mean error of the prediction but have different ways of doing this.

MAE calculates the mean absolute error which means it does not take it into consideration the direction of the errors. On the other hand, RMSE squares all the errors then calculates the square root of the average of the square errors, this means that RMSE gives more weight to larger errors while MAE does not. This could have been a problem if one of the models performed better on MAE and the other on RMSE and would have required more tests. In this case RF performed better on both metrics, by a relatively large margin, which makes it clear that it was the better model out of the two.

MAPE, as explained in section 2.3.3, is not defined for actual values equal to zero.

This was a problem in this study since the dataset contained values equal to zero, but since a measurement for average percentage error was needed to compare the performance of the machine learning model against the current process the decision was made to replace all zeros with ones for this calculation. This could cause the results to be skewed, either to make the results look better than they are or worse and might have been the biggest room for error in the thesis.

5.2.2 Data formatting

As explained in section 3.2.2 some data formatting had to be done to the dataset, the reason for this was to make sure the dataset was in a format that the models could learn from. One of the formatting that needed to be done was changing the program names into IDs with only numeric values. This could cause some problems, as mentioned in section 3.2.4, where the models think two programs with IDs close to each other, let say one and two, are closer related than one and nine which may not be the case. Further this could lead to the models making false connections between TV-rating and numerical representation of the program names (IDs). Another solution to this would be using machine learning models able to learn from features in text format or dropping such features from the dataset.

The other formatting done to features in text format should not have a negative

(39)

effect on the machine learning models other than increasing the training time due to an increase in features. Since the formatting was changing categorical features into binary features which does not lead to the risk of the models falsely thinking unrelated features being related.

5.3 Economical and ethical impact

From an economical viewpoint, if TV-ratings could be predicted with an automated process, companies could save money on both personnel costs, but also earn money by not making mistakes in the prediction of TV-ratings. As the result shows the tested SVR and RF are not accurate enough to replace the employee(s) who does this job at the time of writing. TV-channels will earn more money if they stay with the procedure, they are using than switching to the tested models and make the process automated. This does not mean that the result did not show the potential for future work and the potential for companies to use machine learning for economic purposes. RF, the best of the two models tested in this work, showed that with more work it can be used to cut cost on personnel, as an aid for predicting TV- ratings. The result can be a starting point for further work to find a replacement for the way TV-ratings is predicted.

The data from MMS is from individuals who voluntarily shared their privacy of the information regarding what they are watching and about themselves. If data is gathered from individuals who did not agree to this or data about the ones who did agree was used for another purpose than expected, is it ethical to use this to create a process that will make other people lose their jobs? Even if the person was fully aware of what the data was used for, is it ethical to use other people's data to make a process to automatically predict what some other person is watching, that has not approved this and target specific advertisements to this person?

(40)

(41)

29 | Conclusions

6 Conclusions

This study researched the possibility to use supervised machine learning to predict TV-ratings for the placement of advertisements. The two supervised machine learning models evaluated were Support Vector Regression and Random Forests.

Out of the two models Random Forest had the best performance. It was concluded that Random Forest, with the dataset and methods used in this study, are not accurate enough to replace the current process of manually predicting TV-ratings.

Even though Random Forest cannot replace the current process it could be included as part of the process to ease the work of the employees currently assigned this task.

The results from this study were promising even though they were not better than the current process of estimating TV-ratings. This shows promise that machine learning models potentially could replace the current process in the future.

6.1 Future work

If future work is done in this field, it is recommended to consider using a much larger dataset to see if this impacts the performance to a large degree. Further it is recommended to evaluate more machine learning models since only two different models were trained and tested during this thesis. It would also be interesting to see if more features like weather would have an impact on the performance. Lastly, it is recommended to find a better measurement of performance since MAPE, as used in this study, has its downsides as mentioned in section 5.2.1.

(42)

30 | Conclusions

(43)

31 | References

References

[1] R. Saravanan and P. Sujatha, "A State of Art Techniques on Machine Learning Algorithms: A Perspective of Supervised Learning Approaches in Data Classification," 2018 Second International Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, India, 2018, pp. 945-949. Cited 27-03- 2020. ISBN: 978-1-5386-2842-3. Available at: https://ieeexplore-ieee- org.focus.lib.kth.se/document/8663155

[2] I. Kumar, K. Dogra, C. Uterja, P. Yadav, “A Comparative Study of Supervised Machine Learning Algorithms for Stock Market Trend Prediction”, 2018 Second International Conference on Inventive Communication and Computational Technologies (ICICCT), 2018, Coimbatore, India, pp. 1003-1007. Cited 27-03- 2020. ISBN: 978-1-5386-1974-2. Available at: https://ieeexplore-ieee- org.focus.lib.kth.se/document/8473214

[3] Y. Jiao and J. Jakubowicz, "Predicting stock movement direction with machine learning: An extensive study on S&P 500 stocks," 2017 IEEE International Conference on Big Data (Big Data), Boston, MA, USA, 2017, pp. 4705-4713. Cited 27-03-2020. ISBN: 978-1-5386-2715-0. Available at: https://ieeexplore-ieee- org.focus.lib.kth.se/document/8258518/

[4] Y. Cheng, C. Wu, T. Ku, G. Chen, “A Predicting Model of TV Audience Rating Based on Facebook”, 2013 International Conference on Social Computing, 2013, Alexandria, VA, USA, pp. 1034-1037. Cited 27-03-2020. ISBN: 978-0-7695-5137-1.

Available at: https://ieeexplore-ieee-org.focus.lib.kth.se/document/6693464/

[5] S. Mamgain, S. Kumar, K. M. Nayak and S. Vipsita, "Car Popularity Prediction:

A Machine Learning Approach”, 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), Pune, India, 2018, pp. 1-5. Cited 27-03-2020. ISBN: 978-1-5386-5257-2. Available at:

https://ieeexplore-ieee-org.focus.lib.kth.se/document/8697832/

[6] C. Zhang, C. Liu, X. Zhang, G. Almpanidis, “An up-to-date comparison of state- of-the-art classification algorithms”, Expert Systems with Applications, Vol. 82, 2017, pp. 128-150. Cited 27-03-2020. ISSN: 0957-4174. Available at:

https://www.sciencedirect.com/science/article/abs/pii/S0957417417302397

[7] R. Akula, I. Garibay. “Forecasting the Success of Television Series using Machine Learning”, IEEE SoutheastCon 2019, Huntsville, AL, USA, 2019. Cited 27- 03-2020. Available at:

https://www.researchgate.net/publication/332530593_Forecasting_the_Success_

(44)

32 | References

of_Television_Series_using_Machine_Learning

[8] Scikit-learn. 1.4. Support Vector Machines. Cited 03-24-2020. Available from:

https://scikit-learn.org/stable/modules/svm.html

[9] S. Ray, "A Quick Review of Machine Learning Algorithms," 2019 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCon), Faridabad, India, 2019, pp. 35-39, Cited 05-05-2020. ISBN: 978-1- 7281-0211-5. Available at: https://ieeexplore.ieee.org/document/8862451

[10] A. Géron, “Hands-On Machine Learning with Scikit-Learn & TensorFlow”, 1 ed. O'Reilly Media, Inc, USA, 2017-03-24, pp 154-155. ISBN: 9781491962299.

[11] Scikit-learn. sklearn.svm.SVR. Cited 05-05-2020. Available from:

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html

[12] A. Géron. “Hands-On Machine Learning with Scikit-Learn & TensorFlow”. 1 ed. Sebastopol, CA, USA, O'Reilly Media, Inc. 2017-03-24. pp 181 ISBN:

9781491962299.

[13] L. Breiman. “Random Forests", Machine Learning, Vol. 45, 2001, pp. 5–32.

Cited 27-04-2020 ISSN: 0885-6125 (Print) 1573-0565 (Online). Available at:

https://doi.org/10.1023/A:1010933404324

[14] Juan Huo, Tingting Shi and Jing Chang, "Comparison of Random Forest and SVM for electrical short-term load forecast with different data sources," 2016 7th IEEE International Conference on Software Engineering and Service Science (ICSESS), Beijing, 2016, pp. 1077-1080. Cited 29-04-2020. ISBN: 978-1-4673- 9904-3 Available at: https://ieeexplore.ieee.org/document/7883252 “Time cost”

[15] Scikit-learn. 3.2.4.3.2. sklearn.ensemble.RandomForestRegressor. Cited 05- 05-2020. Available at:

https://scikit-

learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.ht ml

[16] A. Géron. “Hands-On Machine Learning with Scikit-Learn & TensorFlow”. 1 ed. Sebastopol, CA, USA, O'Reilly Media, Inc. 2017-03-24. pp 37-40 ISBN:

9781491962299.

(45)

33 | References

9781491962299.

[19] Scikit-learn. Scikit-Learn Machine Learning in Python. Cited 20-04-2020.

Available at: https://scikit-learn.org/stable/

[20] Scikit-learn. sklearn.grid_search.RandomizedSearchCV. Cited 20-04-2020.

Available at: https://scikit-

learn.org/0.16/modules/generated/sklearn.grid_search.RandomizedSearchCV.ht ml

[21] Scikit-learn. Comparing randomized search and grid search for hyperparameter estimation. Cited 20-04-2020. Available at: https://scikit- learn.org/stable/auto_examples/model_selection/plot_randomized_search.html

[22] Panda. pandas. Cited 20-04-2020. Available at: https://pandas.pydata.org/

[23] NumPy developers. NumPy. Cited 20-04-2020. Available at:

https://numpy.org/

(46)

34 | References

(47)

35 | Image sources

Image sources

Figure 2.1: Simple example of SVM in 2D space. Made by Zack Weinberg, License:

CC BY-SA 3.0.

Figure 2.3: Random Forest Simplified. Made by Venkata Jagannath, License: CC BY-SA 4

(48)

36 | Image sources

(49)

37 | Appendix

Appendix

Table A.1: Details of the ten rounds of training for SVR and RF.

ROUND

1 2 3 4 5 6 7 8 9 10 Mean SD

MAE

SVR 39,91 39,95 40,26 37,74 38,75 37,64 40,05 41,32 39,46 42,22 39,73 1,36 RMSE

SVR 99,56 102,73 84,69 79,44 84,41 77,2 86,32 93,51 88,77 94,45 89,11 7,92 MAE

RF 17,77 18,79 18,67 17,01 17,74 18,04 17,33 17,29 18,24 18,18 17,91 0,56 RMSE

RF 39,1 46,78 48,16 39,87 41,69 40,79 41,54 38,63 43,3 47,41 42,73 3,35 MAPE

RF 25,72 26,77 28,94 27,24 25,57 25,97 25,94 26,48 27,25 25,7 26,56 0,99

(50)

(51)

(52)

TRITA CBH-GRU-2020:056

www.kth.se

Comparison of supervised machine learning models forpredicting TV-ratings

Comparison of supervised machine learning models for predicting TV-ratings

Jämförelse av modeller som utnyttjar övervakad

maskininlärning för att förutsäga tittarsiffror

SEBASTIAN ELF

CHRISTOPHER ÖQVIST

Comparison of supervised machine learning models for predicting TV- ratings

Jämförelse av modeller som utnyttjar övervakad maskininlärning för att

förutsäga tittarsiffror

Sebastian Elf

Christopher Öqvist

Abstract

Keywords

Sammanfattning

Nyckelord

Acknowledgments

Table of Contents

1 Introduction

1.1 Problem statement

1.2 Project goals

1.3 Scope of the project and delimitations

2 Theory and related works

2.1 Machine learning

2.2 Related works

2.3 Models using supervised machine learning

3 Method

3.1 Chosen models

3.2 Data

3.3 Evaluation of the results

3.4 Tools and framework

4 Results

4.1 Hyperparameters

4.2 Test results

4.3 Chosen model compared to the current process

5 Analysis and discussion

5.1 Analysis of results

5.2 Discussion of methods

5.3 Economical and ethical impact

6 Conclusions

6.1 Future work

References

Image sources

Appendix