The effect of quality metrics on the user watching behaviour in media content broadcast

(1)

UPTEC F 16048

Examensarbete 30 hp Oktober 2016

The effect of quality metrics

on the user watching behaviour in media content broadcast

Erik Setterquist

(2)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0

Postadress:

Box 536 751 21 Uppsala

Telefon:

018 – 471 30 03

Telefax:

018 – 471 30 00

Hemsida:

http://www.teknat.uu.se/student

Abstract

The effect of quality metrics on the user watching behaviour in media content broadcast

Erik Setterquist

Understanding the effects of quality metrics on the user behavior is important for the increasing number of content providers in order to maintain a competitive edge. The two data sets used are gathered from a provider of live streaming and a provider of video on demand streaming. The important quality and non quality features are determined by using both correlation metrics and relative importance determined by machine learning methods. A model that can predict and simulate the user behavior is developed and tested. A time series model, machine learning model and a

combination of both are compared. Results indicate that both quality features and non quality features are important in understanding user behavior, and the importance of quality features are reduced over time. For short prediction times the model using quality features is performing slightly better than the model not using quality features.

ISSN: 1401-5757, UPTEC F 16048 Examinator: Tomas Nyberg Ämnesgranskare: Thomas Schön Handledare: Selim Ickin

(3)

The effect of quality metrics on the user watching behaviour in media content broadcast

Erik Setterquist Uppsala University

Performed at Ericsson Research, Kista

October 5, 2016

(4)

Abstract

Understanding the effects of quality metrics on the user behavior is important for the increasing number of content providers in order to maintain a competitive edge. The two data sets used are gathered from a provider of live streaming and a provider of video on demand streaming. The important quality and non quality features are determined by using both correlation metrics and relative importance determined by machine learning methods.

A model that can predict and simulate the user behavior is developed and tested. A time series model, machine learning model and a combination of both are compared. Results indicate that both quality features and non quality features are important in understanding user behavior, and the importance of quality features are reduced over time. For short prediction times the model using quality features is performing slightly better than the model not using quality features.

(5)

Populärvetenskaplig sammanfattning

Hur uppspelningskvaliteten påverkar dig

Långa laddningstider, hackande uppspelning och enorma pixlar. Inget av detta är något man vill uppleva när det står 4-3 till Tyskland i 93:e minuten och Sverige har bollen i det Tyska straffområdet. Men hur mycket påverkas vi egentligen av kvaliteten i en videostream? Detta är inte bara något som är viktigt för konsumenterna utan även för tv bolagen som i en allt mer konkurrensutsatt bransch försöker få dig att se så mycket av deras innehåll som möjligt.

Inom fem år beräknas 80 procent av all internettrafik riktad till konsumenter vara kopplad till videouppspelning, och en allt större del av denna trafik kommer att skötas av specialiserade operatörer. För att den prenumerations- och reklambaserade affärsmodellen ska fungera är det viktigt för dessa operatörer att förstå vad som påverkar hur mycket, och hur ofta en användare väljer att konsumera deras innehåll. Syftet med detta arbeta är att bygga en modell som kan hitta faktorer som påverkar en konsuments beteende, och som kan förutse hur beteendet kommer att ändras om uppselningskvaliteten ändras. Vi finner att användare påverkas av både faktorer som inte är kopplade till kvaliteten, t.ex. vilken tid på dagen det är, och faktorer som är kopplade till kvaliteten, t.ex. hur ofta uppspelningen stannar och buffrar.

Modellerna förutspår också att en ändring i kvaliteten kan påverka en användares beteende till viss del. Modellens kriterier uppfylls utmärkt av metoder som använder maskininlärning.

Detta är i sig ett väldigt intressant och aktivt fält och en fortsatt utveckling av modellerna kommer att ge en bättre förståelse för hur vi konsumerar media. Vad detta arbete har åstadkommit är inte en slutgiltig lösning på hur vi ska förstå konsumenter av media utan snarare ett första steg till att med hjälp av smarta program visa hur arbetet i detta fält kan försätta.

(6)

Acknowledgement

I would like to thank my supervisor Selim Ickin at Ericsson research, I would also like to thank the rest of the research group at Ericsson. Also I would like to extend my gratitude to Thomas Schön at Uppsala university who was my subject reader during the thesis.

Erik Setterquist, September 2016

(7)

Abbreviations

ACF Autocorrelation function

AR Autoregressive

ARIMA Autoregressive integrated moving average ARMA Autoregressive moving average

AUC Area under the curve AWD Average watch duration CDN Content delivery network DoM Day of the month

DoW Day of the week

DT Decision tree

FNR False negative rate FPR False positive rate

GP Gaussian process

HoD Hour of the day

MA Moving average

ML Machine learning

MOS Mean opinion score MSE Mean square error

PACF Partial autocorrelation function PSD Power spectral density

PSNR Peak signal to noise ratio QoE Quality of experience QRF Quantile regression forest

RF Random forest

RMSE Root mean square error

ROC Receiver operating characteristic

SARIMA Seasonal autoregressive integrated moving average TNR True negative rate

TPR True positive rate

VoD Video on demand

(8)

Chapter 1 Introduction

This section presents the goal of the thesis and previous research in the area of video quality of experience.

1.1 Motivation

With the success of subscription- and ad-based revenue models, it is becoming increasingly important to understand user quality of experience (QoE). As users continuously expect higher video quality and the number of video content providers increase, a good understanding of QoE is necessary to sustain and increase the number of users. According to Cisco, mobile video represented 55% of mobile data trafic in 2015 and this number is predicted to increase to 75% by the year 2020 [1]. By 2019, 80% of all consumer internet traffic is predicted to be video traffic and 72% of this trafic is predicted to go through content delivery networks (CDNs)[2]. It is important to study if the user behavior and engagement, e.g.

watched duration or number of views, is influenced by quality degradations such as freezes or low bit rates. New technologies has made the process of collecting, storing, processing and analyzing large sets of data much easier. This allows content providers to store and analyze customer data on a much larger scale than before. The understanding of how users are affected by quality issues are still limited and is therefore an interesting topic to study.

1.2 Defining and Measuring QoE

QoE in standard video broadcast is a well studied field [3] however the knowledge is still limited regarding Internet video streaming QoE [4]. The reason for this is that the traditional metrics of quality and user engagement is replaced by more temporal metrics such as re- buffering rates and average bit rate.

1.2.1 Traditional QoE metrics

Traditionally the peak signal to noise ratio (PSNR) has been used as a measure of quality [5]. However there is only an approximate relation between PSNR and the video quality perceived by the human observer. A further problem with PSNR is that it is a full-reference

(11)

metric [6]. This means that both the video received by the user and the entire original video are needed to calculate the PSNR. This pose a number of problems, firstly this is not scalable to large data sets. Both the size of the data and the process of collecting the video received by the user will not scale. A further complication with PSNR is that the method compares each frame of the video individually. This requires the two videos to be aligned both spatially and temporally, meaning that every pixel of the transmitted video must be compared with the corresponding pixel of the received video. As the authors in [5] argue the required temporal alignment is an especially strong restriction because of frame drops and repeats etc.

1.2.2 New metrics, e.g. bitrate

Florin Dobrian et al. [7] preform a comprehensive study of the impact of different quality metrics on viewer engagement. They study the impact of join time, buffering ratio, rate of buffering events, average bitrate and rendering quality, however they decide not to include the rate of bitrate switching. These metrics are found to have a varying impact on the user engagement. It is found that the quality metrics effect both the viewing duration of each session as well as the total viewing time and number of views each week for unique users.

These metrics rely on data being collected at both the client side and the server side. Other metrics that only rely on the network traffic could eliminate the need for any specific software to be installed on the client side, this method has has been shown to yield good results [8].

However the data needed to perform such a study is not available for this project.

1.3 Identifying confounding factors

A confounding factor is a factor that affects the viewing behavior of the user but is not related to the QoE, an example is the type of program and the hour of the day (HoD).

Dobrian et al. [7] use a combination of different techniques to find the confounding factors.

The following techniques are used to find the confounding factors.

• The correlation of features and the user watching behavior is used to find linear dependencies.

• Information gain that can be obtained by splitting a feature with respect to the user behavior is good at finding factors that are not linearly dependent on the engagement.

Given that the user watching behavior is split into classes, a split at a feature value that separates the behavior classes clearly results in a high information gain

Another method many authors use is to initially produce correlation plots. These are scatter plots of one feature and the user watch behavior. These are used to get a first idea of the type of relation a feature and the user watch behavior has to each other.

(12)

1.4 Establishing correlation between quality features and the user engagement

The existence of correlation between user quality features and user engagement have been shown [7][9]. However the methodology used, e.g. plotting each feature or calculating the information gain, is cumbersome. Another possible approach of determining correlation between the quality features and the user engagement would be to train two different models.

The first model would use both confounding features and quality features, and the second model would only use confounding features. If the performance of the model using the quality features is significantly higher, it suggests that QoE is important for understanding user engagement. Some machine learning (ML) methods also ranks the features that are used by the model when training, this also gives a measure of the correlation of the quality features and the user engagement. If the quality features are given the lowest rank (most importance) it suggests that the quality is correlated with the engagement.

1.5 Predicting user engagement and perceived quality

1.5.1 Building the predictive model

Balachandran et al. [9] develop a ’road map’ for building a model that predicts user engagement using a combination of quality metrics and confounding metrics. They present a data-driven approach using ML algorithms to build the predictive model. A decision tree is found to be the best model based on accuracy and the intuitive interpretations of the model.

However they only compare the accuracies of simple ML algorithms and only before they account for the confounding factors. Using more advanced ML algorithms could potentially increase the accuracy but perhaps make the resulting model harder to understand. They also identify three different ways that confounding factors can affect the user engagement:

1. They can affect the observed engagement directly. They find that video on demand (VoD) user tend to watch a higher percentage of the total video duration.

2. They can affect the observed quality metric meaning that it will indirectly have an effect on the user engagement. One example they find is that the live users are more tolerant toward join time.

3. The nature and the magnitude of the quality and user engagement relationship can be affected. One example they find is that user using wireless providers are more tolerant toward buffering rates.

They argue that the model should be accurate, intuitive and actionable. That the model is actionable means that it should be able to guide the design of the video delivery mechanism.

However for long term quality effects it could be better to have a less intuitive model with higher accuracy. How the model is design should depend on the purpose of the model. Two different ways of using the confounding factors are suggested by Balachandran et al. [9].

They could be used as attributes when building the predictive models or the data could be

(13)

split and a separate model could be developed for each confounding factor. In the second approach the predictive model will be the logic union of the individual models. It is shown that the second approach is more accurate for decision trees.

Shafiq et al. [8] use ML to predict mobile video user engagement. The data set used consists of radio network statistics and TCP/IP headers collected from a cellular network.

They use classification to predict if a user abandons the video before completion or not and they use regression to predict the watched percentage of the video. For both predictions they found that using a decision tree with bootstrap aggregation (bagging) outperformed other commonly used Bayes and linear regression algorithms. They report an accuracy of more than 87%. Both [8] and [9] note that one strength of decision trees is that they do not make any assumptions that the features are independent. Decision trees are also able to handle non-linearities by splitting each feature multiple times.

Other approaches that do not use ML to build the predictive model have also been used.

Casas et al. [10] predict the Mean Opinion Score (MOS) using eq.(1.1)

MOS(n)_i = a_i· e^−bⁱ^·n+ c_i, ∀i = 1, 2, 3, 4, 5 (1.1) where the parameters a_i, b_i and c_i are determined by the ratio between the total buffering duration and total duration. The variable n is the number of stalling events. The models state that the higher the stalling ratio is the less tolerant users are to stalling events. There are a few problems with this method, the first is that it only uses two quality metrics to predict the MOS. As [9] have shown, there are more quality metrics that affect the user QoE.

Also extending eq.(1.1) to incorporate more quality metrics is not trivial, it is known that the quality metrics have complex dependencies on each other and not all quality metrics have an exponential relationship with the user engagement [7]. Further eq.(1.1) does not consider any of the confounding factors.

1.5.2 Evaluating the predictive model

Different approaches to evaluate the performance of the predictive models have been used.

For ML classification models accuracy is a natural choice. Both [8] and [9] use 10-fold cross- validation. However if the data set is imbalanced in number of examples of each label other metrics are more suitable, such as true positives, true negatives, false positives and false negatives. Casas et al. [10] conduct controlled experiments where they measure stalling events and participants report MOS values. This gives clear results but the technique is not scalable and there are only 37 participants in the study. To have a large scale validation of the model another approach is needed.

1.5.3 Using the predictive model

Balachandran et al. [9] demonstrate the benefit of using a QoE model when choosing two control parameters (CDN and bit rate). Their simulations show that choosing control parameters using QoE models increase the expected user engagement compared to other techniques.

They also discuss that a user engagement model using quality metrics could be useful for most players in the content delivery community.

(14)

1.6 Research question

The following research question are investigated in this thesis work.

• How does the quality affect the viewing duration of users?

– What are the features that affect the user viewing duration?

– What methods are suitable to model the viewing behavior of the users?

– Is there a significant difference in the performance of the models that use the quality features and the models that do not use the quality features?

• How does the quality affect the return behavior of the users?

– What are the features that affect the return behavior of the users?

– Can a ML model predict if a user will return to the streaming service?

– Is there a significant difference in the performance of the models that use the quality features and the models that do not use the quality features?

• Can the developed ML models be used in order to study the viewing behavior of the users?

1.7 Outline

The remainder of the report is structured in the following way. In chapter 2 the theory and the methods used are presented. In chapter 3the data set that is used for the analysis and the methods used to process the data is presented. In chapter 4 the result of the analysis is presented. In chapter 5 the results are discussed and are evaluated. In chapter 6 the conclusions are stated and som potential future work is suggested.

(15)

Chapter 2 Theory and methods

In the following chapter the theory underlying the methods that are used in the analysis of the data is presented.

2.1 Statistical methods

Theses methods rely on traditional statistical approaches. By determining the correlation between features and the user behavior an initial understanding of the relationship between different features can be obtained.

2.1.1 Pearson correlation

Used to measure the linear correlation between the data sets. Given two data sets {x₁, x₂, . . . , x_n} and {y₁, y₂, . . . , y_n} the sample Pearson correlation coefficient (r) is calculated using eq.(2.1),

r =

Pn

i=1(x_i− ¯x)(y_i− ¯y)

qPn

i=1(x_i− ¯x)²^q^Pⁿ_i=1(y_i− ¯y)²

(2.1)

where ¯x and ¯y are the average value of the two data sets. The coefficient lies in the range r ∈ [−1, 1]. A correlation of ±1 indicates a perfect linear correlation between the two data sets, and a correlation of zero indicates that there is no linear correlation. A positive correlation means that if a value in the first data set is large the corresponding value of the second data set will also be large. A negative correlation indicates the opposite.

2.1.2 Kendall tau correlation

The Kendall tau correlation is used to measure the ordinal association between two data sets {x₁, x₂, . . . , x_n} and {y₁, y₂, . . . , y_n}. Two observations (x_i, y_i) and (x_j, y_j), where i 6= j are said to be concordant if x_i > x_j and y_i > y_j, or x_i < x_j and y_i < y_j i.e. the ranks of the elements agree. The observations are said to be discordant if x_i > x_j and y_i < y_j, or x_i < x_j and y_i > y_j i.e. the ranks of the elements does not agree. If y_i = y_j or x_i = x_j the observations are tied and are neither concordant nor discordant. The scipy package in

(16)

python implements a function to calculate Tau-b, where Tau-b is defined by, τ_B = (n_c− n_d)

q(n_c+ n_d+ n_t)(n_c+ n_d+ n_u)

(2.2)

where n_c is the number of concordant pairs, n_d is the number of discordant pairs, n_t is the number of ties in only x, and n_u is the number of ties in only y. Simultaneous ties in x and y is not counted toward n_t or n_u. Tau-b can be used to test the null hypothesis that the two data sets are independent. A test statistic Z_B that is approximately distributed as a standard normal distribution can be constructed using the parameters of the Tau-b statistic. To test the hypothesis the cumulative probability of a standard normal distribution at −|Z_B| is calculated, the two-sided p-value is twice this value. If this two-sided p-value is below a chosen significance level the null hypothesis is rejected.The value of the Kendall tau correlation is interpreted the same way as the Pearson correlation.

2.1.3 Correlation plots

To get a first idea of how two data sets {x₁, x₂, . . . , x_n} and {y₁, y₂, . . . , y_n} are correlated a scatter plot can be used. The plot can be used used in order to find trends that are not linear. The first thing that is studied for all the tested features is a correlation plot of the feature and the engagement metric. Two drawbacks with this method is that each plot has to be inspected manually and there is a risk that a human will see correlation where there is none. However non linear correlations that the Pearson or Kendall-tau correlation are unable to find can be detected by inspecting the plots.

2.1.4 Welch spectrogram method

To find periodicities in a dataset a spectrogram can be used. A spectrogram is a plot of the power spectral density (PSD). The PSD is the distribution of power in the signal at each frequency component. There are many different methods that can be used to estimate the spectrogram of a signal [11]. This is however not the main goal of this work and therefore a simple method that is reasonably robust is chosen. The welch spectrogram method using a Hanning window is found to be both easy to use and reasonably robust and is therefore used.

2.2 Machine learning techniques

Machine learning (ML) will be used in order to build predictive models of the average watch duration (AWD) and the probability of return of a user. The AWD model will predict what the AWD will be in the next time bin given data until the current time. The other model will predict the probability that a user will start another session within a specified time period. In both models care will be taken in order to reduce the risk of information leakage, discussed in section 2.2.3. In general, ML models are good for making accurate predictions but the models can be hard to understand and interpret intuitively. This is specially true for ensemble methods, meaning methods that use the union of several predictors in order to

(17)

predict the result. Two different types of ML problems are encountered in this work. The first is when predicting a continuous variable, this is called a regression problem. The second is when predicting a class, e.g. predicting if the user engagement is high, medium or low.

This is called a classification problem. The techniques used for both cases are similar but there are a few differences, the biggest difference is how the perfomance of the models are measured.

2.2.1 Decision tree algorithm

A decision tree (DT) is a very common ML model, both classification and regression decision trees are used. A DT model is constructed top to bottom. Starting with only the root of the tree each node is split recursively according to a predefined rule which depends on the type of tree. Each node is split on one feature. An example of this could be splitting on bitrate, so that each sample with a bitrate higher than 1000 kbit/s is put in one leaf and samples with a bitrate lower than 1000 kbit/s is put in the other leaf. This means that a DT is a binary tree. Therefore it is very fast to search and therefore good at classifying a large number of samples. When a classification tree is built the nodes are split in order to minimize the sum of the Gini impurity defined in eq.(2.3) of the resulting leaves. When building a regression tree the nodes are split to minimize the sum of the MSE of the resulting leaves. Each node is recursively split until each leaf only contains samples belonging to the same class or another predefined condition is met. Other conditions include a limit on the depth of the tree and the minimum size of the leaf. The class of a new sample can then be predicted by following the rules of the constructed tree and assigning the new sample the class of the leaf (when performing classification) or the average value of the leaf (when performing regression).

I_G(f ) =

m

X

i=1

f_i(1 − f_i) (2.3)

In Eq. 2.3 fi is the fraction of samples in the node belonging to class i and m is the number of unique classes in the data set. A pure leaf will have a Gini impurity of zero.

2.2.2 Random forest algorithm

The random forest (RF) algorithm is an ensemble algorithm that fits many DTs to a data set, each DT is fit to a subset of the data. The whole data set is split in subsets by randomly selecting a number of samples with replacements from the whole data set, the subsets of the different DTs can overlap. The individual DTs are then fit to their subset of the data in the same way as described previously except that in each split only a random subset of the features are considered. To increase the randomness of the trees further, the considered split points of the features are random. So each node is split by choosing the best of the random split points. In order to predict the class (or value) of a new sample the majority vote (when performing classification) or the average value (when performing regression) is the predicted value of the model[12].The theory is that a large number of DTs using randomly selected subsets of the data will on average predict the right answer. Given a sample most DTs will probably not be able to accurately predict the class given the subset of the data they are

(18)

using. But unless they DTs are correlated the guess of the bad trees should be random and cancel out. The few DTs that are able to accurately predict the sample class given their subsets of the data will all predict the correct answer and therefore the majority vote will give the correct answer. The idea is explained in the following example. Assume a RF with ten DTs is trained to classify samples as true or false. Given a true sample lets say that eight trees are bad and two trees are good. The bad trees are guessing and are therefore expected to give four votes to true and four votes to false, the good trees will both predict correctly and therefore vote true. The majority vote is correct even though eight out of ten DTs guessed the answer. By having the DTs train on different subsets of the data they are expected to be good at predicting different types of samples. The idea is that given any sample there are enough good DTs able to determine the class of the sample. It is important that the different DTs are not correlated, if this is the case the guesses of the bad DTs are not random and will not cancel out. This is also the reason that the features should be uncorrelated, otherwise DTs using different but correlated features will give correlated predictions. To further randomize the trees the split condition is changed. Instead of using the feature that gives the lowest Gini impurity in each split, a random feature is used to split the node.

2.2.3 Information leakage

Care has to be taken when constructing the training and testing data set in order to reduce the risk of information leakage. Information leakage is when the data used to train the model also contains the data that is to be predicted. When the AWD model is constructed the data is split according to time, this is to remove the risk of leaking information from the future into the past. This is also the reason that cross validation is not an option for this model. The second model splits the training and test data according to users, this will reduce the risk of leaking the data from the test set into the training set.

2.2.4 Coefficient of determination

To measure the performance of the regression model the coefficient of determination (R²) is used. The range of R² is (−∞, 1], where a value of one means a perfect reconstruction, zero means that the predictor only predicts the mean value of the signal and any negative number would be worse than just predicting the mean value. The formula to calculate R² is

R² = 1 −

P

i(y_i− f_i)²

P

i(y_i− ¯y)² (2.4)

where y_i is the value that is to be predicted, ¯y is the mean value of the predicted feature and f_i is the predicted value.

(19)

2.2.5 MSE and RMSE

The mean square error (MSE) is used to compute the difference between the predicted signal and the true signal. The MSE is defined in eq.(2.5),

MSE = 1 n

n

X

i=1

(f_i − y_i)² (2.5)

where fi is the predicted value and yi is the real value. The root mean square error (RMSE) is also used to determine the goodness of the regression model. This is the square root of the MSE and is calculated according to eq.(2.6),

RMSE =√

MSE. (2.6)

2.2.6 Accuracy

The accuracy of a classification model is the fraction of test samples that the model is able to correctly classify. The accuracy can be used to evaluate a classifying model, this is however not a very reliable measure if the data is not evenly distributed over the classes. If the data contains more of one class accuracy will likely give an overly optimistic evaluation of the model. As an example, imagine a binary classification where there are 90% positive samples and 10% negative samples. By simply predicting that all samples are positive the model will have an accuracy of 90%. One method to reduce this problem is to sample the data set so that the set that is used for building the model has an approximately equal distribution among all the classes.

2.2.7 TPR, FPR, TNR, FNR

The performance of a binary classifier will be evaluated using four statistical measures. The true positive rate (TPR) is the fraction of positive samples that are accurately classified as positive. The false positive rate (FPR) is the fraction of negative samples that are wrongly classified as positive. The true negative rate is the fraction of negative samples that are accurately classified as negative. The false negative rate (FNR) is the fraction of positive samples that are wrongly classified as negative. These measures can be used when the two classes are not evenly distributed in the data set. The model of section (2.2.6) would have a TNR of zero and a FPR of one and therefore be correctly identified as a bad model.

2.2.8 ROC and AUC

The accuracy of a classifier performing binary classification (predicting either negative or positive, 0 or 1 etc.) can give very misleading results when the data is heavily biased toward one of the classes. Therefore the receiver operating characteristic (ROC) and the area under curve (AUC)can be used. The ROC curve gives information on how the classifier is performing based on both the TPR and FPR. Instead of having a model that predicts positive or negative it is more common to have the model predict the probability that the sample is positive. In a random forest this is simply the number of trees that vote positive

(20)

Figure 2.1: Figure showing an example ROC, an AUC close to 1 indicates a good model. A random guess has an AUC of 0.5.

divided by the total number of trees. The model will thus predicts a number x such that x ∈ [0, 1]. Given a threshold t such that t ∈ [0, 1] a sample is predicted to be positive if x ≥ t. The intuitive way of choosing t is t = 0.5, however this does not always give the best model. The ROC curve is produced by letting t be varied between zero and one and plotting the TPR vs the FPR. When t is zero both the TPR and the FPR are one, and when t is one both the TPR and the FPR is zero. The AUC is the area under the ROC curve. A perfect classifier will have an AUC value of 1.0 and a random classifier will have an AUC value of 0.5. An example of a ROC curve is shown in Fig. 2.1.

2.2.9 Feature importance

To determine the most important features of the RF model the Gini importance is used.

The Gini impurity is calculated using the Gini index. To calculate the Gini importance of one feature the combined reduction of the Gini index of all the splits in which the feature is used is calculated. This is a negative number because in each split the Gini index is reduced.

The absolute value of the sum is used to rank the features. The feature that reduces the Gini index the most over all the splits of the feature is given the highest rank. The Gini importance is normalized so the sum of all the Gini importances is one, all the used features is given a value between zero and one. This is a relative ranking and only shows which of the used features is the most important.

2.3 Time series analysis

To further understand the data a time series analysis is performed. These methods are based on traditional signal processing. The R libraries forecast, astsa, hydroGOF, and tseries are used to perform the time series analysis.

(21)

2.3.1 Weak-sense stationarity

A random process x_tis said to be weak-sense stationary if the mean value and the autocovariance do not change with respect to time. This means that the mean value is a constant and the autocovariance does only depend on the time difference of the samples. In the following text, when the term stationary is used weak-sense stationarity is implied.

2.3.2 SARIMA

A multiplicative seasonal autoregressive integrated moving average (SARIMA) model is used to capture the periodicity and seasonality in the data. A general SARIMA model is describes by eq.(2.7),

Φ_P(B^s)φ(B)∇^D_s ∇^dx_t = Θ_Q(B^s)θ(B)w_t (2.7) where x_t is stationary with mean zero, w_t is a Gaussian white noise series with mean zero and variance σ_w², B is the backshift operator defined in eq.(2.8),

Bx_t= x_t−1,

B^kx_t = x_t−k (2.8)

∇^d is the ordinary difference defined in eq.(2.9),

∇^d= (1 − B)^d (2.9)

∇^D_s is the seasonal difference defined in eq.(2.10),

∇^D_s = (1 − B^s)^D (2.10)

and Φ_P(B^s), φ(B), Θ_Q(B^s), θ(B) are the ordinary and seasonal auto regressive and moving average operators defined in eq.(2.11 - 2.14)

Φ_P(B^s) = 1 − Φ₁B^s− Φ₂B^2s− · · · − Φ_PB^{P s} (2.11) Θ_Q(B^s) = 1 + Θ₁B^s+ Θ₂B^2s+ · · · + Θ_QB^Qs (2.12) φ(B) = 1 − φ₁B − φ₂B²− · · · − φ_pB^p (2.13) θ(B) = 1 + θ₁B + θ₂B²+ · · · + θ_qB^q. (2.14) where Φ₁, . . . , Φ_P, Θ₁, . . . , Θ_Q, φ₁. . . , φ_p, θ₁, . . . , θ_qare constants and Φ_p, Θ_Q, φ_p, θ_qare all non zero. The model is denoted as ARIMA(p, d, q) × (P, D, Q)_s. Where p and q are the orders of the ordinary autoregressive and moving average polynomials eq.(2.13 &2.14), P and Q are the orders of the seasonal autoregressive and moving average polynomials eq.(2.11 & 2.12), d and D are the order of the ordinary and seasonal difference, and s is the season of the model. The model is multiplicative, meaning that all of the components of the model are multiplied algebraically.

(22)

2.3.3 Fitting the SARIMA model

The SARIMA model is described by seven model orders that are fitted manually, the orders of eq.(2.8 -2.14) and the length of the season s. In order to fit these parameters to the data the sample autocorrelation function (ACF) and the sample partial autocorrelation function (PACF) is used. The ACF is defined by eq.(2.15),

ˆ ρ(h) =

γ(h)ˆ

qγ(0)ˆ

(2.15)

where γ(h) is defined as eq.(2.16).

ˆ

γ(h) = n⁻¹

n−h

X

t=1

(x_t+h− ¯x)(x_t− ¯x) (2.16)

The PACF, φ_hh for h = 1, 2, . . . , of a stationary process x_t, is defined by eq.(2.17), φ₁₁ = corr(x_t+1, x_t).

and

φhh = corr(x_t+h− ˆxt+h, xt− ˆxt), h ≥ 2

(2.17)

where (x_t+h− ˆx_t+h) and (x_t− ˆx_t) are uncorrelated with {x_t+1, . . . , x_t+h−1}. ˆx_tis the regression of xt on {xt+1, · · · , xt+h−1} and ˆxt+h is the regression of xt+h on {xt+h−1, · · · , xt+1}. The properties of the ACF and the PACF of different processes are shown in Table2.1&2.2[13].

The general steps of fitting a SARIMA model described by Shumway and Stoffer [13] are the following. First the order of the ordinary and seasonal difference are determined to find a roughly stationary time series. Then a set of simple seasonal and ordinary ARMA processes are fit to the resulting residuals. The seasonal components are determined first. Peaks in the ACF and/or the PACF can be eliminated by fitting a model of appropriate order according to table (2.1). When all significant peaks in the ACF and PACF are eliminated the model is determined. The ACF and PACF of white noise are zero at all lags except at lag zero.

Therefore if all peaks of the ACF and the PACF of the residuals are eliminated the only difference between the model and the real time series is white noise. True white noise is impossible to predict and therefor the model is considered to capture the behavior of the time series .

2.3.4 Forecasting using the SARIMA model

When the SARIMA model has been fitted to the data it can be used to forecast future values of the time series. The white noise process w_t of the model is not known and therefore the error of previous predictions are used as the driving noise for the M A part of the process.

The one-step ahead prediction can be used as a feature in ML algorithms.

(23)

Table 2.1: Properties of the ACF for AR(p), M A(q), ARM A(p, q), AR(P )_s, M A(Q)_s, and ARM A(P, Q)_s processes

Process Behavior of the ACF

AR(p) Long tail that decreases exponentially

M A(q) Cuts of after q lags

ARM A(p, q) Long tail that decreases exponentially

AR(P )_s Tail at lags ks, k = 1, 2, . . . that decreases exponentially

M A(Q)_s Cuts off after Qs

ARM A(P, Q)s Tails at lags ks, k = 1, 2, . . . that decreases exponentially

Table 2.2: Properties of the PACF for AR(p), M A(q), ARM A(p, q), AR(P )_s, M A(Q)_s, and ARM A(P, Q)_s processes

Process Behavior of the PACF

AR(p) Cuts off after p lags

M A(q) Long tail that decreases exponentially ARM A(p, q) Long tail that decreases exponentially

AR(P )_s Cuts off after P s lags

M A(Q)s Tails at lags ks, k = 1, 2, . . . that decreases exponentially ARM A(P, Q)_s Tails at lags ks, k = 1, 2, . . . that decreases exponentially

2.4 Practical considerations concerning the methods used

Different methods are used to investigate the relationship between the quality of the video stream and the user engagement.

2.4.1 Choice of engagement metric

Other studies have used watched percentage of the video as the engagement metric, this data is however not available for this study as the nominal video duration is missing in the data.

Also for live stream data the watched percentage is not necessarily a good metric. Users can enter a video at any time and therefore percentage watched can give misleading results. The engagement metric used is the total watch duration and the return probability of the user.

2.4.2 Choice of QoE metrics

The choice of QoE metrics is based on the correlation with the engagement feature. The correlation plots of the QoE metrics and the engagement metric are used to determine if the QoE metric is suitable to use. The correlation of the QoE metrics is also used. If two QoE metrics are highly correlated only one of them is used when building the model. This is important when using the random forest algorithm, using both features could cause trees to make correlated predictions resulting in a lower accuracy of the model.

(24)

2.4.3 Splitting data into training and testing set

When constructing the AWD model the data is split according to date. The first 70% of the time series is used to build the model and the last 30% is used to test the model, the choice of the split point is also varied in order to test the effect on the accuracy of the model. When predicting if a user will have another session within a period of time from the last session the data is instead split on users. Now 70% of the users are used to train the model and 30% are used to test the model.

2.4.4 Choice of baseline

The ML models will be compared with a baseline. A baseline is needed in order to determine the relative goodness of the ML model. Even if the ML model seems good using the other measures it might just be because the problem is easy. The choice of the baseline is important, if the baseline model is very smart it will be hard to improve the model using ML. The simplest baseline is a model that predicts a random outcome. A model should be significantly better than the random baseline in order to have any value. When predicting the AWD the baseline will be a ML model that does not use quality features. Two models are constructed, the first model only uses features that are not related to the quality, such as hour of the day (HoD), day of the week (DoW), day of the month (DoM) and the watch duration of the previous time step. This model will be used as a baseline to compare with a model that also uses the QoE features. The accuracy of the different models can be used to understand the effect of quality metrics on the user behavior. If no significant difference in accuracy can be found between the two models it is unlikely that quality metrics have an effect on user engagement. If a significant difference in accuracy is found the quality metrics are important to determine the user engagement. The difference in accuracy is also used to find important features. One advantage of using a ML model to test this is that it is easier than to use them more cumbersome methods of correlation matrices and correlation plots. The baseline for the model that predicts if a user return will also be a ML model that does not use the quality features.

2.4.5 Developing a time series model

A time series model is developed in order to capture the periodic behavior of the data sets.

The method relies both on manually fitting model orders by studying the autocorrelation and partial autocorrelation of the data sets. The exact method of choosing the model order is described in section 2.3.3. The time series model will be used both as a feature in the ML model and also as a baseline in order to compare ML models with the more traditional time series analysis.

2.4.6 Testing the effect of artificially altering the quality

The effect of quality issues on the engagement is tested by using the designed predictive models. This is simulated by predicting the user engagement of the next time step using the data of the current time step. Given the user engagement, quality metrics and non

(25)

quality metrics of the current day, the user engagement of tomorrow is to be predicted. This prediction can be compared with predictions using the same model but with higher or lower quality metrics. If the model is accurate the difference of the three predictions can be used to study the effect of quality issues on the user engagement.

2.4.7 Quality effect on user return rate

The effect of quality on the rate of which users start new session are investigated by again training two different models. The models are trained to predict if the users will have another session within a specified time gap. One model that use both the quality features and the non quality features and the other model that only use the non quality features. The difference of the two models are used to understand the importance of the quality features. If the model using the quality feature is significantly better it indicates that the quality is important.

The effect of the length of the chosen time gap is also tested. It is possible that the effect of quality features is not constant over time. By comparing the difference of the two models, and the overall performance of the two models when changing the time gap, the effect are studied. The relative importance of each feature is used in order to determine which are the most important features when predicting if a user will have another session within a specified time gap.

(26)

Chapter 3 Data set

The data is collected from two media broadcast providers. One is a provider of live content and the other is a provider of VoD content. The live content data is collected over four months, Jan-Apr 2016. The VoD data is collected over one mont, Mar 2016. The data sets are analyzed separately. Live content is the same as watching normal television. The user is only able to choose which channel to watch. VoD streaming means that the users are able to watch any program and pause and resume at any time. This is currently the most popular way of delivering media. One of the main reasons to separate these two user groups is that they are provided by different content providers and the behavior of the two users groups are believed to be different. The data is collected from Apple TVs, desktops and iPads, the live channel analysis is focused on the data from the iPad users, the reason for this is that iPads are believed to be more subject to streaming problems as they are connected via wireless network, while AppleTV and desktops are often connected via fixed network such as fiber. There is therefore a higher chance that iPad user experience a broader range of quality issues. However only data from desktop users is available for the VoD analysis and is therefore used. The initial data collection and processing is done using Apache Spark.

3.1 The raw data

The data is reported in an event based fashion. When a user starts to watch a video, a session is created, this session is given a session ID. While a user is watching a video there are a number of events that can be triggered, for example a bitrate switch event. The exact information contained in an event depends on the type of event, but all events have a time stamp, the session ID, the user ID and the type of event. In total an event has 30 data fields containing information, although most of these are not used for this analysis. The data is stored in semicolon separated files. An anonymised example of a started event from the live channel data set is shown below. A complete list of events that are of interest and what information they contain are shown in table 3.1. A typical live-streaming session is shown in fig.(3.1).

(27)

Created Started Buffering Started Buffering Stopped Stopped

Time

Bit rate switched

Figure 3.1: A typical live-streaming session. The yellow arrow represents the session duration, the sum of the green arrows is the watch duration, the blue arrow is the buffering duration and the red arrow is the initial buffering duration, also called join time.

Table 3.1: Events used for the analysis and information contained in the events

Event type Contained information

Bitrate switched Bitrate value and bitrate switched time Created Time when session was created time

Started Time when the first video frame is displayed Buffering started Buffering started time

Buffering stopped Buffering stopped time Paused Time when the was video paused

Play Time when video was resumed

Stopped Time when the session stopped Connectivity changed Time of a connectivity change

u ’ a n a l y t i c ;2015 −10 −01T00 : 0 1 : 3 3 . 4 4 5 + 0 0 0 0 ;XXX;XXX;

2015−09−30T23 : 4 5 : 2 1 . 9 5 0 + 0 0 0 0 ; s t a r t e d ; ;

6 1 2 4 9 1 ; 0 3 7 E2028 −0241−4614−A38D−0A9DE19DE613 ; 2 2 1 9 1 4 5 \_\_EDRM;XXX;

9048EEB4−B67C−44D1−B71F−BB0E1D4D8A1E ; 0 ; ; ; ; 0 ; ; IMC\_MODE\_LIVE

; iPad3GSM ; Apple ; iOS ; 9 . 0 . 1 ; 5 2 7 9 3 4 4 6 7 1 7 7 5 4 5 7 2 8 0 ; 1 0 4 . 4 5 . 1 6 . 1 9 9 ; ; ; ; ;

3.2 Cleaning the data

The raw data contains many different types of noises and errors. There are session that are not completely reported, timestamps that are wrong and shared user IDs etc. The data is clenade in order to be able to used in the analysis. The first thing that is done is to group all events by their session IDs. It is found that some of the sessions contains duplicate events. Meaning that the same event have been reported more than once, these duplicates are removed. A session must contain enough information to be used in the analysis. The bare minimum is a created event, a started event and a stopped event. Only live channel sessions that contain exactly one of each of these after duplicates have been removed are considered for analysis. The VoD session have almost the same requirements except that they

(28)

Table 3.2: Non quality features and quality features used in the analysis

Non quality features QoE features

HoD Bitrate

DoW Buffering rate

DoM Number of buffering events

Number of scrubbing events Number of connectivity changed events Expected HoD watch duration Initial buffering time (Join time)

Previous AWD Number of bit rate switches

Deviation from expected HoD watch duration Number of up bit rate switches Number of down bit rate switches

must have at least one started event. The difference is that the VoD sessions are reported slightly differently compared to the live channel events, the started session has more than one meaning. Sessions that fulfill these requirements have at least been cerated and stopped correctly and have started to play. A further requirement that is put on the session is that it is at least five seconds long. This requirement will remove sessions that are so short that they are unlikely to be interesting for this analysis.

3.3 Extracting features from the data

Given that the sessions are reported in an event based fashion the QoE features that are needed to perform the analysis are engineered from the data. A list of non-quality and quality features that are used is shown in Table 3.2. The session duration is given by the difference between the created and the stopped event, the watch duration is defined as the time where the video is playing. The watch duration is given by subtracting the time the video was buffering and the time the video was paused from the session duration. The buffering time is the total time that the session was buffering. The average bitrate is defined as the bitrate of the watched duration, time spent buffering or paused does not affect the average bitrate.

The initial buffering time is defined as the time from when the session is created until the video starts to play, this is defined as the difference of the created time stamp and the started time stamp. A scrubbing event is when the user jumps (scrubs) from one part of the video to another part of the video, naturally this is very rare in the live channel data set.

Connectivity change means that the player have changed the type of connection e.g. 3G to wireless, this may cause some initial problem before the bitrate is switched accordingly to the new connection. One thing to note is that the session can be paused and buffering at the same time, this means that the sum of buffering duration, paused duration, initial buffering duration and watched duration is not the same as the session duration. The features are also divided by the session duration to compensate for the effect that long sessions are more likely to have a higher number of quality problems.

(29)

3.4 Per user data analysis

To perform per user analysis all the sessions are grouped per user. To have live channel users that are to regular users each users must have at least 10 sessions in total over a period of at least one month. The reason for choosing one month is that there is a free trial that ends after one month, having this requirement will ensure that the user is a paying customer. However only one month of the VoD data set has been analyzed and therefore the requirement of one month is dropped and the minimum number of sessions per user is set to five.

3.5 Per hour data analysis

The data is binned into hours, each hour contains the average value of the non quality features and the quality features for all the session that where active during that hour. This data is used to predict the average behavior of all users. First the data is binned into hours for each user. When this is done unreasonable users are removed. Three unreasonable behaviors are defined. The first unreasonable behavior is defined as when a users has an accumulated session duration during one hour that is longer than one hour. This could be caused by several users having the same user ID (there are a few IDs that are probably test IDs, such as the ID ’foo’). The second unreasonable behavior is defined as when a users has a longer watch duration than session duration during one hour, this could be caused by errors in the timestamps reported by the events. The third unreasonable behavior is defined as when the initial buffering is longer than the accumulated session duration, again this could be caused by errors in the time stamps. If a user is found to have any unreasonable behavior it is removed from the data set.

(30)

Chapter 4 Results

The results of the different tests are presented in the following section.

4.1 Data

Table 4.1 shows the basic information about the used data sets. The live channel data is taken from a four month period between January and April 2016, and the VoD data is taken from March 2016. The live channel data is taken from iPad users and the VoD data is taken from desktop users.

Table 4.1: Basic statistics about the used data sets. The live channel spans four months and the VoD data spans one month. The Live channel data is taken from iPad users and the VoD data is taken from all available devices.

Data set Number of users Number of session Number of events Number of Days

Live channel 21 065 827 015 38 677 391 121

VoD 11 732 274 480 19 494 157 31

The AWD of each HoD for the live channel data and the VoD data is shown in Figure 4.1. Live channel users have an overall higher AWD. Also the AWD of the live channel users vary over the day while the VoD users have an almost constant AWD. The PSD of the AWD per user per hour for both the live channel data and the VoD data is shown in Figure 4.2.

The live channel data have clear peaks, this is not the case for the VoD data. The most prominent peaks in the live channel data corresponds to periods of approximately one week, 24h, 12h, 8h, and 6 hours. The distribution of the AWD per user per hour for both live channel users and VoD users are shown in Figure4.3. Both data sets show a peak at the bin centered at 3600 s. This means that the users watched the entire hour without any buffering or paused events.

(31)

(a) The AWD per user per hour each HoD with one std of the live channel users. An increased AWD is seen around 07:00 and 20:00.

(b) The AWD per user per hour for each HoD

with one std of the VoD users. The AWD is

higher at night but with an increased variance.

Figure 4.1: The AWD per user per hour for each HoD of the two data sets.

(a) The PSD of the AWD per user per hour time series. Clear peaks can be seen at specific features.

(b) The PSD of the AWD per user per hour time series. No clear peaks can be seen.

Figure 4.2: The PSD of the two time series.

(32)

(a) The watch duration per user per hour of the live channel users.

(b) The watch duration per user per hour of the VoD users.

Figure 4.3: The watch duration per user per hour of the two data sets.

(33)

4.2 Correlation metrics

The Pearson correlation and Kendal-tau correlation of the features and the AWD of the live channel users and the VoD users are shown in Table 4.2. The p-value is the likelihood that two uncorrelated data sets would get the same Kendal-tau value. The tau-B and the Pearson correlation both indicates the same sign of the correlation for most features. However the Pearson correlation indicates a stronger correlation. Most features are found to be correlated with the AWD by both methods. The Pearson correlation values of the features of the live channel users and the VoD users are shown in Table A.1 and Table A.2, appendix A. Only the correlation of the features that are used in the final version of the ML models are shown in the table. Other features is tested but are removed because they have a high correlation with other features..

(34)

Table4.2:PearsoncorrelationcoefficientoftheaveragewatcheddurationandthefeaturesoftheLinearTVusersandtheVoD users. Metric HoD DoW DoM JoinTime[s]

Bitrate[kbit/s]

Bufferingev ents [count]

Bufferingrate Decreasedbitrate sw

itches [count]

Expected HoDw

atch [s]

Prev.w atch duration[s]

SARIMA Deviationof previoushour

pastA vg 2h 13h 24h Fittedslop e

P.corrLinearTV0.390.16-0.100.490.830.49-0.240.630.600.820.890.580.670.600.290.670.37 Tau-BLinearTV0.270.14-0.060.380.650.47-0.090.490.460.650.730.340.480.440.180.510.24 P-valLinearTV0.000.000.000.000.000.000.000.000.000.000.000.000.000.000.000.000.00 P.corr-0.15-0.01-0.03-0.160.23-0.17-0.120.060.400.840.860.560.740.680.070.170.30 Tau-B-0.01-0.01-0.03-0.320.18-0.31-0.270.110.210.710.710.520.600.560.070.080.17 P-val0.590.610.230.000.000.000.000.000.000.000.000.000.000.000.000.000.00

(35)

4.3 Time series analysis

The ACF and PACF of first difference and first seasonal difference (s = 24) of the Linear TV AWD and the VoD AWD is shown in Figure 4.4a & 4.4b. The ACF of the linear TV time series have one significant peak at lag 24. It also has a few significant peaks at the first lags. This indicates a seasonal MA of order one and a low order MA model. The PACF of the linear TV time series has decaying peaks at lag 24, 48, 72 etc. this is consistent with an seasonal MA of order one. The peaks of the PACF is also decaying in the first lags, this is consistent with a low order MA model. The ACF of the first difference and first seasonal difference (s = 24) of the VoD time series have one significant peak at lag 24. And a few barely significant peaks at the initial few lags. Again this indicates a seasonal MA of order one and a low order MA model. The PACF show the same behavior as the PACF of the linear TV PACF, this is again consistent with a seasonal MA and a low order MA. Figure 4.4c & 4.4d show the residuals of the fitted SARIM A(1, 1, 4) × (0, 1, 1)₂₄ for the linear TV and VoD time series. In both cases there are almost no significant peaks in the ACF and PACF. This indicates that the only difference between the model and the real time series is noise. The resulting RMSE of the two models are 88.0s and 80.1s. Other combinations of model orders is tried but no significant improvement is made.

The effect of quality metrics on the user watching behaviour in media content broadcast

Examensarbete 30 hp Oktober 2016

The effect of quality metrics

on the user watching behaviour in media content broadcast

Erik Setterquist

Abstract

The effect of quality metrics on the user watching behaviour in media content broadcast

The effect of quality metrics on the user watching behaviour in media content broadcast

Erik Setterquist Uppsala University

Performed at Ericsson Research, Kista

October 5, 2016

Hur uppspelningskvaliteten påverkar dig

Abbreviations

Contents

Chapter 1 Introduction

1.1 Motivation

1.2 Defining and Measuring QoE

1.2.1 Traditional QoE metrics

1.2.2 New metrics, e.g. bitrate

1.3 Identifying confounding factors

1.4 Establishing correlation between quality features and the user engagement

1.5 Predicting user engagement and perceived quality

1.5.1 Building the predictive model

1.5.2 Evaluating the predictive model

1.5.3 Using the predictive model

1.6 Research question

1.7 Outline

Chapter 2

Theory and methods

2.1 Statistical methods

2.1.1 Pearson correlation

2.1.2 Kendall tau correlation

2.1.3 Correlation plots

2.1.4 Welch spectrogram method

2.2 Machine learning techniques

2.2.1 Decision tree algorithm

2.2.2 Random forest algorithm

2.2.3 Information leakage

2.2.4 Coefficient of determination

2.2.5 MSE and RMSE

2.2.6 Accuracy

2.2.7 TPR, FPR, TNR, FNR

2.2.8 ROC and AUC

2.2.9 Feature importance

2.3 Time series analysis

2.3.1 Weak-sense stationarity

2.3.2 SARIMA

2.3.3 Fitting the SARIMA model

2.3.4 Forecasting using the SARIMA model

2.4 Practical considerations concerning the methods used

2.4.1 Choice of engagement metric

2.4.2 Choice of QoE metrics

2.4.3 Splitting data into training and testing set

2.4.4 Choice of baseline

2.4.5 Developing a time series model

2.4.6 Testing the effect of artificially altering the quality

2.4.7 Quality effect on user return rate

Chapter 3 Data set

3.1 The raw data

3.2 Cleaning the data

3.3 Extracting features from the data

3.4 Per user data analysis

3.5 Per hour data analysis

Chapter 4 Results

4.1 Data

4.2 Correlation metrics

4.3 Time series analysis