UNIFORM INTERVAL NORMALIZATION: DATA REPRESENTATION OF SPARSE AND NOISY DATA SETS FOR MACHINE LEARNING

(1)

UNIFORM INTERVAL NORMALIZATION:

DATA REPRESENTATION OF SPARSE AND NOISY

DATA SETS FOR MACHINE LEARNING

Simon Sävhammar

(2)

Acknowledgments

I would like to extend sincere thanks to the researchers at Sahlgrenska University Hospital for their work and for giving me access to the MoodMapper data set:

Steinn Steingrimsson, Docent and Specialist in Psychiatry, Medicine doctor, Psykiatri Affektiva, Sahlgrenska University Hospital

Ulla Karilampi, Researcher PhD/Psychologist MSc, Kunskapsnätverket Digitalisering och Innovation, Sahlgrenska University Hospital.

I am also grateful to Stefan Andersson (www.safebase.se) and Verklighetslabbet via Susanna Bjälevik (www.verklighetslabbet.se), for their work and for giving me access to the Safebase data set.

(3)

This Master Thesis constitutes a part of the project: BaltSe@nioR 2.0: Innovative solutions to support BSR in providing more senior - friendly public spaces due to increased capacity of BSR companies and

(4)

Abstract

The uniform interval normalization technique is proposed as an approach to handle sparse data and to handle noise in the data. The technique is evaluated transforming and normalizing the MoodMapper

and Safebase data sets, the predictive capabilities are compared by forecasting the data set with a LSTM model. The results are compared to both the commonly used MinMax normalization technique

and MinMax normalization with a time2vec layer. It was found the uniform interval normalization performed better on the sparse MoodMapper data set, and the denser Safebase data set. Future works consist of studying the performance of uniform interval normalization on other data sets and with other

machine learning models.

keywords: Multivariate time series, forecasting, machine learning, LSTM, data representation,

(5)

1. Introduction

Bipolar disorder affects about two percent of the world's population (Merikangas et. al. (2017). A person suffering from Bipolar disorder experiences mood phases ranging from depression to manic episodes. The recurring manic phases or phases of depression have a major impact on a persons life quality, but early pharmacological intervention can reduce the negative effect of the relapses. However, patients themselves are often unaware of their need of treatment, and monitoring through routine checkups are often not frequent enough (Antosik-Wójcińska et. al., 2020). Antosik-Wójcińska et. al. (2020), reviews the usage of smartphone as a monitoring tool, with machine learning techniques as a predictive tool, and found it to be an area with potential. The hospital of Sahlgrenska in Gothenburg, collected data from individuals suffering from bipolar disorder through a smart phone application called MoodMapper, and in this study how well the data can be used to generate models with high predictability is explored. The data in the MoodMapper data set consists of several multi-variate time series, one multivariate time series per individual who participated in the data collection. The

MoodMapper data set is sparse, meaning there are often long periods of time between two events, and the data set contains noise, caused by errors in the sensors. For example, the accelerometer measuring the steps a participant takes, shuts off when the battery level is low, leading to no data being registered by the server. Sparse and noisy data sets are fairly common in the field of psychiatry (Cearns, M, Hahn, T. & Baune, B.T, 2019), and increases the difficulty for deep learning models to identify patterns and therefore decrease the models performance, as well as reduces the reproducibility of research in the field.

The purpose of this study is to examine if reducing the sparsity and noise by fuzzifying the data, could improve the performance of a machine learning model, with respect to the predictability. When the machine learning model is used to forecast a time series. A technique called uniform interval

normalization, is used to fuzzify the MoodMapper and Safebase data sets. The machine learning model selected is a Long Short Term Memory (LSTM) neural network. To improve the models ability to detect seasonal patterns, a time2vec layer is applied to the MinMax normalized data, and compared to the other two approaches, uniform interval normalization and MinMax without time2vec. The uniform interval normalization technique is also applied to the Safebase data set, which contains data from bed scales, installed on beds of patients with dementia. The data in the data set is dense with a low amount of noise, and is likely to contain patterns. The Safebase data set is included to examine how uniform interval normalization perform on a dense data set.

The results shows that the fuzzified data, is easier to predict for both data sets, but the effect is greater on the MoodMapper data set than the Safebase data set.

In the following chapter, the general techniques included in the study are explained, the third chapter the research problem is defined, the fourth chapter contains the details of the implementations, the fifth chapter presents the results, the sixth chapter the discussion and further works, the seventh chapter discusses ethical questions and the eight chapter contains the conclusions of the study.

2. Background

(7)

of life for persons with bipolar disorder, but if medication can be given in time decreases the severity of the symptoms and reduces the risk an episode converts to full-blown illness (Antosik-Wójcińska, Dominiak, Chojnacka, Kaczmarek-Majer, Opara, Radziszewska, Olwert & Święcicki 2020). The MoodMapper data set was created by collecting data with a smart phone application, from volunteers with bipolar disorder. The data set consists of several multivariate time series with data of for example, phone calls, text messages and number of steps taken, if it is possible to forecast the activity of the participants, it could be possible to identify patterns in the behavior of the participant signaling the beginning of a relapse and distribute mood stabilizing drugs in order to prevent or reduce the effects of the relapse.

2.1 Statistics and machine learning

A time series is a sequence of observations recorded at different times and ordered by the time they occurred. In a univariate time series each observation is a single data points, while in a multivariate time series each observation is a vector of data points with known or assumed interrelation. Univariate and multivariate time series are a commonly found in various areas e.g., economics, health care analysis, real estate, etc. Time series are studied for several purposes for example to compare trends and examine the increase or decrease of a variable over time, to identify cyclic or seasonal patterns and to do time series forecasting.

Time series forecasting is a well studied area within data science and statistics. To forecast is according to the Merriam-Webster dictionary (https://www.merriam-webster.com/dictionary/forecast) “to

calculate or predict (some future event or condition) usually as a result of study and analysis of available pertinent data” and to indicate as likely to occur. Forecasting of time series data is then to predict future data based on the available historic and current data. Forecasting of time series is according to Faloutsos, Flunkert, Gasthaus, Januschowski and Wang (2019) one of the most sought-after and difficult data mining task, since it has a high importance in industrial, social and scientific applications. Example applications of forecasting are inventory management by forecasting product supply and demand, managing the distribution of electricity by forecasting the production of wind turbine power (Zhang, Yan, Infield, Liu & Lian, 2019) and predicting which patient would develop Alzheimer's disease during a time period (Hong, Lin, Yang, Cai & Clawson, 2020). Common approaches to time series forecasting has been statistical methods such as linear regression and Autoregressive Integrated Moving Average (ARIMA) but in the last three decades, machine learning models such as Support Vector Machines (SVM) and Artificial Neural Networks (ANN) have

successfully been used to forecast time series and found to be competitive to the classical statistical models (Bontempi, Ben Taieb & Le Borgne, 2013).

(8)

LSTM is a type of Recurrent Neural Network (RNN). A RNN uses a hidden state to keep information from previous states while going forward over the input sequence. The current state in a RNN takes the hidden state from the previous step and the current data from the input sequence as input to update and to output the next hidden state.

Figure 1 shows a diagram of a RNN cell, which demonstrates how a RNN functions. The input data X is inputted in to the cell, the data is then processed by the activation function tanh. The resulting output is passed back to the cell, and used together with the next input. This is the recursion of the RNN, data from the previous step is reused in the current step. The output of the RNN, Y, is obtained by passing data from the RNN cell to an output layer. Figure 2 demonstrates how the above RNN cell, is rolled out over several time steps.

Figure 1:Diagram of a RNN cell. X is the input to the cell, ftanh is the activation function,

tanh is commonly used in RNN. Y is the output of the cell and the arrow from and to the cell represents the data in the output that is used in the input.

X

Y

(9)

Figure 2 illustrates and example of how a RNN computes the output Yt from the input sequence X. The

input sequence is a sequence of data points ordered sequentially. The hidden state is an internal representation, used by the RNN to remember data from the previous time step. At the first step in the input sequence, the input and an initial hidden state are inputted into the RNN cell, to calculate the next hidden state h1. At the next time step, the previous hidden state h1 and the input x2 computes the next

hidden state. The process repeats until all inputs in the sequence has been processed and the last hidden state ht is the output of the RNN. The last hidden state ht is passed to the output layer, in order to output

Yt.

For example, if a RNN is used to predict the last word of a sentence, based on the previous words in the sentence. The input sequence is the words in the sentence, in a sequential order and the input at the first step, is the first word in the sentence and the input at the last step is the second to last word in the sentence. The hidden state is a vector representation of the output, at the previous step. To predict the last word, the hidden state at the last step, is passed to the output layer.

Another approach is to output the hidden state at every step instead of only the last step, for example to predict the next word in a sentence based on the previous word.

RNN uses the gradient to carry information about the error between the target value and the output value, from the output layer to the input. The information is then used to update the weights in the network. However a common problem in RNN is the exploding or vanishing gradient problem, which occurs when the gradient either grow exponentially or shrinks exponentially towards zero (Bengio, Simard & Frasconi, 1994; Pascanu, Mikolov & Bengio, 2013). To solve the problem of the exploding

Figure 2: Illustration of a RNN cell adapted from Li, Johnson and Yeung (2017). The RNN cell is rolled out over t time steps, with an output at the last time step. The hidden states h are represented as rectangles, the recursive cells f as squares, input x as circles and output y as a circle.

(10)

or vanishing gradient and to improve the long term memory of RNNs, LSTM was proposed by Hochreiter and Schmidhuber (1997).

A LSTM network consists of a series of LSTM cells (figure 3). The differences between a RNN and a LSTM network, is that in the LSTM network each cell has a cell state, which stores data over several time steps and gates which controls the data being written and read from the cell state. The gates are celled the input gate, the forget gate and the output gate.

Figure 3 shows a diagram of the LSTM cell the input Xt and previous hidden state ht-1, is passed to each

of the four gates. Where ft, is the forget gate, regulating what is removed from the cell state, it is the

input gate, which decides what to write to the cell state, gt is the activation function and ot is the output

gate. The output gate determines what data from the input and previous hidden state, that should be multiplied with the cell state, to generate the next hidden state.

f_t=σ(W_f∗x_t+U_f∗h_t-1+b_f) (1)

i_t=σ(W_i∗x_t+U_i∗h_t-1+b_i) (2)

(11)

ot=σ (Wo∗xt+Uo∗ht-1+bo) (3)

The output of the forget gate ft is calculated by equation 1, where W and U are weight matrices and b is

the bias, xt is the input at step t and ht-1 is the hidden state from the previous step and sigma is the

sigmoid function. The result from the gates are vectors with the same dimensionality as the cell state. The vectors consists of values between zero and one where values close to one indicates the position should be kept while values close to zero indicates it should be deleted.

Equation 2 and 3 shows the calculation of the input gate and output gate. The input gate decides how much of the input should be added to the cells state and the output gate how much of the cell state should be used as output from the LSTM cell.

gt=tanh(Wg∗xt+Ug∗ht-1+bg) (4)

Equation 4 is the cell activation function, tanh, which is the same as in a RNN.

c_t=f_t∘c_t-1+i_t∘g_t (5)

Equation 5 defines the cell state and is elementwise multiplication of the previous cell state ct-1 and the

forget gate ft added with elementwise multiplication of the input gate it and cell activation gt. h_t=o_t∘tanh (c_t) (6)

The hidden state and output of the LSTM cell is calculated by elementwise multiplication of the output gate ot and cell state ct as (equation 6).

LSTM networks has been in several different field, for example, Hosseini and Sarrafzadeh (2019) used LSTM auto-encoders in order to detect early signs of negative health events. The data set was from the Preterm Infant Cardio-Respiratory Signals Database (PICS), and the data is collected from ten infants with slower than normal heart rate (Bradycardia). Since the data was sparsely annotated, an

unsupervised learning model was implemented. The result of the study showed the LSTM model to outperform the anomaly detection models used for the comparison.

The LSTM cell handles an input sequence in a similar way as a RNN, and can be rolled out over time in the same was as the RNN in figure 2.

2.2 Sparsity, noise and fuzzification

During data collection with smart phones, events could occur that introduces uncertainties in the data. Yang, Hanneghan, Qi, Deng, Dong and Fan (2015) discusses uncertainty in the context of life-logging of physical activity data and defines two categories of uncertainties, Irregular Uncertainty (IU) and Regular Uncertainty (RU). The IU is random, could be accidental or caused by misuse of the

application used for data collection, and considerably affects the efficiency and accuracy of the data analysis. The RU is frequent, persistent and can not be completely removed, e.g., errors in the devices sensors used for measurement or when the person using the application would measure activity in a new environment.

(12)

during those time points. Noise could be caused by infrequent step counts. For example, if the steps taken are measured on a weekly basis, and the user goes on a week long hiking vacation. The week during the vacation, will be be different from the normal behavior of the user. Another cause of noise could be is a user shakes the phone in order to inflate the step count. In both cases noisy data points are introduced in to the data, which are likely difficult to predict.

Fuzzification is a technique that transforms data and that can reduce the effect of sparsity and noise (Ohlbach, 2004).

Figure 4 displays an example of fuzzification, the figure contains two series of observations, of the same data. Where one variable (e.g., time) is placed on the x-axis, the value of the measured variable is placed on the y-axis and the dots the observed values at a given time. The above series is without fuzzification, and contains both sparsity and noise. The sparsity is time points without any observed values and noise is irregular measurements. In the bottom time series the data is fuzzified by dividing it, into three intervals and the measuring the total number of dots within the intervals. Each interval can be represented either as a single data point, or the same value repeated over the full interval. The fuzzified time series can handle the irregular distribution of data points in the last two intervals and identify the same level of activity in both.

(13)

Figure 5 shows two time series, measuring the steps taken of an individual during 24-hours. Where time is placed on the x-axis, starting from midnight and ending at 23.00, and the number of steps taken on the y-axis. The top time series is the steps taken during Monday and the bottom time series the number of steps taken during Tuesday.

During Monday the individual is following the normal behavior., there are no steps taken during the night hours, and the first steps are registered at 06.00. At 10.00 there is a peak of activity and varying activity during the rest of the day. However during Tuesday the individual have some activity that is different from the normal behavior. The individual wakes up one hour later than usual, and the peak shifts from ten to eleven. During the afternoon the individual, does not register any steps taken, from 16.00 to 17.00, either the individual did not carry the phone or did not take any steps.

Fuzzification can be used to smooth out the data points and reduce the effect of sparsity and noise in the data.

(14)

Figure 6 shows the same time series as in figure 5, but in fuzzified. Since the data points has been smoothed over a longer interval of time, the difference between the two days are less clear. The effect of that the individual woke up one hour later on Tuesday than Monday, is not as noticeable in the fuzzified time series, the peak happens in the interval before lunch. During the afternoon, the interval from 15.00 to 17.00 is noticeable lower during Tuesday than Monday, since the missing data points means less activity was observed. However the problem of sparsity is removed, by smoothing the neighboring over the missing time points.

3. Problem Definition

The sparsity and noisiness of the MoodMapper data set increases the difficulty for deep learning techniques to identify patterns and to perform accurate forecasting of the multivariate time series. The aim of this study is therefore to alleviate the problem of sparsity and noisiness by examining, how transforming the MoodMapper data set with the uniform interval normalization technique, affects the predictability of the data set.

The sparsity of the MoodMapper data set has mainly two causes, first the infrequency of the events for some of the features. When the time series is sampled, the features with infrequent observations, will have a high number of empty (zero valued) data points. The second cause of sparsity is when the application is shutoff by in order to conserve battery. During the time the application is off, no data is recorded and instead empty data points are added to the data set. The data set contains noise which is

(15)

difficult to discern from the normal data points.

The causes of the noise is when sensors has incorrectly recorded data and irregularities in the behavior of a participant. Example of sensor errors are when the tilt of the smart phone affects the accelerometer which count the steps taken (Kannan, 2015), and moving between the border of two cell towers could lead to a ping pong effect (Iovan et. al., 2013) which inflates the count of cell tower connections. Examples of irregularities in the behavior of a participant, is when a participant goes on a hiking trip, which inflates the number of steps compared to other days, or if a participant shakes the phone in order to increase the number of steps.

In order to accurately forecast based on the MoodMapper data set, it is necessary to handle sparsity and noisiness of the data set. The uniform interval normalization smooths data over time periods, and thereby handles the sparsity and noise by fuzzifying the data set. The fuzzification of the data could lead to a more effective deep learning model with regards to predictability. Since the model do not have to pinpoint exact times events occur, but instead predict the level of activity over a time interval.

To evaluate the effect uniform interval normalization has on the predictability of the data set, several different configurations of the uniform interval normalization's interval lengths and overlap lengths is applied to the data set. A LSTM neural network is used to forecast the transformed time series, with different configurations for prediction interval. The results of the forecasting with the uniform interval normalization time series will be compared with forecasting of the data set when both normalized with MinMax scaling and when transformed to a set of sine functions with a time2vec layer (Kazemi, et. al 2019). The hypothesis is that fuzzifying the MoodMapper data set, by applying the uniform interval normalization technique, would increase the performance of the used LSTM model, with respect to predictability. Compared to both when the data is normalized with MinMax normalization and when the data is normalized with MinMax normalization and a time2vec layer is added. The predictability is measured, by the RMSE metric and the Mean Arctangent Absolute Percentage Error (MAPE).

The studied research questions are:

(R1) What is the effect with respect to predictability of activity level, when the sparsity and noisiness in activity of daily life data set is handled by fuzzification?

(R2) Is there a noticeable effect with respect to predictability of activity level, when a dense data set with a low amount of noisiness is fuzzified?

(R3) How do the number of hours in each fuzzified interval, affect the predictability of the activity level?

(R4) Does the predictability of the activity level increase when the intervals are overlapped? The objectives of the study is to:

(O1) handle validity threats

(O2) evaluate interval configurations, by varying the uniform interval normalization, interval length parameter

(O3) evaluate step size configurations, by varying the uniform interval normalization, step size parameter

(O4) compare the effect of applying uniform interval normalization, with MinMax

(16)

(O5) examine the effect of uniform interval normalization on the Safebase data set with respect to predictability

4. Method

4.1 Scientific Method

To determine if the hypothesis of applying the uniform interval normalization technique on the

MoodMapper data set increases the predictability is true or not, experiments must be performed and the results compared to both a default normalization technique (i.e., MinMax normalization) and time2vec (Kazemi et. al., 2019). Time2vec has been proven to be an effective technique for handling periodicity in data. The correctness of the hypothesis cannot be decided by other scientific methods, for example, conducting a survey, since the uniform interval normalization technique has not been used previously and the MoodMapper data set has not been analyzed with regards to predictability. The experiment was planned according to the guidelines suggested by Wohlin et. al. (2012), with the goal to analyze how applying uniform interval normalization affects the predictability of the MoodMapper data set. The main study object of the experiment is the uniform interval normalization technique, with MinMax normalization and time2vec as the baseline.

The purpose is to evaluate the effect the application of uniform interval normalization has on the predictability of the MoodMapper data set. Compared to the other two techniques, MinMax

normalization and time2vec. The perspective of the experiments is from the view of a data analyst and researcher, who examines if there is a significant difference in the predictability when uniform interval normalization is applied. The context of the experiment is an offline environment, the data sets of MoodMapper and Safebase, and the techniques are implemented in Python and Keras (Chollet, 2015) running on Tensorflow (Abadi, et. al., 2016).

4.2 Planning

The experiment is planned to examine the effect of applying uniform interval normalization, on both the MoodMapper data set, which has a higher degree of sparsity and noisiness, and the Safebase data set which has a lower degree of sparsity and noisiness. The Null hypothesis is that applying uniform interval normalization on the MoodMapper data set will not increase the performance of the LSTM model with regards to predictability compared to the other two techniques, MinMax normalization and time2vec. The alternative hypothesis is that applying uniform interval normalization will increase the predictability of the MoodMapper data set compared to the other two techniques. The smoothing of data points over longer time intervals introduces a level of fuzziness to the data which could reduce the difficulty of the forecasting task.

The independent variables are the choice of data set, the normalization techniques and the hyper parameters tuned in the LSTM model. The sample size is an independent variable, because it varies both per participant and data set, and affects the performance of the LSTM model. While the dependent variable is the predictability measured by calculating the RMSE of the approaches. The experimental design is one factor with three treatments with a randomized blocked design, the factor is the

(17)

randomization in order of application. The treatments are balanced in the sense that an equal amount of data points are available for each technique and configuration.

The LSTM models and time2vec layer are implemented with Python 3.7 and the Keras API (Chollet, 2015) running on Tensorflow (Abadi, et. al. 2016), and the normalization methods are implemented in Python.

The objectives during the experiment are to clean the data of invalid numbers (nan) and obvious errors for example, negative number of steps, implement the necessary software, tune the neural networks, do forecasting with LSTM for each of the three techniques, uniform interval normalization, MinMax normalization and time2vec, calculate the RMSE and finally to compare and visualize the results.

4.3 Validity

Threats to the conclusion validity of the study are low statistical power, fishing and the error rate and reliability of measures. Low statistical power affects this study since the regularization techniques and the initialization of the weights in a neural network uses random number generators. Which leads to that the results of the same configuration could vary over several runs.

Fishing and the error rate has two separate parts. The first part is when searching or fishing for a result, the researcher then influences the result by for example selecting settings that leads to a specific

outcome. The second part is about the significance level and the need to adjust it when multiple analysis are conducted. Fishing and the error rate is a threat to the validity, since the hypothesis

assumes uniform interval normalization will increase the predictability of the data set. Compared to the other two techniques, MinMax normalization and time2vec. This could lead to results being interpreted in favor of the desired outcome or parameters tuned in favor of uniform interval normalization.

Threats to the construct validity includes mono-operation bias, mono-method bias and restricted generalization across constructs. The of mono-operation bias, is if only the MoodMapper data set is used, it is not possible to generalize the effect on predictability to other data sets. A similar threat is the mono-method bias if the effect on the predictability is measured by one measurement it might bias the results towards a certain technique. The last identified threat to the construct validity is restricted generalization across constructs. Since uniform interval normalization reduces the number of data points by grouping them into one data point, it could reduce the possibility to generalize the technique to other data sets even if the effect on the predictability is higher than the other techniques.

(18)

Table 1: Summary of validity threat and how they were handled. First column name of the validity threat, second column how they were managed.

Validity threat Handled by

Low statistical power Each configuration run over ten trials.

Fishing rate and error rate Information of LSTM, layers and tuning of hyper parameters.

Mono-operation bias Including two data sets in the experiment. Mono-method bias Usage of two measures

Restricted generalization across constructs Partially a threat (reduces sample size)

Interaction of setting and treatment Brief survey of LSTM and time series forecasting.

Table 1 summarizes the validity threat (column one) and how they were managed (column two). The threat of low statistical power was handled by running each configurations ten times. The variance for most of the configurations was low, and an identical experimental setup should yield similar results. However since the seed was not set manually, it is not possible to exactly reproduce the results. The validity threat of fishing and the error rate was handled by providing information of the LSTM models used and the configurations of the hyper parameters in chapter four. The information can be used to implement an identical model and tune it with the same hyper parameters, in order to analyze a data set. However since a permission is required to access the MoodMapper and Safebase data set, it might not be possible to use the same data.

The validity threat of the reliability of was handled by including the mean of the measures in tables, to avoid a visual comparison of the box plots. In the cases where one configuration is not clearly better than another and a statistical test is needed, they are assumed to be tied. The threat of inadequate preoperational explication of constructs was managed by defining predictability as how well a data point can be predicted, based on a number of previous data points, and where the performance is measured by RMSE and MAAPE. Two data sets are used in the experiment one sparser and with more noise (MoodMapper) and one with more dense data and with less noise, in order both to examine how uniform interval normalization performs on data sets with different characteristics and to manage the mono-bias operation threat. To handle the mono-method bias threat two measures were used, RMSE and MAAPE. The threat of restricted generalization across construct is partially a threat, since uniform interval normalization reduces the number of data points. Even though the technique performs better than MinMax and time2vec, the reduction of data points excludes some of the participants. However the number of data points is generally a limiting factor for machine learning models, if the sample size is not large enough the model cannot learn patterns in the data. The interaction of setting and treatment, several articles were found where LSTM had successfully need used to forecast time series, and in several cases it was the best performing technique.

4.4 MoodMapper

(19)

between them and ranges between four and 22 months. During the period the participants used the application, it recorded and sent data about the number of steps taken, the duration of calls measured in seconds, the direction of the call (incoming call or outgoing call), number of characters used in

incoming and outgoing text messages, which cell tower the phone is currently connected to, and when the screen turns on and off. The time the event occurred is registered as a timestamp with the year, month, day, hour, minute and second.

The MoodMapper data set is loaded from several files and transformed to one multivariate time-series per participant. During transformation process the data for the different features is sampled with the granularity of one hour and aggregated into a multivariate time series. This introduces several empty records (figure 7).

(20)

Table 2: Binning method per feature.

Feature Bin method

Call duration (incoming) Sum Call duration (outgoing) Sum Text characters (incoming) Sum Text characters (outgoing) Sum

Cell tower Count

Screen (on) Count Screen (off) Count

Step count Sum

(21)

Table 3: Summary of the MoodMapper data set after preprocessing. The first column is the participant id, second column the number of months the participant used the application, the third column the number of streams in the time series (features) and the last column the number of data points (rows).

User id Months Feature count Data points (count) QPJ42J2B9MPB (sb001) 6* 7 4 080 JJ4HF7RHBC8N (sb002) 11 7 8 568 RQDPZADXJ8CP (sb003) 6 6 4 560 ATRDDP78WJH7 (sb004) 4 7 3 408 RYY7EQWJ7WDB (sb005) 4* 6 4 393 W8Z3K4B2DEWE (sb006) 22 7 16 224 DTAQMMDQWMCF (sb007) 4 8 2 784 CCEZZCCXKRAK (sb008) 8 8 6 528

* The participants had a few data points at a date several months after the last continuous date. The data at the unconnected months were removed and is not included in the count.

Table 3 summaries the preprocessed data. The first column contains the user identity assigned to each participant, the identity does not hold any personal information that could be used to identify the person. The second column contains the number of months the participant used the application to record data. The third column is the number of features for which data was recorded, the fourth column is the number of rows in the data set after preprocessing.

Data from four of the eight participants was selected from the MoodMapper data set, the criteria for selection is the amount of available data points. Since the time period the participants have used the application and the frequency of observations varies between them, the number of recorded data points varies between the participants. Neural networks suffers from overfitting issues when the sample size is low (Chowdhury, Dong & Li, 2019), therefore if participants have too few data points it might not be possible for the LSTM model to do any reasonable predictions. Whereas a higher number of data points improves the training of LSTM neural networks compared to a lower number of data points, and increases the possibility the LSTM model will be able to accurately predict the patterns of the

participants. Therefore participants with less than one thousand data points after the data set has been preprocessed and normalized with uniform interval normalization were excluded from the study. The selected participants to be included in the experiment are JJ4HF7RHBC8N, RQDPZADXJ8CP, W8Z3K4B2DEWE and CCEZZCCXKRAK.

4.5 Safebase

(22)

Four scales are installed on each bed and the scale weights are measured with the granularity of a second.

Table 4: Summary of the Safebase data set. First column the identification number of the bed, column the number of features (scales) and the third column the number of data points.

Identification Number of features Number of data points

141123 4 13 892 320 341767 4 15 40 48 28 406322 4 14 425 489 465792 4 15 423 100 466438 4 14 750 442 466440 4 14 934 652 466497 4 15 367 091 479449 4 15 463 076 513067 4 15 441 978

Table 4 shows the summary of the Safebase data set. The first column contains the identification number, that identifies each bed in the data set. The identification number cannot be used to identify a patient, and is only used to map data to the bed it was collected from. The second column displays the number of features (i.e., scales) of each patient, which is four since each bed has one scale installed on every leg. The third column shows the number of data points for each patient.

One patient is selected from the data set, the selection is done arbitrary, by picking a identification number. From the chosen patient a period of two weeks are extracted, by finding a connected sequence with a low amount of missing data. In order to evaluate the effect on predictability, the application of uniform interval normalization has on a dense data set with a low amount of noise. The data is then sampled on the granularity of a minute, and by averaging all values during each minute. The sampling is done, in order to reduce the number of data points and shorten the time it takes to tune the parameters and the training time of the LSTM model.

4.6 Uniform interval normalization

The uniform interval normalization handles the sparsity and noise in the the multivariate time series by smoothing events over longer time intervals and transforming the data to activity per time period. This reduce the effect daily events happening on slightly different times has on the predictability and fills in the gaps between events that occurs less frequent than other events. The values for each feature in an interval are summed up and then divided by a normalization factor, resulting in one value per feature for the time period.

(23)

Uniform interval normalization divides a time series into uniform intervals, and then applies a low-pass filter to filter the intervals. Figure 8 shows a diagram of the intervals of uniform interval normalization, where L is the length of an interval and S is the step size, which determines the distance from the start of one interval to the start of the next interval. The length of intervals is measured in time steps, and decides the time a data point will be smoothed out over. Increasing the length of intervals increases the effect on the sparsity of the data set since the number of timestamps without events are reduced. The step size determines how many time steps there are between the start of two intervals, higher values for the step size parameter creates less overlap of the intervals but reduces the number of data points in the normalized data set more than a lower step size.

A low-pass filter, allows lower frequencies to pass through, but filters out higher frequencies (Smith, 2008). Figure 9 shows a diagram of a low pass filter, where the frequency is on the x-axis, the gain on the y-axis, fc is the cutoff point and the box represent the passband. If the incoming signal has a

frequency that is higher than the cutoff point, it will be multiplied with a gain of zero, which effectively cancels it out. When the frequency is lower than the cutoff point it will be multiplied with one, which means it is allowed to pass through the filter.

Figure 9: Diagram of a low-pass filter, adapted from Smith (2008).

Figure 8: Diagram of the intervals of uniform interval normalization, L is the length of the interval, and S is the distance from the start of one interval to the start of the next interval.

(24)

y (t)= x (t)+ x(t−1) (7)

According to Smith (2008) the simplest low-pass filter is given by equation 7, where y(t) is the output amplitude at time t and x(t) is the input amplitude at time t. The output amplitude of the filter, is the current input amplitude, added with the precious input amplitude.

y (t)=x (t )+x (t−1)+x (t−2)+x (t−n)

norm (8)

Uniform interval normalization, filter values according to equation 8, where n is the length of an interval, and norm is the normalization factor.

Uniform interval normalization is applied to the MoodMapper and the Safebase data sets, which are multivariate time series and the normalization factor is calculated on a per feature basis. Where the highest value for the feature over all intervals are used.

V

n

(

t )=

∑

t=l u

v

_tn

V

_maxn

for all t ∈T

(9)

Equation 9 shows how the feature vector

V

n with n features is calculated for a participant. Where l is the time point at the start of an interval and u is the last timestamp included in the interval, vi

n _is

the feature vector at time step i, Vmax

n _{is a feature vector with the maximum value for each feature in}

the population of participants and T is all time points in the time series.

Periodicity in the data was represented by encoding the position of the intervals during the day. In other words the first interval starting at midnight is encoded as the first interval by adding a float number, the next interval as the second by increasing the float with a small value. The encoding encodes all

intervals and for the last interval the float has the highest value but is below one.

4.7 MinMax normalization

y

_i

=

x

i

−

min( X )

max ( X )−min( X )

(10)

Equation 10 is the formula for the MinMax normalization, where yi is the normalized value at position

i and xi is the actual value at position i, min(X) is the minimum value of the data set and max(X) is the

maximum value of the data set. During the experiments, the software library of scikit-learn (Pedregosa, et. al. 2011), where used to normalize the data.

(25)

to zero.

4.8 LSTM

Two different LSTM models were implemented in Keras and tuned for the experiment, LSTM 1 and LSTM with an implementation of a time2vec layer (LSTM+t2v). LSTM 1 was used for many-to-one forecasts. The LSTM+t2v model uses a custom time2vec layer that transform the input into a sum of sine functions, which are then used as input into the LSTM model. LSTM+t2v is an interpretation of the work of Godfrey and Gashler (2017) as well as Kazemi et. al. (2019).

The LSTM models used for forecasting takes a sequence of data points as input, learns patterns in the sequence and then predicts the data points that would come after the sequence. The output sequence could either be a single value or several values.

The following example is an example with a single value in the output sequence. If a data set consists of a series of integers like, {1,2,3,1,2,3,1,2,3}, and a LSTM model is to be used to forecast the series. The series is split into an input sequence and an output sequence, the input series could be

{1,2,3,1,2,3,1,2} and the output sequence or target value is then a single value {3}. The model can then learn the input sequence and predict a single integer that would be the next in the series.

Another example with several data points in the output sequence is, if the same series of integers are used, but split into the input sequence of {1,2,3,1,2,3} and the output sequence of the model {1,2,3}. The model can then learn to recognize and predicts longer portions of a series, and in the case of time series forecast further into the future.

The first approach is usually called many-to-one prediction, since several data points in the input sequence are used to forecast a single value in the output sequence. The second approach is called many-to-many, since several values in the input sequence are used to forecast several values in the output sequence. The many-to-many could be more desirable, since it looks further into the future but is also more complex. As stated by Bontempi, Ben Taieb and Le Borgne (2013):

“If the one-step forecasting of a time series is already a challenging task, performing multi-step forecasting is more difficult because of additional complications, like accumulation of errors, reduced accuracy, and increased uncertainty.” .

The LSTM models have a poor performance on the many-to-one predictions of the MoodMapper data set, especially of the sparser features. Because of the poor performance and due time limitations, a many-to-many model was not implemented.

4.8.1 LSTM 1

(26)

Figure 10 displays the layers in the LSTM 1 model. Where each of the LSTM layers consists of a single LSTM unit (figure 3). The dense layer transform the output sequence into one data point, which is the predicted value for the input sequence.

In order to execute the experiment, the hyper parameters of the LSTM model, has to be tuned for each technique, since the characteristics of the data are different depending on which of the three techniques are applied, for example the sample size, level of sparsity varies for the techniques. The sample size has an impact on how complex patterns a machine learning model can learn and the effect dropout has on during the training of the model (Srivastava et. al. 2014).

In order to find the optimal number of layers for the MoodMapper and Safebase data sets, several different models with varying number of layers were tested. The first model contained a single LSTM layer, the number of layers were increased by one for the next model and the third model had three layers. Two models with more than three layers were tested, one with four and one with five layers. However, it was found that the models with more than three layers overfitted on the training data and performed worse on the test data. For each model, the three different techniques, uniform interval normalization, MinMax normalization and time2vec, were applied over ten runs.

Figure 10: Diagram of the LSTM 1 model. The model has three LSTM layers each consisting of a single LSTM unit, the final dense layer transforms the sequence into the output data point.

LSTM

LAYER LAYERLSTM LAYERLSTM Dense layer

(27)

Table 5: Summary of the range of tested values, for the hyper parameters the LSTM model. First column type of parameter and the second column range of tested values.

Parameter Range of tested values

Epoch 100-1000 Batch size 16-512 Learning rate 0.0001 - 0.01 Dropout 0.6 Units 25-125 t2v K 20-175

Table 5 shows the hyper parameter that were tuned and the range of values that were tested. The first column is the name of the parameter, where epoch is the number of times the training data passes through the model, batch size is how many samples passes through the model before the weights are updated, learning rate determines by how much the weights in the model are updated, dropout is used to remove a random number of element in the input sequence, units are the length of the output sequence after each LSTM layer and also determines the dimensionality of the gates in the model and t2v K is the number of nodes in the t2v layer. The second column contains range of the tested values. The same tuning process was followed for each of the three techniques, uniform interval normalization, MinMax normalization and time2vec. However due to the limited time available to train the models, the time necessary to complete one training for a configuration and the lack of preexisting knowledge on the optimal hyper parameter setting for the MoodMapper and Safebase data sets. It was not feasible to find the optimal settings for each of the techniques. Instead the objectives during the tuning process was to, remove or minimize overfitting issues, spend a similar amount of time on each configuration and to find the settings that lead to the lowest Mean Squared Error (MSE) on the test set. The starting point for each technique was the default values as defined in the Keras API of LSTM. Each of the hyper parameter was varied one at a time, until the value which lead to the lowest MSE was found. The order the parameters were tuned, corresponds to the order of table 5.

4.9 LSTM+t2v

The LSTM+t2v model uses a custom layer, time2vec to detect and model periodicity in the MoodMapper data set, the layer was implemented in Keras and based on time2vec (Kazemi et. al. 2019) and Neural decomposition (ND) which was proposed by Godfrey and Gashler (2017). The time2vec layer is an interpretation and extension of time2vec and ND that can handle more than one input vector in order to process and transform multivariate time-series.

time2vec( X )[i]=

{

wn,i∗xn+bn,i if i=0

(28)

Equation 11 is adapted from Godfrey and Gashler (2017) but extends the equation to multivariate inputs. X is the input matrix consisting of one or more input vectors, x_n is an input vector in X,

w_n,L , bn,L is the weight and bias matrices used to calculate the linear term for input vector xn , w_n,i and b_n,i are the weight and bias matrices for the sinusoids and corresponds to the frequency and phase shift of a sinusoid, and a_n,i is the weight matrix for the amplitude.

Figure 11 shows a diagram of the time2vec implementation. The layer is implemented as a feed forward neural network with a single hidden layer, where each neuron in the hidden layer is connected to one input neuron. The input x to the time2vec layer is one vector per stream in a multivariate

(29)

series, the input is multiplied with frequency w and the phase shift is added as a bias b, the sine function is used as activation function in the non-linear neurons, and no activation function is applied in the linear neurons. The non-linear output is multiplied with an amplitude before being inputted into the LSTM model. The LSTM model used in LSTM+t2v is the same as LSTM 1 (figure 10).

There are two possible approaches to connect the inputs to the hidden layer, to connected each input node to all nodes in the hidden layer (fully connected) or to connect the inputs to separate nodes in the hidden layer. Both approaches were tested by classifying a synthetic data set, the first data set described in Kazemi et. al. (2019) and by forecasting a data set of the water level in Venice available at

https://www.kaggle.com/lbronchal/venezia. The difference between the two approaches were not statistically significant for either of the two data sets.

In Godfrey and Gashler (2017) includes an amplitude variable in their equation, which is removed by Kazemi et. al. (2019). During the implementation both including and not including the amplitude variable were tested, by classifying the synthetic data set and forecasting the water level of Venice. The variant with the amplitude performed better on the water level data set, if the data was not normalized between zero and one. However if the data set was normalized between zero and one, there were not a statistically significant difference between the two approaches.

The implemented time2vec layer connects the input to separate nodes in the hidden layer, in order to detect periodic patterns in separate features and not in the combined features. The version with

amplitude was selected, since it did perform better in the case when the data was not normalized, even though in the experiment all data is normalized between zero and one.

4.10 Experimental setup

This section covers how the experiment were setup, naming of configurations and hyper parameter settings.

(30)

Figure 12 is a diagram of the experimental setup, for the MoodMapper and Safebase data sets. Three different techniques are compared to each other, uniform interval normalization, MinMax

normalization and time2vec. The bottom three squares is the type of normalization technique applied to the data sets. The time2vec square above the right most MinMax normalization square, is meant to demonstrate, that after the data has been normalized it is feed into the time2vec layer before being inputted into the LSTM model. The Ellipses visualizes that the parameter setting are different for each of the techniques. A diagram of the LSTM model used during the experiments, is displayed in figure 10.

4.10.1 Preliminary: MoodMapper

The data set was split into a training and a test set. For each participant the first 80 percent of the the time series were selected for training and the last 20 percent for the test set. For validation 25 percent of the training data was used.

Table 6: Summary of the labeling of the uniform interval normalization configurations for the MoodMapper data set. First column the assigned label used for identification, second column the number of hours used for the interval length, third column

Figure 12: Diagram of the experimental setup. The squares represent the techniques, uniform interval normalization, MinMax normalization and time2vec. The ellipses are the hyper parameter settings of the LSTM Model and the rectangle is the LSTM model as shown in figure 10. LSTM Uniform interval normalization MinMax normalization time2vec MinMax normalization Setting: uniform interval

(31)

number of hours between the start of each interval.

Label Interval length (hours) Step size (hours)

6h_s3h 6 3

6h_s6h 6 6

12h_s3h 12 3

12h_s6h 12 6

Table 6 displays the labels for the configurations of uniform interval normalization, used for the trials with the MoodMapper data set. The first column is the label used as reference in the report, column is the length of the interval in hours and column is the number of hours between the start of two intervals.

Table 7: Sample sizes for each participant and normalization technique. The first column contains the identification of the participants, the second column the sample size for a three hour step size, the third column the sample size for a size hour step size and the fourth column the sample size when MinMax normalization is applied instead of uniform

interval normalization.

Participant ID Sample size: Step size three hours

Sample size: Step size six hours

Sample size:

MinMax & time2vec

sb002 2 856 1 556 8 568

sb003 2 576 760 4 560

sb006 5 416 2 708 16 224

sb008 2 184 1 092 6 528

Table 7 shows the samples sizes for each participant and each step size configuration, where the samples size are the number of data points in the data set. The first column contains the identification of the participants, the second column the sample size when a three hour step size is used, the third column the sample size when a six hour step size is used and the fourth column when MinMax

normalization is applied instead of uniform interval normalization. The sample size for time2vec is the same as for MinMax normalization.

A lower step size means more intervals are created, than when a higher step size is used. Since the number of hours between each interval will be less, for the lower step size. The number of samples in the sample size has a effect on the LSTM model, and a higher number of samples can increase the performance of a LSTM model.

4.10.2 MoodMapper: LSTM configuration

(32)

Table 8: Table 4: Hyper parameter settings for LSTM 1, first column is the corresponding name of the parameter in the Keras LSTM API, the second column is the settings for uniform interval normalization, the third column the settings for MinMax normalization and the fourth column the settings time2vec.

Parameter Setting: uniform interval normalization Setting: MinMax Setting: time2vec

Epoch 500 500 300 Batch size 32 256 200 Learning rate 0.001 0.001 0.0001 Dropout 0.6 0.2 0 Units 75 75 75 t2v K - - 170

Table 8 shows the hyper parameter settings used for the uniform interval normalization and MinMax normalization. First column the name of the parameter, the second column contains the settings for uniform interval normalization, the third column the settings for MinMax normalization and the fourth column the settings for time2vec.

The number of epochs are 500 for uniform interval normalization and MinMax normalization, but 300 for time2vec. The tuning process showed, that in all cases the models started having overfitting issues, if the number of epochs were higher than 500. The loss function continued to slightly improve during training, but worsened on the validation data.

The batch size determines how often the weights of the model are updated, it was found that for uniform interval normalization a lower batch size gave better performance, while for MinMax and time2vec a higher batch size had to be used to avoid overfitting issues. Since uniform interval normalization has a lower sample size than both MinMax and time2vec and processes less samples each epoch it was expected a lower batch size for uniform interval normalization and a higher batch size for MinMax and time2vec had to be used.

The learning rate was set to 0.001 for both uniform interval normalization and MinMax normalization which is the standard setting for the Adam optimizer used in the experiment. For time2vec a lower learning rate was set, since the model was prone to overfitting. Lower setting for the learning rate lead to LSTM 1 not learning enough during training to forecast the test data and a higher setting lead to the model being overfitted on the training data, which lead to poor performance on the unseen test data. During the tuning process it was observed that all models had tendencies to become overfitted. In order to manage the overfitting issue in the LSTM 1 model dropout layers were added and dropout rates of 0, 0.2, 0.4 and 0.6 were tested. For uniform interval normalization a high dropout rate was required to reduce the overfitting on the training data, but for MinMax a lower dropout rate gave better

performance. Since MinMax normalization has a higher sample size than uniform interval

normalization, the lower dropout rate support the findings of Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov (2014). Srivastava et. al. (2014) found that the effect of dropout increases with the size of the sample size, up to a certain point, after which the effect declines.

(33)

4.10.3 MoodMapper: Input and output sequence

Table 9: Summary of the input and output sequence configurations for the MoodMapper data set. The first column contains the label of the normalization technique, the second column number of data points in the input sequence and the third column the number of data points in the output sequence.

Label Input sequence (data points) output sequence (data points)

6h_s3h 56 1 6h_s6h 56 1 12h_s3h 56 1 12h_s6h 56 1 MinMax 168 1 Time2vec 168 1

Table 9 displays the number of data points in the input and outputs sequences. First column has the label of the normalization technique, the top four are the configurations for the uniform interval normalization. The second column is the number of data points in the input sequence, which is the number of data points the LSTM model uses to train on. The third column the number of data points in the output sequence, which is the number of data points that is forecast by the LSTM model.

For uniform interval normalization the length of the input sequence, is one week for the six hour step size configurations, and two weeks for the three hour step size configurations. The output sequence is one data point, which corresponds to one interval. For MinMax and time2vec, the length of the input sequence is a week. The output sequence is one data point and is equivalent to one hour. In the case of MinMax and time2vec, it was assumed the longer input sequence is needed, in order to cover enough days for the model to learn daily behavior.

4.10.4 Preliminary: Safebase

The data set were split into a training and a test set. For each configuration the first 80 percent of the the time series were selected for training and the last 20 percent for the test set. For validation 25 percent of the training data was used.

Table 10: Summary of the labeling of the configurations of the uniform interval normalization, for the Safebase data set. First column label used for identification, second column the number of minutes in each interval, third column the number of minutes between the start of each interval.

Label Interval length (minutes) Step size (minutes)

15m_s3 15 3

30m_s15 30 15

60m_s20 60 20

(34)

column is the length of the interval in hours and the third column is the number of hours between the start of two intervals.

Table 11: Summary of the sample size. The first column contains the label of the configuration and the second column contains the sample size.

Label Sample size

15m_s3 4 033

30m_s15 1 345 60m_s20 1 009 MinMax & time2vec 20 165

Table 11 summarizes the sample size per configuration. The first column contains the label of the configuration, the first three rows are the uniform interval normalization configurations and the final rows is the MinMax normalization and time2vec, the second column displays the number of data points in the sample.

4.10.5 Safebase: LSTM configurations

Uniform interval normalization and MinMax normalization are run on the LSTM 1 model, and time2vec on the LSTM+t2v model.

Table 12: Settings of the hyper parameters for the LSTM models. First column name of the parameter, second column the settings for uniform interval normalization, third column settings for MinMax normalization and fourth column settings for time2vec.

Parameter Setting: uniform interval normalization

Setting: MinMax Setting:time2vec

Epoch 400 600 400 Batch size 128 128 128 Learning rate 0.0001 0.0001 0.0001 Dropout 0 0.2 0 Units 125 75 75 t2v K - - 128

(35)

was prone to overfitting. However, by introducing a level of dropout the overfitting issues were solved. A higher number of epochs were then tested on the model with dropout, and it lead to a lower value of the loss function on the test set.

The batch size and learning rate was set to the same values for all models and configurations. Uniform interval normalization and time2vec did not experience overfitting issues.

4.10.6 Safebase: Input and output sequence

Table 13: Summary of the input and output sequence configurations. The first column contains the label of the normalization technique, the second column the number of data points in the input sequence and the third column number of data points in the output sequence.

Label Input sequence (data points) output sequence (data points)

15m_s3 20 1

30m_s15 20 1

60m_s20 20 1

MinMax 24 1

Time2vec 24 1

Table 13 displays the number of data points in the input and output sequences, for the different

configurations. The first three rows are the different uniform interval normalization configurations, for which twenty data points are used as input. For MinMax normalization and time2vec 24 data points are used.

4.11 Metric

Since RMSE is one of the most commonly used metrics to evaluate machine learning techniques, it is selected as one of the metrics for the experiments.

RMSE=

√

1 n

∑

n=1

n

(yn− ̂yn)2 (12)

RMSE is defined by equation 12, where n is the number of samples, y is the target value and ̂y is the predicted value. Larger errors have a disproportionately large effect on RMSE compared to smaller values, and this punishes one large error more than several smaller errors. This is a desired effect in the context of the experiment, since several forecasts that only missed by a few units, is not as bad as forecasting one large value instead of a small value, or forecasting a small value instead of a large value.

(36)

MAAPE=1 n

∑

i=1

m

∣

arctan

∣

ai−yi

∣

a_i

_∣

∣

(13)

Equation 13 shows the definition of MAAPE, where ai is the actual value and yi is the predicted

value, the value of MAAPE ranges from 0 to π/2 and a smaller value is better than a larger.

5. Results

This section summarizes the result of the experiment, each configuration was executed ten times in order to reduce the effect of good or bad starting weights would have on the result.

5.1 MoodMapper

The results are displayed with box plots for the selected participants in the MoodMapper data set. The errors are calculated over all features in the forecast of the multivariate time series.

The box plots in this chapter are structured in the following manner. The type of configurations are

(37)

placed on the x-axis, where 12h_s3 is an interval with the length of twelve hours and three hours between each interval, 12h_s6 is an interval with the length of twelve hours and six hours between each interval, 6h_s3 is an interval with the length of six hours and three hours between each interval,

6h_s6 is an interval with the length of six hours and six hours between each interval, MM is the

MinMax normalization and t2v is time2vec. The metric is placed on the y-axis, for both RMSE and MAAPE a lower metric value is better than a higher.

Figure 13 shows the result for the MAAPE metric for all participants. The uniform interval

normalization configurations, performed better than the MinMax normalization and time2vec for all the participants. Where the twelve hour interval length with a three hour step size was the best setting.

Figure 14 shows the result for the RMSE metric for all participants. The uniform interval normalization configurations, achieves a lower RMSE than the MinMax normalization and time2vec for all the participants. The best setting for uniform interval normalization, was the twelve hour interval length and three hour step size.

Tables 14-17 summaries the ranks of the techniques. The first column is the name of the technique, the second column the rank according to RMSE, the third column the rank according to the MAAPE, the fourth column the mean RMSE over ten runs and the fifth column the mean MAAPE over ten runs.

(38)

Table 14: Table 6: Summary of rank based on RMSE and MAAPE for participant sb002. Type of data representation in first column, rank based on RMSE in second column, rank according to MAAPE in third column, mean RMSE over ten trials in fourth column and mean of MAAPE over ten trials in fifth column.

Technique Rank RMSE Rank MAAPE RMSE (mean) MAAPE (mean)

MinMax (MM) 5 5 32.07 1.31 Time2vec (t2v) 6 6 32.2 1.32 interval 6h step 3h (6h_s3) 4 3 13.68 1.06 interval 6h step 6h (6h_s6) 3 4 12.45 1.07 interval 12h step 3h (12h_s3) 1 1 8.41 0.9 interval 12h step 6h (12h_s6) 2 2 9.14 0.92

Table 14 summarizes the rank assigned based on the box plots for participant sb002. The best configuration for was the twelve hour interval with a step size of three and the other interval configurations performed better than MinMax and time2vec.

Table 15: Summary of rank based on RMSE and MAAPE for participant sb003. Type of data representation in first column, rank based on RMSE in second column, rank according to MAAPE in third column, mean RMSE over ten trials in fourth column and mean of MAAPE over ten trials in fifth column.

MinMax (MM) 4 4 274.26 1.15 Time2vec (t2v) 4 4 276.27 1.56 interval 6h step 3h (6h_s3) 3 3 121.36 0.92 interval 6h step 6h (6h_s6) 3 3 122.9 0.91 interval 12h step 3h (12h_s3) 1 1 82.91 0.73 interval 12h step 6h (12h_s6) 2 2 102.73 0.85

Table 15 summarizes the rank assigned based on the box plots for participant sb003. The best

(39)

MinMax (MM) 5 4 133.93 1.24 Time2vec (t2v) 5 4 136.76 1.25 interval 6h step 3h (6h_s3) 3 3 54.79 0.94 interval 6h step 6h (6h_s6) 4 3 88.79 0.95 interval 12h step 3h (12h_s3) 1 1 31.46 0.73 interval 12h step 6h (12h_s6) 2 2 46.3 0.77 Table 16 summarizes the rank assigned based on the box plots for participant sb006. The best

configuration for participant sb002 was the twelve hour interval with a step size of three and the other interval configurations performed better than MinMax and time2vec.

MinMax (MM) 5 5 171.63 1.29 Time2vec (t2v) 4 5 160.7 1.28 interval 6h step 3h (6h_s3) 3 3 89.61 1 interval 6h step 6h (6h_s6) 3 4 89.1 0.97 interval 12h step 3h (12h_s3) 1 1 47.15 0.77 interval 12h step 6h (12h_s6) 2 2 82 0 Table 17 summarizes the rank assigned based on the box plots for participant sb008. The best

configuration for participant sb002 was the twelve hour interval with a step size of three and the other interval configurations performed better than MinMax and time2vec.

5.2 Safebase

(40)

(41)

Figure 15 shows the result for the MAAPE score for the Safebase data set, a lower score is better than a higher score, and figure 16 displays a zoomed in view to visualize the differences between the

techniques. The fifteen minute interval with a step size of three minutes has the best MAAPE score, but is overlapping with the sixty minutes interval. However since the distribution (except one outlier) of the MAAPE score for the fifteen minute interval is below the mean of the MAAPE score for the twenty minute interval, the difference is significant. Time2vec and MinMax normalization has the third best MAAPE score, the difference between the techniques is likely not significant. The thirty minute interval length has the worst MAAPE score.

UNIFORM INTERVAL NORMALIZATION: DATA REPRESENTATION OF SPARSE AND NOISY DATA SETS FOR MACHINE LEARNING