Anomaly Detection in Signaling Data Streams

(1)

(2)

Abstract

This work has aimed to develop a method which can be used in order to detect anomalies in signaling data streams at a telecommunication company. It has been done by modeling and evaluating three prediction models and two detection methods. The prediction models

which have been implemented are Autoregressive Integrated Moving Average (ARIMA),

Holt-Winters and a Benchmark model, furthermore have two detection methods been

tested; Method 1 (M1), which is based on a robust evaluation of previous prediction errors and Method 2 (M2), which is based on the standard deviation in previous data. From the

evaluation of the work, we could conclude that the best performing combination of prediction- and detection methods was achieved with a modified Benchmark model and

M1- detection. Sammanfattning

Detta arbete har syftat till att utvärdera metoder för att detektera anomalier i signalingeringsdata hos ett telekombolag. Det har gjorts genom att modellera och

utvärdera tre prediktionsmetoder samt två detektionsmetoder. De valda prediktionsmetoderna är ARIMA, Holt-Winters samt en Benchmark modell. Detektionsmetoderna vilka har testats är Metod ett (M1), som baseras på en robust

utvärdering av tidigare prediktionsfel samt Metod två (M2), som baseras på standardavvikelsen i tidigare data. Efter utvärderat arbete kan vi konstatera att, givet

(3)

Acknowledgements

It has been a pleasure to write our master thesis at Tele2 IoT and it has been truly inspiring to see how passionate people work close together to provide smart IoT solutions

and new technology. We would like to thank Richard Gustavsson for giving us this opportunity and for all his help, support and rewarding input during discussions. We would also like to thank Isac Jinton for providing us with data and for his tremendous help and technical support. Finally, we like to express our gratitude to our supervisor Professor Jun Yu for his considerable contribution to the overall improvement of this study. Thank

(4)

2.9 L-BFGS-B . . . 20 3 Method 21 3.1 Data Inference . . . 21 3.2 Prediction . . . 21 3.2.1 Data Processing . . . 21 3.2.2 Cross Validation . . . 22 3.2.3 Existing Solution . . . 22 3.2.4 ARIMA . . . 23 3.2.5 Holt-Winters . . . 23 3.3 Detection . . . 24 3.3.1 Method One . . . 24 3.3.2 Method Two . . . 26 3.3.3 Validation . . . 27 4 Results 27 4.1 Data Inference . . . 27 4.2 Prediction . . . 29 4.2.1 ARIMA . . . 29 4.2.2 Holt-Winters . . . 30 4.2.3 Benchmark Model . . . 31 4.2.4 Model Comparison . . . 33 4.3 Detection . . . 33

5 Discussion and Conclusion 37 5.1 Evaluation . . . 37

5.1.1 Prediction . . . 37

5.1.2 Detection Methods . . . 40

5.2 Conclusion . . . 42

(5)

(6)

1 Introduction

1.1 Background

Tele2 Internet of Things (IoT) is a business unit within Tele2 focusing on exploring the enterprise market for IoT. It designs, builds and operates IoT connectivity and connectivity close services. To be competitive, the connectivity service system is monitored to assure that services are stable, transparent and cost efficient. To further increase customer satisfaction, it is desired to quicker be able to detect when the system is not working properly so that problems can be resolved sooner and customers informed quicker. One way of doing so is to use anomaly detection techniques to model the system’s normal state and consider noticeable deviations from this profile as anomalies. When such a model is implemented, the system is able to automatically detect anomalies and alarm Tele2 when there is a problem in the network. While an anomaly detection model can provide the means to detect the problems, the evaluation and interpretation of the anomaly will remain to the human expert. A detector that sends too many alarms has therefore little practical use hence the sensitivity of the algorithm must be tuned in order to control the rate of false positives.

1.1.1 Anomaly Detection

Anomaly detection refers to the problem of finding sets in the data which do not follow the expected pattern. Today there are a large number of different methods and techniques used in order to find anomalies within different domains, such as fraud detection, health care and network traffic. The arise of the anomaly can, for instance, be due to instrumental error, changes in system behaviour, human error or because of natural deviations in populations. Depending on the characteristics of the data, the anomalies can be divided into subgroups.

Point anomalies are instances of data which, in relation to the rest of the data, can be

considered as anomalous. Collective anomalies are such that a collection of related data is anomalous with respect to the entire data set. Hence the individual data in that collection does not have to be anomalous by themselves although their occurrence as a collection is. Finally, Contextual anomalies are instances of data which are considered anomalous conditional on a specific context. Such as time in sequential data or position in spatial data. In sequential data, time-series data and event data have been widely used within anomaly detection. [Chandola et al., 2009]

1.1.2 Related Work and Contribution

Anomaly detection in time-series data is a well studied field. Here we present a short summary of some of the previous work that has been done in similar domains.

One approach in finding anomalies in time series data is to identify the states collected from a device, and then use clustering algorithms in order to cluster the states. An approach like this is introduced by Salvador et.al (2004) who uses Gecko algorithms in order to identify the different states of the device. Then a RIPPLE algorithm [Salvador et al., 2004] is used in order to create rules for each state. The approach is used to learn the device “normal” behavior and any deviation from this is considered as an anomaly.

(7)

anomalies.

Another approach used to detect anomalies with dynamic ARIMA models is at the stage of parameter estimation, proposed by Zhu et.al. (2011) The aim is to identify sudden changes of the parameters, which is an indication of abnormal behaviour in the system. The detec-tion is made possible by transidetec-tioning the ARIMA model to a state space model in which the Kalman filtering provides model parameter estimation and anomaly detection. The General Likelihood Ratio (GLR) rule maximizes the log likelihood ratio over a window of inputs and decides when a threshold is overstepped, which suggests an anomaly.

An alternative approach, used when detecting anomalies in network data, is the Holt-Winters method (HW). In [Szmit and Szmit, 2012], the traditional and a modified HW method is used in order to explain the double seasonal behavioral characteristics of network volume data. The modified algorithm applied three types of thresholds- one for deviations from the expectation according to first seasonality, one for deviations from the second seasonality, and one for deviations from both seasonalities. The research concluded that the modified algorithm showed a slight decrease in the rate of false positives.

Rather than observing the amplitude in the time domain, one can instead analyze the spec-tral content of the time-series. Fourier transform is a popular anomaly detection approach, used in network traffic data, doing so. In order to represent the function of time Fourier transform represents it as a combination of frequencies. However its application is only suitable for stationary signals which sometimes makes it difficult to apply. Another popular time-frequency-transformation, also widely used in anomaly detection, is the wavelet trans-form. One, among others, advantages with this method is applications to non-stationary signals. [Huang et al., 2006, Dainotti et al., 2006]

1.1.3 GSM System Overview

(8)

Figure 1: A GSM subset describing how the nodes, which are involved when collecting MAP-protocol data, are connected. The cloud represents the probe where the data is collected.

In Figure 1, a very simplified schematic diagram of a subset of the elements in a GSM system can be observed. The GSM nodes in the figure can shortly be described as:

• UE- User equipment (IoT device). Each device is equipped with a SIM-card which stores

the International Mobile Subscriber Identity (IMSI) and the Mobile Subscriber Inte-grated Services Digital Network Number (MSISDN). IMSI is an internationally unique identifier for the subscriber and MSISDN is the phone number of the user.

• BTS- Base transceiver station, enables the subscriber to establish a wireless connection.

In theory, one BTS can cover a radius of 35 km, although this is rare in practice due to limited capacity of serving simultaneous subscribers.

• MSC- Mobile switching center, The MSC is the central element of the network, it is

responsible for registration of device subscribers, call establishment and routing etc.

• SGSN- Serving GPRS support node, responsible for access point name (APN) routing. • HLR- Home location register, a database that contains a record for each subscriber, where

information about the roaming profile and individually available services is stored. For example, a device may only be allowed to roam in northern Europe and restricted to use data services but not voice or SMS.

• VLR- Visitor location register, each MSC has an associated VLR that contains a copy of

the HLR records of the subscribers that are currently served by the MSC. [Sauter, 2010]

1.1.4 Data Set

MAP is a protocol of the signals that are sent between the device and the HLR (See Figure. 1). When a device is switched on, an attachment procedure starts in order for the device to connect to the network. This procedure is initiated when a Send Authentication Information (SAI) request is sent from the device to HLR, where the device IMSI is compared to the subscriber record to check if the device is authorized in the network and allowed subscriber services. If the request is verified the MSC sends an Update Location (U L) message to HLR to verify that the subscriber roamed into a new area. If that is successful, SGSN sends a Update GPRS Location (U LGP RS) message to HLR. When the HLR receives a U L message

it initiates a Cancel Location (CL) procedure, so that the subscriber is removed from the network to which it was previously attached to.

(9)

• SAI- Send Authentication Information • UL- Update Location

• ULGP RS− Update GPRS Location

• CL- Cancel Location

In this work, we aggregate the number of SAI, U L, U LGP RS and CL requests in every 5

minute time period and analyze how these count metrics vary with time. An abnormally large or small number means that there might be a bottleneck somewhere in the GSM system, which should trigger an alarm that notifies the experts at Tele2.

1.2 Motivation

This work aims to explore statistical methods which can be implemented in real-time traffic monitoring system for detecting anomalies. The platform will eventually be used to predict device behaviour in order to have better control over the networks. Focus will be put on statistical models which are easy to interpret and modify, rather than black box solutions. An existing, however not implemented, solution will be used as a performance benchmark.

1.3 Objective

The aim of this work is to build a suitable statistical model in order to predict device signaling and to give support in how to improve the existing solution of detecting network issues.

1.4 Scope and Limitations

Two statistical methods, used to predict signaling behaviour, will be tested and evaluated. The methods will be used to classify and evaluate signaling, such that abnormal behaviour can be identified. The achieved performance, when implementing the two methods, will be evaluated on the criteria: cost efficiency, latency and how well the model adapts to new behaviors in network and customer dimension. The performance will also be compared to the benchmark model.

Due to external restrictions, there are some limitations in this thesis. Firstly due to the time limitations, the solution will not be implemented in the real-time system. Hence the solution will only be validated through simulations. Furthermore there is a limitation in time from when fine granularity data was collected, used for modelling, which restricts our models to account for yearly trend and seasonality.

1.5 Outline

The rest of this thesis is structured in the following manner. In Chapter 2 the reader is introduced to the theoretical concepts that are needed in order to understand the models used in the work. Chapter 3 covers how the theory has been applied in practice in order to build the models. In Chapter 4 the results are presented following the structure of the theory, hence the order in which everything was done. Discussion about the results and the final conclusions are presented in Chapter 5.

2 Theory

(10)

2.1 Stationarity

In order to build most of the existing statistical times series forecasting models a strong assumption is that the time series can be rendered approximately stationary, which can be achieved by the use of mathematical transformations. A strictly stationary time series is one for which the probabilistic behavior of every collection of values

{yt1, yt2, ..., ytk}

is identical to that of the time shifted set

{yt1+h, yt2+h, ..., ytk+h}

Definition 1: We define a strictly stationary process as

P{yt1 ≤ c1, ..., ytk≤ ck} = P {yt1+h≤ c1, ..., ytk+h≤ ck} (1)

for all k = 1, 2, ..., all time points t1, t2, ..., tk, all numbers c1, c2, ..., ck, and all time shifts h.

Definition 1 is often too strict to apply on data sets and therefore we define a less strict stationary time series as a weakly stationary time series. So, a continuous time random process ytwhich is weakly stationary has the following restrictions

• The mean value function is independent on time, E[yt] = µt= µt+τ = µ, ∀τ ∈ R, is

constant and does not depend on time t.

• And the autocovariance function, γy(s, t) = cov(ys, yt) = E[(ys− µ)(yt− µ)], depends

on time points s and t only through their difference|s − t|

Hence, since the mean function E(yt) = µtof a stationary time series is independent of time

t, it can be written as µt= µ. [Shumway and Stoffer, 2011]

2.2 Correlation

Two important measures of correlation in time-series analysis is the autocorrelation function (ACF) and the partial autocorrelation function (PACF).

2.2.1 Autocorrelation Function

In statistics, ACF is the correlation between values of the process at different time lags. Hence, it measures the linear predictability of a time series at time t, say yt, using only the

value of ys, given that t > s. If ytcan be perfectly predicted from ys, through a linear

rela-tionship, yt= β0+β1ysthe correlation will be±1. ACF is defined as [Shumway and Stoffer, 2011]

ρ(s, t) = √ γ(s, t)

γ(s, s)γ(t, t) (2)

2.2.2 Partial Autocorrelation Function

The partial autocorrelation function is used in order to calculate the correlation between

ytand yt_−h, for h≥ 2, without taking into account the linear effect of the set in between

{yt−1, yt−2, ..., yt+h−1}. Hence the autocorrelation function breaks the chain of dependence

by removing the effect of the mutual set of variables between ytand yh. Formally the

autocorrelation for a mean-zero stationary time series is defined as:

ϕhh= corr(yt+h− ˆyt+h, yt− ˆyt), h≥ 2 (3)

where ˆyt+h, for h ≥ 2 denotes the regression of yt+h on {yt+h−1, yt+h−2, ..., yt+1} and ˆyt

(11)

ˆ

yt+h=β1yt+h−1+ β2yt+h−2+ ... + βh−1yt+1

ˆ

yt=β1yt+1+ β2yt+2+ ... + βh−1yt+h−1

(4)

Because of stationarity the coefficients β1, ..., βh−1are the same, in both equations, in

Equa-tion (4). This gives that (yt+h− ˆyt+h) and (yt− ˆyt) in Equation (3) are uncorrelated with

{yt+1, yt+2, ..., yt+h₋₁} and the PACF ϕhh is the correlation between yt+h and yt, with the

linear dependency of{yt+1, yt+2, ..., yt+h₋₁} on each removed. [Shumway and Stoffer, 2011]

2.3 Time Series Decomposition

A model of a time series that captures the characteristics of the series may be formulated in many ways. One way is to decompose the series into several components and express a model as a combination of the individual components. A classical time-series decomposition, divides the data into three components, which can be described as:

• Trend (T): The long term progression of the time series. • Seasonal (S): A repeating pattern with known periodicity. • Remainder (E): All other unexplained variations of the series.

A purely additive model of the time series y at time t can be expressed as:

yt= Tt+ St+ Et.

In such a model, the components can be extracted or removed by subtraction. Similarly, a purely multiplicative model is expressed as the product of the individual components and components can be removed by division. Further, it is also possible for a model to have a mix of components with multiplicative and additive attributes.

If the amplitude of the seasonal fluctuations does not vary with time, the model should con-tain an additive seasonal component. Otherwise, the seasonal component should be multi-plicative. Equivalently, if the variation around the trend is constant with time the trend com-ponent should be additive and otherwise multiplicative. [Hyndman and Athanasopoulos, 2014]

2.4 Holt-Winters

In general, exponential smoothing methods are based on the idea that forecasts are weighted combinations of past observations. The weights are “smoothed” in such way that they de-crease exponentially as the observations gets older, making recent observations weighted relatively more than older observations. [Hyndman et al., 2008]

The simplest exponential smoothing method, Single Exponential Smoothing (SES), can be expressed as:

ˆ

yt+1= αyt+ (1− α)ˆyt, (5)

where the forecast value ˆyt+1 is a combination of the most recent observation yt, weighted

by α and the most recent forecast ˆyt, weighted by (1− α). Since the method is applied

recursively on each successive observation in the series, ˆyt and each previous forecast is

computed equivalently as the weighted average of the corresponding previous observation and the previous forecast. This implies that ˆyt+1 represents a weighted moving average of

all past observations with the weights decreasing exponentially. Making these substitutions, Equation (5) may equivalently be expressed as:

ˆ

yt+1 = αyt+ α(1− α)yt−1+ α(1− α)2yt−2

(12)

An assumption of SES model is that the data fluctuate around a stable mean. However, if the data exhibit trend or seasonality this is not the case so to use SES would be insufficient to fit the data. The Holt-Winters (HW) method extends the idea behind SES by introducing additional smoothing parameters to take account for trend and seasonality in the data. Hence the HW method comprises of one forecasting equation and three smoothing equations-one for the level lt, one for the trend bt and one for the seasonal st component at time t.

The level component can be interpreted as the local mean of the data and is mathematically the weighted average between the seasonally adjusted observation and the non-seasonal forecast. Depending on the nature of the data an additive seasonal or multiplicative seasonal variation of HW could be applied. If the amplitude of the season is constant over time, the additive variation of the model is more appropriate. When the amplitude is not constant the multiplicative model should be used. The equations for a purely additive HW model are expressed as: Level: lt = α(yt− st−m) + (1− α)(lt−1+ bt−1) Trend: bt = β(lt− lt−1) + (1− β)bt−1 Seasonal: st = γ(yt− lt−1− bt−1) + (1− γ)st−m Forecast: yˆt+h_|t = lt+ bth + st_−m+h+ m,

where (α, β, γ) are smoothing parameters for level, trend and season respectively, m is the length of the season, ˆyt+h_|tis the forecast for h periods ahead and h+m= [(h− 1)mod m] + 1.

The smoothing parameters are usually restricted to the interval [0, 1] and needs to be esti-mated in order to initialize the model.[Hyndman and Athanasopoulos, 2014]

2.5 Autoregressive Integrated Moving Average

The focus of classical regression models is on the relationship between a dependent variable and the independent variables, also called predictors. For time series it is desirable to allow the dependent variable to be influenced by its own past values. The ARIMA model consists of an Autoregressive model and a Moving average model with an integrated differencing part to adjust for non-stationarity. [Shumway and Stoffer, 2011]

Backshift Operator The Backshift operator, denoted by B, operates on elements of times series to produce the previously elements in the series. The operator can also be extended to higher powers.

Byt= yt₋₁

B2yt= B(Byt) = Byt−1= yt−2

Bkyt= yt_−k

(6)

The first difference∇yt= yt− yt−1 can then be rewritten as follows:

∇yt= (1− B)yt (7)

which works similarly for the d-th difference:

∇d_y

t= (1− B)dyt (8)

(13)

Autoregressive Models are based on the assumption that current values of a stationary time series, yt, can be expressed as a function of the p past values of the same time series. A

model which uses the last window of length p, to forecast the current value of yt, is denoted

as AR(p) and defined as

yt= p

∑

i=1

ϕiyt_−i+ α + ϵt

The coefficients ϕ1, .., ϕp and α has to be learned from the training data. In an

over-determined system of equations the coefficients can be approximated with least square regression in order to minimize the squared-error. Also, ϵt is assumed to be white noise

ϵt∈ W N(0, σ2). [Aggarwal, 2013]

The backshift operator in Equation (6) gives a useful form of writing the AR(p) model with mean equal to zero, as

(1− ϕ1B− ϕ2B2− · · · − ϕpBp)yt= ϕ(B)yt= ϵt

where the autoregressive operator is defined as

ϕ(B) = 1− ϕ1B− ϕ2B2− ... − ϕpBp

Iterating backwards k times for the first-order model AR(1) with mean equal to zero, defined as yt= ϕyt−1+ ϵt, where|ϕ| < 1 and yt is stationary, gives

yt= ϕyt₋₁+ ϵt= ϕ(ϕyt₋₂+ ϵt₋₁) + ϵt = ϕ2yt−2+ ϕϵt−1+ ϵt .. . = ϕkyt_−k+ k₋₁ ∑ j=0 ϕjϵt_−j

Hence by continuing to iterate backwards, the AR(1) model can be represented as a linear process given by yt= ∞ ∑ j=0 ϕjϵt_−j

where the expected value is defined as

E(yt) = ∞

∑

j=0

ϕjE(ϵt−j) = 0

and the autocovariance function defined as

(14)

Moving Average Models predicts subsequent deviation based on the past deviation of the stationary time series. It relates the current value to the mean and the past deviation of the time series, rather than the history of the values themselves. The model of order q is denoted by MA(q) and defined as

yt= q

∑

i=1

θiϵt_−i+ µ + ϵt

where there are q lags in the moving average, θ1, ..., θq are the coefficients needed to be

learned from the training data and µ is the mean of the series. The value of ϵi is

as-sumed to be white noise error terms. Therefore, the models implied set of relationships is non-linear and typically iterative non-linear fitting procedures are used to fit the model. [Aggarwal, 2013]

By using the backshift operator B, an MA(q) model with mean equal to zero can be written as

yt= θ(B)ϵt

For an MA(1) model yt= ϵt+ θϵt₋₁ the expected value E(yt) = 0,

γ(h) =      (1 + θ2_)σ2 ϵ h = 0, θσ2 ϵ h = 1, 0, h > 1,

and the ACF is

ρ(h) =

{

θ

(1+θ2₎ h = 1,

0 h > 1.

Note that |ρ(1)| ≤ 1/2 for all values of θ. Also yt is correlated with yt−1, but not with

yt−2, yt−3... . Contrast with the case of the AR(1) model in which the correlation between

ytand yt−kis never zero. An MA model is said to be invertable if it is algebraically equivalent

to a converging infinite order AR model.[Shumway and Stoffer, 2011]

Autoregressive Moving Average is a model considering both previous values and his-tory of deviation when calculating the expected values. The ARMA(p,q) model is defined as yt= p ∑ i=1 ϕiyt−i+ q ∑ i=1 θiϵt−i+ α + ϵt (9)

with ϕp̸= 0, θq ̸= 0, σϵ2> 0 and ϵt∈ W N(0, σ2). The number of autoregressive and moving

average terms included in the model are denoted by p and q respectively and the constant

α has to learned from the training data.[Aggarwal, 2013]

The ARMA(p,q) model is said to be causal if the time series {yt; t = 0,±1, ±2, ...} can be

written as a one-sided linear process, such as

yt= ∞ ∑ j=0 ψjϵt−j= ψ(B)ϵt, where ψ(B) =∑∞_j=0ψjBj, and ∑_∞

j=0|ψj| < ∞; we set ψ0= 1. This comes from the def-inition that ytis causal or a causal function of {zt} if its values can be expressed in terms

of current or past values of z : yt= f (zs, s≤ t). [Shumway and Stoffer, 2011]

Furthermore is the ARMA(p,q) said to be invertable, if the time series {yt; t = 0,±1, ±2, ...}

(15)

π(B)yt= ∞ ∑ j=0 πjyt−j = ϵt where π(B) =∑∞_j=0πjBj, and ∑_∞

j=0|πj| < ∞; we set π0= 1 [Shumway and Stoffer, 2011] Integrated ARMA or ARIMA model is a class of ARMA model which includes differ-encing, to produce stationarity. The first difference, in Equation (7), eliminates linear trend. The second deference, i.e. the difference of the first difference, can eliminate a quadratic trend, and so on. From Equation (8) we can say that a process is said to be ARIMA(p,d,q) if

∇d_y

t= (1− B)dyt

is casual ARMA(p,q).

The general model is written as

ϕ(B)(1− B)dyt= θ(B)ϵt

[Shumway and Stoffer, 2011]

Seasonal ARIMA can be used to account for seasonal and nonstationary behavior, where the dependence on past trends is strongest for some underlying lag m. Hence autoregressive and moving average polynomials, which identifies with the seasonal lag is introduced. The seasonal autoregressive moving average model, ARM A(P, Q)m, is defined as

ΦP(Bm)yt= ΘQ(Bm)ϵt

where the seasonal autoregressive operator and moving average operator of order P and Q with period m are defined as:

ΦP(Bm) = 1− Φ1Bm− Φ2B2m− ... − ΦPBP m (10)

and

ΘQ(Bm) = 1 + Θ1Bm+ Θ2B2m+ ... + ΘQBQm (11)

The seasonal and non seasonal operators can be combined into a seasonal autoregressive moving average model, denoted by

ΦP(Bm)ϕ(B)yt= ΘQ(Bs)θ(B)wt

The model can be further expanded to a multiplicative seasonal autoregressive integrated moving average model, denoted by sARIM A(p, d, q)(P, D, Q)mand defined as

ΦP(Bm)ϕ(B)∇Dm∇ d_y

t= δ + ΘQ(Bm)θ(B)ϵt

where the ordinary autoregressive and moving average components are represented by poly-nomials ϕ(B) and θ(B) of order p and q. The ordinary difference component is denoted by

∇d_{= (1}_{− B)}d_{. The seasonal autoregressive and moving average components are defined by}

ΦP(Bm) and ΘQ(Bm) of order P and Q and the seasonal difference component is defined

by∇D

(16)

2.6 State Space Model

Given a basic time series model with the set of observations y1, ..., yn

yt= µt+ ζt, t = 1, ..., n

where µt represents the trend and ζt the error, we can develop suitable models for µt by

using the concept of a random walk, represented by αt. The random walk is determined by

the relationship αt+1 = αt+ ηt where ηt are iid with zero mean and variances ση2, ηt and

ζs are independent for all t and s. Hence we can consider the model in which µt= αt and

assuming ζt has constant variance σ2ζ as follows:

yt= αt+ ζt, ζt∼ N(0, σζ2),

αt+1= αt+ ηt, ηt∼ N(0, ση2).

(12)

The model exhibits the characteristics structure of a state space model in which there is a series of unobserved values α1, ...αn, called states. In the model we also initially assume that

α1 ∼ N(a1, P1), where a1 and P1 are known and that σ2ζ and ση2 are known. The

develop-ment of the system over time is represented by the states together with a set of observations

y1, ..., yn, related to the αt’s by the state space model (12). The objective is to infer relevant

properties of the αt’s from knowledge of the observations y1, ..., yn. Since random walks are

non-stationary it implies that the model is also non-stationary. Where non-stationarity in this case implies that the distributions of random variables yt and αt depend on time t.

[Durbin and Koopman, 2012]

Similar to the Local level model in Equation 12 the linear Gaussian state space model is defined as

yt= Ztαt+ ζt, ζt∼ N (0, Ht) (13)

αt+1= Ttαt+ Rtηt, ηt∼ N (0, Qt), t = 1, ..., n (14)

where, in the univariate case, ytis the observation at time t and αt is the unobserved state

vector. The development of the system over time is determined by αtaccording to Equation

(14), called state equation. However, αtcan not be observed directly hence the analysis is

based on observation yt from Equation (13), called observation equation. Zt, Tt, Rt, Ht

and Qt are initially assumed to be known and the error terms ζt and ηt are assumed to

be serially independent and independent of each other at all time point. The initial state vector α1is assumed to beN (a1, P1) independently of ζ1, ..., ζnand η1, ..., ηn, where a1and

P1 are first assumed known. Equation (13) of the model constitutes of a linear regression model varying over time with coefficient α and the second equation, Equation (14), of the model represents an AR(1) model. [Durbin and Koopman, 2012]

2.6.1 ARMA and ARIMA models

Given the definition of an ARMA process in Section 2.5 we can rewrite a zero mean ARMA model as y_t∗= r ∑ j=1 ϕjyt∗_−j+ ϵt+ r−1 ∑ j=1 θjϵt−j, t = 1, ..., n (15)

where r = max(p, q + 1) and for some coefficients equal to zero. The model can then be put into the state space form of Equations (13) and (14) with ζt= 0 and y∗t replaced by yt,

(17)

yt= Ztαt αt+1= Ttαt+ Rtηt Zt= (1 0 0 · · · 0), αt=        yt ϕ2yt−1+· · · + ϕryt−r+1+ θ1ϵt+· · · + θr−1ϵt−r+2 ϕ3yt−1+· · · + ϕryt−r+2+ θ2ϵt+· · · + θr−1ϵt−r+3 .. . ϕryt₋₁+ θr₋₁ϵ        Tt= T =        ϕ1 1 0 .. . . .. ϕr−1 0 1 ϕr 0 · · · 0        , Rt= R =      1 θ1 .. . θr−1     , ηt= ϵt+1 (16)

Given that ζt = 0 implies that Ht= 0. For seasonal series both trend and season is

elim-inated by using the differencing operator y_t∗ = ∇d_∇D

syt, described in Section 2.5, prior

modelling y∗_t by a stationary ARMA model. [Durbin and Koopman, 2012]

2.7 Kalman Filtering

The Kalman filter is a recursive algorithm used to update the knowledge of the system each time a new observation yt is observed. This is done by predicting the future state from

information about the current state. When applied in time series data it can be used to evaluate the likelihood of an ARIMA model by calculating the one step ahead prediction errors and their variance.

Given the definition of the local level model in Section 2.6 where ζt and ηt are assumed

normal, it is also assumed the conditional joint distribution of one set of observations given another set is also normal. The objective is to calculate at_|t, Pt_|t, at+1 and Pt+1 when yt

is observed. Where at_|t is the filter estimator of the state αt, at+1 is the one-step ahead

predictor of αt+1 and Pt_|t, Pt+1are their respective associated variances.

Let Yt−1 be the vector of observations (y1, ..., yt−1)′ for t = 2, 3, ... and let the conditional

distribution of αt, given Yt−1, be N (at, Pt), where at and Pt are known. Also let the

conditional distribution of αt, given Yt, be N (at|t, Pt|t). The distribution of at+1given Ytis

then N (at+1, Pt+1). Furthermore let vt be the one-step ahead prediction error of yt, which

gives vt= yt− atfor t = 1, .., n and

E(vt|Yt₋₁) = E(αt+ ζt− at|Yt₋₁) = at− at= 0

V ar(vt|Yt−1) = V ar(αt+ ζt− at|Yt−1) = Pt+ σ2ζ

E(vt|αt, Yt−1) = E(αt+ ζt− at|αt, Yt−1) = αt− at

V ar(vt|αt, Yt−1) = V ar(αt+ ζt− at|αt, Yt−1) = σζ2

for t = 2, ..., n. When Yt is fixed, Yt−1 and yt are fixed so Yt−1 and vt are fixed and vice

(18)

at|t=at+ Pt Pt+ σζ2 vt Pt|t= Ptσ2ζ Pt+ σζ2

Given that at+1= E(αt+1|Yt) = E(αt+ ηt|Yt) and Pt+1= V ar(αt+1|Yt) = V ar(αt+ ηt|Yt)

we have that

at+1=E(αt|Yt) = at|t

Pt+1=V ar(αt|Yt) + ση2= Pt|t+ ση2

which gives that

at+1=at+ Pt Pt+ σζ2 vt Pt+1= Ptσζ2 Pt+ σ2ζ + σ_η2

for t = 2, .., n. For t = 1 we delete the symbol Yt₋₁above and find that the results presented

hold for t = 1 as well as for t = 2, .., n.

The result can be made consistent with the general linear state space model in Section 2.6 by introducing new notations

Ft= V ar(vt|Yt−1) = Pt+ σζ2, Kt=

Pt

Ft

where Ft is referred to as the variance of the prediction error vt and Kt is known as the

Kalman gain. The full set of relations for updating form t to time t + 1 can be written using

(19)

vt= yt− at, Ft= Pt+ σ2ζ

at_|t= at+ Ktvt, Pt_|t= Pt(1− Kt)

at+1= at+ Ktvt, Pt+1= Pt(1− Kt) + σ2η

for t = 1, .., n. [Durbin and Koopman, 2012]

2.8 Maximum Likelihood and Conditional Sum of Squares

Maximum likelihood and conditional sum of squares are two methods which can be used to estimate the parameters when fitting a model. To estimate the parameters using Maximum likelihood the likelihood of the observations y1, y2, ..., yn is computed, in terms of the

unknown parameter values Θ. The likelihood function is then maximized over the unknown parameters. For an ARMA(p,q) model the parameters would be β ={µ, θ1, ..., θq, ϕ1, ..., ϕp}

and the likelihood with respect to β and σ2

ϵ can be written as L(β, σ_ϵ2) = n ∏ t=1 f (yt|yt−1, ..., y1)

Assuming the conditional distribution of yt given yt−1, ..., y1 is Gaussian with mean ytt−1

and variance Pt₋₁ t . Where y t₋₁ t = µ + ϕ(yt−1 − µ) and Ptt−1 = γ(0) ∏t₋₁ j=1(1− ϕ 2 jj) is

given by solving the prediction equation for an ARMA model using Durbin-Levinson Al-gorithm [Shumway and Stoffer, 2011] and ϕ are the parameters to predict in order to es-timate the best prediction. For ARMA models the autocovariance function is given by

γ(h) = cov(yt+h, yt) = σϵ2

∑_∞

j=0ψjψj+h h≥ 0, where ψ are the weights. This gives that

P_tt−1= σ_ϵ2 {[_∑∞ j=0 ψ2_j ][t_∏₋₁ j=1 (1− ϕ2_jj) ]} def = σ_ϵ2rt

where the rt terms are functions only of the regression parameters and may be computed

recursively as rt+1= (1− ϕ2tt)rt with initial condition r1= ∑_∞

j=0ψ

2

j. The likelihood of the

data can be written as

L(β, σ_ϵ2) = (2πσ_ϵ2)−n/2[r1(β)r2(β)...rn(β)]−1/2exp [ −S(β) 2σ2 ϵ ] (17) where S(β) = n ∑ t−1 [ (yt− ytt−1(β))2 rt(β) ]

The maximum likelihood estimation would then be given by maximizing Equation (17) with respect to β and σ2

ϵ. [Shumway and Stoffer, 2011]

Conditional Sum of Squares is another method used when fitting the parameters to a model. The method is based on fixing the parameters and then computing the sum of squares. The aim is to find the parameters which minimizes the sum of squares, which is accomplished by using optimization algorithms.

An ARMA(p,q) model, with µ = 0 and β ={θ1, ..., θq, ϕ1, ..., ϕp}, can be written in terms

(20)

For conditional least square the residual sum of squares are approximated by conditioning on the observations y1, .., yp if p > 0 and ϵp = ϵp−1= ... = ϵ1−q= 0 if q > 0. In which case, given β, Equation (18) is evaluated for p + 1, p + 2, ..., n. This conditional argument gives the conditional error sum of squares

Sc(β) = n

∑

t=p+1

ϵ2_t(β)

Minimizing Scwith respect to β yields the conditional least squares estimates. The problem

becomes linear regression for q = 0 and nonlinear regression for q > 0. To solve nonlinear regression problems numerical optimization techniques has to be applied.

[Shumway and Stoffer, 2011]

2.9 L-BFGS-B

Newton’s method in optimization attempts to find the stationary points of a nonlinear function f by iteratively evaluating the gradient and Hessian of that function. When the Hessian of a function f is large or impractical to compute, Quasi-Newton methods can be used as an alternative. One such technique is the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm, which iteratively approximates the Hessian Bk by using information

about the curvature of the function that is given by the gradient gk and the current iterate

xk. The BFGS update is computed:

Bk+1= Bk+ ykykT yT ksk −(Bks)(Bksk)T sT kBks (19) where, sk= xk+1− xk, yk = gk+1− gk. [Kelley, 1999]

L-BFGS is a limited memory modification of the BFGS algorithm, so that when updating the BFGS matrix (Equation (19), only the m most recent correction pairs{si, yi} are stored

in every iteration(for details see [Byrd et al., 1994]). This reduces the n× n BFGS matrix

to a more compact m× n matrix, where m is a small number (usually ∈ [3, 20]). In turn,

L-BFGS-B is a modification of L-BFGS that allows for boxed constraints on variables so that each variable xican be given a lower bound liand/or a upper bound ui. In [Byrd et al., 1995]

the algorithm is explained in more detail. The problem can be written as: min f (x)

where l≤ x ≤ u

f :Rn_{→ R is a nonlinear function with n variables and gradient g. The algorithm works in}

an iterative fashion, where the positive definite limited memory matrix Bk, together with

the current iterate xk and the gradient gk forms the quadratic model of f at xk,

mk(x) = f (xk) + gkT(x− xk) +

1

2(x− xk)

T

Bk(x− xk). (20)

The algorithm uses the gradient projection method to find the set of active variables. While the active variables are held fixed, Equation (20) is approximately minimized with respect to the free variables. If ˆxk+1is this minimizer, then the new search direction is dk= ˆxk+1− xk

or in other words, the vector leading from the current iterate to the approximate minimizer. A line search that ensures that the iterates remain in the feasible region is performed along the search direction in order to find a new iterate xk+1, which satisfies the sufficient decrease

condition. [Zhu et al., 1994]

(21)

where λk is the steplength and α is a parameter. It also attempts to enforce the curvature condition | gT k+1dk |≤ β | gTkdk | where β is a parameter.

3 Method

This section is divided into three subsections (i) Data inference; methods used in order to extract information from the raw data. (ii) Predict; presents how the data is prepared and which models that are used in order to predict future signaling volumes. (iii) Detect; presents how anomalies are classified.

3.1 Data Inference

When selecting the features in the data, on which further analysis is made, we use the domain knowledge within Tele2 IoT. Employees who visually monitor the system have dis-covered that certain metrics seem to be highly correlated with network issues. Based on this knowledge, we have chosen to, for every 5 min, study the accumulated number of SAI requests.

The data is first investigated from a time series point of view to evaluate if time series methods would be applicable to use. Autocorrelation was investigated by analyzing the autocorrelation function (ACF) and partial autocorrelation function (PACF) plots. Further, the data was differenced to make inference about trend and seasonality. Finally, different weekdays were compared to one another to see if the signaling differed depending on which day that was studied.

3.2 Prediction

3.2.1 Data Processing

In order to ensure that the data set is representative of the system’s normal behavior some data processing is required. The problem of missing data and outliers are addressed in two steps:

• Step 1. Missing value replacement:

• If the missing value belongs to the first week of the training set, it is replaced by the mean of the values from the same time period of the two following weeks. • If the value is not in the first or last week of the training set, it is replaced by the

mean of the previous and next weeks value for the same time period.

• If the missing value belongs to the last week of the training set, it is replaced by the mean of the two previous weeks value for the same time period.

• Step 2. Outlier replacement using interpolation

(22)

The inner fences, L(1) _{= (L}(1)

L , L

(1)

U ), are defined equivalently as the points at 1.5· IQR

distance from the upper and lower quartiles. [Seo, 2006] Any residual error ϵ that falls outside the outer fences of the error histogram L(2)_L > ϵ or ϵ > L(2)_U , is considered an outlier and is replaced with an interpolated value.

3.2.2 Cross Validation

When selecting an appropriate model, the goal is to select the model with best predictive ability on average. [Hyndman et al., 2008] To ensure that this is the case, we have used cross-validation (CV) to evaluate the forecasting error ϵt = yt− ˆyt that is generated by

each model for different parameters. In CV, the data is recursively partitioned into two complementary subsets, where one part of the available data set is used for fitting a model (Ymod) and the rest is used for validation (Ypred). Multiple rounds of partitioning, where a

different part of the data is used for validation each time, are performed and the results from the rounds are averaged. With traditional CV techniques the data is assumed to be inde-pendent. However, since the data observations in this work are time dependent, a modified roll forward CV technique is used so that forecasts are only generated based on observation prior to the validation set. That is for one-step forecasts, the training and validation sets are shifted one-step forward in time for every round of partitioning. This CV technique fol-lows the method proposed in [Hyndman and Athanasopoulos, 2014]. If k observations are required to produce a reliable forecast then one-step predictions may be conducted as follows:

For i = 1, 2, . . . , T − k, where T is the total number of observations.

1. Select the observation (k + i)th observation for the test set Ypred_{and use prior}

obser-vations for period i, i + 1, i + 2, . . . , k + i− 1 as training set Ymod_{to fit a model.}

2. Compute forecast error ϵk+i

The forecast accuracy in this work was evaluated by calculating the mean absolute error (MAE), defined as:

MAE = 1 n n ∑ j=1 | ϵj |,

where n is the number of obtained prediction errors. Additionally, the standard deviation of the errors were calculated as well as the equivalent but more robust measures median ab-solute error and interquartile range (IQR). The proposed models were tested for forecasting horizons of h = 1, 12 and 288 five minute time periods.

3.2.3 Existing Solution

The existing solution is based on an average seasonal approach where the prediction equals the average of the last observed values for the same season, the model is referred to the Benchmark model (BM). In the model the seasonality is weekly and the one-step ahead prediction equals the mean value for the last six seasons. Given that the data is aggregated on five minute intervals, there are 288 observations per day and 2016 observations per week. This gives the one-step ahead prediction

ˆ

yt+1=

1

n(yt+1−m+ yt+1−2m+ ... + yt+1−nm), where n = 6 and m = 2016

A second algorithm which is based on the existing solution is also implemented, namely the

median Benchmark model (Median BM). The one-step ahead prediction is based using the

(23)

3.2.4 ARIMA

In the process of finding the appropriate ARIMA models the R-statistics software is used, mainly with the packages forecast and stats.

In order to find the appropariate model to fit to the data, we first investigate if a seasonal or non-seasonal model is suitable, also if there is any trend in the data. This is done by differentiating the data until stationarity is achieved. Stationarity is validated by applying the Augmented Dicky-Fuller test. Given the stationary time series, knowledge is given about which integrated term to use in the model. To find the appropriate number of autoregressive (AR) and moving average (MA) terms, both seasonal and non-seasonal, the autocorrelation function (ACF) and partial autocorrelation function (PACF) are evaluated. By evaluating how the (ACF) and (PACF) is tailing- or cutting off one can select the appropriate model and number of AR- and MA terms, this according to the method described by Shumway and Stoffer. [Shumway and Stoffer, 2011] From the analysis initial terms are selected, which in the next step are tuned by using a grid search approach. This approach aims to find the best suited terms for the model by testing different combinations of terms, within a certain interval from the initial selected terms. The method used is similar to the one described by Yaacob et.al [Yaacob et al., 2010]. However with the difference that prediction errors are being evaluated instead of minimizing the Akaike information criterion (AIC). This is done because the software used does not provide the AIC when using conditional sum of squares (CSS) in the fitting process.

After selecting appropriate terms the model is cast into state space form in order to possibly gain further knowledge from an unobserved process in the data. Furthermore does the state space model enable the ability to easily construct the exact likelihood using Kalman Filter-ing, when fitting the model. When fitting the model, in order to estimating the parameters

θ, Θ, ϕ, and Φ, the CSS is computed. Since the model is both stationary and invertible,

the exact Maximum Likelihood (ML) and CSS estimators have the same asymptotic distri-bution [Harvey, 1993]. The CSS method was chosen because of lower computational cost compared to the use of ML. Furthermore has the use of ML method been problematic when not converging for certain data sets. The estimation procedure is based on the computation of innovations where CSS is minimized. In this process the L-BFGS-B algorithm is used. The innovations and their variance is found by the use of a Kalman filter where the initial-ization is made using the method described by Gardner et al [Gardner et al., 1980]. The Kalman filter is also used in order to produce forecasts with the model.

3.2.5 Holt-Winters

Using the HoltWinters function from the forecast package, an additive trend and seasonal Holt-Winters model was implemented in R-statistics software.

The initiation of the model follows the method suggested in [Hyndman et al., 2008]. First, the first two seasons of the series is decomposed and de-trended by subtracting an estimate

ˆ

T of the trend component. This estimate is constructed with a moving average smoother of

order m = 2k + 1, where 2k = 288 is the seasonality. The smoother is centralized so that the de-trended values are the average of the observations y that are within k periods of the observation at time t. ˆ Tt= 1 2myt−k+ 1 m ( k∑−1 j=_−k+1 yt+j ) + 1 2myt+k

Each period of the season is equally weighted with m and since yt−k and yt+k are

(24)

When ˆT has been removed from the series, the seasonal component ( ˆS = st1, st2, . . . , st288)

can be inferred by averaging the observations corresponding to the same time periods in the de-trended series. These estimates are used as start values for the 288 seasonal components in the model.

Initial values for level and trend components are set by seasonally adjusting the original series (by subtracting ˆS) and then fitting a linear regression line to this data. The intercept

of the line is the starting value for level lt1 and the slope is the start value for the trend bt1.

The smoothing parameters α, β and γ were optimized by minimizing the squared one-step in-sample prediction errors. This is done using the L-BFGS-B algorithm, where each parameter is bounded between 0 and 1.

3.3 Detection

We want to achieve detection in near real-time. That is, as new observations are recorded the detection model should quickly be able to tell if this observation is anomalous or not. Two methods have been tested, based on predictions using Holt-Winters and Median benchmark model, in order to detect anomalies.

3.3.1 Method One

The first method (M1) compares the most recent prediction error to a set of previously generated prediction errors, using the interquartile range (IQR) of the prediction errors. The observation in time t + 1 is classified as an anomaly if the observed prediction error is more extreme than the limits L(2)_L = Q0.25 − IQR × c or L

(2)

U = Q0.75+ IQR× c, where

Q represents the quartile of the distribution of the prediction errors and c is a constant,

used to adjust the limits. The method is implemented by (i) Initialize by creating a data set with prediction errors ϵ, which histogram together with the detection limits L(2) _{can be} represented as in Figure 2. (ii) Predict t + 1 with chosen prediction model and calculate the prediction error once the observed value is available. (iii) Classify the observation by determining if the prediction error is more extreme than the detection limits L(2)_.

Benchmark Model In order to implement the detection method using the Benchmark model as the prediction method the initialization aims to find an appropriate constant c and extract previous prediction errors. This is done using the following algorithm:

1. Select a data set Y = y1, y2, . . . yt−1. Create two subsets Ymod= y1, y2, . . . yt−n used

for modeling and Ypred= yt−n+1, yt−n+2, . . . yt−1 used to predict on.

2. Remove anomalies from Y , using the method described in Section 3.2.1 Step two, which gives the new data set Y c. Create two subsets of Y c, to get Y cmod= y1, y2, . . . yt−n

used for modeling and Y cpred= yt−n+1, yt−n+2, . . . yt−1 used to predict on.

3. Define AT rueas the set difference of Ypredand Y cpred, hence AT rue= Ypred\Y cpred,

which defines the anomalies in the data set Ypred, referred to as target anomalies.

4. Make n one-step ahead predictions using the median benchmark model, described in Section 3.2.3. Create a set of n predictions errors ϵ calculated from the n prediction and the observations Y cpred.

5. Set start value, step length and max value for the the constant c.

6. Set AP red ={}. Use Ymod to make n one-step ahead prediction using the median

benchmark model. Let i be the i = 1, 2, . . . , nth prediction. • if L(2)_L < ϵi< L

(2)

U add ϵi to ϵ. Add the ith value in Ypredto Ymodand drop the

(25)

• if ϵi> L

(2)

U or ϵi< L

(2)

L Classify the observed value as an anomaly by adding it to

AP red. Add the ith predicted value to Ymodand drop the oldest value in Ymod.

7. Set R = AT rue∩ AP red and F = AP red\ AT rue. Calculate x1= _{# of elements in A}# of elements in R

T rue

and x2 = _{# of elements in A}# of elements in F

P red. If √ (1− x1)2+ x22 < y set √ (1− x1)2+ x22 = y and cop= c.

8. If c < cmax update c and go to step 6.

The output of the algorithm is the set of errors ϵ and the optimal constant cop such that

the component levels L(2)_L , L(2)_L are adjusted to detect as many anomalies while at the same time generate as few false positives as possible. In our case, two seasons worth of data is used setting n = 576 and c ={2, 2.5, ..., 4.5}. Once the initialization is done, anomalies can be detected by repeating step 6 in the algorithm above. That is, n new values are predicted with the Benchmark model and once the true observations become available prediction errors are calculated. The errors are then compared to L(2)_L and L(2)_U in order to determine if any observation can be classified as anomalous. L(2)_L and L(2)_U are defined similarly to the outer extreme fences suggested by Tukey (see step 2. Section 3.2.1), with the exception of using a dynamic constant cop that is optimized for every data set instead of always setting c = 3.

There is no statistical motivation to why Tukey chose this number for c. [Seo, 2006] Which is why we attempt to find the optimal constant for each data set.

Holt-Winters To implement the detection method, with a HW prediction model two additional limits are introduced L(1)_L = Q0.25 + IQR× m and L

(1)

U = Q0.75+ IQR× m, which can be illustrated in Figure 2. The purpose of the introduced limits is to determine if the observed value deviates more than normal from the expected value, although not as much as if the observation would be classified as an anomaly. The aim of this approach is to be able to determine if it is more suitable to add the observed- or predicted value to the training set, which is then used to make the next prediction with. To initialize the model the set of prediction error ϵ are calculated, as well as the constants m and c. This is done by the following algorithm:

1. Select a data set Y = y1, y2, . . . yt−1. Create two subsets Ymod= y1, y2, . . . yt−n used

for modeling and Ypred= yt−n+1, yt−n+2, . . . yt−1 used to predict on.

2. Remove anomalies from Y , using the method described in Section 3.2.1 Step two, which gives the new data set Y c. Create two subsets of Y c, to get Y cmod= y1, y2, . . . yt−n

used for modeling and Y cpred= yt−n+1, yt−n+2, . . . yt−1 used to predict on.

3. Define AT rueas the set difference of Ypredand Y cpred, hence AT rue= Ypred\Y cpred,

which defines the anomalies in the data set Ypred, referred to as target anomalies.

4. Make n one-step ahead predictions using Holt-Winters method, described in Section 2.4. Create a set of n prediction errors ϵ calculated from the n prediction and the observations Y cpred.

5. Set start value, step-length and max value for m and c, step length and maximum number. Set mop= m and cop= c

6. Remove anomalies from Ymod, using the method described in Section 3.2.1 Step two,

which gives Y amod. Set i = 1 and AP red={}. Set y = 1.

7. Predict the the i:th observation in Ypred using the Holt- Winters model and traing

data Y amod. Calculate the prediction error ϵi.

• if L(1)_L < ϵi < L

(1)

U add ϵi to ϵ, add the observed value to Y amod and drop the

oldest value in Y amod.

• if L(1)_U < ϵi < L

(2)

U add ϵi to ϵ, add the predicted value to Y amodand drop the

(26)

• if L(2)_L < ϵi< L

(1)

L add ϵi to ϵ. Add the predicted value to Y amod and drop the

oldest value in Y amod.

• if ϵi > L

(2)

U or ϵi < L

(2)

L add the predicted value to Y amod and drop the oldest

value in Y amod. Classify the observed value as an anomaly by adding i to AP red.

8. If i = n go to Step 9, else set i = i + 1 and go to step 7.

9. Set R = AT rue∩ AP red and F = AP red\ AT rue. Calculate x1= _{# of elements in A}# of elements in R_{T rue} and x2 = _{# of elements in A}# of elements in F_{P red}. If

√

(1− x1)2+ x22 < y set √

(1− x1)2+ x22 = y and

cop= c.

10. If m < mmax and c < cmaxupdate m and c and go to step 6.

The output of the algorithm is the set of errors ϵ and the optimal constants (m,c), namely

mop and cop. The constants are selected such that the component levels L

(1) L , L (1) U , L (2) L , L (2) L

are adjusted to detect as many anomalies while at the same time generate as few false pos-itives as possible. This task is solved in step 9 in Algorithm 3.3.1 where x1 represents the proportion of target anomalies which has been detected and x2 represents the proportion false positives out of all detected anomalies. The constant mop and copare set to the m and c

constants which corresponds to the x1 and x2values which minimizes the euclidean distance to the optimal position{x1, x2} = {1, 0}. Hence, when all anomalies are found and no false positive are detected. In this process the two parameters x1and x2are given equal weight. In our study the parameters are set to: n = 576, c ={2, 2.5, ..., 4.5}, m = {1, 1.2, ..., 2}

Given the output from the initialization{ϵ, mop, cop, Y amod} repeating step 7 in Algorithm

3.3.1, will enable to determine if future observations are anomalies. Hence the difference between real-time detection and the initialization of step 7 in Algorithm 3.3.1 is that one is predicting t + 1 into the future instead of predicting the i:th value of Ypred.

Figure 2: An example of a histogram over prediction errors. The picture explains the graphical intuition behind the first and third quartiles (Q0.25 and Q0.75.), IQR, the limits for adaption (L(1)_{) and detection (L}(2)_).

3.3.2 Method Two

(27)

1. Predict the next value ˆyt+1using chosen prediction method.

2. Calculate the standard deviation σY for the given data set

Y = yt+1−n, yt+1−2n, ..., yt+1−6n, where n is the seasonality.

3. Calculate the detection limits as ˆyt+1± 2.6 × σY

4. Once yt+1becomes available, the observation is classified as an anomaly if

yt+1> ˆyt+1+ 2.6× σY or yt+1< ˆyt+1− 2.6 × σY

3.3.3 Validation

Validation of M1 and M2 is accomplished by continuously generating predictions and com-paring these to the observed data. The observations are classified by comcom-paring the pre-diction errors to the limits (L(1)_{, L}(2)_{) set to the previously calculated prediction errors,} given from the initialization phase. Different errors and limits are used, described in Sec-tion 3.3.1 and 3.3.2 depending on which predicSec-tion method that is being used. In order to evaluate how well the methods perform, anomalies are predefined in the observed data set and are considered target anomalies. The method used to predefined the anomalies is given in Section 3.2.1 Step two. The evaluation is done by comparing the classified anomalies to the target anomalies by calculating x1 and x2, defined in Algorithm 3.3.1, Step 7. Hence

x1 represents the share of target anomalies detected with the method and x2 represents the share of false positives out of all classified anomalies. The two ratios, x1 and x2 are calculated based on generated data over one week for each data set. Furthermore the evalu-ation of M1 and M2 is done by testing the methods in combinevalu-ations with HW and median BM on a number of different data sets, where the data sets are chosen based on signaling volume such that different subgroups of customers and networks are represented in the val-idation. Large customer (LC*) generates an average signaling volume of more than 3000 SAI/5min, medium customers (MC*) generates an average signaling volume of more than 500 SAI/5min, small customers (SC*) generates an average signaling volume of less than 500 SAI/5min and networks (OP*) generates a varied signaling volume.

4 Results

In this section the results from the master thesis will be presented in a similar order presented in Chapter 3. The data which has been used in order to get the results is a subset of Tele2 IoT’s customer data. In Section 4.1 and 4.2 the data set has been selected randomly and consists of signaling data in a medium size network. The size of the network is determined by looking at accumulated signaling frequency over a five minute period. The data sets used to get the results in Section 4.3 are presented in Section 3.3.3.

4.1 Data Inference

In Figure 3 the average number of SAI request for every 5 min time period is plotted for each day of the week so that the general patterns can be approximated. From the plot it is clear that weekends are easily distinguishable from weekdays and that they are characterised by lower device activity. On weekdays the highest activity seems to be during office hours, with a slight decline around lunchtime. From this we choose to build one model for each day of the week. This approach is also preferable as it results in a lower seasonality for each model, which lowers the time complexity when building the models.

(28)

Figure 3: 10 week seasonal average SAI-requests for different weekdays.

Figure 4: Partial autocorrelation plot of raw data.

(29)

4.2 Prediction

The following section present the results on how accurate the different models, described in Section 3.2, predicts.

4.2.1 ARIMA

The observed data sets are concidered stationary when taking the first difference on the seasonal differentiated data, presented in Figure 6. The seasonality is concidered daily which, given the granularity of the data, results in a season of 288 observations. This gives that the integrated part of the model is assumed to be one, for both the seasonal and non-seasonal term. Furthermore does the ACF and PACF provide information on how many non-seasonal and seasonal autoregressive and moving average terms to include in the model, presented in Figure 7 and 8. The one-step ahead prediction errors for the ARIMA models are presented in Table 1.

Figure 6: First difference on the seasonal differentiated data.

(30)

Table 1: ARIMA model with six weeks training data. Where MAE is the mean absolute error; Median AE is the median absolute error; Error SD is the standard deviation of errors; IQR is the interquartile range of the errors.

Step forecast Day Model MAE Median AE Error SD IQR 1 Mon (1, 1, 4)(1, 1, 1)288 54.4 45.1 71.3 91 1 Tue (1, 1, 6)(1, 1, 1)288 57.8 44.9 77.4 90.4 1 Wed (1, 1, 7)(1, 1, 1)288 53.2 44.2 68.7 88.2 1 Thu (1, 1, 4)(1, 1, 1)288 61.8 49.6 81.6 97.8 1 Fri (1, 1, 8)(1, 1, 1)288 49.7 40.3 62.6 84.7 1 Sat (1, 1, 2)(1, 1, 1)288 40.3 30.9 52.3 62.4 1 Sun (1, 1, 8)(1, 1, 1)288 38.5 30.2 48.4 62.6 12 Mon (1, 1, 4)(1, 1, 1)288 90.7 68.6 120.1 130.8 12 Tue (1, 1, 6)(1, 1, 1)288 87.5 60.3 120.0 114.5 12 Wed (1, 1, 7)(1, 1, 1)288 107.3 82.5 138.1 149.5 12 Thu (1, 1, 7)(1, 1, 1)288 103.1 84.1 132.8 157.6 12 Fri (1, 1, 8)(1, 1, 1)288 68.4 53.2 89.3 108.3 12 Sat (1, 1, 2)(1, 1, 1)288 59.5 40.7 84.22 81.6 12 Sun (1, 1, 8)(1, 1, 1)288 47.2 40.9 59.2 82.5 288 Mon (1, 1, 4)(1, 1, 1)288 168.8 128.9 153.3 219.9 288 Tue (1, 1, 6)(1, 1, 1)288 194.9 125.2 218.2 259.7 288 Wed (1, 1, 7)(1, 1, 1)288 146.0 94.2 166.5 200.7 288 Thu (1, 1, 7)(1, 1, 1)288 135.3 115.4 143.3 157.6 288 Fri (1, 1, 8)(1, 1, 1)288 112.9 105.7 109.6 115.5 288 Sat (1, 1, 2)(1, 1, 1)288 159.4 157.5 91.8 91.3 288 Sun (1, 1, 8)(1, 1, 1)288 172.9 174.1 63.3 82.6

Figure 8: PACF plot of differentiated data

4.2.2 Holt-Winters

Anomaly Detection in Signaling Data Streams

Contents

1

Introduction

1.1

Background

1.2

Motivation

1.3

Objective

1.4

Scope and Limitations

1.5

Outline

2

Theory

2.1

Stationarity

2.2

Correlation

2.3

Time Series Decomposition

2.4

Holt-Winters

2.5

Autoregressive Integrated Moving Average

2.6

State Space Model

2.7

Kalman Filtering

2.8

Maximum Likelihood and Conditional Sum of Squares

2.9

L-BFGS-B

3

Method

3.1

Data Inference

3.2

Prediction

3.3

Detection

4

Results

4.1

Data Inference

4.2

Prediction