Multi-Source Learning in a 3G Network

(1)

Multi-Source Learning in a 3G Network

YLVA ERSVIK

Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren

Examiner: Anders Lansner

(2)

(3)

Abstract

By 2020 the world is expected to generate 50 times the amount of data it did in 2011, and much of this increased information will be carried over a mobile network. Understanding the data in the network can assist in mitigating threats to network performance such as congestion and help in network management and the allocation of resources. This mas- ter’s thesis aims to investigate to what extent the data carried through the mobile network can be understood in its real-world context, and whether anomalous patterns in the network data profile data can be explained using external data sources. We constructed topic models us- ing LDA for a Twitter stream in London and modeled how the topics’

relative importance changed over time. We examined three anomalous

points in the network data profile and studied their correlation with the

topic proportions and current weather information. The topic model for

Twitter performed poorly due to the difficulty in processing the multi-

faceted Twitter corpus. We acknowledge the need to refine the LDA

model, to include additional textual data sources, and to understand

the different types of anomalous present in the network together with

their causes. Such an understanding would allow for a more targeted

analysis of anomalous patterns in the network and their relation to the

(4)

Referat

Maskininlärning med flera källor i ett 3G-nätverk

År 2020 kommer världen att generera 50 gånger den mängd data den

gjorde år 2011, och till stor del kommer denna ökning av data att fär-

das i ett mobilt nätverkt. Förståelse för data i nätverket kan hjälpa oss

att säkerställa nätverkets prestanda samt ge förståelse för hantering och

resursfördelning. Detta examensarbete syftar till att undersöka i vilken

grad vi kan förstå hur data i ett 3G-nätverk relaterar till sitt verkliga

sammanhang, och huruvida ett avvikande beteende i nätverket kan för-

klaras med hjälp av externa datakällor. Vi skapade en LDA-modell för

Twitter-data för London och modellerade hur innehållets teman föränd-

rades med tid. Vi undersökte sambandet mellan tre avvikande punkter i

nätverket, innehållets teman samt väderleksinformation. LDA-modellen

visade sig fungera dåligt på grund av svårigheten att hantera det mång-

facetterade Twitter-innehållet. Vi ser ett behov av att förfina denna

LDA-modell, inkludera ytterligare textkällor, samt av att förstå de oli-

ka avvikande beteenden som förekommer i ett nätverk och deras orsa-

ker. En sådan förståelse skulle tillåta en noggrannare analys av dessa

beteenden och deras förhållande med sin verkliga kontext.

(5)

Preface

This master’s thesis is written as a part of a degree project at the School of Com- puter Science and Communication at the Royal Institute of Technology (KTH) in collaboration with Ericsson.

Many thanks to my supervisor Jens Lagergren at KTH who guided me through the process, kept me focused on the task at hand and challenged me to achieve results.

Many thanks to Martin Svensson and Richard Cöster at Ericsson for their support,

never-ending enthusiasm and belief in the project idea.

(6)

Introduction

"If we have data, let’s look at data. If all we have are opinions, let’s go with mine."

– Jim Barksdale, former Netscape CEO

1.1 What is Big Data?

The amount and availability of data provided openly, not least on internet, is ever increasing; we are generating massive amounts of data through an uncountable num- ber of devices that make measurements or that allow the user to share information.

By 2020 the world is expected to generate 50 times the amount of data it did in 2011 [Matti and Kvernvik, 2012].

Data is often made publicly available in real-time, and opportunities for data min- ers seem endless. Ongoing research is massive and we can expect knowledge in the field to increase as research continues, not least with technology advances as machine learning methods to explore the data pose extreme capacity and memory requirements on our systems.

With increasing popularity, data is being given a geo-spatial component with users via their mobile devices sending information not only about themselves, their sen- timents, thoughts, and opinions, but also about their exact geographical location.

This not only permits us to perform socio-geographical analytics using thoughts,

experiences and sentiments expressed by people, but we can also exploit the mass

crowd behaviors in trying to understand what is going on in our societies.

(10)

What is less frequently mentioned in context of big data are questions on information security, ethics around collection and storage of data, as well as integrity and the right for an individual to her own data.

1.2 The Mobile Network and Society

Much of the increased information that we and our devices generate will be carried over a mobile network. When data volumes increase in the network, it can cause reduced performance, congestion in the network, and have consequences on user experience [Matti and Kvernvik, 2012].

Understanding the data in the network assist in mitigating these threats through better network management, allocation of resources, and insight generation through global comparisons. It may also serve as a generator of ideas for the utilization of unused resources such as recognizing additional revenue potential, ideas for new subscription plans, and optimization of roaming opportunities. The real-time infor- mation we can extract from data in the network can also be used for urban planning such as efficient transportation, and smart distribution of electricity and supply of water, by forecasting of demand and meeting demand with minimal waste [Matti and Kvernvik, 2012].

Finally, understanding data is crucial for business innovation and the development of new business models, with historical examples from India and Ghana including leasing contracts for capacity in the network during peak hours [Moritz, 2012]. In short, it allows us to be proactive rather than reactive to change in the social patterns of society – something that must come hand in hand with innovation in mobile communications.

1.3 Purpose

The aim of this thesis project is to investigate how the abundance of data generated by us, and our objects and devices, and carried through the mobile network, relate to the physical, real world where it is created. Working under the assumption that the network and other data are different descriptions of the same reality, can we correlate the network profile with other variables that provide a description of our everyday behavior, social patterns, needs, events and circumstances? To what extent can network data patterns be explained by other data sources that are descriptive of the physical world? Subsequently, can other sources of data can help us understand behavior in the network that would otherwise be considered anomalous or unexpected?

2

(11)

1.4 Goal

The goal is to develop a model that provides us with an understanding of if and how aggregate traffic statistics ¹ can be understood by looking at data generated outside of it. The model will function based on historical traffic statistics, Twitter streams and weather reports for the same time period. More specifically, the goal is to investigate whether we can find any correlation or dependency between variables in time points that seem abnormal or particularly interesting to us.

1.5 Definitions

A cellular network or mobile network is a wireless network distributed over land areas called cells.

A cell is a land area served by one or more radio base stations. Joined together the cells provide radio coverage over a wide geographic area.

3G is the third generation’s mobile telecommunications technology. Via 3G users get access to telephony, mobile internet, mobile TV, and more.

1.6 Delimitations

We look at 3G data from a single operator per network cell.

We do not aim to be comprehensive in the choice of data sources added to the model, but rather to get an idea of the value potential from adding additional sources, in their explanation of abnormal events in the network.

Pre-processed data from an additional source, should be addable to the model af- terwards. Hence the model should be generic.

1

Traffic volume and the number of active users withing the network cell.

(12)

(13)

Chapter 2

Anomaly Detection

"To study the abnormal is the best way of understanding the normal."

– William James

Anomaly or outlier detection refers to the problem of finding patterns in data that do not conform to expected normal behavior [Chandola et al., 2009]. Such tasks have become an important problem in many industrial and financial applications [Pokrajac et al., 2007]. Anomaly detection has been researched in various research areas for diverse application domains. Many techniques for anomaly detection are generic while others have been developed for a specific application domains.

The exact notion of an anomaly differs with application domains and hence a tech- nique developed in one domain cannot without modification be applied in another.

Anomalies may be present in data for a variety of reasons, including media events, malicious activity, intrusion or terrorist activity, break-down of a system, natural disaster or crisis event. In this chapter, we will provide a survey of anomaly detec- tion techniques as well as an overview of recent research on anomaly detection in network traffic.

2.1 Detection Techniques

Chandola et al. [2009] distinguish between classification-based, nearest-neighbor-

based, clustering-based, statistical, information theoretic, and spectral anomaly de-

tection techniques.

(14)

Classification-based anomaly detection techniques operate with machine learning techniques for classification, such as neural networks, bayesian networks, support vector machines, and rule-based techniques. They operate under the assumption that training data consists of one or multiple normal classes, and any test instance that does not fall into one of those is considered anomalous.

Nearest neighbor-based anomaly detection techniques tend to assign to each data instance an anomaly score computed as the distance to the k ^th nearest neighbor or as the relative density of the instance. Here, anomaly scores computed as distances usually apply a threshold value on the anomaly score to determine whether an instance is anomalous or not. Density-based techniques instead uses the density of the neighborhood of a data instance to determine whether it is to be declared as anomalous or normal. The key advantage of nearest-neighbor techniques is that they are unsupervised. A key disadvantage is that density-based techniques perform poorly if data has regions or varying densities; also, unsupervised techniques miss anomalies that have close neighbors and misclassify normal instances that do not have enough close neighbors.

Clustering-based techniques apply clustering-based algorithms to find clusters in data, and operate under the assumption that normal data instances lie close to their closest cluster centroid, while anomalies are far away. The key advantage of clustering-based techniques is that they are unsupervised. The disadvantage of such techniques is that they are designed to find clusters, not anomalies, and any anomalies that form clusters will be missed.

Statistical anomaly detection techniques assume a generative stochastic model for the data. Such techniques fit a statistical model to the data, and statistical infer- ence determines whether a particular data instance belongs to the model or not.

Statistical anomaly detection techniques can be parametric or non-parametric. The key advantage of statistical techniques is that they are associated with a confidence interval; the key disadvantage is that we need to make an assumption that the data is generated from a particular distribution.

Information theoretic techniques make use of the information content of data such as entropy and Kolomogorov complexity. The advantage is that we do not need to make any assumption of the generative distribution; the disadvantage is that we are highly dependent upon our choice of information theoretic measure.

Spectral techniques try to identify subspaces where anomalous instances are more easily identified. Such techniques are suitable for high dimensional data set; the disadvantage is that they are useful only if there exist subspaces where normal and anomalous instance are separable.

6

(15)

2.2 Anomaly Detection in Network Traffic

Real network traffic has intrinsic characteristics that must be taken into account in any attempt to model normal traffic. Given these properties, an anomaly detection system must learn and consider normal behavior and hence take into account the periodicity, nonstationarity, and seasonality observed in aggregate traffic variables [Coluccia et al., 2013].

Coluccia et al. [2013] illustrate how a reference set S for normal traffic can be con- structed, taking into account the nonstationarity and daily and weekly seasonality of real traffic behavior. They further detect traffic anomalies using two different distribution-based detection approaches on an operational 3G network, where a distribution is the network-wide distribution of a network variable across individ- ual users. To identify the reference distribution, the authors use a sliding window approach. Letting k be the current time bin, they consider the N _w pas time bins k − 1, k − 2, ..., k − N w and assume that the most correlated information is contained in the most recent observations. This simple method does not take into account that certain behavior may be repeated systematically over time, or behavior that adheres to certain times of the day. We can extend the simple sliding window ap- proach to a dual-window approach, where we also take into account the time bins corresponding to the same times in the past days. Further, they consider two ways of developing a detector from the reference set: a heuristic approach that involves computation of internal and external divergence metrics, and one based on a gen- eralized likelihood ratio test (GLRT). The detector based on divergence metrics is a comparison of the internal dispersion – the set of divergences between all pairs of distributions in the reference set – and the external dispersion – the divergences between the current distribution and those in the reference set.

Similarly, D’Alconzo et al. [2010] present a statistical-based method for anomaly detection in 3G networks, designed to take into account the nonstationary nature of network traffic. They conclude by experiment that the assumption that the most recent samples have maximum correlation with the current sample does not hold.

Instead they identify a reference set where they exclude the most recent samples from the observation window. Since traffic distributions at the same hour of different days tend to be similar, they include n previous days in the observation window, letting the reference set identification algorithm search for same-hour samples. The reference set I ₀ (t) consists of selected past distributions observed in the current observation window W (t). The observation windows is a set of time bins W (t) = {t _j : a(t) ≤ t _j ≤ b(t)}, where a(t) and b(t) are the lower and upper defining bounds of the distribution X(t) at time t. The distributions in I ₀ (t) are selected based on their similarity with the current distribution. The detector is again developed as a comparison of the internal and external dispersion metrics. D’Alconzo et al. [2010]

further find that detected anomalies tend to be present across variables and time

(16)

scales in the network traffic, and that human supervision seems unavoidable and the effectiveness of fully automated anomaly detection system questions.

Contrary to Coluccia et al. [2013] and D’Alconzo et al. [2010], Siris and Papagalou [2006] use a scalar measure of traffic flow being the number of TCP SYN packets for detecting SYN flooding ¹ anomalies. Siris and Papagalou [2006] evaluate two adaptive algorithms; the adaptive threshold algorithm and the cumulative sum al- gorithm. The adaptive threshold algorithm is a simple algorithm that signals an alarm when k consecutive measurements exceed a threshold (α + 1)µ, where µ is the measured average and α determines the sensitivity of the detector. The cumulative sum algorithm signals an alarm when the accumulated volume of measurements that are above a threshold (α + 1)µ exceed some threshold h.

A SYN flood is a series of connection requests sent in an attempt to consume enough resources to make it unavailable for its intended users.

8

(17)

Chapter 3

Learning from Text

Computational linguistics, or natural language processing, is a broad scientific re- search field that aims to find sophisticated methods to analyze the large and ever growing corpora of natural language – such as words and text in English, Norwegian and Portuguese, or any other natural language – that is stored digitally, sometimes on the web. Though natural language processing covers many more aspects than fit into this section, we will here give a brief introduction to natural language processing for topic modeling and prediction, with focus on microblogs and news.

3.1 Topic Models

Topic models are a suite of algorithms targeted to uncover the hidden thematic structure in a collection of documents. The models can help us understand not only what the themes are but also how they are connected and how they evolve over time [Blei, 2012].

3.1.1 Latent Dirichlet Allocation (LDA)

Latent Dirichlet allocation (LDA) is a vastly successful topic modeling method first introduced by Blei et al. [2003]. Blei [2012] presents the intuitive idea behind LDA being that documents are a collection of words ~ x 1:D that have arisen from one or more topics. Blei gives as an example a scientific article that blends evolutionary biology, genetics and data analysis. Latent dirichlet allocation turns this intuition into a generative probabilistic process for a collection of D documents, a corpus.

We assume that each of the D documents is a mixture of K topics, each topic being

(18)

a distribution of the fixed vocabulary of W words. Each document shares the same K topics, though exhibiting these in different proportions.

LDA treats the observed words and documents as being generated by a hidden topic structure [Blei et al., 2003]. The hidden variables are the mixing proportions

~ θ _1:K of topics per document d, the topics ~ β _1:K , and the per-word topic assignments z 1:D,1:W . The topic proportions ~ θ are distributions over topic indices 1, ..., K and the topics ~ β are distributions over word indices 1, ..., W . Both of these are Dirichlet random variables. Now let ~ α be a positive K-vector and η a scalar and the LDA generative process can be written

topic proportions ~ θ _d ∼ Dir(~ α) (3.1)

topics ~ β k ∼ Dir(η) (3.2)

topic assignments z _d,w ∼ M ult(~ θ _d ) (3.3) words x _d,w ∼ M ult(~ β _z

_d,w

) (3.4) Each α _k is the prior weights of topic k in a document and η is the prior weight of each word in a topic. The problem of inferring the hidden topic structure via computations corresponds to computing the posterior distribution of the hidden variables ~ θ _1:K , ~ β _1:K and z _1:D,1:W , given the observed documents. Computation- ally, the posterior distribution is intractable, and therefore a variety of techniques have been developed for approximate inference, including variational inference [Blei et al., 2003] and Gibbs sampling [Steyvers and Griffiths, 2006], each with advan- tages and disadvantages subject to trade-offs between speed, complexity, accuracy and simplicity [Blei and Lafferty, 2009].

LDA assumes that words are exchangeable within documents, meaning that the order of words does not affect the probability of its topic assignment; for each document the content is treated as a bag-of-words, treating each document as a vector of word counts. LDA further assumes that documents are exchangeable within the corpus, meaning that their order do not affect the probability of them being generated by the model. This assumption is a simplification; it is obvious that many sets of documents such as scientific journals, newspapers and emails depict a content that evolves over time [Blei and Lafferty, 2009, Blei, 2007].

3.1.2 Dynamic Topic Models

The dynamic topic model [Blei and Lafferty, 2009] can be used to model how topics or trends within a topic evolve over time. In the dynamic topic model, docu- ments in the corpus are sequentially organized. Here we assume that documents are exchangeable within each sequential slice. In this way, the model allows topic distributions to evolve from slice to slice, where each time slice is a separate LDA model [Blei, 2007].

10

(19)

3.1.3 Topic Models for Microblogs

Hong and Davison [2010] try to address the problem of topic modeling in the short text context, by aggregating tweets by the same author into documents, as an extension to the standard LDA model. They find that the effectiveness of topic models can increase by aggregating short messages, and that different aggregation methods of the short tweets yield different topics.

Zhao et al. [2011] compare the content of a Twitter corpus between that of New York Times as a source of news articles to understand topical differences and the different nature of the two sources as information providers. Zhao et al. [2011] use a standard LDA for the New York Times data set. For Twitter, where tweets are so short in comparison to a news article, they propose a Twitter-LDA model where they recognize the fact that each tweet usually is about a single topic, each user choosing a topic for her tweet based on her topic distribution. Their approach differs from [Hong and Davison, 2010] where tweets from a single user are treated as one document; something which comes with the assumption that each document must have a single author. They find that the Twitter-LDA model outperforms standard LDA for discovering Topics from Twitter. They also find that Twitter and news cover similar topics but that the distributions differ; family, life and arts is given more attention than world events in the Twitter feed, while news tend to have more of a balance between world, arts and business events.

3.2 Text for Prediction

Twitter as a source of data has been extensively explored to predict of a number of real-world phenomena such as flu trends [Achrekar et al., 2011], criminal incidents [Wang et al., 2012], stock market trends [Bollen et al., 2011] and state-level polling in the US [Beauchamp, 2013]. Related studies have been performed with other social media content, for example stock market prediction using financial news articles [Schumaker and Chen, 2009].

Achrekar et al. [2011] present a framework for predicting the emergence and spread of influenza based on messages posted on Twitter. They use a data set consisting of a real-time stream of tweets containing textual indicators of flu, and an auto- regressive model that takes the number of unique Twitter users with flu as input.

Wang et al. [2012] study prediction of criminal incidents using historical patterns

of criminal incidents in combination with data from Twitter. They show that a

model that incorporates information from Twitter increases prediction accuracy of

future incidents, compared to a model that relies solely on historical information on

(20)

criminal incident patterns. They first extract current events from the Twitter feed of day d that they hypothesize correlate with future criminal incidents. Semantic role labeling (SLR) is used for event extraction, after which they extract topics from the events using LDA. The distribution of topics from the LDA is then used in a linear regression model that determines the probability of an incident happening on day d + 1. The parameters of the regression model are estimated using historical incident data.

Beauchamp [2013] uses a linear regression model, refitted each day, to study the re- lationship between polls and text in Twitter feeds, more specifically how frequencies of words are associated with changes in pro-Obama and pro-Romney vote intentions.

3.3 Beyond the Text of Social Media

Although Twitter has been used extensively for content analysis, it has also been explored for more generic attempts to use the geo-spatial and textual information provided.

Twitter has been used to build a real-time earthquake reporting system in Japan [Sakaki et al., 2010]. They detect target events, which they define as large scale events that affect people’s daily life, with a spatial and temporal location. They first searching among tweets that mention the target event, and classify them into a positive or negative class. In this way, Sakaki et al. [2010] consider the tweets as sensors with a time and location. They develop a temporal model where they calculate the probability of an event occurring, given the sensor signals, and a spatial model where they estimate the earthquake center and trajectory estimation of a typhoon using Kalman and particle filters.

In an attempt to determine the occurrence of local events, Lee and Sumiya [2010]

study crowd behavior patterns using geo-tagged microblog data. They attempt to model the regular behavior of local crowds as a geographical regularity which dictates usual crowd patterns. This geographical regularity will define the normal status for the defined region and on the basis of these local characteristics it will be utilized to detect unusual crowd activities. If a region is unusually crowded, specific messages can be looked at within the region, however, the event itself is defined by its location and geographical crowdedness rather than keywords.

12

(21)

Chapter 4

Learning in the Temporal Domain

"The only reason for time is so that everything doesn’t happen at once."

– Albert Einstein

In the previous chapter we have explored how Twitter, microblogs and other data sources with texts have been exploited for learning. Could we generalize this knowl- edge to data sources in other domains? Acknowledging that our textual data sources as well as data in the mobile network are indexed by time, this chapter aims to pro- vide an overview of methodologies for knowledge discovery from time series or other temporal data.

Temporal data mining involves slightly different objectives and constraints and is concerned with the mining of large sets of data ordered with respect to some index.

Such data sets could be text, time series, moves in a game, or any other sequence of data where the ordering of records is crucial. Temporal data mining tasks include prediction, classification, clustering, search, and pattern discovery [Laxman and Sastry, 2006], as well as anomaly detection [Esling and Agon, 2012]. Clustering and prediction will be briefly introduced here.

4.1 Time Series

Time series analysis has a long history where weather forecasting and stock mar-

ket prediction are among the oldest and most studied applications. When speech

recognition research emerged matching and classification of time series started to

receive attention. With that came also a raised interest for machine learning tech-

(22)

niques such as Hidden Markov Models and Neural Networks for time series analysis [Laxman and Sastry, 2006, Esling and Agon, 2012].

4.1.1 Similarity Measure

Comparing time series and their movement patterns is central in many fields rang- ing from tasks in data mining such as clustering, classification and rule discovery [Kvernvik, 2013] to applications in pattern and speech recognition, surveillance, and computer animation [Soatto, 2007]. Clustering of time series or other type of analysis that require a comparison between two or more series, need to involve a consideration of how to measure similarity between the two series. This choice of similarity measure is crucial since it affects the final classification or clustering outcome [Soatto, 2007, Lhermitte et al., 2011].

Data of the same event, object or feature can show large degree of variability, and depending on the nature of the time series, discrete-valued or real-valued, equal or unequal length, univariate or multivariate [Liao, 2005], non-stationary or stationary [Soatto, 2007], one measure may be more applicable than another. The type of application may also pose additional requirements on the type of distance desired [Liao, 2005, Soatto, 2007, Esling and Agon, 2012].

Let Q = q ₁ , q 2 , ..., q i , ..., q n and R = r ₁ , r 2 , ..., r j , ..., r m be two time series. The Euclidean distance between Q and R is defined as

d E (Q, R) = v u u t

n

X

k=1

(q _k − r _k ) ² (4.1)

where Q and R in this case are time series of equal length n = m. Computing the Euclidean distance involves aligning sequences one-to-one; q _i is necessarily aligned with r _i [Kvernvik, 2013]. Similarly, we can compute the Minkowski distance d _M , which generalizes from the Euclidean to

d M =

^q

v u u t

n

X

k=1

(q _k − r _k ) ^q (4.2)

for q > 0, the main advantage of which is the ease in its calculation and inter- pretability, while limitations include a stationarity requirement for the time series, and zero cross-correlation between the data sets. A solution to this is the use of the Mahalanobis distance [Lhermitte et al., 2011]. These distance measures, however, can be distinguished from correlation-based measures, of which the best known is Pearson’s correlation coefficient

d CC =

P n

k=1 (q _k − ¯ q)(r k−s − ¯ r) pP n

k=1 (q _k − ¯ q) ² ^pP ⁿ _k=1 (r _k−s − ¯ r) ² (4.3)

14

(23)

that describes the linear relationship between Q and R, s being the lag between the two series [Lhermitte et al., 2011].

Lhermitte et al. [2011] evaluate the performance of similarity measures between an original time series and a simulated series which has been introduced to effects on amplitudes, scaling, noise and time translations. Esling and Agon [2012] contributes with a study of similarity measure’s robustness to scale, time warps, noise and outliers. The Euclidean distance and other distance-based measures are sensitive to distortions and unable to achieve a level of abstraction that can account for noise, outliers, amplitude and scaling effects [Esling and Agon, 2012].

Correlation-based measures evaluate the relationship between the series and do not account for distance and thus provide a better pictures of how two time series move together without being affected by amplitude shiftings. Correlation-based measures are nevertheless sensitive to noise and to time scaling and translation effects that create a lag between the original and simulated time series. In particular, Lhermitte et al. [2011] show that an increase in the lag between series results in a decrease in the correlation. Similarly, they show that noise results in decreased correlation.

The sensitivity to time lags can be solved by Dynamic Time Warping (DTW) [Berndt and Clifford, 1994], which allows for time shifting and for comparison of time series of unequal lengths [Liao, 2005]. The DTW algorithm finds the optimal alignment by computing a warping path which minimizes the distance of the two time series. The first step involves computing a distance matrix with n times m elements, each representing the Euclidean distance between the points q _i and r _j . Each possible warping between the time series is a path through the matrix; the path W = w ₁ , w ₂ , ..., w _k , ..., w _K that minimized the distance between the two series is called the warping path and can be defined as

d DT W = min P K

k=1 w _k

K (4.4)

for max(m, n) ≥ K ≤ m + n − 1 [Liao, 2005, Al-Naymat et al., 2009]. The problem with DTW is that it still suffers from amplitude effect sensitivity due to its depen- dency on the Euclidean distance, as well as lack of robustness to noise and outliers [Esling and Agon, 2012].

4.2 Clustering

The purpose of clustering is to find a structure in data by organizing it into nat-

ural groups. In clustering of temporal data, this involves grouping time series or

sequences based on their similarity. Formally, the grouping should maximize the

intercluster variance while minimizing the intracluster variance; the objective is to

(24)

find homogeneous clusters that are as distinct as possible from each other [Esling and Agon, 2012]. In short, the time series clustering task usually involve three com- ponents: an algorithm to perform the clustering, a measure of similarity or distance between the time series, and a criterion for evaluation [Liao, 2005].

4.2.1 Clustering of Time Series

Esling and Agon [2012] divide the time series clustering task into whole series clus- tering and subsequence clustering, where whole-series clustering refers to grouping the entire time series into clusters, and subsequence clustering refers to grouping subsequences of single or multiple longer time series. Subsequence clustering there- fore involves slicing time series into non-overlapping windows where the width is chosen by investigating the periodicity of the time series, e.g. by an autocorrelation study. The limit of this approach is that in the absence of a strong periodicity, the slicing may miss structures in the series. Several methods have been proposed to overcome this problem, including clustering algorithms that are not forced to use all available slices and algorithms that let subsequences overlap [Esling and Agon, 2012].

After choosing a suitable distance measure, almost any generic clustering algorithm can be adapted to fit the task at hand. Esling and Agon [2012] mention whole series clustering methods using Self-Organizing Maps, Hidden Markov Models and Support Vector Machines. Liao [2005] provides a survey of clustering algorithms for static data and distinguishes between partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods.

Widiputra et al. [2011] propose a clustering method for time series that model whole series and subsequence relationships and predict their values of the time series simultaneously. To do so, they perform whole series clustering using Pearson’s correlation coefficient as similarity measure, and clustering of recurring trends of a time series by using kernel regression.

4.2.2 Agglomerate Hierarchical Clustering

Hierarchical clustering is a method to group time series or other data objects to a tree of clusters. The agglomerate clustering algorithm works by placing each object in its own cluster at the bottom of the tree and in a single cluster at the top. It is therefore possible to follow the merging process. The criterion for fusion of two clusters is a so called linkage function that usually belong to any of the following:

single, complete, average, or Ward’s linkage. The single and complete linkage algo- rithm measures the similarity between the closest and farthest respectively of pairs

16

(25)

of data objects. The average uses the average of them and Ward’s linkage algorithm merges clusters based on the increase of the sum-of-squares variance [Liao, 2005].

Agglomerate hierarchical clustering can be used with any choice of similarity mea- sure and for time series of unequal lengths.

4.3 Prediction

We have already seen in Section 3.2 how prediction can be achieved by building a predictive model of textual data. Prediction of time series is a major research area in several fields, that aims to model dependencies between subsequent values. Predic- tion of time series has gotten so much attention in research that there are numerous surveys focused on only specific applications or a specific family of methods [Esling and Agon, 2012]. In classical time series analysis, the family of autoregressive (AR) models use a linear combination of earlier values to predict a future value. Here, the ARMA model assumes linear stationarity of the series and the ARIMA model is targeted processes where the difference between successive terms can be assumed to be stationary [Laxman and Sastry, 2006].

Also popular for the prediction of time series are machine learning methods, in- cluding neural networks, support vector machines, self-organizing maps and cluster function approximation [Esling and Agon, 2012].

The model proposed by Widiputra et al. [2011] for simultaneous prediction of mul- tiple time series recognizes that different clusters of whole or sequences of time series depict different phenomena, and show that local regressions developed for each cluster provide better accuracy to the prediction than global models such as linear regressions for multiple time series and the multi-layer perceptron (a neural network).

4.4 Spatiotemporal Learning

We have now seen how we can explore various types of data in the temporal domain.

In the mobile network, we are also concerned about the geo-spatial dimension of our problem. How can we relate to this? There exists little work on analysis of geo- spatial time series; the methods that exist tend to be heavily adapted to a specific problem, application, or relationships between only a subset of available locations.

Chandra and Al-Deek [2008] use cross-correlation analysis to study the dependencies

of the traffic speed at a location with the traffic speeds of upstream and downstream

(26)

locations. They find that past values of upstream and downstream locations influ- ence the value at a location, and that a vector autoregressive model works better than a traditional ARIMA model for prediction of traffic speeds at this location.

This is an example of what Rinzivillo et al. [2008] describe as an approach where spatial relationships are made explicit before modeling; the authors have a clear idea that they want to study the effect of data at two locations on a third. This approach to the deal with the geo-spatial domain is advantageous in that we after that can apply any standard technique for data mining, in this case autoregressive models.

Rinzivillo et al. [2008] describe a second approach to geo-spatial data mining where the spatial domain can be explored during the data mining process, but where the data mining techniques have to be reinvented to suit a specific purpose.

We saw in Section 3.3 how Sakaki et al. [2010] propose temporal and spatial models for detecting earthquake events. These authors’ temporal and spatial models are largely separate. The temporal model detects tweets written about the the target event (earthquake) and classifies such tweets as positive or negative; the spatial model uses Kalman and particle filtering to locate the centers and trajectories of the targeted events.

18

(27)

Chapter 5

Data Acquisition

"With data collection, ’The sooner the better’ is always the best answer."

– Marissa Mayer, President and CEO of Yahoo!

Since provision of historical data is limited, we collected it real-time for our regions of interest at suitable aggregation levels. This proved to be quite a complex process.

The process consisted of writing scripts in Python that could collect Twitter posts in real-time as they were written, and weather reports every half an hour as weather updates were made at this frequency. We collected data for a period between 1 am on 21 November 2013 and 12 am on 6 December 2013. Aggregate network statistics was withdrawn for this period for a resolution level of 15 minutes. This choice of frequency was made with the desire to have high frequency data to better capture events in the network, and since 15 minutes was the highest frequency available for the aggregate network statistics this was a natural choice.

All geo-tagged Twitter posts sent in Greater London were acquired. In this region the availability of English tweets were assumed to be high which should aid us in natural language processing and the interpretation of the results at a later stage.

This region of interest can be specified by the bounding box ¹ shown in Table 5.1.

Aggregate network statistics collected included total traffic volume and the number of active users per cell and for the network as a whole, of one telecommunications operator in the region.

The weather conditions collected included temperature, perceived temperature, air

1

A bounding box describes a land area on Earth by a bounding rectangle, defined by its latitude

and longitude coordinates.

(28)

Table 5.1. Bounding box for London.

Location Top Left Bottom Right London 51.671 -0.423 51.334 0.159

pressure, precipitation, and wind speed were collected real time every 30 minutes ² from Weather Underground’s API [Underground, 2013]. The API requests were sent over HTTP using Python and responses returned in JSON format. Data were then saved into files on the local drive.

Twitter posts (tweets) were streamed from Twitter’s Streaming API [Twitter, 2013b]

as they were written. Twitter data were collected using Twitter’s Streaming API [Twitter, 2013b] using a Python wrapper extending from code written by Geduldig [2013]. Connecting to the streaming API requires keeping a HTTP connection open that returns data from the Twitter API incrementally in the form of JSON- encoded Tweets objects. Twitter users can provide information about their location on their profile page but can also make the exact location of their mobile devices public. Tweets from users who have enabled geo-location will have the geo encoding attached to the message, with geographic coordinates specified according to precise longitude and latitude coordinates. Data were then stored in JSON files on the local drive.

5.1 Risks in the Data Collection Process

The data collection process involved considerations that required particular atten- tion. The primary risks were that the scripts collecting the data crash due to bugs, power outages, computer failures or that the Twitter or Wunderground servers deny access or interrupt an existing connection; these errors would damage the data and create missing values. The Python scripts used for data acquisition had to be de- veloped accordingly.

2

Sending requests more often would violate the API restrictions of the number of allowed requests per day.

20

(29)

Chapter 6

Method and Implementation

As we know from Section 3.2 numerous attempts have been made to use Twitter and other textual data sources for prediction of seemingly unrelated variables or events.

Often the prediction is targeted at a well defined variable or event. In some cases the prediction is binary [Wang et al., 2012], targeted at previously defined keywords in the Twitter stream [Achrekar et al., 2011, Sakaki et al., 2010]. Other attempts, described in Section 3.3, try to use the geographical information provided in tweets.

More rarely have these attempts been combined; the probabilistic spatiotemporal model by Sakaki et al. [2010] is a unique attempt to make use of the whole spectrum of information in the Twitter stream for knowledge discovery. Even more rarely have attempts been made to make predictions using more than one additional source of information.

While most research in the field has focused on combining two data sets as an inter- section of social data or media and the physical world, this thesis aims to explore the possibility of using multiple sets of data to generate knowledge. We work under the assumption that data sources are diverse and that we know little about the nature of the data. Therefore we want to refrain from making assumptions of stationarity, of autocorrelation, of the generative distribution, and other assumptions underlying many of the similarity measures for comparing time series and for modeling of the process. As also mentioned in Section 1, we seek a model that can work for a variety of data sets.

In order to make a comparison of our data sets, we need to process them and repre-

sent them with objects that are comparable. In this case, we represent each variable

with a time series, enabling us to extend the comparison to an unlimited number of

variables. To achieve this, we first need to process the Twitter corpus to represent it

as a time series. Following Blei [2012] we therefore implement a probabilistic topic

model with LDA. Since one time point in our case corresponds to one document,

(30)

we do not need to implement the dynamic topic model [Blei and Lafferty] but can implement standard LDA using the lda package for R [Chang, 2012]. Recognizing what we mentioned in Section 3.1.3 that the Twitter posts (tweets) are normally considered too short to form documents on their own. Therefore, and in order to reduce the dimensionality of the data, we aggregate all tweets written in a 15 minute interval into a document. Following [Blei and Lafferty] we preprocess the corpus by tokenization, stemming, and the removal of stop words and words that appear less than five times. In additional, we only consider tweets written in English. We further remove all hashtags and words starting with @ (references to other Twitter users). A number of tokenization implementations and stemmers were tried; the main challenge consisted in lemmatization of words, since most implementations are made for pure text and misspellings, abbreviations, and website addresses are plentiful in the Twitter corpus. We ended up with a corpus of 5,697 unique tokens.

After preprocessing, we create a corpus of document using the gensim package [Řehuřek, 2013] in Python that serve as input to the LDA model in R.

Setting the parameters for LDA involves a few considerations. First, the LDA model requires as input the number of topics desired. Given that the we know from [Zhao et al., 2011] that topics in microblogs are quite homogenous, covering mainly topics such as arts, business, family and life, we assign a low number of topics to the model, and settle for K = 10. Second, we need to provide values for α and η, the prior weights of the per-document topics and of the per-topic words respectively.

Usually α is set to a number less than 1 to achieve sparse distributions of topics per document, and η is set to a number much less than 1 to prefer sparse distributions where there are only a few words per topic. We note that Griffiths and Steyvers [2004] suggest a parameter of α = 50/K and η = 0.1. We settled for α = 1, η = 0.1 and ran LDA with 1,000 iterations.

Aggregate network statistics are already naturally in the form of time series of 15 minute intervals. The weather information collected each half an hour was dupli- cated to create a time series with 15 minute intervals between data points.

In order to study dependencies between variables, we study the linear correlations between traffic volume and the external variables in three anomalous and three normal time points, for one network cell in a localized region of Greater London.

Due to the limited amount of data, we choose to pick the anomalies manually. The size of the complete data set of Greater London proved to put extreme requirements on our computer system, and the lack of anomalies presence in the aggregated data set also reduces relevance of such computations. Furthermore, studying all network cells at the same time and their interdependencies, would result in an enormous model that would not allow for necessary interpretation. We therefore choose to study one central cell location; we assume that it provides more variety in the data than the aggregate network does as a whole, and a centrally located cell should still be able to provide us with a reasonable though limited amount of data.

22

(31)

Chapter 7

Data Overview

Let us look at the time series of the variables for traffic volume and the number of active users, collected from the 3G network, in Figure 7.1 and Figure 7.2 respec- tively. The data concerns one cell of the network for Greater London located in the proximity of Piccadilly Circus and Green Park. As expected, there is a notable periodicity seen in the seasonal component but also temporal variations depicted in the random component. No trend can be distinguished for any of the variables.

The weather data for London turned out to be of poor quality due to infrequent value updates from the weather station, and further extreme and missing values.

Precipitation seems to have been missed completely in the weather updates and we therefore remove this variable from the data set. Extreme values were removed, missing values were taken as the average of the neighboring values. Temperature in degrees Celsius temp_c, the perceived temperature in degrees Celsius f eelslike_c, the wind speed in miles per hour wind_mph and air pressure in inches of mercury pressure_in are shown in Figure 7.3.

Only tweets that are geo-encoded with coordinates within the bounding boxes in Table 5 were extracted from the API. Nevertheless, filtering on this location, the streaming API also returned tweets with no exact geo-location coordinates but with a place [Twitter, 2013a] attached to it and that lies within the filtered region. Tweets with places are not necessarily written at that location but could also be written about the location. We therefore remove such tweets manually from the data set.

In order to obtain an understanding of the Twitter stream, let us study the number

of geo-tagged tweets in the target network cell. The total number of tweets, as well

as the seasonal, trend and random component of the time series are shown in Figure

7.4. Note that there is a clear seasonal component, no distinguishable trend for the

time period and random noise that shows no evidence of incidents or interesting

events.

(32)

traffic_volume

Seasonal

Trend

Random 0e+00

1e+06 2e+06 3e+06

0e+00 1e+06 2e+06 3e+06

2013−11−21 2013−11−22 2013−11−23 2013−11−24 2013−11−25 2013−11−26 2013−11−27 2013−11−28 2013−11−29 2013−11−30 2013−12−01 2013−12−02 2013−12−03 2013−12−04 2013−12−05 2013−12−06 2013−12−07

Figure 7.1. Traffic volume at the target location, with the seasonal, trend and random components.

active_users

Seasonal

Trend

Random

−5000 0 5000 10000

2013−11−21 2013−11−22 2013−11−23 2013−11−24 2013−11−25 2013−11−26 2013−11−27 2013−11−28 2013−11−29 2013−11−30 2013−12−01 2013−12−02 2013−12−03 2013−12−04 2013−12−05 2013−12−06 2013−12−07

Figure 7.2. Active users at the target location, with the seasonal, trend and random components.

24

(33)

temp_c feelslike_c

wind_mph pressure_in

0 3 6 9

−4 0 4 8

0 10 20 30

29.6 30.0 30.4

2013−11−22 2013−11−24 2013−11−26 2013−11−28 2013−11−30 2013−12−02 2013−12−04 2013−12−06 2013−11−22 2013−11−24 2013−11−26 2013−11−28 2013−11−30 2013−12−02 2013−12−04 2013−12−06

Figure 7.3. The weather in London depicted by the temperature in degrees Celsius temp_c, the perceived temperature in degrees Celcius f eelslike_c, the wind speed in miles per hour wind_mph and air pressure in inches of mercury pressure_in.

Total

Seasonal

Trend

Random

−30 0 30 60 90

2013−11−21 2013−11−22 2013−11−23 2013−11−24 2013−11−25 2013−11−26 2013−11−27 2013−11−28 2013−11−29 2013−11−30 2013−12−01 2013−12−02 2013−12−03 2013−12−04 2013−12−05 2013−12−06 2013−12−07

Figure 7.4. The total number of tweets and the seasonal, trend and random com-

ponents of the number of geo-tagged tweets sent near the target location.

(34)

(35)

Chapter 8

Results

The results consist of the inferred topics and the linear regression results. We study the regression results of running the traffic volume variable on the number of active users, the number of tweets, the topic proportions and the weather variables.

We study first the results of the regression run for the whole time series of 21 November through 6 December, and then the results of running the regression for three days where we have noted an anomalous or different time series profile, being 22 November, 25 November and 3 December. These results are compared with those for the three normal days being 26 November, 28 November and 2 December.

Table 8.1 presents the resulting topics of the LDA iterations for the Twitter cor- pus. Figure 8.1 visualizes how the topic proportions for each topic have changed over time. These results prove the sensitivity of the LDA process. Although the top words of the topics present certain distinguishable characteristics, the resulting dynamics of topics provide neither evidence of changing dynamics nor a difference between topics.

Table 8.2 presents the regression results of running the traffic volume variable on the

number of active users, the number of tweets, the topic proportions and the weather

variables for the whole time series profile. Table 8.3 presents the regression results

of running the random component of traffic volume on the random component of the

number of active users, and the rest of the variables unchanged, for the same time

period. Table 8.4 presents the result of the regression ran on 22 November where

we have noted an anomalous profile. Likewise, Table 8.5 and Table 8.6 present

the regression results for 25 November and 3 December respectively. These can be

compared with the results of the same regression ran for the three seemingly normal

days of the network, presented in Table 8.7–8.9.

(36)

Table 8.1. Top words for each topic X1-X10.

X1 X2 X3 X4 X5

lol today world london tonight

don good time morning night

im happy god train amazing

fucking birthday mandela hate coming gonna morning christmas station coys

make back man work good

girls work nelson end dinner

watching give great bus tomorrow

lool day night stop evening

ah hope true ve weekend

fuck enjoy day early hours

sleep st people victoria watch

shit working rip talk glad

feel home things means bed

loool coffee found people follow

stop album np trains win

X6 X7 X8 X9 X10

amp christmas london big amp

great don britain ff lol

year time house ll life

news love greater ve great

nice show party thing good

young nice uk back ve

event boy art half time

england sounds ll woman tweet

today won pic hear people

health bit didn cold love

day live chelsea girl rt

people yeah city music day

change song westminster fun miss

channel hour mi eating lunch

women made photo fuck xxx

support thought museum isn mum

28

(37)

X1 X2 X3 X4 X5

X6 X7 X8 X9 X10

0.00 0.25 0.50 0.75 1.00

2013−11−22 2013−11−24 2013−11−26 2013−11−28 2013−11−30 2013−12−02 2013−12−04 2013−12−06 2013−11−22 2013−11−24 2013−11−26 2013−11−28 2013−11−30 2013−12−02 2013−12−04 2013−12−06 2013−11−22 2013−11−24 2013−11−26 2013−11−28 2013−11−30 2013−12−02 2013−12−04 2013−12−06 2013−11−22 2013−11−24 2013−11−26 2013−11−28 2013−11−30 2013−12−02 2013−12−04 2013−12−06 2013−11−22 2013−11−24 2013−11−26 2013−11−28 2013−11−30 2013−12−02 2013−12−04 2013−12−06

time

propor tion

Figure 8.1. The proportions of each topic X1-X10 visualized over time.

(38)

Table 8.2. Linear regression results for traffic volume: November 21 – December 6.

Explanatory Variable Estimate Std. Err. t statistic Pr(>|t|) active_users 6.404e+01 2.396e+00 26.725 ł2e-16 ***

no_tweets 3.266e+03 4.061e+02 8.042 1.88e-15 ***

x1_proportion 3.699e+04 6.127e+04 0.604 0.5462 x2_proportion 4.976e+04 6.140e+04 0.810 0.4179 x3_proportion 5.266e+04 6.101e+04 0.863 0.3882 x4_proportion 1.064e+05 6.328e+04 1.681 0.0930 . x5_proportion 1.023e+05 6.345e+04 1.613 0.1071 x6_proportion 7.858e+04 6.138e+04 1.280 0.2007 x7_proportion 3.648e+04 6.286e+04 0.580 0.5618 x8_proportion 5.011e+04 6.149e+04 0.815 0.4153 x9_proportion 6.873e+04 6.436e+04 1.068 0.2857

x10_proportion NA NA NA NA

temp_c 2.134e+04 1.054e+04 2.024 0.0432 *

feelslike_c -1.720e+04 7.495e+03 -2.294 0.0219 *

wind_mph 1.714e+03 2.101e+03 0.816 0.4147

pressure_in 6.491e+04 2.807e+04 2.312 0.0209 * (Intercept) -2.081e+06 8.613e+05 -2.416 0.0158 *

Signif. codes: 0 ’’ 0.001 ’’ 0.01 ’’ 0.05 ’.’ 0.1 ’ ’ 1

Table 8.3. Linear regression results for the random component of traffic volume:

November 21 – December 6.

Explanatory Variable Estimate Std. Err. t statistic Pr(>|t|) active_users$random 3.291e+01 4.899e+00 6.717 2.72e-11 ***

no_tweets 4.030e+02 3.623e+02 1.112 0.266

x1_proportion 3.318e+04 5.790e+04 0.573 0.567 x2_proportion 6.326e+04 5.800e+04 1.091 0.2769 x3_proportion 6.326e+04 5.800e+04 1.091 0.276 x4_proportion 9.362e+04 5.973e+04 1.567 0.117 x5_proportion 9.006e+04 5.997e+04 1.502 0.133 x6_proportion 8.220e+04 5.803e+04 1.417 0.157 x7_proportion 6.693e+04 5.972e+04 1.121 0.263 x8_proportion 3.007e+04 5.903e+04 0.509 0.611 x9_proportion 6.843e+04 6.062e+04 1.129 0.259

x10_proportion NA NA NA NA

temp_c -2.827e+03 1.019e+04 -0.278 0.781

feelslike_c -3.768e+02 7.253e+03 -0.052 0.959

wind_mph 2.627e+03 1.921e+03 1.367 0.172

pressure_in 4.545e+04 3.375e+04 1.347 0.178

(Intercept) -1.448e+06 1.037e+06 -1.396 0.163

Signif. codes: 0 ’’ 0.001 ’’ 0.01 ’’ 0.05 ’.’ 0.1 ’ ’ 1

30

(39)

Table 8.4. Linear regression results for the random component of traffic volume on November 22: anomalous day.

Explanatory Variable Estimate Std. Err. t statistic Pr(>|t|) active_users$random -2.631e+02 9.358e+01 -2.812 0.00623 **

no_tweets 3.629e+02 4.354e+03 0.083 0.93379

x1_proportion 3.314e+05 3.731e+05 0.888 0.37710 x2_proportion 1.256e+05 4.657e+05 0.270 0.78814 x3_proportion -3.760e+03 4.316e+05 -0.009 0.99307 x4_proportion 5.627e+05 3.999e+05 1.407 0.16340 x5_proportion 1.801e+05 4.105e+05 0.439 0.66208 x6_proportion 1.818e+05 3.672e+05 0.495 0.62197 x7_proportion 2.884e+05 4.138e+05 0.697 0.48785 x8_proportion -1.225e+05 3.477e+05 -0.352 0.72556 x9_proportion 6.904e+05 4.331e+05 1.594 0.11497

x10_proportion NA NA NA NA

temp_c 6.178e+03 1.477e+05 0.042 0.96674

feelslike_c -8.406e+04 1.203e+05 -0.699 0.48693

wind_mph 3.912e+04 3.942e+04 0.992 0.32416

pressure_in 2.015e+06 7.343e+05 2.744 0.00753 **

(Intercept) -6.086e+07 2.203e+07 -2.763 0.00715 **

Signif. codes: 0 ’’ 0.001 ’’ 0.01 ’’ 0.05 ’.’ 0.1 ’ ’ 1

Table 8.5. Linear regression results for the random component of traffic volume on November 25: anomalous day.

Explanatory Variable Estimate Std. Err. t statistic Pr(>|t|) active_users$random -1.104e+02 7.392e+01 -1.494 0.1393 no_tweets 7.256e+03 4.012e+03 1.809 0.0743 . x1_proportion -2.833e+05 3.935e+05 -0.720 0.4737 x2_proportion -6.566e+05 3.151e+05 -2.084 0.0405 * x3_proportion -1.486e+05 3.628e+05 -0.410 0.6832 x4_proportion -4.555e+05 2.548e+05 -1.788 0.0777 . x5_proportion 3.578e+05 3.559e+05 1.005 0.3178 x6_proportion -1.655e+05 3.432e+05 -0.482 0.6309 x7_proportion 2.519e+05 4.378e+05 0.575 0.5667 x8_proportion -1.812e+05 3.801e+05 -0.477 0.6349 x9_proportion -2.683e+05 4.654e+05 -0.576 0.5660 x10_proportion -4.808e+05 3.788e+05 -1.269 0.2082

temp_c 1.285e+05 1.731e+05 0.742 0.4601

feelslike_c -1.834e+05 1.424e+05 -1.288 0.2017 wind_mph -8.261e+04 6.808e+04 -1.214 0.2286 pressure_in -3.724e+06 1.688e+06 -2.207 0.0303 * (Intercept) 1.147e+08 5.156e+07 2.224 0.0290 *

Signif. codes: 0 ’’ 0.001 ’’ 0.01 ’’ 0.05 ’.’ 0.1 ’ ’ 1

(40)

Table 8.6. Linear regression results for the random component of traffic volume on December 3: anomalous day.

Explanatory Variable Estimate Std. Err. t statistic Pr(>|t|) active_users$random -1.020e+02 4.423e+01 -2.305 0.0238 *

no_tweets 3.866e+03 2.552e+03 1.515 0.1337

x1_proportion -1.548e+05 3.032e+05 -0.511 0.6110 x2_proportion 3.444e+05 2.882e+05 1.195 0.2356 x3_proportion -9.362e+04 2.521e+05 -0.371 0.7113 x4_proportion -3.351e+04 2.535e+05 -0.132 0.8952 x5_proportion 1.676e+05 3.068e+05 0.546 0.5865 x6_proportion 1.813e+05 2.672e+05 0.678 0.4995 x7_proportion 1.354e+05 2.528e+05 0.535 0.5939 x8_proportion 1.476e+06 2.817e+05 5.241 1.3e-06 ***

x9_proportion 2.397e+05 2.313e+05 1.036 0.3033

x10_proportion NA NA NA NA

temp_c -1.430e+04 1.415e+05 -0.101 0.9198

feelslike_c 9.461e+04 1.012e+05 0.935 0.3525

wind_mph -1.776e+04 4.589e+04 -0.387 0.6997

pressure_in -3.620e+06 1.432e+06 -2.528 0.0135 * (Intercept) 1.095e+08 4.356e+07 2.514 0.0140 *

Signif. codes: 0 ’’ 0.001 ’’ 0.01 ’’ 0.05 ’.’ 0.1 ’ ’ 1

Table 8.7. Linear regression results for the random component of traffic volume on November 26: normal day.

Explanatory Variable Estimate Std. Err. t statistic Pr(>|t|) active_users$random -5.345e+01 4.902e+01 -1.090 0.2807 no_tweets 5.021e+03 2.084e+03 2.409 0.0197 * x1_proportion -7.692e+04 1.554e+05 -0.495 0.6227 x2_proportion 1.029e+05 2.076e+05 0.496 0.6223 x3_proportion -4.262e+05 2.029e+05 -2.100 0.0408 * x4_proportion 2.126e+05 1.527e+05 1.392 0.1700 x5_proportion -7.169e+04 1.607e+05 -0.446 0.6575 x6_proportion 8.472e+03 1.593e+05 0.053 0.9578 x7_proportion -9.491e+04 1.617e+05 -0.587 0.5599 x8_proportion 3.840e+04 1.705e+05 0.225 0.8227 x9_proportion 2.493e+04 2.076e+05 0.120 0.9049

x10_proportion NA NA NA NA

temp_c 9.198e+04 8.965e+04 1.026 0.3098

feelslike_c -8.512e+04 7.895e+04 -1.078 0.2861 wind_mph -7.228e+04 4.647e+04 -1.555 0.1262 pressure_in -4.711e+06 2.048e+06 -2.300 0.0257 * (Intercept) 1.447e+08 6.287e+07 2.302 0.0255 *

Signif. codes: 0 ’’ 0.001 ’’ 0.01 ’’ 0.05 ’.’ 0.1 ’ ’ 1

32

Multi-Source Learning in a 3G Network