Preprocessing unbounded data for use in real time visualization: Building a visualization data cube of unbounded data

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2019,

Preprocessing unbounded data for use in real time visualization

Building a visualization data cube of unbounded data

ISABELLE HALLMAN

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

Sammanfattning

Det h¨ ar exjobbet utv¨ arderar dugligheten av en datakub som bas f¨ or visualis-

ering av obegr¨ ansad data. En kub designad f¨ or anv¨ andning till visualisering

av statisk data anpassades till att medge ins¨ attning punkt f¨ or punkt. Den

nya kuben evaluerades genom att m¨ ata tiden det tog att s¨ atta in olika antal

datapunkter. Resultaten indikerade att kuben kan hantera datastr¨ ommar

med en hastighet p˚ a upp till 100 000 punkter per sekund. Slutsatsen ¨ ar att

kuben ¨ ar anv¨ andbar om hastigheten av datastr¨ ommen ¨ ar inom denna gr¨ ans,

och om grovheten av de representerade dimensionerna ¨ ar tillr¨ ackligt h¨ og.

(3)

Preprocessing unbounded data for use in real time visualization

Building a visualization data cube of unbounded data

Isabelle Hallman - ihal@kth.se June 27, 2019

Abstract

This thesis evaluates the viability of a data cube as a basis for visualization of unbounded data. A cube designed for use with visualization of static data was adapted to allow for point-by-point insertions. The new cube was evaluated by measuring the time it took to insert different numbers of data points. The results indicate that the cube can keep up with data streams with a velocity of up to approximately 100 000 data points per second. The conclusion is that the cube is useful if the velocity of the data stream is within this bound, and if the granularity of the represented dimensions is sufficiently low.

Keywords: unbounded data; visualization; data cube

1 Introduction

Any service which collects data on user events will receive a large number of incoming data events. While this data can be used after-the-fact to get an under- standing of users’ behavior, in some cases it is interesting to view it in real time, enabling new trends or problems to be spotted as they arise. This kind of ever-increasing data is called unbounded data - i.e., data that is, in theory, infinite in size [23][2]. Exam- ples of such data include the continuous collection of user events in apps or on websites, or network traffic messages and logs.

Unbounded data provides some interesting challenges in comparison to its bounded counterpart. A system dealing with unbounded data cannot know beforehand how many data points it is going to see within some interval of time. The data points can also arrive out of order. This difference between the data point’s creation - its event-time - and its observation in the system - its processing time - is referred to as the event-time skew. However, tackling these challenges and working with unbounded data allows

for getting results with low latency, and achieving an even, continuous workload [2]. Since many compa- nies collect data continuously in streams of incoming data, it makes sense to handle such data with tools that were designed specifically for unbounded data.

Regardless of the type of data, its value is based on the insight it provides. For small sets of data, it might be possible for a human to gain such insight directly from the raw data. However, as the size and complexity of the data grow, some tools and methods are required to allow a human to process it. A common way to do this is through visualization, which allows a human to make sense of large amounts of data. When visualizing large data sets, some preprocessing is applied to extract the information of interest and to reduce the memory requirements of the application. Examples of such preprocessing include filtering, aggregation and clustering.

Real-time data observation - i.e., observation which is close in time to the event-time - is interesting in several use-cases. One such example is service monitoring. In this case, it is desirable to be able to react to any problems in the service as soon as possible.

It is also beneficial to be able to see the effects of any applied remedies as soon as possible. The user should also be able to compare the current state of the system with that from some time ago. This means that any system which enables real-time observation should not just show the current events without any context, but maintain a snapshot of some appropriate length. This approach to maintaining a time window containing the most recent data is commonly used when visualizing unbounded data [22].

Often it is desirable to allow the user of a visualization to make various queries to the data, allowing them to explore it as they wish. That is, it is up to the user which, for example, filtering, aggregation or clustering they are currently interested in. In such a case, keeping the latency low is of importance for the user experience. Increased latency dulls motivation

(4)

and has been linked to users changing their behavior [17]. Card et al. identify 100 ms as the limit for per- ception fusing of events such as tracking animations, and one second as the limit for perceived interactivity for events such as clicking links [6][7].

There are different approaches to achieving this low query latency. The query can be performed in parallel, such as on OLAP systems. However, these approaches involve some communication overhead, which often exceeds the required latency [19]. An- other approach, then, is to exploit the fact that it often is enough to give approximate visualizations.

Thus we may utilize approximate processing to reduce the latency.

A third alternative is to pre-compute some aggre- gates of the data and thus avoid having to do it on demand. This method is the basic idea behind what is often called a data cube or an index. Such methods include Nanocubes [16], Hashedcubes [20], im- Mens [18], ConcaveCubes [15] and Falcon [19]. These cubes were developed specifically with visualization in mind, utilizing the knowledge of the dimensional- ity of visualizations in their design. By doing so, they achieve very low latencies for visualization queries.

However, none of these methods are adapted for unbounded data, but rather are calculated as static en- tities. Should the user be interested in any other con- stellation of data, the entire cube needs to be rebuilt.

In contrast, Stream Cube [12] was designed specifically for use with unbounded data. It uses a tilted time frame model, which maintains a higher resolution at more recent data points. This is based on the assumption that a fine granularity is only interesting close to event-time. However, Stream Cube is not designed with visualization in mind, as the cubes mentioned above.

The main drawback of the data cubes for visualization, with the exception of Falcon, is the time it takes to build them. Falcon solves this problem by only building a sparse index based solely on the ac- tive view, backed by an efficient database solution [19]. Falcon’s solution is thus based on the full data being readily available, and as such it is not useful in the unbounded case.

1.1 Thesis objective and scope

This thesis project investigates methods for visualizing unbounded data. The main area of research is ”How to preprocess unbounded data for use in effective real-time delivery monitoring geographical vi- sualisation”. Here, ”effective” refers solely to time- effectiveness, as the main interest is in keeping the latency between user interaction and the system re-

sponse as low as possible. A solution to this question in the bounded case has been suggested in the form of the data cubes presented above. As such, this thesis looks into the viability of applying a similar solution in the unbounded case. The question which should be answered is then ”Is a data cube a viable data structure for use in real-time visualization of unbounded data? ”. The basic idea is to combine the design of the data cubes developed specifically for visualization, with methods for dealing with unbounded data.

The data in this case specifically concerns content delivery network (CDN) monitoring, as dependent on the different networks and locations of point of pres- ences. The main aim of this thesis is thus to provide a proof of concept of the usage of a data cube in this context.

While most cubes allow for different types of dimensions, the cube used in this thesis was limited to a temporal and categorical dimension. There are two reasons for this choice. One, most of the data cubes mentioned above were designed for use with geospatial heat maps, which is not the visualization that is needed in this use case. Second, the exact point of origin of the data in this use case is less interesting than the logical grouping of the data by categories such as internet service provider and country of origin. This can thus be described as categories rather than exact locations.

Because unbounded data can only be viewed once, unless the full data is stored somewhere, the cube cannot be recomputed. As such, the aim is to maintain a data cube and update it as new data comes in, and let the data cube be the sole representation of the data snapshot.

1.2 Prerequisites

The thesis was performed at a company which owns an app that is used worldwide. As such, the solution should fit into the existing back end system used by this company, and work with the data generated by the app to monitor the CDN delivery. The back end system was based on the Google Cloud products [10].

2 Background

2.1 Terminology

This section introduces the terminology used in the thesis. The following terms are all from Tyler Akidau et al.’s book on Streaming systems [2]:

• Unbounded data - Data which is infinite in size

(5)

• Bounded data - Data which is finite in size

• Stream - ”An element-by-element view of the evolution of a dataset over time”

• Table - ”A holistic view of a dataset at a specific point in time”

• Event time - The time at which a datapoint is created

• Processing time - The time at which a datapoint is processed by the system

• True streaming engine - an engine which processes data points as they arrive

• Micro-batch engine - an engine which processes streams in small batches

2.2 Preprocessing data for visualiza- tion

Pixel-aware methods reduce the amount of the data sent to the front end by taking into account how many pixels should be rendered. The idea is based on the fact that if there are only a set amount of pixels available for rendering a visualization, it is unnecessary to send more data points than this. For example, if several points in a line chart will be mapped onto the same pixel, only one of them will actually be visible.

This idea is used as a basis for the VDDA methods [14] and in the M4 algorithm [13]. These methods reduce the amount of data sent to the visualizer on the query-level while preserving the appearance of the visualizations. The idea of pixel-awareness is also used for some data cubes [16][18], and has also been used for solutions which decouple the visualization from the computation of data streams [26].

General approximate methods can also be used whenever accuracy can be sacrificed in favor of speed.

Such methods include sketches, sampling and buckets [9]. The approximation can be done by doing approximate queries on the complete data, or only storing or processing a sample or approximation of the data.

2.3 Data cubes

A data cube is a data structure which represents data along some measure of interest. Pahins et al. describe them with ”A data cube can be seen as a hi- erarchical aggregation of all data dimensions in an n-dimensional lattice” [20]. As mentioned in section 1, there are several data cubes developed specifically for use with visualization.

imMens bases its solution on pixel-aware methods, scaling according to the chosen resolution rather than the number of records. It creates a data cube with a bin size corresponding to the lowest level of resolution in the visualization and combines this with multivari- ate data tiles and parallel query processing to achieve a low query latency [18]. However, it lacks support for usage with any number of dimensions.

Nanocubes, on the other hand, can handle a large number of dimensions, while using less space than previous cubes [16]. It too is based on the idea that the data representation should group data points that will end up on the same pixel of the screen. It supports querying of different levels of detail in indepen- dent dimensions by defining a label which associates a value to a set of objects. These labels can be se- quenced to represent different levels of granularity.

ConcaveCubes provides a cube for cluster-based visualization. Instead of indexing based on location or a given category, it clusters the points according to some features and builds its index based on this. While such capabilities are useful, Concave- Cubes has a significantly higher construction time than Nanocubes and Hashedcubes. While it achieves a smaller memory footprint than both of these cubes, the high construction time indicates that the method might not be suitable for real-time updates of high- velocity [15].

Hashedcubes is a data structure similar to Nanocubes, but which achieves a lower memory footprint and faster building times [20]. Its comparatively simple data structure is also straightforward to adjust to handling dynamic data. For these reasons, this structure was chosen for the base design for this thesis. Hashedcubes was designed for moderately large datasets, which can be kept in the main memory of one machine. Section 2.4 describes Hashedcubes in detail.

2.4 Hashedcubes

Hashedcubes is a fairly simple data structure which represents its data in buckets, sorted according to some order. As the authors put it: ”Hashedcubes uses a partial ordering scheme combined with the notion of pivots to allow fast queries and a simple data structure layout.” [20]. Its main advantage over Nanocubes is that supports compound brushing in

”any number of dimensions”, and lesser memory requirements than Nanocubes and imMems. Its partial ordering scheme allows it to store its data in a simple array and allow for on-the-fly aggregation computation. As such, it lends itself well to being expanded to the unbounded case - as pointed out by the

(6)

Figure 1: Hashedcubes data ordering in 3 dimensions, B, C1 and C2. Here, the superscript indicates which bucket in these dimensions the data belongs to.

authors themselves. Its lesser memory requirements than some other cubes result in that more data can be stored in the structure, which in the case of this thesis translates into a longer, more detailed snapshot of the data stream.

To understand Hashedcubes, imagine we have an array of n elements containing all our data. If we sort this data with respect to some value, such as timestamp, we observe that we can find all the data points in some time interval between some pair of indices (b, e). This is called a pivot, in this case in the temporal dimension. The pivot thus indicates where we may find a subset of the data in the entire dataset.

Instead of iterating over the entire dataset, it is thus sufficient to iterate only over the pivots to find the data we are looking for. The pivots can be annotated with information about the data it contains, such as the range of data, range of pivots, or key metrics.

Say our data can also be divided into categories, i.e., it also has a categorical dimension. Observe that the internal ordering of a pivot does not affect its boundaries. Because the internal ordering of the pivot does not matter, the elements in the pivot may be reordered so that pivots can describe which data points belong to which category. We can continue in this manner in any number of dimensions. Figure 1 illustrates how Hashedcubes orders its data.

The ordering of the dimensions determines what information may be found the quickest. Data points belonging to a category in the first dimension will be stored in a single pivot, and contiguously in the original, sorted data. However, if instead the information of interest is located in the second dimension, it will be made up of several pivots sub-sorted into pivots of the first dimension.

Hashedcubes assumes that the fundamental unit which should be visualized is a count of events per location and time. It can handle three different types of dimensions: categorical, temporal and spatial. The spatial locations are sorted in quadtrees, corresponding to some region of pixels on the screen. As such, its main use is that of heat maps.

Queries over the Hashedcubes will return an approximate result. This is because each data point is

assigned to an interval rather than to its actual value.

While it theoretically could store a representation of the exact data, the granularity of each dimension will affect the number of pivots in each dimension, and thus the size of the data structure. Since an exact result is not necessarily of interest in a visualization [4], it chooses to trade off accuracy for space. However, if needed, its pivots can be used to supplement the result with information from the original data, provided that the original data is maintained in sorted order, in accordance with the pivots’ indices.

2.4.1 Terminology

The notion of pivots was introduced in section 2.4.

Further, Hashedcubes implements nodes, which contain a list of pivots. All pivots within each node will have the same parent pivot. In other words, each pivot in a higher dimension will spawn a node in the next dimension. The node, in turn, owns all pivots which fit into the interval indicated by the parent pivot’s start and end indices. The node’s start and end indices are then the start index of its first pivot, and the end index of its last pivot. This will exactly correspond to the start and end indices of its parent pivot.

2.4.2 Number of pivots

The time and space complexity of the Hashedcubes is dependent its number of pivots. In the first dimension, this will be bounded by the number of buckets, or possible values. The next dimension will then have as many nodes as there are pivots in the first dimension, each containing up to as many pivots as there are buckets, or possible values. As such, the bound for the number of pivots in total in the second dimension is the product between the number of pivots in the first dimension and the number of possible values in this dimension. The same calculation can be applied to all the following dimensions.

3 Method

3.1 Method of investigation

We note that the main difference between the original Hashedcubes and a cube which can be used for maintaining a snapshot of streaming data, is its abil- ity to ingest data which may come in random order and discard old data. This could be done in small batches, or in a true streaming fashion, where each incoming data point is inserted into the cube.

(7)

We further note that this is the same as building the structure once, by reading one data point at a time from some data source and inserting it into the data structure. Then, the max velocity the structure can handle is the number of data points ingested divided by the time taken to build the entire cube. The insertion time may also be measured directly.

As such, the viability of the datacube for use in a streaming context may be investigated by creating a data structure that can ingest any number of data points, has some max size after which it will discard old data, and measuring the time taken to build the cube or to insert a data point into an existing cube.

The original Hashedcubes idea was expanded so as to allow for point-by-point insertions into the cube.

After the first implementation was done, its building time was evaluated and the results used as a basis for further improvements. This yielded two versions of the cube.

Based on evaluation of these two versions run on a local machine with a simulated data stream, the most efficient one for the use-case was chosen and tested with data generated by users of an app.

Because the difference to the original Hashedcubes lies mainly in how the cube is build and new points are inserted, the focus of the validation experiments is put on the insertion and building times. The memory requirements are largely ignored, mainly because even the largest cube tested by the original Hashed- cubes consisted of 203 million pivots and used up 9.4 GB of memory, which would still fit into the working memory of modern computers. The cubes tested in this thesis do not approach this amount of pivots, instead being closer in number of pivots to the cubes which take up ca 400 MB in the original Hashedcubes [20].

3.2 Scope of investigation

The original Hashedcubes implementation is designed to be used with static data. The data is read once, and each dimension is created in turn. After the cube has been built, no further updates are made.

To make Hashedcubes work with dynamic data, the following points need to be addressed:

• Efficiently inserting a data point into an arbi- trary position in the cube

– Finding the new data point’s pivot

– Maintaining the pivots as their indices update upon insertions

• Handling expiration of old data

• Maintaining the original data set

• Managing queries of the cube as it continually updates

This thesis focuses on the first two points, ignor- ing the maintenance of the original data, as well as the management of queries. The original data may be discarded because the use case does not require an exact data representation. The management of queries is important if the writing to the cube needs to be done in parallel with queries to the cube, which in turn would be the case if it was important to see each data point as soon as it enters the cube. How- ever, single data points are of little interest to delivery monitoring, which focuses on averages and percentiles to see patterns in the data. As such, a micro-batch approach may be used, which in turn solves the problem of querying by creating space in between each update, in which the cube may be queried.

3.3 Hashedcubes for streaming

The original Hashedcubes implementation was adapted so as to allow point-by-point insertions rather than building each dimension one at a time based on the entire dataset. For the full data cube, this would mean maintaining ordered pivots with correct indices into a sorted array of data points. In the use case for this thesis, though, the exact data was not needed, and thus only the pivots need to be maintained. Section 5 discusses how the structure may be expanded so as to maintain the full data.

The first dimension was chosen to be time, as this was the dimension that needed to be switched out at every time step. Each time a data point comes in, it is hashed into the cube. If the point is too old, it is ignored. The cube will hold a time interval which should represent the events from some set window of time before, until the current time. The data cube can thus be seen as a sliding window snapshot of the data stream.

Hashedcubes main use case is geospatial heat maps.

A geospatial heat map shows the density of some type of data for some location. As such, geospatial heat maps aren’t too interesting in the case of delivery monitoring, as the information of interest there is some measured value, rather than the density of events. This, in turn, means that the spatial dimension as used in Hashedcubes is not very well suited for the use-case. Instead, the categorical dimension may be used to represent the spatial dimension if the expected number of locations is small enough. This is the case for the use case in this thesis, and as such, the spatial dimension was ignored.

(8)

We may observe that the temporal pivot structure does in fact store a value. As such, a similar pivot structure may be used to store other types of values.

For this thesis, the measures of interest are latency - the time taken to receive a response. As such, the temporal dimension may be used to store the data points.

Upon insertions of new data points, there are two main actions which need to be taken: first, the correct node and pivot for the new point need to be found, and second, the indices of all affected pivots need to be updated. The correct node in each dimension is found in O(logN ) time, while the new pivot may be found in O(logB) time, where N is the number of nodes in the dimension - corresponding exactly to the number of pivots in the parent dimension - and B is the number of buckets in the dimension, i.e., the number of pivots within the node. This needs to be repeated for each dimension.

For the temporal dimension, data points may expire if the cube has discarded the pivots corresponding to that time stamp. Such data points are filtered out and do not enter the cube. If the cube is full, the oldest pivot and its child nodes are removed after the new pivot has finished its insertion.

3.3.1 Version I

When a data point comes in, we first check if it fits into any existing pivot. If it does, that pivot’s end index is updated with +1, and all the following pivots’ indices within the node are updated in the same manner. If it does not fit into any existing pivot and the cube is not full, we add a new pivot of length 1 and insert it into its sorted position, moving and updating all pivots which come after.

Each time an update is made to a dimension, the update is marked as a new pivot, an update of an existing pivot or an exchange of an old pivot for a new pivot. This marker is then passed down to the next dimension, indicating which behavior the child dimension should take. If the parent pivot is new, the next dimension may directly create a new node and insert it into its appropriate place. If the parent pivot was updated, the node already exists and can be updated directly.

To avoid having to do updates to all pivots within the cube for each update, each node stores the start index of its parent pivot. The pivots within that node may then store only their indices as relative to their node. Upon each insertion into a node, the following nodes’ start indices then need to be updated.

Figure 2: Pivots that are updated upon insertion of a data point with values {B¹, C₁¹, C₂¹} into Version I. Orange indicates the pivots which update their indices to make space for the new item. Red indicates pivots or nodes which update their indices as a side effect.

Figure 3: Pivots that are updated upon insertion of a data point with values {B¹, C₁¹, C₂¹} into Version II. Orange indicates the pivots which update their indices to make space for the new item. Red indicates pivots which update their indices as a side effect.

3.3.2 Version II

Early results indicated that the building time - and thus the insertion time - for Version I was not sufficiently fast to keep up with the velocity of the stream.

Each time a data point is inserted into the Version I structure, all following pivots’ or nodes’ indices need to be updated. Consider the case where the cube consists of three dimensions: temporal - categorical - categorical. If the temporal dimension consists of b buckets, the first categorical dimension consists of c1

categories and the second categorical dimension consists of c2 categories, the total number of pivots will be O(b + bc1+ bc1c2). While the maintenance of the start index in each node avoids having to update all of these, the last dimension will still have as many nodes as there are pivots in the preceding dimension, O(bc₁) in this case. These pivots’ and nodes’ indices need to be maintained, meaning that the bound for the number of elements which need to be updated on each insert in Version I is O(c2+ bc1) in the example here. Figure 2 illustrates this.

Observing that the lower dimensions nodes will share their start and end indices with their parent pivot, the pivots can contain a pointer to their parent pivot, maintaining only their index relative to their parent pivot’s start index. This is similar to the idea in Version I, but avoids having to do updates to any nodes. This reduces the number of element updates needed for each insertion to O(b + c1+ c2). Figure 3 illustrates this.

The queries into the cube will then need to find the indices of each pivot recursively. However, the number of recursions is exactly the number of dimensions,

(9)

which will be very small in comparison to the number of pivots. The extra pointer does however affect the memory requirements of the cube, with an additional O(b + bc₁+ bc₁c₂) space.

To maintain a valid pointer to the parent pivot, the structure maintaining the pivot lists for each dimension was changed into a doubly-linked list. This changes the insertion time complexity for new pivots to O(1) rather than O(n). However, searches across linked list have the time complexity of O(n) rather than O(logn).

3.4 Time and space complexity

Table 1 describes the time complexity of the different versions of the cube for a cube with one temporal and two categorical dimensions, where n is the number of data points, c₁ and c₂ is the number of categories in the two categorical dimensions, and b is the number of buckets in the temporal dimension, assuming c2> c1

and c2> b.

The original Hashedcubes is mainly affected by its sorting of the data sets, which is done for all the data points for each node. Version I is mainly bounded by its updates to pivot and node indices, which grow with the size of the cube, and the s. Version II is mainly bounded by the searches across nodes and pivots within those nodes, as its pivot updates are bounded by the number of buckets or categories in each dimension.

3.5 System design

Hashedcubes may be run locally in a simple server- client implementation when building the cube from local data. Some changes were made to this system for the thesis project, which should also be able to read its data from a stream.

The data is ingested from a Google Pub/Sub topic into a Dataflow job [1], which performs time win- dowed aggregations and filtering of invalid data. The job then passes the aggregated data into a Pub/Sub topic [11]. To make sure the messages don’t become too large, they are grouped by category, corresponding to the second dimension in the cube. Each of these groups is sent as a different message, which is ingested by a Java service which maintains the cube.

It is responsible for the communication between the Dataflow job, the Hashedcubes service, and the front end. It reads events from Pub/Sub and forwards the data in these messages to the Hashedcubes. It re- ceives HTTP requests from the front end, forwards those to the Hashedcubes, and returns the result.

The cube itself is run in its own service.

The back end service and cube may be run locally on the same machine that is running the front end, or be deployed onto some back end.

In the use case for this visualization, some latency is tolerable between that an item enters the pipeline and is visible in the front end, in the order of minutes. As such, the cube does not need to operate in a true streaming fashion, but can rather ingest data in micro-batches. In between each batch, the cube is available to calculate the results of queries. This is achieved by the Dataflow job throttling the updates and passing them in small batches.

3.6 Implementation

The new versions of the cube were implemented in C++, building upon the code used for the original Hashedcubes. This was done so as to enable a fair comparison of building times between the original Hashedcubes and the new versions.

The Dataflow job was implemented in Scio [24].

The back end service was implemented in Java to fit into an existing echo system of services.

An example application was developed in Data Driven Documents (D3.js) [5] to allow a user to interact with the data. The user could filter which data they wanted to see, zoom and pan.

3.7 Test Data

The Brightkite dataset [8] - a data set of users’ check- in locations to a social network - was used to build the new versions of the cube. This dataset was also used as test data for both Nanocubes and the original Hashedcubes. It has a temporal dimension describing the time of a check-in, ranging between April 2008 and October 2010, a spatial dimension describing the location of the check-in, a categorical dimension describing the day of the week, and a categorical dimension describing the hour of day. The bucket size in the temporal dimension was 604 800 seconds, as this bucket size was used in the Hashedcubes article [20].

The cube was also tested on use-case specific data, namely based on user events in an app. Each data point contains the approximate location of its origin, its event time stamp, a category, and a measure of latency for getting a response from the app. This data was used to create two types of static data samples:

one where the location was described by the country, and one where the location was described as the latitude and longitude of the closest city.

Table 2 summarizes information about the sample data sets used and the different schemes used to build

(10)

Cube Inserting n data points Single data point insertion

Hashedcubes O(nlogn) -

Version I O(n(logbc1+ logc2+ bc1+ c2)) O(logbc1+ logc2+ bc1+ c2) Version II O(n(logbc1+ logc2+ c2)) O(logbc1+ logc2+ c2)

Table 1: T − C₁− C₂cube time complexity, n data points

them. The schemes are described as letters and numbers, in order from first to last dimension. Here, T refers to a temporal dimension, and C to a categorical dimension. The number next to the T refers to the bucket size in that dimension, in seconds.

For the tests with the real data stream, the locations were described by country. To reduce the velocity of the stream, a filter was applied which filtered out certain events. This resulted in a velocity of approximately 10 000 events per second.

3.8 Estimating data loss

Since the data structure is approximate, some information is lost in the visualization. Specifically, the temporal dimensions are approximate as they represent the data as intervals. In this thesis, the bucket size was chosen as 5 and 10 seconds for the dimension representing the event time, and 0.5 seconds for the dimension representing latency. Then, the max error will be ±2.5 seconds when the bucket size is 5,

±5 when the bucket size is 10, and ±0.25 s for the latency.

Secondly, some data might reach the streaming engine too long after its event time, in which case it is filtered out.

3.9 Evaluation

The cube was tested on a PC with a simulated data stream based on static data, as well as based on data batches from a real data stream. The PC was a Mac- Book Pro 13” 2015, 3.1 GHz Intel Core i7, 16 GB DDR3.

First, the time it took to build a cube fully representing a set of static data was measured. This directly compares to the building times of the original Hashedcubes, indicating how much effectiveness was lost in exchange for handling dynamic data.

To measure the insertion times, the cube was first built with different number of data points, of different granularity. Then, a smaller sample of the same data set on which the cube had been built, was inserted into the cube to test the insertion times when the number of pivots in the cube was constant. Next, data points with a newer time stamps were inserted,

expanding the window of time the cube represents.

This was to test the insertion times for data which the cube has not seen before, given that the cube is not full.

The insertion of new data points resulted in a larger number of pivots in the cube. For cubes LLT5, LLT10, CT5/CT5L and CT10/CT10L, the new data added an additional 108 705, 104 595, 45 231 and 32 953 pivots respectively.

To understand how the pivot count affects the insertion times for the two versions, the first 1 920 000 data points from the CT5L data set were inserted into the empty cube in batches of 10 000 points. The time taken per batch were measured for the two versions of the cube.

To test the cube in a real-life scenario, the cube was initialized without any data and allowed to fill up with data from a stream. Two window sizes were tested: 30 and 15 minutes, and the cube was run for 30 minutes after it had filled up. During this time, the time it took to insert new data batches was measured.

As the real time until between a query and its visualization include the network latency, the latency for queries was calculated only as the time from when the service running the cube received the query, to when it sent its reply. Similarly, the insertion times were calculated from the time that the incoming data had been decoded, until it had finished hashing into the cube.

4 Results

4.1 Building time

Table 3 shows the building times for the new versions as compared to the original Hashedcubes implementation. It is clear that the both Version I and II are significantly slower than the original Hashedcubes.

Similarly, Version I is clearly slower than Version II.

The relative difference in building times between the different location representations are greater for Version II than for Version I. The ratio between the average building times x and y and standard error of the means σ_x and σ_y was calculated, with the 95%

error range as q

(^2·σ_x^x)²+ (^2·σ_y^y)² [25]. LLT5 takes

(11)

Dataset Dimensions Number of points Size dimension Number of pivots

Brightkite T-C-C 4 491 035 - / 7 / 24 23 126

Brightkite S-C-C-T 4 491 035 - / 7 / 24 / - 7 279 004

LLT5 T5-C-T0.5-C 1 000 000 4 682 s / 6 / 310 ms / 56 934 1 003 911 LL10 T10-C-T0.5-C 1 000 000 4 682 s / 6 / 310 ms / 56 934 956 621

CT5 T5-C-T0.5-C 1 000 000 1 111 s / 6 / 1 200 ms / 181 163 719 CT10 T10-C-T0.5-C 1 000 000 1 111 s / 6 / 1 200 ms / 181 113 408 CT5L T5-C-T0.5-C 11 539 096 2 123 s / 6 / 1 394 ms / 205 711 914 CT10L T10-C-T0.5-C 11 539 096 2 123 s / 6 / 1 394 ms / 205 479 084

Table 2: Datasets

Dataset Dimensions Hashedcubes Version I Version II

Mean (s) STDEV (s) Mean (s) STDEV (s) Mean (s) STDEV (s)

Brightkite T-C-C 5.64 0.911 60.09 1.65 26.16 3.44

Brightkite S-C-C-T 33.29 0.180 - - - -

LLT5 T5-C-T0.5-C 212.30 2.40 37.24 5.30

LLT10 T10-C-T0.5-C 129.70 3.27 37.06 0.62

CT5 T5-C-T0.5-C 167.70 20.31 20.45 3.25

CT10 T10-C-T0.5-C 99.91 2.80 11.74 0.092

CT5L T5-C-T0.5-C - - 266.33 3.42

CT10L T10-C-T0.5-C - - 209.55 3.44

Table 3: Building times for Hashedcubes, Version I and Version II.

82 ± 13% longer time to build than CT5 using Ver- sion II, but the increase is 27 ± 8% for Version I. Sim- ilarly, LLT10 takes 30 ± 2% longer time than CT10 for Version I, but 216 ± 1% longer for Version II.

Aside from the lower asymptotic bound of the original Hashedcubes implementation as compared to the cube presented in this thesis, the building times are also dependent on cache performance. The original Hashedcubes does sequential iterations over lists, building the cube in sequential order. In contrast, the modified cube inserts points in a column-like manner, which is not cache optimal. While this effect is obvious when both cubes are built from static data or lists of data, a data stream which is inserted in a true streaming fashion wouldn’t have the same cache effects.

4.2 Insertion time

Tables 4, 5 and 6 show the insertion times for a fully built cube. Table 4 and 6 shows the results of 10 insertions of the same set of 10 000 data points and 100 000 data points respectively, all of which were already represented in the cube. Table 5 shows the results of 10 insertions of different sets of 10 000 data points, none of which had been inserted into the cube before. Due to the long time Version I took to insert

data into its cubes, it was not tested with the CT5L or CT10L data sets.

The results show that there is a significant difference in insertion times depending on if the data already is represented in the cube or not. A one-tailed T-test of the ten data points used for calculating the averages in table 4 and 5 showed a significant difference at the 0.01 level between the insertion times for the new points and the old points, for all of the cases except for the CT5 cube built with Version II.

In most cases, it was the new data points which were inserted faster than the old ones. However, the re- verse was true in the case of LLT5 and LLT10 built with Version I.

Table 7 show the average insertion times for batches of data for 30 minutes of insertions into a full cube. The approximate velocity v points / seconds of these insertions is calculated by v = ⁿ_t, where n is the average number of points in each insertion and t is the average time in seconds. With the standard error σ_t= ^√^·σ^t

N, where σ is the standard deviation and N is the number of measurements, the error of this estimation with a 95% certainty is

q (^2·σ^t

t )²+ (^2·σ_nⁿ)² [25]. Given the large number of measurements, this error is sufficiently small to ignore. Using these calcu- lations, the resulting velocities for the schemas used

(12)

Dataset Version I Version II

Average (ms) STDEV (ms) Average (ms) STDEV (ms)

LLT5 6 414 519 649 37

LLT10 3 496 381 779 19

CT5 2 940 204 179 10

CT10 1 778 158 153 10

CT5L - - 357 38

CT10L - - 265 28

Table 4: Insertion times into Version I and Version II, average of 10 insertions of the same 10 000 data points.

LLT5 17 356 1 623 664 189

LLT10 7 415 544 394 114

CT5 530 245 165 19

CT10 384 169 114 15

CT5L - - 206 38

CT10L - - 134 26

Table 5: Insertion times into Version I and Version II, average of 10 insertions of new 10 000 data points.

in table 7 range from approximately 65 000 points / second to 100 000 points / second, where the slowest schema was the schema using 1800 second windows with a bucket size of 5 seconds in the first dimension, and the fastest was the one with 900 second windows, using a bucket size of 10 seconds in the first dimension.

Figure 4 shows the insertion times for Version I and Version II as they build the cube from the first 1 920 000 data points from the CT5L data set. They both insert data in batches of 10 000 points, in the same order. The time taken to insert each batch was measured. Here, the effect of the linear time complexity of the node updates in Version I is obvious.

The fact that the insertions into the LLT5 and LLT10 cube built with Version was slower for new data than for old data might be due to the increase in the number of pivots. As figure 4 shows, Version I is more sensitive to increases in the number of pivots than Version II is.

4.3 Queries

The same queries used to test Hashecubes [20] were used to test Versions I and II. These queries are based on real actions by users when using Nanocubes [16], and thus represent a sample of common interactions with the data. This was to evaluate how the querying time was affected by the change from an array to a linked list to store the pivots in the categorical and

temporal dimensions. The results can be seen in table 8.

The querying times are expectedly slower for Ver- sion II than Version I. Version I makes little changes to the end result of the cube as compared to the original Hashedcubes. Version II’s main difference is that the pivots are stored as doubly-linked lists rather than arrays. The authors of Hashedcubes point out the array based storage as one of its strengths: ”Un- like tree-based data structures, scans happen along arrays. Such approach tend to offer appealing performance, since the CPU cache automatically optimizes burst memory operations” [20]. Version II moves away from this optimization. However, the results indicate that the additional querying times are negligible.

4.4 Visualization

Figure 5 shows an example of how the data may be visualized when the locations are described by their latitude and longitude. In a visualization, the data points will normally be scaled according to the domain of the data, such that it there is a direct, in- versible mapping between the metric and its representation. Because the data here is unbounded, the exact domain of the data is unknown. Changing the domain during runtime according to the domain of the data which is visualized at that moment, or available within the cube, would result in a confusing vi-

(13)

Figure 4: Insertion times for batches of 10 000 data points into a growing cube.

Figure 5: The visualization with some test data with high granularity in the locations

(14)

LLT5 57 636 1 926 6 453 507

LLT10 30 666 1 423 9 201 176

CT5 27 545 1 219 1 712 176

CT10 15 440 787 1 133 128

CT5L - - 1 867 177

CT10L - - 1 344 139

Table 6: Insertion times into Version I and Version II, average of 10 insertions of the same 100 000 data points.

Dimensions Window size #Batches #Points in batch Insertion time (ms) Pivots

Seconds Buckets Average STDEV Average STDEV Average STDEV

T5-C-T0.5-C 900 180 1762 10 445 2 657 118 42 368 727 1 042

T10-C-T0.5-C 900 90 4153 10 629 2 701 105 32 208 146 27 854

T5-C-T0.5-C 1800 360 2864 11 025 3 044 168 96 601 863 43 753

T10-C-T0.5-C 1800 180 2166 10 894 2 802 126 44 465 656 1 532

Table 7: Insertion times for Version II used with a data stream.

sualization. To avoid this, a step-like function such as a sigmoid can be used when mapping the values from the cube into visual representations. The sigmoid function returns a flat result for any value above a certain threshold.

5 Discussion

Overall, the cubes presented in this thesis seem to be viable data structures for solving the original problem - that is, to use in real-time visualization of unbounded data for use in delivery monitoring geographical visualization. However, there are two caveats to this claim. One, the ”geographical” part of this question was simplified to be represented as countries rather than exact locations. While this is a reasonable simplification in this specific use case, the same might not be true for other use cases. On the contrary, the results indicate that a use case which requires a high granularity in the spatial dimension would not be helped by this data structure, unless the velocity of the incoming data is sufficiently low.

Which brings the discussion to point two: that the velocity at which the cube can insert new data points is approximately 100 000 data points per second in the best-case scenario tested by this thesis. This was when the window was 15 minutes long, and the data was sorted in 10 second long buckets. While this was enough when certain types of events were filtered out by the Dataflow job, the full stream would have over- whelmed the cube.

A positive result was the fact that insertions of new data were mostly faster than the re-insertion of already inserted data. This makes the cubes better fitted to a real data stream than to building them from static data, since a data stream would continually introduce previously unseen data.

A key takeaway is that the cube approach is a viable solution if the number of buckets in each dimension is sufficiently small. However, in the case of heat maps or when exact locations are necessary, the number of possible pivots will balloon, and with it the insertion times and space requirements. The building times of the original cube also indicate this - when the spatial dimension is added, the time taken to build the cube from the same set of data grows six- fold. The authors point out the leaf size of the quad tree representing the spatial dimension as a crucial factor for the memory usage and performance of the cube [20].

This thesis has side-stepped this problem by representing the spatial dimension as a small number of categories, as this is enough information for users to understand how the CDNs are performing. The number of buckets could be limited to the number of countries in which the app was used. However, the results indicate that the method would not work if a higher granularity of the spatial dimension was needed.

(15)

Dataset Dimensions #Queries Version I Version II

Mean STD Mean STD

Brightkite S-C-C-T 1430 95 117 114 146

Table 8: Querying times (ms) for cube with vector (Version I) and list based (Version II) implementation.

5.1 Improving insertion speeds

The insertion time for the cube is bounded by the number of possible pivots, which in turn is bounded by the number of buckets in each dimension. This number affects two things: one, the number of pivots which need to have their indices updated, and two, the number of nodes which will need to be searched to find the node which should be updated. Version II was designed to solve the first of these problems, but does not deal with problem number two. Problem number two becomes central when the granularity of low dimensions is high, affecting the number of nodes in later dimensions.

A solution to this was considered, in which each pivot would contain a pointer to its child node so as to avoid searching across all nodes in the next dimension. However, since the base design idea for Hashed- cubes is to store the pivots and nodes in order, new nodes would still need to be added in order. While this could be solved by doing a normal search and inserting a new node if the parent pivot is new, there is no guarantee that all the following dimensions will already have existing nodes for that particular value.

Then, since the dimension structure owns the list of nodes, the parent pivot must contain a dependency to the next dimension to allow for insertion into that list.

Another solution would be to have a tight coupling between the different dimensions, where the parent pivots point to their child nodes in a position-aware manner. Since both of these options move away from the cube, creating a graph instead, this approach was abandoned.

5.2 Method evaluation

Aside from the difference in location representation between data sets LLT5/LLT10 and CT5/CT10, the ranges of values in the temporal dimensions are different. Thus, the difference in the number of pivots in the resulting cubes are not only due to the extra amount of possible locations, but also due to the extra pivots in the temporal dimension. While this could have been avoided by having the only difference between the data sets be that of the representation of the locations, the data sets still illustrate the main point: the effects of increased granularity on the

building and insertion times into the cube.

The implementation for running the system on the back end is in this thesis more complicated than it needs to be. This is mainly due to time constraints.

The implementation was designed mainly to utilize as many standard solutions as possible for HTTP communication, Pub/Sub ingestion, and deployment that already were used in the back end. Had this not been the case, the C++ code running the cube could also include code to manage the ingestion of data from Pub/Sub and HTTP requests directly from the front end.

Grouping the data into Pub/Sub messages according to which category they will hash into in the second dimension might affect the time it takes for the cube to hash the message. Since all of them will hash into the same pivot in the second dimension, provided that they fall into the same pivot in the first dimension, this will affect the number of pivots introduced by each new batch of data. However, there are only six possible values in the second dimension. Because this number is so small, the effects of the pre-grouping are likely negligible.

Running the cube on a back end cloud system introduces some additional overhead for network communication when users interact with the visualization. However, such a decoupling might still be the best option for large data streams. For one, decoding such streams normally take several machines working in parallel. Secondly, the time taken to build a snapshot of some length of time from a cold-start will be at least as long as the length of the snap shot that should be represented. As such, if a user is interested in seeing a 30-minute snapshot of the data, they need to wait 30 minutes to be able to see it if everything is run locally. This problem can be avoided by maintaining the cube in the back end.

Section 3.1 leaves out some important points which are relevant in any real-life scenario. Most impor- tantly, the implementation used here does not run on a true streaming engine. Doing so would require to adjust the cube so it may be run on multiple machines in parallel, and allow for concurrent queries on the cube.

A solution to this could be to maintain separate cubes of separate data on different machines, utiliz-

(16)

ing a dispatcher to aggregate results from queries to them. This idea is based on the observation that the structure within a node could be described as its own cube. If maintenance of the original data is of interest here, the corresponding data could be kept locally on the machine which owns the cube related to that specific data.

Another issue regarding the updating times is when a time bucket expires. At this time, its pivot and child nodes are removed, causing a reallocation of the nodes arrays at linear complexity. A solution to this was considered, in which the oldest pivot would be switched out for a new pivot in-place. While this would work in the case of two dimensions, as the dimensions increase, the number of nodes which de- scend from the original pivot increase. For example, if the number of child pivots in the second dimension is two, there will be two nodes in dimension three related to the original pivot, only one of which may be used by the new incoming data point. Removing these excessive nodes would result in the reallocation which we wanted to avoid. Then, the nodes may instead be kept, but remain empty. However, this adds some complexity to searches or queries into the structure, since it is not clear how these inactive nodes’

indices should be maintained.

However, given the bucket size used in this project, pivots should only expire once every 5 seconds. Even though a linear reallocation of all nodes might be less than optimal, it is rare enough to safely ignore.

One can note that by not maintaining the full data as a complement to the cube itself, the main point of maintaining indices - to be able to supplement the information gathered from the cube with that of the full data - is lost. As such, it is not strictly necessary to describe the pivots by their indices, instead allowing them to maintain only their index relative to the other pivots while representing the data as a number.

This would reduce the data dependency between pivots, allowing for parallelization.

This thesis did not go in that direction for three reasons. One, to keep the possibility of maintaining the full data open. Two, because Version II largely solves the problem of maintaining the indices, provided the number of buckets in each dimension is sufficiently low. Three, to avoid doing refactoring to the querying structure, mainly due to time constraints of the project.

The results in this thesis are based on a cube built for handling data in a true streaming fashion, inserting them one by one. However, because of technical constraints of the methods, it was never tested on this specific use case. Instead, micro batches of data were used.

If all the data points in each batch share their time stamp, some opportunities for a more effective solution opens up. Knowing that all the data points will fit into their own pivot which will be appended to the end of the cube, they can be pre-hashed as their own small cube, and simply appended to the existing cube. In this case, the original Hashedcubes implementation could be used for the hashing itself.

The data could be forced into sharing their time stamp through using the processing time for hashing in the first dimension, rather than the event time.

This would of course result in inaccuracies in the data.

However, assuming that the event time is not vital to describing the data, this would be a viable solution.

Given the performance of the cubes presented here, this would likely be a more efficient solution.

5.3 On sustainability and ethics

When using data, the main ethical question is often related to privacy or how the data is used. For this thesis, the approximate nature of the cube might be positive from a privacy perspective if the data sources are related to human actions, as the cube will group together data sources. This is especially true for the representation of locations used here, where all data points within the same country were grouped together. Of course, if the cube is supplemented with the full data, that data needs to be sufficiently anonymized.

From a sustainability perspective, the fewer resources used, the better. In this case, this translates into fewer machines or operations used. The method used in this thesis utilized one single machine for running the cube. However, the number of operations is larger than for the original Hashedcubes, as indicated by the time complexity described in section 3.4. The conclusion here is that the cube used in this thesis is less efficient in resources than Hashedcubes.

5.4 Future work

This thesis did not investigate methods for maintaining the original data as it was not useful for the use case. This thus leaves room for further investigation, possibly utilizing the PMA [3], as suggested by the authors of Hashedcubes [20].

Another interesting area might be to apply parallelization to the implementation, both locally on a single machine or distributed on several.

To achieve a longer data snapshot, a tilted time frame model, as in Stream Cube [12], could be applied.

(17)

To handle larger data streams, a sampling of the stream could be applied to reduce the amount of data the cube needs to ingest. For example, StreamApprox [21], which combines reservoir sampling and stratified sampling, could be used as a layer in front of the cube.

6 Conclusion

The results indicate that the data structure may be useful for maintaining a sliding window, approximate snapshot of data on a single machine. In the case of delivery monitoring, this solution might be useful.

However, the main use of the cube design used as a basis for this thesis was to provide a simple and fast data structure for spatiotemporal visualizations.

The results in this thesis indicate that a switch to streamed data loses that strength, as the spatial dimension needs to be significantly simplified to keep up with a high-velocity stream. As such, more standard solutions might be of better interest to a user wishing to design a system for a use case with high granularity in the spatial dimension.

References

[1] Tyler Akidau, Robert Bradshaw, Craig Cham- bers, Slava Chernyak, Rafael J Fern´andez- Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, et al. The dataflow model: a practical approach to balancing correctness, latency, and cost in massive- scale, unbounded, out-of-order data processing. Proceedings of the VLDB Endowment, 8(12):1792–1803, 2015.

[2] Tyler Akidau, Slava Chernyak, and Reuven Lax.

Streaming Systems: The What, Where, When, and How of Large-Scale Data Processing. ” O’Reilly Media, Inc.”, 2018.

[3] Michael A Bender and Haodong Hu. An adap- tive packed-memory array. ACM Transactions on Database Systems (TODS), 32(4):26, 2007.

[4] Nikos Bikakis. Big data visualization tools.

arXiv preprint arXiv:1801.08336, 2018.

[5] Mike Bostock. D3.js. https://github.com/d3/d3 (accessed March 29, 2019), 2019.

[6] Stuart K Card. The psychology of human- computer interaction. CRC Press, 2017.

[7] Stuart K Card and Thomas P Moran. Newell the psychology of human-computer interaction, 1983.

[8] Eunjoon Cho, Seth A Myers, and Jure Leskovec.

Friendship and mobility: user movement in location-based social networks. In Proceedings of the 17th ACM SIGKDD international confer- ence on Knowledge discovery and data mining, pages 1082–1090. ACM, 2011.

[9] Danyel Fisher. Big data exploration requires collaboration between visualization and data in- frastructures. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics, page 16.

ACM, 2016.

[10] Google. Cloud computing services.

https://cloud.google.com (accessed May 30, 2019), 2019.

[11] Google. Cloud pub/sub.

https://cloud.google.com/pubsub/docs/overview (accessed May 30, 2019), 2019.

[12] Jiawei Han, Yixin Chen, Guozhu Dong, Jian Pei, Benjamin W Wah, Jianyong Wang, and Y Dora Cai. Stream cube: An architecture for multi- dimensional analysis of data streams. Distributed and Parallel Databases, 18(2):173–197, 2005.

[13] Uwe Jugel, Zbigniew Jerzak, Gregor Hacken- broich, and Volker Markl. M4: a visualization- oriented time series data aggregation. Proceed- ings of the VLDB Endowment, 7(10):797–808, 2014.

[14] Uwe Jugel, Zbigniew Jerzak, Gregor Hacken- broich, and Volker Markl. Vdda: automatic visualization-driven data aggregation in rela- tional databases. The VLDB Journal—The In- ternational Journal on Very Large Data Bases, 25(1):53–77, 2016.

[15] Mingzhao Li, Farhana Choudhury, Zhifeng Bao, Hanan Samet, and Timos Sellis. Concavecubes:

Supporting cluster-based geographical visualization in large data scale. In Computer Graphics Forum, volume 37, pages 217–228. Wiley Online Library, 2018.

[16] Lauro Lins, James T Klosowski, and Carlos Scheidegger. Nanocubes for real-time exploration of spatiotemporal datasets. IEEE Trans- actions on Visualization and Computer Graph- ics, 19(12):2456–2465, 2013.

(18)

[17] Zhicheng Liu and Jeffrey Heer. The effects of interactive latency on exploratory visual analysis. IEEE Transactions on Visualization & Com- puter Graphics, (1):1–1, 2014.

[18] Zhicheng Liu, Biye Jiang, and Jeffrey Heer. immens: Real-time visual querying of big data.

In Computer Graphics Forum, volume 32, pages 421–430. Wiley Online Library, 2013.

[19] Dominik Moritz, Bill Howe, and Jeffrey Heer.

Falcon: Balancing interactive latency and resolution sensitivity for scalable linked visualizations.

2019.

[20] C´ıcero AL Pahins, Sean A Stephens, Carlos Scheidegger, and Joao LD Comba. Hashedcubes:

Simple, low memory, real-time visual exploration of big data. IEEE transactions on visualization and computer graphics, 23(1):671–680, 2017.

[21] Do Le Quoc, Ruichuan Chen, Pramod Bhato- tia, Christof Fetze, Volker Hilt, and Thorsten Strufe. Approximate stream analytics in apache flink and apache spark streaming. arXiv preprint arXiv:1709.02946, 2017.

[22] Christian Rohrdantz, Daniela Oelke, Milos Krstajic, and Fabian Fischer. Real-time visualization of streaming text data: Tasks and challenges. In VIS-Week, 2011.

[23] Jonathan A Silva, Elaine R Faria, Rodrigo C Barros, Eduardo R Hruschka, Andre CPLF De Carvalho, and Jo˜ao Gama. Data stream clustering: A survey. ACM Computing Surveys (CSUR), 46(1):13, 2013.

[24] Spotify. Scio. https://github.com/spotify/scio (accessed May 30, 2019), 2016.

[25] John R Taylor. An introduction to error analysis: the study of uncertainties in physical measurements. University Science Books, page 61.

2nd edition, 1997.

[26] Jonas Traub, Nikolaas Steenbergen, Philipp Grulich, Tilmann Rabl, and Volker Markl. I2:

Interactive real-time visualization for streaming data. In EDBT, pages 526–529, 2017.

(19)

TRITA -EECS-EX:477