• No results found

EVALUATING VIEWS FOR PRODUCING DERIVED DATA ELEMENTS ON TIME SERIES DATA

N/A
N/A
Protected

Academic year: 2021

Share "EVALUATING VIEWS FOR PRODUCING DERIVED DATA ELEMENTS ON TIME SERIES DATA"

Copied!
20
0
0

Loading.... (view fulltext now)

Full text

(1)

EVALUATING VIEWS FOR

PRODUCING DERIVED DATA

ELEMENTS ON TIME SERIES DATA

Vicky Westermark

Bachelor Thesis, 15 hp/credits

Bachelor’s Programme in Computing Science

(2)

Abstract

(3)

Acknowledgements

I’d like to thank my supervisor Michael Minock for giving me the opportunity to write this thesis, and whose well grounded advice and gumption kept me well on track throughout my efforts.

I’d also like to thank the department och Computing Science, with special acknowledgements to Kai-Florian, whose candid feedback was invaluable to this paper; Marie Nordström, whose plans and suggestions illuminated the oft dark and twisty halls of academia, and Lars Karlsson and Jonny Pettersson whose passion for education was an inspiration throughout my studies. A heartfelt thanks to all those lecturers and educators who, with erudition and conscientious-ness took charge of their courses, online and off, and created a coherent and constructive learning environment. Without your efforts the contents of this program would be nought but spaghetti code.

Lastly, my gratitude to my family and friends who have supported me up to and throughout my education, without whom none of this would have happened.

(4)
(5)

Contents

1 Introduction 1

1.1 Purpose and Research Questions 1

2 Theoretical Background 2

2.1 The C-Phrase 3

2.2 The Covid-19 Time Series Dataset 3

3 Method 5

3.1 Derived Data Elements 5

3.2 Common Experimental Arrangements 5

3.3 Experiment 1 6 3.4 Experiment 2 7 4 Results 9 5 Discussion 10 5.1 Interpretation of Results 10 5.2 Limitations 10

5.3 Conclusion and Recommendations 11

(6)
(7)

1 Introduction

As a database administrator (DBA), the faster and easier it is for users to understand a database, the better. However when dealing with regularly updated time series data and complex de-rived data elements, both speed and simplicity are easily hampered. Among the tools available to tackle these issues are database views [1]. By utilizing views to perform part of the com-putation necessary for a query ’behind the scenes’, a DBA can offload part of the difficulty of writing queries on the database and the time it would take to execute them [2]. This can become especially powerful if the DBA has information on the expected query load on the database system. In that case they can tune the database as to make it especially simple and fast to use for more common use cases.

In the case of time series data, users might be interested in more in depth analytical derivative metrics based on the data in the time series. Constructing rolling averages of, or employing differentiation to acquire the rate of change for a metric might be two examples of such de-rived metrics of interest to a user. If the DBA knows that requests for this information will be common, it can be prudent to find out what view implementations on these derived data elements are the most efficient in terms of execution time. However, some view implemen-tations come with additional costs, such as those of constructing and maintaining the view [2].

This paper is dedicated to the exploration of a part of this issue. It does this by means of a naturalistic experimental study (that is a study on a real dataset containing information on a real phenomena) into the efficiency of a set of view implementations. These views all produce derived data elements intended to be analytical of a time series dataset. The intended value of this paper is to provide DBA:s with hard quantitative evaluations of realistic examples that might weigh into the practical decision making of how they are to tune their databases, especially when concerned with time series data and derived data elements.

Chapter 2 presents a theoretical background to views and Persistent Derived Information (PDI), along with an introduction to the tools and dataset utilized in this paper. Chapter 3 describes the experimental method used to evaluate a set of views containing various PDI:s, the results of which is presented in Chapter 4. Lastly, these results are analysed in Chapter 5, along with a discussion about the study, it limitations and potential avenues for further research.

1.1 Purpose and Research Questions

The purpose of this paper is to explore the relationships between the usage of virtual, material and dynamic views and efficiency expressed in total execution time, regarding constructing Persistent Derived Information (PDI) from time series data. In so doing, it is meant to shed light on the trade-offs between the costs of constructing, running and maintaining a database that allows access to quantitative analyses of regularly updated chronological information. As such, this paper is mainly concerned with using practical experiments as feedback for database tuning. This paper explores the total execution time of a number of derived data attribute synthetization methods. The two main research questions can be summarized as follows.

• When first and second order rates of change of a number of attributes in a time series dataset are synthesised, how many queries must be made on a materialized view before the total execution time of queries and view materialization is lower than performing

(8)

those same number and kinds of queries on a virtual view or a dynamic view with the same contents?

• When rolling n-day averages of rates of change of time series data are synthesised, how do the methods of virtual, dynamic and materialized views compare in terms of total execution time, when including the time required to construct the views?

2 Theoretical Background

Databases are organized collections of data [1]. Specifically, an SQL database is a set of tables where every row represents one entity with a certain set of attributes, represented by the columns of the table. The different tables often have specified relations and semantic meaning associated with them.

However, it is rarely useful for a user to be presented with the entire database. Most often, a user is only interested in a subset of the data in the database, where the data conforms to certain relevant criteria. To address this, SQL employs queries to allow the user to ask specific questions about the dataset [1].

In addition to filtering the data in a database, queries can be used to synthesise derived data elements [2] from attributes already existing in the table. For example, Country_Time might be a date attribute derived from the current UTC time, adjusted by the time zone of the country in question. In the long run, the synthesis of derived data elements can often become quite costly in terms of execution time, and tedious to perform when constructing queries. In this context, the derived data element is known as a derivative and the attribute that the derivative is derived from is called the derivee.

Luckily there are often certain types of queries that one can expect to be more commonly run on a certain database [2]. The process of database tuning is that, with knowledge of what this work load looks like, a database administrator (DBA) can tune a the database to perform better under those circumstances. Views are one of the tools the DBA can use to accomplish that goal.

In essence, views are queries already existing in the database that a user can do additional queries on [1]. Views behave like tables, but are the result of queries done on tables in the database. Because of this, views can also serve to manage the tedium and complexity of writing queries on the database, since they can do a lot of the filtering and synthesis behind the scenes. As such they act as a kind of interface, abstracting some of the databases internal details. However, despite the simplicity of their use, there are different kinds of views that vary in how they work on the inside.

If the view is merely virtual, it means that the view acts as a subquery, that is, a query inside a query. In this case, the query that specifies the view will be run every time the view is queried. Here, there is no data structure corresponding to the view in the database. Nothing of the view contents is stored persistently. Instead, it can be seen as an instruction for how to construct the view, and these instructions are stored in the database. Oftentimes the specific DBMS used will optimize this query to not construct the entire view, instead opting to create the part necessary for the subquery. Virtual views rarely speed up queries, since they don’t do any work before they are queried and everything they do can be accomplished by writing a subquery instead.

A materialized view, on the other hand, is what you get when you execute a query

(9)

ing the view and insert the resulting tuples into a table in your database. This table will contain redundant data, derived data (also known as PDI for Persistent Derived Information) or both. This information is stored persistently and so it is available for use by multiple queries. This precomputation can greatly enhance the efficiency of queries done on its contents compared to if it was virtual [3]. With this boon comes costs, two of which are that of its maintenance and that of its construction. Building a materialized view is not free, since the query and construction of the table take processing time. To make matters worse, the materialized view must be kept up to date with the tables it is derived from. If any of the derivees are updated, the materialized view must be updated as well, ‘lest it succumb to inconsistency.

A third method is using dynamic views. Dynamic views do not correspond to any object in the database. Instead they are prepared as queries by the client program. This allows the dynamic views to contain variables that are set just before the query is sent to the database, which allows the user "on the fly" specification of the filtering and synthesis inside of the view, without having to know about it’s internal details. In all other respects, the dynamic views utilized in the experiments of this paper function similarly to virtual views.

When designing a database, the trade-offs between these different view solutions must be weighed against each other, in order to provide both efficiency and consistency.

2.1 The C-Phrase

C-Phrase [4] is a database user interface created by Michael Minock. It uses natural language processing to interpret user provided strings called utterances as queries. Utterances can include specification of the table to do a query on, attributes of interest and other restrictions specified in a configuration file. The interpretation of these utterances can be clearly defined with no ambiguity.

In its documentation it is explained that The C-Phrase Administrative Tool, which constitutes the administrator side of the application, also supports the construction of dynamic views that employ virtual attributes that are defined at run time. These virtual attributes function simi-larly to that of parameters in functions in ordinary programming languages. Virtual attributes do not correspond to any attributes in the database, but are user defined and are static during query execution.

C-Phrase allows for the construction of multiple views on the same tables. Each view has a corresponding lexical phrase (lex-phrase) that the user includes in their utterance to indi-cate that a query should be run on a specified view. When a user provides C-Phrase with a utterance, the total execution time of the corresponding query is available via the C-Phrase Administrative Tool.

An example of an utterance could be ’employees sales salary under 50000’, indicating that the table corresponding to the lex-phrase ’employees’ is the one to perform a query on. Addition-ally, the query should filter for employees working in sales and with a yearly salary under $50’000. What these substrings mean is entirely up to the administrator and are defined via the C-Phrase Administrative Tool.

2.2 The Covid-19 Time Series Dataset

Time Series Data is a kind of dataset comprised of discrete data points ordered chronologically. Each entry represents a system state at a specific time. From this kind of data, several kinds of analyses can be made to better understand the development of a dynamic system. For

(10)

example, one might look at changes in data over time, such as the difference between values in the time series, or averages over a number of data points.

An example of a set of time series datasets are those tracking the epidemiological development of the covid-19 pandemic. One such dataset is used as a basis for the experiments in this paper. It is provided by the user ’Devakumar kp’ on the website Kaggle [5], an online community for data scientists and machine learning practitioners. This dataset, henceforth referred to as the covid-19 dataset, is updated daily and the version used for the basis of this article was downloaded 2020-05-04 and has entries from 2020-01-24 until then.

The covid-19 database contains only one table called ’covid’. The schema of ’covid’ can be found in table 1 and its functional dependencies in table 2.

Each tuple in the table represents a report from a certain place and time. Country_Region usu-ally denotes a reporting country. However, due to differing opinions regarding the sovereignty or independence of various region, this occasionally denotes somewhat smaller subsections of countries. Province_State denotes even finer demarcations between parts of a country, for example the various states of the US. Lat and Long constitute the coordinates for the Province_State or Country_Region if no Province_State is specified, that is to say NULL. Date is the date of the report. The Confirmed, Deaths and Recovered attributes all denote the total accumulation of confirmed cases of, deaths due to and successful recoveries from the covid-19 virus from the first observed case to the date of the report. The three attributes Confirmed, Deaths and Recovered will henceforth be referred to as outcomes, as this paper uses them in similar ways to synthesise derived data elements.

Table 1 The schema of the table ’covid’.

covid

Attribute Name Attribute Type Province_State text Country_Region text Lat real Long real Date date Confirmed INTEGER Deaths INTEGER Recovered INTEGER

Table 2 The functional dependencies of the attributes in the table ’covid’. covid

Province_State, Country_Region, Date → Confirmed, Deaths, Recovered Province_State, Country_Region → Lat, Long

(11)

3 Method

In order to get a good view on what happens in the following experiments, this chapter will begin with presenting the derived data elements in Subsection 3.1. It moves on to the gen-eral experimental conditions present in both experiments in Subsection 3.2, along with an overview of the experimental method. Subsections 3.3 and 3.4 both lay out the proceedings of their respective experiments.

3.1 Derived Data Elements

In the following experiments two kinds of PDI are synthesised. They will be called n:th Order Rate of Change and Rolling n-day Averages respectively.

The First Order Rate of Change (1-RC) on a certain day is defined as the difference between an outcome on that day and the day before. The Second Order Rate of Change (2-RC) a certain day is the difference between the first order rate of change on that day and the day before. An example of this is provided in table 3. In the case of the covid-19 dataset, the 1-RC corresponds to the new instances of an outcome, e.g. new confirmed cases, on a certain day. 2-RC is the difference in new instances from the day before, that is how the rate of new instances are increasing or decreasing day to day. N:th Order Rate of Change is the same as the mathematical concept of the n:th derivative, but to avoid confusion with the concept of derived data elements, this nomenclature is avoided.

Table 3 An example on first and second order rate of change of the Confirmed attribute. Day Confirmed 1-RC 2-RC 0 2 N/A N/A 1 3 1 N/A 2 7 4 3 3 13 6 2 4 17 4 -2 5 14 -3 -7

More formally, one can define First Order Rate of Change of a value 𝑣 on day 𝑑 as the difference between 𝑣 on day 𝑑 and 𝑣 on day 𝑑 − 1. The Second Order Rate of Change a value 𝑣 on day 𝑑 is defined as the difference between the First Order Rate of Change of 𝑣 on day d and the first First Order Rate of Change of 𝑣 of day 𝑑 − 1.

Rolling n-day Averages (RAVG) is when you, for every day, calculate the average of a value during that day and the 𝑛 − 1 preceding days. If 𝑛 = 3, that means that you, for every day, are interested in the average of a value of that day and the two preceding days. An example of this is provided in table 4.

More formally, rolling n-day averages of a value 𝑣 on a day 𝑑 is can be defined as the average of 𝑣 during 𝑛 days preceding and including 𝑑.

3.2 Common Experimental Arrangements

A local agile database using sqlite3 was implemented using the C-Phrase Administrative Tool and was used for all experiments. It ran on a desktop computer running Ubuntu 18.04.4 LTS, with a Intel i5-8400 CPU @ 2.80GHz, 8gb ram, an ATA Samsung SSD and an Asus GeForce GTX 1060 Dual OC 2xHDMI 2xDP 6GB GPU.

(12)

Table 4 An example of rolling 3-day averages of the confirmed attribute rounded to two decimals.

Day Confirmed RAVG

0 2 2 1 3 2.5 2 7 4 3 13 7.67 4 17 12.33 5 14 14.67

A total of 15 views were constructed. The time required to materialize the table with the corresponding lex-phrase ’material all 2rc’ was recorded. This was the only table that was materialized. In all views, all tuples with the same ’Country_Region’ and ’date’ attributes are grouped, thus eliminating the ’Province_State’ attribute. This was done in order to standard-ize the disparate ways in which various countries used Province_State in their reporting. The SQL-statements for each view can be found in Appendix A.

3.3 Experiment 1

Experiment 1 tested three different view implementations that synthesised first and second order rates of change for all outcomes. The lex-phrases corresponding to these views can be found in table 5. All views were tested by running four queries each, each a unique combina-tion of no time filtering or filtering for tuples whose ’_Date’ attribute had a value greater than ’2020-04-01’, and filtering for tuples whose ’Country’ attribute had a value of either ’US’ or ’Italy’. The countries were chosen in the interest of seeing if the separate reporting of the US states would impact performance. All views in this experiment had the same schema, which can be found in table 6. For further details, see Appendix A.

Table 5 The lex-phrases corresponding to the views tested in Experiment 2, categorized by implementation type.

virtual dynamic material

virtual all 2rc virtual dyn-c all 2rc material all 2rc

Table 6 The schema of the views in experiment 1. The column ’Comment’ indicates what the attribute is.

Attribute Name Attribute Type Comment

Country text Country_Region from covid _Date date Date from covid

Confirmed_Total INTEGER Confirmed from covid Confirmed_Today INTEGER 1-RC of Confirmed Confirmed_Diff INTEGER 2-RC of Confirmed Deaths_Total INTEGER Deaths from covid Deaths_Today INTEGER 1-RC of Deaths Deaths_Diff INTEGER 2-RC of Deaths Recovered_Total INTEGER Recovered from covid Recovered_Today INTEGER 1-RC of Recovered Recovered_Diff INTEGER 2-RC of Recovered

(13)

All queries were run five times and their average execution times as reported by the C-Phrase Administrative Tool were recorded. The query execution times for each query differed merely a few milliseconds between runs, so five runs were deemed to be enough to get a decent average. The c-phrase utterances and the views they were interpreted as are available in Appendix B.

3.4 Experiment 2

Experiment 2 tested twelve different view implementations concerned with synthesising rolling day averages of a number of rates of change of a number of outcomes. For all rolling n-day averages 𝑛 = 3 since preliminary testing suggested that the value of 𝑛 did not affect the execution time. The lex-phrases corresponding to each view can be found in table 7.

Each view is a unique combination of an implementation type (virtual, dynamic or material utilizing), an outcome domain (only Confirmed or All outcomes) and orders of rate of change (only first order or first and second order). These qualities of the constructed PDI correspond directly to the schemas of the views. Further details are available in tables 8 and 9. In this experiment, no view was materialized. Instead, the material utilizing views derive rolling averages from the precomputed values available in the materialized view corresponding to the lex-phrase ’material all 2rc’, the same one used in Experiment 1.

The views were tested by running a query on them that projects all attributes of the table. The queries all filter tuples from their corresponding views for those where the attribute ’Country’ has the value ’Italy’. Five tests were run for every query, except for those taking over 100 seconds, in which case only one test was performed. The execution times of each query as reported from c-phrase were recorded. The c-phrase utterances and the queries they were interpreted as are available in Appendix B.

Table 7 The lex-phrases corresponding to the views tested in Experiment 2 categorized by implementation type and qualities of the constructed PDI.

virtual dynamic material utilizing Conf 1-RC virt conf 1rc avg virt dyn-c conf 1rc avg material conf 1rc avg

2-RC virt conf 2rc avg virt dyn-c conf 2rc avg material conf 2rc avg All 1-RC virt all 1rc avg virt dyn-c all 1rc avg material all 1rc avg

2-RC virt all 2rc avg virt dyn-c all 2rc avg material all 2rc avg

(14)

Table 8 A table over what kinds of views have what attributes in their schemas. An ’x’ in a cell indicates that the attribute on that row is included in the views with the qualities of that column.

Attribute Name Conf All 1-RC 2-RC 1-RC 2-RC Country x x x x _Date x x x x Confirmed_Total x x x x Confirmed_Today x x x x Confirmed_Today_Average x x x x Confirmed_Diff x x Confirmed_Diff_Average x x Deaths_Total x x Deaths_Today x x Deaths_Today_Average x x Deaths_Diff x Deaths_Diff_Average x Recovered_Total x x Recovered_Today x x Recovered_Today_Average x x Recovered_Diff x Recovered_Diff_Average x

Table 9 The schema of the views tested in experiment 2. The column ’Comment’ indicates what the attribute is.

Attribute Name Attribute Type Comment

Country text Country_Region from covid _Date date Date from covid

Confirmed_Total INTEGER Confirmed from covid Confirmed_Today INTEGER 1-RC of Confirmed

Confirmed_Today_Average real RAVG of Confirmed_Today Confirmed_Diff INTEGER 2-RC of Confirmed

Confirmed_Diff_Average real RAVG of Confirmed_Diff Deaths_Total INTEGER Deaths from covid Deaths_Today INTEGER 1-RC of Deaths

Deaths_Today_Average real RAVG of Deaths_Today Deaths_Diff INTEGER 2-RC of Deaths

Deaths_Diff_Average real RAVG of Deaths_Diff Recovered_Total INTEGER Recovered from covid Recovered_Today INTEGER 1-RC of Recovered

Recovered_Today_Average real RAVG of Recovered_Today Recovered_Diff INTEGER 2-RC of Recovered

Recovered_Diff_Average real RAVG of Recovered_Diff

(15)

4 Results

The total time required to create the materialized view with the lex-phrase ’material all 2rc’ was 1.53 seconds.

The results of Experiment 1 can be found in table 10. Each cell corresponds to a unique query definition. Between 4.94 and 5.67 queries were required for the total execution time of using the materialized view to be less then that of the virtual one, depending on what criteria the query filtered for. The dynamic and material view implementations were very similar in query execution times.

Table 10 Average execution times in seconds of queries in Experiment 1 on the virtual, dy-namic and material utilizing views, along with the average number of queries required to be run on the material utilizing view before the total execution time of using the virtual view is greater that the material utilizing view, including materialization time. The column named ’Filter’ indicates the filtering criteria of the query.

Filter virtual dynamic material queries required Country = ’italy’ 0.64 0.34 0.34 5.1

Country = ’italy’ & _Date ≤ ’2020-04-01’

0.62 0.32 0.35 5.67 Country = ’us’ 0.65 0.34 0.34 4.94 Country = ’us’ & _Date ≤

’2020-04-01’

0.62 0.32 0.35 5.67

The results of Experiment 2 can be found in the table 11. Each cell under the columns ’Virtual’, ’Dynamic’ and ’Material’ represents a unique view and contains the execution time of the query that was run on that view in seconds. The columns named ’Domain’ and ’Orders of RC’ indicate what PDI the view contains. Under Domain, ’Conf’ indicates that the view only contains PDI derived from covid.Confirmed, whereas ’All’ indicates the view contains PDI from all outcomes. Under Orders of RC, ’1’ indicates that the view only contains first order rates of change, whereas ’2’ indicates it contains both first and second order rates of change. All views contain rolling n-day averages of all its rates of change.

The number of queries required for the total execution time of using the material utilizing views to be less then that of the dynamic views were between 1.65 and 0.12, depending on what PDI were a part of the view schemas. Both the dynamic and the material views were consistently over 100 times faster than the purely virtual view implementations.

Table 11 Execution times in seconds of queries in Experiment 2 on the virtual, dynamic and material views, along with the average number of queries required to be run on the materialized view before the total execution time of using the dynamic view is greater that the material view, including materialization time.

Domain Orders of RC Virtual Dynamic Material Queries Required Conf 1 223.85 1.49 0.56 1.65

All 1 675.82 3.89 1.15 0.56 Conf 2 866.87 4.85 0.89 0.37 All 2 2568.63 14.2 1.83 0.12

(16)

5 Discussion

5.1 Interpretation of Results

Regarding Experiment 1, it seems as though the material view of first and second order rates of change was far more efficient in terms of total execution time, compared to the similar virtual view. However, the dynamic view did just as well as the material view. Since the dynamic view requires no precomputation, and the dynamic and material views performed so similarly, we can not conclude that the materialized view will ever make up for its materalization time. It is difficult to conclude the cause of the very tight race between the dynamic and material views. To speculate, one might conjecture that the main factor in all queries performed in Experiment 1 includes a similar calculation that requires constant time. Another possibility is that the very similar execution times indicate that the number of nested computational loops becomes identical in this case. It might be that filtering for country as in the case of the dynamic view reduces the problem to something very similar to performing a joinless query on the material view.

As for Experiment 2, that on the rolling averages on rates of change, the execution time of the queries on the virtual view grew rather quickly in a way unfit as a solution. On the other hand, the relationships between the dynamic and the material-utilizing view implementations looks rather interesting. Here, materialization pays of very quickly, often within the first query. Neither does it look like the execution time of the queries on the dynamic views can be expressed as a linear function of that of the material-utilizing views.

It seems likely that the great differences in apparent time complexity of querying the views is due to the nested and repeated joins. The initial filtering in the dynamic view means the view does not need to scan through the entire covid-table during the joins in subqueries concerned with calculating the rolling averages. The material-utilizing views instead avoid having to construct the derivatives, allowing it to avoid the utilization of an oft repeated computational loop, as the construction involves a lot of joins.

5.2 Limitations

Preliminary testing also indicated that the number of averaged days hade no impact on per-formance when rolling n-day averages was synthesized. Values for 𝑛 between 2 and 10 were tested and no discernible impact on execution time could be established. These preliminary tests were done on virtual, dynamic and material-utilizing views. However, it is possible that higher values for 𝑛 could have had an effect.

Sqlite3 was chosen mainly due to it integrating well with the agile option for c-phrase. Sqlite3 offers a rather limited set of features and another DBMS might well have provided options for streamlining many of the queries. Amongst other things, sqlite3 lacks support for common table expressions, the lack of which has severely impacted the redundant code in several sql queries utilized in the experiments of this paper. Due to these limitations, writing flexible dynamic views became unworkable. Had this not been the case, it might have been possible to write views that could perform an arbitrary number or orders of rates of change, allowing the production of more tests.

C-Phrase itself was selected due to its support of dynamic views. However, the measurements for all queries were taken directly from the C-Phrase Administrative Tool. This might have had an impact on the validity of the measurements. One such impact was detected, where the number of displayed tuples in the C-Phrase graphical interface had a drastic impact on

(17)

performance. This issue was resolved by allowing the interface to display all tuples returned by a query.

The queries used to test the synthetisation of derived data elements might well be suboptimal in several ways. One possible improvement to the dynamic views that synthesise n-day aver-ages could be to add a WHERE case that compares the country with a dynamically provided specification.

The covid-19 dataset came with the complication that some countries initially reported sep-arate values for their respective States_Provinces, but later changed this scheme to reporting only on the Country_Region level. To manage this inconsistency, all views constructed for this paper group entries on Country_Region and Date, and so make State_Province irrelevant. This might have an additional impact on the performance of certain queries.

5.3 Conclusion and Recommendations

A cursory glance at the experimental results could be taken to indicate that the more nested joins a view definition contains, the more useful materialization becomes. However, it might not always be necessary to materialize an entire view, but utilize a sort of stratified hierarchy of view materialisations, in a way suggested by the material-utilizing views in Experiment 2. Other times materialisation might be entirely unnecessary, as shown by Experiment 1, where dynamic filtering might suffice to boost the performance of a view to the standards of a material one.

Under the precise circumstances present during the experiments outlined in this paper, mate-rializing the rates of change and utilizing that view to synthesise rolling averages is advisable.

5.4 Future Work

For additional research, a more general study of the relationships between the number of derivees, the number of derivations and view implementations on time series data could be made to further inform the decision making process for DBAs dealing with database tuning. This paper studies naturalistic data in order to get a well grounded basis for its conclusion, but this dataset is limiting in the number of relevant derivees (there are only three) which limits the quantitative analysis that can be done on the results. A greater number of test results could allow for interesting statistical modeling work. An artificial dataset following the pattern of time series data could provide the basis for such a study.

(18)

References

[1] Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. 2008. Database Systems: The Complete Book (2nd. ed.). Prentice Hall Press, USA.

[2] D. Botzer and O. Etzion, "Optimization of materialization strategies for derived data el-ements," in IEEE Transactions on Knowledge and Data Engineering, vol. 8, no. 2, pp. 260-272, April 1996, doi: 10.1109/69.494165.

[3] Halevy, A. Answering queries using views: A survey. The VLDB Journal 10, 270–294 (2001). Available: https://doi.org/10.1007/s007780100054

[4] Michael Minoc. 2020. "Phrase | Natural Language Interfaces to SQL Databases." C-Phrase Technologies AB, 20 May 2020. Available: https://c-phrase.com/

[5] K. P. Devakumar, "COVID-19 Dataset | Kaggle". April 24 2020. [Online]. Available: https://www.kaggle.com/imdevskp/corona-virus-report/data

(19)
(20)

References

Related documents

The churn prediction model will predict if a cus- tomer will churn within six months from the most recent month in the data- point, while the churn time prediction model predicts

The SumTime weather project goal was to generate weather forecast that contained information about wind speed and direction from time series of data for offshore rigs in the North

The reason for comparing two periods which are not only one year (year of introduction of the wind shield) apart from each other is to apply a test which would

(FDI) t is a measurement of the FDI inflows to the host country set in FDI inflow per year, Time is the time period variable in this case from 1980 to 2000, CCRIS is a dummy

By using the ED instead of the combined distance as defined in Equation 3.2, we might find a closest cluster which can be a miss match because the distance exceeds the tau value.

Studiens slutsatser redovisas, vilka utmaningar och möjligheter lärare uppfattar med läsplattan som redskap och hur lärare kan använda den som stöd för elever som de uppfattar

Music and the ideological body: The aesthetic impact of affect in listening.. Nordic Journal of Aesthetics,

The inclusion criteria were: (1) study participants of any age diagnosed with type 2 diabetes or pre-diabetes, (2) studies evaluating the modulation of the gut microbiota, either