Database schema migrationin highly available services

(1)

Database schema migration

in highly available services

JAVIER BENITEZ FERNANDEZ

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

in highly available services

BENITEZ FERNANDEZ, JAVIER

ICT Innovation Master’s Program Date: January 20, 2021

Supervisor: Ying Liu Examiner: Benoit Baudry

School of Electrical Engineering and Computer Science Host company: Eficode Oy

Swedish title: Databasschema migreringar i mycket tillgängliga tjänster

(3)

(4)

Abstract

Since the rise of DevOps, companies have been trying to provide rapid and agile cycles on the development of new products, reducing time and costs to go from definition to production. However, whenever a MySQL system with a zero-downtime requirement is found, the database deployment cycle is still manual, leaving it less evolved than other parts of the development cycle, es-pecially when applying schema updates without affecting the availability of a MySQL system (Database Online Schema migrations) on Continuous Inte-gration and Continuous Delivery pipelines. As a consequence, it delays the process. There is a strong belief that the different affected technologies are ready to automatize Online Schema migrations in production. In this thesis, we verify the viability and study the reliability of automatizing Online Schema migrations with a zero-downtime requirement. We achieved this by testing the most common errors in a setup without error detection mechanisms and then testing our proposed design. We conclude that is viable and reliable to apply these techniques and strategies for providing Continuous Delivery. Our design prevents most common errors when updating SQL schemas online with some inherent implications and flaws in MySQL systems.

(5)

Sammanfattning

Sedan DevOps uppkomst har företag försökt ge snabba och smidiga cykler för utveckling av nya produkter, vilket reducerar tid och kostnader för att gå från definition till produktion. Men när ett noll-stilleståndskrav hittas i ett MySQL-system, är databasdistributionscykeln fortfarande rudimentär och manuell, vil-ket gör att den har mindre utvecklats än andra delar av utvecklingscykeln, sär-skilt migrationsdatabas online-schema för kontinuerlig integration och kon-tinuerliga leveransrörledningar . Som en konsekvens försenar det processen. Även om det inte finns någon tidigare litteratur om det, finns det en stark över-tygelse om att de olika berörda teknikerna är redo att automatisera online-schema-migrationer i produktionen. I den här avhandlingen verifierar vi livs-kraften och studerar tillförlitligheten för att automatisera Online Schema mi-greringar med ett noll-stilleståndskrav. Vi gjorde det genom att testa de van-ligaste felen i en normal installation och när vi använder vår föreslagna de-sign. Vi drar slutsatsen att det är genomförbart och pålitligt att tillämpa dessa tekniker och strategier för att tillhandahålla kontinuerlig leverans. Vår design förhindrar vanliga fel vid uppdatering av SQL-scheman online med några in-neboende implikationer och brister.

(6)

1 Introduction 1

1.1 Problem Statement . . . 3

1.2 Research Question . . . 4

1.3 Structure of the Document . . . 5

1.4 Thesis Contribution . . . 5

1.5 Ethics and Sustainability . . . 6

2 Background 7 2.1 Devops, Continuous Integration and Continuous Delivery . . . 7

2.2 MySQL Systems . . . 9

2.3 Online Schema Migrations . . . 12

2.4 Summary . . . 13

3 Related Work 14 3.1 Methods . . . 14

3.2 Related Works . . . 15

3.3 Summaries and Discussions . . . 18

4 Methodology 20 5 Solution Design 22 5.1 Environment Design . . . 22 5.2 Requirements . . . 23 5.3 Expected Outputs . . . 26 5.4 Pipeline Structure . . . 26 5.5 Setup . . . 28 5.6 Implementation . . . 31

5.7 Summaries and Discussions . . . 36

(7)

6 Experiment 37

6.1 Research Questions . . . 37

6.2 Metrics . . . 38

6.3 Case Study and Scenarios . . . 39

6.4 RQ1 . . . 43

6.5 RQ2 . . . 45

6.6 RQ3 . . . 47

7 Discussion 49 7.1 Results of the Conducted Experiment . . . 49

7.2 General Considerations . . . 52

7.3 Impact of the Pipeline . . . 52

7.4 Threats to Validity . . . 53

8 Conclusions and Future Work 54 8.1 Conclusions . . . 54

8.2 Future Work . . . 55

(8)

1.1 DBA migration flow . . . 3

2.1 DevOps Workflow [23] . . . 8

2.2 MySQL cluster master-slaves . . . 10

2.3 MySQL cluster master-master . . . 11

2.4 AWS Aurora clustering master-slaves . . . 12

5.1 Pipeline stages . . . 26

5.2 Pipeline stages . . . 27

5.3 Continuous delivery flows . . . 28

5.4 Infrastructure architecture . . . 32

6.1 New column runtime error . . . 43

6.2 New column runtime latency . . . 44

6.3 Alter column backward compatibility error . . . 44

6.4 Alter column backward compatibility traffic . . . 45

6.5 Alter column runtime error . . . 46

6.6 Alter column runtime traffic . . . 46

6.7 Alter column runtime database CPU . . . 47

6.8 Drop column database CPU . . . 48

6.9 Drop column latency . . . 48

(9)

(10)

Introduction

Since the rise of DevOps, many companies have been trying to improve the development cycles and achieve more agile cycles [1][2]. The state of the art has been focusing on improving the agility of software development in gen-eral, however, some areas are still immature with limited research work. One of them is the necessity of predictable automatized updates for the system life-cycle of high availability systems. Jim Gray et al. describe a high availability system as a system with a low unavailable period of 5 minutes per year [3].

Due to technical challenges further explained in this document, DevOps practices, especially Continuous Delivery (CD) and Continuous Deployment (CDL) pipelines are not fully exploited in the scenarios involving updates for high availability systems. CD and CDL are two complementary components. CD pipelines are usually custom pipelines for each project to define differ-ent jobs for building and testing software automatically. CDL pipelines have the aim of automatically releasing the software previously processed by a CD pipeline.

V. Holt et al investigated the adoption of the best practices for database management in the database community [4]. They found high use of agile de-velopment methodologies. However, half of the respondents remarked a lack of a defined database development life cycle, including fault recovery time ob-jectives. This makes it difficult to ensure the stability of the state conservation during database evolution without producing downtime, outage, or any other issue that compromises the availability of the system.

Traditionally, a Database Administrator (DBA) maintains and manages all

(11)

the changes related to a MySQL system, related to the code development life-cycle and the changes related to MySQL maintenance [5] as shown in Figure 1.1. Both types of changes are manually managed by DBAs. The flow de-scribed in Figure 1.1 is split in 7 steps. A developer updates a SQL schema in step 1 which would be assessed by a DBA, this process consists of analyzing the SQL statements validity and the impact that they may have over the sys-tem to prevent errors such as compatibility breaks and data integrity breaks. Following, a DBA tests the changes locally as manual integration tests before approving the migration. Once the migration is approved at step 4, a DBA prepares the MySQL system to be migrated, prepare a backup, schedule a maintenance time, and other project-specific tasks. Once the MySQL system is prepared for the migration, in step 6 a DBA proceeds with the online schema migration approach and supervises the operations while the data is migrated to the new schema. Then, a DBA checks that the migration has been successfully applied in step 7.

Steps 5 and 6 can take from hours to days in a production database with required supervision. It presents a clear bottleneck and obstacle to automatize a system continuous delivery cycle. Even if, in some cases, there are already tools that can be effectively used in some stages of a CD pipeline or a CDL pipeline such as SQL lints for step 2 and online schema migration tools for step 5, they do not cover steps 5, 6 and 7 without compromising the system integrity. SQL lints perform sanity checks, mainly checking the syntax, how-ever, they lack breaking change checks or data integrity break check. Online schema migration tools manage part of the migration, it applies the changes to the schema without downtime but it lacks database state and performance monitoring which implies that a DBA needs to supervise the process. CD and CDL pipelines as automatic processes have to ensure the repeatability of themself and system integrity [6]. Currently, it is not possible to perform these tasks in CD and CDL pipelines without compromising the robustness and sta-bility of the system with the current tools for MySQL systems.

Consequently, most companies keep using manual schema migration meth-ods for database schema changes. A DBA will manually apply and manage the changes, generating a delay in the development cycle. In some use cases, due to the complexity of the process and the required manual time, some tools re-duce manual interaction in some of the steps such as SQL lint at step 2 and online schema migration tools at step 6. However, these steps are just partially automated.

(12)

Figure 1.1: DBA migration flow

1.1 Problem Statement

There is already research around the different areas that may allow automa-tizing these steps. Schema transformation and rollbacks is a widely studied topic [7][8]. It has a special focus on the design of Database Evolution Lan-guage (DEL) to improve the process. On the other hand, relational database migrations are also widely studied [9][10][11][12][8]. The previous studies are particularly focused on analyzing, and designing strategies and techniques to face the resulting challenges [13][14]. Also, automation, especially CD, has been extensively studied [15][16][17]. Nevertheless, as shown in the next chapter, there is a lack of research around the methodologies and strategies that can be applied to automatize database schema evolution in systems with a zero downtime requirement. Furthermore, the possibility of providing a re-liable and robust development cycle for high availability systems [18]. Zero downtime requirement describes the necessity to have a system without service

(13)

interruption, normally motivated by high economic losses when the system is down.

The previous research describes strategies to migrate schemas and make rollback on a CD/CI pipeline. What the previous studies do not consider are the common necessities of zero downtime systems such as multiple service versions connected to the same database or schema transformation securely. Due to the lack of consideration of these cases in the current work, the current work does not offer reliable techniques or strategies to manage database evo-lution on zero downtime systems in a CD/CI pipeline.

This thesis has as purpose to analyze the possibility of bringing highly automated CD and CDL pipelines for MySQL schema development, reduc-ing repetitive work for DBAs, and increasreduc-ing the velocity of the Development Value Stream (length of time between a task is defined until it is deployed to production) by partially automatizing the flow presented in Figure 1.1. This thesis aims to achieve this purpose by automatizing steps 2, 3, and 6 and with a special focus on detecting common breaking compatibility and data corruption errors to automatize these tasks robustly and stably.

1.2 Research Question

• Is it possible to provide reliable automatic schema evolution on high availability MySQL systems by using the current online schema migra-tion tools and strict testing and SQL linters?

There are two relevant metrics, the percentage of total errors produced by the server and the latency by the HTTP requests. The first metric allows no downtime is produced and the second one allows to ensure that the service response time is not drastically affected by the migration. This Research Question is important to ensure that the changes are not affecting the current behavior of the system during or after the migration to achieve the thesis purpose stated in the previous section.

• Is it possible to prevent downtime on automatic schema evolution for MySQL systems with a zero downtime requirement by detecting com-monly faced schema bugs?

The metric is the percentage of total errors produced by the server while processing the client calls. This metric allows measuring what resources are affected by the downtime. This Research Question is important to

(14)

ensure that the changes produce no downtime to achieve the thesis pur-pose stated in the previous section.

• Is it possible to prevent data corruption on automatic schema evolu-tion for MySQL systems with a zero downtime requirement by detecting commonly faced schema bugs?

There are two relevant metrics, the database CPU utilization and the la-tency by request to the server. Both metrics allow identifying the error as data loss by a drastic reduction of required resources for the process-ing of the queries. This Research Question is important to ensure that the changes are corruption-free to achieve the thesis purpose stated in the previous section.

1.3 Structure of the Document

This document is structured as follows. The first chapter goes through a gen-eral introduction of the document and research. The second chapter goes over the general background of the research. The third chapter goes over the liter-ature review, introducing related works, and discussing the differences from the work presented in this document. The fourth chapter introduces the used research methods. The fifth chapter describes the proposed tentative pipeline architecture for automatic online database schema updates, the implementa-tion, and the setup used for the experiments. The sixth chapter presents the experiments. The seventh chapter shows the results of the experiment and discusses what these results mean. Finally, the final chapter presents overall conclusions to the research, talks about problems related to this thesis, and introduces topics for future research.

1.4 Thesis Contribution

The main contributions of this thesis project are:

• We provide a study case on how CI/CD pipelines could leverage dif-ferent techniques and algorithms to offer CD for MySQL schema evolu-tion with zero downtime. Offering a partially automated DBA migraevolu-tion flow, managing the risk of assessing and testing the migration, and mi-grating the changes without downtime. Also offering an overview of the problem, which could help researchers working on similar problems for other database systems.

(15)

• Our approach offers a reliable way to detect and prevent common bugs and in this way preventing wrong schema migrations would produce breaking compatibility and data corruption errors. This feature ensures a more stable and robust CD pipeline for MySQL schema evolution with zero downtime.

1.5 Ethics and Sustainability

As explained previously, this thesis tackle CD in MySQL systems with zero downtime issue. It does not present any known ethic dilemma or illegality that may affect our society.

(16)

Background

In this chapter, we introduce the main fields that this thesis relates to, which are DevOps, highly available MySQL systems, and online schema migrations. Section 2.1 introduces DevOps and the application of it in this thesis. Section 2.2 introduces MySQL systems and different clustering approaches to achieve high availability. Section 2.3 introduces online schema migrations.

2.1 Devops, Continuous Integration and

Con-tinuous Delivery

DevOps has become an important cross-disciplinary component in many large Enterprises [19]. DevOps is a set of practices that aim to reduce the gap between different disciplines such as development, operations, and Quality Assurance (QA) and speed up software delivery cycles without risking the quality of the product [20]. At the beginning of the 2000s, IT evolved mul-tiple traditional methodologies such as waterfall methods to take advantage of methodologies developed in other disciplines such as Lean manufacturing, which was a well-established set of practices for manufacturing by then. How-ever, the concept "DevOps" was proposed in 2009 by the first conference held in Belgium for eliminating the boundary between developer teams and opera-tion teams [21].

DevOps involves an iterative process composed of multiple stages as shown in Figure 2.1. In 2012, Gene Kim et al. introduced "The Three Ways": flow, feedback, and continual learning [22]. The "First Way" or flow has the main purpose to move the product as quickly as possible from planning to

(17)

tion. Therefore, optimizations in the system are required, especially when moving to a stage in which a different disciplinary team is responsible, like moving from the development team to the operations team. The "Second Way" or feedback has as purpose to shorten and amplify feedback loops and to un-derstand and meet customer needs to learn how to improve the current product. The "Third Way" or continual learning has as purpose to enable a culture of experimentation and learning of failure to continuously learn on the develop-ment and maintenance of the product.

Figure 2.1: DevOps Workflow [23]

As a consequence of the "First Way", optimizations for the processes that involve multiple disciplines and teams are required to speed up software de-livery cycles. One of the most popular ones is the use of CI and CD. Many large companies have adopted CI and CD in their software delivery cycles in the last years [6].

About CI, Virmani et al. (2015) stated "Continuous integration refers to integrate early, don’t keep changes localized to your workspace for long, in-stead share your changes with team and validate how code behaves continu-ously. Not only share within component teams but integrate beyond compo-nent boundaries, at the product integration level. Further, this stage of process optimization refers to achieving automation such that as soon as the developer delivers the change the build systems detect that (may even be a scheduled event at end of the day) and triggers a build carries out sanity tests, and posts the build to a repository. This has to be a repeatable continuous process all

(18)

across the development cycle." [6].

CD refers to optimize software delivery to achieve short cycles, reliably released every time. The releases are repeatable and it advocates offering changes rapidly and efficiently [15]. Most importantly, it offers the ability to release software at any moment, allowing almost any production delivery frequency such as daily or weekly. All of it is possible when the process is finely tuned and governed by automation to be able to offer rapid cycles with safety mechanisms as rollback or monitoring systems [24].

2.2 MySQL Systems

MySQL is a relational database management system (RDBMS) useful for a wide range of use cases. It is a high performance and flexible database sys-tem. It was originally released in 1995 by David Axmark and Michael Wide-nius [25]. Still one of the most popular databases nowadays [26]. It can be ex-plained by the many attractive qualities such as speed, query language support, and capability [27]. InnoDB, the default storage engine, is a high-performance engine that offers fast service. Query language or SQL is a common standard used for decades by database systems. MySQL understands the SQL standard and extends it. Also, MySQL is a multi-threaded database system with inter-faces for many different programming languages.

However, a MySQL system with a single instance has two main disadvan-tages. Firstly, if the server crashes the database goes down, losing availabil-ity. Secondly, vertical scaling (improving hardware specifications) is the only option to increase system traffic capacity. For that reason, clustering is a pop-ular approach. Clustering allows increasing the number of instances that are synchronized to offer a database. Clustering offers failure resilience and re-dundancy, an instance can have a complete failure without any downtime or data loss for the users. Clustering also offers horizontal scaling, it is possible to increase and decrease system offering capacity by increasing or decreasing the number of instances.

Traditionally, there were two types of MySQL clustering: shared-disk and shared-nothing clustering [28]. Shared-disk clustering means to have multi-ple MySQL instances reading from the same disk. Shared-nothing clustering means that each instance uses its disk resources and they use other mecha-nisms to keep the consistency between instances. Shared-disk is obsolete due

(19)

to the lack of isolation and scalability limitations. If the shared disk fails, the whole system has a downtime breaking the availability of the system. Also, sharing a disk limits the horizontal scalability of the service.

Figure 2.2: MySQL cluster master-slaves

Nevertheless, MySQL clustering as a distributed database system con-fronts the main problem, data consistency. There are two approaches: master-slave and multi-master. A master is an instance that can write and read data from the database and a slave is an instance intended as only-read using MySQL replication. MySQL replication is a process that enables data to be automat-ically copied to other instances. The master-slave approach defines a cluster with a single master that receives every new addition or deletion of data in the database and one or multiple slaves that thanks to MySQL replication synchro-nize copy data from the master allowing the consistent reading of the database. This approach scales up the MySQL system by distributing the writes and cen-tralize the reads to the database as shown in Figure 2.2. The multi-master ap-proach defines a cluster with multiple masters and multiple slaves as shown in Figure 2.3. Masters need to keep synchronized with each other to keep data consistency.

(20)

Figure 2.3: MySQL cluster master-master

2.2.1 AWS Aurora

AWS RDS MySQL, in particular AWS Aurora, a fully managed MySQL sys-tem offers a high-performance and reliable database. This thesis has AWS Au-rora as the main MySQL system. However, it differs from normal MySQL sys-tems and normal MySQL clusters in multiple topics. First, by default MySQL systems have a strict mode for SQL querying. Strict mode controls how MySQL handles invalid or missing values in SQL statements involving data changes [29]. Strict mode is important because ensures that only correct SQL state-ments are accepted. On the other hand, Aurora master-slaves clusters in par-ticular, As shown in Figure 2.4, do not use MySQL replication to keep the data synchronized between instances, instead, the master writes directly into the slave replicas volumes [30]. This method offers consistency and reduces the replication lag. Nevertheless, this last feature may offer us a problem dur-ing schema migrations in some scenarios. As the master writes directly into slave volumes, slaves can not update schemas individually because they are the same logical instance that the master instance.

(21)

Figure 2.4: AWS Aurora clustering master-slaves

2.3 Online Schema Migrations

A relational database is a database based on a relational model. The data is structured and stored on a component called table, where the data can be ac-cessed or modified. The tables are defined and described by a formal language. A schema is the logical representation of the data structure and the relation-ship between the tables. Then, schema migration is the process to update or revert changes to relational database schemas.

Traditionally, updating a schema in a MySQL system was an operation that required downtime. The main reason was for keeping consistency when executing Data Definition Language (DDL), operations that add, delete, or modify database schemas. DDL operations are required to lock the table to apply the operation. As a result, different approaches were created to allow zero downtime schema migrations. There are two common options: migra-tion on a replica and ghost table. Migramigra-tion on a replica consist of taking a replica, isolate it, and apply the migration to it. When the replica is updated, it is promoted to master, then the process is repeated with the other replicas until every replica is updated. By contrast, the ghost table approach is different. It consists in create a table (ghost table) and migrate the schema into this new table, once it is migrated, the data is copied from the table with the old schema to the ghost table. When the data is copied, the old table is replaced with the

(22)

ghost table without causing downtime.

Migration on a replica is normally a manual process. In contrast, there are different tools for the ghost table approach. Some of the most common ones are Percona’s online schema change tool [31] and Facebook’s OSC [32]. Also, GitHub released online schema migration tool Gh-ost [33] in 2016 with a modern approach to the same problem.

2.4 Summary

In this thesis, we explore how to automatized part of the DBA schema mi-gration flow as explained in chapter 1. As it was shown in Section 2.1, De-vOps is an iterative process to reduce the gap between different disciplines using CI and CD for speeding up software delivery cycles. CI and CD of-fer automatizing processes in a repeatable manner. However, applying DDL changes without downtime is a problem that requires a specific solution for the MySQL setup. As it was shown in Section 2.3, there are multiple tools such as Facebook’s OSC or Gh-ost that offer online schema migration for the ghost table approach. As it was shown in Section 2.2, MySQL systems as a high performance and flexible RDBMS. MySQL offers different ways of clustering and for this thesis, we use AWS Aurora MySQL master-slave cluster. The par-ticularity of this cluster is direct writes into the volumes instead of MySQL replication for keeping the data synchronized between instances.

(23)

Related Work

This master thesis is conducted on a study case to state the possibility to uti-lize DevOps techniques for MySQL online schema migrations. It defines how to put together the advantages of the DevOps paradigm with the delivery of relational database changes on highly available systems. To achieve this goal, the approach proposed in this thesis utilizes technologies and methodologies developed in other areas. DevOps paradigm aims to integrate system develop-ment and system operations life cycles. It reduces time frames by providing continuous delivery with high quality.

To achieve the goals of this new approach, there are four different areas whereby this approach is shaped: Automation, database schema migrations, Quality Assurance, and Logical Representation of Migrations. All of them are a component or stage of this new approach.

Automation is the way to provide continuous delivery by automatizing the operational processes that were previously manual and provide repeatable pro-cesses. As a requirement of high availability systems, the database schema migrations have to be done without downtime. In addition, it is a requirement to ensure high quality on the continuous delivery of the different versions. Quality Assurance is an implicit requirement. Check that the changes have backward compatibility and no breaking change is introduced.

3.1 Methods

The literature review presented in this research met the following criteria to select the sources and areas.

(24)

• Context. As one of the main goals of this research work is to introduce a new approach. It is a necessity to give some context over the areas that it covers, even if it partially addresses the content of the used sources. • Industry relevance. Some of the areas covered here have been mainly

developed in the industry with a lack of intervention research and higher relevance of descriptive research.

3.2 Related Works

3.2.1 Automation

Automation is a topic that had been extensively developed in the industry and accordingly, it has been covered by academia. During the last years, a con-tinuously growing number of organizations have been adopting new practices in automation, mainly, CD. For this reason, research has been focusing on an-alyzing the adoption of this paradigm, from structure to common challenges faced by organizations.

Mojtaba Shahin et al [16] analyzed the main challenges that organizations are facing towards Continuous Delivery and Continuous Deployment. They found four main challenges: Lack of fully automated acceptance test, manual quality check, lack of efficient rollback mechanism, and bureaucracy or busi-ness involved in deployment decisions.

Lianping Chen in [15] defines a standard pipeline structure with six stages: code commit, build, acceptance test, performance test, manual test, and pro-duction. Continuing with an analysis of the implications and challenges [17]. On the other hand, Lianping Chen describes what are the architectural changes that were necessary for the adoption of CD in Paddy Power [34].

3.2.2 Database Schema Migrations

Different approaches have been utilized and studied for schema migrations without service downtime. Nevertheless, the approach that is widely utilized these days is online migration. MySQL offers a built-in approach for schema online migrations [35]. It offers online schema migrations with some limita-tions. Some of the operations still require to lock the table, which produces downtime, and all the DDL online operations require a short lock time at the

(25)

beginning and the end of each operation. However, the most important limita-tion is that it limits the operalimita-tionality of a MySQL cluster during its execulimita-tion. The replication to the slaves is paused during replication, generating lagging time for the slaves. Therefore, it drastically reduces the cluster serving capa-bility during the migration process.

Due to the limitations of the built-in solution, many solutions have been developed to meet the demands of the market. Most of these solutions use the same approach which consists on create a replica table, migrate that table, slowly and gradually copy data from the target table to the replica, meanwhile propagate ongoing changes DDL (any insert, delete or update operation) to the new table. When the process ends, it replaces your target table with the replica table. This generates a small downtime at the end of the operation.

The state of the art is far from the research work in this area. Currently, multiple stable tools provide reliable approaches but there is little documented research about them. Some of the most common ones are Percona’s online schema change tool [31] and Facebook’s OSC [32]. One of the main features that they share is the use of an asynchronous approach, triggers, for propa-gating the changes to the replica/ghost table. Due to this feature, they have multiple limitations such as it is not possible to migrate foreign keys and it does not provide a reliable process.

Michael de Jong at al. propose a different approach to tackle these prob-lems of the limitations encountered by the tools mentioned above [36]. They introduced QuantumDB, an online schema migration tool that follows a new approach. The approach creates a new table for each affected table, apply the schema updates, and listen to the data from the previous table doing a transfor-mation. Also, a special proxy routes the queries to one of these two versions. Finally, when the migration is done, the proxy routes all the requests to the new table. The solution proclaims to solve the problems on the previously presented solutions such as foreign keys issues. However, this approach has a clear drawback. Whenever a DDL change is applied, all the affected tables are replicated and triggers are set to keep the consistency of the state between the current version and the new version. As a consequence, the state is shared by two different sets of tables. Nevertheless, this approach keeps most of the issues that the other tools have due to the utilization of triggers.

(26)

utilization of triggers by reading the binary log and asynchronously copying the data to the replica table with the changes and whenever all the data is on the new table, the table is replaced. In addition to asynchronous migration, it allows improving the control over the migration. One of the most notable added features is the possibility to pause the migration during the data copying stage. Also, it allows to delay the last stage, replace the old table with the new table, for allowing DBAs to apply the migrations during maintenance hours. It solves the problems associated with triggers and allows high flexibility. How-ever, it comes with other challenges. The main issue is the lack of support for foreign keys.

3.2.3 Quality Assurance

Ensuring the stability and robustness of the database system is a common re-quirement when working on business systems. There are multiple conven-tional methods such as unit testing or query generation. However, these tra-ditional methods do require a high effort to implement and maintain when normally it is offering low test coverage for big projects.

Jiaqi Yan and al. [37] introduce a new test method, Snowtrail, an internal system that uses customer production queries with some level of obfuscation for automatic database testing. Oracle’s SQL Performance Analyzer [38] also follows a similar approach. It is used for testing the impact of new changes such as upgrades or schema changes.

3.2.4 Logical Representation of Migrations

Migrating on relational databases is a common pitfall that developers confront. It presents multiple challenges due to SQL design respect the impossibility of directly revert schema changes. One of the most challenging ones is the lack of bidirectional changes. It prevents to have transparent forward and back-ward relation of the changes. The research community has been working for a long on designing solutions for this challenge. The Schema Modification Operations (SMOs) is the most popular approach, creating a custom Database Evolution Language (DEL) to manage changes. It abstracts the SQL schema changes allowing to track and keeps a relationship between the changes, al-lowing to revert schema changes without manual intervention in the queries. PRISM/PRISM++ is a traditional approach [39] that creates its own DEL. An-other one is Inverda, it aims to create versioning for database state making

(27)

transparent migrations in real-time [40].

On the other hand, in the industry, SMOs are commonly integrated with a specific programming language framework or SQL itself. There are mul-tiple known tools such as Skeema [41]. Nevertheless, SMOs do provide no automatic rollback functionality, one of the main aims of DEL systems.

3.3 Summaries and Discussions

As it was shown above, the automation processes are matured to afford con-tinuous delivery and sometimes concon-tinuous deployment for the code lifecycle. The unanswered question is if the presented tools are ready for CD pipelines for database lifecycle on systems with the zero-downtime requirement. There are two implementation strategy differences between a CI/CD database pipeline and code pipeline stages: testing and delivery. For testing, multiple traditional techniques can be applied to normal cases but also new flexible techniques that allow testing the database with commonly utilized queries on production such as Snowtrail and Oracle’s SQL Performance Analyzer systems [37][38]. How-ever, these technologies have a special focus on database system development testing, leaving apart app and database integration testing. For delivery, there are multiple approaches for migrating schema changes with zero-downtime as shown by Michael de Jong et al. [1]. The most popular one is the online schema migration. There are multiple implementations with different meth-ods, each of them has advantages and challenges requiring a different action path for each specific use case. However, that being the case, the state of the art approaches do allow no CD for most of the scenarios.

Research has been focusing on the different areas covered in this chapter. Nonetheless, it was found no relevant research work trying to utilize DevOps techniques for MySQL online schema migrations as this thesis proposes. Dif-ferent areas have shown a variety of maturity. For example, database online updates are mature but the research work in the logical representation of mi-grations is still non-mature. However, the area where the thesis is developed, automatization of online schema migrations for highly available MySQL sys-tems, is still uncharted for researches. This thesis provides a study case that shows how CI/CD pipelines could leverage online schema migration tech-niques and database testing to offer CD for MySQL systems with a zero down-time requirement. We consider that the technologies are mature and, stable unfolding the possibility to combine them. Then, this thesis contributes to the

(28)

academy by studying the feasibility of this statement. Also, as a consequence of this statement, the presented different areas combined permit robust and stable continuous delivery for MySQL databases with zero-downtime without giving up the service availability and data consistency.

(29)

Methodology

This section describes the research approach chosen for this thesis. It presents the steps and methods utilized to pursue the goals of this study. For that pur-pose, the Engineering Design Process is adopted with some modifications.

Engineering Design Process is a problem-solving paradigm that aims to ideate hypotheses and iteratively tests them. It is a highly iterative process, re-quiring to repeat multiple times each step. It is also considered to be a decision-making process where mathematics and engineering sciences are utilized to optimally meet a previously defined goal.

This thesis was chosen to utilize Engineering Design Process as its re-search methodology since it is trying to design a solution for facing derived problems from database schema migrations automation without previous lit-erature or experience with a similar scenario. It requires a constant iteration over the steps based on the observations. This thesis iterated 4 times over the process before shaping the last version.

According to this definition, it is an iterative process with multiple stages. We can differentiate 4 of them:

A. Research the problem. This phase includes a study on the background of the topic and related work. It helps to understand the core of the field and the most commonly faced challenges. Thanks to it, we were able to have a broader understanding and knowledge about CI/CD pipelines and the main challenges of keeping the state of a system. The complexity of maintaining highly available relational databases.

(30)

B. Develop possible solutions and pick one. Thanks to the understanding and knowledge acquired during the previous phase, it was possible to design the standard pipeline and an experiment. This set was conducted to acquire a higher understanding of the challenges and to test if the proposed solutions and requirements could be used.

C. Build a prototype and test it. This phase is just focused on creating implementation and to conduct a set of experiments over it.

D. Improve: Redesign as Needed. In this phase, we analyze our experimen-tal observations and do improvements to the design. Our focus was to design pipelines for allowing a robust and secure method to do online database schema migration using an Agile development methodology. So, we have defined the problem in chapter 1 the introduction, followed by the background of the thesis in chapter 2. Next, a study of the related works de-scribing why this approach has advantages concerning the previously studied in chapter 3. Next in chapter 5, we define and list the expected requirements from the prototype and proposed design. In chapter 6, based on this design, we follow the design of the experiment and the implementation of the proto-type. In chapter 7, based on the results of the first iteration of the experiment, analyze and evaluate the feasibility and advantages of respect alternative meth-ods, and modify the pipeline and scenarios to adapt to the results. Finally, we present discussion and conclusions for the research work, answering the re-search questions in chapter 8.

(31)

Solution Design

As described in the previous chapters, the previous research describes tech-niques to apply online schema migration in MySQL systems but there is a lack of solutions that enable CI/CD pipelines for online schema migration in MySQL systems allowing an automatic migration process. This chapter has a focus to design a solution that enables automatic online schema migrations in highly available MySQL systems by using CI/CD pipelines.

This chapter describes a system and a pipeline. The system is a simple web application composed of a client and service that utilizes MySQL as a datas-tore. This system is used as the base of the solution proposed in this chapter, a CI/CD pipeline that provides an automatic flow to apply developer schema changes into a MySQL system without downtime and it analyzes the changes and the effect over the current system to prevent derived incidents.

The purpose of this pipeline is to state standardized processes and mech-anisms and show some practical examples. The pipeline design has been cre-ated by combining the derived requirements from iterating this thesis experi-ment, which tests this pipeline, and along the requirements inherited from the zero-downtime system requirement. This chapter describes the setup for the experiment, the design requirements, and proposes a solution pipeline design for confronting the different challenges.

5.1 Environment Design

For the experiment, this environment is divided into two blocks: infrastruc-ture and flows. Infrastrucinfrastruc-ture defines the infrastrucinfrastruc-ture components that are

(32)

needed for the development of the life-cycle. On the other hand, the flows that connect the different components and are part of the development life-cycle.

The infrastructure of the environment is composed of four essential com-ponents:

• Version Control System. The app code is stored and version controlled on a code repository using a version control system.

• Infrastructure. Place where the app is deployed and served for end users. • CI/CD server. CI/CD pipeline system that executes tasks intending to prepare the app to be deployed to the infrastructure. Specifically, process the designed pipeline for the project.

• Monitoring system. A system that tracks, analyses, and monitors the state and behavior of the app. When the code is deployed, the monitoring system would be watching the state of the app and infrastructure. On the other hand, these components are connected by two main flows: • Code Life cycle. Most commonly known as git flow. As explained by V.

Driessen [42], it offers a standard process to implement new code and maintain the codebase. It defines the flows for a piece of code to go from development to ready for production.

• Deployment. It defines the flow to deploy new code. It starts with the developer pushing new code to the git repository and creates a Pull Re-quest for the development branch. Once the code is on the git repository, it triggers the CI/CD server. The pipeline just builds the project and ex-ecute unit tests at this point. Once the pipeline finish and a reviewer approves the changes, the new code would be deployed on the infras-tructure after going through the CI/CD server again. How it is deployed depends on the stage and the Pipeline system.

5.2 Requirements

The requirements listed below have been gathered by considering the derived requirements from iterating the experiment and along the requirements inher-ited from the zero-downtime system requirement.

(33)

The systems targeted by this thesis are systems that can be updated with-out causing any downtime or data corruption for the end-users. In particular, when updating a schema on a MySQL system by applying DDL operations. Data corruption is understood as changes on the database that prevent the con-sumption of some data. The database migrations should reflect no impact on the availability of the system. For this reason, testing is considered an im-portant element to achieve quality on the development life cycle. It helps to prevent defects and failures in the basic expected behavior of a system. As in MySQL database schema migrations, errors are a common pain point during migrations, they can produce multiple effects that alter the normal behavior of the database such as downtime or data loss. For preventing data loss, database schema migrations have been preceded by mechanisms to ensure the safety of the process. Commonly, the database stops offering service during mainte-nance, followed by a backup of the data. After the data is safe, the DBAs can proceed with the migration. If something goes wrong during the process, it is possible to revert the changes to the previous version. A strong advantage of making a backup before the migration is the possibility of partially testing the changes in production without affecting the users during the process. It allows a safer method to apply updates.

Data schema changes on systems with a zero downtime requirement can be problematic and unsafe even after the changes are successfully applied. One of the reasons is the coexistence of multiple code versions of the app at the same time. They require to have a compatible database schema for both versions, also known as forward- and backward-compatibility. Otherwise, different at-tempts to execute read, write, update, or delete (CRUD) operations would fail and potentially corrupt the data state. For this reason, there are different ap-proaches to this problem. The most accepted pattern is the expand and contract pattern, also known as Parallel Changes [43]. The expand and contract pattern provides a way to introduce breaking changes in a safe way, keeping compati-bility with previous versions by splitting the change into smaller changes that keep the compatibility with the previous version. The pattern is well used in contexts as presented in this research work, where maintenance downtime is far from desired. Furthermore, the Expand and Contract pattern offers mul-tiple advantages for data schema updates preserving the state. It opens the possibility to isolate changes, facilitating the possibility to revert the changes by splitting the changes into small compatible ones. Therefore, the updates should be compatible with the immediate previous version and with the next one.

(34)

Database schema changes are always risky and the migrations can go wrong and even more when the updates are applied without downtime. It reduces the effectiveness of safety measures such as backups or scheduled maintenance times. In the case of backups, a backup is a snapshot of the database state at a specific moment. The problem with databases with a zero-downtime require-ment is that the state is constantly changing, making every backup automati-cally outdated. It does not mean that the backup is not valid anymore because it is not possible to recover the state. Nevertheless, the state would be partially recovered. On the other hand, in the case of scheduled maintenance times, a system with a zero-downtime requirement normally got this requirement due to the negative impact that the downtime of the system produces. In that case, any maintenance period potentially produces a negative impact when errors are produced. However, there are some time ranges where the impact is lower. Therefore, the updates for the database schema should be as minimal as possi-ble. It permits to reduce the impact of an error and giving more opportunities to correct it due to keep updating the state on the old and new schema.

When updating a database schema, it should be possible to restore a previ-ous version of the schema after some erroneprevi-ous behavior is detected. However, considering the limitations of using backups in online schema migrations due to the complex flows, the strategy would be specific to each use case.

In software development, especially when implementing DevOps prin-ciples, multiple stages become important to maintain an effective workflow. It allows having multiple levels of testing and cooperation between different teams. Each level test different things, trying to cover a wide part of the app workflows and improve the app quality. By splitting testing of different com-ponents in different environments, more teams work simultaneously offering a faster and more reliable workflow. Also, testing a wide part of the app in different ways offers more opportunities to catch bugs and errors. Especially for the ones that affect the data integrity of our databases.

The system should be able to cope with errors during the migration process without affecting the end-user. When a bug is detected at this stage, it should be possible to control the errors and prevent them to propagate to the rest of the system and affect the end-user.

(35)

consider that CI includes packing and testing while CD includes deploying code and applying the migrations. Therefore, reducing the DBAs manual in-tervention allows rapid and agile life cycles. Also, the design should allow following database schema migration progress and their impact.

5.3 Expected Outputs

This design has as the main goal to show how to achieve continuous delivery in database schema changes in MySQL systems with a high availability require-ment. It puts in practice The First Way of DevOps introduced by Gene Kim et al. by simplifying delivery processes and reducing update batch sizes [22]. Then, two outputs are expected: reduce delivery time, standardize review, and delivery processes to reduce errors. On the other hand, the delivery time would be reduced due to the automatized processes and increase of reliability of the changes.

5.4 Pipeline Structure

Figure 5.1: Pipeline stages

The migration flow is shown in Figure 5.1. It describes the flow that a schema migration change takes before the production environment is updated. this pipeline defines a generic structure that can be extended and modified to bring continuous delivery to highly available systems. The purpose of this pipeline is to define a structure that fulfills the previously stated requirements.

(36)

This pipeline considers three environments: New Feature, Staging, and Pro-duction. New Feature is a specific environment for the new changes, Staging is an environment that hosts a stable version of development and Production is the production environment. This pipeline may be divided in two sub-flows, pipeline in Figure 5.2 and CD in Figure 5.3.

Figure 5.2: Pipeline stages

As shown in Figure 5.2, the pipeline has four stages: Build, Unit Tests,

Deploy and DB Unit Tests. It follows the traditional pipeline stages commonly

utilized for code integration and delivery. One of the differences is found in the cloning of those stages for the SQL code. On the first stage, Build, the goal is to ensure the correctness of the new code so it does Syntax check. This stage can also include more checks to search for inconsistent or unsafe operations with a SQL linter. On the second stage, Unit Tests, the goal is to execute unit tests over the code. On the third stage, Deploy, the goal is to deploy the changes to the infrastructure environment. Finally, ones the schema changes are applied to the database, it comes to the last stage, DB Unit Tests. In this stage, the goal is to execute database unit tests of the old version and the new version to ensure version compatibility and consistency of the data. The main reason why this stage comes after deploying the code, as shown in figure 5.2, is to test forward and backward compatibility. All the pipeline stages are executed when a new feature is introduced as shown in figure 5.3. In other scenarios, only Deploy stage will be present.

The pipeline occurs on three different components: Code Repository, Sys-tem infrastructure, and Pipeline sysSys-tem. Each of them has a different purpose, as described in previous sections.

As showed on the figure 5.3, the flow for updating production goes through multiple stages. New code would follow git flow [42] or a derived

(37)

methodol-Figure 5.3: Continuous delivery flows

ogy. Each branch is linked to a different infrastructure environment, and as consequence, with a different pipeline. When a developer pushes new code, a new branch would be created and after a merge request is created, a pipeline would be triggered. If the pipeline succeeds, the updates would be deployed on a new environment and the merge request would be reviewed by the devel-opment team and database administration team. Deploy job for SQL applies the updates without considering the zero-downtime requirement. This require-ment is considered for Staging and Production environrequire-ments where job Deploy SQL applies the updates online.

Once the changes on the merge request are accepted and merged into the stage branch. The changes are applied to Staging. SQL changes are applied to the database considering the zero-downtime requirement, acting as a test of the SQL delivery process. Finally, when the changes are ready for production, the pipeline executes the pipeline stages and the new version would be available in Production.

5.5 Setup

To demonstrate the previously described design, this section lists the technolo-gies and components used for the implementation of the experiment. In table 5.1 are listed the main technologies used for the implementation used for the

(38)

(39)

Technology Version Description

AWS RDS

MySQL v5.7.19 Managed MySQL instance and cluster so-lution. It is used as the database server for this implementation.

Gitlab 13.3.0 Git server where the code is hosted. Gitlab CI 13.3.0 CI/CD Pipeline tool built into GitLab and

integrated with the Git repositories by de-fault.

Kubernetes v1.14 Orchestration system used to host the in-frastructure environments of this imple-mentation.

Skeema v1.4.2 SQL schema version manager.

Golang v1.14 Programming language used to imple-ment the application.

Ent v0.2.1 Golang Object-relational mapping (ORM) framework. It permits the man-agement of the data stored in MySQL using objects. This framework is utilized by the application to simplify the SQL data management.

Prometheus v2.16.0 Systems monitoring and alerting toolkit with a time series database commonly used for recording and monitoring real-time metrics. It is used to monitor the Kubernetes cluster and the service in real-time in this implementation.

Grafana v6.7.3 A data visualization server that connects with real-time metric services. It is used to visualize the metrics captured by Prometheus and other services in this im-plementation.

Gh-ost v1.0.49 Online schema migration tool for MySQL. It provides online schema migration for the app with some powerful features such as flexibility for testing the changes and control/pause the progress of the changes.

ProxySQL v2.0.12 SQL Proxy for high performance interac-tion with SQL clusters. It provides an abstraction level between the app and a MySQL cluster, to provide a unique end-point to the MySQL cluster, distributing the load and caching data.

(40)

5.6 Implementation

In order to demonstrate the previously described design and conduct the exper-iments, in this section explain the environment, implementation and its limi-tations.

The implementation implements the flows around the CI and CD pipelines to deploy new schema migrations to production. The migration validation pro-cess is shown in Algorithm 1.

Data: schema migration

Result: schema migration tested and validated use stagging;

if !syntax check(sql) then error;

end

if !code unit tests then error;

end

gh-ost deploy sql; if !db unit test then

error; end

Algorithm 1: Schema migration validation

This implementation is a proof of concept, it implements the design with some limitations derived from the picked tooling.

The utilized technologies are listed above on Table 5.1. We can differen-tiate five different components:

• Application. The application, Items, is formed by a server and a client to offer items and associated users. The server is a basic concurrent CRUD server with a representational state transfer (REST) interface, commonly used for web services. It provides basic logging and usage metrics for monitoring and tracking the app behavior. Also, the client is a basic random data generator that reads and creates rows in the users and items tables.

• Database. For the sake of the experiment, this implementation is based on MySQL. This is due to multiple reasons. First, MySQL is one of

(41)

Figure 5.4: Infrastructure architecture

the most popular relational database management system (RDBMS). Therefore, a huge community supports tools and language libraries that integrate with the RDBMS. Secondly, online schema migrations are rel-atively new and their evolution and maturation are limited. That being the case, most of the innovations have been mainly focused on providing compatibility with MySQL and PostgreSQL RDBMS. Besides, Gh-ost, the online migration tool selected by its verifiability, pausability, and flexibility, is exclusively available for MySQL.

For this component, there are multiple technologies in use:

1. Managed Database. For simplifying the management of the clus-ter during its implementation, AWS Aurora (AWS RDS clusclus-ter) MySQL for managing the cluster is used. However, there are some requirements. First, it requires to use of the strict mode. By de-fault, AWS Aurora settings are quite permissive. One of the most important features, it adds defaults to every column. As a conse-quence, even if an insert query is missing mandatory attributes, it will accept the query and add default values to the missing columns. For the sake of the experiments, we are testing the experiment in the worst-case scenario, avoiding flexible modes. Second, the binary log is required. Binary logs contain all the data or table

(42)

changes, stated as events. It is required by Gh-ost to asynchronously copy the data of the table on a ghost table.

2. Online Migration Tool. Due to the flexibility, verifiability, and pausability that Gh-ost offers. It is the Online Migration Tool used for this implementation. Gh-ost was developed by Github and pub-licly released in 2016 [33]. It has many features that make it the best tool for this use case. However, it has especially three strong features that allow accomplishing the requirements of this research work, which are: testable, trustable, and pausable. Testable since it permits to test the changes on a replica without affecting the master instance. trustable since it allows throttling the changes and pauses them to avoid errors overnight or on weekends. And pausable since the migration is executed asynchronously and it does not affect the service availability.

For this implementation, every change is tested first on a replica, before it is updated directly on the master node. For doing this, some special settings are required for the target database cluster, AWS Aurora RDS (mainly replication mode and other basic set-tings). Also, specific settings for limiting the speed of the mi-gration, allowing to adapt to database capacity and load. In this implementation, the settings are permissive.

3. Schema Version Control. Most of the stateful production systems using a relational database requires modifying database schemas over time. For the same reasons, when the changes become regular, schema versioning also becomes necessary. For this implementa-tion, Skeema was selected as the tool for schema version control due to the simplicity and stability of the tool. Instead of accumulat-ing the schema history on different files (endaccumulat-ing up with hundreds of them) and on a table on the database, it is based on individ-ual files for each defined schema, leaving the versioning history to git. Then, it generates the update query by comparing the desired schema with the one found on the database.

The configuration is mainly the same on the different environments. However, Gh-ost is only used in Production and tables with a size above 1GB. Under 1GB the lock time is low enough for most of the operations.

(43)

• Repository. The implementation and definition of the environment are hosted on a git repository at Gitlab.

• CI/CD. CI/CD pipeline system is responsible for the continuous integra-tion and delivery of the code from development to producintegra-tion. In this case, this implementation uses Gitlab CI. The main reason is that it is integrated on the Gitlab Repositories, easily providing the necessary in-frastructure for it. Also, the simplicity and integration of Terraform are strong advantages. With Terraform, we easily configure a Kubernetes cluster and environment variables on our Gitlab Repository.

• Infrastructure. For this implementation, the infrastructure is a Kuber-netes cluster hosted in AWS, in addition to the databases as shown in Figure 5.4. Databases are hosted in AWS RDS, a simple instance for the Test Stage and an Aurora Cluster with a master and a replica for the Production Environment.

The application is deployed in different ways for each of the stages. In the Test Stage, the entry point is an Nginx proxy connected to the items service, which is also connected to the AWS RDS database instance. In the case of the Production Stage, we introduced a new component, a ProxySQL instance. It manages the connection to the instances of the Aurora cluster and splits the RW operations between the instances, also providing cache for common queries.

• Monitoring. For monitoring the whole system, two services are in use, Prometheus and Grafana. Prometheus is used to recollect metrics from the different services, Kubernetes system, and test server instances, in real-time and monitor these services. On the other hand, Grafana per-mits to visualize this monitoring through graphs and numbers. Also, due to some limitations for the monitoring of the MySQL cluster, Ama-zon CloudWatch Metrics is used for monitoring the AWS RDS cluster and integrated to Grafana as shown in Figure 5.4.

5.6.1 Limitations

The described experiment environment that was developed for the experiments of this research work has some known limitations.

(44)

Foreign keys (FKs) are not supported and their use is problematic and risky on the presented setup. The main reasons for it can be split into two cate-gories: performance at scale and online schema migration strategies. Perfor-mance at scale is a common approach for scaling out a SQL database is shard-ing database tables. Each table is partitioned with the partitions on different servers. FK is a common way that applications use to maintain integrity, in-stead of doing it on their own. When using the sharding approach, FKs are not supported. Also, FKs add overhead in insert/delete. According to Schwartz and al. [44]:

Foreign keys aren’t free. They typically require the server to do a lookup in another table every time you change some data. Al-though InnoDB requires an index to make this operation faster, this doesn’t eliminate the impact of these checks. It can even result in a very large index with virtually zero selectivity. For example, suppose you have a status column in a huge table and you want to constrain the status to valid values, but there are only three such values. The extra index required can add significantly to the ta-ble’s total size—even if the column itself is small, and especially if the primary key is large—and is useless for anything but the foreign key checks.

Another reason is the online schema migration strategy. These strategies create a risk when foreign keys are present in a parent-child relationship [45]. The main problem is that often foreign keys relay on an auto-incremented id created for each row on the "parent" table. When a table is migrated with Gh-ost, the process is asynchronous and it is not possible to ensure the insert order in the ghost table. In the case of migrating the "parent" table, the ids would be randomly assigned and the data integrity would break. Migrating the "child" table stills possible without breaking the integrity but tools such as Gh-ost block it because they do not support foreign keys.

Testing and migrating schema changes on the replica first is not possible using this setup. In AWS Aurora replication, the reader and writer are sepa-rated although the cluster shares the underlying storage layer. Gh-ost would always detect the instance as a master. For this reason, in this experiment, the changes are tested on the master instance.

Database unit tests do test no business logic. The main goal of these tests is to avoid incompatibilities and schema errors associated with the integrity of

(45)

the data. However, every error introduced by the logic should be checked and tested on the code unit tests. Therefore, it is unreliable and requires constant maintenance.

5.7 Summaries and Discussions

As was shown above, this thesis proposes a CI/CD pipeline design for au-tomatic online schema migrations in MySQL systems with zero-downtime. This solution analyzes the SQL changes seeking common schema bugs that may produce downtime or data corruption. In addition, it tests the migra-tion queries before the ghost table replaces the old one. However, the current implementation of this solution presents multiple limitations. Firstly, due to underlying used tools such as Gh-ost (online schema migration tool), the so-lutions are only compatible with MySQL as RDBMS. Secondly, FKs are not supported by this implementation. The main underlying reason is the pre-viously explained risk produced when migrating a table with a parent-child relationship due to the picked migration strategy. Thirdly, Gh-ost enables to test the SQL schema changes on a MySQL replica node before applying the changes to the master. Unfortunately, due to the use of AWS Aurora for man-aging the MySQL cluster and the replication model explained in Section 2.2.1. This solution offers CD and CDL for online schema migrations, impacting DBA migration flow and as consequence drastically reducing the required su-pervision time. This solution put stress in schema migration scenarios with a zero downtime requirement in contrast to previous research discussed in this thesis.

(46)

Experiment

In the following chapter, we define the experiment to be conducted for this thesis. The purpose of this experiment is to provide data about the downtimes and the data corruption produced by commonly faced schema bugs to answer the Research Questions (RQs) presented in this thesis. The experiment has been iterated multiple times, adapting the design to cover the errors that were discovered during the process.

6.1 Research Questions

As presented in the first chapter, this thesis presents three RQs.

RQ1: Is it possible to provide reliable automatic schema evolution on high

availability MySQL systems by using the current online schema migration tools and strict testing and SQL linters?. Using current tooling and strict testing for

providing automatic schema evolution on High availability MySQL systems. It refers to test the proposed strategy in this thesis to achieve automatic schema evolution on High availability MySQL systems.

RQ2: Is it possible to prevent downtime on automatic schema evolution

for MySQL systems with a zero downtime requirement by detecting commonly faced schema bugs?. This RQ explores the possibility of preventing downtime

by detecting commonly faced schema bugs. Therefore, it allows achieving au-tomatic schema evolution on High availability MySQL systems by preventing downtime.

RQ3: Is it possible to prevent data corruption on automatic schema

(47)

tion for MySQL systems with a zero downtime requirement by detecting com-monly faced schema bugs?. This RQ explores the possibility of preventing

downtime by detecting commonly faced schema bugs. Therefore, it allows achieving automatic schema evolution on High availability MySQL systems by preventing downtime.

6.2 Metrics

For monitoring and analyzing the results of the experiment, we use multiple measures. These measures have the goal of answering the RQs. They are taken from right before the migration is applied to when the migration is reverted correctly. The ones utilized are the following:

• Latency. The latency of the app Rest API client requests per handler. In this thesis, latency is considered the time difference from when the web application service receives a new HTTP request and this request is replied to. Processing an HTTP request includes the code execution from the service but more importantly connecting to the MySQL cluster. • Error. Error percentage per Rest API server. It measures the error per-centage of total processed HTTP requests by the web application ser-vice. All the processed requests by our service are processed and con-sidered as errors the ones with an HTTP code from 500 to 599 range. As a Rest API compliant service, when an error is produced trying to process a valid request, a response with a standard code from 500 to 599 range is returned.

• Traffic. Traffic load per Rest API handler. It measures the total pro-cessed HTTP requests by the web app server.

• Database. It measures the percentage of CPU utilization in the Database writer instance. This metric considers the total CPU capacity that is been used over the total available CPU capacity.

These metrics have as main goal to answer the RQs. For answering RQ1, we require two different metrics: error and latency. Error metric allows us to measure the downtime while the latency metric allows us to track the effective-ness of the migration while it is applied. As the absence or not of downtime responds to RQ2, the error metric allows us to measure the produced down-time by the MySQL system. For answering RQ3, we require two different metrics: database and latency. The DDL changes presented in the scenarios