Optimizing Total Migration Time in Virtual Machine Live Migration

(1)

IT 13 016

Examensarbete 30 hp Mars 2013

Optimizing Total Migration Time in Virtual Machine Live Migration

Erik Gustafsson

Institutionen för informationsteknologi

(2)

(3)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0

Postadress:

Box 536 751 21 Uppsala

Telefon:

018 – 471 30 03

Telefax:

018 – 471 30 00

Hemsida:

http://www.teknat.uu.se/student

Abstract

Optimizing Total Migration Time in Virtual Machine Live Migration

Erik Gustafsson

The ability to migrate a virtual machine (VM) from one physical host to another is important in a number of cases such as power management, on-line

maintenance, and load-balancing. The amount of memory used in VMs have been steadily increasing up to several gigabytes. Consequently, the time to migrate machines, the total migration time, has been increasing. The aim of this thesis is to reduce the total migration time.

Previous work aimed at reducing the amount of time and disk space required for saving checkpoint images of virtual machines by excluding data from the memory that is duplicated on the disk of the VM. Other work aimed at reducing the time to restore a VM from a checkpoint by only loading a subset of data before resuming the VM and marking the other memory as invalid. These techniques have been adapted and applied to virtual machine live migration to reduce the total migration time. The implemented technique excludes sending duplicate data that exists on disk and resumes the VM before all memory has been loaded.

The proposed technique has been implemented for fully virtualized guests in Xen 4.1.

The results of research conducted with a number of benchmarks demonstrate that there is an average 44% reduction of the total migration time.

Examinator: Ivan Christoff Ämnesgranskare: Philipp Rümmer Handledare: Bernhard Egger

(4)

(5)

Acknowledgements

I would sincerely like to thank Professor Bernhard Egger at Seoul National Uni- versity helping to supervise me during this thesis.

I would also like to thank Professor Philipp Ruemmer at Uppsala University for reviewing my thesis.

I would also like to dedicate this thesis to my lovingly special someone, Alek- sandra Oletic.

(6)

List of Figures

2.1 Structure of Xen . . . 4

4.1 Network Topology . . . 13

4.3 Simplified Architecture . . . 16

4.4 Violation cases . . . 18

5.1 Execution Trace over the Sector Accessed . . . 23

6.1 Total migration time normalized to unmodified Xen . . . 26

6.2 Downtime normalized to unmodified Xen . . . 26

6.3 Total data transferred . . . 28

6.4 Data sent over the network . . . 29

6.5 Performance degradation normalized to unmodified Xen . . . 30

(9)

Acronyms

dom0 domain 0.

domU user domain.

EPT Extended Page Tables.

HVM hardware virtual machine.

MFN machine frame number.

MMU memory management unit.

NAS Network Attached Storage.

NPT Nested Page Tables.

PFN page frame number.

PTE page table entry.

SPT shadow page table.

SSD solid-state drive.

VCPU virtual CPU.

VM virtual machine.

(10)

1 Introduction

1.1 Overview

In recent years, server virtualization has seen a steady increase of attention and popularity due to a multitude of factors. Virtualization [7] is the principle of providing a virtual interface for hardware. A virtual interface can create inde- pendence from the physical hardware by providing a layer of abstraction that accesses the hardware. A virtual machine [17] is a machine that runs in a virtualized environment as opposed to directly on hardware.

Decoupling the OS from the physical machine by virtualization has enabled several techniques and methods to be employed such as power management capabilities, on-line maintenance, and load balancing which is enabled by the ability to move a virtual machine from one physical host to another.

These capabilities can be achieved because the state of the virtual machine, the virtual CPUs (VCPUs), the memory, and any attached device can be recorded.

The state can then be transferred to another physical host which enables the same virtual interface to the hardware where the virtual machine can be resumed. The process of moving the state from a physical host to another is called migration.

Running several virtual machines on one physical server allows pooling of resources together to provide better power management [14]. Moving a virtual machine from one physical server to another provides cluster environments to do on-line maintenance [13]. Load balancing can be achieved by dynamically moving and allocating virtual machines across a cluster of physical hosts [22].

The use of each of these techniques requires the ability to efficiently move the virtual machine between physical hosts, and is called virtual machine migration.

1.2 Live Migration

Live migration builds upon the idea of migration and takes it a step further.

The ”live” in live migration pertains to the fact that the migration should be

(11)

transparent to the users. As a consequence, the guest OS should be running during the migration. The downtime is the time it takes for the source host to suspend execution of the virtual machine (VM) until the destination host resumes it. The VM should not be stopped for a considerable amount of time in order for the it to be usable during the migration and thereby transparent to the users who ideally are not aware of the migration occurring.

1.3 Motivation and Problem Definition

The aim of this thesis is to further improve upon and to reduce the the total migration time of virtual machine live migration. The total migration time is the time it takes from when the migration is initiated for a VM on the source host until the VM is resumed on the destination host. In order to be able to provide the capability of live migration, a low downtime is required. Downtime is the time during which the virtual machine is not responsive. Suspending the state of the VM includes pausing the VCPUs as well as other connected devices.

Extensive work has previously been done to reduce the downtime but less work has been done focusing on reducing the total migration time.

Total migration time can be very important in data centers since a reduction in total migration time improves load balancing, proactive fault tolerance, and power management capabilities.

1.4 Contributions

The contributions of this thesis are:

• A technique is presented that reduces the total migration time by sending only a critical subset of data through the network. We identify and cor- rectly handle all scenarios that could lead to a corrupt memory image or disk of the VM during and after the migration.

• The proposed technique has been implemented in Xen 4.1 and various benchmarks have been conducted with fully-virtualized Linux guests.

• The results have been analyzed compared to the performance of the original Xen implementation. The total migration time is reduced by 40.48 seconds on average which corresponds to a 44% relative reduction.

1.5 Outline

The thesis is structured as follows. Chapter 1 introduces to the thesis topic.

Chapter 2 continues with an introduction to virtualization concepts and tools used. Chapter 3 presents a discussion of related and previous works. Thereafter, Chapter 4 proposes a solution to the problems discussed. Chapter 5 presents

(12)

an discussion about implementation details. Chapter 6 provides an in-depth analysis of the solution. Chapter 7 concludes the thesis and discusses future work.

(13)

2 Virtual Machine Monitors, Xen, and Memory Management

The following sections in this chapter discusses important background information abut virtualization, tools used, and a brief overview of relevant computer architecture. A basic knowledge of computer architecture is assumed and can be found in [17].

2.1 Overview

Xen is an open source bare metal hypervisor [1]. A hypervisor is a virtualization layer on top of the hardware that runs virtual machines. Bare metal pertains to the fact that the hypervisor is running directly on hardware and not on top of any existing OS such as VirtualBox or VMware WorkStation. Each of the virtual machines resides in a domain of their own called a user domain (domU).

The only domain allowed to access the hardware directly and start new guest machines is domain 0 (dom0) which is a special management domain in Xen.

The hypervisor ensures the separation of the domains so that they cannot affect each other for reasons of security and correctness. The structure can be seen in Figure 2.1.There are two types of virtualization techniques which Xen provides, hardware virtual machine (HVM) and paravirtualized guests.

Hypervisor Hardware

Dom0 DomU

Figure 2.1: Structure of Xen

(14)

2.2 Paravirtualization

A paravirtualized guest is aware that is it running virtually as opposed to a HVM guest [4, 17]. In a virtualized environment the hypervisor must execute in ring 0, to ensure that the domains are completely separated in order for the different domains not to be able to affect each other. A ring is a protection mode that is allowed to execute with a certain privilege, where ring 0 has the highest privilege. The paravirtualized kernel of the guest is thus evicted from ring 0, where it usually resides, to ring 1. Thereby, the kernel cannot execute privileged instructions. A privileged instruction is an instruction that can only be performed in ring 0 and trap otherwise according to Goldberg [16]. In order for the kernel to cope with the eviction, the kernel is modified to call the hypervisor via hypercalls, which are similar to the privileged instructions. The changes are confined to the kernel, and thus the user applications does not need to be modified [5].

2.3 Hardware Assisted Virtualization

An HVM guest has no knowledge that it is running virtually as opposed to directly on hardware and thus has a different set of requirements compared to a paravirtualized guest. There are methods to virtualize unmodified guests without the using special hardware with techniques such as binary translation [18].

However, this approach was not considered in Xen due to the incurred performance degradation of this technique [4]. Hardware extensions such as Intel VT [20] or AMD SVM [2] are instead required since there are instructions in the x86 architecture that change the configuration of the system, by changing global registers or communicate with devices without being a privileged instruction.

Since operations which can have global effects existed prior to these hardware extensions, certain privileged instructions could be executed without being able to be trapped by the hypervisor. Support for non-paravirtulized guest could not be offered because it violates the isolation of domains. Due to the rising popularity in virtualization, hardware manufacturers introduced the ability to mitigate these problems. ”Conceptually, they can be thought of as adding a

’ring -1’ above ring 0, allowing the OS to stay where it expects to be and catch- ing attempts to access the hardware directly” [5, p. 12]. The hypervisor thus executes in this ’ring -1’ above ring 0, thus enabling the use of virtualized guests without kernel modifications in the guest itself.

There are performance differences of an HVM and paravirtualized guest. Hard- ware is emulated for an HVM guest and this usually causes them to be slower than paravirtualized guests. A hybrid approach can provide the hardware extensions together with the modifications of the OS to reduce the performance degradation.

(15)

2.4 Memory Management and Page Tables

Memory management differs from the HVM and the paravirtualized guest. A paravirtualized guest interacts with the hypervisor directly since the domU is not allowed to access hardware. It issues a hypercall to the hypervisor which ensures that a mapping between guests page frame number (PFN) has a correct mapping to a machine frame number (MFN). A MFN is a real hardware address in memory whereas a PFN is a virtual address which the guest can use as a real address.

For HVM guests, shadow page tables are page tables containing mappings from what the guest believes are MFNs, which are in fact PFNs, to real hardware MFNs used by the hypervisor. These shadow page tables are used by the hypervisor and hidden from the HVM guest. This is achieved by marking the page tables of the HVM as read-only by the hypervisor. Therefore, a page fault will be generated when a guest tries to update its own page table entries. The hypervisor intercepts these page faults and keep a mapping between the PFNs and MFNs. These operations can be quite expensive. Modern memory management units (MMUs) such as Intel’s Extended Page Tables (EPT) and AMD’s Nested Page Tables (NPT) mitigate this problem by allowing the guests to modify their own page tables and handle their own page faults. This is achieved by keeping a separate set of page tables, such as EPT and translating the guests guests pseudo addresses to real hardware addresses. Shadow page tables are also used during live migration for the purpose of tracking which pages are modified [4, 5, p. 82].

2.5 The Page Cache

The page cache [17] is a cache kept by the kernel of the OS in main memory to buffer data that goes to and from the disk. It is used to facilitate quicker access to disk, since accessing memory is orders of magnitude faster than accessing the disk. Write requests to disk as well as read requests from disk are cached by the OS in the page cache. Modern operating systems use a majority of the unused memory to buffer I/O requests to or from the disk. The page cache can also aid in speeding up write requests to disk by caching the write request in memory and eventually flushing the page to disk. This speeds up the write because the application will think it has already finished writing to disk and can continue even though it may only be cached and later written to disk.

(16)

3 Related Work

3.1 Introduction

Live migration is the migration of a VM without stopping the execution of it for a considerable amount of time. Three phases can be identified in live migration [6].

• Push - In the push phase, the source virtual machine sends its data to the destination VM.

• Stop and copy - In the stop and copy phase, the source VM stops executing and transfers control to the destination VM.

• Pull - In the pull phase, the destination VM retrieves data from the source VM.

These three phases can be mapped to different methods of achieving live migration. Two commonly used techniques are called pre- and post-copy, respectively.

The naive approach to migrate a virtual machine is to stop the execution at the source, record its state, then move that state to another machine and finally continue executing as described above in the stop and copy phase. The problem that arises is the availability of the virtual machine. For a virtual machine that has 512 MB of RAM, this can result in an 8 second downtime or 32 second downtime with bandwidths 512 Mbit/sec and 128 Mbit/sec, respectively [6]. In the present day, VMs have memory sizes of 4 GB are not uncommon which clearly exacerbates the problem of long migration times. Recording the state of a virtual machine includes recording the state of the memory, the VCPU and any attached devices. The majority of the time to migrate a VM is spent recording and transmitting the memory. This approach of doing a naive stop and copy quickly becomes infeasible since even larger amounts of RAM are used today which increases the downtime and total migration time even more.

3.2 Performance Measurements

There are two useful performance measures in terms of time, total migration time and downtime. Downtime refers to the time when the virtual machine

(17)

is unresponsive. This is when the actual transfer of control is occurring from the source host to the destination host, the time during which the VM is not responsive on the source or the destination host. Total migration time is the time from when the process of migration begins from the source machine until the time when the destination VM has control and the source can be discarded. In live migration these two times usually differ greatly. However, in cold migration, the naive version with only the stop and copy, they are equal. Cold migration does not involve any work before or after the transfer of control from one host to another; it does not involve the push or pull phase.

3.3 Live Migration Methods and Techniques

3.3.1 Iterative Pre-Copy

Pre-copy is the predominant algorithm used today in live migration [6]. It works by copying the memory of the virtual machine from the source and then transferring it to the destination in a number of iterations without stopping the VM on the source host. In the first iteration, the whole memory is copied from the source to the destination while the VM is still executing. In the subsequent iterations, the memory pages that have been modified during the previous iteration are resent. This process continues until the last iteration when the source halts execution of the VM, the last modified memory pages are transferred to the destination machine where the virtual machine is thereafter resumed. The pre-copy technique can be seen as a number of successive iterations of the push phase which ends with the stop and copy.

The last iteration occurs when one of the three following conditions occurs. The maximum number of iterations has passed, the maximum amount data has been sent over the network, or the modified memory is sufficiently low. In contrast to the naive stop-and-copy approach, the pre-copy technique can reduce the downtime by a factor of four with a single push iteration, sending the whole memory once and then do a stop-and-copy, for a SPECweb benchmark. However, pre- copy normally performs several iterations and demonstrates the effect of the downtime compared to the naive approach. In an interactive environment such as a Quake 3 server; a large downtime can be detrimental to the players. With the pre-copy technique, the downtime can be as low as 60 ms with 6 players [6].

The amount of data transferred in the first pre-copy iteration is proportional to the amount of memory allocated to the guest OS. The other iterations are proportional to how fast the memory pages are modified [3].

3.3.2 Memory Compression of Pre-Copy

The pre-copy algorithm has been improved upon since its conception. Hai Jin et al. [10] developed an adaptive memory compression technique where the mem-

(18)

ory sent is first compressed at the source and decompressed at the destination machine. A 27% reduction in downtime, a 32% reduction in total migration time, and a 68% reduction in data transferred was achieved with this approach.

3.3.3 Post-Copy

The post-copy approach begins with the stop and copy phase and continues with the pull phase. In order for the destination machine to have all the data it requires, it retrieves them from the source continually. Several techniques can be used in order to retrieve the memory from the source, such as demand paging, active pushing, and adaptive pre-paging.

Demand paging is a technique in which the page fault occurs at the destination machine and as this happens it requests the faulted page from the source. Ac- tive pushing is when the source machine actively pushes data to the destination without waiting for a page fault. These two techniques can be used simultaneously. Adaptive pre-paging is a concept borrowed from operating systems and works on the assumption of spatial locality. When a page faults occurs, the like- lihood of a page fault in the same vicinity is increased and thereby pages close to the first page fault will be sent to the destination along with the page fault itself [9]. Adaptive pre-paging continues to send pages in the nearby proximity until another page fault occurs.

Hines et al. [9] also proposes a technique called dynamic self ballooning, where a driver runs in the guest VM, continuously reclaiming free and unused memory and giving it back to the hypervisor. This requires the use of a paravirtualized machine due to the communication between the guest driver and the hypervisor.

Using this technique during migration reduces the amount of memory that is to be sent of the network since the reclaimed and unused memory is not transferred.

The migration time of post-copy is mostly bounded by the amount of memory allocated to the VM since the memory is the bottleneck of saving and transferring the state [9]. As opposed pre-copy, post-copy will only transfer each memory page once.

3.3.4 System Trace and Replay

Haikun Liu et al. [12] studied a more unorthodox method based upon tracing and replaying events instead of transferring data. This method, called system trace and replay, starts by taking checkpoint of the source VM. A checkpoint is a recorded state of the VM usually saved to disk to be resumed later. The system trace and replay instead transfers the checkpoint to the destination machine. Si- multaneously, the source machine starts to record, or to trace, non-deterministic events such as user input and time variables. These events are recorded in a log and subsequently sent to the destination. The log transfer happens in a number of iterations similar to that of pre-copy. Thereafter, the destination executes,

(19)

or replays, from the checkpoint and any non-deterministic event is read from the log. The replay mechanism is able perform faster than the original trace of events. This is required for the destination to catch up with the source. De- terministic events do not need to be recorded because the destination machine starts from a checkpoint and thus they will have the same deterministic events afterwards. The cycle of tracing and replaying goes on until the destination machine has a sufficiently small log; the source stops and copies the last of the log to the destination where the last of it is replayed and the migration has been completed at this point.

A problem may afflict this technique when several virtual machines are running on the same physical host. In case the source host has several VMs running which require the use of the CPU(s) frequently. Then this method may in fact increase the total migration time. This happens because there is an overhead on the source host for recording the events. If there are other CPU bound VMs running simultaneously it can slow down the migration compared to original Xen.

3.3.5 SonicMigration with Paravirtualized Guests

Koto et al. [11] presented a similar technique to this thesis called SonicMigration.

They presented a technique where the page cache and the unused memory in a guest is not transferred to the destination. Instead, they invalidated the page cache and skipped sending those pages. This only works for paravirtualized guests since a fully virtualized guest has no knowledge that it is running in a virtulized environment. The hypervisor must be able to communicate with the guest kernel directly.

3.3.6 Discussion of Live Migration Techniques

The techniques presented here can all be evaluated by the two metrics of downtime and total migration time as discussed above. The system trace and replay method managed to achieve an average of 72.4% reduction in downtime compared to the pre-copy technique, which is currently the most frequently used one [12]. The system trace and replay is one of the most promising techniques in terms of reducing downtime, although not total migration time. This technique has an 8% application performance overhead because of the tracing capabilities enabled during live migration at the source, which can increase the migration time. It reduces the total migration time with about 30%. The memory compression technique reduces the downtime and total migration time by 27% and 32%, respectively, by reducing the bandwidth required to transfer the data. The post-copy technique increases the downtime due to implementation difficulties which can be improved upon and reduces the total migration time by over 50%

in certain cases. SonicMigration is able to reduce the total migration by 68%.

(20)

In contrast to the pre-copy approach, the post-copy approach suffers from reduced reliability since the destination host has only one part of the whole state.

If the destination machine dies, the source has an inconsistent state since the destination had the most recent state of the VM. If the source machine dies then the destination cannot continue executing the VM with only a subset whole state. In the same fashion, the migration cannot be cancelled mid-way like pre- copy. Another problem with post-copy is that it performs worse than pre-copy on read intensive programs since numerous page faults will occur. On the other hand, when it comes to write intensive programs which will modify the memory pages rapidly, post-copy performs better than pre-copy, because post-copy will not have to fetch much data from the source machine, whereas pre-copy will keep sending all the modified pages until a termination requirement has been met. Another problem with post-copy is that current implementation increases the downtime compared to pre-copy [9].

3.4 Checkpoint Methods and Techniques

3.4.1 Efficiently Checkpointing a Virtual Machine

Checkpointing, or taking a snapshot as it is sometimes called, of a virtual machine is to record the state of the running VM and save it to non-volatile storage such as disk. The checkpoint can later be used to resume the VM with the state it had before being checkpointed. The state of the VM includes, but is not limited to, the memory of the machine. Previous work has been done to reduce the amount of memory saved by up to 81% along with a reduction in 74% of time spent taking the checkpoint.

The performance enhancements are achieved by realizing that the page cache, a buffer in the OS, keeps data from disk cached in memory to facilitate quicker access to it, will contain the same information both on disk as well as in memory.

At the time of check pointing a machine, the memory of the VM will be saved to disk and thus there will be replicated data on disk, the data in the memory snapshot as well as the original data on disk.

The unmodified pages in the page cache, the ones that are the same in memory as on disk, are discarded from the checkpoint by Park et al. [15] When the VM is later resumed, it will read the pages from the original disk sectors, and not from the checkpoint image.

To know which pages have duplicate data on disk, Park et al. track and intercept I/O request from the VM to the disk. The PFN involved in the disk request is recorded in conjunction with the accessed sector on disk and stored in a map, pfn to sec. The PFNs included in the pfn to sec are marked as read-only in their page table entrys (page table entries). If the VM modifies

(21)

such a page, by writing to it, a page fault is triggered. The page fault is handled by the hypervisor which deletes the corresponding mapping in the pfn to sec map and re-grants the PFN write access and resume the VM.

A drawback to this approach of efficient checkpointing is the restoration time.

Since the checkpoint image does not contain all data required, the data must be loaded from disk. This makes the restoration slower because the data is scattered over the disk. Reading a contiguous checkpoint image file is faster than reading scattered data because the disk is forced to introduces a large seek time in the latter case.

3.4.2 Fast Restore of Checkpointed Memory using Work- ing Set Estimation

Zhang et al. [21] worked on improving the restoration time of virtual machines.

Restoring a virtual machine includes restoring the state of the VCPUs as well as the memory and any devices. Zhang et al. load only a subset of all memory before resuming the VM because loading the memory is the major part of the restoration time. The subset consists of the working set of memory pages used before saving the memory to a checkpoint image on disk. Restoring without the working set performs poorly since it incurs numerous page faults. It is therefore used to minimize the subsequent degradation. The working set is measured by constantly reading and resetting the access bit in the guests page table entries. The result is saved in a bitmap for later restoring only the most recently used pages. This approach achieves a 89% reduction in restoration for certain workloads.

(22)

4 Proposed Solution to Optimize Total Migration Time

The solution in this paper is briefly described in this paragraph and a longer, more detailed version follows in the next sections. The overall design of the solution is to discard sending the data in memory which is duplicated on disk.

Instead, this data is read at the destination directly from disk after the migration has finished. This proposed solution relies upon the fact that the source and the destination machines are attached to the same secondary storage in the form of Network Attached Storage (NAS), this does not pose a problem since to be able to do a live migration in Xen some type of network attached storage has to be used [19].Data is not read from disk during the migration to ensure correctness.

The data yet to be loaded in memory is marked invalid to preserve correctness of the VM, once it has resumed on the destination. This a combination of the pre-copy algorithm, the algorithm used in Xen, and a modified version of efficiently restoring virtual machines from checkpoints described in section 3.4.1 and section 3.4.2, respectively.

4.1 Page Cache Elimination

The proposed method to handle the increasing migration times with larger memory sizes involves several of the techniques previously mentioned in section 3.4, by exploiting the page cache. The page cache is a cache provided by the OS to facilitate fast access to the disk. The OS caches recently written pages to disk in memory in addition to data read from the disk in order to reduce the latency for a disk access.

The proposed method of reducing total migration time is to use the efficient

Source Destination NAS

Figure 4.1: Network Topology

(23)

checkpointing described in section 3.4.1 and apply it to live migration. The unmodified pages are read from disk at the destination. The modified pages are sent over the network. The topology of the design can be seen in Figure 4.1.

This entails a solution where the source machine records the unmodified page cache pages with the corresponding disk sector in a data structure, called a pfn to sec map in order for the restoration process on the destination side to know what PFN has its corresponding data on disk.

If a PFN is in the pfn to sec map during migration of a VM, the PFN is not included with the pages sent over the network. Instead, the pfn to sec map is sent over the network. The pfn to sec map stores a mapping between a PFN and a disk sector, hence it is orders of magnitude smaller than the memory.

Sending the pfn to sec map in the last iteration might increase the downtime due to the additional data sent. Therefore, the pfn to sec map is sent itera- tively to the destination in the same fashion as pre-copy does with the memory, in order to keep the downtime as low as possible. In the first iteration, the whole pfn to sec map is sent. In the subsequent iterations, only the changes to the pfn to sec map are sent.

One possible design is to load pages from disk during the network transfer. This introduces a problem when the page cache is sufficiently large so that loading the data from disk takes longer than the network transfer. In that case the VM have to be paused at the destination until all data has been read from disk.

This might drastically increase the downtime which is a substantial problem for interactive services. Park et al. measured that the the page cache can be over 90% of the memory used in a computer in extreme cases [15]. This may in turn result in unsatisfactory downtimes. To solve the potential problem of increased downtime, the guest should be started while pages are loaded in the background from disk, called pre-fetch.

Cache Consistency During Live Migration

There are several problems that occur during live migration due to asynchronous I/O or caches for the proposed technique. The source may delay and cache write requests to disk since writing to disk can be asynchronous. The destination machine may also cache reads from the NAS. These problems can cause race conditions to appear. If the source writes from PFN p to sector s and records that in the pfn to sec map and the write request is delayed, then the destination may read invalid data from sector s. This is more of a implementation issues, an issue nonetheless. Additionally, if the destination machine has previously read and cached sector s which is then written to by the source. The destination will read the wrong data from its cache instead of reading directly from the disk which is more of an inherent problem of this approach.

Caching of reads from the NAS at the destination host should thus be turned off

(24)

Source

Destination Running execution of a VM

N AS

time

Downtime Total migration time

Page faults

Figure 4.2: Time-line Overview

once the migration is started and only for the affected background process. Any other process or domain should not have degraded performance. Direct I/O which bypasses any client cache is not used since not all kernels support direct I/O, the O DIRECT flag, in conjunction with the distributed file system used.

Due to the complicated nature of this problem, instead of turning off caching in real-time per domain-basis on any level, another solution is better suited, namely avoid simultaneous reads or writes to the disk. This solution implies that the destination machine does not start reading data from disk until the VM at the source has shut down. All outstanding writes are flushed to disk when the VM is suspended and moved from the source. After this, the destination may start reading data from disk so that no pages are read during the network transfer.

There are other methods that can be used such as using cache coherent distributed file systems or using direct I/O to the network attached storage. Dis- tributed cache coherent file systems are not used due to the over head of connect- ing different machines and the difficulty of measuring the performance degradation.

4.2 Page Cache Data Loaded at the Destination Host

The aforementioned problem of the potentially increased downtime, when the page cache is too large, is remedied by resuming the VM once the network transfer is completed in a similar fashion to what Zhang et al. accomplished with efficient restoration of checkpoints [21]. Pages are loaded in the background while the machine is running. To keep the VM in a consistent state, the pages

(25)

Hypervisor/Emulated Disk Guest VM

Pre-fetch process

Page fault / Disk I/O Copy prefetched data it into the

memory of the VM

Memory Unloaded page

Loaded page

Disk

Communication of page faults or consistency requirements

Disk I/O requests from the VM Load page cache

data from disk

Figure 4.3: Simplified Architecture

(26)

that have not been loaded are marked as invalid in their corresponding PTE. A background fetch process, called pre-fetch, is initialized with all pages to load, the pages in the pfn to sec map. The running VM may try to access a page yet to be loaded. In this case, a page fault is raised. The page fault handler in the hypervisor intercepts it and forwards a request to the background fetch process to load the page, which immediately loads the data from disk and responds to the hypervisor. The hypervisor itself cannot load the page since it is required to services other guests on the same system and cannot be blocked with an I/O request. Figure 4.2 illustrates the design of the the proposed solution. The total migration time and the downtime are also depicted. The architecture of the resumed VM is depicted in Figure 4.3. To maintain consistency of the disk and memory of the VM, all disk I/O needs to be intercepted as discussed in the next section and illustrated in Figure 4.3.

4.3 Maintaining Consistency

The following cases may occur depending on the way the hypervisor virtual- izes the disk. Disk requests do not go directly to the disk in Xen, since that violates the isolation of domains constraints; they are rather handled by a emulated disk. This emulated disk runs with privileges of a super user on the host machine and does not use the same virtualized page tables as the guest. Since the guest can simply pass a reference to the memory address from where the I/O request originates, the virtualized disk may access data which is marked as invalid in the guest. Thus, a page fault is not triggered by the emulated disk and because of these interactions, the following problems may arise. To detect if this may occur, all I/O requests from the guest are intercepted and compared with the background process’ queue which contains an up-to-date mapping of which pages are to be loaded. These pages that have not been loaded are the ones that can cause corruption of the memory or disk of the VM discussed in detail in this section.

Read/write disk race. This situation occurs when the VM writes data to a disk sector which is to be loaded by the background process. If the write request is processed and written to disk before the read from the background process, the background process will load modified data on disk into the memory of the VM. The data on disk has been changed behind the process’ back. The solution to this problem is to delay the write request from the guest until the data on disk has been loaded by the background process. Case a depicts this scenario in Figure 4.4.

Read disk race. The disk read race occurs when the VM reads data into a page which has not been loaded by the background process. If the background process later loads data into a such a page, it will overwrite the newer data with old data causing the memory to become corrupt. This may occur when mem-

(27)

Figure 4.4: Violation cases

(28)

ory inside the VM is used for several read requests from disk or newly allocated memory is not accessed before a read request. Old page cache data may for instance be overwritten by newer I/O requests. Case b depicts this scenario in Figure 4.4.

In case the read request is the size of the whole memory page, then it can simply be dropped from the background process queue. If, on the other hand, the read requests only spans a subset of the page, the read request has to be delayed until the original data is loaded by the background process. This is depicted as case b in Figure 4.4.

Writing data to disk from a page that has not yet been loaded. In case the VM issues a write request from a page that has not been loaded, it will write corrupted data to disk, thereby corrupting the disk. If such a write is about to occur, it has to be intercepted and deferred until the correct data is loaded into memory. Case c depicts this scenario in Figure 4.4.

(29)

5 Implementation Details

The proposed technique has been implemented in the Xen hypervisor version 4.1.2. for HVM guests. This chapter discusses the details of the implementation.

5.1 Live Migration

When a user requests Xen to migrate a guest, a migration process is started.

In the original Xen implementation, all pages are sent over the network. In this implementation, only the non-duplicated pages are sent. In the first iteration of the pre-copy algorithm the pfn to sec map is transmitted over the network.

In the subsequent iterations only the changes in the pfn to sec map are sent.

Once the VM resumes at the destination, the pages in the pfn to sec map have all their access permissions (read, write, and execute) removed from their corresponding page table entry (PTE) in the guests virtualized page table, Extended Page Tables (EPT) or Nested Page Tables (NPT).

5.2 Pre-fetch Restore

Once the VM is running, the pre-fetch restoration process sets up a communication channel with the hypervisor. An asynchronous ring buffer is used for the communication from the hypervisor to the pre-fetch restoration process.

The hypervisor uses asynchronous communication since it has to be available to other guest domains. However, the restoration process performs hypercalls to get direct access to the hypervisor. This communication channel is set up to deliver page fault requests to the background restoration process.

5.2.1 The Page Fault Handler

The Xen hypervisor receives page faults that have been triggered by a guest violating the access permissions of a memory page. The page faults which are caused by the guest accessing a page which has not yet been loaded are easily distinguished from other types of page faults. It is achieved by checking if the access was removed by the pre-fetch restore process. These page faults cause the hypervisor to pause the VCPU of the affected domain and forward the request to the background restoration process. The restoration process uses two array

(30)

based queues, one sorted by PFN and one sorted by disk sector to facilitate quick access to both.

The page faults are indexed by PFN whereas the background process needs a queue to read sequential sectors, a queue sorted by sector number in ascending order. The queue indexed by PFN is updated when a page is loaded from disk to check off which sectors has already been loaded with a flag.

The background restoration process continually loads batches of pages from disk to reduce the number of page faults. After each of these load operations, the ring is checked to see if a page fault has occurred. These page fault requests are processed immediately upon reception. A subsequent hypercall is performed to notify the hypervisor that the page is loaded and to resume the VM. It is possible that the background restoration process receives page fault requests for pages recently loaded. In such occasions, the hypervisor is notified immediately which resumes the VM.

5.2.2 Intercepting I/O to Maintain Consistency

In order to maintain consistency, all I/O requests from the VM have to be intercepted as previously discussed in Section 4.3. All I/O requests issued from an HVM guest pass through ioemu, a modified version of qemu. Ioemu is used in Xen to emulate hardware for HVM guests. The emulated disk located in ioemu has been modified to intercept and properly deal with cases which would violate the consistency and to detect these situations that may lead to a corruption of the memory or disk of the VM. Ioemu needs to have access to the queue maintained by the restoration process to ensure consistency. A shared memory segment is used between ioemu and the restoration process to facilitate quick lookups in the queues.

Read/write disk race. To detect a disk read/write race condition from a page PFN to a disk sector s, ioemu checks the restoration process queue for sector s. In case such a sector is found in one entry, it sets a flag in a shared memory segment requesting a load of that sector. After loading the data into the memory, the background process sends a message back to ioemu which then executes the write request.

Read disk race. To detect a disk read/read race condition from a PFN p to a sector s, ioemu searches the pre-fetch queue of the restoration process for an entry with the same PFN p. If such an entry exists, ioemu deletes the entry in the queue if data read from disk is as large as the size of the memory page. If on the other hand, the read from disk only spans a subset of a page in memory, then ioemu signals the restoration process to load the page and once it is loaded, continue with the request.

(31)

Modern operating systems perform the majority of the disk I/O on page granularity. I/O requests on a sub-page granularity is not tracked by Park et al. since it adds complexity while providing modest reduction in the checkpoint image size. The same approach is followed in this paper. Thus, reads from disk that span only a subset of a page will be a special case since a subsequent load by the restoration process overwrites any data which ioemu loads into the memory page. The restoration process is notified about such requests and loads the data immediately and the requests are delayed until the data is loaded.

Writing data to disk from a page that has not yet been loaded. In this case, the VM requests to write data from PFN p to disk sector s. Ioemu searches the restoration process’ queue for an element with PFN p, if it finds such an element then it notifies the background process which it immediately loads from disk. However, ioemu has already copied the data from the memory of the VM into its own I/O buffers that now contain invalid data. The data of the requested page is thus sent along with the confirmation message back to ioemu which copies the data into the I/O buffers before executing the write request.

5.2.3 Optimizations

Early tests showed that the number of seek operations can drastically affect the performance of the disk. Data is thus loaded in ascending order on disk to minimize the number of seeks.To achieve good performance during the pre-copy iterations as well as after restarting the VM on the destination, it is important to minimize the number of page faults as well as the number of disk I/O operations caused by the background process. Tests have shown that the performance of the VM is substantially better when the restoration process’ queue is sorted by the disk block number. This may be counter-intuitive to the principle of spatial locality which often is applied to the memory but here the disk. There are two reasons for why this performs better. Firstly, it reduces the number of seek operations on the disk and reduces the number of system calls to the kernel.

Secondly, consecutive entries on disk may be loaded with one I/O operation as opposed to as many I/O operations as the number of pages. Thereby, increasing the throughput of the disk. There are in fact too many consecutive entries in the queue in certain cases so that a limit has to be placed on the maximum number to load at once. Loading too many would impose a large latency for page faults as well as the disk requests originating from the VM itself would be prolonged. The increased I/O latency in the VM would only be noticeable by the users of the guest which is difficult to measure, so a trade-off has to be made between user satisfaction and the restoration time, the time it takes for all pages to be loaded. Loading too few would request to many operations to the disk and because of latency to disk imposed by it to physically move, this performs poorly.

To further improve performance, an adaptive pre-paging algorithm which aims at reducing the number of page faults has been implemented. The adaptive

(32)

Figure 5.1: Execution Trace over the Sector Accessed

pre-paging works as follows: when a page fault occurs and is received by the background process, nearby data on the disk is simultaneously loaded to the corresponding page fault. The disk sector s of a corresponding page fault is centered around any unloaded data. A maximal number of consecutive blocks max is read in case that there is any unloaded data around s. The interval looked within is (s − max, s + max) where the optimal one is (s − max/2, s + max/2).

After the page fault has been serviced, the pre-paging algorithm continues to read from the last read disk sector to minimize the seek overhead and the number of page faults by exploiting spatial locality. Based upon early tests including visual inspection, usability tests, and minimizing the number of page faults, the max variable was chosen to be 8. Hence, a maximum of 8 consecutive pages are loaded from disk in one I/O request.

Figure 5.1 depicts how the pre-paging algorithm works by showing an execution trace of a guest that is compressing a large file. The x-axis shows depicts the clock cycles until all pages have been loaded and the y-axis shows the sectors loaded from disk. The blue crosses shows which pages are loaded by pre-fetching and the red ones shows if they are loaded due to a page fault. The background process starts slightly before the machine has booted and therefore there are no page faults appearing with a red cross immediately. The figure also illustrates that page faults often occur in close proximity to each other, this also depends on the benchmark.

The disk optimizations used in this paper are aimed toward hard disks since those are still the most commonly found non-volatile storage medium in com- puters today. However, solid-state drives (SSDs) may require different optimizations since SSDs do not have any seek overhead, but provide constant-time random access. Different strategies may be explored here.

(33)

6 Results

6.1 Experimental Setup

The proposed technique has been implemented in the Xen hypervisor version 4.1.2 on a machine with an Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz, 16 GiB RAM, and a QNAP NAS server. The two hosts, the source and the destination, are connected via Gigabit Ethernet. dom0 runs Ubuntu Server 12.04.1 LTS.

The guests have been configured to run Ubuntu 10.04 with 1 GB of RAM and 1 VCPU.

A series of experiments have been conducted with varied usage pattern. Each experiment has been run with the default adaptive rate-limiting algorithm used in Xen. The transfer rate starts at 100 Mbits and increases with 50 Mbit after each iteration, until a maximum limit of 500 Mbit. This algorithm is used to minimize the congestion of the network between the hosts. The results that have been achieved are compared to unmodified Xen. There is a number of interesting measures for evaluating the performance of the proposed technique.

After each experiment, the page cache was also dropped on each of the hosts in order for them to not have cached anything from a previous run.

• The downtime of the VM

• The total migration time of the VM

• The degradation incurred for the VM once it has started.

Table 6.1 lists the benchmarks used to evaluate this work. The user sessions benchmarks include the desktop and movie benchmark which are lacking a benchmark time since there is no time at which they finish. The desktop benchmark represents a long running user session. The movie benchmark plays a 700 MB movie until almost the end, at which point the VM is migrated in order to have a relatively large number of pages in the page cache. The movie benchmark is similar to a desktop session, although the movie is playing and keeps an active working set whereas a normal user session is more or less idle.

Gzip compresses a large file of about 500 MB; this benchmark is both I/O and CPU-intensive. The Make benchmark compiles the Linux 2.6.32.60 kernel and represent a CPU-intensive benchmark. Copying a file and running Postmark

(34)

Benchmark Description

Movie Played a movie of about 700 MB.

Desktop Four Libre Office Writer documents, each having file size of about 10 MB. Firefox opened with 4 tabs. Music streamed.

Essentially, long running VM user session.

Gzip Gzip a file of 512 MB random data.

Make Linux 2.6.32.60 kernel compile with few modules and drivers.

Copy Copied 1 GB of arbitrary data.

Postmark Simulates and email server which performs numerous file operations. The base number of files was 500 with 5,000,000 transactions. Files ranged in size between 500 bytes and 9.77 kilobytes.

Table 6.1: Benchmark Scenarios

are I/O-intensive loads. The copy benchmark copies a large file, and Postmark is configured to create and use 500 files ranging from 500 bytes to 9.77 kilobytes, 5,000,000 transactions performed with 512 bytes read or written. The benchmarks have been run five times and the average of these run has been averaged to reduce statistical anomalies.

6.2 Results

Migration Results

Figure 6.1 shows the total migration time in comparison to unmodified Xen.

All results are normalized to unmodified Xen. The proposed technique clearly outperforms unmodified Xen in all benchmarks. The total migration time is reduced by 44% on average over these benchmarks. In general, there is also a modest increase in downtime, which can be easily explained by the extra syn- chronization step performed at the destination. The destination machine has to remove access to all pages that should be loaded from disk, thereby making a hypercall to the hypervisor. Additionally, a background process has to be started to load the pages in the background.

The user session benchmarks, desktop and movie reduce the total migration time by over 40% while keeping the downtime low as can be seen in Figure 6.1 and Figure 6.2, respectively.

The comparison of the Copy 1024 benchmark to Copy 2048 illustrates an interesting phenomenon. The Copy 2048 benchmark seems to perform better than the Copy 1024. Table 6.1 shows that the Copy 1024 benchmark takes 100.45 on average and a total migration time of 100.67. This occurs because the benchmark finishes before the migration is done. At the moment the benchmark

(35)

Figure 6.1: Total migration time normalized to unmodified Xen

Figure 6.2: Downtime normalized to unmodified Xen

(36)

completes, the working set is small enough to be immediately transferred to the destination. The Copy 2048 demonstrates a bad case for unmodified Xen because it is a write intensive benchmark that does not finish during migration.

Instead, write intensive applications like this show that pre-copy might never finish because in each iteration, enough pages are modified so that the working set is not decreased. It finishes after a pre-defined number of iterations has occurred or if enough data has been sent over the network. Pre-copy simply suspends the VCPU and transfers all pages, which is shown in the increased downtime compared to Copy 1024. However, the proposed solution reduces the working set by not sending these pages over the network and quickly transfers control to the destination, shown by the decreased migration time. The downtime for the Copy 1024 benchmark increases for the proposed solution because it reduces the working set to a small enough size. When the working set is small enough, the pre-copy algorithm issues a call to suspend the VM. During the time until the VM is actually suspended, the VM modifies pages which are not included in the pfn to sec map. Thus, the number of pages that has to be sent over the network are increased compared to original Xen. This does not occur for original Xen because the benchmark has already finished and the VM is idle.

For Gzip, the total migration time is reduced by about 60%. It increases the downtime slightly more. This is explained by the same reasoning as why Copy 1024 has a large increase in downtime compared to original Xen.

Postmark illustrates a benchmark where the proposed migration technique performs poorly due to that there is a small number of pages in the pfn to sec map. There are a small number of pages in there because I/O is currently only tracked on a page level granularity. The Postmark benchmark simulates an email server which performs all I/O operations on small files with a sub-page granularity. These I/O requests which are smaller than the memory page size are currently not tracked and illustrates a weakness in this approach. Even though there a large page cache may exist in the VM, the hypervisor does not track I/O operations on this level.

Make has few pages in the page cache when the migration is started since the migration is started as soon as the kernel is compiling. It is also notable that the downtime decreases compared to original Xen. Like the Copy 2048 benchmark this is due to the decreased working set that has to be transferred over the network. On the other hand, this decreased downtime is just transferred to the performance degradation once the machine has started.

Interesting to note from Figure 6.3 is that the proposed solution transmit less data compared to original Xen in certain cases. Original Xen (Org) is denoted as Org in the diagram, followed by the benchmark name. Similarly for the proposed solution (PS). The red portion proposed solution is the amount of data read from disk and the blue the amount of pages sent over the network. The discrepancy between the original version and the proposed solution is explained

(37)

Figure 6.3: Total data transferred

by the fact that reading from disk only starts once the migration process has ended and a shorter migration time. Thereby, pages on disk that would have been sent several times are only read one time from disk after the VM has resumed executing on the destination. The shorter total migration time also af- fects the amount of data sent across the network. This can be beneficial in data centers since the congestion of the network decreases and less duplicate data is transferred. The reduction in data transferred does not linearly translate to an equal reduction in total migration time due to the adaptive rate limiting, the first iterations sends at a slower pace than the latter ones.

Figure ?? illustrates a test-run of Copy 2048. It depicts the number of pages sent across the network. The original version clearly continues far beyond the proposed solution and sends more pages as can be seen in Figure 6.3. It can be inferred that one iteration is faster for the proposed solution since less data is sent in every iteration. Furthermore, since the proposed solution finishes each iteration faster, there are less changes to be sent in subsequent iterations compared to original Xen. Finally, there are small enough changes to migrate the machine which does not happen in the case of original Xen. Original Xen simply does a stops the VM and transfers its state to the destination since enough iterations has passed.

(38)

Figure 6.4: Data sent over the network

Performance Degradation

The degredation incurred by handling page faults and fetching data from disk are hard hard to measure simply by calculating the sum over all such operations.

The VM and the guest are competing for the same resources, namely disk and CPU time. The disk competition may even introduce seek overhead since the VM or the background process might read/write from/to disk and thus change head position for each other. A problem where this might affect the benchmark time is if both the VM and the background process are disk intensive, thus they change the position of the head for each other, possibly after each I/O request.

There is also the case when a page fault occurs and the hypervisor swaps in and handles it. The caches will invariably become modified which might also affect the performance of the VM. Due to these reasons, the degradation was measured by measuring the time it took for a benchmark to finish while migrating a VM.

The measured performance degradation as seen in Figure 6.5 is rather low in all cases. Postmark and Make have few pages in the page cache and therefore will not substantially produce a large performance degradation. Any degradation in the Make benchmark is hardly noticable due to the long time to compile the

(39)

Figure 6.5: Performance degradation normalized to unmodified Xen

Linux kernel. Any degradation is relatively small in comparison to the long benchmark time. A very small degradation for Copy 2048 may also partly be explained due to the reduced migration time. To track which pages are modified, the hypervisor uses shadow page tables as discussed in chapter 2.4. The shadow page tables are activated as soon as the migration process is started.

These shadow page tables incur an overhead since every time the VM modifies a page, the hypervisor context switches in. Since the total migration time is reduced, the overhead of the shadow page tables will also be smaller. The amount of pages that are modified each iteration in the Copy benchmarks can be more than 10,000 in each iteration, thus the shadow pages tables might incur an overhead large enough to slightly reduce the benchmark time for the proposed technique as can be seen in Figure ??.

Gzip may perform poorly for the reasons discussed in the previous paragraph.

Moreover, since the Gzip benchmark is comparatively short, the incurred degradation will have a larger relative impact depending on the length of the benchmark.

Table 6.3 shows additional data for the proposed solution benchmarks. The first column is self-explanatory, the second describes how many pages were in the pfn to sec map. The third column show the time until all of the duplicate data located on disk has been loaded into the memory of the VM. The ”Page faults”

column show how many page faults occurred on average. ”Read violations”,

(40)

Benchmark Migration Time Compared (s) Downtime Compared (s) Benchmark Time Compared (s)

Desktop 15.51 / 28.92 0.81 / 0.72

Movie 40.82 / 74.33 1.65 / 1.48

Gzip 23.71 / 61.62 1.20 / 0.85 63.67 / 54.09

Copy 1024 39.63 / 100.67 2.58 / 0.78 108.29 / 100.45

Copy 2048 34.09 / 171.00 2.03 / 11.44 204.19 / 200.02

Postmark 38.13 / 41.23 1.74 / 1.55 108.25 / 107.74

Make 43.24 / 47.66 2.06 / 2.31 549.76 / 542.29

Table 6.2: Benchmark Data Benchmark Page

cache pages

Time until all loaded (s)

Page faults

Read violations

Write violations

Execution violations

Read race Write race

Desktop 147074 19.95 288 67 18 204 0 0

Movie 176884 26.21 433 79 110 228 15 0

Gzip 137804 26.57 335 14 213 108 802 1

Copy 1024

152159 38.93 1287 101 1024 163 3844 1

Copy 2048

139195 38.31 1251 39 1085 128 3798 5

Postmark 52545 6.59 110 15 0 95 0 0

Make 60485 13.1 639 329 6 305 0 4

Table 6.3: Additional Benchmark Data For the Proposed Solution

”write violations”, and ”Execution violations” shows what type of page fault occurred. ”Read race” and ”Write race” shows the consistency requirement of read race and write race. Writing to an unloaded page is not included since it never occurred during any of these benchmarks. The number of total pages in memory are a constant 262144 since the benchmarks were run with 1 GB guests. Desktop demonstrates that even an idle machine produces numerous execution violations. This occurs because there is code loaded from disk which needs to execute, even for an more or less idle VM. Movie is similar to desktop with more page faults due to a more active benchmark. Gzip and the two copy benchmarks demonstrate numerous read races and a few write races which are all properly handled. The two copy benchmarks have similar numbers since the migration is occurring about 40 seconds in as shown in Table 6.2. Postmark and Make shows a short time to load all pages and a small degradation as can be seen in 6.5.

(41)

6.3 Comparison with Other Methods

Lacking a standard benchmark suite for live migration, comparing results to related work is difficult, especially since the migration time in this paper depends on the pages in the page cache. Based on the results other authors have achieved, this work is compared with the overall result of each of the following techniques.

6.3.1 Memory Compression

Jin et al. describes a technique which both reduces downtime and total migration time by using an adaptive compression algorithm to decrease the amount of data sent over the network. They achieved an average reduction of 27% in downtime and 32% in migration time [10]. In terms of downtime, this technique outperforms the page cache reduction technique described in this paper;

however, not in the case of total migration time. These two techniques are or- thogonal to each other, meaning that they can be combined to facilitate even larger decreases in migration time. The slight increase in downtime can also be mitigated by using this compression method.

6.3.2 Post-Copy

Post-copy and the proposed technique share several characteristics. Both techniques invalidates a portion of the memory on the destination host when the guest starts. Additionally, both use a adaptive pre-paging algorithm to try to minimize the number of page faults; post-copy applies its algorithm to memory whereas the page cache reduction technique applies a similar algorithm to the disk. The post-copy technique achieves a 50% reduction in migration in some cases which is similar to the proposed and implemented technique in this paper.

As in the case of the compression technique, the post-copy and page cache reduction techniques can be combined to facilitate larger decreases in migration time.

6.3.3 Trace and Replay

Trace and replay introduced by [12] et al. reduces the total migration time up to 32% and downtime with 72%. This technique clearly outperforms the page cache technique reduction technique when it comes to downtime but the reverse is true for total migration time. Unlike the other methods, this introduces an overhead of 8% during the migration on the source machine in order to log all the events, and might increase the total migration time in certain cases as discussed in 3.3.4. The overhead incurred is similar to the page cache reduction technique on the source machine as opposed to on the destination. The trace and replay can also be combined with the technique develop in this thesis. The trace and replay algorithm begins by taking a checkpoint of the VM on the source host which is transferred to the destination host. Only after will it begin

(42)

recording and replaying. This checkpoint can combined with the page cache reduction technique to further reduce the total migration time.

6.3.4 SonicMigration with Paravirtualized Guests

Koto et al. [11] managed to reduce the total migration time up to 68%. The larger reduction in migration time compared to this thesis can be explained by that the ability of the kernel to communicate with the hypervisor, indicating what memory pages are unused or free in the guest VM. This information is not known in a fully virtualized environment since any allocation or dealloca- tion of memory within the guest cannot be known by the hypervisor. It is mentioned that a degradation occurs but it is not reported in any quantifiable way. However, since the page cache is invalidated, it is safe to assume that the degradation is larger than the page cache reduction technique presented here since this algorithm has to read all the page cache data from disk again.

(43)

7 Conclusion and Future Work

7.1 Conclusions

In this work, a solution has been proposed and implemented to efficiently migrate virtual machines. This method uses techniques from work on optimized checkpoints and efficient restoration of virtual machines applied to live migration. An efficient migration time is achieved only by sending a critical set of memory pages over the network which are not present on disk. The discarded pages are loaded in the background while the VM is resumed on the new host.

Accesses to pages that are not yet loaded are intercepted by the hypervisor with a page fault mechanism. The technique presented decreases the total migration time by 44% on average. There are still issues that require to be solved with this approach, and they are discussed in the next section. This reduction in total migration time can assist data centers to more efficiently utilize power management, proactive fault tolerance, and load balancing capabilities of virtual machine live migration.

7.2 Future Work

Future work includes exploring various strategies to improve upon this work.

Reading from disk simultaneously at the destination, while the migration is occurring can implemented to reduce the degradation once the guest resumes.

This requires that write-after-read hazards be handled properly. Currently, the pfn to sec map records a mapping between the PFN and disk sector before the write request has been written to disk. This may create write-after-read race conditions during the live migration. Consider the following scenario. The source host receives a request to write to sector x which is recorded in the pfn to sec map. This pfn to sec map is transmitted to the destination which reads sector x from disk before the source has written write request data. This will cause the destination to read invalid data. Thus, to solve this, the mapping in the pfn to sec map should only be recorded after the write has been com- mitted to disk. Asynchronous I/O presents a problem with caching on any level in the source or the destination. An appropriate solution would be to flush all outstanding writes when the migration begins and then only track read requests to avoid the destination from reading inconsistent data.

Optimizing Total Migration Time in Virtual Machine Live Migration

Examensarbete 30 hp Mars 2013

Optimizing Total Migration Time in Virtual Machine Live Migration

Erik Gustafsson

Institutionen för informationsteknologi

Abstract

Optimizing Total Migration Time in Virtual Machine Live Migration

Erik Gustafsson

Acknowledgements

Contents

List of Figures

Acronyms

1 Introduction

1.1 Overview

1.2 Live Migration

1.3 Motivation and Problem Definition

1.4 Contributions

1.5 Outline

2 Virtual Machine Monitors, Xen, and Memory Management

2.1 Overview

2.2 Paravirtualization

2.3 Hardware Assisted Virtualization

2.4 Memory Management and Page Tables

2.5 The Page Cache

3 Related Work

3.1 Introduction

3.2 Performance Measurements

3.3 Live Migration Methods and Techniques

3.3.1 Iterative Pre-Copy

3.3.2 Memory Compression of Pre-Copy

3.3.3 Post-Copy

3.3.4 System Trace and Replay

3.3.5 SonicMigration with Paravirtualized Guests

3.3.6 Discussion of Live Migration Techniques

3.4 Checkpoint Methods and Techniques

3.4.1 Efficiently Checkpointing a Virtual Machine

3.4.2 Fast Restore of Checkpointed Memory using Work- ing Set Estimation

4 Proposed Solution to Optimize Total Migration Time

4.1 Page Cache Elimination

Cache Consistency During Live Migration

4.2 Page Cache Data Loaded at the Destination Host

4.3 Maintaining Consistency

5 Implementation Details

5.1 Live Migration

5.2 Pre-fetch Restore

5.2.1 The Page Fault Handler

5.2.2 Intercepting I/O to Maintain Consistency

5.2.3 Optimizations

6 Results

6.1 Experimental Setup

6.2 Results

Migration Results

Performance Degradation

6.3 Comparison with Other Methods

6.3.1 Memory Compression

6.3.2 Post-Copy

6.3.3 Trace and Replay

6.3.4 SonicMigration with Paravirtualized Guests

7 Conclusion and Future Work

7.1 Conclusions

7.2 Future Work