Virtual Machine over WAN

(1)

092014

Performance of Disk I/O operations during the Live Migration of a

Virtual Machine over WAN

Revanth Vemulapalli Ravi Kumar Mada

Department of Communication Systems Blekinge Institute of Technology

SE-371 79 Karlskrona

Sweden

(2)

Engineering. The thesis is equivalent to 40 weeks of full time studies.

Contact Information:

Author(s):

Revanth Vemulapalli, Ravi Kumar Mada.

E-mail:

revanth.vemulapalli@gmail.com, m.ravikumar1989@yahoo.com.

University advisor(s):

Dr. Dragos Ilie,

Department of Communication Systems.

University examiner(s):

Prof. Kurt Tutschku,

Department of Communication Systems.

School of Computing

Blekinge Institute of Technology Internet : www.bth.se/com

SE-371 79 Karlskrona Phone : +46 455 38 50 00

Sweden Fax : +46 455 38 50 57

(3)

Virtualization is a technique that allows several virtual machines (VMs) to run on a single physical machine (PM) by adding a virtualization layer above the physical host’s hardware. Many virtualization products allow a VM be migrated from one PM to other PM without interrupting the services running on the VM. This is called live migration and offers many potential advantages like server consolidation, reduced energy consumption, disaster recovery, reliability, and efficient workflows such as “Follow-the-Sun”. At present, the advantages of VM live migration are limited to Local Area Networks (LANs) as migrations over Wide Area Networks (WAN) offer lower performance due to IP address changes in the migrating VMs and also due to large network latency.

For scenarios which require migrations, shared storage solutions like iSCSI (block storage) and NFS (ﬁle storage) are used to store the VM’s disk to avoid the high latencies associated with disk state migration when private storage is used. When using iSCSI or NFS, all the disk I/O op- erations generated by the VM are encapsulated and carried to the shared storage over the IP network. The underlying latency in WAN will eﬀect the performance of application requesting the disk I/O from the VM.

In this thesis our objective was to determine the performance of shared and private storage when VMs are live migrated in networks with high la- tency, with WANs as the typical case. To achieve this objective, we used Iometer, a disk benchmarking tool, to investigate the I/O performance of iSCSI and NFS when used as shared storage for live migrating Xen VMs over emulated WANs. In addition, we have conﬁgured the Distributed Repli- cated Block Device (DRBD) system to provide private storage for our VMs through incremental disk replication. Then, we have studied the I/O per- formance of the private storage solution in the context of live disk migration and compared it to the performance of shared storage based on iSCSI and NFS. The results from our testbed indicate that the DRBD-based solution should be preferred over the considered shared storage solutions because DRBD consumed less network bandwidth and has a lower maximum I/O response time.

Keywords: Virtualization, XEN, DRBD, iSCSI, NFS, Iometer, Distributed

Replicated Block Device, ﬁle storage, block storage, virtual machine, VM

and I/O performance.

(4)

(5)

I would like to express gratitude to my supervisor Dr. Dragos Ilie who intro- duced the concept of virtualization to us and for his encouragement and support through valuable suggestions when required. Furthermore, I would like to thank Prof. Dr. Kurt Tutschku for his motivation and useful suggestions through the learning process of this master thesis. I would like to thank Dr. Patrik Arlos for lending us hardware for our experiments.

I am fortunate to have loving and caring parents who supported my stay and education against all odds. Without their support I couldn‘t have studied in Sweden. In my daily work I have been blessed with a friendly and cheerful group of fellow students (Chaitu, Gautam, Venu, Tushal, etc). I will forever be grateful for all your love and help.

–Revanth Vemulapalli

iii

(6)

(7)

I would like to express my gratefulness towards my supervisor Dr. Dragos Ilie, without whom this research would not have been possible. I’m grateful to Prof.

Kurt Tutschku for his consistent support and suggestions.

I would like to thank Dr. Patrik Arlos for providing required equipment and environment. I appreciate the cooperation of my partner Revanth.V for being consistent towards this research work.

I would like to thank my friends who have been great support. Last but not the least; I would also like to thank my family for their endless support.

–Ravi Kumar M

v

(8)

(9)

Abstract i

Acknowledgments iii

Acknowledgments v

List of Contents vii

List of Figures xi

List of Equations xiii

List of Tables xv

1 INTRODUCTION 1

1.1 Advantages of virtualization . . . . 2

1.2 Components of Virtualization . . . . 2

1.3 Types of Virtualization . . . . 3

1.3.1 Full Virtualization . . . . 3

1.3.2 Para Virtualization . . . . 3

1.3.3 Hardware Assisted Virtualization . . . . 4

1.4 Virtual Machine Migration . . . . 4

1.5 Problem Statement . . . . 6

1.6 Advantages of Live Migration over WAN . . . . 7

2 Aim, Objectives, Research Questions & Related Work 9 2.1 Aim and Objectives . . . . 9

2.2 Research Questions . . . . 10

2.3 Related work . . . . 10

3 XEN Hypervisor and Storage Solutions 17 3.1 Xen Hypervisor . . . . 17

3.2 Distributed Replicated Block Device (DRBD) . . . . 19

3.2.1 DRBD disk replication algorithm . . . . 21

3.2.2 DRBD Replication Modes . . . . 22

vii

(10)

3.2.5 Eﬃcient Synchronization . . . . 24

3.2.6 XEN with DRBD . . . . 25

3.3 iSCSI Block Storage . . . . 27

3.3.1 iSCSI Architecture . . . . 27

3.4 Network File Storage (NFS) . . . . 29

4 Research Methodology and Test-bed 31 4.1 Research Methodology . . . . 31

4.1.1 Stage 1 . . . . 31

4.1.2 Stage 2 . . . . 32

4.1.3 Stage 3 . . . . 32

4.2 Experiment Test-beds . . . . 33

4.3 Hardware conﬁguration . . . . 33

4.4 Benchmarking and other tools . . . . 34

4.4.1 Iometer . . . . 34

4.4.2 NetEm . . . . 35

4.4.3 Wireshark . . . . 35

4.5 Experiment Scenarios . . . . 35

4.5.1 Scenario 1 . . . . 35

4.5.2 Scenario 2 . . . . 36

4.5.3 Scenario 3 . . . . 37

4.6 Performance Metrics . . . . 37

4.6.1 IOps performance . . . . 37

4.6.2 I/O Throughput . . . . 38

4.6.3 Maximum I/O Response Time . . . . 38

4.6.4 Network Load . . . . 38

4.7 Delay Characteristics of WAN . . . . 38

5 Experiment Results 40 5.1 Input Output per Second Performance . . . . 41

5.1.1 No delay (0.02 ms) . . . . 41

5.1.2 Inter-Europe (30ms Delay) . . . . 42

5.1.3 Trans-Atlantic (90ms Delay) . . . . 42

5.1.4 Trans-Paciﬁc (160ms Delay) . . . . 43

5.2 I/O Throughput during Migration . . . . 44

5.2.1 Campus Network (0.02 ms) . . . . 45

5.2.2 Inter-Europe (30ms Delay) . . . . 45

5.2.3 Trans-Atlantic (90ms Delay) . . . . 46

5.2.4 Trans-Paciﬁc (160ms Delay) . . . . 46

5.3 Maximum Response Time (MRT) . . . . 48

5.3.1 No delay (0.02 ms) . . . . 48

viii

(11)

5.3.4 Trans-Paciﬁc (160ms Delay) . . . . 50 5.4 Network Load . . . . 51

6 Conclusion and Future work 53

6.1 Conclusion . . . . 53 6.2 Future Work . . . . 54

Appendix 56 ix

(12)

(13)

1.1 Virtualization . . . . 1

1.2 Type-1 and Type-2 Hypervisor . . . . 3

1.3 Types of Virtualization . . . . 4

1.4 Xen Live Migration [9] . . . . 5

2.1 Three-Phase whole system live migration . . . . 11

2.2 Incremental Migration Algorithm . . . . 12

3.1 Xen Architecture . . . . 18

3.2 DRBD Position in I/O stack . . . . 20

3.3 Block diagram of DRBD algorithm . . . . 21

3.4 Asynchronous protocol . . . . 22

3.5 Memory or semi Synchronous protocol . . . . 23

3.6 Synchronization protocol . . . . 23

3.7 DRBD Three Way Replication . . . . 24

3.8 DRBD disk status during vm live migration. . . . 26

3.9 Communication between initiator and target . . . . 28

3.10 iSCSI Block-Access Protocol . . . . 29

3.11 NFS File-access protocol . . . . 30

4.1 Experiment setup . . . . 32

4.2 Scenario 1 and 2 . . . . 36

4.3 Scenario 3 . . . . 36

5.1 Graph: IOps Vs Network Delay . . . . 44

5.2 Graph: MBps Vs Network Delay . . . . 47

5.3 Graph: MRT vs Network delay . . . . 50

5.4 Graph: Network Bandwidth vs Storage Solutions . . . . 52

xi

(14)

(15)

3.1 Synchronization time . . . . 25 5.1 Conﬁdence Interval . . . . 40 5.2 Relative error . . . . 41

xiii

(16)

(17)

3.1 Comparison of storage solutions . . . . 27

5.1 IOps performance of diﬀerent solutions on campus network . . . . 42

5.2 IOps performance of different solutions on Inter-Europe Network . 42 5.3 IOps performance of different solutions on Trans-Atlantic Network 43 5.4 IOps performance of different solutions on Trans-Pacific Network . 43 5.5 I/O throughput for different solutions on Campus Network . . . . 45

5.6 I/O throughput for different solutions on Inter-Europe Network . 45 5.7 I/O throughput for different solutions on Trand-Atlantic Network 46 5.8 I/O throughput for different solutions on Trans-Pacific network. . 47

5.9 MRT of diﬀerent solutions on Campus network . . . . 48

5.10 MRT of diﬀerent solutions on inter-Europe network . . . . 49

5.11 MRT of diﬀerent solutions on trans-Atlantic network . . . . 49

5.12 MRT of diﬀerent solutions on trans-Paciﬁc network . . . . 50

1 NFS address Details . . . . 56

2 iSCSI address Details . . . . 58

xv

(18)

(19)

ABSS Activity Bases Sector Synchronization MBPS Mega Bytes per Second

AVG average page dirty rate MIPv6 Mobile Internet Protocol version 6 CBR Content Based Redundancy MRT Maximum Response Time

CHAP

Challenge-Handshake Authentication

Protocol NAS Network Attached Storage

CPU Central Processing Unit NFS Network File Storage DBT Dirty Block Tracking OS Operating System

DDNS Dynamic Domain Name Service PBA Proxy Binding Acknowledgement DNS Domain Name Service PBU Proxy Binding Unit

DRBD Distributed Replicated Block Device PMIPv6

Proxy Mobile Internet Protocol version 6

HIST history based page dirty rate RDMA Remote Direct Memory Access IM Incremental Migration RPC Remote Procedure Call

IO Input Output RPD Rapid Page Dirtying

IOPS Input Outputs per Second SAN Storage Area Network IPSec Internet Protocol Security SFHA Super-Fast Hash Algorithm IPv4 Internet Protocol version 4 TCP Transmission Control Protocol IPv6 Internet Protocol version 6 TOE TCP offload engine

iSCSI

internet Small Computer System

Interface VM Virtual Machine

KVM Kernel-based Virtual Machine VMM Virtual Machine Monitor

LAN Local Area Network VMTC Virtual Machine Traffic Controller LMA Local Mobility Anchor VPN Virtual Private Network

MAG Mobile Access Gateway WAN Wide Area Network

(20)

(21)

INTRODUCTION

Hardware virtualization is one of the underlying techniques that makes cloud computing possible [7]. Virtualization is a technique that allows a physical host to run two or more operating systems simultaneously. This technique adds a virtualization layer between the physical host’s operating system and hardware, allowing virtualized hosts to utilize the physical host processor, memory and I/O devices [1]. The physical host may be a desktop that contains a limited number of virtual hosts or a Data Centre (DC) containing several virtual hosts. Virtual- ization has many advantages like server consolidation, providing virtual desktops, reducing energy consumption, hardware cost reduction, etc. The simple diagram of virtualization where a single physical machine hosting three virtual machine is shown in ﬁgure1.1.

Figure 1.1: Virtualization

This report is divided into six chapters. The concepts of virtualization and

1

(22)

advantages of VM migration over WAN are discussed in chapter one. Chapter two describes our aim, research question and related work. We introduce the XEN hypervisor, the disk replication scheme DRBD and shared storage approaches like iSCSi and NFS in chapter three. We describe our research methodology and ex- periment setups with detail description of hardware used in our experiment along with delay characteristics of WAN in chapter four. The analysis of the results observed during the experiment are presented in chapter ﬁve. We concluded our thesis and discussed possible future work in chapter 6.

1.1 Advantages of virtualization

Some of the advantages of virtualization are hardware abstraction and server consolidation where several small physical server machines can be eﬃciently re- placed by a single physical machine with many virtual server machines. Virtual- ization is cost eﬃcient. It eliminates the need of several individual physical server machines thereby reducing the physical space, energy utilization etc. Below is a list of advantages [3] [51] associated with virtualization:

• Server consolidation and hardware abstraction,

• Proper resource utilization,

• Cost reduction,

• Low power consumption,

• Reduction of physical space,

• Global warming and corporate greenhouse gas reduction,

• Flexibility,

• High Availability, etc.

1.2 Components of Virtualization

Virtualization has two important components, namely the hypervisor or Vir- tual Machine Monitor (VMM) and the guest. The hypervisor is responsible for managing the virtualization layer. There are Type1 and Type 2 hypervisors.

Type-1 and type-2 hypervisors are shown in below ﬁgure 1.2.

Type-1 hypervisor, which is also known as bare metal hypervisor runs directly above the host’s hardware. They have direct access to the hardware resources.

They provide greater ﬂexibility and better performance due to its design [9]. Xen, Microsoft Hyper-V and VMware ESX are the type-1 hypervisors.

Type-2 hypervisor works on top of host machine’s operating system as an

application. For every virtual machine there is one virtual machine monitor

to run and control them, for this reason there is one virtual platform for each

(23)

and every virtual machine to work. Virtual box, VMWare player and VMware workstation are examples of type-2 hypervisors.

The guest is the virtual host that runs above the virtualization layer. The virtual host has its own operating system (OS) and applications. This guest operating system can be migrated from one physical host to other physical host.

Figure 1.2: Type-1 and Type-2 Hypervisor

1.3 Types of Virtualization

Virtualization can be classiﬁed into three types: full virtualization, para- virtualization and hardware assisted virtualization [1]. These three virtualizations are shown in ﬁgure 1.3.

1.3.1 Full Virtualization

In full virtualization, binary translation is performed over the virtual ma- chine’s privileged instructions before sending them to the physical host’s CPU.

Full virtualization is the combination of binary translation and direct execution.

The virtual operating system believes that it owns the hardware itself. The virtual operating system’s privileged instructions are translated by the hypervisor before sending them to the CPU whereas user level instructions are directly executed.

1.3.2 Para Virtualization

Pare-virtualization was introduced by Xen projects team. It is a virtualization

technique in which the kernel of the guest operating system is modiﬁed to make it

aware of the hypervisor. In this technique the privileged instructions are replaced

(24)

by hypercalls that directly communicate with the hypervisor. This technique is eﬃcient and easy to implement [20].

1.3.3 Hardware Assisted Virtualization

Hardware assisted virtualization is also called accelerated virtualization [4].

This technique allows users to use unmodified operating system by using the special features provided by computer hardware. Both Intel and AMD started supporting hardware virtualization (VT-X / AMD-V) since 2006. In this tech- nique, the virtual machine monitor runs on a root mode privilege level below the ring 0 [53]. That is, hardware virtualization creates an additional privilege level below ring-0 which contains hypervisor and leaves ring-0 for unmodified guest operating system [52]. The protective ring structure of hardware assisted virtu- alization is shown in figure 1.3. CPU’s manufactured before 2006 cannot make advantage of this system.

Figure 1.3: Types of Virtualization

1.4 Virtual Machine Migration

.The virtual machine migration is the process of moving a working/running virtual machine from one physical host to another physical host without interrupt- ing the services provided by it. During migration the memory, CPU, network and disk states are moved to the destination host. The end users using the services provided by the virtual machine should not detect the notable changes. There are three types of migrations: cold migration, hot migration and live migration.

In cold migration, the virtual machine is ﬁrst powered oﬀ at the source node

before migrating it to the destination node. CPU state, memory state and existing

network connections in the guest OS are lost during cold migration. In hot

(25)

migration, the virtual machine is suspended at the source node before migrating it to the destination node where it is resumed. Most of the OS state can be preserved during this migration.

Figure 1.4: Xen Live Migration [9]

In live migration [9] the hosted operating system is migrated along with CPU, memory and disk state from source to destination while the hosted OS is still in running state without losing the active network connectivity. The disk state migration is not necessary in the case of using shared storages like Network At- tached Storage (NAS) or Storage Area Network (SAN). Among three migration techniques, live migration is best suited to reduce notable downtime of services running on the virtual machine [9]. By using live migration, load on physical machines hosting several virtual machines can be decreased to a large extent [22].

Live migration may increase the total migration time of the virtual machines running server applications.

In order to perform live migration, hypervisor needs to move the memory state of the virtual machine from the source to the destination physical machine.

The crucial method used for migrating memory state is known as pre-copy, and

(26)

is clearly explained in [8] and [9]. In pre-copy phase the memory state of vir- tual machine is copied to destination host in iterations. Unmodified or unused pages are moved in first round and modified pages are moved in next nth round.

Hypervisor maintains dirty bitmap to track modified pages. If pre-determined bandwidth is reached or when bandwidth range is below 256kb, pre-copy phase is terminated [4] and stop and copy phase is initiated. In stop and copy phase the virtual machine is paused on source host and modified pages are copied to destination host. The virtual host is resumed at the destination host and starts working [3] as usual. Xen live migration time line is shown in figure 1.4 [9].

Various disk state migration and network state migration schemes proposed by various researchers are discussed in related work section of chapter 2.

1.5 Problem Statement

Xen [8], Hyper-V and VMware are some of the virtualization suits that sup- port live migration over LAN. Some researchers are working on live migration of virtual machines over WAN. This is quite challenging because typically the WAN has limited bandwidth and high latency when compared to the LAN. Migration over WAN requires network reconﬁgurations because of the changes observed in the IP address pool

¹

at destination host after migration. If Virtual Private Net- work (VPN) or any other form of tunnels are not used, the source host and destination host resides at different sub-network. So, the VMs migrating over WAN should change their IP address to destination hosts subnet to have uninter- rupted network connectivity. For example a virtual machine hosting a webserver is running at site 1. Due to some issues like hardware maintenance or overload, that virtual machine is required to migrate to other location called site 2 which is at a different geographical location. After migration the virtual machine which is at site 2, where a different network prefix is used must acquire a new IP ad- dress. Additionally, it may also have to reconfigure its default gateway. This situation leads to loss of connection with clients using services of the webserver running on virtual machine. We conducted a literature study on different live migration schemes proposed by various researchers to support live migration of virtual machine over WAN without losing network connectivity. VM migration over WAN has many advantages like cloud bursting, load balancing, enabling “fol- low the sun” IT-strategy [11], consolidation, maintenance, scalability, reliability, recovery, disaster management, etc. [3].

Shared storage and private storage are two data storage techniques used to store the disk of the virtual machines. Private data storage, which is also called as direct-attached storage is the storage allotted to a particular host. It cannot share its unused space with other hosts, whereas shared storage is the space allocated

1A set of IP addresses available.

(27)

commonly to two or more hosts. If a virtual machine use private storage it requires the virtual disk to be migrated along with memory, CPU and network states. This increases the downtime and total migration time of the virtual machine. Frequent disk migrations also leads to high network overhead due to their size. Therefore shared storage is utilized to eliminate the need of disk state migration when a virtual machine is migrated. This will reduce the total migration and downtimes.

However, when migrated over WAN, the performance of virtual machine may still be eﬀected due to the high latency in the network. This is because the virtual machine uses network to access disk which is located at another geographical location. To our best knowledge the disk I/O performance of virtual machine while migrating over WAN was not fully explored earlier. Our thesis work may look similar to [2], the diﬀerence between both the works is discussed in section 2.3.

The primary purpose of our thesis is to investigate disk performing, during the live migration of a virtual machine over the WAN using diﬀerent shared storage techniques. We will analyse the performance of iSCSI, NFS and DRBD storage techniques and recommend the best technique for speciﬁc scenarios.

1.6 Advantages of Live Migration over WAN

Almost all the advantages of VM live migration are currently limited to LAN as Migrating over WAN eﬀects the performance due to low latency and network changes. The main goal of our thesis is to analyse the performance of various disk solutions available during the live migration of VM over WAN. When a virtual machine using shared storage is live migrated to a diﬀerent physical host, end users interacting with a server running on the migrating virtual machine should not sense notable changes in the performance of the server. Live migration is supported by various popular virtualization tools like VMWare and Xen. The advantages of live migration over WAN will motivate our thesis work in this area.

1. Maintenance: During the time of scheduled maintenance all the virtual machines running in the physical host are migrated to other physical host so that the maintenance work doesn’t create interruption to the services provided by the virtual machines.

2. Scaling and Cloud Bursting: Load balancing and consolidation can make best use of virtual machine migration over WAN. If the physical host gets overloaded beyond the capacity of hardware resources it will aﬀect the performance of other virtual machines. So the virtual machines should be migrated (cloud busted) to physical hosts at other geographical locations to attain load balancing.

3. Power Consumption: Virtual machines running on low populated hosts

can be migrated to moderately loaded physical hosts at diﬀerent location.

(28)

This allows the initial host to be shutdown to reduce unnecessary power wastage.

4. Disaster Recovery and reliability: During the times of disaster the virtual machine running on a physical host can be saved by migrating it to another physical host over WAN. When a physical host is corrupted or destroyed the virtual machine can be recreated or booted at other mirror location by using the VM’s shared disk and conﬁguration ﬁle (in case of Xen) reducing the service downtimes.

5. Follow-the-Sun: It is a new IT strategy where a VM can be migrated

between diﬀerent time zones in timely manner. This was designed for the

teams working for a project round the clock. Team A works on a project

during their working hours and the data is migrated to other location where

team B will take care of the work during their work hours and migrate data

to team A later.

(29)

Aim, Objectives, Research Questions &

Related Work

In this chapter we brieﬂy described our aim and objectives along with research questions. We discussed the related work conducted by various researchers on network and disk state migrations.

2.1 Aim and Objectives

In this thesis we analyzed the performance of disk state during the live mi- gration of virtual machine over an emulated WAN. We had used Xen [8], an open source hypervisor developed at the Cambridge university computer science lab which supports both hardware assisted virtualization and Paravirtualization.

Xen also supports live migration of virtual machines. To overcome the limitations of disk migration over WAN, which we reported in chapter 1 we had used DRBD to replicate disk at both cloud locations through periodic updates. We identiﬁed the performance metrics relevant to our approach. We had also repeated the experiment with other shared storage solutions like NFS and iSCSi. Finally, we analyzed the performance of all the disk storage solutions and compared their performance.

The primary purpose of our thesis is to analyze the performance of virtual ma- chine’s disk during the live migration of virtual machine over WAN. We analyzed the performance of network and distributed storage solutions and recommend the best solution.

We laid out the following objectives to achieve the above aim.

• Conduct a literature review of the various virtual machine migration schemes over WAN.

• Conduct a literature review of various disc migration algorithms/methods proposed by various researchers.

• Conduct a study on Distributed Replicated Block Device (DRBD) tech- nique.

• Perform an experiment on a laboratory test bed.

9

(30)

• Identify the performance metrics to evaluate the above experiment.

• Repeat the experiment with diﬀerent network storage systems like NFS and iSCSi.

• Analyze the results and draft the observations.

2.2 Research Questions

The following are the research questions we are going to answer in our thesis.

RQ1:

What are the strategies proposed by various researchers to support live virtual machine migration over WAN/Internet?

RQ2:

What are the various disc migration schemes proposed by various researchers?

RQ3:

How eﬃcient is the Distributed Replicated Block Device (DRBD) to mirror disc data in diﬀerent locations?

RQ4:

What is the I/O performance when migrating virtual machines, over a WAN using DRBD?

RQ5:

What is the I/O performance when migrating virtual machines with shared storage solutions over a WAN?

RQ6:

Among distributed and shared storages which solution performs better while migrating over WAN?

2.3 Related work

Luo et al. [12] discussed a live migration scheme which migrates the local disk, CPU and memory state of the Virtual node. They proposed a three-phase migration algorithm that minimizes the downtime caused by large disc migrations.

The three phase migration algorithm was shown in ﬁgure 2.1. This migration scheme has three phases named Pre-copy, Freeze-and-Copy and Post-Copy, which is similar to Xen’s hypervisors memory migration process. In the ﬁrst stage disk data is copied to destination in n iterations. Block-map is used to track the dirtied disk during this stage. The dirtied disk data is copied during the next iteration.

This stage is limited to a particular number or iteration and is proactively stopped

when the dirtying rate is faster than the transfer rate. In freeze-and-copy phase

the virtual machine running on the source machine is suspended and dirtied

(31)

memory pages are migrated to destination along with CPU state. The third stage is the combination of push and pull phase. Here the virtual machine is resumed and according to the details in dirty bitmap, the source push the dirty blocks and the destination machine pulls them. This process reduce the total migration time of the virtual machine.

Figure 2.1: Three-Phase whole system live migration

In [12] authors developed a mechanism called Incremental Migration (IM) algorithm which reduce the total migration time when migrating the virtual ma- chine back to the source machine. When the virtual machine is sent to the physical machine from which it is migrated earlier, this algorithm checks the block-bitmap to ﬁnd out the dirty blocks modiﬁed in the virtual disk after earlier migration.

Only this dirtied blocks are migrated to the migrating physical location. This algorithm is shown in the ﬁgure 2.2.

Kleij et al. [2] conducted an experiment on laboratory test-bed to evaluate

the possibility of using Distributed Replicated Block Device (DRBD) [6] for mi-

(32)

Figure 2.2: Incremental Migration Algorithm

grating disk state over WAN, which created two copies of data disks by mirroring.

DRBD supports asynchronous, semi synchronous and full synchronous replication

modes. All the three replication modes are discussed in chapter 3.2.2. They ex-

pected that using DRBD in asynchronous mode gives better performance. They

constructed a laboratory test bed with 50ms RTT delay to emulate WAN. They

compared the performance of DRBD with high latency Internet Small Computer

System Interface (iSCSI) and concluded DRBD test is 2.5 times faster than re-

mote iSCSI test. The downtime observed during the virtual machine’s migration

is only 1.9 seconds. While coming to read statistics, DRBD’s performance is

better when compared to remote iSCSI. But there is no signiﬁcant diﬀerence be-

tween the performance of DRBD and local iSCSI. However, the authors admit

that there are some inconsistencies in their results, which they cannot account

for. Our study may look similar to this work. In our thesis we include NFS

as a shared storage solution and we also measure the performance for diﬀerent

latencies. In this authors analyzed the virtual machines disk performance based

(33)

on migration time, http mean request time and I/O read statistics. They used DRBD asynchronous replication algorithm to replicate data between nodes. We analyzed the disk I/O performance using diﬀerent performance metrics discussed in section 4.6 by conducting both read and write I/O tests. We used DRBD synchronous replication model which is more secure than asynchronous protocol used in [2]. Our experiment results conﬁrm that DRBD outperformed iSCSI and NFS while the virtual machine is migrated over WAN.

Wood et al. [11] discussed solutions for the limitations faced by virtual machines migrating over WAN. They experimented with an architecture called CloudNet that is distributed across datacenters in the United States separated by 1200km. Their architecture reduced network bandwidth utilization by 50%, and memory migration and pause time by 30 to 70%. The authors used the DRBD disk replication system to replicate disk in both source and destination.

Firstly DRBD brings the remote disk to consistent stage using synchronization demon. When both the disks are synchronized and stable, DRBD’s synchronous replication protocol is switched and the modified data is placed in TCP buffer to transmit towards destination disk. When the write state is confirmed on des- tination the synchronous mode is completed. The memory and CPU modes are migrated after disk migration stage. Content Based Redundancy (CBR) [11]

block technique is used to save bandwidth. CBR splits disk blocks and memory pages into ﬁxed size ﬁnite blocks and uses Super-Fast Hash Algorithm (SFHA) to generate hash bases on their content. This hash is used to compare previously sent blocks. This solution saved 20 GB bandwidth when compared to the pro- cess of migrating full disk. It also reduced memory transfer time by 65%. They designed a scheme to retain network state using layer-2 VPN solution. In this experiment authors analyzed the network bandwidth utilized for virtual machines migration along with memory migration and pause times. On the other hand, in our experiment, we focused on the disk I/O performance of DRBD, iSCSI and NFS when the VM is migrating over WAN along with network utilization. From our experiments we can say that iSCSI and NFS consumed more than 75% of network bandwidth than DRBD.

Akoush et al. discussed the parameters [19] which eﬀect the live migration

of virtual machines. Total migration time and downtime are the two important

performance parameters used to analyze virtual machines performance during

migration. Total migration time is the time required to move virtual machines

from one physical host to another physical host. Downtime is a period of time

where virtual machine doesn’t run. Page dirtying rate and network bandwidth

are the factors which aﬀect the total migration time and downtime. The authors

implemented two simulation models, namely average page dirty rate (AVG) and

history based page dirty rate (HIST) which are accurate up to 90 percent of actual

results. They proposed a model called Activity Based Sector Synchronization

(ABSS) [22] which migrates virtual machine eﬃciently in a timely manner. The

(34)

ABSS algorithm predicts the sectors that are likely to be altered. This algorithm helps in minimizing the total migration time and the network bandwidth.

Robert et al. [23] worked on both disk related and network related issues.

They worked on migrating local persistent state by combining block level solution with pre copy state. When the virtual machine need to migrate, disk state is first pre-copied to the destination host. The virtual machine will still run during this stage. After some time this mechanism starts XEN to migrate memory and CPU states of the virtual machine. All the changes made to the disk on source side during the migration process are recorded in the form of deltas (Unit which contain written data, its location and size). These deltas are sent to destination and applied to the image. This mechanism has downtime of 3 seconds in LAN and 68 seconds in WAN. They combined Dynamic DNS (DDNS) with IP tunneling to retain old network connections after live migration. When the virtual machine in source host goes to pause state, a tunnel is created using Linux iproute2. This tunnel is between the virtual machine’s old IP address and the new IP address at destination. This mechanism discards / drops the packets on source host’s side during the final stage of migration. The packets related to the old connection are forwarded from source host to IP address of the virtual machine at destination via tunnel. The virtual machine is configured with a new dynamic IP address to prevent it from using the old tunnel for new connections. Thus, the virtual machine will now have two IP addresses. The disadvantage of this mechanism is that the tunnel between source and destination will not be closed until the old connections end. This requires the source host to run until old connections are closed.

Kazushi et al. [13], used data duplication technique to propose a fast virtual machine storage migration. This technique will reduce the volume and time of data transfer. Suppose there is a situation where a virtual machine should migrate frequently between site A and site B. Migrating a large disk between sites A and B will waste a lot of network resources and will take a long time. These frequent disk migrations can be eliminated by using duplication. When the VM is migrated for the ﬁrst time from site A to site B, full disk is copied to site B. When the VM migrates back to site A, only changes made on disk in site B are replicated to disk in site A. A new diﬀ image structure, which is a new virtual machine image format and Dirty Block Tracking (DBT) are developed to track the changes. This technique successfully reduced the migration time from 10 minutes to 10 seconds.

Travostino F et al. discussed about the advantages and requirements of long-

haul live migration [5]. Their experiment proved that virtual machines can mi-

grate over WAN across long distance geographical locations instead of being lim-

ited to small local datacenters. Their scheme has a dedicated agent called as

Virtual Machine Traﬃc Controller (VMTC) that created dynamic tunnels be-

tween clients and the virtual hosts. The VMTC is responsible for migration of

the virtual machine and it maintains connectivity with the destination host where

(35)

the virtual machine should be migrated. The VMTC communicates with Authen- tication, Authorization and Accounting (AAA) module to get authentication in the form of a token to setup an on-demand, end-to-end path. It is the responsibil- ity of the VMTC to migrate the virtual machine and reconﬁgure the IP tunnels so that it can seamlessly communicate with its external clients. When a virtual host migrated from one host to other host the tunnel is broken and a new tunnel is created between the client and virtual host. They migrated their virtual ma- chine between Amsterdam and San Diego using a 1Gbps non-switched link with Round Trip Time (RTT) of 198ms. The application downtime observed during the migration is in the range of 0.8 to 1.6 seconds. Their experiment results show that, when compared to LAN migration, the downtime of migration over WAN is just 5-10 times higher, but the round trip time is more than 1000 times.This may be due to the dynamically established link (light path) of 1Gbps using layer-1 and layer-2 circuits without routing protocols. Anyhow, layer-3 protocols are used for communication between VM and clients.

Eric et al. [14] used Mobile IPv6 (MIPv6), a protocol for managing mobility of mobile nodes across wireless networks, to support live migration of virtual machine over WAN. The virtual machines behave as mobile nodes by enabling MIPv6 [24] in the TCP/IP protocol stack. This scheme enables a virtual machine to retain its IP address after migrating from the source host. It eliminates the need of DNS updates [23] on the virtual machine. The disadvantage in this scheme is that the virtual machine’s kernel must be modiﬁed to support mobility (unless that was already preconﬁgured).

Solomon et al. presented a live migration approach using Proxy Mobile IPv6 (PMIPv6) protocol [15]. This approach is similar to MIPv6 [14] but it does not require installation of mobile software on a virtual machine [25]. However, it requires speciﬁc infrastructure deployed in the network. In this experiment the source host and destination host act as Mobile Access Gateways (MAGs) and are connected to the Local Mobility Anchor (LMA). This choice in their experiment was just for convenience, to keep the experiment testbed simple. The LMA and MAG are entities independent infrastructure elements. The LMA and MAGs are equivalent of a home agent and foreign agent in Mobile IP. The LMA arranges that packets for the mobile node and forwarded them to the MAG where the node is currently located. The MAG is responsible for emulating the mobile node’s home network and for forwarding packets sent by the mobile node over the tunnel to LMA. LMA acts as a central management node tracking the virtual machines.

When a virtual machine is booted on source machine, suppose MAG1 it registers with LMA via Proxy Binding Unit (PBU) and Proxy Binding Acknowledgement (PBA). Upon successful registration a tunnel is created between MAG1 and LMA.

Now the virtual machine can be connected to outer world via LMA. When the VM

is migrated to other location, for example toMAG2, the tunnel between MAG1

and LMA is broken and the VM is deregistered via PBU and PBA messages. Now

(36)

the tunnel is created between MAG2 and LMA using the same previous process that happened for MAG1. In this solution VM will retain its ip address and network connections after the migration. Their experiment results showed that, this approach migrated virtual machine with minimum downtime when compared to MIPv6 approach. In this experiment authors used iSCSI shared storage to store VM’s disk to avoid disk state migration.

Forsman et al. [16] worked on automatic seamless or live migration of virtual machines in a data center. They developed an algorithm that migrates virtual machine based on the CPU work load. They used two strategies called push and pull. When the physical host’s workload crosses higher threshold level, the hotspot is detected and the push phase initiates virtual machine migration to other underutilized physical host to balance the load. When a physical host is underutilized and is below the lower threshold level the pull phase is initiated to balance the system by requesting virtual machines from other physical nodes.

They also used a concept called variable threshold, which varies the threshold

level of the load periodically, which resulted in a more balanced system. They

successfully simulated their technique using OmNet++ simulation software.

(37)

XEN Hypervisor and Storage Solutions

In this chapter we described about the Xen hypervisor and various stor- age solutions we used to analyse the disk performance during the live migration of virtual machine. Xen, Kernel Virtual Machine (KVM), VMware ESXi and Hyper-V are some of the widely used type-1 hypervisors. Among these hyper- visors VMware ESXi and Hyper-V are expensive and proprietary software with restrictive licence scheme. So, we restricted ourselves to open source hypervisors like Xen and KVM. KVM requires the physical host to support hardware assisted virtualization (HVM) but unfortunately our testbed doesn’t have hardware sup- porting HVM. So, we used Xen hypervisor and created para-virtualized virtual machines. We brieﬂy described Xen hypervisor along with its features in section 3.1.

We used internet Small Computer Interface (iSCSI), Network File Storage (NFS) and Distributer Replicated Block Device (DRBD) storage solutions to store the disk of the virtual machines. These are the three widely used storage solutions supported by various Hypervisors to store the disk state of virtual machines. Each storage solution has its own advantages and disadvantages. A short comparison of these three storage solutions is shown in table 3.1.

Among these three solutions DRBD and iSCSI are block storage solutions and NFS is a ﬁle storage solution. In this thesis we used Xen hypervisor to migrate a virtual machine while conducting disk I/O tests. Our virtual host has its disk at shared or replicated storage solutions. So, the Xen migrates only the memory and CPU states leaving the disk state. The performance of VM’s disk I/O operations depends on underlying storage solutions. So, we assume that our results are valid for other available hypervisors.

3.1 Xen Hypervisor

Xen, [8] is an open source hypervisor developed at the Cambridge university computer science lab which supports both hardware assisted virtualization and Paravirtualization. It also supports live migration of virtual machines. Xen is

17

(38)

one of the ﬁve

¹

type-1 or bare metal hypervisor that are available as open source [42]. Xen allows multiple operating systems to run in parallel to host operating system. Xen is used for diﬀerent open source and commercial application such as desktop virtualization, server virtualization, Infrastructure as a service (IaaS), security, hardware and embedded appliances. Today Xen hypervisor is powering large clouds in production [20]. The Xen hypervisor is responsible for handling interrupts, scheduling CPU and managing memory for the virtual machines.

Figure 3.1: Xen Architecture

Dom0 or the Domain-0 is the domain in which Xen starts during the boot.

From the Xen architecture which is shown in figure 3.1 we can see that Dom0 is the privileged control domain which has direct access to the underlying hard- ware. The Dom0 has the toolstack which is a user management interface to the Xen hypervisor. Xen toolstack can create, manage and destroy virtual machines or domUs which are unprivileged domains [20]. Xen supports hardware virtual- ization and Paravirtualization. In hardware virtualization, unmodified operating systems can be used for the virtual machines whereas Paravirtualization requires modification to the operating system’s kernel running inside virtual machines.

Doing so will increase the performance of paravirtualized hosts. The host operat- ing system should be Xen Paravirtualization enabled to create virtual machines.

The Linux kernels before 2.6.37 version are not Paravirtualization enabled. Their kernels should be recompiled to enable Paravirtualization. All the Linux kernels released after 2.6.37 version are by default Xen Paravirtualization enabled.

1The other open source hypervisors are KVM, OpenVZ, VirtualBox and Lguest.

(39)

Xen allows virtual machines to migrate between hosts while the guest operat- ing system is running. This feature is called live migration. In Xen, the demons running in the Dom0 of source and destination hosts takes the responsibility of migration. The memory and CPU states of the virtual machine are migrated from source machine to destination machine by the control domain. The Xen Hypervisor copies memory pages in series of rounds using Dynamic Rate-limiting and Rapid Page Dirtying [8] techniques to reduce the service downtime. Dy- namic Rate-Limiting algorithm adapts bandwidth limit for each pre-copy round and is used to decide when the pre-copy stage should end and stop-and-copy phase should start. Rapid Page Dirtying algorithm is used to detect the rapidly dirtied

²

pages and skip them from copying them in pre-copy stage. Xen uses a microkernel design which consists of small footprint and interface that is around 1MB of size making it more secure and robust than the other available hypervi- sors. Xen hypervisor is capable to run main device driver for a system inside a virtual machine. The virtual machine that contains main device driver and device driver can be rebooted leaving the rest of the system unaﬀected [20]. This feature of Xen is called driver isolation. Driver isolation is a safe execution environment which protects the virtual machines from any buggy drivers [55]. Xen is operating system agnostic which means diﬀerent operating systems like NetBSD and Open Solaris can be hosted.

Basically Xen supports two types of virtual Block Devices named ‘Phy’ and

‘file’. Phy is the physical block device which is available in the host environment where as file is the disk image which is available in the form of a file in the host computer. The loop block device is create from the available image file and the block device is handled to the domU. Shared storage solutions like iSCSI use Phy and NFS use file.

3.2 Distributed Replicated Block Device (DRBD)

DRBD [43] stands for Distributed Replicated Block Device. It is a software- based tool which replicates data from block devices between servers and provides virtual/cluster storage solution. The block devices may be hard disks, logical volumes or disk partitions. In DRBD the data is replicated as soon as the ap- plication writes to or modiﬁes the data on the disk. The top level applications are not aware of replication. DRBD uses synchronous and asynchronous tech- niques to mirror the block device. DRBD is controlled by a cluster software called “heartbeat”. If an active node crashes heartbeat initiates the failover pro- cesses. Each DRBD peer acts either in primary or secondary failover role. The user space applications in the secondary node will not have write access to the DRBD disk resource but they can read data. Write access to data is granted only

2Rapidly modiﬁed memory pages which are generally copied in stop-and-copy phase.

(40)

in primary mode. DRBD’s Cluster management software is generally responsible for the role assignment.

Figure 3.2: DRBD Position in I/O stack

DRBD’s functions are purely implemented using Linux kernel module. Fea- tures like ﬂexibility and versatility makes DRBD suitable for providing duplica- tion solution to many applications. DRBD package has two main components namely, the driver code and the user tool. The driver code runs in kernel space and the user tool is used to control the driver code and cluster management pro- grams from user space. Drbdadm, drbdsetup, drbdmeta are some of the user tools used to communicate with kernel module for managing and conﬁguring the re- sources of DRBD. DRBD remained an out-of-tree kernel module for many years [43] but was then integrated into the Linux kernel from version 2.6.33 release.

We used DRBD v8.4.3 in our experiments. DRBD installation and conﬁguration details are described in the appendix.

DRBD has two operation modes namely single primary mode and dual pri- mary mode. In single primary mode only one of the two disks in the cluster acts as primary disk. All the changes done to the data on this primary disk are replicated to the other peer disk which is at a diﬀerent physical location. The secondary disk can be used as backup in case of data loss on the primary disk.

The dual primary mode was ﬁrst introduced in DRBD v8.0 and disabled by de- fault. In dual primary mode both the disks in the cluster act as primary disks.

The dual primary mode can be integrated with Xen hypervisor to eliminate disk

migration during the live migration [2]. The changes we made in “/etc/drbd.conf ”

(41)

conﬁguration ﬁle enabled dual primary mode and integrated DRBD with XEN.

These conﬁguration changes are described in the appendix.

Figure 3.3: Block diagram of DRBD algorithm

3.2.1 DRBD disk replication algorithm

In this section we explained the DRBD disk replication algorithm [6] by using a block diagram shown in ﬁgure 3.3. DRBD uses ﬁve steps to replicate write event on both the disks.

1. When a write request is issued by an application running on primary node, the write request is submitted to the local I/O system and is placed in the TCP buﬀer to be sent to the peer node.

2. The write event is send to the peer node in the form of DRBD data packet over TCP and is written on the disk of the peer node.

3. When the write request is written on peer node acknowledge (ACK) is created and send to source node.

4. The ACK packet will reach source node.

5. Local write conﬁrmation is received by source node.

Order of events from 1 to 4 is always the same. But the write event on local

disk can take place anytime independently of events 2 to 4.

(42)

3.2.2 DRBD Replication Modes

DRBD supports three replication modes namely Protocol A, Protocol B and Protocol C [43]. They were also called as Asynchronous, Memory Synchronous and Synchronous modes.

3.2.2.1 Protocol A

Figure 3.4: Asynchronous protocol

Protocol A is also called as Asynchronous replication protocol. In this protocol the operations written on the primary disk are considered complete after the data is written on the local disk and data packets are placed in the local TCP buﬀer. When the local disk crashes, failover occurs on local disk but the data is consistent on the secondary disk. The data modiﬁed before crash may be lost but the secondary disk will be in consistent state. This secondary disk can be used instead of crashed primary disk or the data from the secondary disk can be synchronised with a new primary disk.

3.2.2.2 Protocol B

This protocol is also called as Memory Synchronous protocol or SemiSyn-

chronous protocol. In this protocol the operations written on the primary node

are considered complete when the data is written on the local disk and data pack-

ets have reached to the location of secondary node. With Protocol B, written data

is generally not lost during the forced failover but the data maybe lost in case of

failure occuring on both the locations at same time.

(43)

Figure 3.5: Memory or semi Synchronous protocol

Figure 3.6: Synchronization protocol

3.2.2.3 Protocol C

This protocol is also called synchronous replication protocol. Write operations are considered completed on primary node when write operations are conﬁrmed from both remote and local disks. This assures that no data loss occurs in the nodes. However data loss may occur if both nodes are destroyed irreversibly at same time or may be lost due to power failure at both the locations. Protocol C is the most commonly used in DRBD setups [43].

3.2.3 Multiple Replication Transports

Multiple low-level transports like TCP over IPv4, TCP over IPv6 and Su-

perSocket [54] are supported by DRBD. This was made available from DRBD

v8.2.7. TCP over IPv4 is DRBD’s default and canonical implementation. Any

system that is IPV4 enabled can use this implementation. DRBD can also use

TCP over IPv6 as its transport protocol and there should be no diﬀerence in the

performance. The only diﬀerence between TCP over IPv4 and TCP over IPv6

(44)

is their address schemes. In SuperSocket, TCP/IP portion stack is replaced by super sockets with a RDMA capable, single monolithic and highly eﬃcient socket implementation. The disadvantages with this transport is, it is limited to the hardware which is produced from a single vendor named Dolphin interconnect Solutions.

3.2.4 Three Way Replication

Figure 3.7: DRBD Three Way Replication

In three way replication an additional disk at another node is added to DRBD.

This additional disk is added on top of the existing primary and secondary node design. The data is replicated to the third disk which can be used for immediate backup and disaster recovery purpose. Three way replication is shown in the ﬁgure 3.7. In three way replication diﬀerent DRBD protocols can be used. For example protocol A can be used to for backup and at the same time protocol C can be used for replicating production data from primary to secondary disk.

3.2.5 Eﬃcient Synchronization

To make DRBD operational, Disks at both the nodes should be ﬁrst synchro-

nized so that the data will be consistent. Typically data from the primary node

will be synchronized with secondary node. There is a diﬀerence between disk

synchronization/ Re-synchronization and disk replication. In disk replication the

write event issued by an application on the primary node to make changes to the

data on the disk is replicated to the secondary node. The disk synchronization is

decoupled from the writes and the entire disk is aﬀected. In DRBD, if disk repli-

cation was interrupted due to node (may be primary or secondary) or network

failures one of the peer disk will become inconsistent. The data in inconsistent

disk cannot be used. So the disk synchronization should be employed to make

(45)

both the disks consistent. Synchronization is really eﬀective as DRBD doesn’t synchronize modiﬁed blocks in the original written order, but rather in linear or- der. This is because, several successive write operations written on the consistent disk are synchronized in blocks with the peer disk. Synchronization is also fast because the inconsistence disk is partly updated and it requires synching only the new data blocks.

When one of the primary or secondary node is aﬀected by failure that disk become inconsistent. The inconsistent disk cannot be used by DRBD until it is synchronised by the peer nodes disk. If the secondary node failed during the operation it enters into inconsistent state. But the application running on the primary node can operate without any disturbance. When secondary node is restarted DRBD initiates disk synchronization so that the changes made on primary disk are synchronized to secondary disk. If the primary node fails the applications cannot use it. It should be synchronized from the secondary disk.

In DRBD the disk synchronization should be initiated during the deployment time. If it is done before or during migrating virtual machine, the T

sync

time required and the bandwidth utilized for it will aﬀect the performance of the VM.

When the synchronized peer node is removed and new node is added then the initial synchronization should be initiated during the deployment of new node.

In DRBD the initial synchronization consume more network bandwidth in the initial stage but improves the VM performance during migration.

T

sync

is the expected synchronization time. Synchronization time depends on the volume of data modified by the user application when the secondary node is down and the rate of synchronization. D is the volume of data to be synchronized, on which we may not have any influence over (this is the volume of data that was changed by user application while the replication link was broken). R is the rate of synchronization, which is configurable (by administrator) and depends on the type of network being used or available network bandwidth.

T

_sync

= D

R (3.1)

Formula 3.1. Synchronization time

3.2.6 XEN with DRBD

When DRBD is integrated with Xen hypervisor it acts as a virtual block stor-

age device. Xen uses DRBD block device to make the entire content of domU’s

virtual block disks available at two diﬀerent locations by replicating them. When

the virtual machine is migrated to the other location it use the disks at that

location to read and write I/O operations. In this way DRBD provides redun-

dancy virtualized operating systems managed by the Xen hypervisor [43]. DRBD

supports both x64 and x86 operating systems.

(46)

Double primary mode should be activated on both the locations to enable the DRBD disks to accept write events from the migrated virtual machine. The DRBD configuration file which is at /etc/drbd.conf location should be modified to enable dual primary mode at both locations. Xen virtual disk details in the virtual machine configuration file found at the directory /etc/xen/ should be modified to support DRBD. All the modifications done in the configuration files of Xen and DRBD are described in the appendix. Let us suppose there is a virtual machine which should migrate between locations A and B. Before the virtual machine boots DRBD scans the disks at both locations and synchronize them if they are not up-to-date. The system uses the last updated disk to synchronize the other disk. When the virtual machine boots up at location A it uses the local disk to store and access its data (i.e., this disk becomes the target for read or write I/O events). The DRBD demon replicates the changes made on local disk to the disk at location B using DRBD’s replication protocols (for example protocol C).

During this stage, the disk at location A acts as primary disk and disk at location B acts as secondary disk.

Figure 3.8: DRBD disk status during vm live migration.

When an emergency situation like maintenance or load balancing arises it

requires virtual machine to be migrated from location A to location B. During

the migration, DRBD disks enters dual primary mode. During the initial stage

of migration the virtual machine enters pre-copy stage where pages are copied to

destination in iterations. During this time the changes made on disks at location

A are replicated to disks at location B. After some time the virtual machine

is suspended at location A and resumed at location B. As soon as the virtual

machine is resumed at location B and the entire resources are pulled the status

of disks at location A changes to secondary and the disks at location B acts as

primary disks. All the changes made on disks at location B are replicated to disks

at location A.

(47)

iSCSI NFS DRBD Category Block-storage File-storage Block-storage Storage type Network based Network based Replication based Virtualization

compatibility

Compatible Compatible Compatible Conﬁguration

complexity

Medium Easy Diﬃcult

Maximum hosts 256 256 3

Maximum volume 64 TB 64 TB 1PB (1024 TB)

Volume elasticity Yes Yes Yes

Table 3.1: Comparison of storage solutions

3.3 iSCSI Block Storage

Internet Small Computer System Interface (iSCSI) is the most widely used network based block-access storage technique which uses IP technology [44]. The iSCSI [46] uses SCSI, a popular block transport command protocol over the tradi- tional IP network to transmit high performance “block-based” storage data over the local Area Network (LAN) and Wide Area Network (WAN). This technol- ogy is well understood, aﬀordable, easy to implement, not limited by distance, scalable and secure. The iSCSI allows user for remote mirroring, remote backup and similar applications. Incorporating IP networking into a storage network can remove barriers to implementing and managing networked storage.

3.3.1 iSCSI Architecture

The iSCSI has two important components namely iSCSI target and iSCSI initiator. The iSCSI target is the endpoint, where storage disk is located. It provides I/O data to the iSCSI initiator when requested. The iSCSI initiator is an endpoint which functions as a client by starting iSCSI sessions. Applications on iSCSI initiators are enabled to perform traditional “block-level” transfers over a common network. The iSCSI is built on two well-known protocols, SCSI and Ethernet, which are the dominant standards for networking and storage. Initia- tors use IP protocols to transmit and receive data from the storage disk which is located at iSCSI target. The iSCSI initiator uses SCSI protocol to encapsulate the I/O data generated by the application. These packets are again encapsulated using TCP/IP protocol and sent to the iSCSI target over the network [47]. I/O data is decapsulated back by the iSCSI target and written to the disk. IPsec and Challenge-Handshake Authentication Protocol (CHAP) are used to provide authentication between initiator and target.

When a connection is established between initiator and target, the disk block

(48)

Figure 3.9: Communication between initiator and target

is exported from the target and it is seen as a local disk by the operating system of the iSCSI initiator. The applications running on the initiator are not aware of the iSCSI target. Here the initiator runs the local file system which reads from and write the data blocks to the iSCSI target when required. This is displayed in the figure 3.9. In Xen virtualization iSCSI is used to store the disks of virtual machine so that the disks are available even after migrating virtual machine to another physical host. This requires the two physical hosts to connect with the iSCSI target. When the virtual machine is running on physical host A, it access the disk which is at other location (iSCSI target) using iSCSI initiator drivers running on physical host A. When it is migrated to the physical host B (which is configured to connect the iSCSI target), it will resume using the disk from the iSCSI target. The process to configure iSCSI to enable live migration using Xen hypervisor is described in the appendix.

The physical node hosting a virtual machine needs an iSCSI initiator con- nected to the network in order to access storage from the target. An iSCSI driver can act as initiator with a standard network card or with a TCP offload engine (TOE) network card which reduces the CPU utilization. The storage device must also implement the iSCSI protocol stack on the target side. Unlike NFS which is a file storage solution, block storage solutions like iSCSI doesn’t have file locking.

In iSCSI the target can’t connected to multiple initiators at same time unless

the target uses a cluster aware ﬁle systems like GFS and OCFS2. The multi-host

feature which is disabled by default should be enabled. In iSCSI each initiator has

a unique iSCSI Qualiﬁer Name (IQN) id which is used to connect to the target.

(49)

The iSCSI target generally grants access control based on the IQN. The virtual machine can’t be created from diﬀerent location using the same disk from iSCSI target at a same time.

Figure 3.10: iSCSI Block-Access Protocol

3.4 Network File Storage (NFS)

Network File System also termed as NFS [48] was developed by Sun Microsys- tems in 1984. It is a distributed ﬁle system which allows clients to access user data over the IP network as if the data is on the local disk. NFS is designed to be easily portable to other operating systems and machine architectures, unlike many other remote ﬁle systems that are implemented under UNIX. NFS is fast, well-developed and it can handle non homogeneous systems or operating systems.

It is a ﬁle-access protocol where the local namespace of the NFS server machine

is shared with NFS clients [44]. In other words, NFS shares the ﬁles from the

server to the clients and these ﬁles can be for example images, music, documents

etc. Remote Procedure Call (RPC) based protocol is used by the clients to

access ﬁles and meta-data from the server. In the NFS storage solution the ﬁle