State as a Service: Towards Stateful Cloud Services

(1)

Master of Science Thesis

Stockholm, Sweden 2012

TRITA-ICT-EX-2012:31

A H M A D U L L A H A L N O O R

Towards Stateful Cloud Services

State as a Service

K T H I n f o r m a t i o n a n d C o m m u n i c a t i o n T e c h n o l o g y

(2)

(3)

State as a Service

Towards Stateful Cloud Services

AHMADULLAH ALNOOR

Master of Science Thesis

Supervisor : Lars Hammer, Principal Software Architect, MDCC/Microsoft Examiner : Vladimir Vlassov, Associate Professor, KTH

(4)

(5)

Abstract

Cloud ERP or Enterprise Resource Planning (ERP) as a Cloud Service delivers value by reducing initial and long term operating costs since infrastructure, platform and (certain) application management tasks are delegated to a specialist provider. Questions present at intersection of the ERP challenge landscape and the Cloud Computing opportunity horizon include characterization of Cloud friendly ERP modules and adaptation of stateful (on-premises ERP) components to a stateless platform.

Contributions of this thesis work include the R.A.I.N. Cloud fitness criteria that encompasses Responsiveness, Availability, I/O and Native support aspects of Cloud Services. More importantly, the State abstraction, a reliable and elastic state management framework employing Autonomic Computing and Redo Recovery con-structs is introduced. Construction of abstraction properties, namely, affinity aware state preservation and recovery consider Cloud strengths of scaling out and reliabil-ity as well as peculiarities of Cloud billing model. Proof-of-concept implementation of State as a Service has been comprehensively detailed and evaluated advocating infrastructure layer support of the kind and associated tooling.

(6)

(7)

Dedication

To teachers.

(8)

(9)

Acknowledgment

nani gigantum humeris insidentes

- Bernard de Chartres

This independent work carries enabling contributions from individuals and or-ganizations alike, to whom appreciation is extended.

Gratitude is duly expressed towards Mr. Lars Hammer and Prof. Vladimir Vlassov for their guidance, patience and confidence. Assistance from K.T.H. and I.A.E.S.T.E. with logistics of performing this degree project is also highly valued. Many thanks are in order to Mr. David Worthington, my manager, and the larger group at MDCC (Microsoft Development Center Copenhagen) for the time and re-sources we shared.

Further debit has been incurred and credit thus offered to my parents and sib-lings for sharing my dreams and bearing my absence.

Ahmadullah Alnoor 12. February 2012 Virum, Denmark

(10)

4.3.2 Elasticity . . . 25 4.3.3 Fault Tolerance . . . 25 4.4 Algorithms . . . 27 4.4.1 State Preservation . . . 27 4.4.2 Load Measurement . . . 28 4.4.3 Load Balancing . . . 28 4.4.4 Elasticity . . . 29 4.4.5 Actuator . . . 30 4.4.6 Session Recovery . . . 32 5 Implementation 33 5.1 Design . . . 33 5.1.1 Cloud Infrastructure . . . 33 5.1.2 Computation . . . 33 5.1.3 Persistence . . . 34 5.1.4 Elasticity . . . 35 5.1.5 Recovery . . . 36 5.1.6 Fault Tolerance . . . 36 5.2 Construction . . . 36 5.2.1 OrderService . . . 36 5.2.2 ServiceWrapper . . . 37 5.2.3 Store . . . 38

(12)

x CONTENTS

5.2.4 Client Interface (Proxy) . . . 38

5.2.5 Storage Interface (Proxy) . . . 39

5.2.6 Actuator . . . 41

5.2.7 Monitor . . . 41

5.3 Additions & Refactoring . . . 43

5.4 Tools & Technologies . . . 45

5.4.1 Windows Azure . . . 45

5.4.2 Azure SDK . . . 45

5.4.3 Microsoft .NET Framework 4.0 . . . 45

5.4.4 Windows Azure Tools for Microsoft Visual Studio . . . 46

5.4.5 Windows Azure Platform Management Portal . . . 46

5.5 Code Metrics Analysis . . . 46

6 Evaluation 49 6.1 Cost . . . 49

6.2 Performance . . . 50

6.3 Reliability . . . 51

6.3.1 Tenant Service Fails . . . 53

6.3.2 Monitor Fails . . . 53

6.3.3 Actuator Fails . . . 54

6.3.4 Client Interface Fails . . . 55

6.3.5 Service Recovery . . . 55

6.4 Scalability . . . 57

6.4.1 Elasticity . . . 58

7 Directions\Future & Related Work 63 7.1 Cloud Integration . . . 63 7.2 Tooling . . . 63 7.3 Log Management . . . 64 7.4 Idempotence . . . 64 7.5 Further Tests . . . 64 7.6 R.A.I.N-fall . . . 65 7.7 Related Work . . . 65 8 Revision 67 8.1 Requirements Revisited . . . 67 8.2 Solution Brief . . . 67 8.3 Measurement Observations . . . 68 8.4 Conclusion . . . 68

A Windows Azure Billing Model 69

(13)

List of Acronyms

B2B Business to Business

B2C Business to Consumer CO Control Objective

ERP Enterprise Resource Planning IaaS Infrastructure as a Service

OGSI Open Grid Services Infrastructure PaaS Platform as a Service

ROI Return On Investment SaaS Software as a Service SLA Service Level Agreement SLO Service Level Objective SOA Service Oriented Architecture VM Virtual Machine

WSRF Web Services Resource Framework

(14)

List of Figures

1.1 Cloud ERP Offering . . . 2

1.2 Cloud ERP Adoption Scenarios . . . 5

4.1 State as a Service Fault Tolerant Architecture for Elastic Stateful Services . . . 24

4.2 The State Service for Stateful Services . . . 26

4.3 Resource Elasticity . . . 26

4.4 Fault Tolerance . . . 27

4.5 Sample execution of Algorithm 4 . . . 30

5.1 Order Service . . . 37

5.2 Service Wrapper - Structure . . . 37

5.3 Service Wrapper - Flow . . . 38

5.4 State Store . . . 39

5.5 Client Interface - Structure . . . 40

5.6 Client Interface - Method Flow . . . 41

5.7 Storage Proxy . . . 42

5.8 Actuator - Structure . . . 42

5.9 Monitor - Structure . . . 43

5.10 Monitor - Flow . . . 44

6.1 Average Response times for population sizes . . . 52

6.2 Connections refused for population sizes . . . 52

6.3 Response time variation . . . 55

6.4 Recovery cost distribution . . . 57

6.5 Requests / second . . . 58

6.6 % of Processor Time Alloted . . . 59

6.7 Arc Elasticity . . . 60

6.8 Elastic Execution . . . 61

(15)

List of Algorithms

1 Log Session Interactions . . . 28

2 Calculate Performance Counters . . . 29

3 Rank Service Instances . . . 29

4 Provision Resources . . . 31

5 Actuate Elasticity . . . 32

6 Recover Client Session . . . 32

(16)

List of Tables

5.1 Implementation Code Metrics Analysis . . . 47

5.2 Test Code Metrics Analysis . . . 47

6.1 Service Response Time . . . 50

6.2 Service Response Time Breakdown . . . 50

6.3 Recovery Cost Distribution . . . 56

(17)

Chapter 1

Vision

Cloud Computing has come of age and has attracted wide spread though cau-tious interest. The Enterprise Resource Planning (ERP) industry relies upon stable contemporary architectures and technologies to offer value to its customers. This chapter explores the intersection of ERP challenges and Cloud opportunities.

1.1 The ERP Problem

Enterprises are typically structured into operational entities woven together by work flow processes. The structure and processes are, to varying degree, hidden from the clients and partners of the enterprise. Business to Consumer (B2C) and Business to Business (B2B) interactions occur through known and well defined service points. Organizations strive to avoid propagation of internal and external changes. Coop-eration and value addition demands organizing enterprises, often hierarchically, of various sizes and market sectors.

Computerization of enterprises and business has traditionally remained aligned to problem domain structure and dynamics. Contemporary software packages offer various sets of add-on features that build upon common denominator capabilities. Enterprise information and processes are managed internally and selectively ex-posed via different interfaces. Effort is made to isolate and localize modifications to internal processes and external contracts.

Modeling the enterprise by ERP products confronts them with significant chal-lenges. Deployment and upgrade expense has only increased, even minor patches come no cheaper. Enterprises continuously spend to provide for and maintain suf-ficient infrastructure. Further tax is introduced when ensuring reliability and re-covery. More interestingly, the on premises installation model hampers on demand scaling of the service delivered by the ERP product.

(18)

2 CHAPTER 1. VISION

Figure 1.1: Cloud ERP Offering

1.2 The Cloud Incentive

The forecast is overcast. The mist of Cloud Computing carries within the concepts of Infrastructure as a Service (IaaS), Platform as a Service (PaaS) and Software as a Service (SaaS). Here, Infrastructure refers to computing, communication and storage resources whereas Platform encapsulates enabling resources including oper-ating systems and application development as well as deployment services. Finally, SaaS extends Service Oriented Architecture (SOA) from fine grained operations to richer applications. The common trait among these Cloud layers is that of utility computing whereby resources are made available and scaled on demand allowing for a pay per use billing model. Utilities at each Cloud crust are provisioned and reclaimed in an elastic fashion with swift sensitivity to demand[27].

Cloud ERP - the notion of ERP as a Cloud Service carries exciting opportunities and tough challenges. As a service offering, Cloud ERP delivers value by reducing initial and long term operating costs by delegating infrastructure, platform and application management to a specialist organization and allowing the enterprise to solely focus on utilizing the ERP service for increased productivity. The Service Oriented Architecture (SOA) of Cloud ERP facilitates continuous development and deployment which allows the ERP vendor to add timely enhancements and fixes. Most importantly, Cloud ERP benefits from Elasticity attributes of the host Cloud Platform which translates to reliable and cost effective service delivery.

(19)

1.3. SCENARIOS 3

1.3 Scenarios

Modern day ERP solutions are available in a variety of architectural flavors rang-ing from monolithic to N-tier configurations. Though in their usual deployment scenario, ERP solutions are confined within organizational boundaries, a number of interaction scenarios are becoming ever more common, for instance, a 2-tier on-premises ERP might utilize as well as expose XML Web Services. As Cloud Computing gains acceptance, on-premises ERP systems will become intentional or unintentional clients to Cloud Services. These interactions assist in anticipating a typical Cloud ERP offering and its various associated stories as captured in Figure 1.1.

Discourse centered on adoption of Cloud ERP, introduced above, can benefit from inclusion of the Persona concept. “A persona is a description of a fictional person representing a user segment of the software you are developing”[21]. The following personas apply to this discussion.

1. Christine - IT Manager : Employed with ACME Nordic, a small but growing Apparel manufacturer, Christine is responsible for the organization wide IT strategy. Christine strives to wisely spend her budget allocation to ensure that necessary and appropriate technologies are utilized.

2. Julia - Systems Consultant : With years of industry experience in ERP de-sign, development and deployment, Julia maintains the ERP solution, TERP, adopted at ACME Nordic. Christine relies on Julia’s skills and opinion re-garding changes and improvements to TERP.

3. Karina - Sales Support : Dealing with Sales Representatives and Cus-tomers, taking and recording Orders are good examples of Karina’s daily tasks. Karina is a frequent TERP user and finds it impossibly difficult to complete her duties when TERP is overloaded or offline.

Christine views Cloud ERP as a step forward towards the state-of-art in ERP that would reduce operational costs. She, however, has legal and security concerns that require putting an exit strategy in place as well. Christine therefore consults Julia and commissions a preliminary technical investigation. Julia has already con-ducted basic research of the technology space and, alongside Christine’s interest, is aware of Karina’s hardship during TERP outages and peak hours. Julia shares her initial findings on various Cloud Adoption scenarios with Christine that exhibit interesting analogies to the Water Cycle, as captured by Figure 1.2.

Satisfied with possibilities of reverting to an on-premise installation (i.e. precip-itation) or a Cloud/on-premise hybrid setup, Christine decides in favor of investing in Cloud deployment of TERP (i.e. evaporation) instead of licensing an existing

(20)

4 CHAPTER 1. VISION

Cloud ERP (i.e. sublimation). Julia accordingly begins work on identifying techni-cal requirements for C-TERP - TERP on Cloud.

Early in her work, Julia recognizes that the scalability mechanism employed within Cloud is one of scaling out whereby multiple instances of a service, each running within its own Virtual Machine (VM), process client requests. In some (connectionless) scenarios, affinity between a specific client and a specific server instance for the duration of the session (i.e. session affinity) is not guaranteed as stateless services are favored over stateful services. Julia is alarmed since TERP, a stateful service, cannot cope with absence of session affinity for its web interface as TERP does not replicate client session state/information across servers. More-over, even if introduced, a basic client-server affinity provision would, in the case of server instance failure, marginalize the reliability attribute of Cloud platform. The latter concern is equally applicable to TERP’s connection oriented interface for rich (desktop) clients.

Julia appreciates the fact that Karina’s life would be simpler if her session, spanning valuable time, would never again be lost to an overloaded or failed server. Christine’s interest in capitalizing on Cloud’s elasticity feature also compels Julia to consider the notion of forced migration of a user session, applicable when allocated resources (service instances) are scaled down to avoid underutilization.

Julia, thus, searches for Cloud based solutions which would ensure that a service instance can pick up/resume a user session from the point of (planned or unplanned) departure of the previously serving service instance. Addressing the above require-ment by means of modification of TERP is inefficient for the following reasons:

1. Refactoring a large and complex existing, layered and open to customization, code base is most likely to prove an uphill task.

2. Additional modules could increase complexity and add to regression testing cost as the complimentary functionality will not be used in an on-premises setting.

3. The widespread need and significant utility of the identified feature advocates a platform and application independent solution.

The challenges facing the IT Staff and End Users at ACME Nordic provide the motivation for the investigation detailed in this report. The thesis work addresses, in sufficient detail, the properties and design of a service that would abstract away concerns of reliable and scalable state management for stateful services utilizing a stateless platform. Introduction of such a State abstraction will allow higher level services including session management and transaction processing to function with no or minimal modifications.

(21)

1.3. SCENARIOS 5

(22)

(23)

Chapter 2

Background

Cloud Computing is more heard about and less known as it attempts at a synergy of Infrastructure as a Service (IaaS), Platform as a Service (PaaS) and Software as a Service (SaaS). ERP, an overloaded term itself, stands for software architecture as well as application software. This chapter aims to provide a succinct, still sufficient, overview of both and thus establishes the necessary context for a comprehensive exploration of the problem space.

2.1 Cloud Computing

2.1.1 Rationale

Computing, storage and communication resources are often underutilized and ex-pensive to maintain which adds to the total cost of ownership. Conversely, demand surge for an under provisioned service may increase response times or, in the worst case, cause service unavailability. Maintenance of the software platform (i.e. con-figuration and upgrade of system software) has an associated cost as well. Fur-thermore, application services need to interact among themselves as well as with a variety of clients. These issues have already been investigated in areas of IaaS, PaaS and SaaS. Cloud Computing addresses these challenges by capturing the de-pendencies among their solutions.

2.1.2 Flavors & Features

Cloud resources can be utilized and managed at various abstraction granularities. The current highest abstraction, termed SaaS, captures the scenario where end users interact with hosted applications over a delivery network. SaaS offerings are supported by lower abstractions referred to as PaaS and IaaS. PaaS encompasses programming and management interfaces to Cloud specific computation, storage and communication resources. IaaS concerns a bare-bones access to the Cloud Grid i.e. machine clusters and their internal network. Management frameworks exist for all three mentioned Cloud abstractions differentiated by their architecture and

(24)

8 CHAPTER 2. BACKGROUND

interfaces exposed.

Being an intermediary abstraction, PaaS attracts aspiring SaaS providers and existing infrastructure proprietors. The majority of commercial Cloud offerings fall into this category. PaaS attributes are detailed since and as they apply to this text. Resources provided by a Cloud Platform include compute instances, storage structures and communication channels for external and possibly internal message exchange. Notably, configurable elasticity services are made available.

An assortment of compute instances is often available with increasing CPU strength and memory, storage and I/O capacity. Cloud storage primitives include keyed storage, both structured (table) and unstructured (blob). Certain PaaS of-ferings provide a queue based storage, primarily aimed at message exchange among compute instances. PaaS providers invest heavily in their delivery network for communication across Cloud boundaries to deliver on the promise of reliable and satisfactorily speedy access to Cloud resources. PaaS citizens (e.g. compute in-stances) may synchronize via data center’s internal high speed network.

Choosing the right Cloud platform (vendor) for a given application has become a problem of plenty. The CloudCmp [25] tool aims to ease the decision making process by highlighting relative strengths (or weaknesses) of a set of Cloud vendors. Selected options are measured from a customer perspective with focus on efficacy of compute, storage, communication and content distribution facilities. Supported comparisons attempt at comprehensive coverage of functions common to the study set. Evaluation indicates absence of a clear winner with different vendors perform-ing better on different fronts. Customers are thus required to investigate which vendor best resolves their application bottlenecks.

The Business model behind Cloud Computing translates into public and shared provisioning of Cloud resources, hence the term Public Clouds. Concerns over se-curity and administrative control of Public Clouds is being addressed with Private Clouds i.e. overlaying private data centers with Open Source or Proprietary IaaS and/or PaaS solutions.

Disparity among products from various PaaS vendors has motivated research into interface consistency across PaaS offerings. Conducted under the banner of Meta Cloud[48], this work aims to avoid Cloud vendor lock in and ease migration between Cloud platforms. Increased interoperability among IaaS and PaaS solutions would support the notion of Hybrid Clouds - Cloud Platforms spanning Public and Private Clouds.

(25)

2.2. ENTERPRISE RESOURCE PLANNING 9

2.1.3 Adoption

Motivation and hurdles down the road leading to Cloud adoption vary across enter-prise and consumer segments. Heavy weight enterenter-prises eye reduced maintenance cost of applications, data and infrastructure. However, concerns over security guar-antees and compliance to Service Level Agreement (SLA) exist (where SLA refers to a commercial contract between service provider and consumer regarding quantifi-able service characteristics). Small and medium business are lured by early Return On Investment (ROI). Still, complexities of Billing Models, Cloud migration costs and lack of Cross-Cloud interoperability/integration are slowing down adoption in small to mid-sized market segments.

Enterprises are investing in Private and Community Clouds to mitigate the se-curity and SLA violation risks. Customers from consumer sector prefer specialized and hybrid Cloud services over Cloud only offerings.

Cloud Adoption is predicted to gain pace as challenges of data and application security, compliance to SLA and Government regulations are addressed. Maturity of the relevant technologies and a Cloud Ecosystem (with demonstrated interoper-ability) will rightly accelerate prevalence of Cloud services. Efforts aiming for Cloud interoperability include WebSphere Cast Iron[20] from IBM and the Open Cloud Computing Interface[46] working group.

2.2 Enterprise Resource Planning

Organizations of all sizes in public and private sector rely heavily on a number of computational resources to execute processes of varying complexities. Domain requirements have motivated research and development in business software and hardware technology. Innovation in computing industry has also seen acceptance across user groups and thus been refined for individual domain segments.

2.2.1 Early Days

The decade of 1960’s saw initial significant computerization of certain business pro-cesses such as accounts and inventory management. Clarity of processing rules and accuracy of expected results appears as the implicit selection criteria for suitable candidate processes. This first generation was accordingly termed MRP1 for Ma-terials Requirement Planning. With the house or rather back office in order, focus shifted to automating processes that cross organizational boundaries. MRP2 (Man-ufacturing Resources Planning) rolled out support for procurement and product assembly processes. Advancement in personal computing and network technologies during 1980-1990 facilitated development of enterprise wide solutions. Heavy duty software packages, rightly called Enterprise Resource Planning (ERP) integrated disparate departments and streamlined distributed processes. Business functions

(26)

10 CHAPTER 2. BACKGROUND

exclusively catered for by ERP suites included Supply Chain Management (SCM), Customer Relationship Management (CRM) and Human Resources Management (HRM) to name a few.

2.2.2 Contemporary Solutions

Categorization of the plethora of commercial ERP offerings requires specification of the aspect of interest. Example classification dimensions include customiza-tion/extension mechanisms, feature set and deployment architecture among others. All major ERP suites provide extension interfaces and tools to allow for customer tailored solutions. Certain ERP packages provide exceptional support for a sub set of ERP functions. Deployment architecture options include legacy monolithic, modularized, tiered and hosted solutions.

The choice of deployment architectures combined with the feature strength and extensibility ( stitching and customization) allows definition of rich installation op-tions. Few scenarios of industry interest are outlined below in order.

• Tailored Modularized installation where select business functions receive cus-tom support.

• Tailored Single Vendor installation where customized feature rich suite is de-ployed organization wide

• Hosted installation where business functions, possibly a select few, utilize a generic software delivered as an online service by a particular vendor or intermediary (partner).

• Tailored Multi-Vendor installation where products from different vendors are adopted across the organization. The product installation at each department could be one of the three options listed above. Adapters may have been created to integrate this heterogeneous environment.

2.2.3 Research Challenges

Problems and/or opportunities are a plenty owing to the sheer depth and breadth of the domain. Issues of relevance to this text are covered in chapters on introductory [Chapter 1] and concluding topics [Chapter 8]. This section lists peripheral threads of research work.

• Interoperability

Advancement in distributed computing technologies has simplified application to application interaction. Agreement among ERP suites on semantic repre-sentation of business processes and information is yet to be achieved. ERP

(27)

2.2. ENTERPRISE RESOURCE PLANNING 11

Interoperability aims to allow definition and execution of business processes over a variety of ERP suites.

• Agility

ERP adoption and deployment projects are expensive and risky. Software Vendors and ERP users both desire more agile deployment, migration and upgrade tools and processes.

• ERP 2 - Business Intelligence

An ERP package of any scale archives an ever growing mass of data. ERP-2 leverages these records beyond reference purposes to deliver “decision sup-port” by utilizing concepts and technologies from business intelligence re-search.

2.2.4 Industry Offerings

Close alignment between the size and variety of ERP customers and vendors explains the abundance of ERP software packages on shelves today. The richness of the current ERP install base has already been discussed in section 2.2.2. The following is a categorized selection of noteworthy options at hand.

• Propriety Small-Medium Business

SAP Business One, Infor 10 ERP Business, Microsoft Dynamics NAV • Propriety Large Enterprise

PeopleSoft, SAP Business Suite, Microsoft Dynamics AX • Open Source

Compiere, OpenPro, OpenERP

Industry heavyweights and startups alike are adding to the momentum towards Cloud ERP with SaaS products designed and often delivered with SOA. Customers from various market segments, especially small and medium enterprises, are buy-ing into the benefits of reduced ownership costs and on demand customization and provisioning of services. Visible alternatives include SAP Business ByDesign, Sales-force.com and Microsoft Dynamics CRM Online.

(28)

(29)

Chapter 3

Analysis

Cloud ERP deployment, partial or absolute, necessitates Cloud profiling of candi-date ERP services. Relevant guidance would aid with adaption of existing ERP services for the Cloud as well as adoption of available and upcoming Cloud based ERP services.

Also, existing on-premises ERP application components demand robust state management from a stateless platform including broad allowance for storage, re-trieval, preservation and recovery of application wide as well as client specific state data. Generalizing the problem, this chapter introduces the State abstraction or State as a Service and specifies associated reliability, scalability and load balancing requirements.

3.1 Cloud Service Characteristics

Presenting a concise criteria to spot candidate Cloud services is complicated by the variety of technologies and usage scenarios involved. Nonetheless, the question must be addressed to provide initial guidance when migrating existing, designing new and maintaining deployed Cloud services.

The following sections describes R.A.I.N. [Responsive, Available, I/O, Native] -a Cloud fitness -assessment guide th-at c-aptures strengths -as well -as constr-aints of contemporary Cloud offerings. Inspiration and justification for the devised guidance presented here has been gathered from surveys of current academic and commercial publications referenced below.

3.1.1 Responsive

Services required to promptly respond to changes in usage patterns and functional-ity expectations can benefit from elasticfunctional-ity [3] and continuous deployment facilities [34] of Cloud infrastructure. This combination of facilities is unique to Cloud

(30)

14 CHAPTER 3. ANALYSIS

ing Cloud citizens (i.e. Cloud based Services) to react to consumer demands and expectations. The Cloud vendor managed maintenance model shortens the duration to apply updates and patches.

3.1.2 Available

Cloud compute, storage and communication facilities exists for services with high availability requirements to capitalize on. Computation instances (virtual machines) are monitored to ensure the required number of resources are always served. Various semantically versatile persistence mechanism including queues, blobs and relational and non-relational table storage exist with efficient access and reliability ensured through redundancy. Services hosted within Cloud can interact within and across Cloud boundaries by exposing internal and external end points for various protocols including TCP, HTTP, SOAP and REST. In addition to simple request-response interactions, Microsoft Azure’s App Fabric Service Bus[31] infrastructure supports multicast and publish-subscribe architectures as well as service naming and adver-tisement.

3.1.3 I/O

The cost of reading and writing data is highest among all Cloud facilities [8]. Compute-intensive applications incur less overhead to perform their function. Data-intensive applications[16], however, can add to the service invoice and must mini-mize the movement of data between computation nodes and storage as well as across storage locations. All storage options do not carry the same price tag and address different problems. Care therefore must be taken to choose the appropriate storage structure, media and location for the problem at hand.

3.1.4 Native

Providing a utility infrastructure such as Cloud means scoping the degree of access to platform services, for instance, Google AppEngine requires application to be single threaded and execute for a known period in a sand-boxed environment [25] whereas Microsoft Windows Azure platform defines a service life cycle which services need adjust to[41]. Architecture of Cloud based applications must consider these limitations and strive to ground candidate designs in elements native to Cloud.

3.2 The Nature of State

Program state is arguably the most enabling advancement following the “Stored program” concept. Previous work has broadly categorized “state” based on its scope, location and persistence. Overview of existing relevant background material appears below with references.

(31)

3.3. STATELESS CLOUD - STATEFUL SERVICE 15

3.2.1 Application State

This class of state denotes the data maintained by the application for the appli-cation. The subject set of data covers configuration settings, policies and the like. The key characteristic for application state being its disassociation from all entities (including application resources and users) and a lone binding to the application itself. Note that the state of a web application is considered private to application instances that co-inhibit a web server/host[23].

The obvious choice for application state placement is within secure proximity of the application. Designers may choose to store application state on disk (database or files) or memory[52].

3.2.2 Session State

The state of a particular Client Server Interaction (Session) may refer to all or one of the following:

• the state of the service i.e. state of server objects • the state of the client i.e. state of client objects

Session state can be persisted as current values or as a history of modifications to the relevant objects. Session state persistence options include memory or serialized objects, local files or database records. State can be stored either at the client, at the server or distributed among the two. The persistence options and location give rise to issues of development effort, access speed, bandwidth needs, isolation, failure handling and session migration or session affinity [24].

3.3 Stateless Cloud - Stateful Service

Operator/configuration errors and failure of front-end software have been identi-fied as the most significant causes of service failure such as service unavailability or malfunction[47]. Functional correctness and availability improves with production tests, failure monitoring and redundancy; all these measures are supported by Cloud offerings with staged deployments, diagnostics services and scalability options.

Alongside the aforementioned services, Cloud citizens continue to enjoy standard means to secure application state detailed previously, yet face similar shortcomings in preserving session state. Decisions on location and persistence balanced against maintainability, performance and fault tolerance must still be made for client session data. More specifically, Cloud’s inherit elasticity notion (of scaling out and down) necessitates tailored treatment of session affinity and migration issues. Session affin-ity provision need ensure correct client-server pairing as new server instances appear

(32)

with appropriate session migration ensured when they depart (un)expectedly. Concerns surrounding session and application state preservation and recovery have previously been addressed and provide guidance for a Cloud friendly solution.

3.3.1 Server Side State

Retaining session state at Server remains attractive since locality of business logic and parameters (state) is ensured. An associated multi-tiered architecture for WWW deployment of stateful applications that interact with persistent storage (Databases) is presented in [18]. The proposed system supports applications that utilize socket based communication and are capable of producing HTML output. Session state is preserved with a session manager process that ensures sticky sessions utilizing the Cookie mechanism. There is no recovery support provided to handle application (service) failures nor is servicing of client request aided with statistics on application load to deal with demand peaks and slumps.

Server-side management of application state is more of a necessity than con-venience. Scenarios where multiple instances of an application/service execute in parallel demand externalization of application state to ensure availability and consis-tency. The work presented in [52] lists and compares three techniques for application state preservation for the particular case of Web Services. As proposed, application state could be retained in-memory by a state server or written to a database on disk. Also, a proxy may be introduced to forward requests to internal processes for the actual computation thus eliminating state management concerns for the Web Service.

The state server approach has also been treated as an extension of Web Services Resource Framework (WSRF)[54] and compared against alternatives of basic Web Services deployment, Open Grid Services Infrastructure (OGSI) and WSRF itself. The study reports benefits of state externalization and persistence similar to WSRF with the additional capability to specify the location of the state repository which in turn resolves state privacy and security concerns.

Evaluation presented in [52] shows persistent storage of application state to per-form similar to a dedicated state server with the per-former being capable of tolerating service failures. The choice can be made easier if the overhead imposed by transac-tion processing inherit to most RDBMS can be alleviated.

Delegation of state management has also been investigated in the context of ses-sion state. A detailed study [26] found sesses-sion state as short lived, client (sesses-sion) specific and requiring serial access only. The mentioned work suggests a session state store with a basic read and write interface exposed by stub components with an underlying implementation composed of bricks (where a brick is a simple

(33)

as-3.3. STATELESS CLOUD - STATEFUL SERVICE 17

sembly of compute, storage and network components). The state store exhibits self-tuning, self-protection and self-healing properties by employing techniques of timeouts, admission control and read, write sets.

3.3.2 Client Side State

Alternatively, session state can be maintained at client side and forwarded to a designate server (possibly from a server pool) with each request. Related work re-ported in [14] proposes selective client-server state exchange of immutable/mutable and private/public nature with appropriate frequency to reduce performance over-head. Security concerns such as replay attacks and byzantine clients are addressed with validity ranges for state values and sequence numbers for requests supported by basic encryption and digital signatures. The proposal falls short of coping with parallel session creation (forking) and session recreation (reversal) on a sister server. Maintaining session state on Client introduces the risk of Client (Agent) software and/or hardware failure in absence of a backup mechanism similar to the one enjoyed by a Cloud based Server.

3.3.3 Virtual Machine Cloning

Virtualization of resources is a key mechanism for Utility Computing and gener-ally utilized by Cloud Computing Platforms. On Demand Cloning of Virtual Ma-chine (VM) may potentially serve the requirements of scalable robust state manage-ment. The SnowFlock [22] system presents VM forking as a Cloud abstraction that derives inspiration from UNIX style process forking. API exists to spawn and del-egate tasks to stateful child VMs as well as to coordinate parent-child interaction. Major impediments of state transfer from parent to child have been addressed with delayed and selective propagation approaches that employ unicast as well multicast communication. The system can be controlled from within applications and with scripts with C++ and Python language bindings.

Despite its richness, the system described deals primarily with issues surround-ing creation and maintenance of VM resources and caters less to the needs of a user facing service except for applications from a certain class (parallel processing, load balancers). Benefiting from the framework presented would require introduction of additional logic to existing services and yet would not benefit from suggested improvements in VM creation and maintenance handled by the Cloud platform im-plicitly. Finally, the proposal of treating VM as expendable and short-lived similar to UNIX process does not hold in public Clouds where (long lasting) VM resources are billed hourly.

(34)

3.3.4 Redo Recovery

Redo Recovery serves the needs of both session and application state while pro-viding facilities of state externalization and fault tolerance. Previous efforts have put in place the notion of interaction contracts between components of persistent and transactional nature as well as with external components. These contracts con-strain inter-component message passing to ensure exactly once execution semantics with reduced logging cost and recovery independence. The Phoenix/App [10] sys-tem provides a framework that implements these contracts using .NET Runtime Services. Applications/Services (components), both stateful and stateless, benefit from a logging and monitoring mechanism that ensures failed components are auto-matically recovered via message replay. Furthermore, the message log can be used as an activity trace for debugging purposes.

The Phoenix/App system does not capitalize on Cloud facilities of elastic com-pute and storage resources, and does not address load balancing concerns. Fur-thermore, application require modification to benefit from the proposed framework. Still, the approach as well as the results presented in [10] are attractive and provide pointers to a candidate solution of the broader problem at hand.

3.4 State Abstraction - State as a Service

Surveys of the nature of state and existing state management approaches as well as desired characteristics of a Cloud service provide ample background and support for defining the State abstraction with following characteristic guarantees.

1. Application and session state can be stored and retrieved using standard and Cloud specific primitives.

2. Session state management (i.e. creation, maintenance and disposal) scales out and down in a load balanced and affinity aware manner.

3. Session state preservation and recovery is ensured via message logs and replay.

3.5 Autonomicity

Reliable, scalable and load balanced delivery of the State abstraction under investi-gation can benefit from concepts and methods developed in the field of Autonomic Computing where the primary focus is placed on issues related to self-managing systems with self-{configuring, healing, optimizing and protecting} capabilities. In short, such systems utilize a control loop designed to reach and effect a verdict based on measurements that meets defined objectives regarding state of a (resource) component. Correct execution of the control loop is aided by data sensors and fil-ters that update a system model of the managed system (resource), which is in

(35)

3.5. AUTONOMICITY 19

turn, consumed by an estimator to produce predictions for planning and actuation purposes[51]. Breadth of related knowledge and techniques exist for selection and application to our particular problem.

3.5.1 Goals & Means

The purpose served by an autonomous system can be captured with a SLA be-tween the provider and its clients regarding certain system aspects e.g. availability, performance etc. The survey reported in [55] reviews contemporary solutions, in particular control theoretic approaches, to the problem of specifying and honoring an SLA. Choice(s) for Control Objectives and Adaptations Mechanisms has been determined as key solution characteristics.

Selection of regulation of certain resource characteristics (i.e. processor us-age and memory availability) as the only Control Objective (CO) and resource (de)allocation as the sole adaptation mechanism is being made to ensure propor-tionate coverage of identified requirements for the State abstraction. The choice of mentioned CO finds motivation in the correlation of state management (both in-memory/serialized session state and application state) with CPU and memory consumption. Additional support is on offer in the simplicity with which the CO can be measured and desirably influenced with the adaptation mechanism preferred above. The identified control objective and adaptation mechanism provide the terms used to define applicable set of Service Level Objective (SLO) that would constitute the governing SLA.

Alternative CO considered include service response time and concurrent connec-tion count. Variance in measurements for these CO could be attributed to external factors including persistent storage interaction delay and server connection pool size, among others. Difficulty involved in accurately associating these CO with state management and deterministically adapting to their variations with resource (de)allocation makes them less attractive choices and are hence not employed.

3.5.2 Healing & Optimization

Systems built on a Cloud platform benefit from the inherit configuration and secu-rity apparatus, thus allowing focus on concerns of robustness and elasticity. Self-healing aspect of the State abstraction carries multiple interpretations; the au-tonomous system itself (i.e. the State abstraction) benefits from remedial features of the Cloud platform whereas consumers of the State abstraction are tended to as detailed in section 3.3.4. Optimal resource utilization i.e. avoidance of both over and underutilization is complicated by the difficulty involved in modeling system and resource state, correctness of measurements and consequent plans as well as timing of enactment.

(36)

Mentioned challenges can be overcome by considering a simple maintainable sys-tem model that supports frequent updates allowing generation of timely and sound predictions. Timeliness of planned actions can be improved with a hybrid approach that utilizes patterns observed in recent history with current measurements for a pro-active response instead of pure reaction.

3.6 Summary

Earlier account has provided succinct selection criteria for potential Cloud citi-zens. Utility of provided guidance could now be determined by applying the same when attempting to answer the larger question of supporting Stateful Services in a Stateless Cloud setting. Lessons gathered from existing approaches towards state management, redo recovery and schemes for organizing autonomous systems may now be applied to define and describe a candidate solution. Know-how acquired on Cloud Service design and development would inform solution architecture with technical opportunities and limitations.

(37)

Chapter 4

Solution

Analysis results allow for specifying solution properties that provide the necessary reference for State as a Service architecture. The control mechanisms within and across architecture components are also presented in this chapter.

4.1 Properties

Study of problem domain for requirements and existing solutions surfaced desired solution attributes. Combination of these high level solution properties are unique to the proposed solution.

1. State Preservation

Support for managing both service/application and session state is required. The solution needs to provide interfaces for active preservation of application state. Session state must also be passively maintained with session affinity. Interaction between service and storage has to be managed as well. Timely cleanup of preserved state information must be performed.

2. Fault Tolerance

Failure detection and recovery should aim for masking all service failures from clients. Failure detection may inform of false-negatives but must never notify of false-positives. Failure recovery must never interfere with existing healthy sessions. Non-idempotent operations must never be repeated.

3. Elasticity

All solution aspects must scale to demand. This requirement applies to state preservation, fault tolerance as well as service usage. The scalability notion is not limited to scaling-out but also covers scaling-down.

4. Cloudy

Alongside elasticity, other Cloud services should be leveraged upon whenever feasible. Candidate services include Performance Counters and Messaging facilities.

(38)

22 CHAPTER 4. SOLUTION

4.2 Architecture

Consideration of desired solution properties translates into the architecture pre-sented in Figure 4.1. Functional description of individual components follows.

4.2.1 Components

1. Service

Scenarios of End-user interest are realized with individual or a congress of components that embody and serve the necessary capabilities, hence the term “Service”. Services need present a known contract (interface) to publicize supported operations and associated data structures. The Service component acts as a consumer in the architecture illustrated, utilizing functions offered by surrounding components.

2. Client

End-users typically use a graphical, commandline or programmatic interface to interact with a remote Service. The Client component represents one of several such User Agents (e.g. Web Browser or a Graphical User Interface based application). The client component is a direct though oblivious beneficiary of the architecture described since all functions except the desired Service are kept transparent.

3. Client Proxy

The requirements on the Client Proxy include load balancing as well as log-ging Client-Service interaction for recovery purposes. Moreover, our architec-ture should support services that use stateful (TCP) and stateless protocols (HTTP). Design goals applicable can be summarized as follows:

a) Client-Service interactions for all services should respect session affinity b) An incoming session should be created on the most suited (least busy)

service instance.

c) Client proxy should not become a performance bottleneck 4. Storage Proxy

The function of the storage proxy is to ensure exactly once execution for SQL operations. Achievement of this requirement serves the following design goals

a) Session recovery does not write/modify persistent state b) Same persistent state is read during execution and recovery

c) Recovery is not coupled with service or storage

(39)

4.2. ARCHITECTURE 23

5. Monitor

Timely elasticity, load distribution and fault tolerance is realized with the Monitor component which maintains a global view on the state (i.e. condi-tion) of sister components.

SLO involving regulation of resource characteristics (described in section 3.5.1) is realized with a resource consumption model for the computational resources involving variables of CPU Utilization and Memory Availability. Monitor performs measurement queries to update the resource consumption model, calculates proactive resource provisioning estimates and sends appropriate (i.e. SLO compliant) scaling signal to the Actuator. Estimates are not reacted upon while the Actuator performs scaling to allow Actuator actions to gain effect. Service instance failures are tolerated with an orchestrated effort that includes the Client Interface, Storage Proxy and the underlying Cloud infrastructure. 6. Actuator

This component exposes a simple interface with methods for acquiring and re-leasing Cloud resources and in effect works towards ensuring resource (de)allocation SLO. Acquisition is preempted to ensure safety property of SLO compliance. Release is delayed to meet the liveness property of cost effectiveness.

7. State Service

Interface to reliable Cloud storage is exposed via the State Service. Sup-ported operations include reading and writing session and application state as structured and/or non-structured data.

4.2.2 Completeness

Architecture components cover defined solution properties as described below. 1. State Preservation

Client Proxy load balances client requests across available resources such that session affinity is preserved. Storage Proxy records the interaction between service and persistent storage to ensure only once execution of non-idempotent actions.

2. Fault Tolerance

State Service stores soft (session) state. Service state can be reconstructed via message log based replay. Services may actively save and restore their state from the state store.

3. Elasticity

Monitor tracks service instance usage and computes resource needs that meet SLOs. Resources are acquired and released by the Actuator.

(40)

Figure 4.1: State as a Service

(41)

4.3. USE CASES 25

4. Cloudy

State Service uses Cloud storage where reliability and scalability is ensured by load balanced redundancy of data objects. The cost of the state service is minimized with proximity data placement and batch read and write oper-ations.

4.3 Use Cases

Functional compliance of the proposed service is shown utilizing candidate use cases that touch upon reliability and scalability scenarios. Architecture components and relations that do not take part in execution of the subject Use Case are filled white in the associated figures.

4.3.1 The State Service for Stateful Services

For each Client, Client Proxy queries Monitor for the suited service instance to ensure load balancing. Subsequent requests from the client are forwarded to the same service instance so that session affinity is preserved. Client Proxy logs client messages for playback. Service itself may also store session data in session objects. Storage Proxy intercepts and records the interaction between Service instance and Storage. Both Client and Storage Proxies periodically write session message logs to the State Service. Service instance may also persist in-memory session state with the State Service. Figure 4.2 captures this scenario.

4.3.2 Elasticity

The architecture makes use of two resource types, Service Instances (a compute resource) and State Service Capacity (a storage resource). Client Proxy detects sessions terminations and frees space occupied by the message logs and SQL results for service instance. Monitor periodically calculates service instance usage and sends scaling signal to the Actuator to start/shut down instances so SLOs are met and SLA not violated. Service instances are assumed to take the responsibility of freeing up space taken by their session objects when feasible. This Use Case is depicted in Figure 4.3

4.3.3 Fault Tolerance

Service instance departure or failure triggers recovery of orphaned Client sessions. The mechanism employed is that of redo recovery which resurrects selected Client sessions on a healthy Service instance. This approach is different from traditional session migration which requires setting up session state externalization and Ser-vice instance fail-over schemes. Interestingly, conventional session migration could be supported as well with the Client Proxy and Monitor ensuring fail-over without

(42)

Figure 4.2: The State Service for Stateful Services

(43)

4.4. ALGORITHMS 27

Figure 4.4: Fault Tolerance

redo-recovery and State Service providing the necessary persistence primitives. Service failure can be detected by the Monitor during its periodic health checks or by the Client Proxy when attempting to forward a client request. Upon failure detection at Client Proxy, Monitor is queried for suited service instance which in turns notifies Storage Proxy of the recovery process at the selected healthy service instance. In Recovery mode, Client Proxy plays back logged messages whereas Storage proxy returns saved SQL results to bring the service to the state before failure at which point the next client message is sent to the service instance. If the failure is detected by the Monitor, a recovery signal is sent to the Client and Storage Proxy to execute recovery at a particular Service Instance for a Client. Race conditions where the failure is detected simultaneously by the Client Proxy and Monitor are handled at the Client Proxy to avoid unnecessary recovery measures. A graphical rendition of failure detection and recovery is presented with Figure 4.4

4.4 Algorithms

4.4.1 State Preservation

Session state is preserved (for recovery purposes) as message logs of the request response interaction between Client and Service instance involved. The interception mechanism employed by the Client Proxy also embeds load balanced session affinity

(44)

facility as shown in Algorithm 1. Similar flow is employed by the Storage Proxy to log the Service to Storage interaction associated with the driving Client session. Algorithm 1 Log Session Interactions

loop

Request ← Read Server = ø

Client ← GetClientIdentif ier(Request) Server ← P reserveAf f inity(Client) if Server = ø then

Server ← EstablishAf f inity(Client) end if

LogRequest(Request, Client)

Response ← RelayRequest(Request, Server) LogResponse(Response, Client)

end loop

4.4.2 Load Measurement

The Monitor component sets up a table PerformanceCounters, with the below struc-ture, to record the performance counters of interest for instances of service and proxy components.

P erf ormanceCounters : {Component, Instance, CounterT ype, CurrentV alue, OldV alue, Rank} The table stores current as well as the previous value for each performance counter. In accordance with the SLO on regulation of resource characteristics, out-lined and motivated in section 3.5.1, the selected performance counters include CPU and Memory usage. Periodic updates to this table are required; realized with either custom code or an existing platform service. At interval MeasureInterval, a load based ranking of all instances is computed and written ( with Update procedure) to PerformanceCounters as described by Algorithm 2.

4.4.3 Load Balancing

Client Proxy component query PerformanceCounters to determine the most suit-able service instance for the next client session. For our case, the ideal candidate instance will have the least CPU usage and the most available memory as shown in Algorithm 3. A SQL (Structured Query Language) like syntax is used for clar-ity sake; there exists equivalent iterative algorithms. The listed query returns the component of type ParamComponentType with the lowest maximum of rank values

(45)

4.4. ALGORITHMS 29

Algorithm 2 Calculate Performance Counters

CounterT ypes = {IdleP rocessorT ime, AvailableM emory} for all counterT ype ∈ CounterT ypes do

CounterV alues = ø

for all counter ∈ P erf ormanceCounters do if counter[CounterT ype] = counterT ype then

CounterV alues = CounterV alues ∪ counter end if

end for

RankOnCurrentValue(CounterValues) for all c ∈ CounterV alues do

Update(PerformanceCounter, c) end for

end for

over all performance counter types. Compared to other instances, this high ranking instance has smaller rank values for Performance Counters.

Algorithm 3 Rank Service Instances

SELECT TOP 1 Instance, MAX(RANK) AS Ranking FROM PerformanceCounters

WHERE Component = ParamComponentType GROUP BY Instance

ORDER BY Ranking ASC

4.4.4 Elasticity

As listed, Algorithm 4 aims to achieve timely elasticity. The core of this scheme is a rate based calculation. Sum of current and difference between current and old value of a performance counter is computed over all instances. The averaged sum of these two values is set as demand forecast. Resources adjustment is asked of the Actuator if the forecast violates SLO for the counter type. Resource (i.e. performance counter) specific SLO is defined as a value range, with known up-per and lower bounds, whose width is defined and set by the applicable SLA. The nature of elasticity signal sent is determined by the bound (upper or lower) violated. Correctness of the adopted elasticity scheme is demonstrated by Figure 4.5. The illustration plots two sample executions of Algorithm 4 for the “Available Memory” counter type. The lower and upper bounds are set at 20 and 80 units respectively within minimum and maximum values of 0 and 100. Calculations made over time for the current value of the performance counter against the previous value and in comparison with SLO bounds ensure that the necessary elasticity signals are sent.

(46)

Figure 4.5: Sample execution of Algorithm 4

For instance, violation of the Upper Bound (set at 80 units) for Available Memory results in a scale-down signal with sufficient strength to meet the SLO Upper Bound.

4.4.5 Actuator

The Actuator component spools for signals from the Monitor as described in Algo-rithm 5. Either procedure Acquire or Release is executed as indicated by the received signal. Both procedures are accumulative; resources are acquired or released only after sufficient invocations, that constitute the smallest possible instance, have oc-curred. Important differences however exist; resource acquisition is preempted and enacted for sufficient demands for any resources type (i.e. processor or memory) whereas resource release is delayed and actuated only when necessary scale down signals have been accumulated for all resource types. This approach ensures prompt scaling up and eventual scaling down in a gradual fashion (i.e. an instance at a time).

(47)

4.4. ALGORITHMS 31

Algorithm 4 Provision Resources

CounterT ypes = {IdleP rocessorT ime, AvailableM emory} for all counterT ype ∈ CounterT ypes do

SumF orCounter ← 0

T otalChangeInCounter ← 0 P redictionF orCounter ← 0 N umberOf Instances ← 0

for all counter ∈ P erf ormanceCounters do if counter[CounterType] = counterType then

SumOfCounter = SumOfCounter + counter[CurrentValue]

TotalChangeInCounter = TotalChangeInCounter + (counter[CurrentValue] - counter[OldValue])

NumberOfInstances = NumberOfInstances +1 end if

end for

P redictionF orCounter = (SumF orCounter +

T otalChangeInCounter)/N umberOf Instances Signal : {CounterT ype, Scale, Strength}

if P redictionF orCounter > U pperBoundSLO[counterT ype] then Signal ← {counterT ype, Down, P redictionF orCounter−

U pperBoundSLO[counterT ype]} end if

if P redictionF orCounter < LowerBoundSLO[counterT ype] then Signal ← {counterT ype, U p, LowerBoundSLO[counterT ype]− P redictionF orCounter}

end if

Send(Signal) end for

An alternate scheme would assign a one shot behavior to Release such that all existing instances are inspected for client sessions and released if appropriate. The appropriateness can be modeled with two approaches; one, where an instance is released only if it does not interrupt existing session; on the other hand, high rank-ing instances with > 0 existrank-ing sessions could be recycled and the sessions restored on other instances. This approach is not practical since the elasticity interface for existing Cloud offerings is not always instance specific when scaling down.

The incremental elasticity method employed above stems from consideration of typical load patterns and Cloud infrastructure limitations. The suggested scheme should cope well with linear change (increase and decrease) in resource

(48)

consump-32 CHAPTER 4. SOLUTION

tion as well as fluctuations between linear and cubic demand patterns. Exponential growth in service requests (i.e. arrival or departure of swarms), however, will be addressed eventually. Elasticity is constrained to a single instance to avoid over and underutilization of resources by supporting resource (de)allocation with current load measurements. Sensitivity expected for this case is constrained by promptness and correctness of Monitor’s forecast as well as the pace at which the Cloud infrastruc-ture can spawn and destroy instances.

Algorithm 5 Actuate Elasticity

Signal : {CounterT ype, Scale, Strength} loop

Signal ← Read

if Signal[Scale] = U p then Acquire(Signal[Strength]) end if

if Signal[Scale] = Down then Release(Signal[Strength]) end if

end loop

4.4.6 Session Recovery

Detection of Service failure at one of the two points (Client Proxy or Monitor) would initiate the flow outlined in Algorithm 6 that partially covers the steps involved. The associated Storage Proxy behavior has been omitted for simplicity and brevity since already covered in section 4.3.3

Algorithm 6 Recover Client Session loop

F ailedServiceInstance ← Read

OrphanClients ← RetrieveAf f inity(F ailedServiceInstance) for all Client ∈ OrphanClients do

Requests = RetrieveSessionLogInT imeOrder(Client) SignalRecovery(StorageP roxy, Client)

HealthyServiceInstance ← GetBestServiceInstance(M onitor) EstablishAf f inity(Client, HealthyServiceInstance)

RemoveAf f inity(Client, F ailedServiceInstance) for all Request ∈ Requests do

RelayRequest(Request, HealthyServiceInstance) end for

end for end loop

(49)

Chapter 5

Implementation

Candidate solution architectural components, detailed previously, are adopted for a chosen Cloud infrastructure and realized using appropriate technologies. Major issues addressed during development are noted as well throughout the chapter.

5.1 Design

Decisions and choices were made concerning system environment, data structures and control flow as detailed in this section.

5.1.1 Cloud Infrastructure

An array of Cloud offerings has surfaced in different flavors with characteristic fea-tures; examples include Amazon EC2[5], Google AppEngine[17] and Microsoft Win-dows Azure[13]. Cloud vendors range from technology leaders to startups. Offerings are aiming at commercial as well as academic audiences. The choice of technologies and tools to employ for a proof of concept of the proposed framework is directed by a number of factors. “The Windows Azure platform is an Internet-scale Cloud services platform hosted through Microsoft data centers. The platform includes the Windows Azure operating system and a set of rich developer services.”[30]. The subject platform attracts attention among available options with its rich feature set[12] and simplified development experience supported by companion state-of-the-art tools such as Microsoft Visual Studio 2010 IDE[28] (Integrated Development Environment) and resources as MSDN (Microsoft Developer Network)[44].

5.1.2 Computation

Most framework components require an execution environment with processing (CPU), memory (RAM) and communication (Intra/Internet) facilities. Windows Azure terms the coupling of a hosted service with required resources as a Role[39]. Each role specifies how many of its copies (i.e. instances) should execute in parallel through the ServiceConf iguration.cscf g file. The configuration may also specify

(50)

34 CHAPTER 5. IMPLEMENTATION

other settings including associated constants (e.g. database connection strings) and security certificate information. The configuration schema is defined in the paired ServiceDef inition.csdef file. In addition, the definition file describes the exposed communication end points and available local storage resources. Once deployed, the configuration may change during service execution and those changes will take effect. Changes to definition, however, require service re-deployment.

Roles, in particular “Worker” roles are suited to host framework components in-dividually. For instance, the tenant/managed service is hosted within a ServiceW rapper role. Worker roles are suited for long running background execution processes and have access to the resources necessary for the component’s function. Hosting each component in a separate role lends robustness by avoiding a single point of failure.

5.1.3 Persistence

Windows Azure storage mechanisms cover a range of requirements with a set of primitives and technologies. Options include Blob, Table and Queue[13] with exten-sion options leading to Windows Azure Drive[11] and SQL Azure[42]. The following text describes how posed persistence needs were considered and met.

Table

Storing structured data with scale is realized by the “Table Service”[43]. An Azure Table can group an unlimited number of entities, an entity in turn comprises of named typed properties that hold values. Traditional relational features including fixed schema and support for SQL have been stripped away for a simpler to manage, and scale data structure. Alleviating DBMS concerns, the Table Service supports LINQ[36] and REST [43]access to the disk structures. With no limits on table count and size and redundant storage spread across fault domains, both scalability and reliability is ensured.

The characteristics detailed above simplify the choice of the Table Service for 3 key solution data structures. Azure Diagnostics Service rightly chooses to write se-lect Performance Counters to the (infrastructure managed) WADPerformanceCoun-ters table. The correctness of values stored here is vital for the correct function of the Monitor component. Structural and logical separation of rankings from Performance Counters imposed, resulted in splitting the earlier described P erf ormanceCounters table in two. Periodically calculated rankings are thus written to the RoleInstanceR-anking table instead and are considered for routing service requests.

Most importantly, the key element of playback recovery i.e. Client session mes-sage logs are recorded in the StateStorage table. These logs trace session activity and are critical for session recovery. The Store component described in section 5.2.3 provides wrappers that parallelize write operations and batch read operations, necessary for sharing the structure among competing client sessions and speeding

(51)

5.1. DESIGN 35

up log retrieval. Service instances may invoke the operations exposed by the Store component to persist in-memory session state for later retrieval by that or another service instance.

Queue

Communication among Cloud compute instances is the primary purpose of the Queue Service[40]. The asynchronous and “at least once” processing semantics provide an alternate to message passing over internal endpoints. As with tables, parallels should not be drawn between Queue Service and conventional message queuing architectures such as Microsoft Message Queuing (MSMQ) since Queues provide neither ordered delivery nor exactly once processing.

The queue structure is central to the elasticity function of the framework. Scal-ing signals are pushed to a scalScal-ing queue which is periodically polled by the Actuator component. The decoupling introduced by the asynchronous scaling signal insertion and processing flow lends tolerance against Actuator failures. Robustness against signal loss and multiple processing is assisted by the fine granularity of the signal strengths. Skipping or multiple processing of a signal does not alter the scaling forecast significantly. Failure detection for Service Wrappers also benefits from the queue service. Failed instances insert a failure token which is consumed by the Monitor component. The ability to post messages during instance startup and shut down phase gives queues an edge over communication via endpoints.

Blob

Binary large object or BLOB[32] in Windows Azure are the simplest and most generic storage service. Objects of any type (e.g. video, audio, text) of sizes from 200 gigabytes (Block blobs) up to a Tera Byte (Page Blobs) may be stored and retrieved block or page wise. Blob contents can be secured with private containers requiring signed read as well as write requests. Page Blobs also double as conven-tional drives since mounting virtual hard drives is supported.

Blobs did not qualify for storing session message logs since retrieval is compara-tively expensive and does not support filtering and ordering. Still, Blobs are useful for storing application state and/or serialized session state. The interface to the Store component has therefore been extended with methods to store and retrieve blobs.

5.1.4 Elasticity

Instantiation and shut down of role instances is not instantaneous and imposes cer-tain restrictions when implementing elasticity. Execution of a resource acquisition activity, thus, opens two time windows, one for each type of elasticity action.

State as a Service: Towards Stateful Cloud Services

Master of Science Thesis

Stockholm, Sweden 2012

TRITA-ICT-EX-2012:31

A H M A D U L L A H A L N O O R

State as a Service

State as a Service

Abstract

Dedication

Acknowledgment

Contents

List of Acronyms

List of Figures

List of Algorithms

List of Tables

Chapter 1

Vision

1.1

The ERP Problem

1.2

The Cloud Incentive

1.3

Scenarios

Chapter 2

Background

2.1

Cloud Computing

2.2

Enterprise Resource Planning

Chapter 3

Analysis

3.1

Cloud Service Characteristics

3.2

The Nature of State

3.3

Stateless Cloud - Stateful Service

3.4

State Abstraction - State as a Service

3.5

Autonomicity

3.6

Summary

Chapter 4

Solution

4.1

Properties

4.2

Architecture

4.3

Use Cases

4.4

Algorithms

Chapter 5

Implementation

5.1

Design