Analysis of enterprise IT service availability

(1)

Analysis of enterprise IT service availability

Enterprise architecture modeling for assessment, prediction, and decision-making

ULRIK FRANKE

Doctoral Thesis Stockholm, Sweden 2012

(2)

TRITA EE 2012:032 ISBN 978-91-7501-443-2 ISSN 1653-5146

ISRN KTH/ICS/R--12/02--SE

Industrial Information and Control Systems KTH, Royal Institute of Technology Stockholm, Sweden Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy

Set in L^ATEX by the author

Printed by Universitetsservice US AB

(3)

Abstract

Information technology has become increasingly important to individuals and organizations alike. Not only does IT allow us to do what we always did faster and more effectively, but it also allows us to do new things, organize ourselves differently, and work in ways previously unimaginable. However, these advantages come at a cost: as we become increasingly dependent upon IT services, we also demand that they are continuously and uninterruptedly available for use. Despite advances in reliability engineering, the complexity of today’s increasingly integrated systems offers a non-trivial challenge in this respect.

How can high availability of enterprise IT services be maintained in the face of constant additions and upgrades, decade-long life-cycles, dependencies upon third-parties and the ever-present business-imposed requirement of flexible and agile IT services?

The contribution of this thesis includes (i) an enterprise architecture framework that offers a unique and action-guiding way to analyze service availability, (ii) identification of causal factors that affect the availability of enterprise IT services, (iii) a study of the use of fault trees for enterprise architecture availability analysis, and (iv) principles for how to think about availability management.

This thesis is a composite thesis of five papers. Paper 1 offers a framework for thinking about enterprise IT service availability management, highlighting the importance of variance of outage costs. Paper 2 shows how enterprise architecture (EA) frameworks for dependency analysis can be extended with Fault Tree Analysis (FTA) and Bayesian networks (BN) techniques. FTA and BN are proven formal methods for reliability and availability modeling.

Paper 3 describes a Bayesian prediction model for systems availability, based on expert elicitation from 50 experts. Paper 4 combines FTA and constructs from the ArchiMate EA language into a method for availability analysis on the enterprise level. The method is validated by five case studies, where annual downtime estimates were always within eight hours from the actual values. Paper 5 extends the Bayesian prediction model from paper 3 and the modeling method from paper 4 into a full-blown enterprise architecture framework, expressed in a probabilistic version of the Object Constraint Language. The resulting modeling framework is tested in nine case studies of enterprise information systems.

Keywords: Service Level Agreement, outage costs, Enterprise Architecture, enterprise IT service availability, decision-making, metamodeling, Enterprise Architecture analysis, Bayesian networks, fault trees, Predictive Probabilistic Architecture Modeling Framework

(4)

(5)

Sammanfattning

Informationsteknik blir allt viktigare för både enskilda individer och för organisationer. IT låter oss inte bara arbeta snabbare och effektivare med det vi redan gör, utan låter oss också göra helt nya saker, organisera oss annorlunda och arbeta på nya sätt. Tyvärr har dessa fördelar ett pris: i takt med att vi blir alltmer beroende av IT-tjänster ökar också våra krav på att de är ständigt tillgängliga för oss, utan avbrott. Trots att tillförlitlighetstekniken går framåt utgör dagens alltmer sammankopplade system en svår utmaning i detta avseende.

Hur kan man säkerställa hög tillgänglighet hos IT-tjänster som ständigt byggs ut och upp- graderas, som har livscykler på tiotals år, som är beroende av tredjepartsleverantörer och som dessutom måste leva upp till verksamhetskrav på att vara flexibla och agila?

Den här avhandlingen innehåller (i) ett arkitekturramverk som på ett unikt sätt kan analysera IT-tjänsters tillgänglighet och ta fram rekommenderade åtgärder, (ii) ett antal identifierade kausalfaktorer som påverkar IT-tjänsters tillgänglighet, (iii) en studie av hur felträd kan användas för arkitekturanalys av tillgänglighet samt (iv) en uppsättning prin- ciper för beslutsfattande kring tillgänglighet.

Avhandlingen är en sammanläggningsavhandling med fem artiklar. Artikel 1 innehåller ett konceptuellt ramverk för beslutsfattande kring IT-tjänsters tillgänglighet som understry- ker vikten av variansen hos nertidskostnaderna. Artikel 2 visar hur ramverk för organisa- tionsövergripande arkitektur (s.k. enterprise architecture – EA) kan utvidgas med felträds- analys (FTA) och bayesianska nätverk (BN) för analys av beroenden mellan komponenter.

FTA och BN är bägge etablerade metoder för tillförlitlighets- och tillgänglighetsmodelle- ring. Artikel 3 beskriver en bayesiansk prediktionsmodell för systemtillgänglighet, baserad på utlåtanden från 50 experter. Artikel 4 kombinerar FTA med modelleringselement från EA-ramverket ArchiMate till en metod för tillgänglighetsanalys på verksamhetsnivå. Meto- den har validerats i fem fallstudier, där de estimerade årliga nertiderna alltid låg inom åtta timmar från de faktiska värdena. Artikel 5 utvidgar den bayesianska prediktionsmodellen från artikel 3 och modelleringsmetoden från artikel 4 till ett fullständigt EA-ramverk som uttrycks i en probabilistisk version av Object Constraint Language (OCL). Det resulterande modelleringsramverket har testats i nio fallstudier på verksamhetsstödjande IT-system.

Nyckelord: Service Level Agreement, nertidskostnader, Enterprise Architecture, tillgäng- lighet hos IT-tjänster, beslutsfattande, metamodellering, arkitekturanalys, bayesianska nät- verk, felträd, Predictive Probabilistic Architecture Modeling Framework

(6)

(7)

(8)

viii

rather than recipients of private research funding. (David Schmidtz wonderfully coined the term transitive reciprocity for this phenomenon.)

Thank you all!

Stockholm, August 2012 Ulrik Franke

(9)

Papers

Papers included in the thesis

[1] U. Franke, “Optimal IT Service Availability: Shorter Outages, or Fewer?” Network and Service Management, IEEE Transactions on, vol. 9, no. 1, pp. 22–33, Mar. 2012, DOI:

10.1109/TNSM.2011.110811.110122.

[2] U. Franke, W. Rocha Flores, and P. Johnson, “Enterprise architecture dependency anal- ysis using fault trees and Bayesian networks,” in Proc. 42nd Annual Simulation Sympo- sium (ANSS), San Diego, California, March 23-25 2009, pp. 209–216.

[3] U. Franke, P. Johnson, J. König, and L. Marcks von Würtemberg, “Availability of enterprise IT systems: an expert-based Bayesian framework,” Software Quality Journal, vol. 20, pp. 369–394, 2012, DOI: 10.1007/s11219-011-9141-z.

[4] P. Närman, U. Franke, J. König, M. Buschle, and M. Ekstedt, “Enterprise architecture availability analysis using fault trees and stakeholder interviews,” Enterprise Informa- tion Systems, 2012, DOI: 10.1080/17517575.2011.647092.

[5] U. Franke, P. Johnson, and J. König, “An architecture framework for enterprise IT service availability analysis,” 2012, submitted manuscript.

Author contributions [1] was fully authored by Franke.

In [2], the general research concept is due to Franke and Johnson, whereas the article was mostly authored by Franke and Rocha Flores, and presented by Franke at the conference.

In [3], the general research concept is due to Franke and Johnson, the construction of the survey was mostly done by Franke and König, and the respondents were found by Franke, König and Marcks von Würtemberg, who also shared most of the authoring.

In [4], the general research concept is due to Närman, Ekstedt, and Franke, the metamodel was constructed by Närman, Franke, Ekstedt, König, and Buschle, the empirical data collection was done by Närman and Franke, and the authoring was mostly done by Närman, Franke, König, and Buschle.

In [5], the general research concept is due to Johnson and Franke, the metamodel construction and P²AMF implementation was done by Franke and Johnson, and the ITIL operationalization and empirical data collection was done by Franke and König, who also shared most of the authoring.

ix

(10)

x RELATED PAPERS NOT INCLUDED IN THE THESIS

Related papers not included in the thesis

[6] A. Fazlollahi, U. Franke, and J. Ullberg, “Benefits of Enterprise Integration: Review, Classification, and Suggestions for Future Research,” in International IFIP Working Conference on Enterprise Interoperability (IWEI 2012), Sep. 2012.

[7] H. Holm, T. Sommestad, U. Franke, and M. Ekstedt, “Success rate of remote code exe- cution attacks – expert assessments and observations,” Journal of Universal Computer Science, vol. 18, no. 6, pp. 732–749, Mar. 2012.

[8] C. Sandels, U. Franke, and L. Nordström, “Vehicle to grid system reference archi- tectures and Monte Carlo simulations,” International Journal of Vehicle Autonomous Systems, 2011, to appear.

[9] ——, “Vehicle to grid communication Monte Carlo simulations based on Automated Meter Reading reliability,” in Power Systems Computation Conference 2011, Aug.

2011.

[10] L. Marcks von Würtemberg, U. Franke, R. Lagerström, E. Ericsson, and J. Lilliesköld,

“IT project success factors – an experience report,” in Portland International Confer- ence on Management of Engineering and Technology (PICMET), Jul. 2011.

[11] H. Holm, T. Sommestad, U. Franke, and M. Ekstedt, “Expert assessment on the prob- ability of successful remote code execution attacks,” in 8th International Workshop on Security in Information Systems – WOSIS 2011, Jun. 2011.

[12] J. König, P. Närman, U. Franke, and L. Nordström, “An extended framework for reliability analysis of ICT for power systems,” in Proceedings of IEEE Power Tech 2011, Jun. 2011.

[13] J. Saat, R. Winter, U. Franke, R. Lagerström, and M. Ekstedt, “Analysis of IT/business alignment situations as a precondition for the design and engineering of situated IT/business alignment solutions,” in Proceedings of the Hawaii International Confer- ence on System Sciences (HICSS-44), Jan. 2011.

[14] U. Franke, R. Lagerström, M. Ekstedt, J. Saat, and R. Winter, “Trends in enterprise architecture practice – a survey,” in Proc. 5th Trends in Enterprise Architecture Re- search (TEAR2010) workshop, Nov. 2010.

[15] M. Buschle, J. Ullberg, U. Franke, R. Lagerström, and T. Sommestad, “A tool for en- terprise architecture analysis using the PRM formalism,” in CAiSE2010 Forum Post- Proceedings, Oct. 2010.

[16] U. Franke, O. Holschke, M. Buschle, P. Närman, and J. Rake-Revelant, “IT con- solidation – an optimization approach,” in Enterprise Distributed Object Computing Conference Workshops, 2010. EDOCW 2010. 14th, Oct. 2010.

[17] J. Saat, U. Franke, R. Lagerström, and M. Ekstedt, “Enterprise architecture meta models for IT/business alignment situations,” in Proc. 14th IEEE International EDOC Conference (EDOC 2010), Oct. 2010.

[18] C. Sandels, U. Franke, N. Ingvar, L. Nordström, and R. Hamrén, “Vehicle to grid – Monte Carlo simulations for optimal aggregator strategies,” in Proc. 2010 International Conference on Power System Technology (PowerCon 2010), Oct. 2010.

(11)

RELATED PAPERS NOT INCLUDED IN THE THESIS xi

[19] M. Jensen, C. Sel, U. Franke, H. Holm, and L. Nordström, “Availability of a SCADA/OMS/DMS system – a case study,” in Proc. IEEE PES Conference on Inno- vative Smart Grid Technologies Europe, Oct. 2010.

[20] C. Sandels, U. Franke, N. Ingvar, L. Nordström, and R. Hamrén, “Vehicle to grid – reference architectures for the control markets in Sweden and Germany,” in Proc. IEEE PES Conference on Innovative Smart Grid Technologies Europe, Oct. 2010.

[21] U. Franke, P. Närman, D. Höök, and J. Lilliesköld, “Factors affecting successful project management of technology-intense projects,” in Proc. Portland International Confer- ence on Management of Engineering & Technology (PICMET) 2010, Bangkok, Jul.

2010.

[22] J. König, U. Franke, and L. Nordström, “Probabilistic availability analysis of control and automation systems for active distribution networks,” in In proceedings of IEEE PES Transmission and Distribution Conference and Exposition, Jun. 2010.

[23] M. Buschle, J. Ullberg, U. Franke, R. Lagerström, and T. Sommestad, “A tool for enterprise architecture analysis using the PRM formalism,” in Proc. CAiSE Forum 2010, vol. 592, Jun. 2010, pp. 68–75, ISSN 1613-0073.

[24] P. Närman, U. Franke, and M. Stolz, “The service-orientation of MODAF – concep- tual framework and analysis,” in Proc. 7th bi-annual European Systems Engineering Conference (EuSEC 2010), May 2010.

[25] J. Ullberg, U. Franke, M. Buschle, and P. Johnson, “A tool for interoperability analysis of enterprise architecture models using pi-OCL,” in Proceedings of The international conference on Interoperability for Enterprise Software and Applications (I-ESA), Apr.

2010.

[26] U. Franke, P. Johnson, J. König, and L. M. von Würtemberg, “Availability of enterprise IT systems – an expert-based Bayesian model,” in Proc. Fourth International Workshop on Software Quality and Maintainability (SQM 2010), Madrid, Mar. 2010.

[27] R. Lagerström, U. Franke, P. Johnson, and J. Ullberg, “A method for creating enter- prise architecture metamodels – applied to systems modifiability analysis,” Interna- tional Journal of Computer Science & Applications, vol. VI, pp. 89–120, Dec. 2009.

[28] U. Franke, J. Ullberg, T. Sommestad, R. Lagerström, and P. Johnson, “Decision support oriented enterprise architecture metamodel management using classification trees,” in Enterprise Distributed Object Computing Conference Workshops, 2009.

EDOCW 2009. 13th, Sep. 2009.

[29] U. Franke and P. Johnson, “An enterprise architecture framework for application con- solidation in the Swedish armed forces,” in Enterprise Distributed Object Computing Conference Workshops, 2009. EDOCW 2009. 13th, Sep. 2009.

[30] S. Aier, S. Buckl, U. Franke, B. Gleichauf, P. Johnson, P. Närman, C. M. Schweda, and J. Ullberg, “A survival analysis of application life spans based on enterprise archi- tecture models,” in Lecture Notes in Informatics, vol. P-152, Sep. 2009, pp. 141–154, proc. 3rd International Workshop on Enterprise Modelling and Information Systems Architectures (EMISA 2009).

(12)

xii RELATED PAPERS NOT INCLUDED IN THE THESIS

[31] P. Gustafsson, D. Höök, U. Franke, and P. Johnson, “Modeling the IT impact on organizational structure,” in Proc. 13th IEEE International EDOC Conference (EDOC 2009), Sep. 2009.

[32] S. Buckl, U. Franke, O. Holschke, F. Matthes, C. M. Schweda, T. Sommestad, and J. Ullberg, “A pattern-based approach to quantitative enterprise architecture analysis,”

in Proc. 15th Americas Conference on Information Systems (AMCIS), San Francisco, USA, no. Paper 318, Aug. 2009.

[33] R. Lagerström, J. Saat, U. Franke, S. Aier, and M. Ekstedt, “Enterprise meta modeling methods – combining a stakeholder-oriented and a causality-based approach,”

in Enterprise, Business-Process and Information Systems Modeling, Lecture Notes in Business Information Processing, vol. 29, Springer Berlin Heidelberg, Jun. 2009, pp.

381–393, ISSN 1865-1348.

[34] P. Närman, U. Franke, and L. Nordström, “Assessing the quality of service of Powel,”

in Proc. 20th International Conference on Electricity Distribution, Jun. 2009.

[35] U. Franke, P. Johnson, E. Ericsson, W. R. Flores, and K. Zhu, “Enterprise architecture analysis using fault trees and MODAF,” in Proc. CAiSE Forum 2009, vol. 453, Jun.

2009, pp. 61–66, ISSN 1613-0073.

[36] U. Franke, D. Höök, J. König, R. Lagerström, P. Närman, J. Ullberg, P. Gustafs- son, and M. Ekstedt, “EAF2 – a framework for categorizing enterprise architecture frameworks,” in Proc. 10th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing, May 2009, pp.

327–332.

[37] U. Franke, P. Johnson, R. Lagerström, J. Ullberg, D. Höök, M. Ekstedt, and J. König,

“A formal method for cost and accuracy trade-off analysis in software assessment mea- sures,” in Proc. 3rd International Conference on Research Challenges in Information Science (RCIS), Fès, Morocco, Apr. 2009.

[38] P. Närman, P. Johnson, R. Lagerström, U. Franke, and M. Ekstedt, “Data collection prioritization for system quality analysis,” in Electronic Notes in Theoretical Computer Science, vol. 233, Mar. 2009, pp. 29–42.

[39] U. Franke, P. Johnson, R. Lagerström, J. Ullberg, D. Höök, M. Ekstedt, and J. König,

“A method for choosing software assessment measures using Bayesian networks and diagnosis,” in Proc. 13th European Conference on Software Maintenance and Reengi- neering, Mar. 2009.

[40] M. Ekstedt, U. Franke, P. Johnson, R. Lagerström, T. Sommestad, J. Ullberg, and M. Buschle, “A tool for enterprise architecture analysis of maintainability,” in Proc.

13th European Conference on Software Maintenance and Reengineering, Mar. 2009.

[41] U. Franke, T. Sommestad, M. Ekstedt, and P. Johnson, “Defense graphs and enter- prise architecture for information assurance analysis,” in Proceedings of the 26th Army Science Conference, Dec. 2008.

[42] P. Gustafsson, U. Franke, D. Höök, and P. Johnson, “Quantifying IT impacts on orga- nizational structure and business value with extended influence diagrams,” in Springer Lecture Notes in Business Information Processing, vol. Volume 15, Nov. 2008, pp.

138–152.

(13)

RELATED PAPERS NOT INCLUDED IN THE THESIS xiii

[43] P. Närman, U. Franke, and L. Nordström, “Assessing the Quality of Service of Powel’s Netbas at a Nordic Utility,” in Nordic Distribution and Asset Management Conference (NORDAC 2008), Sep. 2008.

[44] R. Lagerström, M. Chenine, P. Johnson, and U. Franke, “Probabilistic metamodel merging,” in Proceedings of the Forum at the 20th International Conference on Ad- vanced Information Systems, vol. 344, Jun. 2008, pp. 25–28.

[45] P. Gustafsson, U. Franke, P. Johnson, and J. Lilliesköld, “Identifying IT impacts on organizational structure and business value,” in Proceedings of the Third International Workshop on Business/IT Alignment and Interoperability, vol. 344, Jun. 2008, pp.

44–57.

(14)

(15)

Introduction

(18)

(19)

Chapter 1

Introduction

1.1 Outline of the thesis

This thesis consists of two parts. The first part is an introduction to the subject, research question, methods and results presented in greater detail in the second part, which is the locus of the main contribution. The second part consists of five papers that have either been published in (papers 1, 3, and 4), or submitted to (paper 5) peer-reviewed academic journals, or presented at and published in the proceedings of a peer-reviewed academic conference (paper 2).

1.2 Background

In the modern world, it is becoming increasingly difficult to imagine life without IT systems.

Not only are computers becoming ever smaller, faster and more pervasive in a physical sense, but they also increasingly affect how organizations are set up and how activities are orga- nized. While this has brought about many advantages, such as the automation of manual labor, economies of scale in information management, and decreasing transaction costs for businesses and consumers, it has has also added new challenges, such as the complexity of increasingly integrated systems, the life-cycle management of long-living solutions, and the contingency planning for outages. The last challenge is importantly connected to this thesis: with technology dependence comes the need for continuous and uninterrupted IT service availability.

As businesses increasingly come to rely on IT services, the requirements on availability levels continue to increase [136]. Nevertheless, the understanding of the cost-benefit relation between a desired availability level and its associated cost is often wanting [34]. This is partly due to a lack of maturity on the part of businesses, but also to a lack of proper investigations into the costs of downtime. One often cited source is an IBM study from the nineties reporting that American companies lost $4.54 billion in 1996 due to lost system outages [65]. While there is a consensus that the costs of downtime are generally on the rise, it is also important to understand that this does not justify any amount of spending on mitigation [135]. What is required is an enlightened trade-off between the costs and benefits of availability improvements. This is a key topic of this thesis.

3

(20)

4 CHAPTER 1. INTRODUCTION

1.3 Enterprise IT service availability

A suitable point of departure for a study of enterprise IT service availability is to survey the meaning of the term.

IT services

Starting off with the notion of an IT service (or just service, for short), Taylor et al., in the ITIL series of publications, give the following definition of an IT service [145]:

”A Service provided to one or more Customers by an IT Service Provider. An IT Service is based on the use of Information Technology and supports the Customer’s Business Processes. An IT Service is made up from a combination of people, Processes, and technology and should be defined in a Service Level Agreement.”

Though the ITIL framework and its definitions are widely adopted among practitioners, this definition is slightly enigmatic. First of all, it immediately calls for further definitions of customers and service providers, respectively. The corresponding ITIL definition of customer reads as follows [145]:

”Someone who buys goods or Services. The Customer of an IT Service Provider is the person or group that defines and agrees the Service Level Targets. The term Customer is also sometimes informally used to mean Users, for example

’this is a Customer-focused Organization’.”

For service provider, we have [145]:

”An Organization supplying Services to one or more Internal Customers or Ex- ternal Customers. Service Provider is often used as an abbreviation for IT Service Provider.”

Taken together, it is clear that the ITIL definitions of customers and service providers are quite empty – they rely heavily on a previous conception of the service concept to be properly (or at all) understood. (ITIL alone should not be faulted for its circular definitions, though.

The International Standardization Organization offers a similarly unenlightening definition in the ISO/IEC 20000 (Information technology – Service management) standard where

”service provider” is defined as ”the organization aiming to achieve ISO/IEC 20000”.) The kind of informal background understanding of IT services required by the ITIL definition can be articulated as follows (from Johnson and Ekstedt [78]):

”Although system users might sometimes feel that the systems fail to deliver, the information systems in a company are there to provide value to the business. Even when successful, however, the information systems themselves need support to continue delivering services to its users. As briefly mentioned in the previous chapter there is thus a causal flow from the IT organization through the information systems to the business [. . . ]”

This kind of description makes it easier to understand what an IT service is. Importantly, it is not only about technology, but about technology in an organizational setting, where it delivers some kind of value to whatever that organization is doing. Indeed, it could be argued that the main point of the ITIL definition is its trichotomy of people, processes and technology – highlighting that a service is more than the technology upon which it is built.

(21)

1.3. ENTERPRISE IT SERVICE AVAILABILITY 5

The causal flow of Johnson and Ekstedt is useful to understand how an information system (a piece of technology) relates to the customer’s business process: even if it is a necessary precondition for a successful service, it is not a sufficient one. This causal flow is important for the purpose of this thesis, as we consider IT – and its availability – in its business operations context. Unavailability that does not impact a customer’s business process is of no consequence or concern in this setting. This is also importantly connected to the next term to be scrutinized, viz. enterprise IT. The IT service concept will also be revisited in section 1.6 (related work).

Enterprise IT

The focus of this thesis is enterprise IT services, not IT services in general. So how do enter- prise services, enterprise software, and enterprise computing differ from the non-enterprise counterparts? Turning to industry practice first, the renowned consultancy firm Gartner defines enterprise applications as follows [46]:

”Software products designed to integrate computer systems that run all phases of an enterprise’s operations to facilitate cooperation and coordination of work across the enterprise. The intent is to integrate core business processes (e.g., sales, accounting, finance, human resources, inventory and manufacturing). The ideal enterprise system could control all major business processes in real time via a single software architecture on a client/server platform. Enterprise software is expanding its scope to link the enterprise with suppliers, business partners and customers.”

Fowler offers a good ostensive definition (i.e. definition ”by pointing”) of enterprise applications [41]:

”Enterprise applications include payroll, patient records, shipping tracking, cost analysis, credit scoring, insurance, supply chain, accounting, customer service, and foreign exchange trading. Enterprise applications don’t include automobile fuel injection, word processors, elevator controllers, chemical plant controllers, telephone switches, operating systems, compilers, and games.”

However, the simplicity of Fowler’s definition is a mixed blessing. His exclusion list indeed includes systems that arguable can be thought of as enterprise systems. A chemical plant controller in the sense of a Supervisory Control And Data Acquisition (SCADA) system is, arguably, an enterprise application, whereas a chemical plant controller that merely regu- lates the flow in a pump certainly is not. Similarly, a telephone switch that is interconnected to an office network, integrated with an enterprise telephone directory and the calendars of the office clerks is, arguably, an enterprise application, whereas software that merely routes calls on a switchboard is not. To accommodate such distinctions, an ostensive definition is not enough. Johnson defines enterprise software systems in the following manner [77]:

”An enterprise software system is the interconnected set of systems that is owned and managed by organizations whose primary interest is to use rather than develop the systems. Typical components in enterprise software systems are thus considered as proper systems in most other cases. They bear names such as process control systems, billing systems, customer information systems, and geographical information systems.” (Emphasis in original.)

(22)

This wording accurately describes the definition adhered to in this thesis. In particular, the use-not-develop concept is useful – in the enterprise context, IT is a tool to be used, and the less it has to be developed the better. Similarly to Gartner’s definition, Johnson goes on to highlight the increasing importance of integration, and its consequences for enterprise systems (or services) management:

”In the early days, these components were separated from each other physically, logically, and managerially. During the last decades, however, an ever-increasing integration frenzy has gripped the enterprises of the computerized world. To- day’s enterprise software system is thus multi-vendor based and enterprise-wide, characterized by heterogeneous and large-grained components.”

This integration aspect resolves the chemical plant controller and telephone switch ambi- guities in an elegant manner.

Furthermore, this heterogeneity and growing complexity are key driving forces behind the advent of enterprise architecture (cf. next section, where the enterprise concept is re- visited), a discipline aiming to holistically address the management problems of today’s complex enterprise computing environments. Looking at the scope of one of the leading academic conferences in the area – the IEEE Enterprise Distributed Object Computing Conference (EDOC), the same lines of thought can be seen [21]:

”The IEEE EDOC Conference emphasizes a holistic view on enterprise applications engineering and management, fostering integrated approaches that can address and relate processes, people and technology. The themes of openness and distributed computing, based on services, components and objects, provide a useful and unifying conceptual framework.”

Note that the processes, people and technology trichotomy familiar from the ITIL service definition reappears here as well. Enterprise Information Systems is an academic journal published by Taylor & Francis. In its aims and scope, a similar concern with integration can be seen [35]:

”[. . . ] the Journal focuses on both the technical and application aspects of enterprise information systems technology, and the complex and cross-disciplinary problems of enterprise integration that arise in integrating extended enterprises in a contemporary global supply chain environment. Techniques developed in mathematical science, computer science, manufacturing engineering, operations management used in the design or operation of enterprise information systems will also be considered.”

This mission statement also hints to the multitude of methodologies that can be deemed relevant in the field of enterprise IT.

Availability

The literature offers several relevant definitions of availability. Schmidt makes the following definition [134]:

”Availability is the measure of how often or how long a service or a system is available for use.”

This gives an informal feeling for the term, but as a definition it is vague in that it leaves the ”how often or how long” question open. The International Telecommunications Union is a bit more precise, defining availability as [72]:

(23)

1.3. ENTERPRISE IT SERVICE AVAILABILITY 7

”the ability of an item to be in a state to perform a required function at a given instant of time, or at any instant of time within a given time interval, assuming that the external resources, if required, are provided.”

Here, availability is specified to a given time interval, and some preconditions are stated.

Indeed, the proviso ”assuming that the external resources, if required, are provided” offers an important insight into why availability work can be a source of conflict in an organization:

under this definition, determining whether a service is available requires common pictures both of what is required and what has been provided in any given case. (See Barki and Hartwick [6] for an investigation of conflicts in information system development.) Rausand and Høyland, following British Standard BS4778, make the following definition [127]:

”The ability of an item (under combined aspects of its reliability, maintainability and maintenance support) to perform its required function at a stated instant of time or over a stated period of time.”

This definition offers a choice: whether to measure availability at an instant or over a period of time. This distinction will be revisited in a moment. The International Organization for Standardization makes the following definition [68] in the ISO/IEC 9126 standard (Software engineering – Product quality), part 1:

”Availability is the capability of the software product to be in a state to perform a required function at a given point in time, under stated conditions of use.”

In 2011, ISO/IEC 9126 was superseded by ISO/IEC 25010 (Systems and software engineering – Systems and software Quality Requirements and Evaluation (SquaRE) – System and software quality models), which defines availability as the [71]:

”degree to which a system, product or component is operational and accessible when required for use”

Yet another useful ISO/IEC definition is found in ISO/IEC 20000 (Information technology – Service management), where availability is defined as the [70]:

”ability of a component or service to perform its required function at a stated instant or over a stated period of time”

The ISO/IEC definitions are all very similar. Perhaps their most important difference re- sides in the definition of the relevant time window: Again, the given point (of ISO/IEC 9126) is contrasted with the instant/period options (of ISO/IEC 20000). The less formal but quite demanding ”when required for use” is added by ISO/IEC 25010. To resolve the issue of measurement window, we follow Milanovic and distinguish the steady state availability from instantaneous availability, defined as ”the probability that the system is operational (deliv- ers the satisfactory service) at a given time instant” [108]. The steady state availability is defined mathematically in the following fashion:

A = MTTF

MTTF + MTTR (1.1)

MTTF denotes ”Mean Time To Failure” and MTTR ”Mean Time To Repair” or ”Mean Time To Restore”, respectively. The term MTBF, ”Mean Time Between Failures”, is sometimes used for repairable systems (i.e. when there can be several failures). In this thesis MTTF and MTBF are used interchangeably, as only repairable systems (enterprise IT services) are considered. It should be emphasized that since mean times are used, Eq. 1.1

(24)

measures the long-term performance of a system, i.e. the steady state system availability. In this thesis, availability refers to steady state availability, unless explicitly stated otherwise.

Indeed, this is precisely the operationalization prescribed by ISO/IEC 25010, revealing the meaning of ”when required for use” [71]:

”Externally, availability can be assessed by the proportion of total time during which the system, product or component is in an up state. Availability is therefore a combination of maturity (which governs the frequency of failure), fault tolerance and recoverability (which governs the length of down time following each failure).”

The second sentence is a more verbose explanation of the simple mathematical observation that to improve availability, as defined by Eq. 1.1, there are two strategies: Increase MTTF or decrease MTTR. While increasing MTTF is the typical traditional strategy of reliability engineering, some have argued that decreasing MTTR is actually a more cost-effective way of providing the desired availability [4].

However, one more conceptual aspect needs to be mentioned. The point can be illus- trated by the ITIL definition of availability [149]:

”Ability of a Configuration Item or IT Service to perform its agreed Function when required.”

It comes along with an operationalization [145]:

”It is Best Practice to calculate Availability using measurements of the Business output of the IT Service.”

In the enterprise IT service context, this operationalization is important. Remembering the causal flow of Johnson and Ekstedt, it is not enough that particular databases, network connections or servers can be shown to be available – what matters is the availability of the service as whole, the so-called end-to-end availability. As pointed out by Gartner, achiev- ing any given end-to-end service availability level requires a substantially higher average availability level of the constituent components [136]. Of course, this is not to say that measuring component availability levels is useless. However, such measurements need to be aggregated to be appropriate for the enterprise IT service context, e.g. as done in paper 4.

To summarize, the availability addressed in this thesis is the steady state availability of enterprise IT services, as measured by their business output.

However, before moving on, it is useful to contrast availability with the closely related, yet different, notion of reliability. Rausand and Høyland (following ISO/IEC 8402) define it as follows [127]:

”The ability of an item to perform a required function, under given environmen- tal and operational conditions and for a stated period of time.”

ISO/IEC 25010 defines reliability as the [71]:

”degree to which a system, product or component performs specified functions under specified conditions for a specified period of time”

These definitions are confusingly similar to some of the definitions of availability given above.

Milanovic explains the difference pedagogically: Reliability is about failure-free operations;

about the probability that no failure occurs during a time-interval. Availability, on the other hand, allows for systems to fail and then be repaired again. Only for non-repairable systems

(25)

1.4. ENTERPRISE ARCHITECTURE 9

are the two equivalent [108]. Gray makes the point very succinctly [57]: ”Reliability and availability are different: Availability is doing the right thing within the specified response time. Reliability is not doing the wrong thing.” In the excellent wording of Goyal et al., written 25 years ago [55]:

”System availability is becoming an increasingly important factor in evaluating the behavior of commercial computer systems. This is due to the increased dependence of enterprises on continuously operating computer systems and to the emphasis on fault-tolerant designs. Thus, we expect availability modeling to be of increasing interest to computer system analysts and for performance models and availability models to be used to evaluate combined performance/availability (performability) measures. Since commercial computer systems are repairable, availability measures are of greater interest than reliability measures.”

It should be noted that some of these confusing reliability definitions have been criticized.

Immonen and Niemelä explain the apparent lack of studies on availability [67]:

”One reason is the confusing definitions of the ISO/IEC 9126-1 quality model [66] that defines reliability as the capability of a software system to maintain a specified level of performance when used under the specified conditions. Accord- ing to the quality model, reliability is mixed with performance, and availability is a sub-characteristic of reliability.”

Regrettably, the unfortunate characterization of availability as a sub-characteristic of reliability is retained in ISO/IEC 25010, that superseded ISO/IEC 9126 in 2011 [71].

1.4 Enterprise architecture

At the intersection of information technology and business operations, new planning and evaluation tools are needed to keep track of the bigger picture. In this area, the long estab- lished notions of software architecture and systems architecture are now being accompanied by the new notion of enterprise architecture (EA). EA does not replace, but rather comple- ments, previous descriptions of information systems. It assumes a new level of abstraction, much like a city plan does not replace, but rather complements, the technical drawing of the individual house. At this level of abstraction, abstract properties of information systems such as availability, modifiability, security and interoperability become increasingly important to understand, in order to be able to effectively manage the landscape of IT and business activities. Recall the distinction made by Johnson about the components in enterprise software systems being considered full-scale systems in other contexts [77]. This is helpful for our understanding of the proper abstraction level of enterprise architecture: its components certainly can be opened up and considered in greater detail – but that would entail losing the bigger picture.

The science and practice of enterprise architecture have evolved in concert over the past two decades, trying to find an appropriate granularity for describing the technology, organization and business processes of enterprises in a single coherent fashion. A typical feature of enterprise architecture is the use of models, made up of entities, related by relationships, and equipped with attributes describing their properties [78, 91, 17]. Looking at a few definitions of enterprise architecture, some similarities and differences can be found:

”The formal description of the structure and function of the components of an enterprise, their inter-relationships, and the principles and guidelines governing

(26)

their design and evolution over time. (Note: ’Components of the enterprise’ can be any elements that go to make up the enterprise and can include people, processes and physical structures as well as engineering and information systems.)”

(Ministry of Defence, [109])

(Note how this definition echoes the service definition from above, i.e. the trichotomy of people, processes, and technology – broadly construed.)

”A strategic information asset base, which defines the business, the information necessary to operate the business, the technologies necessary to support the business operations, and the transitional processes necessary for implementing new technologies in response to the changing business needs. It is a representation or blueprint” (CIO Council, [24])

”A rigorous description of the structure of an enterprise, its decomposition into subsystems, the relationships between the subsystems, the relationships with the external environment, the terminology to use, and the guiding principles for the design and evolution of an enterprise” (Giachetti, [49])

”fundamental organization of a system, embodied in its components, their relationships to each other and the environment, and the principles governing its design and evolution” (Hilliard, [62])

The TOGAF framework embraces the IEEE (Hilliard) definition, but argues that ”architecture” has two different, context-dependent, meanings:

”1. A formal description of a system, or a detailed plan of the system at component level to guide its implementation

2. The structure of components, their inter-relationships, and the principles and guidelines governing their design and evolution over time”

(The Open Group, [150])

A more action-oriented definition is offered by the Gartner consultancy [46]:

”Enterprise architecture (EA) is the process of translating business vision and strategy into effective enterprise change by creating, communicating and improv- ing the key requirements, principles and models that describe the enterprise’s future state and enable its evolution.

The scope of the enterprise architecture includes the people, processes, information and technology of the enterprise, and their relationships to one another and to the external environment. Enterprise architects compose holistic solutions that address the business challenges of the enterprise and support the governance needed to implement them.”

The thesis is part of an ongoing enterprise architecture research program at the department of Industrial Information and Control Systems, where non-functional qualities of information systems are analyzed using enterprise architecture methods. A key feature of this program is the role of quantitative analysis. This quantitative strain of enterprise architecture aims for its models to allow predictions about the behavior of future architectures. Thus, rather than using trial and error to govern and modify enterprise information systems, decision-makers can get model-based decision-support beforehand [80].

(27)

1.4. ENTERPRISE ARCHITECTURE 11

(a) Traditional architecture

ÉgliseSaint-SylvestredeJailly FromWikimediaCommons

(b) Traditional EA model

DoDAFOV-5example FromWikimediaCommons

(c) Computer aided engineering

Finiteelementmodel FromWikimediaCommons

(d) EA analysis model

TheKTHEA2Ttool

Figure 1.1: From documentation and communication to analysis and explanation – an analogy: 1.1c is to 1.1a as 1.1d is to 1.1b.

Enterprise architecture models serve several purposes, e.g. (i) documentation and communication, (ii) analysis and explanation, and (iii) design [88]. While the quantitative EA analysis program that this thesis is a part of emphasizes the second purpose, mainstream EA is still more focused on the first. Documentation and communication is certainly important and should not be denigrated. However, the aim of EA analysis is to take the additional leap of processing and visualizing the information inherent in an architectural description in such a way that non-trivial properties, that cannot be assessed at first sight, emerge.

Conceptually, this is equivalent to the leap taken by traditional architecture from draw- ings indicating how many columns support a roof and where they are placed to computer aided design (CAD) and computer aided engineering (CAE) models that offer automatic calculations not only of column loads and buckling, but also of acoustics and lighting.

The analogy extends to the underlying theoretical foundations: creating a sophisticated CAE tool is not only a matter of software engineering, but requires scientific understanding of solid mechanics, acoustics and optics. Similarly, EA analysis requires not only application of modeling principles, but also domain-specific theory. Thus forecasting the performance of services and databases might require models from queuing theory and forecasting their availability might require a fault tree-like dependency structure. An EA analysis model of

(28)

enterprise IT service availability thus requires both finding adequate theoretical underpin- nings and integrating them into a suitable modeling formalism.

The EA analysis program was first outlined by Johnson and Ekstedt in 2007 [78]. Two PhD theses have been published so far by Lagerström on modifiability [90] and Närman on on multiple non-functional system qualities [114]. Forthcoming theses from the program include Sommestad on cyber security [142, 143], Ullberg on interoperability [152], and König on reliability [85].

1.5 Purpose of the thesis

The purpose of this thesis is to offer support for decision-making related to the availability of enterprise IT services. As described above, such services span not only traditional software architecture, but rather a combination of technology with processes and people. Therefore, it is suitable to make enterprise architecture models – that are designed to span at least the technology and processes parts – a key part of the approach. Such enterprise architecture models should support analysis of future scenarios and offer estimates of how different courses of actions would impact the availability of the services within the decision-maker’s domain. More specifically, therefore, the main purpose of the thesis is:

• To develop a method for enterprise IT service availability analysis using enterprise architecture models.

To achieve this, the method needs to appropriately reflect modern availability management practices (including outsourcing of services that are controlled only through Service Level Agreements), it needs to be based on an adequate understanding of the factors affecting availability, it needs to be properly formalized and it needs to be empirically investigated and validated. Thus, three important subgoals can be discerned:

1. To investigate the causal factors contributing to availability levels.

2. To create an enterprise architecture metamodel for availability analysis and prediction.

3. To empirically test, investigate and validate the approach.

Delimitations

According to the ITIL definition cited above, an (enterprise) IT service is made up from a combination of people, processes, and technology. While most previous work on enterprise IT service availability tends to focus on technology, this thesis also includes the process perspective. However, the people component has been excluded from the constituent papers, in order not to over-extend the research scope.

Furthermore, this thesis is delimited to steady state availability, rather than instantaneous availability. It is reasonable to assume that when studying enterprise IT service availability from an EA perspective (looking at the ”bigger picture” of structure and processes), long-term averages changes can be tracked, explained and predicted in terms of process maturity, whereas instantaneous outages need to be predicted with other means (if they can be predicted at all).

(29)

1.6. RELATED WORK 13

1.6 Related work

Surveying the scientific literature on availability analysis methods in 2008, Immonen and Niemelä find it surprisingly scarce [67]:

”The literature that addresses the availability analysis methods is very scarce;

except two methods [57,58]. The availability analysis has not been studied, or at least, we could not find any evidence.”

As noted above, they attribute this lack of literature partly to the confusing definition of the ISO/IEC 9126 standard. Still, they note the importance of such analysis and call for more research [67]:

”However, future software systems are service oriented and the availability of services is a critical quality factor for systems end-users. Therefore, availability is an issue that requires specific consideration, extensive research, and development of the appropriate analysis methods, techniques, and tools.”

As explained in Section 1.5, availability analysis is at the heart of this thesis. However, even if there is precious little literature on availability analysis, there is still a lot of related work on reliability and availability. In the following review, a few of the most important lines of thought are described and contrasted with the approach of this thesis.

From hardware to software reliability

Functioning hardware is the most fundamental precondition for working enterprise IT services. In the early days of computing, hardware failure was a major source of outages.

However, the impact of hardware failure on the availability of enterprise IT systems and services has been steadily decreasing since then [47]. Already in 1985, Gray showed that IT administration and software are the major contributors to failure [57], and in a later work Gray shows that the sources of failures have changed from hardware to software over the 1985-1990 time period [58]. Similarly, Pertet and Narasimhan investigate the causes and prevalence of failure in web applications, concluding that software failures and human error account for about 80% of failures [124]. A similar (but not identical) claim is made by Malik and Scott, who assert that people and process failures cause approximately 80%

of ”mission-critical” downtime [99].

The increasing relative importance of software over hardware spawned the well-researched field of software reliability. Software reliability differs importantly from hardware reliability.

As it is succinctly put in ISO/IEC 25010 [71]:

”Wear does not occur in software. Limitations in reliability are due to faults in requirements, design and implementation, or due to contextual changes.”

Software reliability models play a key role in this field. Probabilistic models include failure rate models that describe how failure rates change as a function of the remaining faults in a program. Well-known failure rate models include the Jelinski-Moranda model [74]

from 1972, the Schick-Wolverton model [133] from 1978, the Jelinski-Moranda geometric model [111] from 1981, the Moranda geometric Poisson model [110] from 1975, the negative-binomial model, the modified Schick-Wolverton model [144] from 1977, and the Goel-Okumoto imperfect debugging model [51] from 1979. Reliability growth models such as Coutinho’s [28] from 1973 or that of Wall and Ferguson [157] from 1977 explain the relia- bility improvement throughout testing and debugging. None-homogeneous Poisson process

(30)

(NHPP) models describe reliability by letting an NHPP fit the number of failures experi- enced up until a certain time t. Models include the Musa exponential model [113] from 1987, S-shaped growth [163] from 1983 and generalized NHPP [126] from 1997. There are also deterministic models, the most famous of which are probably the Halstead [59] (1975) and McCabe [102] (1976) measures of complexity. (The taxonomy of software reliability models partly re-used in this section is due to Pham, who also offers brief model descriptions [125].)

New technologies also push for new analysis methods. The advent of peer-to-peer file sharing systems prompted Bhagwan et al. to model users periodically leaving and joining the system as intermittently available components [13], and others to similarly analyze availability in terms of end-user behavior [132, 23, 31].

Another strand of software reliability deals with causes of failure. This is a diverse strand, ranging from general considerations on what causes systems to go down [156] over well-delimited particular subjects such as the reliability of Windows 2000 [112] to high- impact events such as large-scale Internet failures [120]. There is also a methodological diversity. For example, Zhang and Pham present an effort to identify factors affecting software reliability by surveys directed to stakeholders in software development or research [170].

Over the decades, numerous review articles have been written on software reliability methods and models, e.g. Shanthikumar [139], Goel [50] and Cai et al. [19]. However, despite prolific theory development, practical applicability remains limited. Thus Koziolek et al. (2010) summarize their research [87]:

”Although researchers have proposed more than 20 methods in this area, empirical case studies applying these methods on large-scale industrial systems are rare. The costs and benefits of these methods remain unknown. [. . . ] We found that architecture-based software reliability analysis is still difficult to apply and that more effective data collection techniques are required.”

Gokhale (2007) identifies related limitations in architecture-based software reliability analysis, since existing techniques [53]:

”1. cannot consider many characteristics and aspects that are commonly present in modern software applications (for example, existing techniques cannot consider concurrent execution of components),

2. provide limited analysis capabilities,

3. cannot be used to analyze the reliability of real software applications due to the lack of parameter estimation techniques, and

4. are not validated experimentally.”

On a similarly skeptical note, Milanovic (2010) concludes [108]:

”Software reliability measures can at present be achieved with pretty good accuracy if programming team has a substantial track data and lots of reliability data to support it, which is rarely the case. As no standard or widely accepted reliability model exists, curve fitting seems to be most popular in practice.”

Of course, the difficulty to obtain accurate data for reliability models has been addressed [54]. However, there is a more profound reason why software reliability is not at the forefront of this thesis, related to its scope of investigation. To see why this is so, consider the work by Gokhale and Mullen, a thorough investigation of software repair rates based on

(31)

1.6. RELATED WORK 15

more than 10 000 software defect repair times collected from Cisco systems [52]. Recalling the definitions given in Section 1.2, it would seem that joining the work of Gokhale and Mullen on repair times with the large existing body of knowledge on software failure times, good availability models would be achieved. However, the usefulness of such an availability measure is highly limited. Gokhale and Mullen investigate finding and fixing bugs in software. Finding and fixing such bugs can take days, weeks, or months, but businesses restoring their IT services on those time scales go out of business. The discrepancy is easy to understand: to get a service back on-line, it is not necessary to repair the bug in a software engineering sense. Indeed, probably the most common service restore is a rollback – simply going back to the last working configuration. Any bugs in the flawed version can then be found and fixed off-line. This is the reason why applying Eq. 1.1 to software failure and repair rates as they are mostly investigated in the software reliability literature is not a very relevant availability measure for any business with high availability requirements.

Service availability

The limitations of software reliability research, together with the advent of the service- oriented paradigm in software engineering [37] brought about a new way to look at availability. In fora such as the International Service Availability Symposium mentioned above, availability started to be analyzed not from the perspective of software failure, but rather from that of the consequences of service failure. The program chair’s message at the first symposium (2005) sets the tone [97]:

”No matter whether we call the computing services of the future ”autonomic,”

”trustworthy” or simply ”reliable/available” the fact of the matter is that they will have to be there seven days a week, 24 hours a day, independent of the environment, location and the mode of use or the education level of the user.

This is an ambitious challenge which will have to be met. Service availability cannot be compromised; it will have to be delivered. The economic impact of unreliable, incorrect services is simply unpredictable.”

The research challenge thus outlined is quite close to the scope of this thesis.

Within this paradigm, some important strands can be identified. As technologies important to enterprise IT services, the availability of databases [32, 38, 161, 165] and middleware [128, 117, 122, 105] have received a lot of attention. While these subjects were also studied within the software reliability paradigm, external services that fail [115] and service composition [140, 33, 107] – particularly quality of service-aware service composition [168, 167, 104]

– are topics more native to the perspective of service availability. So are the interest in user perceived availability [162, 82, 158, 151, 121] and the SLA management perspective on availability [137, 11, 5, 160].

On the method side, the service-oriented paradigm makes the architecture a useful level of analysis [131, 86, 89, 67, 39, 103], reflecting the fact that architectures composed from constituent services are subject to change. This is closely related to the area of component- based reliability assessment [30, 60, 56] and more recently component-based availability modeling [20, 106].

Milanovic offers what is probably the most comprehensive survey to date of tools for availability assessment in the service-oriented paradigm [108], comparing more than 60 commercial, public domain and academic tools. It is worth quoting the result, as it is closely linked to the difference between traditional software systems and the modern service- oriented enterprise systems:

(32)

”Based on this part of the study, the following conclusion was drawn: existing methods and tools are powerful, but difficult to apply directly for modeling service availability. Historical development of availability models and tools has associated them with mission-critical systems that rarely change during their lifetime, such as real-time systems, avionics, or telecommunication circuits. The availability assessment procedures they offer are unable to adapt to fast-paced changes in modern SOA [service-oriented architecture] systems. Each time the IT infrastructure or business process changes, it is required to manually intervene and update, verify and evaluate availability model.”

This is the raison d’être for architecture-based availability analysis in the service-oriented paradigm. Architectures need to be quickly and accurately (i.e. automatically) created and then used to compute the availability of the design alternative described. Indeed, this is closely related to the model-driven paradigm in software engineering, and the literature is abundant with model-oriented assessment formalisms for availability and reliability. A model for predicting the reliability of future systems based on component reliability, rep- resented in UML, is proposed in a series of papers by Singh et al. [141, 27, 26]. Along related lines, Bocciarelli and D’Ambrogio use web services described in the business process execution language (BPEL) to create UML models annotated with reliability data so that the reliability characteristics of composite services can be computed [16]. Other work in the same area include Leangsuksun et al., Rodrigues, Bernardi and Merseguer and Majzik et al., who all offer methods that integrate system design aspects as expressed in UML models with computational formalisms that enable the prediction of the reliability of software systems that are still in the design stage [92, 130, 12, 96]. Unlike this thesis, these approaches do not account for the governance aspects of service availability, e.g. the impact of IT service management process maturities. The same holds for Immonen, who uses a design-stage simulation to detect relations between system components that impact the reliability of the execution paths of the final architecture [66].

Milanovic’s work is also similar. Following his survey of available tools and formalisms, he offers an architecture-based availability assessment tool, where an availability model (fault tree or Markov model) is generated from an architecture description of the underlying ICT infrastructure, populated with data from an infrastructure data repository and solved with respect to a business process modeled in BPMN or a similar language [108]. This is state of the art architecture-based availability analysis, and quite similar to this thesis. The main difference is that whereas Milanovic limits his model to assessing availability in terms of the underlying ICT infrastructure, our aim is to also take some account of how IT service management processes affect service availability. These factors cannot be localized to any particular place in the service architecture, but nevertheless they impact its availability. In this sense, this thesis is also somewhat similar to design guides for high availability such as those offered by Vargas [154, 153], Holub [64] and Kalinsky [83] or general availability risk management models such as that offered by Zambon et al. [166].

Availability in ever-changing enterprise environments

The service availability research paradigm properly accounts for the dynamics introduced by the ad hoc service composition envisioned by SOA advocates: if a service is dynamically re-configured to include other atomic services, a new architecture for availability evaluation can be automatically generated within a framework such as that proposed by Milanovic.

However, there is another kind of dynamics that is unaccounted for. Malek, in a tutorial

(33)

1.7. RESULTS 17

delivered at the 2008 International Service Availability Symposium, describes this challenge [98]:

”Also, due to dynamicity (frequent configurations, reconfigurations, updates, upgrades and patches) an already tested system can transform into a new untested system and additional faults may be inserted.”

Indeed, this is not news. Already in 1985, Gray noted that system administration was the main source of failures, accounting for 42% of reported system failures in his study [57].

This is an important reason why methods that consider only a static environment where humans do not intervene will be unable to capture a lot of important causes of unavailability.

The service availability paradigm accounts for dynamic reconfiguration of services – but not for the propensity of human error. This thesis attempts to include such outage causes as well, and to reconcile that with the traditional, architectural approach. The Bayesian model used (as presented in Article 3) has some similarities to the work of Zhang et al. [169], but whereras he uses system logs , we base our model on expert judgements.

The impact of system administration puts additional emphasis on the human part of the technology-processes-humans trichotomy. Naturally, the literature in this field is less technology-centered, instead focusing on for instance the importance of communication with and support from senior executives [61, 118, 73, 2], and the use of proper key performance indicators (KPIs) [8, 94]. Paper 1 addresses this issue to some extent, pointing out that the single figure notion of availability is seriously inadequate. However, the human part of the technology-processes-humans trichotomy is mostly excluded from the thesis work.

1.7 Results

This section summarizes the contribution of each of the papers constituting this composite thesis and relates them to the goals identified in Section 1.5 above.

Papers 2, 3, 4, and 5 form the main pillar of the thesis. Paper 2 (2009) introduces an architectural approach to availability modeling, addressing the second subgoal identified in Section 1.5 above. Paper 3 (2011) introduces a causal factor approach to availability modeling, addressing the first subgoal identified in Section 1.5 above. Paper 4 (2012) extends the architectural approach of paper 2, and its five case studies verify that the combination of fault trees and architectural descriptions yields useful and accurate results, addressing the second and third subgoals. Paper 5 (2012) integrates the fault tree and causal factors approaches into a single framework and tests it empirically, addressing the second and third subgoals. This line of work constitutes the long-term direction of the thesis project, aligned with the overall department research program on enterprise architecture.

Paper 1 (2012) was spawned by the realization – during the work on the main pillar – that the single figure notion of availability (e.g. 99.8%) is seriously inadequate.

The following short summaries are based on the abstracts of each of the constituent papers of the thesis.

Paper 1: Optimal IT Service Availability: Shorter Outages, or Fewer?

Even though many companies are beginning to grasp the economic importance of availability management, the implications for management of Service Level Agreements (SLAs) and availability risk management still need more thought. This paper offers a framework for thinking about availability management, highlighting the importance of variance of outage costs. Using simulations on previously existing data sets of revenue data, variance is shown

Analysis of enterprise IT service availability