http://www.diva-portal.org
Postprint
This is the accepted version of a paper presented at The 4th ACM SIGOPS/SIGACT Workshop on Large Scale Distributed Systems and Middleware.
Citation for the original published paper:
Canini, M., Novakovic, D., Jovanovic, V., Kostic, D. (2010) Fault Prediction in Distributed Systems Gone Wild.
In: Proceedings of The 4th ACM SIGOPS/SIGACT Workshop on Large Scale Distributed Systems and Middleware
N.B. When citing this work, cite the original published paper.
Permanent link to this version:
http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-147099
Fault Prediction in Distributed Systems Gone Wild
Marco Canini, Dejan Novakovi´c, Vojin Jovanovi´c, and Dejan Kosti´c
Networked Systems Laboratory
School of Computer and Communication Sciences, EPFL, Switzerland
firstname.lastname@epfl.ch
ABSTRACT
We consider the problem of predicting faults in deployed, large- scale distributed systems that are heterogeneous and federated.
Motivated by the importance of ensuring reliability of the services these systems provide, we argue that the key step in making these systems reliable is the need to automatically predict faults. For example, doing so is vital for avoiding Internet-wide outages that occur due to programming errors or misconfigurations.
Categories and Subject Descriptors
C.2.4 [Computer-Communication Networks]: Distributed Sys- tems; H.4.3 [Information Systems Applications]: Communica- tions Applications
General Terms
Reliability
Keywords
Fault prediction, Federated systems, Heterogeneous systems, Shadow snapshot, Spatial and temporal awareness, BGP
1. INTRODUCTION
Large-scale distributed systems are already at the foundation of today’s Internet services and continue to grow in popularity and importance. However, making large-scale distributed systems reli- able is a notorious challenge. Moreover, many successful systems become heterogeneous due to the creation of multiple implementa- tions and evolve into multi-provider distributed systems as a result of deployment in the wide-area network with several federated ad- ministrative domains.
Recent research efforts have focused on finding bugs in dis- tributed systems by applying model checking [15, 22, 23] or sym- bolic execution [20] to explore a large number of potential states.
We argue that making heterogeneous, federated distributed systems reliable is even more challenging because (i) the source code of
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
LADIS’10 Zürich, Switzerland
Copyright 2010 ACM 978-1-4503-0406-1 ...$10.00.
every node may not be readily available for testing and (ii) com- petitive concerns are likely to induce individual providers to keep private much of their current state and configuration.
Examples of systems of such nature encompass inter-domain routing, electronic mail, peer-to-peer content distribution [11], con- tent and resource peering [5, 12]. We believe that the rapid evolu- tion of cloud computing will further foster the emergence of new such systems
1.
Motivated by the importance of ensuring dependability of long- running systems that are federated and heterogeneous, we argue for the need to predict faults and assess their impact as this is the cru- cial step in being able to guard against important classes of faults.
A key insight in doing so is that nodes need to become spatially and temporally aware of the consequences of local actions on their neighborhood. We propose to achieve spatial awareness by creat- ing a consistent, shadow snapshot, i.e., a distributed snapshot of the system taken from the current live state in which nodes are allowed to communicate with each other in isolation from the running en- vironment, while respecting the node trust boundaries. Then, we achieve temporal awareness by subjecting the system snapshot to a large number of possible scenarios created by systematically ex- ploring the potential behavior of a node and judging the wider im- pact of its actions. Finally, we predict the aggregate future behavior across multiple nodes by checking the status of safety properties and system invariants in the shadow snapshots. This approach is illustrated in Figure 1.
In this context, by fault prediction we mean to detect possible se- quences of actions which reach states that present inconsistencies that can lead to failures. As these might actually never happen, our prediction is loose with respect to time in that it is not associated to a time window during which the failures could occur. For a thor- ough discussion on failure prediction we refer the reader to [19].
According to literature [6], faults are the adjudged or hypothesized cause of an error. A failure refers to misbehavior that can be ob- served by the user. Given that we are exploring possible actions, these include faults. Therefore, because we want to avoid failures, i.e., prevent faults that have a visible erroneous state, we refer to our approach as fault prediction.
Based on this approach, we are building a prototype which is already successful at predicting certain operator mistakes in BGP router configurations.
With this paper we solicit discussions around the key insights of making distributed systems reliable.
The rest of the paper is organized as follows. In Section 2 we mo- tivate the importance of predicting faults. Section 3 presents a num-
1
For instance, Google has opened to third parties the Google Wave
Federation Protocol promoting the creation of interoperable wave
providers. http://www.waveprotocol.org
Time
Explore system behavior
Explore shadow snapshot 1
Explore shadow snapshot n
Live system