Automatic fault detection and localization in IPnetworks : Active probing from a single node perspective

(1)

Institutionen för datavetenskap

Department of Computer and Information Science

Final thesis

Automatic fault detection and localization in IPnetworks

Active probing from a single node perspective

by

Christopher Pettersson

LIU-IDA/LITH-EX-A--15/022--SE

2015-06-20

Linköpings universitet

SE-581 83 Linköping, Sweden

Linköpings universitet

581 83 Linköping

(2)

Final Thesis

Automatic fault detection and localization

in IPnetworks

Active probing from a single node

perspective

by

Christopher Pettersson

LIU-IDA/LITH-EX-A--15/022--SE

2015-06-20

Supervisor:

Johannes Schmidt, LiU IDA

Nicklas Weman, Ericsson

Examiner:

Ola Leifler, LiU IDA

(3)

(4)

Abstract

Fault management is a continuously demanded function in any kind of network management. Commonly it is carried out by a centralized entity on the network which correlates collected information into likely diagnoses of the current system states. We survey the use of active-on-demand-measurement, often called active probes, together with passive readings from the perspective of one single node. The solution is confined to the node and is isolated from the surrounding environment. The utility for this approach, to fault diagnosis, was found to depend on the environment in which the specific node was located within. Conclusively, the less environment knowledge, the more useful this solution presents. Consequently this approach to fault diagnosis offers limited opportunities in the test environment. However, greater prospects was found for this approach while located in a heterogeneous customer environment.

(5)

Acknowledgements

I would like to thank my supervisor at Ericsson, Nicklas Weman, for swiftly and positively making sure I got answers to all my questions. I also want to thank, Mats Ljungberg, Michael Lundkvist and everyone else at Ericsson who gladly provided me with information, support and encouragement throughout my thesis work.

Additionally I want to thank for all the advice and guidance provided by Johannes Schmidt and examiner Ola Leifler for precise suggestions and feedback throughout the thesis work.

Last but not least I want to thank my family and friends for their great support in every regard imaginable.

Thank you!

Christopher Pettersson Linköping, Sweden

(6)

(7)

2.4 IP Networking ... 9 2.5 2.5.1 The IP stack ... 9 2.5.2 IP in telecommunications ... 10 2.5.3 Auto Negotiation ... 10 Node unit ... 11 2.6 Test environment ... 11 2.7 Customer environment ... 12 2.8 Chapter 3 Method ... 15 Pre study ... 15 3.1 Prototype implementation ... 15 3.2 Main implementation ... 19 3.3 Evaluation ... 24 3.4 3.4.1 Prototype implementation ... 24 3.4.2 Main implementation ... 24 Chapter 4 Results ... 26 Pre study ... 26 4.1 Prototype implementation ... 26 4.2 Main implementation ... 27 4.1 Chapter 5 Discussion ... 32 Results ... 32 5.1 Method ... 36 5.2 Test environment ... 37 5.3 Customer environment ... 38 5.4 Source criticism ... 38 5.5 The work in a wider context ... 38

5.6 Chapter 6 Conclusions ... 39

References ………41

Appendix A Scenarios ... 43

A.1 Scenario 1 - Missing connectivity to remote client ... 43

(8)

Appendix B Produced code Prototype implementation ... 45

B.1 Connectivity engine ... 45

B.2 Demo runner ... 45

B.3 Matching engine ... 45

Appendix C Main implementation evaluation tests ... 46

C.1 Step 1 – Unit test ... 46

C.2 Step 2 – Verification test ... 46

C.3 Step 3 – Evaluation test ... 47

Appendix D Results main implementation ... 48

D.1 Step 1 – Unit test ... 48

D.2 Step 2 – Verification test ... 48

(9)

List of Figures

List of Tables

Figure 1 Basic Bayesian decision network, T being probes and X states. ... 8

Figure 2 Decision tree, T being probes and X states. ... 9

Figure 3 Scenario 1 ... 16

Figure 4 Structure of the software built in the prototype implementation, called Demo runner. ... 17

Figure 5 Illustration of the sync issue. ... 20

Figure 6 Illustration of the auto negotiation issue. ... 21

Figure 7 Binary decision tree from dependency matrix. ... 22

Figure 8 Bayesian network for the thesis taken from the dependency matrix. ... 23

Table 1 The layers described by OSI and Internet model ... 10

Table 2 Dependency matrix. ... 22

Table 3 Result for step 1 testing ... 27

Table 4 Result for step 2 testing ... 27

Table 5 Result for step 3 testing, selected test cases ... 28

(10)

(11)

(12)

(13)

Chapter 1 Introduction

Managing faults represents one of the certainties in any kind of network, regardless of size. Whether it is small and used for simulation in a lab or a big network providing internet access and telephone coverage to an entire country. Having a strategy to detect and locate faults as early as possible is vital to maintain the performance and stability of a network.

Over a considerably matter of time centralized software has been utilized by telecommunication systems to monitor and maintain, as well as managing faults in, the networks. However, on the nodes closest to the consumers at the radio stations much less efforts have been made. Would it not be useful to catch the faults as close to the source as possible? Can the node be empowered to actively detect and locate faults from its own perspective?

By giving the capability for the nodes themselves to perform the first line of troubleshooting, we will become more able to accurately detect and locate faults both earlier and with greater precision than with a centralized solution.

Additionally as the node itself is doing the troubleshooting, the possibility to improve can begin already in the simulation environment where the node software is being developed and tested. Previously only available to manual troubleshooters at rare occasions, the node now is authorized to reason about its connectivity, and thereby, provide all interested parties with a unique and novel perspective.

Background

1.1

In the simulation environment, just as in a real network, the node is connected to a set of other units. When developing and testing new software for the node, one does only want to be bothered by faults and issues directly related to the software under test. Historically, the amount of faults not directly related to the software under test has been high, upwards 60 % 1_.

In comparison to the network in the customer environment, the simulation environment does not have any centralized fault management in place for detecting and localizing connectivity related faults. In the simulation environment the need for manual troubleshooting of environmental faults is vast, yet, due to time constraints it is currently not possible1_{. As the node is also utilized in a customer environment the} use of a solution for such will also be included in the thesis.

Together with experienced troubleshooters, a subset of fault causes were identified as focal points in this thesis’ work. Expressed in a compact manner, possible causes for previously mentioned questions are:

Misconfiguration concerning link operation mode, with exploration of possible duplex mismatch situation.

Lost connection to an explicitly configured destination vital for operation of particular features and services.

(14)

Currently the information out of a isolated single node perspective is minimal. A tool located appropriately could change this fact and provide previously inaccessible information from isolated nodes both in the test- and customer environment.

Motivation

1.2

Currently there are no practical ways to analyze the connectivity of a transportation node. It is left as a task required to be performed manually by operators using a multitude of tools to gather information on present activity in the nearest reachable area of the node.

The questions requiring answers in this thesis are: Can this process be performed in an automated fashion? Can we build a software which is capable of gathering all this information, make sense of it, and thus save time and resources for operators? Can this process gather information in a timely, reliable and structured manner?

Questions that are currently posed but cannot be answered at a reasonable cost include1_: What connections are working and which ones are not?

Why can I not ping to this destination?

How come I only get half the bandwidth on this particular interface?

I am experiencing operational failure during this time period; can any inconsistencies in the connectivity be found?

Thesis purpose

1.3

The purpose of the thesis is to create a proof of concept software which is able to perform automated fault detection and localization while located on an isolated node unit.

In addition to detect and localize possible culprits, connectivity information is gathered in effort to shed some light on the connectivity-associated situation upon the node. Previously, hard-to-get information could be obtained by triggering the tool on an interval or when other detectable events occur on the node. We want to harness the information available at the node and utilize it in the most efficient manner possible to reveal the cause of faults.

This thesis aims to outline the path towards a more autonomous approach to fault management by developing a proof of concept based on the approach of Active Probing (Rish et al. 2005) which is in turn based on the experience of seasoned troubleshooters.

Basically, this work intents to put the capabilities of the troubleshooter doing manual probing, into lines of code executing on the node itself, and with potential to be used on all nodes in a network.

Research questions

1.4

Can the Active Probing approach for fault detection and localization be applied on a single node in an IP network?

How useful is the information from the Active Probing approach on a single node in an IP network?

How would a potential solution be applied within a customer network?

Scope

1.5

This thesis will put emphasis upon the process of how to automate the detection and localization of IP connectivity related faults. No attention will be spent on how to localize multiple simulation faults.

(15)

Due to accessibility to hardware we will not be able to evaluate the thesis software practically in a customer environment. Instead a test environment will be utilized. Issues concerning software bugs will not be handled. Neither will issues that are exclusive to the simulation environment.

The intended use of the software tool is to replace the manual component in troubleshooting connectivity faults. Only existing tools and information, at the disposal of an experienced troubleshooter, will be considered for input to the thesis software. The software will not attempt to correct localized faults, the objective is to research and report information with significant impact to the normal operation.

Disposition

1.6

The rest of the report is divided into 5 chapters; 2 Theoretical Background, 3 Method, 4 Results, 5 Discussion and 6 Conclusions.

Chapter 2, Theoretical Background, brings forward the field of research as well as giving the reader knowledge about concepts used through in the thesis.

Chapter 3, Method, describes the work performed in order to answer to the research questions. Chapter 4, Results, describes the findings from the evaluation performed upon the software created. Chapter 5, Discussion; looks back at the results and, with the aid of both the background and research questions, discusses the findings.

Chapter 6, Conclusions, presenting answers to the research questions founded in the knowledge gathered through the thesis as a whole.

Notation

1.7

In order to aid the reader in grasping important concepts used through the thesis, these main concepts are further explained below.

Fault detection is the process of discovering if there is at least one faulty component in a system

(Rish et al. 2005). When performing fault detection in a network, interest lies in whether a fault exists or not, rather than in locating it.

Fault localization is the process of analyzing a collection of observed or gathered fault indications,

with the intent to find an explanation for this collected knowledge. In literature this is also referred to as fault isolation, root cause analysis or event correlation (Steinder & Sethi 2004).

The node or node unit is an IP router that is running custom software developed by Ericsson. A probe is a method for obtaining information about the system and its surrounding environment. In this thesis as well as in the article about Active probing (Rish et al. 2005) a probe may return either a 0 for success or 1 for fail.

State is a way to describe the characteristics of the system at a certain time. In this thesis a state is

normally a definition of a collection of probe results representing a hypothesis regarding the connectivity of a destination/interface combination.

Simulation environment, also referred to as Test environment is the environment Ericsson uses to

verify the stability of the software. A multitude of tools help build this infrastructure, it is running on authentic physical hardware.

Customer environment, this term refers to an environment serving normal end users. This

environment can be rather large and is generally also heterogenic.

ComCli is the shell used to communicate with the Ericsson developed routing software, this is a

resource used to get configuration information for the node unit.

Connectivity related faults is a fault directly, or closely, related to the function of IP network

connections. A ping response is directly related, while the functional state of a feature like clock sync is closely related. That is because the feature will cease to perform its intended function in the event of a lost connection to the synchronization server.

(16)

An Interface in this thesis is usually a network interface having an IP address and used for IP communication.

(17)

Chapter 2 Theoretical Background

This chapter provide an insight into the theory regarded as the basis for the work carried out in this thesis.

Fault detection and localization

2.1

The need to handle problems in IP networks has been around since the early days of networking. As the networks have evolved and grown, so has the research area for how to manage them in the best possible way. One primary part of the general field of network management, is fault management.

There is a strong bond between the areas of fault detection, and fault localization; mainly due to the fact that both require similar knowledge. Knowledge regarding the symptoms can be associated with the fault, and vice versa.

Steinder and Sethi (2004) define fault detection, fault localization and testing to be the process of fault diagnosis; where testing is the process of verifying whether the result of the fault localization is true to fact, or not.

When comparing the work involved with performance of fault detection and fault localization, localization has been suggested harder; as for example in the case where active probing is being used by Rish et al. (2005). This point of view can be more easily understood by considering the system under scrutiny only having two states, where a fault is either present or not. Then, only one symptom has to be found to detect a faulty state, rather than a full or qualifying set of symptoms that is needed in order to locate a specific fault.

Fault detection is built around the principle of looking for information that deviates in a specific way from what would be expected. The more precise information that exists about the fault and how it is manifested in the system, the more precise fault detection and localization can be achieved.

Through times many approaches have been constructed to provide automated detection and localization of faults, the most relevant ones follow below.

A rule-based approach assumes that precise information regarding what information to look for, is present. It relies on rules to detect faults, but can also be adapted to localize faults. The information in a rule based system are generally not updated, attempts to remedy that characteristic have been made in the construction of fuzzy rule based approaches (Thottan & Ji 2003). The drawbacks of a rule based approach is that it is generally to slow for real time fault detection and the amount of knowledge required to make it effective can be a problem.

Finite state machines depend on sequences of events, and can in addition to detecting faults also aid

in the process to localize them. The information generally has to be derived from extensive sets of training data for the state machine to be able to detect patterns leading to a fault. One of the main drawbacks of a finite state machine solution for fault detection lies in its assumption that all alarms are true. Another drawback is the restriction of the length of the state chains, this approach does not scale well, especially not in a dynamic environment (Thottan & Ji 2003).

(18)

Pattern matching is an approach that just describes the deviations from a normal behavior (Feather

et al. 1993). The normal behavior may be learned and continuously updated to require minimal external information needed. Drawbacks are that it may take time to build up a perception of what is normal. By adding the capabilities to create signatures of anomalous events this approach can be extended to localize faults. This is realized by Feather et. al (1993) by using a fault feature vector.

Statistical analysis is an approach presented by Thottan & Ji ( 2003) as an alternative not requiring

continuous retraining. It works by correlating abrupt changes in a number of parameters and thereby detecting anomalies.

As we have seen above the most common way to localize faults is through different kinds of correlation, event correlation in the case of (Rish et al. 2005). And the majority of the previously mentioned approaches use it in some way or another.

Data collection and probing

2.2

The basis for all kind of detection and localization is data, which is acquired in mainly two ways, passive and active (Tang et al. 2005). Passive data collection are mainly concerned with collecting data that other sources make available in different ways. It can be configurations, logs, alarms and all kind of data that is available to fetch somewhere. In a network setting Muller et al. (Muller et al. 2011) shows that a basic Linux system allows for a fairly comprehensive passive data collection.

Active data collection is data that does not explicitly exist but is retrievable by using tools such as ping (Muuss 1983) and traceroute (Jacobson 1987). In order to allow the retrieval of active data, specific questions have to be posed, for example: “Do I get response from a destination IP?”. This is called probing and regardless of outcome, the probe result aid in limiting the number of possible causes.

Research in the area propose mainly two ways of doing probing, pre-planned and active probing (Rish et al. 2005), also referred to as offline and online probing (Yu et al. 2010) respectively. Active probing (Rish et al. 2005) is a continuation of the research performed on pre-planned probing. Where not only static sets of probes are used, but where also the result of earlier probes are used to choose the next set of probes to get extra value when considering the context in which probes are used.

One of the primary considerations in a solution that includes any kind of active data collection, is resource usage. This is one of the main reasons that research is being conducted towards active probing and away from preplanned probing, to decrease the amount of network traffic caused by the probes themselves. This, without compromising the value in the data provided by the probes. It becomes even more important in a setting where the active data collection is used to monitor a network and not just in cases of fault localization. As extensive probing may be directly counterproductive to the original intent of improving the stability and performance of the network (Mohamed & Basir 2010).

Due to the applicability and need within the field, methods to combine passive and active data collection approaches into more optimized hybrid systems in efforts to mitigate the drawbacks of the individual methods by themselves have been developed (Tang et al. 2005). Active Integrated fault Reasoning (AIR) (Tang et al. 2005) make use of a the passive data for reasoning and if that is not considered enough to explain the cause for the fault, probes are selected and executed to provide only the missing data for the cause to be explained.

Another approach utilizes a combination of pre-planned and active probing to reach a similar goal (Yu et al. 2010). By using active probing to narrow down the scope of possible causes to a fault and then, upon reaching a predefined threshold, switch to preplanned probing. In such a case computational work as well as time can be saved in the process of finding the cause of a fault. Depending on the implementation of event correlation from probe results to localized fault, the motivation for this kind of hybrid approaches may vary; which is the case with the use of a decision tree, where the computational overhead is minimal.

(19)

Similar tools

2.1

There are a multitude of tools aiming to aid in the process of localizing a fault. One type of tools approach the problem by learning the normal operation of the node. To that type belongs Cohen et al. (2005) and Huy et al. (2008). Cohen et al. (2005) uses an algorithm to create signatures that highlight the special characteristics of every captured system state. Cohen et al. (2005) also argue for the alleged ineffectiveness of raw values.

Huy et al. (2008) with the tool FlowMon on the contrary shows that as long as the variables, in this case counters, are chosen with care, they provide good points of reference for analysis of different state of the system. Both of these rely on the aspect of time, this is something that put great emphasis of continuity of the system where they are deployed.

The information that is collectable on an IP routing node has been examined by Muller et al. (2011). A tool is also presented along with the locations of where most information related to network is stored on a generic Linux machine.

The idea to utilize the capabilities of probes to detect, and possibly localize faults in a network has been shown and is explained by Rish et al. (2005), which also is one of the research groups that consistently work to improve the active probing as a monitoring and fault localization technique. From the perspective of using probes in an intentional and qualitative manner that minimize the negative performance effects caused by them on the network, this is valuable and relevant research. The main thing to note about the approach taken by Rish et al. (2005) involves an evaluation step where the value of information probes may generate, is of integral importance to the efficiently of the approach. This demand for previous knowledge is something that the approaches by Cohen et al. (2005) and Huy et al. (2008) do not require.

An approach for detection of faults on an application level is presented by Kandula et al. (2009) which also implements the suggested solution in the software NetMedic. The approach used does not rely extensively on knowledge, instead it focuses on the exploration of and the dependency between components. Not totally unlike the approach by Cohen et al. (2005) and Huy et al. (2008).

Yet another approach that is quite different from the rest is NetPrint by Aggarwal et al. (2009). It basically does what Muller et al. (2011) proposes and sends that information to a central server that indexes that information based on the binary parameter, if it is a working or non-working configuration. This way it is learning centrally and is able to match similar network compositions in order to suggest configuration changes. The focus of this solution is put on; home routers, and home networks.

Active Probing

2.2

Active Probing has been mentioned multiple times in previous sections. Due to the fit of the approach to the thesis the details are covered in greater detail than the alternative approaches.

The solutions of Active probing are researched by Rish et al. (2004), (2005) and several more publications by the same research team. Active probing, as described, uses Bayesian inference in two ways. First, deciding upon what probes to send next, given the result of the already sent ones. And secondly, finding out what faults are most likely, given the result of already sent probes.

In their article from 2005, Rish et al. (2005) also explore the capabilities of their approach for detecting multiple simultaneous faults. This by using a dynamic Bayesian network that also adds the dimension of time into the inference.

Active probing is primarily concerned with detecting and localizing faults in networks. But is also shown to be capable of providing reasoning regarding services affected by connectivity faults.

Refer to the article (2005) for further information regarding the algorithm and other details about the active probing approach. Brodie et al. (2008) also holds a patent for the Active Probing approach for fault detection and localization in networks.

(20)

Bayesian decision networks

2.3

Bayesian decision networks are built around the same theories as any Bayesian network2_{. Given the} initial relational probabilities between the nodes are known beforehand, we are able to recalculate the posterior probabilities as new knowledge regarding the values of nodes is acquired. What we get is a network where we can simply read out the conditional probabilities at each node adjusted for the knowledge we have acquired up until that point. The result we are looking for in most cases is the maximum conditional probability for a node given the knowledge we have provided the model with (Rish et al. 2005). By iteratively choosing the most probable node until an instance-specific threshold has been reached, each node can be evaluated separately and momentarily. Due to this kind of situation analysis, hence out of a momentary perspective, we can ensure that the most optimal path has been taken.

The advantage of this approach is that all knowledge gathered is taken into account at every step of the process.

The drawback of the inference process is that it is computationally demanding and does not scale very well as the amount of nodes in the network grows. Literature has shown that exact Bayesian inference is NP-hard (Cooper 1990). Also the construction of the initial network requires extensive knowledge and tuning to operate efficiently. An easier way, would be to make sure historical data exists; to derive prior probability distributions from; and thereby decreasing the amount of assumptions having to be made.

Binary decision trees

2.4

A binary decision tree is a binary tree that contains a question in each node. The answer to the question decides onto which path to continue down the tree. The process continues until a leaf has been reached. The leaf node of the tree does not contain questions, it describes a result. A result that is true only when all the questions on the path from the root node to the particular leaf have been answered in a

2_{http://en.wikipedia.org/wiki/Bayesian_network}

(21)

particular way. A binary decision tree can also be called a flowchart, describing the flow to a result based on questions (Beygelzimer et al. 2005).

The advantage of binary decision trees is that they are easy to follow and implement. They do not require high computational performance for diagnostics, and scale in the same way a binary search tree would in regards to time complexity when performing search operations3_.

The downside of the decision tree approach in general is the problem of maintaining and updating. Also, they offer no possibility to ask the questions in a different order depending on what results that is more or less likely. This means that the execution time is fairly constant for a decision tree (Beygelzimer et al. 2005). In some regards, this can be seen as preplanned probing, despite the interactive operation.

IP Networking

2.5

To better understand the faults that are used as focus for this thesis, a short explanation of concepts relating to IP networks in general, follows below.

2.5.1 The IP stack

Internet, as we commonly call it in our everyday life, is essentially a collection of standardized conventions called protocols on how computers are allowed to communicate with each other. The structure of these protocols are defined in the Internet protocol suite. The structure is normally described in layers, where each protocol is generally associated with one particular layer. There are multiple models describing these layers. Two of the primary ones are the Internet model (Force 1989) consisting of 4 layers and the Open Systems Interconnection (OSI) model (International Organization for Standardization 2000) that uses 7 layers.

3_{https://en.wikipedia.org/wiki/Binary_search_tree}

(22)

Table 1 The layers described by OSI and Internet model

The layers of concern to us are; the link, internet, transport, and application layer from the Internet Model. The Internet layer uses the Internet Protocol (IP) to define names of entities and to facilitate the routing of packets, as IP is a packed oriented protocol. Above the internet layer is the transportation layer that traditionally uses either TCP which is a connection oriented protocol or UDP which is a connectionless protocol. Based on the requirement on the transmission TCP, UDP or other protocols may be the most suitable option. On top of the transportation is the application layer where more usage adapted protocols are defined like HTTP, FTP and DHCP.

2.5.2 IP in telecommunications

Today an increasing amount of communication is transported over networks that uses the internet protocol for routing. This has led to users and subsequently operators finding the need to be able to handle IP traffic in the mobile networks. This has resulted in operators using IP on top of the hardware to get the most out of current and future infrastructure. A routing unit is placed on site to handle the IP communication and this unit is communicating with the internet using the internet protocol. The placement of this routing unit is at the site of the antennas that receive mobile calls over GPRS/3G/LTE which means that they are generally at locations with very limited physical access to the hardware as a result of wireless coverage being the primary factor when choosing the location. The responsibility of this unit is to terminate traffic from all hardware on the site and, using the internet protocol, transmit everything as IP traffic onwards in the operator network4_.

2.5.3 Auto Negotiation

Auto negotiation is a way for two ends of an Ethernet link to agree upon the same duplex mode and data rate for communication. It resides in the physical layer of the OSI model (International Organization for Standardization 2000) and is required for 1000BASE-T gigabit Ethernet and compatible, but not mandatory, for 10BASE-T, normal Ethernet. Full duplex means that the signals are modeled so that there is one data stream in each direction, resulting in greater throughput for the link. Half duplex means that the data can only go one way at a time, require the two sides to take turns in transmitting data.

A fault that is related to the Auto negotiation standard is duplex mismatch. A duplex mismatch may occur when only one, or none, of the two ends of a link support Auto negotiation. The fallback setting for the Auto negotiation standard is to use half duplex if the negotiation turn up empty.

Problems occur when one end of the link assumes full duplex and the other end does not. Resulting in a situation where one end is sending traffic without waiting for the other. This creates collisions on the link when the traffic intensity increases; yet, it is barely noticeable when there is low amounts of traffic on the link, due to at the same time lower amounts of collisions. This makes duplex mismatch a

4_{Internal documentation}

Layer OSI model from ISO Internet model from RFC1122

7 Application Application

6 Presentation

5 Session

4 Transport Transport

3 Network Internet

2 Data link Link

(23)

troublesome fault to locate as ping probes generally will show no indication of an issue (Shalunov & Carlson 2005).

There are two primary ways to identify a duplex mismatch, one is to check the amount of reported cyclic redundancy check (CRC) errors. A CFC-error indicates packets that have not arrived intact and may indicate the occurrence of duplex mismatch since the amount of collisions dramatically increase together with the bandwidth utilization (Shalunov & Carlson 2005). The other way is to look at the amount of Runt frames, which are packets that are smaller than the minimum allowed packet size for the Ethernet standard(IEEE 802.3(IEEE 802.3 Ethernet working group 2015)). The reasons for this being an indicator follow the same deduction as the CFC-errors, more collisions resulting in more damaged frames.

Node unit

2.6

The node unit is a router/switch that offers IP transportation network connectivity to the radio base station. It uses primarily the IP protocol and may perform advanced tasks such as shaping traffic and supporting advances Quality of Service (QoS) features. The node operates on layer 1, 2 and 3 of the OSI stack. The node configuration is performed via a custom shells called COMCli and EMCli that can be accessed via SSH protocol (Network Working Group et al. 2006). An object model is used to construct relations between different configured entities containing parameters and references to other objects.

In the process of manual troubleshooting on the node several tools are used. Among these are ping, traceroute, arping, tcpdump etc.

There are two features in the software that may be of use in order to perform continuous measurements of connectivity, namely “Ping measurement”5_{and TWAMP (Network Working Group et} al. 2008). They are both limited in the functionality they provide and can only monitor one connection at a time.

Test environment

2.7

The test environment at Ericsson consist of a collection of tools that through an elaborate process secures the quality and stability of the produced code. The tests are written in Java and use the internally developed framework Java Common Auto Tester (JCAT)5_{in conjunction with Maven (The Apache} Software Foundation 2015) and TestNG (Beust 2014).

There are hundreds of tests that are run in a great number of suits continuously to verify new code. The full chain is contained within this system, from compilation and installment of software under test to performance measurement with generated traffic. A short summary of the different components will follow.

Maven is a tool to handle the building, reporting, and documentation of software in an uniform and centralized way. Employing a project object model (POM), is well suited to facilitate the task required in a big software development organization.

TestNG is an open source project that facilitates unit-, integration- and functional testing for Java software. It is inspired by JUnit (Saff et al. 2014) but is substantially more data driven. It utilizes a multitude of annotations to be able to provide flexibility. This flexibility is partly achieved through using XML files to define the interactions and executions of different classes and groups of tests, which in turn are defined in annotated test cases written in Java.

Java Common Auto Tester (JCAT) is a internally developed framework that facilitates the automated testing of software and products. It aids in the execution of all kinds of tests, from unit- to network testing in a practical and economic fashion.

(24)

Customer environment

2.8

The customer environment is in this case the telecommunication network. Its primary task is to provide stable digital coverage over large areas. The task of the node unit is to route and forward IP traffic. A customer network can contain thousands of node units and is managed by a centralized system, called Operations Support System(OSS) 6_.

No fault management is performed on the node unit, except the transmitting of predefined alarms when certain hardcoded conditions are met.

The different tasks facilitated by OSS are: Configuration-, Performance-, Accounting-, Security-, and Fault management. When focusing on the perspectives of fault detection and localization, Fault Management Expert (FMX) software is of most interest. FMX is a rule based system that is hooked into OSS to improve network surveillance as well as aiding the operators. Also, by translating human knowledge into rules; FMX is capable of correlating big amounts of data, gathered in purpose of localizing faults, and thereby help gaining a better picture of the managed network7_.

FMX is built around graphically configured rules and make use of the information gathered within the network, from both OSS and other connected units. Further, FMX does root cause analysis, service impact analysis, and geographical- and functional visualizations. It also attempts to remedy faults on the node, given node connectivity.

6_{http://en.wikipedia.org/wiki/Operations_support_system}

(25)

(26)

(27)

Chapter 3 Method

Pre study

3.1

Since the majority of the practical knowledge is maintained by the Ericsson personnel in their daily work with related topics; gathering information by informal interviews was considered to be important. A group of Ericsson employees were chosen, due to their firsthand experience with the systems, essentially connected with the thesis. Their association with the system was ranging from system developer, to customer focused troubleshooters. As the knowledge for this particular scenario in this environment was quite scattered, no one possessed a coherent picture. My task was to acquire enough knowledge to obtain that picture myself.

As the previous work in the area was massive, a lot of approaches were discarded in accordance with the known constraints of the targeted system.

Summaries by Steinder & Sethi and Thottan & Ji (2004; 2003), provided a foundation for grasping previous work performed on this area.

A collection of implementation approaches were considered; two of the most feasible were the Active probing approach by Rish et al. (2005), and the decision tree approach, found in Beygelzimer et al. (2005). Both approaches, fully capable of using probes for the goal of detecting and localizing faults.

Prototype implementation

3.2

The prototype implementation was set out to confirm, or dismiss, the hypothesis that it is possible to create a tool that can perform automated detection and localization on the node itself. Since no previous tool exists, neither had any attempts been found succeeding, in executing automated troubleshooting on the node; this first prototype was to unveil the outlook for continued work towards such a tool.

Due to the constraints discovered in, and enforced by, the environment; all progress was valuable and needed in order to build a tool that would collect and process previously underutilized information. A tool, that would also be able to use other existing tools available on the node, to gather more information based on need.

To guide the work on the prototype implementation, a scenario was constructed. This scenario 1 can be found in its entirety in appendix A.1. In addition, the scenario also gave the requirements, setting, and evaluation framework, prior to the prototype implementation. Adding a rigidity to the work and remove unnecessary ambiguity.

(28)

The actions, defined in the scenario, should not be regarded as all the steps towards answering the greater thesis questions. The intent was merely to give enough insight, into obstacles, and possible solutions; for the main implementation to proceed in a smoother, more predictable fashion.

One of the obstacles was how to acquire the information required, in order to enable the localization of the faults. This included information from two primary sources. The first kind included the result from ping (Muuss 1983) probe, as well as data from the Arp cache8_{, and the interface status. This information} was gathered within the system, where the software was running. A second source of information, also desired, was information from the configuration, managing the setup of the node. This class of data was to be gathered by interfacing with the ComCli; used to both view, and modify the configuration of the Ericsson software on the node.

In order to provide the required setting for Scenario 1, the simulation environment was used. This was the same environment that was used for the evaluation of the main implementation. The simulation environment was thereby forming the development-, test-, and validation environment; for both prototype, and the final software. Hence, grasping the context set by the simulation environment, where paramount to understanding the environment, within which the work for this thesis took place. Therefore, it was carefully investigated.

The origin of the scenario was based in the requirement to use active troubleshooting measures in the process of detecting, and localizing, connectivity related faults in the simulation environment. The approach taken in Scenario 1, was that the piece of code should prove itself capable of detecting the fault, created in accordance with the scenario. The tool shall, in this instance, work according to the same procedures, followed by an experienced troubleshooter that is familiar with the node unit and the type of environment.

The scenario concerns the troubleshooting of the connectivity, to a destination, for which the destination IP address was known. The environment was configured, and traffic was sent in the network, to simulate an active network situation. The first action, performed by the software, was to check the connectivity of the known destinations. Information available by parsing the ArpCache8_{and interfacing} with the transportation software running on the unit via the ComCli9_{. This compiled a list of all known} destination IP addresses. The list was then used by the software to survey the connectivity to these destinations. After the initial execution of the software, one connection was severed, so that a probe would fail; in this case a ping probe. Then the software was run once more; the list would still be the same, and the software would at this point report a connection related fault, since the ping probe failed.

The continuation of the realization that the ping probe had failed was to survey the way towards the destination to conclude whether the error was located at the destination, or on the path leading to it. For the prototype implementation this was done with traceroute (Jacobson 1987).

The result expected from the prototype implementation was a report, on what changes had been observed in relation to the connectivity of the system.

8_{http://man.cx/arp} 9_{Internal documentation}

(29)

At this stage, the objective was to make sure that all components, required for a tool running on the unit, was available. By constructing a software that could fulfil the requirements of the scenario, the foundation for the future development could be guaranteed.

Due to the foundational nature of the prototype implementation; the decision, not to put any greater emphasis upon the actual implementation algorithm at this point, was made. Creating a well thought out implementation was left for the main implementation. At a time when the existence of all required resources was verified. The purpose of the prototype implementation was therefor to verify that all required resources was working in the environment on the node relevant for this thesis. For example; prove the programmatic access to node configuration information, and demonstrate the use of ping (Muuss 1983) to survey the connectivity to relevant destinations.

The chosen language was Python, version 2.7 (Python Software Foundation 2014). The reasons behind this was that python is a very portable language, which was also true for a substantial amount of the available libraries. Resulting in no additional software, except the python 2.7 standard library, was required for the software to run. This was an advantage for use within the desired systems.

A modular solution architecture was chosen in order to facilitate easier reuse of code in the future main implementation. An example was the construction of probes as independent methods. They aid in the reuse of code in future implementations.

As we were focusing on connectivity, the destinations and the interfaces had an integral importance in the actions, performed by the software. The connectivity of a destination could be considered an isolated, and independent unit, in the troubleshooting being performed. The interfaces, on the other hand, may be used by multiple destination IP addresses. The approach used in this respect was to bundle the interface, along with the destination IP address, and process them as one unit. The implementation has the option to optimize in this respect, if desired, but it was not done within the scope of the prototype implementation.

The probe functionality has been encapsulated in methods of their own. Due to the nature of active probing, the order, in which the probes were executed, varied from one run of the software, to the next. The best way to make sure that the execution of the individual probes was not affected by the order of which they were executed, was to make sure they all conformed into a universal interface. Since we earlier concluded that an IP destination address, and an interface, could be considered as one unit in the regards of probing, they would appear to be the natural arguments that each probe should be required to accept. This resulted in that every probe should take an IP address and an interface name as arguments.

For the return values, a similar convention had to be defined. The chosen approach was that a probe should return either a 0, if successful, or 1, if unsuccessful. These return values were inspired by the way most command line tools report exit status and how probe results are represented in the article by Rish et al. (Rish et al. 2005). For example, ping return a 0 if everything went fine, and a 1, if destination did not respond to the ping. This was the principle chosen for the probes as well.

(30)

The probes are designed in such a way that a binary result is satisfying for evaluation of the system state. This also makes the process faster, as the amount of possible outcomes will limit the possible “state space”.

The probes were basic functions that provide the software access to the tools, commonly used during a manual troubleshooting of the faults looked for. A summary of the probes and their most important properties can be found bellow.

Ping-probe, survey the connectivity to the destination IP address. It uses the normal terminal and the

exit code to convey the result returned.

ifStatus-probe return the status of the given interface, using the information from the underlying

system.

ifConfigured-probe, returns, whether the interface was configured or not, using the information from

the underlying system.

hasNTP-probe, confirms, whether the destination in question were used for synchronization

purposes, or not. This was valuable knowledge, in order to be able to locate which services were affected by a potential connectivity fault; affecting the destination, or interface, for which it originated.

hasAutoNeg-probe, return, whether the interface has auto negotiation configured, or not.

collisionOk-probe, look at the amount of collisions and CRC-errors, on the given interface, in

relation to the total amount of packets. Returning, as the name suggests, if the interface may be communicating with a link interface, that is not employing proper auto negotiation standards.

In addition to probes, some information gathering functions was also used. The ArpCache10_{is one of}

the commands that was used to gather information on which destinations, and which interfaces, to be surveyed for connectivity related faults. And Ifconfig was used to gather information about the status and connectivity of interfaces. As all destinations were associated with an interface, this information was important in order to be able to localize a fault concerning the interface.

The probes could be divided into two groups; passive diagnosing probes, and active probes. This partitioning idea resides in the work on Active Integrated fault Reasoning (AIR) technique by Tang et al. (Tang et al. 2005). The passive diagnosis probes use the knowledge already existing on the unit. For example, information regarding which destinations were configured for certain services like sync; and which destinations have been communicated with recently. The probe “hasNTP” performed passive diagnosis by utilizing information already existing on the unit. Active probes were actions employed in order to acquire information concerning the connectivity. Ping was the most obvious of these examples. In order to know the result of the ping, the active probe had to be executed. And the result was new information not previously present on the node unit.

The other functions required, was constructed as separate modules, or engines. Each engine interacted with other modules in a predefined way to perform the desired task.

The demo runner, was the main one, that utilized the others in order to perform the designated task of surveying the connectivity of the node. Arriving in a result that would show if any of the supposedly working destinations did not work, according to belief. Over time this result could also illustrate the characteristics of the connectivity situation on the node, and highlight points of particular interest.

To conclude the prototype implementation, and allow the findings to guide the main implementation, an evaluation was carried out at the end of the prototype implementation. These findings were presented in the Prototype results 4.2.

(31)

Main implementation

3.3

The main implementation outlined the process of creating a software, taking advantage of the prototype implementation knowledge; merge it with the fault specific knowledge, acquired up until this point, into an implementation that was capable of using probes, to return a likely verdict of the current state of each connection upon the node.

Through the prototype implementation it has been proven, that an implementation, that fulfills the desired software operation, was possible in the environment on the node. This, together with the literature study, made it convincing that an adaptive probing approach (Natu & Sethi 2007) would be a relevant path to walk.

One such approach was Active probing, as described by Rish et al. (2005), another, was decision trees (Beygelzimer et al. 2005) that takes advantage of the possibilities given by active and mindful utilization of probes, with the aim of both detecting and localizing faults; which was one of the cornerstones for this thesis.

One of the primary differences between the Active probing implementation, described by Rish et al. (2005), and the one required for the thesis, was the focus in the article on detecting and localizing faulty nodes and localization of services, associated with these nodes. The output of the approach was the localization of a node within a fully mapped network. This means that the knowledgebase, guiding the probe selection and localization of the most likely fault, was aware of the environment.

In the thesis scope, the desired use was to detect faults in association with a single destination IP address, and associated interface. In the thesis specific scenario, previous knowledge of the characteristics of the environmental surroundings were low. So instead of troubleshooting the correct operation of nodes, within a fully known network, the desired use was to detect and locate possible culprits on the connections that was of relevance for the node in question.

An example of where the approach, in Active probing, was different from in the thesis context, was in the property of probe length; used in the article by Rish et al. (2005). The length of a probe was used to evaluate the amount of information gain, for a particular probe. Its value was a number of how many nodes were covered by the probe. This metric was not easily applied upon the situation; where, we were surveying the likeliness of different kinds of faults, instead of the amount of nodes, covered by one probe. Length of probes was a way to improve the choices done by the algorithm. But in the thesis context there was no clear use for it. Another example of mentioned work, that was to improve the efficiency of the active probing algorithm, was the active selection of probing station; the node within the network that would execute the probe, and then report back with an outcome. In this thesis only one node was considered, and by that, making the choice of probing station obsolete for the thesis situation.

The intent of the main implementation was to create a software tool that followed the ideas established in the theories about Active probing, not necessarily its direct implementation. A direct use of the software described in the article Rish et al. (2005) was not the goal, nor was it feasible. This due to the differences in constraints posed by the environments, the lack of concrete code available, and the previously highlighted differences.

One of the issues regarding the Active probing algorithm presented in the article by Rish et al. (2005) was the lack of code for the underlying logic. The technique was patented by the authors Brodie et al. (2008) and no open source implementation was found during intensive search. Only the pseudo code for the algorithm, concerning active probing, was given. This meant that the task of the main implementation was not only to apply and adapt the tool described by Rish et al. (2005) but also to provide, nearly all of, the code required, in order to create the functionality outlined in the active probing approach. The decision tree implementation was not given either, but as the structure for such an implementation would be far less complex than the Bayesian network one, it was not seen as an issue.

Within network communication one must pay regard to noise, in this thesis the assumption was made that the noise levels were low enough to overlook their effect on the performance of the fault detection and localization. According to Tang et al. (2005) the use of the active probing, in itself, may be a way to

(32)

mitigate the effect of noise. As for passive collection of information; the fact that a limitation to only look at troubleshooting from a single node perspective, in this thesis means, that we were not at the mercy of uncertainty caused by network instability. All passively available information required could be gathered on the node itself.

In the difference between the Bayesian network, and the decision tree implementation, there was a difference in the way they handle noise; in short, the decision tree generally was more affected by noise (Steinder & Sethi 2004).

As mentioned, the amount of implementation details was limited for given approaches; but a prototype software was required in order to satisfy the evaluation.

The work begun with what was known, the pseudo code for the active probing algorithm. Each step in the algorithm was then analyzed, and broken down, in order to extract the requirements of the independent parts; enabling the possible development of each function separately. When the task for each of the included functions was understood, the next step could be initiated.

In their work Rish et al. (2005) used Bayesian inference to provide the implementation of all the included functions. While searching back in the publication history of the researchers in an attempt to learn more about their implementation of the Bayesian inference, a parallel idea emerged. The idea, built on the same intent to detect and localize faults, also by using probes in an iterative process. The idea came from a previous work of the same research team, who wrote the active probing article (Rish et al. 2005). The article titled was “Test-based diagnosis: tree and matrix representations” (Beygelzimer et al. 2005) and described the use of decision tree implementation for detecting and localizing faults.

The implementation itself was clearly different from active probing, but from the perspective of overall thesis question; it was well within the scope. At this point, the decision to create two implementations with the same basic approach of using active probes, was taken. Alongside the Bayesian inference approach would also be a second implementation conforming to the same input, output, and knowledge.

The reason behind this was to get a more dynamic result, and discussion, regarding different implementations. This also gave the opportunity to better see the pros and cons of respective implementation. The development of two different implementations did work well into the setting constructed by the thesis. The same set of probes was used; so was the outcomes, and the knowledge guiding the implementation, in detecting and localizing different faults. The only difference between them, except for the actual code, was the work of creating the implementation specific knowledge representations. The Bayesian inference approach require a Bayesian network, while the decision tree require a tree representation of the knowledge. Which in itself, in the construction phase, awakened questions regarding the reasons for doing one way over the other.

In order to create the foundation for a relevant evaluation, the detection and localization of faults had to be performed on context relevant faults. In this thesis the knowledge comes originally from the informal interviews11_{. Two faults were chosen, which are described below.}

11_{According to the informal interviews performed internally on Ericsson.}

(33)

Synchronization is a critical feature on any communication node and must therefore be monitored carefully. As the controls for the synchronization were based on the connection to a synchronization server, it was of interest as a connective related fault. If the connectivity to the synchronization server, in this case a NTP server, exhibit irregularities the synchronization can be affected with disturbances as a result. Naturally, the same approach could be applied to any of the other services relied upon on the node, but for simplicity, only one; NTP was chosen.

Secondly, for a connection to work properly; the two terminations of the links must use the same speed and duplex configuration. In general, this is handled by the auto negotiation standard (IEEE 802.3 Ethernet working group 2015). In cases where this did not work properly, it was not always clear as to why that was the case. For example, the link may work fine for low amount of traffic, just to display serious congestion as the amount of traffic increases. This, could in some cases be detected by looking closely at the configuration and statistics of the interfaces, hence constituting a good choice for our intentions. But also to be clearly different from the affected services fault displayed earlier (the synchronization fault).

This knowledge about the example cases was translated into a dependency matrix. The dependency matrix as a knowledge representation system were utilized by both the Bayesian network (Rish et al. 2005) and binary decision trees (Beygelzimer et al. 2005). Hence, presenting the perfect starting point for the construction of the knowledge representations required in the implementations.

For the proceeding part of the thesis an example matrix was constructed. In order to assist in the understanding of future knowledge representations. The same knowledge has been presented in all upcoming knowledge representations, regardless of type. Thereby giving the reader a possibility to grasp the differences in how the same data was presented in different ways for different purposes.

Dependency matrices were, as pointed out by Beygelzimer et al. (2005), commonly used within the area of fault management in general for their capability of in a human accessible way represent the data required for fault correlation.

As previously mentioned, the Active probing approach consisted of an iterative hypothesis testing where a number of important states were defined as a combination of probes; this could be illustrated with a dependency matrix.

(34)

The dependency matrix was constructed with the knowledge and faults in mind, that was gathered from the informal interviews12_{. The probes was denoted as PX, where X was a number, that in this} example was ranging from 1 to 6. The states representing the results of the troubleshooting was denoted in a similar way; but prefixed with an S instead. A description of the states was also available.

The dependency matrix has shown to be a starting point for the construction of several different implementation specific data representations.

Within the matrix, an unsuccessful probe was denoted with a 1, while a successful one was left undefined. This worked since there were only two possible results of a probe, success or not, 0 or 1. Hence, we were only required to explicitly state one of them.

12_{According to the informal interviews performed internally on Ericsson.}

Figure 7 Binary decision tree from dependency matrix. Table 2 Dependency matrix.

(35)

In this example (Table 2), that will be used through the report, the unsuccessful state was chosen as the “default” state. The fact that some zeros have been explicitly stated will be explained later in the report, when explaining the requirement for the construction of the Bayesian network.

From Figure 8 we can see that the outcome of the ping probe divides the state space in two ways; so does the outcome of the ifStatus probe. Iteratively, this approach narrows down the state space for every probe executed until only one state remains. The final state is the one most likely to represent the actual state of the system.

All nodes, except leafs, were represented by probes capable of returning a binary result. This is how the binary decision tree work in an iterative way. By being aware of probe results already acquired, no questions were asked more than once, and the implementation could simply walk down the tree until a leaf was found. The leaf represents a fault and a result in the process of fault localization.

This was a simple and straight forward representation of knowledge to express the relation between probe outcomes and states. Its connection to the dependency matrix was discussed by Beygelzimer et al. (2005). In the article the dependency matrix representation was attributed to be best suited to handle changes in knowledge, as well as being most manageable. Beygelzimer et al. (2005) also correlate the dependency matrix with the binary decision tree, called flowchart within the article.

For the implementation of the Bayesian inference a Python library was used (eBay Software Foundation 2013). This enabled the software to be implemented in a way that was conforming closely to the Bayesian network implementation of the Active probing algorithm (Rish et al. 2005). As for the construction of the knowledge representation, limited guidance was to be found for the particular area of decision networks.

Figure 10 shows the data model expressed as a Bayesian network, containing all probes and states. When defining a Bayesian network, there was a need for probabilities for every connection in the network. This information was something that did not exist. The challenge, in this case was that there was no historical information; neither was there any good perceptions regarding the general frequency of the faults.

The explicit zeroes found in the dependency matrix were required for the Bayesian network to be complete. Unless the network was complete, we could not perform the inference.

The starting point for the construction of this data model was the assumption that all state were equally likely. Further, the assumption that the amount of probes supporting a state was an indication of high support for that particular state. For example; a first state (1), was defined by 3 probes that failed, a

Automatic fault detection and localization in IPnetworks : Active probing from a single node perspective

Institutionen för datavetenskap

Department of Computer and Information Science

Final thesis

Automatic fault detection and localization in IPnetworks

Active probing from a single node perspective

by

Christopher Pettersson

LIU-IDA/LITH-EX-A--15/022--SE

2015-06-20

Linköpings universitet

SE-581 83 Linköping, Sweden

Linköpings universitet

581 83 Linköping

Final Thesis

Automatic fault detection and localization

in IPnetworks

Active probing from a single node

perspective

by

Christopher Pettersson

LIU-IDA/LITH-EX-A--15/022--SE

2015-06-20

Supervisor:

Johannes Schmidt, LiU IDA

Nicklas Weman, Ericsson

Examiner:

Ola Leifler, LiU IDA

Abstract

Acknowledgements

Table of Contents

List of Figures

List of Tables

Chapter 1

Introduction

Background

1.1

Motivation

1.2

Thesis purpose

1.3

Research questions

1.4

Scope

1.5

Disposition

1.6

Notation

1.7

Chapter 2

Theoretical Background

Fault detection and localization

2.1

Data collection and probing

2.2

Similar tools

2.1

Active Probing

2.2

Bayesian decision networks

2.3

Binary decision trees

2.4

IP Networking

2.5

2.5.1 The IP stack

2.5.2 IP in telecommunications

2.5.3 Auto Negotiation

Node unit

2.6

Test environment

2.7

Customer environment

2.8

Chapter 3

Method

Pre study

3.1

Prototype implementation

3.2