Monitoring Water Distribution Network using Machine Learning

(1)

IN

DEGREE PROJECT ELECTRICAL ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2017,

Monitoring Water

Distribution Network using Machine Learning

GAGAN GUPTA

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

TRITA TRITA-EE 2017:163 ISSN 1653-5146

(3)

Monitoring Water Distribution Network using Machine Learning

EP242X, Degree Project in Communication Networks

Gagan Gupta

(4)

Abstract

Water is an important natural resource. It is supplied to our home by water distribution network that is owned and maintained by water utility companies. Around one third of water utilities across the globe report a loss of 40% of clean water due to leakage. The increase in pumping, treatment and operational costs are pushing water utilities to combat water loss by developing methods to detect, locate, and fix leaks. However, traditional pipeline leakage detection methods require periodical inspection with human involvement, which makes it slow and inefficient for leakage detection in a timely manner. An alternative is on-line, continuous, real-time monitoring of the network facilitating early detection and localization of these leakages. This thesis aims to find such an alternative using various Machine Learning techniques.

For a water distribution network, a novel algorithm is proposed based on the concept of dominant nodes from graph theory. The algorithm finds the number of sensors needed and their corresponding locations in the network. The network is then sub-divided into several leakage zones, which serves as a basis for leak localization in the network. Thereafter, leakages are simulated in the network virtually, using hydraulic simulation software. The obtained time series pressure data from the sensor nodes is pre-processed using one-dimensional wavelet series decomposition by using daubechies wavelet to extract features from the data. It is proposed to use this feature extraction procedure at every sensor node locally, which reduces the transmitted data to the central hub over the cloud thereby reducing the energy consumption for the IoT sensor in real world.

For water leakage detection and localization, a procedure for obtaining training data is proposed, which serves as a basis for recognition of patterns and regularities in the data using supervised Machine learning techniques such as Logistic Regression, Support Vector Machine, and Artificial Neural Network.

Furthermore, ensemble of these trained model is used to build a better model for leakage detection and its localization. In addition, Random Forest algorithm is trained and its performance is compared to the obtained ensemble of earlier models. Also, leak size estimation is performed using Support Vector Regression algorithm.

It is observed that the sensor node placement using proposed algorithm provides a better leakage localization resolution than random deployment of sensor. Furthermore, it is found that leak size estimation using Support Vector Regression algorithm provides a reasonable accuracy. Also, it is noticed that Ran- dom Forest algorithm performs better than the ensemble model except for the low leakage scenario. Thus, it is concluded to estimate the leak size first, based on this estimation for small leakage case ensemble models can be applied while for large leakage case only Random Forest can be used.

(5)

Referat

Vatten är en viktig naturresurs. Den levereras till v˚art hem via vattendistributionsnätet, som ägs och un- derh˚alls av vattenföretag. Omkring en tredjedel av vattenföretagen över hela världen rapporterar en förlust p˚a 40 % rent vatten p˚a grund av läckage. Ökningen av pumpnings-, behandlings- och driftskostnader driver vattenförsörjningen till att bekämpa vattenförluster genom att utveckla metoder för att upptäcka, lokalis- era och fixa läckor. Emellertid kräver traditionella pipeline-detekteringsmetoder periodisk inspektion med stor skala mänsklig inblandning, vilket gör det l˚angsamt och ineffektivt för läckage-detektion i tid. Ett alternativ är on-line, kontinuerlig, realtidsövervakning av nätverket som underlättar tidig detektering och lokalisering av dessa läckage. Avhandlingen syftar till att hitta ett s˚adant alternativ med hjälp av olika maskinläsningstekniker.

För ett vattendistributionsnät föresl˚as en ny algoritm baserad p˚a begreppet dominerande noder fr˚an grafteori. Algoritmen finner ut hur m˚anga sensorer som behövs och deras motsvarande platser i nätverket.

Nätverket delas sedan in i flera läckagezoner, som utgör grunden för läckageplacering i nätverket. Därefter simuleras läckage i nätverket praktiskt taget med hjälp av hydraulisk simuleringsprogramvara. Den erh˚allna tidsserie-tryckdatan fr˚an sensornoderna förbehandlas med användning av endimensionell wavelet seriebrytning genom att använda Daubechies Wavelet för att extrahera särdrag fr˚an data. Det föresl˚as att använda detta extraktionsprocedur vid varje sensornod lokalt vilket minskar överförd data till det centrala navet över molnet och därigenom minskar energiförbrukningen för IoT-sensorn i verkliga världen.

För upptäckt och lokalisering av vattenläckage föresl˚as ett förfarande för erh˚allande av träningsdata, som utgör grunden för erkännande av mönster och regelbundenhet i data som använder övervakade mask- ininlärningstekniker, s˚asom logistik regression, stödvektormaskin och konstgjort neuralt nätverk. Dessu- tom används ensemble av dessa tränad modeller för att bygga en bättre modell för läckagespecifikation och lokalisering. Utöver det är Random Forest-algoritmen tränad och dess prestanda jämfördes med det erh˚allna ensemblet av tidigare modeller. Ocks˚a utmatning av läckstorlek utförs med hjälp av Support Vector Regression-algoritmen.

Det observeras att sensorns nodplacering med användning av den föreslagna algoritmen ger en bättre läckage-lokaliseringsupplösning än slumpmässig utplacering av sensorn. Vidare konstateras att läckstor- leksuppskattning med hjälp av supportvektorregressionsalgoritmen ger en rimlig noggrannhet. Det noteras ocks˚a att Random Forest-algoritmen fungerar bättre än ensemblemodellen med undantag för l˚ag läckage scenario. Slutligen innebär detta att man uppskattar läckagestorleken först. Baserat p˚a denna uppskat- tning för sm˚a läckagefall, kan ensemblemodeller appliceras medan för stort läckagefall kan endast Random Forest användas.

(6)

Acknowledgements

I would like to express my deepest gratitude to associate professor Carlo Fischione for giving me opportunity to pursue my Master Thesis in the department of Networks and Systems Engineering. I am very thankful to Rong Du for guiding me all through my thesis work. I would also like to thank my friends from my program and faculty members of department of Information Science and Engineering who have helped me in broadening my knowledge in Wireless Systems. Additionally, I am very thankful for the inputs from my friends from the school of Computer Science and Communication at KTH Royal Institute of Technology. Above all, I can’t thank enough my family members, in particular my wife Shilpa Gupta for supporting me throughout my endeavor for higher studies at KTH.

(7)

List of Tables

3.1 Notations used in Algorithm 2 . . . 30

4.1 Comparison between Algorithm 1 and random deployment of sensor in terms of average distance and total hop count . . . 36

4.2 Optimized regularization parameter for Logistic Regression . . . 39

4.3 Optimized hyperparameter value for Support Vector Machine . . . 40

4.4 Number of bags versus Out-of-bag Error for Random Forest classification . . . 43

(10)

(11)

List of Figures

1.1 Water Distribution Network . . . 2

1.2 WSN Ecosystem . . . 3

1.3 Linear Regression . . . 4

1.4 Binary Classification . . . 5

2.1 EPANET Layout . . . 10

2.2 Dominating Set . . . 11

2.3 Wavelet Series Decomposition . . . 12

2.4 Support Vector Machines . . . 13

2.5 Support Vector Regression . . . 15

2.6 Neuron Function . . . 16

2.7 Neural Network . . . 16

2.9 Bias-Variance Trade off . . . 18

2.10 Confusion Matrix . . . 19

2.11 ROC Curve . . . 20

2.12 One versus the rest . . . 20

3.1 An example of Adjacent Vertex/Neighbor . . . 23

3.2 An example of nearest neighbor . . . 24

3.3 Sensor Node Placement I . . . 26

3.4 Sensor Node Placement II . . . 27

4.1 Distribution of leak zone size for random placement of sensors and proposed algorithm for sensor placement. . . 36

4.2 Comparison of time-series and wavelet pre-processing in terms of leakage detection . . . . 37

4.3 The performance of leakage zone identification achieved by different wavelets (pre-processing) 38 4.4 ROC plot for Class 9 classification on LR model . . . 39

4.5 The result of LR classification . . . 40

4.6 The result of SVM classification . . . 41

4.7 The result of ANN classification . . . 41

4.8 The leak detection and localization accuracy achieved by different ML algorithm . . . 42

4.9 The result of RF classification . . . 43

4.10 Comparison of RF algorithm performance with Ensemble of LR, SVM and ANN for leakage detection and localization . . . 44

4.11 The result of leakage size estimation using SVR . . . 45

(12)

(13)

Acronyms

ANN Artificial Neural Network.

AUC Area under the curve.

DWT Discrete Wavelet Transform.

EC Emitter Coefficient.

EPANET Environmental Protection Energy Network.

FPR False positive rate.

IoT Internet of Things.

ITA Inverse Transient Analysis.

LR Logistic Regression.

MDS Minimum Dominating Set.

ML Machine Learning.

OOB Out-of-bag.

PCA Principal Component Analysis.

RF Random Forest.

SVM Support Vector Machines.

SVR Support Vector Regression.

TPR True positive rate.

WDN Water Distribution Network.

WSN Wireless Sensor Network.

(14)

(15)

Chapter 1

Background and Motivation

Water is supplied to our home by water distribution network (WDN), which is owned and maintained by water utility companies. These companies are faced with increasing costs of installing new pipelines to serve the growing population as well as maintaining and replacing the aging system. The increase in pumping, treatment, and operational costs is pushing water utilities to combat water loss by developing methods to detect, locate, and fix leaks. Today, leaks and bursts that occur in the WDNs account for significant water loss worldwide. Around one third of water utilities around the world report a loss of 40%

of clean water due to leakage [1]. By reducing the amount of water leakage in WDN, the water utilities can save unnecessary expense on a) producing, b) purchasing of additional water, c) its treatment, and d) its transportation to the end-users.

A general water distribution system comprises of a complex network of pipelines buried underground that are relatively inaccessible. This makes the detection of leakage in the water supply challenging.

Traditional pipeline leakage detection methods depend on the periodical inspection of the massive urban- scale pipeline networks conducted by the maintenance personnel. This however, requires intensive human involvement. Moreover, the periodical inspection hardly provides real-time monitoring of the distribution system. Consequently, a leakage may not be detected in time and may cause much larger economic loss and environmental pollution. An alternative would be the on-line, continuous, real-time monitoring of the entire network, which would facilitate early detection and localization of these leakage. The thesis works on finding such kind of solution using data analysis.

This chapter gives a framework of the thesis by introducing topics such as Water distribution network, Wireless sensor network, and Machine Learning. Furthermore, previous relevant works related to this problem are discussed and in the end, problem formulation for the thesis work is presented.

1.1 Water Distribution Network

Water Distribution Network (WDN) [2] is a hydraulic infrastructure that consists of several components such as pipes, pumps, valves, reservoirs, storage tanks, meters, and other constituents that connects water treatment plants to consumer taps. These distribution networks are designed to meet peak demands. The purpose of the system of pipes connected in a network is to supply water at adequate pressure and flow.

The water in the supply network is maintained at positive pressure to ensure water reaches all parts of the network. The water is typically pressurized by pumps that pump water into storage tanks, which are constructed at the highest point in the network. A single network may have several such reservoirs.

One such network is shown in Figure 1.1 [3]. The figure shows junction valves/nodes, reservoir and pipes.

These junction nodes are connected by pipes. The network also has a single reservoir. There has been an increase in trend to obtain physical parameters such as pressure and flow rate of network by use of

(16)

Figure 1.1: The Water Distribution System Layout of District 4 of ’Capitanta’ (Italy) [3]: The junction valves are color coded in small circle depending on the elevation of its location. The water pipes connect two junction nodes. The length of pipe in figure is in proportion of the actual length. The network has a single reservoir as shown by the cyan color box.

wireless sensors.

With the availability of state-of-art sensors and the growth of IoT (Internet of things) enabled services, it is feasible to build an autonomous system that continuously collects field data remotely by sensor nodes deployed in the WDN. These nodes serves as a building block of a network that connects with a central monitoring unit wirelessly. Such networks are termed as Wireless sensor network (WSN) as described in the next section.

1.2 Wireless Sensor Network

Wireless sensor network is a network of large number of sensor nodes that cooperatively sense and monitor parameters involving physical or environmental conditions. It is an infrastructure [4] that is a unique combination of sensing, computing, and communication segment that provides an administrator the ability to instrument, observe, and react to events and phenomena in a given environment.

WSN has made it possible to obtain data about physical phenomena that was difficult or impossible to obtain in more conventional ways. With the recent advances in fabrication technology and the availability of inexpensive, low-powered miniature components that are often integrated on a single chip, sensors have become small, rugged, reliable, low-powered, and inexpensive. Coupled with it, availability of low power communication technology has fueled the growth of WSNs as an important tool for managing our resources efficiently in variety of fields ranging from predictive maintenance [5], environmental monitoring [6], industrial machine monitoring, [7], smart city applications [8] and so on. One such example where WSN have gained prominence is water distribution networks.

A typical WSN is shown in Figure 1.2. The sensor nodes are distributed over the monitored field to measure the phenomena of interest. These sensor nodes are connected by wireless communication links through which the obtained data are transmitted to sink node on request in a multi-hop manner.

The sink node is connected to an Internet Gateway, from where data is further transmitted to a data center or a cloud platform. At the data center, the received data is stored, processed and analyzed using various data analytic tools. For water monitoring application, it can provide valuable information such as

(17)

8

Figure 1.2: A typical WSN ecosystem : A network of sensor nodes, are connected wirelessly through multi-hop to a sink node and then to Gateway where data is transfered to a data center/cloud, where it is stored, processed and analyzed and is accessible by the users .

detection and localization of leakage in the network. This is possible through the recognition of patterns and regularities in the data obtained by data center using various Machine Learning (ML) techniques. In the next section, a brief background on Machine Learning techniques is provided.

1.3 Machine Learning

Machine Learning is a class of methods in data analysis that learns patterns and hidden insights in the data without being explicitly programmed for it. Thanks to the better and more powerful computing devices, there is an increasing trend to apply ML in various cases recently, such as equipment failure detection [9], pattern and image recognition [10], email spam filtering [11], and fraud detection [12].

ML algorithms performs predictive analysis. When the output variables takes continuous values, it is termed as Regression whereas when it takes class labels it is called Classification. Regression refers to estimating a response whereas Classification refers to identifying group membership.

For example, if we have past data containing price of house and its corresponding size for a particular location in a city and we want to predict the selling price of any house from its size, then this could be solved using Regression. As shown in Figure 1.3 [13]. The output of the ML algorithm in this case is a curve that best fits the past data.

An example for binary Classification problem is shown in Figure 1.4. There are two feature vector x₁ and x₂ in this case. Positive and negative classes are separated by a best fit decision boundary. Any points that lie within the decision boundary would be labeled as positive and those outside would be termed as negative.

The above two example belongs to the case of Supervised ML where past labeled data is available with us for learning. However, this is not always the case. When ML techniques are applied to non-labeled data, it is called Unsupervised Learning. Unsupervised learning is mostly used for finding structure in the data. It can group the data based on structure present in it, but can’t figure out what those group corresponds to. Thus, Supervised learning approach is most suitable for finding leakage in the water distribution network. Next section provides a review on previous literature work on the topic of leakage detection, which will be followed by problem formulation for this thesis.

(18)

Figure 1.3: [13]: An example of Linear Regression of house prices from its size. The green line represents the best curve fit to the past data of house price vis-a-vis its area.

1.4 Related Works

Leakage in a distribution systems can be caused by several ways. It could be due to faulty connection and corrosion of the pipe, high system pressure, or it could be due to damage caused by ground movement in winter, excavation, or even due to poor quality of workmanship. Many practical deployed water leakage detection systems measure various hydraulic and acoustic parameters such as flow, pressure and water quality parameters. Out of these, pressure change due to leakage is the most noticeable and hence it is of primary interest for localizing leaks in the network. Leakage of water from the pipe is directly proportional to system pressure on the pipe. Since the pipes in a WDN are pressurized, events such as pipe bursts, valve opening/closure results in a sudden change in the flow through the pipe causing a pressure transient that propagates along the the pipelines. The pattern of the pressure changes can be analyzed computationally to detect the location and the size of leakage in real-time. In this section, a short literature review of the available leakage detection methods are provided.

Hardware based methods using acoustic or non-acoustic equipment have been studied in [14]. However, such methods are expensive, time consuming, and labor intensive. Then there are classified transient- based methods such as detecting negative pressure waves caused by pipe bursts as studied in [15], [16]

and [17]. A review of transient based leak detection method is provided in [18]. However, the method fails to locate leakage correctly when pipe burst near boundary of pipe-joint and also suffers with false alarms caused by normal pipeline operation. Alternatively, there also exists a methodology called ’Inverse Transient Analysis (ITA)’ [19]. In this technique, system state is already known but system parameters are not known. The inverse problem is solved for parameters such as leak size and its position. In simple words, leakage detection are also performed through the analysis of subsequent behavior of burst-induced transient signals studied in [20] and [21]. However, the burst-induced pressure signals may be masked by background noise or other events in the network.

Model based leak detection [22] is also one of the ways in which researchers have approached this problem. Using this way, the problem can be formulated into a classical least square estimation problem.

Nonetheless, due to relatively small number of pressure measurement readings available, it is a challenging task to solve for complex non-linear models of any city’s WDN. The other techniques that have gained traction in recent times is the use of pattern recognition in the data to find the leakage in the distribution system.

Several data driven approaches to solve this problem have been published in the area of leakage detection in WDN. A comprehensive review of all such approaches for burst detection is published in

(19)

Figure 1.4: Binary Classification : A circular (Non-linear) decision boundary depicts the region belonging to the two class. Any future data points lying within the marked circle would be classified as positive and vice-versa.

[23]. This article have classified the data-driven approach into three categories a) Classification method b) Prediction-Classification method and c) Statistical Method. Methods that use classification techniques on the physical parameters are placed in the first category. In Prediction-Classification methods, classification is carried out after a stage of prediction. In statistical method , the burst detection completely relies on statistical theory such as ’Statistical Process Control’. In this thesis work, Classification methods are used to detect and localize the leakage in the system and the literature review corresponding to it is mentioned in the following paragraph.

In [24], a leak detection using analysis of monitored pressure values is performed by support vector machines. However, it places the monitoring nodes in the network randomly and does not provide information on placement of sensor nodes. In [25], a Novelty detection is performed based on WDN time series data of pressure and flow values. This novelty could include variety of events such as pipe bursts, hydrants flushing and sensor failure. This detects the leakage but does not provide information on the localization of leakage. In [26], a data-driven novelty detection system was used to solve the problem of leakage detection using MultiRegional Principal Component Analysis (PCA). However, a major limita- tion comes from the linear nature of the method that applies PCA. As a result, the system led to limited sensitivity or a high false alarms rate. From an IoT perspective, it is desirable to transmit limited data to the central hub. In this regard, a study in [27] is done to extract the feature from the pressure data in the water distribution system. However, how to use the information further in WDN framework has not been looked upon. In the next section the problem formulation for the thesis work is provided.

1.5 Problem Formulation

The main goal of this thesis work is to come up with an appropriate solution for automatic detection and localization of pipeline leakages, and estimation of the size of leakages in water distribution networks.

Thesis objective can be further sub-divided into two distinct problems. First, it is required to come with a judicious rule to place the sensor node in the water distribution network. Secondly, it is required to find the best way to use various ML techniques on the obtained data to find out the leakage information in the network. In the following subsection both of these problems are discussed in detail.

(20)

1.5.1 Placement of Sensor nodes

Given any water distribution network, a mechanism is needed so as to find out automatically how many sensor nodes are needed and where to place them in the network. Placement of sensor node would have direct impact on the efficacy of locating the leakage in the network. Apart from finding the right number of sensors and their placement in the WDN, it is also desirable to define a zone near every sensor node location known as leak zone. Each and every sensor node will represent a zone consisting of several nearby junction nodes. This helps in identifying at which zone the leakage has occurred in the current data.

In an ideal world, one can place as many sensors as there are junction nodes and can monitor all of the nodes in real time, to exactly pinpoint the occurrence of leakage in the WDN. In other words leakage zone size would be 1 in all the cases. However, in our resource limited world, this is infeasible. Hence, this becomes an interesting engineering problem to place the sensor node in the network. Before proceeding to the solution, it is to be noted here that the problem at hand is different then the traditional node placement problem. There are some special requirements that need to be considered, as described in the following lines.

There should be equitable distribution of the size of leakage zone. The distribution of leakage zone size should not be too flat nor can be too large. A skewed distribution of leak zone size will have negative effect on the resolution of the leak localization. On one hand, sensors representing themselves only in the network would result in an redundant use of resources; on the other hand, sensor node representing a larger amount of nodes would result in poor performance of leak localization. A solution to this is arrived in the thesis work and an algorithm has been developed based on the graph theory which will be discussed in detail in the Section 3.1.

1.5.2 Data Analysis for leakage information

Once the number of sensor nodes, its location, and corresponding leakage zone are identified, it is possible to use the network for data generation and train various machine learning models. However, this requires various consideration at different levels which are discussed below:

The next natural progression after sensor placement in the thesis work is to generate training data required for training ML models. How the data is generated would determine the time taken in detecting the occurrence of leakage in the system. Furthermore, from an IoT perspective, it is desired to minimize the energy consumption of the sensors. This calls for data reduction or compression technique that would minimize the data to be sent by each sensor over the cloud to the central hub.

To address this, in the thesis various pre-processing approaches are tested and a suitable way of extracting relevant parameters is used which reduces the data to be transmitted by the sensor. By using the preprocessed data various ML algorithms are trained and tested. In this thesis, Logisitic Regression, Support Vector Machines, Artificial Neural Network, and Random Forest are used. Out of these, Logisitc Regression being the simplest algorithm, it is used to find leakage in the system as well as to understand the performance of pre-processed data as compared to the raw data. Support Vector Machines and Artificial Neural Network are the most used algorithm in the literature for leakage detection, thus they are included in the discussion. In the end, ensemble of the learned models is made to form a better performing model for leakage detection and localization. This ensemble model is further compared to ensemble learning model called Random Forest algorithm for performance. For leak-size estimation, only Support Vector Regression is used in the thesis as it provided reasonable accurate results for practical purposes.

This thesis is laid out as follows. Following this current chapter, relevant background information is presented in Chapter 2. This is followed by description of solution approach used for data collection and analysis in the thesis. Chapter 4 provides the numerical results obtained vis-a-vis different ML Algorithms.

Finally conclusion is stated in the chapter 5 and the report ends with the short note on future scope and

(21)

challenges involved in the work.

(22)

(23)

Chapter 2

Preliminaries

This chapter summarizes the theory and concepts that are necessary to understand the solution approach, used in the contribution of the thesis. In this chapter, first EPANET [28] is introduced which is the software used for simulating water networks that forms the basis of data generation for my thesis work. Next the concept of Dominating Set and theory of Wavelet Series Decomposition are described, that are used for pre-processing of data. Furthermore, various ML techniques used in the thesis; namely Logistic Regression, Support Vector Machine, Artificial Neural Network and Random Forest are described in brief. Thereafter, ways to address the problem of bias and variance in the ML algorithm is explained.

Figure of merit for binary classification is discussed in Section 2.10. Finally, the chapter concludes with short prelude on Multi-class classification and Ensemble learning.

2.1 EPANET

Supervised ML algorithm for automatic water leakage detection requires training data. The data should involve hydraulics details such as pressure at different locations in the WDN, pertaining to previous leaks that occurred in the past. However, for security reasons the water distribution system data, which includes geographical layout of pipes, tanks, and demands are kept confidential by the water utility companies and are not available in public domain.

To circumvent this problem, for data generation, a water network modeling software called EPANET is used. EPANET is a software that models the hydraulic and water quality behavior of water distribution piping systems. It is a public domain software, freely available for use and is very much prevalent for university education and research purposes. EPANET simulation provides time series data of various parameters in the network. Figure 2.1 depicts a typical layout of this software. The figure portrays a WDN, consisting of junctions in different location. These junctions are connected by pipes. A user can edit various parameters of the network element and perform simulation to observe its effect on the overall system.

Although the software does not have direct tool to induce leakage in the system, it is still possible to perform it using the emitter property of junction nodes. This emitter property that is designed to model fire hydrants and sprinkler can be exploited to model leakage simulation. As per the Torricelli’s Equation

F lowrate : Q = C×A×P^P^exp, (2.1)

where C is coefficient, A is the orifice aperture area, P is fluid pressure and P_expis the pressure exponent.

The pressure exponent is typically taken 0.5 for circular apertures. Based on the above equation, EPANET computes emitter coefficient (EC) as

EC = Q

P^P^exp, (2.2)

(24)

Figure 2.1: EPANET Layout : A WDN is shown with location of various junctions and tanks. The output of the hydraulic simulation is a time series data shown as a table on the left.

where EC is emitter coefficient with unit (ls⁻¹m⁻¹), and Q is the flow rate.

As mentioned in [29] hydraulic simulations through EPANET are affected by very low values of emitter coefficient thus, as a result for the better convergence of numerical result system accuracy, should be kept as 0.00075. The study in [30] on the effects of leakage in water distribution system has proved that the pressure exponent is dependent on the geometry of the orifice. Furthermore, the work in [31] have shown that the corroded metal pipe corresponds to the highest values of the exponent(≤ 2.3) and that in plastic pipe, the exponent can take different values (from 0.41 to 1.85).

In the absence of experimental data throughout the simulation, we have kept the pressure exponent as 0.5 which is the default value for the circular orifice. This parameter in the EPANET model is used extensively in generation of leakage data which would be presented in detail in the method section.

This thesis is carried out on a virtual WDN obtained through a water network modeling software called EPANET. One of the given example of the software called ’Net2’ was chosen for development of all the algorithms. The reason for choosing this example was, it is neither too simple nor too complex in terms of number of components in the model. This allows for faster, yet relevant, development of algorithms.

2.2 Dominating Set

A WDN can be viewed as a connected graph [32] G(V, E ) where V is a set of junction node in water distribution system and E is set of edges representing pipes between the vertices. For node v_i, v_j ∈ V, we say vi is the neighbor of vj if and only if there is an edge hvi, vji ∈ E. In this case, we will also say vi is adjacent to vj. Then, we use N (vi) to denote the set of vertices that is adjacent to vi. Based on this, the definition of dominating set is given as follows:

Definition 1 (Dominating set) Given a graph G(V, E ), its dominating set D ⊆ V is a subset of the vertex set V , such that for any vertex v_i∈ V\D there exists vj∈ D that satisfies vj is adjacent to v_i, i.e., vj∈ N (vi).

(25)

Figure 2.2: Dominating Set : In this connected graph the vertices belonging to Dominating Set are marked red. Rest nodes in the graph are connected to either node 2 or node 5 in the graph.

This is illustrated by Figure 2.2, Dominating set of the shown graph is clearly {2, 5}. However, {1, 5}, {7, 5}, etc. can equally be defined as dominating set of the given connected graph. Minimum Dominating Set (MDS) problem is of particular interest in many field such as for routing in WSN. Finding MDS is considered to be NP-hard [33] in general but efficient sub-optimal algorithms do exist.

2.3 Wavelet Series Decomposition

Given a signal f (t), its Wavelet transform decomposes the given signal into a set of basis function called wavelets. Wavelet transform is given by,

γ(s, τ ) = Z

f (t)Ψ^∗_s,τ(t)dt, (2.3)

where ∗ denotes complex conjugation. The variables s and τ are called scale and translation of a wavelet transform. Wavelet transform provides a unique ability of achieving time-frequency localization and mutli- scale resolution. This makes it useful in many fields of signal processing specifically in signal compression, image enhancement and noise reduction. The wavelets are generated from a single basis wavelet Ψ(t) by scaling and translation as shown below

Ψ^∗_s,τ(t) = 1

√sΨ t − τ s

. (2.4)

However, the above way of calculating a wavelet transform is not employed in practice. Instead, Discrete wavelets are used, these wavelets are not continuously scalable and translatable but are scaled and translated in discrete steps only. Such a transform is called Discrete Wavelet Transform (DWT). This transform decomposes the signals into mutually orthogonal sets of wavelets.

In discrete wavelet decomposition, the original time series signal is divided into two new signals.

The input signal is decomposed into approximation coefficients and detail coefficients. Approximation corresponds to the lower frequency content in the signal, whereas the detail coefficients corresponds to the higher frequency component. This decomposition is continued from one level to another, dividing the approximation component into two new components until the selected level of decomposition is reached.

The procedure is illustrated in Figure 2.3 [34], where an original signal S(n) is decomposed into three levels. The approximation coefficient at level 3 resembles the filtered signal, while noise is captured in the detailed coefficient.

One can further calculate the energy in each component as percentage of the total energy in the original signal. Wavelet decomposition of time-series data can be applied as a tool for feature extraction

(26)

Figure 2.3: Wavelet Series Decomposition [34]: Signal S is decomposed into Approximation coefficient cAi

and Detail coefficient cD_i. Level 3 decomposition provides cA₃ as Approximation coefficient and three Detailed coefficient ; cD1, cD2, cD3 respectively.

and can help us to reduce the large and variable size vector into a constant and much smaller size as feature vector.

2.4 Logistic Regression

Logistic Regression (LR) [35] is one of the ML techniques for binary classification. The hypothesis function used in this method is given as

h_θ(x) = g(θ^Tx), (2.5)

where x is the input data, θ is the parameter and g is sigmoid function given by, g(z) = 1

1 + e^−z. (2.6)

Logistic Regression cost function is given as follows

J (θ) = −1 m[

m

X

i=1

yⁱlog h_θ(X⁽ⁱ⁾) + (1 − y⁽ⁱ⁾) log(1 − h_θ(X⁽ⁱ⁾)] + λ 2m

n

X

j=1

θ_j². (2.7)

where X is training matrix with m rows as training examples and n columns as the number of features, y is the label vector and λ is the regularization parameter.

To fit the parameters θ, we minimize this cost function by using optimization algorithms such as gradient descent algorithm [36], which solves for optimum value of θ iteratively using the following equation,

θj := θj− α∂J (θ)

∂θj

, (2.8)

where α is the learning rate. Further detailed mathematical description about logistic regression can be found in [37].

(27)

Figure 2.4: Support Vector Machines [38] : The figure represents the diagrammatic representation of SVM classification. The two classes in the data set are separated by a hyperplane in the solution.

2.5 Support Vector Machines

The Support Vector Machines (SVM) [39] is a supervised ML algorithm that can be used both for the purpose of regression and classification. It was originally designed for the binary classification problem.

The core idea of SVM is to map the data’s nonlinear characteristics to a high-dimensional feature space to separate the two classes. It defines an optimal hyperplane which maximizes the margin between the two classes.

Given a training data set of n points. (x₁, y₁), ...(x_n, y_n), where y_i ∈ {−1, 1}, and each x_i is a p dimensional real vector. SVM provides the maximum margin hyperplane that divides the two group of points. A general Hyperplane equation can be written as,

w · x + b = 0, (2.9)

where w is the normal vector to the hyperplane and the parameter_||w||^b defines the offset of the hyperplane from the origin along the normal vector w. Two hyperplanes can be defined such that

w · x + b = 1 (2.10)

and

w · x + b = −1. (2.11)

Geometrically the distance between these two hyperplane is_||w||² . To obtain maximum margin hyperplane this distance needs to be maximized, subject to condition

w · x + b ≥ 1, ∀x ∈ {1} (2.12)

and

w · x + b ≤ −1, ∀x ∈ {−1}. (2.13)

On further manipulation, the problem can be stated as finding b and w by solving the following objective function

min

b,w

1

2||w||² (2.14a)

s.t. y_i(w·x_i+ b) ≥ 1∀x_i. (2.14b)

(28)

The perfect separation of variables may not be always possible. In that case, the objective function is modified as

min

b,w

1

2||w||²+ CX

i

ζ_i (2.15a)

s.t. y_i(w·x_i+ b) ≥ 1 − ζ_i∀x_i (2.15b)

ζ_i≥ 0. (2.15c)

where C is error penalty factor which trades off the margin width and misclassification and ζi is training error. The objective function can also be written as the dual form by substituting w =Pl

i=1α_iy_ix_i, maximize L_D=

l

X

i=1

α_i−1 2

l

X

i,j=1

y_iy_jα_iα_j(x_i· x_j) (2.16a)

s.t.

l

X

i=1

yiαi= 0 (2.16b)

0 ≤ αi≤ C. (2.16c)

The above method works well as a linear classifier. However, in some situations it is more efficient to separate the data using non-linear decision region. For this, SVM uses kernel trick [40]. The resulting algorithm is similar except that the dot product is replaced by kernel function. The kernel function transform the data into a higher dimensional feature space to make it possible to perform the linear separation. The objective function is modified as follows.

maximize LD=

l

X

i=1

αi−1 2

l

X

i,j=1

yiyjαiαjK(xi, xj) (2.17a)

s.t.

l

X

i=1

yiαi= 0 (2.17b)

0 ≤ αi ≤ C. (2.17c)

There are various Kernel functions available but the most widely used ones are Polynomial Kernel function and Gaussian Kernel.

Polynomial K (xi, xj) = (xi· xj)^d (2.18a) Gaussian K (xi, xj) = exp

−|xi− xj|² 2σ²

(2.18b) The pictorial representation of the SVM classification is shown in the Figure 2.4. Applying SVM on a given set of labeled data, displayed in figure, resulted in a decision boundary which separates the two classes.

2.6 Support Vector Regression

SVM when applied to a problem of regression is termed as Support Vector Regression (SVR). It uses the same principles as the SVM for classification but with few minor changes. The output of a SVR is a real number and thus has possibility to take infinite values. For a given set of training data (x₁, y₁), (x₂, y₂)..., (x_n, y_n), where x_i ∈ R^p i.e, each x_i is a p-dimensional real vector and y_i ∈ R. The aim is to find a function f (x) such that,

f (x) = w · x + b, (2.19)

(29)

Figure 2.5: Support Vector Regression [41] : The figure represents the SVR regression problem. Red dots are the actual values and blue line is the predicted function. Difference between predicted and actual values should in range of (, -) and the outliers are assigned corresponding slack value.

where w is the normal vector to the hyperplane and the parameter_||w||^b defines the offset of the hyperplane from the origin along the normal vector w. The function f (x) approximates the y value. Similar to SVM, the objective function for SVR is written as

min

b,w

1

2||w||²+ CX

i

(ξi+ ξ^∗_i) (2.20a)

s.t. yi− (w·xi+ b) ≤ + ξi (2.20b)

yi− (w·xi+ b) ≥ − − ξ^∗_i (2.20c)

ξi, ξ^∗_i ≥ 0, (2.20d)

where C is the error penalty factor and defines a margin of tolerance where no penalty is assigned to errors. Slack variables ξ_i and ξ^∗_i are introduced to accommodate a soft margin SVR [42]. Figure 2.5 represents a typical SVR formulation. The difference between yi and the fitted function should be either smaller than or larger than −. In other words, all points yi should be in -box shown in the figure.

There is also slack variable (penalty) ξ assigned to each of the outliers. A linear regression yields a straight line as depicted in figure.

For cases where the data is not linearly separable, kernel trick is used similar to the case in SVM classification. Kernel function transforms a non-linear regression problem into linear regression by projecting the original feature space into higher dimensional feature space. More about SVR can be found in [42].

2.7 Artificial Neural Network

Artificial Neural Network (ANN) [43] is a ML algorithm which is inspired by biological neural network of human brain. It consists of multiple nodes which imitates neurons of brain. In simple words ANN is a network of artificial neurons. A single artificial neuron can be seen as a mathematical functions, which performs a weighted sum of one or more inputs, which is then passed through a non-linear function, also known as activation function. A neuron function is illustrated in Figure 2.6.

(30)

Figure 2.6: Neuron : A neuron in ANN takes input and performs weighted sum of the input and then the sum is passed through an activation function to give an output.

Figure 2.7: Neural Network : The figure illustrates a three layer Neural Network with four nodes in input layer, five nodes in hidden layer and two output nodes.

Any ANN has an input layer, hidden layer and an output layer. There can be one or more than one hidden layers in the network. For example in Figure 2.7, a three layer Neural Network is shown where the input layer has four input nodes. These input nodes are connected to five neurons in the hidden layer individually. Furthermore, each nodes in hidden layer computes their output and pass it on to all the nodes in the output layer. The node in output layer finally, performs its computation on the given inputs from the hidden layer nodes to give two output.

For a typical ANN, let {(x⁽¹⁾, y⁽¹⁾), (x⁽²⁾, y⁽²⁾), ...., (x^(m), y^(m))} be the m sets of labeled data. Let y ∈ R^k, i.e, the labeled data have K classes. Let L be the total number of layers in the network, sl be the number of units in layer l. If the hypothesis function is represented by h_θ(x), then

hθ(x) ∈ R^k. (2.21)

The cost function is then given by

J (θ) = −1 m[

m

X

i=1 K

X

k=1

y_k⁽ⁱ⁾log(h_θ(x⁽ⁱ⁾))_k+ (1 − y⁽ⁱ⁾_k ) log(1 − (h_θ(x⁽ⁱ⁾))_k)] + λ 2m

L−1

X

l=1 s_l

X

i=1 s_l+1

X

j=1

(θ_ji^(l))². (2.22)

where θ^l is the weight that connects layer l to l + 1. This cost function needs to be minimized over θ^l∀l using gradient descent optimization. For calculating the gradient of the cost function, back-propagation algorithm is used, which is discussed in detail in [43].

(31)

Figure 2.8: Random Forest: The figure [45] illustrates steps involved in the Random Forest algorithm.

2.8 Random Forest

Random Forest [44] is a type of tree based supervised learning algorithm. It uses many decision trees to aggregate the answer. It is an ensemble of Decision Trees. The drawbacks of regular decision trees is resolved by this method. The randomness is introduced by two factors [45]: a) Selection of a random training set for every base learner. A bootstrap set is chosen for each base learner, where each base learner is a Decision Tree. b) The nodes of the Decision Tree uses feature to split, and the tree grows to the next level. This feature is chosen from a random subset of features of the training set. The RF algorithm is explained succinctly in Figure 2.8.

For Classification with RF, each of the decision trees classifies the item and a class having majority in all the outcomes becomes the final output of the Random Forest algorithm. For case of regression problem, average value of all the outputs are taken as the final prediction of the RF.

2.9 Bias-Variance Trade-off

The prediction error of any machine learning algorithms can be divided into three parts [46].

1. Bias Error 2. Variance Error 3. Irreducible Error

Bias are the simplified assumptions made by a model to make the target function easier to learn. It is desirable to achieve low bias in the learning process as high-bias generally corresponds to under fitting and in turn they have lower predictive performance on the data.

Variance is the amount that the estimate of the target function will change if different training data were used. It corresponds to stability of model in response to new training examples. The target function

(32)

Figure 2.9: Bias Variance [47] : The figure represents a graphical illustration of bias and variance. The red circle corresponds to the least error. Low bias and low variance is desired for best performance.

is estimated from the training data by a machine learning algorithm, so it is expected that the algorithm will have some variance. Ideally, if the algorithm is able to capture the underlying mapping between the input and output variables, then the function learned would not vary much from one training dataset to the another.

Graphical illustration of bias and variance is depicted in Figure 2.9. It is always desirable to have low bias and low variance in the model. However, there is usually a trade-off between the two. A learning algorithm should be flexible (low bias) to fit the data well. However, if it becomes too flexible such that it fits every training set differently, then it is said to be suffering from high variance.

The irreducible error, as the name suggests cannot be reduced regardless of what algorithm is used. It is the error introduced from the chosen framing of the problem and may be caused by factors like unknown variables that influence the mapping of the input variables to the output variable.

2.10 Classification Accuracy

Accuracy of any model is measured by number of correct predictions made by that model out of all the predictions. However, for binary classification problem, measuring only the accuracy in this manner is not sufficient. For cases where the data is unbalanced, meaning there are more number of element belonging to a particular class, classification accuracy is of little use and is mostly misleading.

For a binary classification the occurrence of phenomenon of interest is labeled as positive class and absence of it is labeled as negative class. A clean and unambiguous way of presenting the result of a binary classification is through confusion matrix. The confusion matrix is shown in the Figure 2.10. The row in the matrix represents instances in a predicted class while each column represents instances in an actual class.

Based on the confusion matrix, two more parameters known as Precision and Recall are defined which are frequently used in literature. Precision represents the fraction of cases which the classifier labeled as positive are actually positive where as Recall represents the fraction of true positives which are predicted positive by classifier. Recall is also known by the name of True positive rate (TPR) and Sensitivity, depending on the context in which it is being used. It is summarized in mathematical notation in the

(33)

Figure 2.10: Confusion Matrix : The figure represents a confusion matrix. Columns represents the actual class whereas the rows represent true class.

following equation.

P recision = T P

T P + F P (2.23a)

Recall = T P

T P + F N. (2.23b)

For an ideal classification model, both false positive and false negative should be zero. In other words, precision and recall should both yield the value 1. However, in practical scenario, precision and recall are inversely proportional to each other.

For a given scenario, where either precision or recall holds the prime importance, it is convenient to use one of them as a metric for model comparison. However, in scenario were both precision and recall holds equal importance, it is necessary to use different measure which combines both precision and recall into one single measure for classification. F-score discussed next, serves as one such numerical measure for binary classification.

2.10.1 F-score

For testing accuracy of the classification algorithm numerically, F1 score or F-score is used. It is the harmonic mean of precision and recall as depicted in equation 2.24. Constant 2 is multiplied to keep the score to 1, when both precision and recall are 1. F-score is maximum when both precision and recall is equal.

F1= 2 · 1

1

recall +_precision¹ = 2 ·precision × recall

precision + recall (2.24)

2.10.2 ROC Curve

Receiver operating characteristic curve, i.e. ROC curve, is the most common way of visualizing the performance of binary classification algorithm. It is plotted between true positive rate (TPR) and false positive rate (FPR) at various threshold settings. As described earlier in Section 2.10 TPR is another name for Recall. False positive rate (FPR) or false alarm is the proportion of all the negatives that yielded positive test outcomes.

ROC curve is used to determine the efficacy of the binary classification [48]. The area under ROC curve is used for comparing different models. The best classification has the highest area under the curve (AUC). Figure 2.11 depicts an example of ROC Curve. Random guess would result in a diagonal line with AUC equal to 0.5. The higher the plot, the better is the classification model, with best fit occurring when the ROC curve passes through or close to (0,1) point, i.e, when the AUC is 1.

(34)

Figure 2.11: ROC Curve [49] : The figure represents a typical ROC curve. The red line represents a random guess and the best performance model is when the ROC curve passes through or close to (0, 1).

Figure 2.12: One versus rest [51]: The Multi-Class classification problem is splitted into three individual binary classification problem.

2.11 Multi-class Classification

One of the way to address a Multi-Class classification problem is by using One-vs-all (One-versus-the- rest) [50] technique. To perform this classification, binary classification technique is used one by one for each of the class separately. In this technique a single classifier is trained per class, by considering the samples of that particular class as positive samples and rest all samples as negative. It requires the base class to produce a real value confidence score for it’s decision, instead of a class label. Concretely, in this technique first a binary classifier is trained to give a hypothesis h⁽ⁱ⁾_θ (x) for each class i. The hypothesis predicts the probability that y = i. On a new input x, to make a prediction, the class i that maximizes h⁽ⁱ⁾_θ (x), is predicted.

As explained in Figure 2.12, the three class classification problem is addressed by splitting the problem into three binary classification problem and training them separately to get three hypothesis function.

(35)

2.12 Ensemble Learning

It is a method in which various ML models are combined intelligently to solve a computational intelli- gence problem. The combined result is more accurate than each of the individual result. There are various ensembling techniques available in the literature such as a) Bagging [52], b) Boosting [53], c) Stacking [54], and d) Plurality Voting [55]. In this thesis to combine the result from different algorithms following procedure is applied.

This combination technique is applied for leakage detection part of the thesis. In this setup, the time series data after preprocessing is fed to each of the algorithm to know its output. The result is obtained from each of the learned algorithm along with its probability. The result with the highest probability is declared as final outcome of the multi-class classification problem.

(36)

(37)

Chapter 3

Solution Approach

The primary focus of this chapter is to present the steps involved in the solution of the problem of automatic detection, localization and estimation of leakage size in water distribution network using data analytics. At first, in Section 3.1, sensor placement and leak zone identification problem is addressed.

Subsequently, training data generation methodology is mentioned in detail. Following this, different ML algorithms applied on training data are discussed. Finally, Section 3.7 describes the accuracy calculation and comparison of the ML methods.

3.1 Leak Zone Identification

This section deals with the sensor node placement and leak zone identification problem. As a reminder, from Section 2.2, WDN can be viewed as a connected graph [32] G(V, E ) where V is a set of junction node in water distribution system and E is set of edges representing pipes between the junctions. To understand the method completely, it is imperative to define few terms, which will be used often in the discussion to follow.

Definition 2 (Adjacent vertex (Neighbor)) Given a graph G(V, E ), for any vertex v_i∈ V, its Adja- cent vertex/Neighbor is a vertex vj∈ V such that there exists an edge ei∈ E that connects the node vi to node vj in the graph. The set of neighbour of a node is represented by N (vi).

Figure 3.1: An example of Adjacent Vertex/Neighbor : In this undirected connected graph, each of the red colored vertex, i.e, node 1, 3, 5 & 7 are adjacent vertex of node 2.

(38)

This is demonstrated in Figure 3.1, Here Vertex 2 in the graph is connected to vertex 1, 3, 5 and 7 through an edge. Hence each of the node 1, 3, 5 and 7 can be called as adjacent vertex of node 2.

Definition 3 (Nearest Neighbor) For a undirected weighted graph G(V, E , W), where V represents the set of vertices in the graph, E represents the set of edges connecting the vertices and w_ij∈ W is the length of the edge < vi, vj >. For any vertex vi∈ V with its neighbor set N (vi) , vj ∈ V is vi’s Nearest Neighbor if and only if wij≤ wik∀vk∈ N (vi).

Figure 3.2: An example of nearest neighbor : In this weighted graph, for Node 2, Node 1 is the nearest neighbor with least weight of 120.

For example, a undirected connected graph is shown in Figure 3.2. In this graph, for node 2, {1, 3, 5, 7} is a set of adjacent vertex and the corresponding set of edges between node 2 and its adjacent vertex is {2-1, 2-3, 2-5, 2-7}. For this set of edges, corresponding weights are {120, 300, 450, 300}. Clearly, nearest neighbor of node 2 is node 1 because the edge carries the minimum weight among all other edges.

The main intuition of defining leak zone in this thesis, is to place sensors in the network in such a way that all non-sensor nodes falls at one hop distance from the location of a sensor node. In other words, from the prism of graph theory it pertains to placing sensor in such a way that their locations form a dominating set of the graph.

Following are the steps involved in the algorithm, which takes a WDN as input and provide the number of sensor nodes needed, their locations and the leakage zone of each of the sensor nodes. The steps are general in nature and apply to any WDN. However, for better understanding, the steps are explained for Net2 Example of EPANET.

1. Modeling of a network as an undirected weighted graph G = (V, E , W), where V is a set of junction node in water distribution system and E is a set of edges representing connecting pipes between the vertices and w_ij∈ W is the length of edge < v_i, v_j>.

2. Steps 1 to 4 in the pseudo-code of Algorithm 1 calculates the nearest neighbors for every node.

3. Identification of nodes which are ’nearest neighbor’ for more than one nodes, are included into initial list of sensor nodes. Figure 3.3b shows the initial list of sensor nodes when applied to EPANET example, denoted by the vertexes in red.

4. Identification and addition of left-out-nodes (which are not the neighbors of any of the sensor nodes found in step 2) in the list. For our running example, this is shown in Figure 3.3b, by the vertexes in yellow. This is documented in pseudo-code from steps 6 to 8.

(39)

Algorithm 1 Sensor Node and Leak Zone algorithm Input: Graph G = (V, E , W)

Output: Sensor node set S and mapping information of non sensor nodes M.

1: for l = 1 to |V| do

2: Choose element v_l∈ V

3: Find its nearest neighbor and append to a vector n.

4: end for

5: Count the frequency of occurrence of each element of vector n, for count ≥ 2, store it in set A.

6: Find neighbor of each element of A and store in set P.

7: Construct Left-out node L, such that L = V ∩ P^c.

8: Construct a new set S, such that S = L ∩ A.

9: for i = 1 to |S| do

10: For an element s_i∈ S. Find the neighbor of element in S − {s_i} and include them in a set R

11: if si∈ R then

12: if Neighbor of si∈ R then

13: Remove s_i from set S

14: end if

15: else

16: Increment i

17: end if

18: end for

19: Mapping of elements in set V ∩ S^c to elements in set S and including in set M.

20: return S, M

5. Identification and removal of redundant sensor nodes from the list: When a sensor node and all it’s neighbors appears in the neighbor list of other sensor nodes, it is termed as redundant sensor node.

These redundant nodes are removed from the list to give a final list of sensor nodes. Steps 9 to 18 in pseudo-code performs this action. Figure 3.4a represents redundant sensor nodes marked explicitly in our running example.

6. Mapping of each non-sensor nodes to the nearest connected sensor nodes that defines the leak zone of a sensor node.

Figure 3.4b clearly shows the final output of the algorithm. It results in a total of 14 sensor nodes in our chosen example. The demarcated area around them shown in figure is corresponding leak zone of the sensor nodes. Next section uses this information to get the hydraulic data for different settings of parameter and then the obtained data is pre-processed into a format suitable for applying ML algorithms.

(40)

(a) Net2 Layout (b) Initial nodes & Left-out nodes

Figure 3.3: Intermediate result of node placement : a) The figure shows the EPANET Net2 Example, chosen in this thesis for illustration of the steps in the method of thesis. b) The red circle are initial nodes

& the yellow circle are left out nodes for chosen example.

(41)

(a) The encircled nodes are redundant nodes (b) Identified leakage zone with sensor placement Figure 3.4: Sensor Node Placement : a) The redundant nodes are identified and encircled in the figure b) This is the final output of the sensor placement algorithm

(42)

3.2 Training Data

This section provides the information regarding the steps involved in generating large amount of training data for Supervised ML algorithms. Generating training data involves two major components.

At first raw data is obtained from the junction nodes where the sensor nodes are placed. In the second step, the obtained data is preprocessed before it is fed to the data analytics tool.

A substantial number of training data is required to attain satisfactory level of accuracy from the data analysis algorithms. This training data is obtained by using EPANET toolkit [56] for use in MAT- LAB. It is an open-source software that provides programming interface for EPANET within MATLAB framework. It provides easy to use wrappers/commands for viewing, modifying, simulating and plotting the result produced by EPANET libraries. In our formulation very low level of leakage is considered to be the case of no leakage in the system. The leakage is introduced in the network one node at a time.

The following steps are involved in generating training data. The meaning of the notations used in the Algorithm 2 is provided in the Table 3.1.

Algorithm 2 Generating Training data

Input: WDN, S, M, nj, ns, ecinitial, ecstep, ecfinal, ecninitial, ecnstep, ecnfinal, l, sa, st, and sd. Output: Training data Matrix X for Supervised ML algorithm.

1: Calculate nnl= nj(ecnfinal− ecninitial)/ecnstep.

2: Calculate n_l= n_j(ec_final− ec_initial)/ec_step. Thus, t_c= n_l+ n_nl.

3: Calculate the size of feature vector f obtained for a single leak case as p = (sd/st) + 1.

4: Calculate the size of training matrix X. Number of rows = tc. Number of columns = ns× (l + 1)

5: Initialize Training matrix X with all zeros.

6: for l = 1 to n_j do

7: i = ecninitial

8: while i ≤ ecnfinal do

9: Perform hydraulic simulation using EPANET and store the time-series pressure data from sensor nodes into a pressure matrix P of size ns× p.

10: Initialize vector r of size 1 × (ns(l + 1)) with all zeros.

11: for k = 1 to ns do

12: Perform the wavelet series decomposition of level-3 with wavelet Daubechies 2 (db2) on row k of matrix P .

13: Calculate energy of time-series data distributed in Approximate and Detailed coefficients, using Equation (3.1) and (3.2) respectively.

14: Calculate ratio of energy in Approximate and Detailed coefficients with respect to total energy.

15: Update the vector r.

16: end for

17: Update the Training matrix X with elements in vector r.

18: i = i + ecnstep.

19: end while

20: Choose next node.

21: end for

22: Repeat step 6 to 21 for the leakage case.

23: Prepare y label.

24: Update the y label information as the last column of Training matrix X.

25: return X

1. For our selected example, as mentioned in step 1 of pseudo-code in Algorithm 2, number of junction

Monitoring Water Distribution Network using Machine Learning

Monitoring Water

Distribution Network using Machine Learning

GAGAN GUPTA

Monitoring Water Distribution Network using Machine Learning

EP242X, Degree Project in Communication Networks

Gagan Gupta

Abstract

Referat

Acknowledgements

Contents

List of Tables

List of Figures

Acronyms

Chapter 1

Background and Motivation

1.1 Water Distribution Network

1.2 Wireless Sensor Network

1.3 Machine Learning

1.4 Related Works

1.5 Problem Formulation

1.5.1 Placement of Sensor nodes

1.5.2 Data Analysis for leakage information

Chapter 2

Preliminaries

2.1 EPANET

2.2 Dominating Set

2.3 Wavelet Series Decomposition

2.4 Logistic Regression

2.5 Support Vector Machines

2.6 Support Vector Regression

2.7 Artificial Neural Network

2.8 Random Forest

2.9 Bias-Variance Trade-off

2.10 Classification Accuracy

2.10.1 F-score

2.10.2 ROC Curve

2.11 Multi-class Classification

2.12 Ensemble Learning

Chapter 3

Solution Approach

3.1 Leak Zone Identification

3.2 Training Data