• No results found

General-purpose maintenance planning using deep reinforcement learning and Monte Carlo tree search

N/A
N/A
Protected

Academic year: 2021

Share "General-purpose maintenance planning using deep reinforcement learning and Monte Carlo tree search"

Copied!
49
0
0

Loading.... (view fulltext now)

Full text

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Datavetenskap

202019 | LIU-IDA/LITH-EX-A--2019/096--SE

General‐purpose

maintenance

planning using deep reinforce‐

ment learning and Monte Carlo

tree search

Generell underhållsplanering genom Deep Reinforcement Learn‐

ing och Monte Carlo Tree Search

Viktor Holmgren

Supervisor : Olov Andersson Examiner : Patrick Doherty

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet ‐ eller dess framtida ersättare ‐ under 25 år från publicer‐ ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko‐ pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis‐ ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker‐ heten och tillgängligheten finns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman‐ nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet ‐ or its possible replacement ‐ for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down‐ load, or to print out single copies for his/hers own use and to use it unchanged for non‐commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

Maintenance planning and execution is increasingly important for the modern in-dustrial sector. Maintenance costs can amount to a major part of inin-dustrial spending. However, it is not as simple as just reducing maintenance budgets. A balance must be struck between risking unplanned downtime and the costs of maintenance efforts, in or-der to keep the profit margins needed to compete in the global markets of today. One approach to improve the effectiveness of industries is to apply intelligent maintenance planners. In this thesis, a general-purpose maintenance planner based on Monte-Carlo tree search and deep reinforcement learning is presented. This planner was evaluated and compared against two different periodic planners as well as the oracle lower bound on four different maintenance scenarios. These four scenarios are all based on servicing wind tur-bines. All scenarios include imperfect maintenance actions, as well as uncertainty in terms of the outcomes of maintenance actions. Furthermore, the four scenarios include both single and multi-component variants. The evaluation showed that the proposed method is outperforming both periodic planners in three of the four scenarios, with the forth being inconclusive. These results indicate that the maintenance planner introduced in this paper is a viable method, at least for these types of maintenance problems. However, further research is needed on this topic of maintenance planning under uncertainty. More specifically, the viability of the proposed method on a more diverse set of maintenance problems is needed to draw any clear general conclusions. Finally, possible improvements to the training process that are discussed in this thesis should be investigated.

(4)

Acknowledgments

First I would like to thank my thesis supervisor Olov Andersson in the Artificial Intelligence and Integrated Computer Systems (AIICS) at Linköping University. He has been of tremendous help in guiding this thesis work, specifically regarding which modifications/adaptations were needed in making the method presented in this thesis work. I would also like to thank Prof. Doherty at the same division, for examining this thesis, as well as my supervisor at Attentec, Tommy Ellqvist.

Finally, I must express my gratitude to my parents and to my girlfriend for providing me with support and encouragement throughout my years of study and the process of researching and writing this thesis. This would not have been possible without them. Thank you.

(5)

Contents

Abstract iii

Acknowledgments iv

Contents v

List of Figures vi

List of Tables vii

1 Introduction 1 1.1 Aim . . . 2 1.2 Research questions . . . 2 1.3 Delimitations . . . 2 2 Theory 3 2.1 Maintenance planning . . . 3

2.2 Markov decision processes . . . 6

2.3 Reinforcement learning . . . 7

2.4 Monte-Carlo Tree Search . . . 7

2.5 Deep Reinforcement Learning . . . 11

3 Method 12 3.1 Simulation Models . . . 12

3.2 Monte-Carlo Tree Search on Stochastic Problems . . . 13

3.3 Comparative planning models . . . 18

3.4 Implementation details . . . 19 3.5 Evaluation . . . 19 4 Results 21 4.1 Single-component scenario . . . 21 4.2 Multi-component scenario . . . 22 5 Discussion 33 5.1 Results . . . 33 5.2 Method . . . 35

5.3 The work in a wider context . . . 36

6 Conclusion 38

(6)

List of Figures

2.1 Time-to-failure distributions . . . 4 2.2 An outline of a single MCTS iteration for a small toy example. Each node stores

two values in the form q/n where q is the number of ’wins’ and n is the visit count. 8 3.1 Failure-condition distribution for different wind-turbine components (rotor,

bear-ing, gearbox and generator) . . . 14 3.2 Tree structure for multiple outcomes. Shows that value of a given action is

calcu-lated by a weighted average of the known outcomes. Each node contains a state value Q and a visit count N, illustrated as (Q/N) . . . 17 3.3 Simplified MCTS tree (no discount or multiple outcomes) showing how node-values

are assigned using max-backup . . . 17 3.4 Self-play training process for incrementally improving DNN performance . . . 18 4.1 Cost density plot for single-component scenarios for different planning models.

Es-timation of the empirical probability density over costs for different planning mod-els. The density estimation is achieved by use of a gaussian kernel estimate, more specifically the default density function in R. . . 23 4.2 Cost density plot for single-component scenarios for different planning models.

Es-timation of the empirical probability density over costs for different planning mod-els. The density estimation is achieved by use of a gaussian kernel estimate, more specifically the default density function in R. . . 24 4.3 Performance testing during training (every tenth episode); mean cost per episode

with standard deviation bands for single-component MCTS-DNN . . . 25 4.4 Mean absolute training and validation error for single-component DNN learning . . 26 4.5 Cost density plot for multi-component scenarios for different planning models.

Es-timation of the empirical probability density over costs for different planning mod-els. The density estimation is achieved by use of a gaussian kernel estimate, more specifically the default density function in R. . . 28 4.6 Cost density plot for multi-component scenarios for different planning models.

Es-timation of the empirical probability density over costs for different planning mod-els. The density estimation is achieved by use of a gaussian kernel estimate, more specifically the default density function in R. . . 29 4.7 Performance testing during training (every tenth episode); mean cost per episode

with standard deviation bands for multi-component MCTS-DNN . . . 30 4.8 Mean absolute training and validation error for multi-component DNN learning . . 31

(7)

List of Tables

3.1 Failure distributions with mean time to failure estimates (MTTF), and cost

pa-rameters (in $k) . . . . 14

3.2 Periodicities for different components and periodic strategies . . . 19

4.1 Simulation results for single-component scenario . . . 22

4.2 Simulation results for single-component scenario with turbine age . . . 22

4.3 Simulation results different selections of planning horizon parameters with MCTS. Results are for the single-component scenario . . . 22

4.4 Simulation results different selections of planning horizon parameters with MCTS. Results are for the single-component scenario with turbine age . . . 27

4.5 Simulation results for multi-component scenario . . . 32

4.6 Simulation results for multi-component scenario with turbine age . . . 32

4.7 Simulation results different selections of planning horizon parameters with MCTS. Results are for the multi-component scenario . . . 32

4.8 Simulation results different selections of planning horizon parameters with MCTS. Results are for the multi-component scenario with turbine-age . . . 32

(8)

1

Introduction

Ever since the industrial revolution, society has become increasingly more reliant on machines. The area in which this is perhaps most true is in the industrial sector it self. In Sweden, as with most OECD1countries, labor costs are very high. To stay competitive in the global markets of today, Swedish industry needs to rely heavily on machines and technology, rather than human labor. It is therefore paramount, that these machines are kept running with as little downtime as possible. However, it is not as simple as increasing maintenance budgets, since it is already estimated that between 15 and 40 percent of production costs are due to maintenance [14]. Furthermore, there are several examples of excessive levels of maintenance, for instance, the petroleum industry spends about $1 trillion replacing good equipment each year [24]. It therefore seems plausible that profits could be increased without sacrificing availability, by using more intelligent maintenance efforts.

Industrie 4.02, which is sometimes referred to as the fourth industrial revolution, is an effort by the German government to computerize its industries. It involves increasing automation, large scale data collection and analysis, as well as machine-to-machine communication [15]. This move towards big-data enabled industries is not limited to Germany, for instance, the European Union is also creating similar initiatives3. There is a major opportunity to leverage the data that is being collected to, for instance, learn predictive failure models. Models which given some component/machine status information can estimate the probability of failure and using them to improve maintenance planning. This approach of integrating predictive models as part of maintenance planning is one major focus in the condition-based maintenance (CBM) research community. However, most current CBM models are not very general in the types of stochastic problems they can be applied to, but are instead custom-made solutions to specific problem domains [2]. This tailored-made approach not only results in a high initial adoption cost but also makes them rather inflexible to new demands or changes to the underlying maintenance problem. Moreover, many existing CBM models cannot handle very common and important properties of real-world systems, such as multi-component systems and imperfect

1Organisation for Economic Co-operation and Development is a collection of well developed nationshttp:

//www.oecd.org/about/members-and-partners/

2German government’s high-tech strategy for the future of manufacturing, see https://www.bmbf.de/de/

zukunftsprojekt-industrie-4-0-848.html

3H2020 CREMA is an EU project that provides Cloud-based Rapid Elastic Manufacturing for digitizing

(9)

1.1. Aim

maintenance actions. Ideally, there would exist a more general maintenance planner which leverages predictive failure models, without the need for any specific domain knowledge.

In this thesis, we will attempt to construct and evaluate a general artificial intelligence method for maintenance planning under uncertainty. More specifically, a method based on deep reinforcement learning (DeepRL) and tree search methods that can handle multi-component systems and imperfect maintenance actions. This combination is very interesting in light of its recent successes. Most notably Silver, Huang, Maddison, et al. [32] used this combination of methods to create their AlphaGo Zero agent which achieved superhuman performance in the game of Go. Maintenance planning under uncertainty is quite obviously a different problem than playing Go and other board games. However, there are enough similarities to warrant exploring the possibility of adapting their method for maintenance planning.

1.1 Aim

This thesis work is part of a greater research project called SIMON4, which aims to research ar-tificial intelligence (AI) methods for increasing the capacity, competitiveness and performance of Swedish industrial companies. More specifically, this work is part of Attentec’s contribution to this collaborative project. The focus for this work is on intelligently plan maintenance ac-tions such that the associated costs are minimized. This could, for instance, include downtime costs, transport costs for maintenance personnel, and replacement parts.

Attentec is an IT-consulting firm in Mjärdevi, Linköping, specializing in streaming media, embedded systems and Internet of Things (IoT). AI maintenance planning is of interest for Attentec mainly in the IoT field, since there will be many different types of devices and problems being monitored, and thus domain-specific approaches are of little interest.

1.2 Research questions

• How does a Monte-Carlo Tree Search approach compare in terms of performance to a periodic maintenance schedule, and close can it get to an ideal oracle lower bound when planning preventative maintenance actions?

• Can a deep neural network for value approximation be used to improve the performance of Monte-Carlo Tree Search for planning preventative maintenance actions?

1.3 Delimitations

Since the main contribution of this thesis is developing a general AI approximation to main-tenance planning, and most related work make use of problem-tailored solutions, we will limit the performance comparisons to simpler similarly general approaches. This decision was made to limit the scope of the thesis work. Also, this thesis only covers performance comparisons on relatively simple scenarios, however, it should be noted that this AI method could easily be adapted for more complex problems.

Finally, we do not consider the problem of constructing the failure probability models for the different components and scenarios. The assumption is that there exists a probabilistic predictive model (for failure probability) that is estimated from data in some way.

4SIMON (New Application of AI for Services in Maintenance towards a Circular Economy) is a Vinnova

(10)

2

Theory

In this chapter the underlying theory and related work used in this thesis are detailed. More specifically, the major topics covered are maintenance planning and its different approaches, Markov decision processes, Monte-Carlo tree search, reinforcement learning and deep rein-forcement learning.

2.1 Maintenance planning

Maintenance can be said to be the act of inspecting, repairing or replacing machines or their components to continue operations. There are many different approaches to maintenance, however, they can generally be categorized into one of two major categories; corrective main-tenance (CM), and preventive mainmain-tenance (PM) [13]. According to the US Department of Defence [36], CM is defined as “all the actions performed as a result of failure to restore an item to a specified condition“ and PM as “all actions performed in an attempt to retain an item in a specified condition by providing systematic inspection, detection, and prevention of incipient failures“. It can, therefore, be argued that CM is the most straightforward approach to maintenance since it only considers performing maintenance actions on equipment after failure or malfunction has occurred. For some applications, this can be sufficient, but in gen-eral, this strategy incurs very high levels of unforeseen downtime as well as maintenance costs due to failures [35]. Instead of reacting to component or machine failure, PM attempts to cor-rect small issues before they develop into larger defects, by performing proactive maintenance activities [22]. PM is very extensive category in itself, with many different types of strategies. There are for instance strategies centered around human experience and original equipment manufacturer (OEM) recommendations [1]. However, for this thesis work, we are mainly interested in data-driven approaches, more specifically time-based maintenance (TBM) and condition-based maintenance. A more in-depth discussion on these two maintenance topics is found later in this section.

In order to construct performant maintenance planners, and to analyze the reliability of machines and components in general, a degradation model, or time-to-failure model can prove very useful [23]. These time-to-failure models typically take the form of a single probability distribution or a combination of several distributions. Moubray [27] discusses several different kinds of time-to-failure models and their history. According to the author, historically all models had a so-called wear-out shape, i.e that the probability of failure increases non-linearly

(11)

2.1. Maintenance planning

with the age of the component. Since then, more models have been created, partially due to the realization that excessive maintenance efforts either achieve nothing or are even counter-productive in terms of extending component life-time. Many newer models, therefore, take this into consideration by incorporating additional failure probability at the beginning of the life-span to account for installation errors and other similar issues. The five most popular models according to the author are; bathtub, infant morality, wear-out, random (uniform), and fatigue, all of which can be seen in Figure 2.1. All these models can be constructed using one or more Weibull distributions. Originally presented by Weibull [40] in 1951, the Weibull distribution is very important in the maintenance research community. It uses two parame-ters, θ and β which are referred to as the scale and shape parameters. The scale parameter is rather straightforward, it simply stretches the distribution, i.e scales it. In the context of maintenance, the shape parameter controls the aging process of the component/machine.

β< 1, results in a decreasing failure probability (infant morality) β= 1, results in a constant failure probability (random)

β> 1, results in an increasing failure probability (wear-out)

For the more ‘complex‘ models, i.e bathtub and fatigue, a combination of several Weibull distributions can be used to construct them. Using a Weibull distribution, the mean-time-to-failure (MTTF) is given by θΓ(1 + 1/β) where Γ(x) is the gamma function.

As mentioned in the delimitations, the problem of constructing these failure probability models is not the focus of this thesis work. Suffice it say that such a model can be created by modern machine learning methods using data gathered on similar components and machines over time. The main focus in this thesis is on intelligently plan maintenance actions such that the overall costs are minimized.

2 4 6 8 10 0.00 0.15 0.30 Bathtub Time p(t) 2 4 6 8 10 0.00 0.15 0.30 Infant mortality Time p(t) 2 4 6 8 10 0.00 0.15 0.30 Wear out Time p(t) 2 4 6 8 10 0.00 0.15 0.30 Random Time p(t) 2 4 6 8 10 0.00 0.15 0.30 Fatigue Time p(t)

Figure 2.1: Time-to-failure distributions

Time-based maintenance

Time-based maintenance (TBM) aims to solve maintenance problems by constructing a sched-ule of preventative maintenance actions before-hand. The main issue in TBM is finding a policy which produces a schedule requiring minimal resource usage, in terms of maintenance personnel, replacement parts, downtime, etc, while still keeping the risk of failure low enough. Wang [39] reviews the state of PM methods, including two main TBM methods; periodic and

(12)

2.1. Maintenance planning

sequential. Using a periodic maintenance policy, a fixed time interval cT is used to perform preventive maintenance, where T is the interval length and c∈ Z+ is the maintenance times. If a failure occurs between maintenance slots, necessary corrective maintenance is performed and the schedule is shifted accordingly. The sequential policy is very similar to the periodic policy but does not use fixed time intervals. Instead, the time between maintenance efforts gets shorter and shorter as the machine ages. This policy is based on the assumption that maintenance frequency must increase as the unit ages. Several modifications can be applied to these TBM policies as well, for instance, if imperfect maintenance actions should be used.

Condition-based maintenance

Condition-based maintenance (CBM) is an approach to planning preventive maintenance ac-tions in which the condition of the system is monitored and used as a basis for decision making [2]. This differs from more conventional TBM strategies where the aim is to find an optimal maintenance schedule, disregarding the current system status [18]. According to Jardine, Lin, and Banjevic [18] CBM consists of three main phases:

• Data acquisition: collecting system data,

• Data processing: process the collected data by selecting and transforming it to improve interpretability,

• Maintenance decision making: construct an efficient preventative maintenance effort policy based on the data.

The authors divide CBM models into two categories; diagnostics and prognostics. In diagnostic CBM, the main task for the maintenance system is to detect whether the system that is being monitored is behaving abnormally, while prognostic CBM deals with predicting system failure/abnormality before it occurs. In some sense, these two types complement each other since prognostics can be used to formulate a plan, with diagnostics being used to detect if any mistakes have been made. In the case of only using diagnostics or prognostics, the authors claim that prognostic CBM systems are much more efficient, since it is a proactive system, catching faults before they occur, rather than reacting to them. This is very important since unplanned downtime is very expensive in many applications.

In regards to prognostics, the authors categorize models into two main types depending on whether, remaining useful life (RUL) estimates, or predictive failure probability models are used. A RUL estimate is typically defined as a conditional random variable:

T− t∣T > t, Z1, Z2, ..., Zt

where T represents the random variable of time to failure, t is the current time, Zi is the

condition of the system at time point i. According to the authors, RUL models need to know the fault mechanism and the fault propagation process. Typically, a trending or forecasting model is used for the propagation process, and the faults are assumed to occur once the condition of the component/machine reaches a predetermined level. Conversely, predictive failure probability models try to estimate the probability that the system operates without fault until the next interval. The authors argue that firstly, the area of prognostics is much smaller compared to diagnostic CBM and second that most prognostics papers only involves RUL estimation. Which leaves the area of predictive failure probability CBM very much unexplored. This suggests that the approach used for this thesis, i.e relying on a predictive failure model is a rather unexplored possibility in terms of planning maintenance.

System properties in CBM

Alaswad and Xiang [2] reviews the current state of CBM research and the type of system properties that existing models can handle. According to the authors, most available methods

(13)

2.2. Markov decision processes

only regard single-component systems, i.e systems considered to consist of a single part, with actions reflecting that. With a multi-component system there are multiple components, which can be maintained independently. This results in a much more complex problem since a decision needs to be taken for each component. Hence, the CBM models need to either produce a RUL estimate for each component or use a failure probability model which manages multiple components. Moreover, most multi-component CBM models currently only handle multiple identical independent components, which according to the authors can be a major oversimplification. They argue that more research needs to be done for dependent, multi-component systems with varying multi-component types.

Another property which is seldom handled by existing methods according to the authors is imperfect maintenance actions. Do, Voisin, Levrat, et al. [11] define perfect maintenance actions as actions which restore the system to an ’as good as new’ state. Imperfect main-tenance actions, on the other hand, are actions which do not fully restore the system but are typically cheaper. In their work they also assume that imperfect maintenance actions in practice accelerate system deterioration.

Both these properties; multi-component and imperfect maintenance actions, are therefore a focus for the performance evaluation of our approximative maintenance planner. For more details on the scenarios used for performance evaluation, see section 3.1, but in short, the scenarios will incorporate multiple components, of varying kinds albeit independent, as well as imperfect maintenance actions.

2.2 Markov decision processes

A Markov Decision Process (MDP) is a mathematical model developed in the early 1960s that can be used for modeling decision making in discrete time and stochastic outcome problems [5, 16]. MDPs have proven very useful for many applications, not least in reinforcement learning. An MDP contains the following components:

• States: A set of individual states s ∈ S where each state can be a vector of several variables.

• Actions: A set of actions that can be performed, the actions available is limited by the current state a∈ A(s). A state with no available actions is defined as a terminal state. • Transition probabilities: p(s∣s, a) is the probability of reaching state sfrom s using

action a.

• Rewards: r(s, a, s) the reward received by taking action a from state s and reaching

s′.

• Reward discount rate γ∈ [0, 1] which discounts future rewards. This factor essentially controls the short vs. far-sightedness trade-off, i.e a low γ makes greedy actions more rewarding whereas a high γ puts a higher importance on future rewards.

From the perspective of a reinforcement learning agent, which will be detailed further in section 2.3, solving an MDP problem is finding the optimal policy π(a∣s), which defines the probability of selecting action a in state s, such that the cumulative discounted reward is maximized Equation 2.1. Gt= ∞ ∑ h=0 γhRt+h+1= Rt+1+ γRt+2+ γ2Rt+3+ γ3Rt+4+ ... (2.1)

(14)

2.3. Reinforcement learning

2.3 Reinforcement learning

Sutton, Barto, et al. [34] laid the foundation for what we now call reinforcement learning (RL), which studies the use of agents learning through experience. By observing the state of the environment, and the rewards received by performing different actions, the agent can form a policy for which sequence of actions leads to the highest cumulative reward. The environment is commonly described using an MDP, but unlike dynamic programming approaches, which can also be used to solve MDPs, RL is more focused on using approximations than finding a guaranteed optimal solution. Moreover, many RL methods do not assume exact knowledge of the underlying MDP, they are so-called model-free. These properties can make it much more feasible for large and complex MDPs.

Since the agent is unaware of what states are reachable through which actions, and the as-sociated rewards, it needs to explore the state-space by using essentially trial-and-error. Since transitions can be stochastic, it needs a large body of experience to estimate the value of differ-ent states. This problem of learning through experience is complicated further since rewards can be delayed, i.e the reward at time-step t might be highly dependent on the action taken in t− 1, t − 2, .... Moreover, exploring the state-space comes at a cost. Either the agent spends time taking lesser-known actions, or take actions which previously have yielded a high reward. Not exploring lesser-known actions reduces the possibility of improvement, but not exploiting actions that are known to provide high rewards will severely reduce the cumulative reward for the agent. This balancing issue is commonly referred to as the exploration-exploitation dilemma, and is very important in RL as a whole.

There are several different RL techniques, both in regards of how to explore the search space, as well as how to control the learning of the optimal policy [38]. One important charac-teristic of RL methods is whether they use on-policy or off-policy exploration. On-policy RL methods estimates and improves the same policy. Conversely, in an off-policy method, the agent is exploring using one policy but may evaluate an entirely different policy. These two policies are referred to as the behavior and target policies.

2.4 Monte-Carlo Tree Search

One of the most promising planning methods put forward in recent years is Monte-Carlo tree search (MCTS). Coined by Coulom [8], MCTS is a heuristic tree search algorithm which uses a sampling-based approach to exploring the search space. It has been applied successfully to several different problems, one of the most popular applications is playing different types of board games. Silver, Hubert, Schrittwieser, et al. [33] used a method based on MCTS to achieve super-human performance in Chess, Shogi (Japanese chess), as well as the very complex game of Go. It has also been used for games with imperfect information and non-determinism, for instance, Rubin and Watson [29] applied an MCTS based method for playing poker. MCTS shares many aspects with RL methods but is not clearly recognized as part of the RL family of algorithms, although the connection has been suggested [31]. The algorithm is grounded on two main assumptions regarding the problem that is being solved, firstly that the true value of any action can be estimated using a sampling-based approach, and second, that this estimated value can be used to construct an efficient policy [7]. The second assumption is also required by many RL algorithms.

MCTS works by incrementally building a search tree of nodes, containing state or state-action estimations. This search tree is built by selecting state-actions, according to an exploratory action-selection policy, and observing their outcomes using a form of simulation model, for instance, an MDP. This distinguishes MCTS from most RL algorithms since it does not solve the modeling problem while constructing its policy, instead relying on an external model. Constructing this external simulation model is, as previously stated, out of scope for this thesis work.

(15)

2.4. Monte-Carlo Tree Search

Browne, Powley, Whitehouse, et al. [7] argues that the reason for the popularity of MCTS is the; aheuristic, anytime and asymmetric properties it has. One of the most important characteristics of MCTS is that it requires little to no domain-knowledge. Formally, this property is known as aheuristic, and results in an algorithm that can be applied to almost any domain. Nevertheless, heuristics can be incorporated into MCTS to improve its performance. The anytime property refers to the fact that MCTS progressively improves its solution with every iteration performed, at any point it contains an up-to-date solution and can thus be aborted at any time and still provide a valid policy. As a consequence MCTS can, in theory, produce plans from a given state which are arbitrarily close to the optimal given enough compute time. In contrast, typical RL methods try to find decision policies for all possible states, which relies on additional approximations, see section 2.5, unless all states can be enumerated in a table. Finally, the search trees produced by MCTS are typically asymmetric. Since the search algorithm samples promising moves more frequently, nodes are added to the tree in an unbalanced way, i.e towards more interesting regions of the state space. By not forcing any type of balanced search MCTS can be very efficient.

Each iteration in MCTS typically consists of four phases, for an example of a single MCTS iteration see Figure 2.2:

1. Selection: Starting at the root node, a selection policy, sometimes also referred to as a tree policy, is used to recursively select promising child nodes, until a expandable or leaf node is reached. This selection policy needs to strike a balance between exploring the search space, and exploiting existing moves to improve the state/state-action estimate, i.e the exploration vs exploitation dilemma. Perhaps the most successful and popular policy to use is called UCB1. The combination of UCB1 inside MCTS produces an algorithm called Upper Confidence Bounds for Trees (UCT).

2. Expansion: Once a node has been selected, one or more children nodes are expanded from this node and added to the tree. Each of these new nodes corresponds to some action that is available at the selected node.

3. Simulation: From this newly expanded node(s), a simulation is performed, also referred to as a rollout or playout, until a terminal node is reached. Which actions that are taken during this simulation step is decided by a default policy. There are two main types of policies for simulation; light in which random actions are taken, and heavy in which actions are selected according to some heuristic.

4. Backpropagation: The result of the simulation is backpropagated through the current search tree to update the node values.

5/10 2/4 0/3 0/2 2/2 3/3 3/3 Selection 5/10 2/4 0/3 0/2 2/2 3/3 3/3 Expansion 0/0 5/10 2/4 0/3 0/2 2/2 3/3 3/3 Simulation 0/0 0/1 5/11 2/5 0/3 0/2 2/3 3/3 3/3 Backpropagation 0/1

Figure 2.2: An outline of a single MCTS iteration for a small toy example. Each node stores two values in the form q/n where q is the number of ’wins’ and n is the visit count.

It should be noted that MCTS can be constructed to either be an off-policy or an policy algorithm depending on how the backpropagation phase works [38]. Using an on-policy approach, each node in the search tree stores the average outcome, and the exact

(16)

2.4. Monte-Carlo Tree Search

feedback values are used during the backpropagation phase. In contrast, in an off-policy MCTS algorithm, an entirely different value is backed up, for instance, the maximum value of all children.

Selecting an optimal action to take in a given state s0, using MCTS can be seen in Algo-rithm 1, where N is the number of iterations that should be used, ok is the newly expanded

node, sk is the state in that node, δ is the feedback from the simulation phase and finally,

amax(o0) represents the optimal action given the root node of the search tree. Selecting the op-timal action based on the search tree can be done in numerous different ways, for instance [30]:

1. max: select the root action with the highest reward, 2. robust: select the most visited action,

3. secure: select the action which maximizes the lower confidence bound.

Algorithm 1 Monte-Carlo tree search

1: procedure MCTS(s0, N ) ▷ Run MCTS with starting state s0using N iterations

2: o0← create_root(s0)

3: while i< N do

4: ok← tree_policy(o0) ▷ Run the tree policy starting from the root

5: δ← default_policy(sk)

6: propagate(ok, δ)

7: end while

8: return amax(o0) ▷ The optimal action from the root node in the tree

9: end procedure

Upper Confidence Bounds for Trees

Kocsis and Szepesvári [20] was first to apply UCB1 to MCTS, thus producing what is known as Upper confidence bounds for trees, however, the definition of UCT has arguably changed since then [38]. UCB1 was first put forward by Auer, Cesa-Bianchi, and Fischer [4] as one approach to solve multi-armed bandit problems. The optimal child node to select according to UCB1 can be seen in Equation 2.2, where qi is the value of child i, n is the visit count for the parent

node, ni is the visit count for child i and c is an exploration constant. It is assumed when

ni= 0 UCB1 should result in ∞ for that child. This ensures that all child nodes are expanded

before any grandchildren, forming a type of iterative local search. The first term is essentially the exploitation value, i.e how promising this child is, whereas the second term corresponds to an exploration reward since it shrinks as that child gets visited more. Apart from the selection policy, UCB1, UCT is very similar to the standard MCTS algorithm presented previously. It uses the same four phases of selection, expansion, simulation, and backpropagation, there are however some important details which should be described [38]. Firstly, the UCT search tree each node stores two major types of information; n the number of times this node has been visited, and q which is the average future reward going forward from this node. Second, the backpropagation returns the sum of all rewards in the simulation phase, this makes it an on-policy MCTS algorithm. Finally, it does not consider the discounting of rewards which is commonly used in MDP problems.

arg maxiqi+ 2c

2 ln n

ni

(17)

2.4. Monte-Carlo Tree Search

Non-episodic

The MCTS and UCT algorithms above assume that the game or problem being solved has some terminal node that can be reached. This is however not always the case, for instance, most planning problems are non-episodic problems where rewards are potentially given at any node, not just terminal nodes. Without any terminal nodes, the search tree could grow to infinite depth and the simulation phase would simply never terminate. Limiting the search tree growth can be achieved by assigning some planning horizon after which all new nodes are considered terminal. This ensures that the simulation phase in each iteration will reach some artificial terminal node and therefore stop in finite time. Another approach to mitigate this issue is to use some other type of future reward estimator. For the observant reader, this does, however, introduce the same issue for which MCTS was meant to solve in the first place, i.e the issue of constructing an adequate estimator. A successful solution to adding an estimator to MCTS has been to learn that estimator from data, for example by using the search trees produced by MCTS to train a deep neural network (DNN), this approach results in a form of deep reinforcement learning [32, 26, 25].

A non-episodic problem also makes using UCT non-trivial since traversing a given path down the tree will most likely result in multiple non-zero rewards. Since UCB1 assumes that each node value is q ∈ [0, 1] some normalization is needed. In general, it can be said that node values can either be normalized globally, i.e normalized with regard of the maximum and minimum node values in the entire search tree, or locally, i.e that normalization is only done between children during the UCB1 selection phase.

Non-determinism

Originated in traditional board games, standard MCTS was not developed for games/prob-lems with any kind of non-determinism regarding the rewards received, or the states reached by taking a particular action. However, in more complex games, such as video games, and planning problems which we are interested in, there is often stochastic behavior.

Browne, Powley, Whitehouse, et al. [7] describes several different approaches to adapt MCTS for non-deterministic problems. One approach is to use determinization, which involves fixing all outcomes of the problem using different random seeds. This results in a collection of deterministic problems which can be solved using standard MCTS or UCT, the solutions are then combined to evaluate decisions in the actual stochastic problems [4]. This approach of combining several solutions into one better solution shares many similarities with ensemble learning in supervised learning, which uses several classifiers in combination to produce better results [9].

The other general approach to determinization is to incorporate the stochastic outcomes directly into the MCTS search tree. Bjarnason, Fern, and Tadepalli [6] discusses several dif-ferent variants to MCTS for heavily stochastic problems, more specifically Klondike Solitaire. According to the authors, Klondike Solitaire is a long horizon (minimum of 52 time-steps), high branching factor and stochastic game. Firstly, they discuss a combination of UCT and hindsight optimization (HOP) which is very similar to determinization as described above. By solving several perfect information deterministic instances of Klondike with UCT, the au-thors can use HOP obtain an upper-bound on the expected reward which can then be used for selecting between moves. Second, the authors described an alternative procedure called Sparse-UCT in which a single tree node can have several child nodes from the same action. Moves are selected in the same way as with UCT, however, the traversal during the selection policy is stochastic, as well as the expansion of new nodes. Finally, they discussed an ensemble version of sparse-UCT in which the optimal action is selected based on the average expected reward over several independent trees. There Sparse-UCT also utilized a sample width, w, to limit how many children a given tree node can have. Once a node has w children it is con-sidered fully-expanded, thus putting an upper limit on the branching factor, although some

(18)

2.5. Deep Reinforcement Learning

potential actions will never be explored. This ensures that each tree can be constructed much more quickly since fewer nodes need to be expanded, which of course comes at a performance cost.

2.5 Deep Reinforcement Learning

Deep reinforcement learning (DeepRL) is a relatively new field of machine learning, Li [21] defines it as the combination of traditional RL techniques and deep learning. Unlike normal RL, DeepRL uses a DNN to approximate one or more parts of the RL process, typically the state-value function, the state-action-value function or the policy. The motivation for using a DNN in place of for instance tabular or tree structures for learning in traditional RL, is that the DNN improves the scalability and generalization, such that previously unsolvable problems can be tackled [3]. It has for instance been a very important part of tackling the Curse of dimensionality, a very big issue for traditional RL in large state space problems [21].

One of the first DeepRL methods was Deep-Q-Networks (DQN) which uses a DNN to approximate the state-action-value function, as its name suggests, it is a deep learning variant of traditional Q-learning [21]. This algorithm has been successful in multiple applications, but perhaps most notably by Mnih, Kavukcuoglu, Silver, et al. [26] who used it to achieve professional human-level performance in several Atari 2600 games. In their approach they only used the raw pixels on the screen as well as the game score to train their agent, indicating that DeepRL can achieve great performance with no domain knowledge. Since DQN was first presented in 2015, several adaptations and extensions have been suggested, Van Hasselt, Guez, and Silver [37] for instance, suggested Double DQN which attempts to deal with the tendency of DQN, and Q-learning, of over-estimating the state-action values.

AlphaGo Zero is another DeepRL method presented by Silver, Huang, Maddison, et al. [32] which uses a combination of MCTS and a convolutional neural network (CNN) to achieve super-human performance in the game of Go. Their method works by iteratively running batches of self-play games of Go to produce training data for the CNN and improving its performance over time. Their CNN is a combined value and policy network which uses the raw board positions to produce move probabilities and the win probability from this point. According to the authors, this combined network approach reduces the over-fitting of the model and thus yields better performance. In each turn during a self-play game, they perform 1600 iterations of MCTS, or more specifically a modified version of UCT. The main differences are twofold. Firstly, they do not use any rollouts/simulations after expanding a new node, instead they use an approximation from the CNN to initialize the node value. Second, they use a variant of the UCT algorithm, called PUCT, as selection policy seen in Equation 2.3 where P(s, a) is the move probabilities estimated using the CNN. PUCT is very similar to UCB1 although in their case they are also using the move probabilities as a form of prior for selecting more promising moves. Once this search tree has been built, the most promising move is selected at the root node. This process is repeated until a terminal state is reached and a final game reward is observed. For every move made in this self-play game a training data point on the form(st, πt, zt) is created (the reward z ± 1 depending on which player won). The CNN

parameters θi are retrained to firstly minimize the error between the observed game-winner z

and the predicted value v and secondly to maximize the similarity between the predicted move probabilities P(s, a) and the search probabilities π. Finally, they evaluate each checkpoint during the training process against the network which currently has the best performance. This evaluation is done by running many competitive games between the new network fθi and

the currently best performing network fθ. If the new network outperforms the previous best,

then it will be used for the next round of self-play games. arg max

a

(19)

3

Method

In this chapter, the method used to answer the research questions is presented. This includes the simulation model concept, the modifications made to MCTS, and the evaluation framework used to gather the results.

3.1 Simulation Models

A simulation model will be used to abstract from the specifics of the problem at hand, and instead provide an environment in which different agents, maintenance planners in this case, can act and be evaluated. At any point in time, the simulation model will report the current state, the reward for being in that state and the valid actions that can be taken. Taking a specific action, forwards the simulation model, updating the state, reward, and available actions. Note that in this thesis work, this transition is stochastic, i.e that different states are reachable from identical initial state and action taken. In this thesis, focus will be on maintenance of single machine consisting of one or more components. The abstract state for this machine is represented by a vector of continuous variables, although it could easily be generalized to other data types. Moreover, the simulation model contains a mechanism describing how a given component degrades over time, and when it should fail. This could for instance, be as simple as a hard threshold on component age, or a very complex stochastic failure probability model, dependent on multiple variables and previous actions. In practice, this degradation process can for instance, be represented by a RUL estimate, or predictive failure model that predicts the probability of failure. Of course, simulation models can also use condition variables as part of the state, but it is not required. For more information on RUL estimates and predictive failure models see, section 2.1. Finally, the different types of actions that will be considered are:

Pass: do nothing, does typically not impart any cost. Only allowed for a working component. Preventative Maintenance (PM): perform maintenance on a working component,

im-proving its condition.

Corrective Maintenance (CM): perform maintenance on a non-working component, this typically results in a higher cost than preventative maintenance.

(20)

3.2. Monte-Carlo Tree Search on Stochastic Problems

Scenarios

In this thesis the case of wind turbine maintenance will be used, more specifically four scenar-ios (simulation models) will be considered, based on two variables; single/multi-component, and turbine-age. This particular choice of maintenance case was primarily motivated by the reasonably large body of available and recent research. It should however be noted that the purposed method is not in any way limited to planning maintenance for wind turbines. The simulation models are based on wind turbine characteristics, in terms of costs, deterioration, and types of components. More specifically, they are based on work by Ding and Tian [10]. Wu and Zhao [41] argues that the gearbox is the most major part of the wind turbine. It therefore seems like a reasonable component to select in a single component simplification of the original multi-component scenario. Both single-component scenarios are identical to their respective multi-component variants, except that only the gearbox is considered.

The multi-component scenarios consider four components; rotor, bearing, gearbox and generator. Each component k is associated with a continuous condition variable ck, and a

failure-condition threshold ck−max which is sampled from an increasing Weibull distribution

> 1), i.e a wear-out model, which was discussed in section 2.1. It is important to note that this failure-condition threshold is a hidden variable, only known by the simulation model, as well as the distribution used. The scenarios with the turbine-age included, will in addition to the per component condition variable, contain a turbine-age variable, cturbine.

At each time-step, all variables, including cturbine gets incremented, and in the turbine

variants ck−maxgets scaled by a factor of T 0.5

horizon. After all variables are updated, the planner

needs to take a binary maintenance decision (Pass, PM) per component. If a given component has failed, i.e ck> ck−max, then the planner is forced to perform CM for that component.

Per-forming PM on a component reduces its condition according to ck= ck(1−q) where 0 < q ≤ 1 is

the ratio of repair. Furthermore, the failure-condition threshold is partially re-sampled accord-ing to ck−max= qck−d+(1−q)ck−max, where ck−dis a random sample from the failure-condition

distribution for component k. The ratio of repair q is a user-selected parameter; in this thesis a value of 0.6 will be used, i.e the condition reduction through PM is 60%. Performing CM on a component implies that the component is replaced, which involves resetting ck = 0 and

re-sampling ck−max entirely from its associated failure-condition distribution. If the scenario

in question is of the turbine-age variant, all values from the failure-condition distribution is scaled according to 1−0.5∗cturbine

Thorizon before being used to update ck. This turbine-age

modifica-tion progressively increases the failure rate as the simulamodifica-tion progresses, ending with a scaling factor of 0.5.

The cost of the actions performed in a single time-step consists of a static and dynamic part. The static cost represents the cost for sending a maintenance team and accessing the wind-turbine and is therefore not dependent on the number of components maintained (given that at least one PM or CM is performed). The dynamic costs associated with performing the different actions are:

• Pass: 0 • PM: q2∗ C

pv+ Cpf where Cpv is the variable preventive replacement cost, and Cpf is

the fixed preventive maintenance cost

• CM: Cf+ Cpf where Cf is the failure replacement cost

The parameters for the failure-condition distributions and maintenance costs are available in Table 3.1, a density plot of the failure-condition distributions are shown in Figure 3.1.

3.2 Monte-Carlo Tree Search on Stochastic Problems

The main work of this thesis is the adaptation of traditional MCTS to handle non-episodic and stochastic problems, more specifically maintenance planning under uncertainty. This modified

(21)

3.2. Monte-Carlo Tree Search on Stochastic Problems

Table 3.1: Failure distributions with mean time to failure estimates (MTTF), and cost param-eters (in $k) Component α β MTTF Cf Cpv Cpf Cstatic Rotor 8.5 3 7.37 112 28 40 57 Bearing 10.5 2 9.30 60 15 40 Gearbox 6.5 3 6.03 152 38 40 Generator 9.25 2 8.20 100 25 40 0 5 10 15 20 25 0.00 0.05 0.10 0.15 0.20 c_k p(f ail | c_k) rotor bearing gearbox generator

Figure 3.1: Failure-condition distribution for different wind-turbine components (rotor, bear-ing, gearbox and generator)

MCTS still works in very much the same way as traditional MCTS in terms of its four phases; selection, expansion, simulation, and backpropagation. However, all phases have been modified to handle maintenance planning for stochastic problems.

Before we delve into the details of the modified algorithm, two main modifications to the tree structure need to be noted. Due to the stochastic outcomes of actions, the search tree built by MCTS is different than in standard MCTS or even UCT for that matter. For most applications which MCTS have been applied to, such as Go, Chess, etc, the outcome of a given action is deterministic, for instance moving a chess piece always produces the same outcome. Since this is not the case for planning under uncertainty, there is a probability of failure at each point, each action from a given node in the tree now maps to one or more outcomes (states), which are not known beforehand. In addition, rewards are not only given at terminal nodes or limited to[0, 1]. As a result, normalization is needed to use most selection policies, for instance UCB1. However, storing the future cumulative reward and applying global normalization, i.e the straightforward approach, is very problematic since the cumulative reward is depth-dependent for the problems of interest in this thesis. This means that a longer path in the search tree generally has a larger cumulative reward than a shorter path since the cost of

(22)

3.2. Monte-Carlo Tree Search on Stochastic Problems

maintenance is monotonically increasing with time. Using global normalization on this type of tree results in extremely small values for nodes at a large depth, which has a profound effect on the exploration vs exploitation mechanism. To combat this issue, the entire path values are stored at each node contained in the path, see a Figure 3.3 for a simplified example. More specifically, each node stores the maximum cumulative discounted reward which involves passing through that node. Consequently, the node values become independent, and global normalization can be used without issue.

The pseudo-code for this modified MCTS can be seen in Algorithm 2. Every node o, also referred to as an outcome, is associated with three values: the associated state so, the state

value qo, the visit count vo. Each iteration starts with executing the selection policy which

aims to find a promising new node to add to the search tree. This involves recursively selecting the most promising action at each step, using the provided simulation model to simulate the outcome of the action, until a previously unseen outcome is observed. It is important to note that this can happen even for actions which have previously been used since the outcomes are stochastic. Since an action maps to one or more outcomes (stochastically), the normal selections policies discussed, i.e UCB1 and PUCT, cannot be used directly. All outcomes for a given action must be considered when evaluating how promising it is. Therefore, the selection process is based on the expected action valueiqip(si∣s, a). To calculate the expected action

value, the transition probability p(si∣s, a) must be known for all reachable states si from s in

order to produce the expectation. But, since the underlying probability model is not known, our proposed MCTS variant uses the visit count of each outcome to approximate the transition probability. The resulting selection policy can be seen in Equation 3.1, where no visit count

for outcome o, qo is the state value of the given outcome, n is the visit count of the current

state, and(bl, bu) are the lower and upper bounds of all node state values in the tree which

are used for normalization. An example of how this is evaluated can be seen in Figure 3.2, where it can be observed that the PASS action is evaluated as a weighted average between its two observed outcomes.

After a new outcome has been observed, a new node is expanded and added to the search tree with a visit count no = 0 and the associated state so being the outcome state. The

state value qo is initialized y using simulation as in normal MCTS, either through rollouts

or through value-approximation using a DNN. Rollouts are performed the same way as in traditional aheuristic MCTS, i.e by taking random actions until some terminal node is reached. Since there are no real terminal nodes in the cases of interest in this thesis, rollouts will be performed until the planning horizon is reached. At each point, a random action is taken, and the associated discounted reward is recorded. The newly expanded node’s state value is then updated according to qo= ∑Hi=dγirsi where d is set to the depth of the expanded node, H is

the planning horizon and rsi is the observed reward at a given step. Conversely, when the

value-approximating network, fθ, the state value is updated according to qo= fθ(so).

Finally, after the new node has been updated, the backpropagation phase starts. This involves updating the visit count for all nodes passed through during the selection phase and the updating of the node values. Each node’s state value, qo, is set to the maximum cumulative

discounted reward passing through that node, i.e a form of max-backups. Lastly, the lower and upper bounds of the node state values, bl and bu, are updated such that normalization

can be performed in the next iteration.

arg max a∈A(s)ono(qo− bl)/(bu− bl) ∑ono + 2cn 1+ ∑ono (3.1)

Value approximation

As previously mentioned, MCTS with rollouts can be adapted for stochastic problems, however, the rollouts do introduce a major exploration problem due to the variance in their outcomes. In the deterministic case, a rollout will produce an upper bound of the cost in that sub-tree,

(23)

3.2. Monte-Carlo Tree Search on Stochastic Problems

Algorithm 2 Monte-Carlo tree search for stochastic maintenance planning

1: procedure MCTS(s0, N ) ▷ Run MCTS with starting state s0using N iterations

2: o0← create_root(s0) ▷ q ← 0 and n ← 0

3: bl← ∞, bu← −∞ ▷ Create normalization bounds

4: while i< N do

5: ok← selection_policy(o0, bl, bu)

6: δ← default_policy(sk)

7: propagate(ok, δ)

8: bl← min(q0, q1, . . .) ▷ Update normalization bounds over all tree nodes

9: bu← max(q0, q1, . . .)

10: end while

11: return amax(o0) ▷ The optimal action from the root node in the tree

12: end procedure

13: procedure selection_policy(o, bl, bu)

14: while o is non-terminal do

15: if o not fully expanded then

16: return expand(o)

17: else

18: a← select(o, bl, bu)

19: s← state(so, a) ▷ Get resulting state using action a in node state so

20: if sis unseen from o then

21: add new child oto o with n← 1 and s ← s′ 22: return o

23: else

24: o← o▷ Find already existing state owith s′ as state

25: end if

26: end if

27: end while

28: end procedure

29: procedure expand(o)

30: a← sample uniform untried actions of o

31: add new child oto o with n← 1 and s ← state(s, a) 32: return o

33: end procedure

34: procedure select(o, bl, bu) ▷ Selection according to expected outcome value

35: return arg maxa∈A(s)ono(qo−bl)/(bu−bl)

ono + 2cn 1+∑ono 36: end procedure 37: procedure default_policy(s) 38: rtotal← 0

39: while s is non-terminal do ▷ Terminal states are decided by the simulation model 40: a← sample uniform action from s

41: s← state(s, a)

42: rtotal← rtotal+ reward(s, a, s)

43: s← s′ 44: end while 45: end procedure 46: procedure propagate(o, δ) 47: o← parent(o) 48: rpath← δ

49: while onot null do

50: rpath← reward(o) + γrpath

51: o← parent(o′)

52: end while

53: qo← rpath

54: o← parent(o)

55: while o not null do ▷ Update all nodes along path using max-backup

56: no← no+ 1 57: qo← maxa∈A(so) ∑onoqoono 58: o← parent(o) 59: end while 16

(24)

3.2. Monte-Carlo Tree Search on Stochastic Problems a = PM Q = ­10 N = 1  a = Pass Q = ­5 N = 5  ­20/1 0/4 ­10/1

Figure 3.2: Tree structure for multiple outcomes. Shows that value of a given action is calculated by a weighted average of the known outcomes. Each node contains a state value Q and a visit count N, illustrated as (Q/N)

3 1 3 1 0 -3 -3 0

Figure 3.3: Simplified MCTS tree (no discount or multiple outcomes) showing how node-values are assigned using max-backup

but not necessarily in the stochastic case. The series of actions performed in a rollout could produce both very optimistic and pessimistic outcomes, resulting in very unstable estimates. This high variance in the results has a very problematic effect on the exploration of the search tree. A series of particularly bad simulations on what may be the optimal path will guide the selection policy away from that sub-tree for a very long time. By replacing the rollouts with a value-approximating deep neural network (DNN), this variance can hopefully be reduced, given that the network is trained properly. This network would be responsible for predicting the future cumulative discounted reward, t time-steps forward from a given state s. A properly trained network would converge towards the expected future cumulative discounted reward, which would present a great improvement in terms of guiding the exploration of the search tree. In AlphaGo, policy approximation is used in addition to value approximation, however, since the branching factor is quite limited, and due to the scope of this thesis this was omitted. The network that will be used in this thesis is a very simple regression model; three densly connected layers of 50 neurons each, and one linear output layer producing a single value. The activation function will be the sigmoid, the equation for which can be seen in Equation 3.2.

f(x) = 1

1+ e−x (3.2)

The training of this DNN is inspired by the process used for AlphaGo Zero [32], i.e a self-play reinforcement learning algorithm using MCTS as a policy improvement operator, an outline of the training process can be seen in Figure 3.4. First, the DNN weights θ0

(25)

3.3. Comparative planning models

Initialize network weights

Generate MCTS search trees

Generate supervised learning samples from tree nodes

Update the network weights

Figure 3.4: Self-play training process for incrementally improving DNN performance

are initialized randomly. Then a fixed number of episodes are performed to improve the performance of the MCTS with DNN. For each episode i≥ 1, several MCTS trees are generated, guided by the previous iteration of deep neural network fθi−1. In the first iteration, rollouts

are used, rather than the randomly initialized DNN to bootstrap performance. Note that this ‘self-play’ is different from what is used in AlphaGo Zero, where full games of Go are played from start to finish, here search trees are only generated once. It should be noted that the state of the root node is sampled randomly in order to have varied trees. Each node is then transformed into a tuple on the form (s, n, vf uture) where s is the state variables, n is the visit

count, and vf utureis the discounted future cumulative reward. This data is then filtered based

on the visit count n, such that every tuple where n< nminis removed, where nminis the sample

minimum visit count, a user-selected parameter. By filtering the data used, vf uture estimates

that are very uncertain, i.e has a low visit count, are avoided. Finally, new DNN parameters θiare initialized to θi−1and then trained on the data generated in this episode. The θiweights

are adjusted to minimize the error between the predicted value and the discounted future cumulative reward, vf uture. More specifically, θi is adjusted by stochastic gradient descent on

the mean-square error, using a learning rate of 0.01 and in batches of 32 with a random 80/20 split of training and validation. Unlike in the AlphaGo training process, this adjusted network is then taken as is, i.e no competitive evaluation is used to decide which network should be used for the next episode of self-play.

3.3 Comparative planning models

The main focus of this thesis is to develop a general maintenance planner, applicable to different problems with varying difficulty. In order to draw any relevant performance related conclusions, there needs to be some type of baseline method(s) to compare against.

Oracle maintenance planning

Oracle maintenance planning represents an unrealistic lower bound of maintenance costs. It works by previewing when exactly the component will fail, i.e viewing the failure-condition threshold ck−max and comparing that to the current condition ck, if ck+ 1 > ck−max then

it immediately performs PM. It is important to note that the failure-condition threshold is not visible to any other maintenance planners, and not in real-world applications either. This

(26)

3.4. Implementation details

Table 3.2: Periodicities for different components and periodic strategies Component Defensive (z= 0.4) Aggressive (z = 0.8)

Rotor 3 6

Bearing 4 7

Gearbox 2 5

Generator 3 7

therefore represents an unrealistic lower bound as previously stated. Finally, the oracle method is not implemented for the multi-component scenarios, since firstly, the implementation is no longer trivial, and second because it does not necessarily produce a lower bound. Due to the separate fixed maintenance cost, it is not guaranteed that scheduling PM at the last possible moment per component independently is the optimal method.

Periodic maintenance planning

Periodic maintenance planning represents the reasonable, albeit simple data-driven approach and should be considered the main alternative method. It works by periodically scheduling preventative maintenance actions given some periodicity p. Due to stochastic failure models, components might fail before the scheduled maintenance, in which case a corrective mainte-nance action is performed, and the schedule is shifted accordingly. For the multi-component scenario discussed in section 3.1, schedules will be constructed independently, i.e one schedule per component. Selecting the per component periodicity pkis done by sampling the underlying

failure probability distribution to estimate the expected failure-condition threshold E[ck−max]

and multiplying by some constant factor z, thus getting pk= z ∗ E[ck−max]. Since the

perfor-mance of any periodic planner is highly dependent on the periodicity p, two different values for z will be tried in order to minimize the risk that the particular periodicity is especially ill-suited, or well suited for that matter. For this thesis these two periodicities will be selected to cover two different maintenance strategies; one defensive, in the sense of being very proac-tive, and one aggressive which delays maintenance for much longer. The exact values for z are loosely inspired by the work of Wu and Zhao [41], which as previously discussed are also the inspiration for the scenarios used in this thesis. Among their strategies, they used values of z ranging from 0.4 to 1.2. For this thesis the exact values for z will be 0.4 and 0.8 for the defensive and aggressive policies respectively. Since the maintenance is performed in discrete time-slots, the periodicities pk will be rounded to the nearest integer. The final periodicities

used per component can be seen in Table 3.2.

3.4 Implementation details

In this thesis the Rust programming language will be used as the primary implementation language. Rust is a compiled, system-level language focused on performance, reliability and productivity1. It aims to solve similar problems as C++ but has a widely different approach to especially memory safety. Apart from Rust, Tensorflow2, Googles open-source machine learning library will be used to craft, train and apply the neural network in the developed algorithm.

3.5 Evaluation

This thesis aims to develop and performance test a general maintenance planner based on MCTS in comparison to other planning models. Since the scenarios of interest are stochastic,

1Rust language website: https://www.rust-lang.org/

References

Related documents

The agent in a Markov decision process has as its objective to maximise its expected future reward by nding a policy that produces as large expected future rewards as possible..

Another way to make the DRL agent independent of student state was to make use of an RNN. We ended up choosing LSTM, a kind of RNN, for this task which has been shown to work well

In RL, an agent learns to control a dynamic system (its environment) through examples of real interactions without any model of the physics ruling this system. Deep learning

Concerning playouts, the Early Playout Termination enhancement only yields better results when the number of MCTS-iterations are somewhat

The second problem we consider is how to control physical systems with fast dynamics over multi-hop networks despite wireless communication delays.. Control with

De fåtal ungdomar som uttrycker att de känner sig missnöjda med sig själva vid bildexponeringen kan tolkas ha en låg utvecklad självkänsla vid tidiga år

The deep determin- istic policy gradient (DDPG) algorithm [3] is another example of em- ploying deep neural networks in a reinforcement learning context and continuous action

Like for most other deep reinforcement learning algorithms, the information passed on to the input layer of the network corresponds to the agent’s current state in the