Extracting Behaviour Trees from Deep Q-Networks : Using learning from demostration to transfer knowledge between models.

(1)

Linköpings universitet

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Computer science

2020 | LIU-IDA/LITH-EX-A--20/059--SE

Extrac ng Behaviour Trees from

Deep Q-Networks

–

Using learning from demonstra on to transfer knowledge

be-tween models.

Extrak on av beteendeträd från djupa Q-nätverk

Zacharias Nordström

Supervisor : Johan Källström Examiner : Fredrik Heintz

(2)

Upphovsrätt

De a dokument hålls llgängligt på Internet - eller dess fram da ersä are - under 25 år från publicer-ingsdatum under förutsä ning a inga extraordinära omständigheter uppstår.

Tillgång ll dokumentet innebär llstånd för var och en a läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och a använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrä en vid en senare dpunkt kan inte upphäva de a llstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För a garantera äktheten, säker-heten och llgängligsäker-heten ﬁnns lösningar av teknisk och administra v art.

Upphovsmannens ideella rä innefa ar rä a bli nämnd som upphovsman i den omfa ning som god sed kräver vid användning av dokumentet på ovan beskrivna sä samt skydd mot a dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens li erära eller konstnärliga anseende eller egenart.

För y erligare informa on om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years star ng from the date of publica on barring excep onal circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educa onal purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are condi onal upon the consent of the copyright owner. The publisher has taken technical and administra ve measures to assure authen city, security and accessibility.

According to intellectual property law the author has the right to be men oned when his/her work is accessed as described above and to be protected against infringement.

For addi onal informa on about the Linköping University Electronic Press and its procedures for publica on and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

In recent years the advancement in machine learning have solved more and more complex problems. But still these techniques are not commonly used in the industry. One problem is that many of the techniques are black boxes, it is hard to analyse them to make sure that their behaviour is safe. This property makes them unsuitable for safety critical systems. The goal of this thesis is to examine if the deep learning technique Deep Q-network could be used to create a behaviour tree that can solve the same problem. A behaviour tree is a tree representation of a flow structure that is used for representing behaviours, often used in video games or robotics. To solve the problem two simulators are used, one models a cart that shall balance a pole called cart pole, the other is a static world which needs to be navigated called grid world. Inspiration is taken from the learning from demonstration field to use the Deep Q-network as a teacher and then create a decision tree. During the creation of the decision tree two attributes are used for pruning; to look at the trees accuracy or performance. The thesis then compare three techniques, called Naive, BT Espresso, and BT Espresso Simplified. The techniques are used to transform the extracted decision tree into a behaviour tree. When it comes to the performance of the created behaviour trees they all manage to complete the simulator scenarios in the same, or close to, capacity as the trained Deep Q-network. The trees created from the performance pruned decision tree are generally smaller and less complex, but they have worse accuracy. For cart pole the trees created from the accuracy pruned tree has around 10 000 nodes but the performance pruned trees have around 10-20 nodes. The difference in grid world is smaller going from 35-45 nodes to 40-50 nodes. To get the smallest tree with the best performance then the performance pruned tree should be used with the BT Espresso Simplified algorithm. This thesis have shown that it is possible to use knowledge from a trained Deep Q-network model to create a Behaviour tree that can complete the same task.

(4)

Acknowledgments

A thesis might be written alone but there a number of people that I have to thank for sup-porting me during the thesis and my studies. First a thank you to LiU for having a fun and interesting program.

Then a huge thanks to Christopher Bergdahl, Pernilla Eilert, and the team at Saab AB for taking an interest in my thesis and for making me a part of your team for its duration. A thanks to both Johan Källström and Fredrik Heintz for being my supervisor and examiner from LiU. Furthermore thank you for asking interesting questions and answering the questions I have had.

Most importantly a big thank you to my family for always being there and supporting me. This would not have been possible without your support.

(5)

List of Figures

2.1 An example of a decision tree. . . 5

2.2 Example tree of a BT. . . 6

2.3 Structure of a 2 layer fully connected artificial neural network. . . 7

3.1 Outline of the ANN-DT algorithm. . . 12

4.1 Visualised cart pole scenario in Matlab. The beige rectangle is the pole, the blue square is the cart, the green dashed lines are the fail margins for the cart and pole. The vertical lines are fail margins for moving the cart and the angled lines that are attached to the cart are for the pole. . . 23

4.2 Critic network structure for the DQN agent. . . 24

4.3 A human designed BT for the cart pole simulator. It is made with the same nodes and rules the extraction algorithms has to confine to. . . 25

4.4 Visualised grid world scenario in Matlab with a trail that shows the agent’s path. The agent starts at the lighter red rings, at [2,1], and moves to the full red ring, at [5,5]. Each ring has an arrow that points in the direction that the agent moves from that tile. The black tiles are obstacles and the teal state is the goal state. There is a jump from state [2,4] to [4,4], which is why it can jump the obstacle in this case. . . 26

4.5 A human designed BT for the grid world simulator. It is made with the same nodes and rules the extraction algorithms has to confine to. . . 28

5.1 A BT extracted with the naive solution from the DQN agent for a cart pole simu-lator. The base DT was pruned on performance. . . 30

5.2 A BT extracted with the BT Espresso algorithm from the DQN agent for a cart pole simulator. The base DT was pruned on performance. . . 31

5.3 A BT extracted with the BT Espresso Simplified algorithm from the DQN agent for a cart pole simulator. The base DT was pruned on performance. . . 31

5.4 A BT extracted with the naive solution from the DQN agent for a cart pole simu-lator. The base DT was pruned on accuracy. . . 32

5.5 A BT extracted with the BT Espresso Simplified algorithm from the DQN agent for a cart pole simulator. . . 33

5.6 A BT extracted with the naive algorithm from the DQN agent for the grid world simulator. Extracted with performance as choice for pruning level. . . 34

5.7 A BT extracted with the BT Espresso algorithm from the DQN agent for the grid world simulator. Extracted with performance as choice for pruning level. . . 34

5.8 A BT extracted with the BT Espresso Simplified algorithm from the DQN agent for the grid world simulator. Extracted with performance as choice for pruning level. 35 5.9 A BT extracted with the naive algorithm from the DQN agent for the grid world simulator. Extracted with accuracy as choice for pruning level. . . 36

5.10 A BT extracted with the BT Espresso algorithm from the DQN agent for the grid world simulator. Extracted with accuracy as choice for pruning level. . . 37

(8)

5.11 A BT extracted with the BT Espresso Simplified algorithm from the DQN agent for the grid world simulator. Extracted with accuracy as choice for pruning level. . 37 A.1 A DT created from Matlab’s data ionosphere, pruned to level 6 of 8. . . 48 A.2 Naive solution run on a DT created from Matlab’s ionosphere data and pruned to

level 6 of 8. . . 49 A.3 BT Espresso algorithm run on a DT created from Matlab’s ionosphere data and

pruned to level 6 of 8. . . 50 A.4 BT Espresso Simplified run on a DT created from Matlab’s ionosphere data and

(9)

List of Tables

4.1 Explanation of the variables in an observation from the cart pole environment which is passed to the model. . . 23 4.2 Layer breakdown of the critic network for the DQN agent used in the cart pole

simulator. . . 24 4.3 Parameter selection for the DQN agent used in Cart Pole. . . 24 4.4 Explanation of the variables in an observation from the grid world environment

which is passed to the model. . . 26 4.5 Layer breakdown of the critic network for the DQN agent used in the grid world

simulator. . . 27 4.6 Parameter selection for the DQN agent used in grid world. . . 27 5.1 The table shows the results of the algorithms on the cart pole simulator. The DT

used was extracted with performances as measurement. R stands for the normalised performance and A is the accuracy. . . . 30 5.2 The table shows the results of the algorithms on the cart pole simulator. The DT

used was extracted with accuracy as measurement. R stands for the normalised performance and A is the accuracy. . . . 32 5.3 The table shows the results of the algorithms on the grid world simulator. The DT

used was extracted with performance as measurement. R stands for the normalised performance and A is the accuracy. . . . 33 5.4 The table shows the results of the algorithms on the grid world simulator. The DT

used was extracted with accuracy as measurement. R stands for the normalised performance and A is the accuracy. . . . 36 B.1 The table shows the results of the algorithms on the cartpole simulator. The DT

used was extracted with accuracy as measurement. R stands for the normalised performance and A is the accuracy. . . . 52 B.2 The table shows the results of the algorithms on the cartpole simulator. The DT

used was extracted with accuracy as measurement. R stands for the normalised performance and A is the accuracy. . . . 53 B.3 The table shows the results of the algorithms on the grid world simulator. The DT

used was extracted with performance as measurement. R stands for the normalised performance and A is the accuracy. . . . 53 B.4 The table shows the results of the algorithms on the grid world simulator. The DT

used was extracted with accuracy as measurement. R stands for the normalised performance and A is the accuracy. . . . 54

(10)

Glossary

AI Artificial Intelligence ANN Artificial Neural Network

API Application Programming Interface BT Behaviour Tree

CART Classification and Regression Tree DNF Disjunctive Normal Form

DNN Deep Neural Network DQN Deep Q-Network FCN Fully connected network LORE LOcal Rule Extraction LSTM Long short-term memory LfD Learning from Demonstration QBN Quantized Bottleneck Network RL Reinforcement Learning

RNN Recurrent neural network. SL Supervised Learning

(11)

1 Introduction

For safety critical system, such as autonomous vehicles and decision support systems, it is important to know what will happen in each situation. Therefore, it is important to know what a learned model will do when used in such systems. The systems need to guarantee that the solutions they rely on do not behave in strange, unpredicted or in an undesired way.

Behaviour tree (BT) is a way of modelling when to perform an action for a scenario. They can be used in decision support, game artificial intelligences (AI) or for making decision in autonomous vehicles. The design of BTs are often left to an expert, but in complex scenarios this is a complicated process that take a lot of time. Therefore, research have been conducted to generate these trees automatically [2, 9].

In the last years, advancements in reinforcement learning (RL) have made computers learn to play on par with humans in complex games. One big break-through is the start to use deep RL to beat Atari 2600 games from pixel input [21].

Since new algorithms in RL have good performance it is preferable to use these when training agents. A problem with using these techniques for safety critical systems is that they can use deep neural network (DNN), which are close to black box systems. One cannot say to the customer of a safety critical system, that the system will probably behave correctly. Therefore, a way of expressing the learnt behaviour of the deep RL algorithm in an explainable way is needed.

To be able to tell what a system will do or what caused a system to make a decision is important. One of the reasons being that the European Union has introduced the right to know why an algorithm made a decision [15].

There has been research done regarding explaining DNNs. Some areas that algorithms have been created in are linear proxy models, decision trees (DT), automatic-rule extraction and salience mapping [13]:

• Linear proxy models probe a black-box system in order to construct a local linear model [13]. The local linear model is then used as a proxy for the large model.

• Decision tree is when a DT is created from an artificial neural network (ANN) [13]. • Automatic-rule extraction can be done on decompositional or pedagogical level [13].

De-compositional is when the network structure is taken into account and used for extraction of rules. In a pedagogical approach the network structure is not taken into account and

(12)

1.1. Motivation

the network is seen as a black box and is estimated by testing input and recording out-put. The information gained is then used for creation of rules that is used to estimate the network.

• Salience mapping is when part of the input data is obscured and this is then used to show what parts of the input data the network uses in its predictions [13].

This thesis will focus on extraction of DTs as a BT can model a DT [4]. If these parts can be put together a bridge from deep RL to the creation of a BT might be achievable. Creating such a bridge is the goal of this thesis.

1.1 Motivation

This thesis is done in cooperation with the company Saab AB. On of the areas that Saab AB works with is flying vehicles and decision support for the pilot of these vehicles. One technique that can be used to model behaviour of the vehicles is BTs. The main method of creation of BTs is by hand by a domain expert. This can be a time-consuming process. The process can be prone to errors as the scenario can be complex and require a complex behaviour. Therefore, if the process could be automated it would save time. Previous thesis work have been done in trying to generate BTs by using evolutionary learning algorithms [9]. A problem with the technique is that a BT that is able to complete the scenario has to be present in the starting population for the evolution to go anywhere.

Instead of improving the learning process for BT by evolutionary processes, one could try to use state-of-the-art methods in learning behaviour. The state-of-the-art for learning behaviour is deep RL. The question is then moved from, how to learn a BT for a scenario; to how to extract a learnt behaviour from an RL model for a scenario.

1.2 Aim

The aim of the thesis is to study if it is possible to take a trained Deep Q-network (DQN) and use the learnt information to create a behaviour tree (BT) that still can complete the scenario and have the same behaviour. The hope is that by transforming the network into a behaviour tree a lot of understandability can be gained while only paying a small price in terms of performance.

1.3 Research Questions

The aim of the thesis is to find out if it is possible to extract a BT from a DQN model, this is the main question; Is it possible to create a BT from a DQN model? Then there are some follow up questions if this is achieved, which are:

1. Does the extracted model accurately predict the DQN agents actions? 2. Is it possible to analyse what the extracted BT will do?

3. How does the performance change when modelling DQN as a BT?

1.4 Delimitations

To make the project doable in the scope of a master’s thesis some delimitations have to be made. The extraction will be made from a DQN model into a BT. Furthermore, the simulators used are small and not complex to make the training easier. Techniques for extraction of behaviour will be based on existing methods, which might not be a BT but will then be transformed into a BT by other methods. There might be better formats for representing the extracted

(13)

1.4. Delimitations

behaviour from a network but this thesis will work towards representing the behaviour as BTs. The computationally demands of techniques have to be limited as the thesis does not have unlimited computational resources. Therefore more complex simulators, scenarios or techniques are not be used.

(14)

2 Background

To help with understanding the contents of the thesis a number of concepts and terminology is explained in this chapter. The first part of the chapter explains DT and BT and the last part of the chapter explains RL and DQN.

2.1 Decision Tree

A decision tree (DT) is a tree structure that is used to make predictions. DT have internal nodes which each have two children. The predictions are based on some input and then traversing the splits in the tree. When a leaf node is reached the label at the node is used as the output prediction. The internal node is a split on one of the input variables. For choosing which child to traverse the left is less then the split and the right is larger than the split. Leaf nodes is the other type of nodes in the tree. The leaf nodes do not have any children or contain any split. Leaf nodes instead contain a prediction label that is used for predictions. [26]

An example of a DT can be found in Figure 2.1. The figure shows a small tree where the first split looks at variable x5 and if the value is lower than 0.23154 it will be classified as label b. If it is larger and variable x27 is smaller than 0.999945 it is classified as label g otherwise label b.

ID3

ID3 is an algorithm developed by Quinlan for creation of DTs from training data [10]. To create a DT, a split point of remaining attributes have to be calculated. This is done by calculating the information gain and then selecting the attribute that has the largest gain.

Some disadvantages of the ID3 algorithm is that it easily overfits if the training data is small [10]. Continuous data have to be discretized and this can create large computational constraints if the discretization is too fine-grained.

The advantage of the ID3 that make it popular is that it tries to avoid large trees [10]. As ID3 looks at all the attributes the resulting tree is able to perform classification on unseen observation when created. The tree produced is easier to prune because of ID3s ability to easily find leaf nodes.

(15)

2.2. Finite State Machine

Figure 2.1: An example of a decision tree. Classifies an input based on a number of x values into class b or g. Each split looks at one x value. If the value is smaller then the split the left branch is chosen. If it is larger the right branch is chosen. This is continued until a leaf node is met and then the sample is classified according to the leaf nodes class.

C4.5

C4.5 is an extension of the ID3 algorithm by Quinlan that addresses some of the shortcomings of ID3 [10]. The main differences from the ID3 algorithm is that C4.5 uses gain ratio instead of information gain. C4.5 also has the advantage of handling missing data in the input. Statistical pruning is used to reduce the tree size and C4.5 allows the use of continuous values.

C4.5 starts the same way as ID3 but then performs a pruning step. The pruning step looks at branches that increase the rate of classification error in the tree [10].

Classification and Regression Tree

Classification and Regression Tree (CART) is another algorithm for creating DTs. The idea behind CART is to perform three steps at each of the nodes in the tree [7]. The steps are: to look at all splits that can be made at the node, then select the best of the splits according to some criteria specified. If the node in question fulfils a predefined stopping criteria, do not continue to split on that node.

The trees generated by CART are often large. So a second part of the algorithm is to prune the generated tree into a smaller size. The most common way of doing the pruning is by using the minimal cost complexity procedure [7].

2.2 Finite State Machine

A finite state machine (FSM) is a mathematical model for modelling computation [4]. The FSM have states, condition and transitions. Each FSM have a start state, one or multiple goal states, transitions between states and conditions that trigger the transitions. The FSM receive an input and check which condition correspond to the input, then it executes the transition related to the condition from the current state. If the new state after the transition is the goal state then the FSM halts.

(16)

2.3. Behaviour Tree

Figure 2.2: Example tree of a BT. Fall is a fallback node, Seq is a sequence node, Con is a condition node, Act is an action node. Root is the root node.

2.3 Behaviour Tree

Behaviour tree (BT) is a tree representation of a flow structure. The tree is a directed tree and is written with the root node at the top. Internal nodes in the tree are called control flow nodes and the leaf nodes are named execution nodes [4]. Each node in the tree returns

Success, F ailure or Running when invoked. The execution of a node is called tick. The root

node is the one that the tick originates from.

There are different versions of the control flow nodes that modify the tick or returns value of the child. The sequence control flow node sends the tick forward to its left most child and if that child return Success it sends the tick to the next child. If all children return Success the sequence node returns Success. Otherwise if a child returns F ailure the sequence node return F ailure. Another type of control flow node is the fallback node. It does the same as the sequence node but instead of sending the tick to next child if it returns Success it sends it forward if it returns F ailure. If the child returns Success the fallback node returns Success. The parallel node sends its tick to all its children at the same time. If a subset of the children return F ailure the parallel node returns F ailure, how many of the children needed to fail have to be specified in the node. Two execution node implementations are the action node and the condition node. The action node performs a command each time it receives a tick. If the command was successful it returns Success, if it is still running the command it returns

Running and if the command failed it returns F ailure. The condition node checks if some

condition is met when it receives a tick, if it is it returns Success otherwise it returns F ailure. An example of a BT can be seen in Figure 2.2.

2.4 Artificial Neural Network

The idea behind the artificial neural network (ANN) is to model a biological neural network. An example of a biological neural network is the human brain. ANNs all have the same structure of nodes and layers [29]. Each layer of the ANN consist of one or more nodes. The input of each node comes from nodes in the previous layer, or the input observation to the

(17)

2.5. Reinforcement Learning

Figure 2.3: Structure of a 2 layer fully connected artificial neural network [29].

network. The output is sent to nodes in the next layer or the output. If all nodes in a network are connected to all nodes in the previous and next layer the network is called fully connected network (FCN). Each connection in the layer have a weight associated to it. See Figure 2.3 for an overall structure.

Each node in the network calculates the output, y, of the node according to Equation 2.1 where f is the transfer function, wiis the weight of connection i and T is a threshold value for

the node [29]. To allow the network to approximate non-linear problems the transfer function has to be non-linear. y= f( n ∑ i=0 wixi− T ) (2.1)

The way that the network learns, is to update the weights in its connection and therefore strengthening or weakening the signal from nodes. There are two major learning categories for an ANN, supervised and unsupervised learning [29]. In supervised learning the network’s gets an input and then a measurement between the networks output and the true output is calculated. This measured error is then minimised during training when updating the weights. In unsupervised learning the goal is not to minimise the difference between guessed output and the true value but to find the underlying structure of the input data. In unsupervised learning the true label of the data is often not known.

In supervised learning when the error is minimised a technique called back propagation is used. Back propagation sends the gradient of the error backwards from the output back through the network [29]. Based on the back propagation signal the weights of the network are updated.

2.5 Reinforcement Learning

Reinforcement Learning (RL) is one of three main learning fields in machine learning, the others are supervised learning (SL) and unsupervised learning (UL). The main idea behind RL is to let an agent explore the world on its own and give positive or negative feedback when the agent does something good or bad [18]. One of the main problems that agents are faced with in RL is the choice between exploration and exploitation. Should the agent use the

(18)

2.5. Reinforcement Learning

knowledge that it has collected to decide which action to take in order to maximise or should the agent explore the world and learn more about it? This dilemma is called the exploitation versus exploration problem.

In RL the agent gets observations from the environment. For an given observation of the environment the agent chooses an action to perform in the same environment [11]. The agent receives a reward and a new observation from the environment. From the reward the agent can learn how good the action it performed was. Another knowledge gained is transaction knowledge for moving from an observation to another. This loop is repeated until the agent completes the scenario, and then the world is reset and the agent can continue explore it. When learning the agent can be set to learn for a number of resets, reach a certain average score over a period, or some other defined metric.

In RL there are three components that are used, a value function. a policy and a model[11]. The value function is used to predict how god a state or state/action pair is. Policy represents what actions the RL agent should take for a state. The model is an estimation of transition functions and reward functions for the environment and is used with a planning algorithm. When only a value function and a policy is used the algorithm fall into model-free RL and when all three are used they are called model-based RL.

The Markov Property

If a process is Markovian it means that it has the Markov property [11], which fulfils the Equation 2.2. Where, wtis the observation at time t, atis the action taken at time t and rtis

the reward achieved at time t. t is a timestep after initialisation t= 0. Another property of the Markovian process is that it is a discrete time stochastic control process. What the Markov property states is that the current observation can be used as a starting point for the future of the process, without taking into account the history of the process.

P(wt+1∣wt, at) = P(wt+1∣wt, at, ..., , w0, a0)

P(rt∣wt, at) = P(rt∣wt, at, ..., , w0, a0)

(2.2)

Q-Learning

A model-free reinforcement learning technique called Q-learning was introduced by Watkins [27]. Q-learning is a technique for learning which action to take in a controlled Markovian domain. Action polices are learned by trying different actions and recording what reward was received. A table of actions and states are used to record how good each combination is. This table is called Q-table. When the agent shall take an action it looks at what action will give the highest reward. Watkins proves that the agent’s behaviour will converge with probability 1 to an optimal policy given enough iterations. The update of the Q-table is done according to Equation 2.3; where x is the current state, a is action taken, y is the next state and r is the rewared received by performing action a at state x. This is all done for a sequence of steps, called episodes, where n is the n:th episode. n= 0 is the start of the learning for the agent. α is called the learning factor and describes how much the agent learns from the current reward with regard to previous knowledge. γ is the discount factor and it is a parameter for how much the agent shall go for short term versus long term rewards.

Qn(x, a) =⎧⎪⎪⎨⎪⎪ ⎩ (1 − αn)Qn₋₁(x, a) + αn[rn+ γmax b {Qn−1(yn, b)}] if x = xnand a= an, Qn₋₁(x, a) otherwise (2.3)

Deep Neural Network

Deep neural network (DNN) is an ANN with many hidden layers. Where the transition from shallow and deep neural networks is hard to say and no clear definition exist. One of the

(19)

2.6. Deep Q-Networks

problems when creating DNN is that when back propagation is done the gradient either goes to zero, vanishes, or goes towards infinity, explodes. This is due to the number of hidden layers in the network. If the weights are small, a number of multiplications with values less than 1 will be done, this makes the value go towards zero. There can also be multiplication with zero, which gives zero. If multiplication is performed with values larger than 1, the value grows exponentially, it grows towards infinite. With more and more weights this is an increased risk [24].

One of the reasons for the success of DNN is that the multilayer structure allows the network to take advantage of compositional hierarchies that exist in many natural signals [20]. In the case of images, objects are made up of parts and parts are made up of edges. This means that a DNN classifies images can learn to identify edges in its first layer. Then in the second layer the edges can be put together into parts and in the third layer the parts can be classified into objects.

Some other reasons for the increased popularity of DNNs is the advancement in hardware, networks that would have taken weeks to train now can be trained during a day [20]. The use of another activation function called ReLU (Rectified Linear Unit) and the regularization technique called drop-out, gave a needed performance boost to the networks. These are some reasons of the huge success of DNN in the ImageNet [6].

ReLU (Rectified Linear Unit) helped solve some of the problems with vanishing and ex-ploding gradient [14]. ReLU is a linear function that is cut off at 0. The advantage of ReLU is that the gradient is always 0 or 1. A note is that ReLU is undefined in 0, here the derivative is defined to be 0, 1/2 or 1.

2.6 Deep Q-Networks

Mnih, V. et al. [21] introduce a new way of doing RL that uses DNN. They called the new technique Deep Q-network (DQN). The main difference between this method and Q-learning is that DQN uses a DNN to approximate the Q-function. In the paper, the structure of the network is altered from taking a state and an action and give the reward, to instead be a network that takes a state and approximates the reward for each action. This means that the best action can be taken by choosing the action with the highest reward.

A problem with using a DNN for representing the Q-function is that RL have a known issue of becoming unstable when using non-linear function approximator[21]. An ANN is an non-linear function approximator. To deal with the issue they used two different mechanism. The first is an idea called experience replay. The idea behind experience replay is to store the experience gained from interacting with the environment, and then sample from the buffer pool of past experience during training. By doing this the changes in the data distribution can be smoothed over and correlations can be removed in the observation space. The second improvement is in the way that they update the action-values. They use an incremental approach to update the values closer to the target values and doing this update periodically. Which is done by having a second network estimate the value of an action in the q-learning step, and this network is updated less frequently than the network used for choosing the policy. This is done to reduce the correlations with the target.

(20)

3 Related works

This chapter summarises previous research in areas that are close or related to the topic of this thesis. Research that are in similar areas are presented under each section. As can be seen in the chapter much of the existing research focus more on creating a DT than a BT, but a BT can model a DT so it could be used as an intermediate step on the way.

There is some different areas of extraction, some look into breaking down the network structure to be able to extract information from the network, others treat it as a black box and only look at what classification is done by the network.

For research that look at creating BT most focus on creating them with help from genetic algorithms or modify an existing BT to make it smaller or more complete.

3.1 Rule Extraction Research

Research talks about the difference between interpretability and accuracy [16, 13]. The defi-nition used for interpretability is the ability to understand what output a model will give in a human understandable way [16]. Accuracy is how close the model can express the original system [16]. This means that if a model has 100% Accuracy it will make exactly the same actions as the original system.

Some different ways that can be used to measure the interpretability of sets of rules can be to look at the complexity of them. When looking at the complexity the goal is to have as low complexity which then would increase the interpretability. Some approaches would be to measure the number of labels per rule, number of variables or the number of rules [16]. Another attempt of quantifying the interpretability of the set of rules is to look at the semantics of them. Here the semantics that are associated with the elements that are in the rules are used. There is a difference on having interpretability in a model and have a model that have explainability [13]. Interpretability is comprehending what a model do or should have done in a situation but explainability being able to give a reason for the behaviour or produce an insight of the cause of the decision.

3.2 Rules Extraction from Artificial Neural Network

LOcal Rule Extraction (LORE) is a method proposed by Chorowski and Zurada [3] that extracts rules from an ANN. The rules are then transformed into a decision diagram. The

(21)

3.3. Decision Tree Extraction from Artificial Neural Network

main idea behind the algorithm is to take a training sample and insert it into the network and record how the network classifies it. Then create a rule that does the same classification and add it to the pool of rules. A generalization step is done, which makes the rules no longer completely true to the network. Last a pruning step can be run to simplify the rule space. One limitation with the algorithm is that the input space has to be discrete.

Another technique for creating more understandable decision making models was intro-duced by Koul, A. et al. [19]. They showed that a recurrent neural network (RNN) could be transformed into a finite state machine (FSM) called Moore Machine (MM). The extraction were then replaced by a Quantized Bottleneck Network (QBN). A QBN is a auto-encoder that have quantized the latent representation. This insertion does not affect the RNN if the QBN perfectly estimates the memory, which is not always true. But the insertion of the QBN allows the network to be viewed as a Moore Machine Network (MMN). The MMN could then be transformed into a MM.

Koul, A. et al. [19] showed that the produced MM is close or have the same performance as the RNN However in some cases some additional tuning was required. They compare the performance of a DQN, and a Long short-term memory (LSTM) solution on six Atari games. Both have good performance on the games and the extracted FSM have the same, or almost the same, as the trained RNN. The produced MM could then be analysed to get an insight into which strategies the RNN had learnt during training.

3.3 Decision Tree Extraction from Artificial Neural Network

ANN-DT is an algorithm introduced to be able to handle ANNs that output continuous values [25]. Another advantage of the technique is that it does not make any assumptions about the network structure or in which way the network was trained. The main idea behind the algorithm is to train the network, then interpolate data and input to the trained network. From the sampling of the ANN with the interpolated data a DT is created. An outline of the steps can be seen in Figure 3.1

The created DT have the same or better performance than creating the DT directly with CART [25]. A theory for why the extracted DT can have better performance than the directly generated DT is that the ANN does not over train on outliers. The ANN is assumed to detect trends between points on different branches. This might also be the reason for the performance of the extracted DT.

As most research have been done for shallow ANNs an algorithm for extracting DTs from DNN were introduced by J.R. Zilke et al. [28]. The technique is based on the shallow ANN extraction algorithm CRED. CRED uses C4.5 algorithm to transform the output layer of the ANN into a DT. Then they move to the hidden layer. For each weight from the hidden layer that was used as one of the inputs to the DT for the output a new DT has to be created for these, which is done in the same manner. The next step is to perform the same step from the input to the hidden layer. As a final step, each of the DTs are merged together into one. This is done by inserting the DTs instead of the labels produced in the DT of the earlier layers. The final DT then connects the input and output of the network. They extend the algorithm by creating more DTs as intermediate step for the hidden layers. This then allows the algorithm to be run on networks with more hidden layers.

3.4 Using Q-Learning with Behaviour Trees

There is a way of using Q-learning in conjunction with BTs. The way is called RL-BT. RL-BT works by having a fixed BT where some of the nodes are learning nodes [4]. In the learning nodes Q-learning is used to learn the correct behaviour. The problem with this method is that it needs a BT as input with nodes that it can learn what to do in. If the root node is the learning node then the problem is back to normal Q-learning. The advantage of using RL-BT

(22)

3.5. Learning Behaviour from a Teacher

Figure 3.1: Outline of the ANN-DT algorithm [25].

over just Q-learning is that the problem is broken down into smaller parts that the algorithm have to learn, therefore making it easier to learn.

QL-BT is a technique created by Dey and Child [8]. The idea behind the technique is to take an existing BT, then find the lowest level sequence nodes. The actions that can be found under the sequence node is then used as actions in Q-learning. The generated Q-table is then divided into a number of sub-tables. Each sub-table is chosen so that each table represent one action. In the original BT all the condition nodes are replaced with a new node called Q-Condition node. The Q-Condition node is a node created from a sub-table of the Q-table. As a last step of the algorithm the BT is sorted in such a way that the nodes that have the highest Q-value is put first.

3.5 Learning Behaviour from a Teacher

Sagredo-Olivenza et al. [23] looked at how allowing a designer to control a Non-player character (NPC) could allow them to easier create a BT for that NPC. The designer is allowed to create a minimal BT to start off with. When they want to insert a behaviour that is more complex, they can choose the training node. Then they can perform training to teach the node what behaviour it should have. This is done by controlling the NPC and moving in the desired way. When the designer wants the NPC to perform some type of action during the training they can pause the game and select one of the already implemented action nodes. From the data created during the training episode a BT describing the behaviour is created and inserted instead of the training node. To create the BT they start by generating a DT with C4.5. The DT is then converted into rules by a depth-first search. Redundant rules are removed and the rules are condensed. The last step is to create a BT by having a parallel node at the top which is connected to each task. Each task is protected by a condition that corresponds to the rule for that task being fulfilled. The study found that the presented technique were perceived to be easier to use when creating a BT than a conventional BT editor.

Robertson and Watson [22] looked at the game StarCraft and created an algorithm for creating BTs from replays of human expert level matches. The way they create the BT is

(23)

3.5. Learning Behaviour from a Teacher

to first create a BT that is over-fitted to the data. This is done by creating an action node for each action in an example of the data. Each action for that example is then attached to a sequence node. All the sequence nodes are then attached to a selector node. The selector node chooses which sequence node shall run by random chance or by matching the sequence node to the closest example of the current observation. This creates a BT that is over-fitted to match each example exactly. They call this type of BT a maximally-specific BT. From the maximally-specific BT, trends and common patterns are detected. The common patterns are then merged together and added into new sequences. When the algorithm cannot find any more patterns it stops and the new BT is returned.

The trees that are produced by the algorithm are very large, even after the reducing step. The start tree is over 200 000 nodes and when the algorithm is done it is reduced down to around 50 000 nodes [22].

Learning from Demonstration (LfD) allows a robot to be taught behaviour in the same way we teach each other; by showing and doing. French et al. [12] introduces a technique that uses the CART algorithm to create DT from human demonstration. The created DT was then transformed into a BT by their own proposed technique called BT-Espresso. The created BT then models the demonstrated behaviour and can be used to run it. The technique is used to learn a robot to pick up a duster, move the duster to a designated area and dust the area with the duster. Pseudo-code for BT-Espresso can be found in Algorithm 1, reproduced from [12].

Algorithm 1 BT-Espresso algorithm for converting a DT to a BT, reproduced from [12]. function BT_ESPRESSO(dt)

rules← DT _T O_RULES(dt)

rule_dnf s← LOGIC_MINIMIZER(rules) root← P arallel()

for dnf in rule_dnfs do seq← Sequence node act← dnf.action or← F allback node seq.add_child(or) seq.add_child(act) for minterm in dnf do

and← Sequence node for predicate in minterm do

cond← Conidtion(predicate) node and.add_child(cond)

or.add_child(and) root.add_child(or) return root

Crick et al. [5] created an environment that allows humans to help a robot navigate through a maze. This was done by allowing humans to connect to the robot through a website. Here they got to guide the robot through the maze given a sensor feed. The two feeds that the participants could get to operate with were raw camera feed or processed camera feed. The processed feed showed the location of a tag in view. The tag feed is the same input sent to the robot when it has control. By recording the human controlled runs and then use them for learning DTs the researchers were able to train the robot to navigate the maze. The interesting result of the experiment was that the data generated from the runs with the tag system gave a more robust and faster converging behaviour.

Janssen [17] performed an experiment were he used genetic algorithms to try and estimate the learnt behaviour of an RL policy. First Q-learning was run on a simulator with an UAV that had to collect trash and avoid crashing. The agent, UAV, got a positive reward for collecting trash, large negative reward for crashing and a small negative reward for turning.

(24)

3.6. Summary

When the agent collected a piece of trash a new one was randomly spawned. The locations of the walls were fixed during both training and evaluation. After the agent was done training its behaviour was analysed and summarised. When the RL policy was learnt the genetic algorithm was run to try and recreate it with a BT. The fitness function of the algorithm looked at how many of the actions taken by the genetic agent was the same as the one taken by the RL agent in the same state. At the last part of evaluation the fitness function was changed to promote BTs with fewer nodes. The result of the genetic algorithm was a BT that missed some of the behaviours that the RL algorithm had developed. The best generated BT took the same action as the RL policy 66% of the time. The advantage found was the most of the missing behaviour could be inserted into the tree by hand. This made the tree and policy select the same action for 86% of the states. The advantage of the generated BT, and human extended BT, was the size of them. The generated BT had 6 nodes and the human extended had 20 nodes. Due to the small umber of nodes, the trees are easier to examine for bad behaviour. This can be compared to the number of entries in the Q-table, which had over 3000 entries. This is much harder to verify that no ill behaviour exists.

3.6 Summary

Some related research was presented in this chapter. From the summarised reports it can be seen that there exists multiple ways of extracting information from a neural network. There are more research in how to extract information from shallow ANN than for DNN. Most of the research have presented successful ways of extracting the information from the network.

When it comes to research in creating BTs there are multiple different ways. Some use RL to enhance the tree and making more flexible nodes, and give a more structured way of looking at the RL solution. Other have looked at generating BT, some require a BT to enhance and some have used genetic algorithms to generate the trees. But when the BT have been generated with genetic algorithms some problem with the final tree or method have emerged. Janssen [17] had the problem that the extracted BT did not capture the whole behaviour of the Q-learning agent. Therefore it performed worse then the agent on the tested scenario. Given the set-up of the world used in Janssen’s report it is a hard problem for the RL. This is due to the agent having a limited view of the world in conjunction with the positive feedback locations begin randomly placed in the world. This means that the agent has to come up with some technique of traversing a world it cannot fully sense to get to locations that are randomly spawned in order to get positive feedback. Which means the right action for the agent to take for two observations that are the same could be different.

The algorithm presented by Robertson and Watson [22] gave very large BTs. This comes from the fact that they mapped every action taken in a game of StarCraft into the tree and then repeating this process for multiple games. Then they tried to make abstractions from the actions taken to minimise the tree. As this is a technique that starts of with 100% accuracy to a game they get a large tree to be able to handle the complexity. The problem for them is extracting useful information from each game and making abstractions such that they get one generic tree that can play a game of StarCraft, instead of a tree that is over-fitted to many different games and has to select which of the games the current is closest to. This could be a useful way of creating a BT if each set-up of a scenario requires a different solution with some common variants. If the abstraction step could be done in a good way then big parts of the tree could be generalised to sub trees, which could save on the implementation side, as the sub-trees could be reused.

Then there is a field that has looked at generating behaviour trees or decision trees by having a teacher, often a human one. These allow a teacher to generate correct classifications for a problem that the tree then can be fit to. This allows them to generate more data for the problem and therefore get more training data to use.

(25)

3.6. Summary

Even though there exist much research on how to extract information from shallow ANNs and how to learn BTs no research was found that looked into using deep RL to train and agent and then extract the behaviour into a BT.

Many of the existing techniques today are used to generate DTs that describe an ANN, or DNN. One of the major problems with using many of the techniques is that the behaviour that this thesis wants to extract is not only hidden in the network of the agent. But also relies on the selection of the best action from the information provided by the network. A guess is that it is not as useful to extract the Q-function and model it as an BT. The goal is that whole behaviour can be contained in the BT. Therefore looking at techniques that use a decompositional approach might need some tweaking before they can be used. The mechanism of action selection is known and therefore a mapping from the state and action input to reward to a state to action might be feasible to do. This gives an advantage to the pedagogical methods that do not need to rely on the structure on the network, which makes the wrapper implementation easier. This can then hopefully create a more interpretable behaviour. As research have shown that LfD can be a valid method for creating DTs and BTs that captures the shown behaviour. If a human can teach a computer a computer might also be able to teach another computer. If this is possible then the behaviour might be transferable from a learnt behaviour in one technique to a DT and BT. This then creates an opportunity that is not dependent on the base technique used for learning and open up more options.

(26)

4 Method

From the related research presented in chapter 3 a method is devised to try and solve the problem presented by the research questions. The first step before creating the algorithms is to define how they should be evaluated when they are created. After the evaluation is defined the extraction algorithms can be created. As was seen in the Related works chapter many have used an teacher to create a DT and as BT are more complex they can model a DT. This can then be the first step in extraction. To be able to test the extraction algorithms a DQN agent trained on a simulators is needed, which is defined last in this chapter.

4.1 Evaluation

The first step of the experiment was to take the created DQN agent and use it to generate a set of actions and states. In a simulation the agent visits a number of states and performs an action and these pairs were recorded. The aim was to get over 100 000 states and actions combinations for each simulator. For cart pole this was done by running 1 000 simulations and for grid world 30 000 simulations was run. The difference in between the simulators is needed because they have different number of steps when an agent completes the scenario. In the cart pole the agent always completes the scenario, which are run for 500 steps and only shorter if the agent does not complete the scenario. For grid world the reverse is true, the scenario is cut short if the agent completes the scenario. As the grid worlds size is 5x5 it can be completed in a small number of steps. After the data was collected for each simulator it could be used as training data by the algorithms. Each algorithm had also access to the simulator and the DQN agent. This means that they could run experiments on the simulator and see what the right thing to do for a simulator run should be.

Evaluation metrics

To help answer the research questions some measurement had to be done. The metrics chosen to evaluate the models are Stability, Complexity, Accuracy and Performance. These were chosen as; Accuracy can be used to help answer Research Question 2, Complexity can be used to help answer Research Question 2, and Performance can be used to help answer Research Question 3. Stability was chosen because it can help answer how dependent the models are on randomness and if the models is run with another seed if the same result can be expected. A

(27)

4.1. Evaluation

stable model shows that the result of the model is not based on a lucky creation of the model and that the results are reproducible. So even though Stability might not directly help any of the research questions it is deemed important and therefore used.

Stability

The evaluation step was performed 30 times. 30 times was chosen as the Central Limit Theorem states that when having many samples, the distribution of the samples can be estimated by a normal distribution [1]. A good number to aim for is 30 or more. The samples must also be independent random variables that have the same expected value and the same standard deviation. Because the samples are drawn, or simulated, from the same function, simulator, it can be assumed to fulfil these criteria. This means that the estimated mean value and standard deviation are part of a normal distribution. Furthermore, to check how stable a model is between runs the tests were run five times on different random seeds. The extracted trees can then be compared to see if they are the same. Furthermore, the mean and standard deviation can be compared to see if they are similar. If the trees are the same and the values are close the tree is stable. When a DT was used by a model it was the same DT used by the other models; they were generated and then saved such that they could be used by all the models that need a DT. As the model is run five times, five normal distributions are extracted, to easily summarise the distributions a mean was calculated for each value.

Complexity

To measure the complexity of the network the number of nodes of the tree was measured. More nodes mean more things to check and understand when examining the model, which can be used as an estimate of the interpretability. Fewer nodes could make it easier to analyse and understand what the BT will do. This is true to a certain extent. The effort to understand a tree that has hundreds of nodes and another tree that has hundreds of nodes can be seen to be in the same effort area. But if one of these trees were compared to the effort of understanding a tree that has tens of nodes, it would be another area of effort. So if two trees should be compared and one have 10 nodes and the other 15, depending on the values and the structure of the trees it might be hard to say which of these is easier to analyse, even though the argument could be made that 10 nodes probably is easier to analyse and understand as you can give each node more time than for 15 nodes. However if a tree with 100 nodes and 10 nodes should be analysed and understood it is clearer that the smaller tree is easier to analyse and understand. So it could be hard to order the trees straight up based on the number of nodes, but it should be possible to order them in different effort ranges needed to understand them.

Accuracy

To evaluate the accuracy of the tree its actions were compared to what action the DQN agent took in the same scenario. This was done by giving the BT a set of observations from a scenario and then allowing it to chose an action for each observation. The actions performed was then compared to the actions that the DQN agent performed for the same observations. If an action was the same it counted as a one and if not it was marked as a zero. To calculate the accuracy the vector of comparisons was summed and divided by the total number of observations. This gives the accuracy of the tree compared to the original model.

Performance

To see if the extracted tree has the ability to complete the simulator scenario it was run on the simulator. The score of the tree was recorded and compared with the score of the DQN agent and normalised by dividing the two scores. Note that as the DQN agent is used as a baseline it might be possible for the model to get a score higher than one if it performed better than

(28)

4.2. Extraction Algorithms

the DQN agent. The average score could then be compared to see if the tree lost performance against the DQN agent. To allow a fair comparison a random seed was selected to be used for both the tree and agent for each run. This makes sure that both the tree and DQN agent start in the same state, as the reset function is random for both simulators.

4.2 Extraction Algorithms

As seen in chapter 3 there exist many different techniques for extracting and learning be-haviours. This means that there are multiple ways to reach the goal of extracting the behaviour from the DQN. One way is to use the techniques for extraction of rules and DTs. Most of these techniques are of the decompostional approach. This might be a problem for DQN which often have tens or even hundreds of layers. Furthermore, each layer might consist of tens or hundreds of nodes. Meaning a large amount of nodes need to be processed, which takes time. This makes the decompositional approaches less appealing in comparison to the pedagogical approaches. A research field that was mentioned in chapter 3 is LfD. Much of the research in LfD often use a human as the teacher. However, the only requirement of the teacher is that they can show the right action for a state. As the DQN agent is trained on the problem it can be used as the teacher in the algorithms. Often this defeats the purpose of learning up an agent. If there already exists an agent that knows the problem why is another agent needed? But in this case there exists an agent that have learnt a behaviour that is needed in another model. Therefore techniques in the LfD field can be tested. The techniques in LfD often fall under the pedagogical approach as the teachers model is often not known.

One problem with choosing pedagogical approaches instead of decompositional approaches is that information is discarded that exists in the DQN agent. This may lead to longer training times or worse performance. The advantage of choosing pedagogical approaches is that the structure of the model does not matter. By ignoring the structure of the model it will mean that the approach is more general and can be used on other techniques and network structures. In contrast, the decompositional approach would have to be adjusted to work with another network or model. Therefore a focus on pedagogical methods was chosen.

Extracting a Decision Tree

In [12, 5, 23] a DT is fitted, as a mapping from states to action, to model a behaviour. The fitted DT can then be used for further refinement of the model, which is done in [12] and in [5, 23] the DT is used to model the behaviour. One simple way of fitting the DT is to treat the problem as a SL problem. Then the states can be seen as input and the actions as the label. This means that the problem can be solved with C4.5 or CART.

To get the base set of actions and states the DQN agent could act as a teacher. This is done by running the DQN agent on the simulator a number of times. Each time saving all the states that are encountered and then which action the agent performed at that state. This then transforms the problem to supervised learning (SL) problem. Which means that the standard algorithms of creating DTs can be used. To create a DT the Matlab classification tree,

f itctree1, was used. f itctree uses the CART algorithm to create a DT. For more information about the CART algorithm see section 2.1 in the theory chapter. The trees were created with actions as true label and observations as input. After the tree was created it was pruned to a smaller tree.

The pruning step is an important step in creating the DT as CART returns a large, often over-fitted, tree. f itctree returns a number of levels, each level cuts of some part of the tree. The levels go from zero, the full tree that have no nodes cut, to a maximum pruning level where only the root node is left. The maximum pruning level is specific to each tree depending on its size. One option for selecting the best pruning level is to use cross validation to evaluate

(29)

each level and select the one with the highest accuracy that gives the smallest tree, the highest pruning level. Often a drop in accuracy could be accepted to get a smaller tree and avoid over-fitting, then the level before the threshold is passed is chosen.

Two main pruning attributes, accuracy and performance, were tested to see how they compare and change the extracted tree. Because an oracle and the simulator were available for use, each pruning level of the tree could be tested with new data. This reduces the need to use cross validation to select the best prune level. Cross validation is more useful when there only is a fixed amount of training data and not enough can be spared to be used in validation. This data can be called validation data, the tree is validated on each level with data that have not been used in training. The reason for using new data instead of the same data is to avoid over-fitting to the data. If the training data was used again, then the tree is verified on knowledge it already knew.

To select which level to prune the tree on, each level was evaluated with the accuracy of the model or the reward of the model. The value was calculated by running the DQN agent on the scenario and record each action taken for each state. Furthermore, the tree got to predict the action for the same state. Then the actions of the two models is compared and an accuracy is calculated by taking the number of same actions by the total number of actions. To get the smallest tree, a search started at the highest pruning level and descending towards zero was performed. The search was stopped when zero was reached or when the level could get a perfect score for 30 runs. This gives two method for creating a DT that can be used by the algorithms.

Naive First Solution

The first tested solution was a naive solution from [12], which can be seen in Algorithm 2. This solution essentially maps a DT into a BT without making any changes. A split in the DT is transformed into a fallback node that has two sequences nodes as children. Each sequence node represents the fulfilment of the split or not, these are called true sequence and false sequence node. In both cases the sequence node haves the condition of the split as their first children, the false sequence has the negative condition of the split. Then the second child of the sequence node is the next split in the tree for each side of the DT. If a leaf node is reached in the DT an action node is inserted into the BT. As the algorithm just maps the DT into a BT there is no real change in the structure of the tree. The algorithm was chosen as it is easy to implement and shows that as long as a DT can be created to solve the problem a BT can be created. Even though there is no real advantage if modelling the DT as a BT, it shows that it can be done and when a base tree is generated other techniques of improving or changing the tree can be used.

An example of the algorithm run on a DT, Appendix Figure A.1, can be found in the Appendix Figure A.2.

BT Espresso

The second tested solution was the expanded algorithm from [12] called BT Espresso. The pseudo-code of the algorithm can be seen in Algorithm 1 which can be found in chapter 3.5, however changes to the algorithm were made. Instead of using the parallel node as base node a fallback node was chosen instead. This was chosen as the implementation of the simulator only allows one action to be chosen and the way that the conditions nodes are set-up mean that only one action will be activated at each tick. Even though the change should not affect the performance of the current implementation it was chosen for easy of implementation. If a parallel node was used it would have meant extra time needed to implement it instead of looking at evaluating the current algorithms, or implement new ones. The other change is that the sequence node called seq is added as a child to the root node instead of the fallback node called or. These changes were made as the original algorithm never adds the sequence

(30)

Algorithm 2 Naive implementation of a DT to BT conversion, from [12]. function NAIVE_DT_TO_BT(node)

if IS_LEAF_NODE(node) then return node.action

else

root← F allback node true_seq← Sequence node f alse_seq← Sequence node

true_cond← Condition(node.condition) node f alse_cond← Condition(¬node.condition) node

true_action← NAIV E_DT _T O_BT (node.true_child)) f alse_action← NAIV E_DT _T O_BT (node.false_child)) true_seq.add_children(true_cond, true_action))

f alse_seq.add_children(false_cond, false_action)) root.add_children(true_seq, false_seq)

return root

node to anything. The fallback node called or is also already added as a child to the sequence node seq. From the description and the example tree given in the report a sequence node is the parent of the action node and then the converted rule to protect it. The last change made was that the or and act nodes were added as children to the seq node after the loop instead of before. This is because the or node should have all its children added before being assigned to the seq node. The new algorithm can be found in Algorithm 3.

As, from what the author can tell, there are no native way to transform a set of rules in Matlab into DNF, a python package called Sympy and the function to_dnf was used2_{. This}

was achieved by using Matlabs functionality to call out of process python code3_.

BT Espresso Simplified

Some further simplification of the generated tree from the BT Espresso algorithm was made. This was possible because of two main reasons, the parallel node was changed into a fallback node and that the action nodes have no chance to fail. The fallback node in combination with an action has to be chosen for each tick, which means that the last child sub-tree of the node must be fulfilled and the action taken. This can be reasoned from the DT that always chooses an action which is then transformed into the BT. If the first actions is not fulfilled then it means that the last action is. Which is due to no information from the transformation being discarded. The DT is first transformed into rules which describe the same action space. These are then transformed into DNF, which still spans the same action space. Lastly the DNF rules are transformed into a BT with the BT Espresso algorithm, which still span the same action space. As the sub-tree that protects the last action always will return Success it can be removed. The advantage of this is that the final tree will have less nodes than the full algorithm. Changes to the algorithm can be seen in Algorithm 4.

2_{https://docs.sympy.org/latest/modules/logic.html}

3

(31)

Algorithm 3 Changed BT-Espresso algorithm for converting a DT to a BT, main ideas from

[12].

function BT_ESPRESSO(dt) rules← DT _T O_RULES(dt)

rule_dnf s← LOGIC_MINIMIZER(rules) root← F allback node

for dnf in rule_dnfs do seq← Sequence node act← dnf.action or← F allback node for minerm in dnf do

cond← Conidtion(predicate) node and.add_child(cond) or.add_child(and) seq.add_child(or) seq.add_child(act) root.add_child(seq) return root

Algorithm 4 Simplified version of the changed BT-Espresso algorithm for converting a DT

to a BT, main ideas from [12].

function BT_ESPRESSO_SIMPLIFIED(dt) rules← DT _T O_RULES(dt)