Comminution control using reinforcement learning: Comparing control strategies for size reduction in mineral processing

(1)

Master’s thesis, 30 ECTS Engineering Physics

Autumn term 2018

Comminution control using reinforcement learning

Comparing control strategies for size reduction in mineral processing

Mattias Hallén

(2)

(3)

Comminution control using reinforcement learning

Comparing control strategies for size reduction in mineral processing

Mattias Hall´ en

Department of Physics Ume˚a University Ume˚a, Sweden 2018

(4)

Comminution control using reinforcement learning

Comparing control strategies for size reduction in mineral processing Mattias Hall´en

Mattias Hall´en, 2018.c

Supervisor: Max ˚Astrand, ABB Corporate Research Examiner: Martin Servin, Department of Physics

Master’s Thesis 2018 Department of Physics Ume˚a University

SE-901 87 Ume˚a, Sweden Phone: +46 90-786 50 00

(5)

since it is often the bottleneck of the concentrating process, thus small improvement may lead to large savings. By implementing a Reinforcement Learning controller this thesis aims to investigate if it is possible to control the grinding process more efficiently compared to traditional control strategies. Based on a calibrated plant simulation we compare existing control strategies with Proximal Policy Optimization and show possible increase in profitability under certain conditions.

Keywords: Reinforcement Learning, Mineral processing, Process control, Comminu- tion

(6)

Sammanfattning

Inom mineralanrikning ¨ar malningsprocessen en central del eftersom den ofta

¨

ar flaskhalsen i koncentrationsprocessen. Det gör att sm˚a förbättringar kan leda till stora besparingar. Genom att använda Reinforcement Learning siktar denna uppsats p˚a att undersöka om det är möjligt att styra malningsprocessen mer effektivt jämfört traditionell reglering. Baserat p˚a en kalibrerad simuleringsmodell av malningsprocessen jämför vi befintlig reglerstrategi med Proximal Policy Optimization och visar en ökning i lönsamhet för vissa driftsfall.

(7)

hur man kan instruera en dator att lära sig n˚agot. Även om fältets historia sträcker sig tillbaka i tiotals ˚ar har ökad tillgänglig beräkningskapacitet och nya metoder gett fältet ett stort uppsving p˚a senare ˚ar. Detta arbete fokuserar p˚a Reinforcement Learning (RL) som är ett delfält inom AI med fokus p˚a att träna en datoragent p˚a att maximera en belöning som agenten f˚ar genom att givet ett visst antal beslut fattar de beslut som i längden maximerar denna belöning. Genom att implementeraren designar en belöningssignal som motsvarar m˚albeteendet är syftet att agenten hittar vägen till att uppn˚a m˚alet och därmed gör att agenten lär sig beteendemönster som kan vara sv˚art att programmera för hand.

En agent implementeras i detta arbete för att styra en malningsprocess inom ett anrikningsverk för mineralerutvinning. Malningsprocessen är en viktig del inom anrikningsprocessen d˚a effektiv malning är en förutsättning för att kunna generera ett högt utbyte av utvunnen mineral fr˚an den brutna bergarten.

Malningprocessen i fokus för detta omr˚ade styrs idag med traditionell kontrollte- ori designad av människor med stor omr˚adeskunskap. Hela denna process med befintlig styrning finns tillgänglig som en simuleringsmodell vilket möjliggör träning av den implementerade agenten p˚a den simulerade modellen istället för verkligheten vilket ger en stor frihet d˚a riskmoment s˚asom mekanisk skada och produktionsbortfall inte finns. M˚alet med detta arbete är att se om denna teknik passar för att styra malningsprocessen och om denna tränade agent kan hitta nya sätt att styra kvarnprocessen som genererar högre lönsamhet eller effektivitet jämfört med den befintliga styrningen.

Resultat i denna rapport bekräftar att tekniken fungerar för att hitta mer lönsamma styrningar p˚a den simulerade modellen under vissa driftsvillkor.

Metoden kr¨aver dock ytterligare utveckling om den ska till¨ampas p˚a en riktig malningsprocess.

(8)

Acknowledgements

First, I would like to express my gratitude to Max ˚Astrand at ABB Corporate Research for his continuous support, spot on remarks, and never-ending en- gagement during this thesis.

I would also like to thank Johannes Sikstr¨om at Boliden for always being available to answer my questions and Anders Hedlund at BI Nordic for his early findings on this concept.

In helping me understand the working principle of mill circuits I would like to direct my gratitude to Patrik Westerlund at Boliden and to Erik Klinten¨as at ABB for his support on mill circuit control.

I am grateful for the valuable help received from Kateryna Mishchenko at ABB Corporate Research. My thanks also go to Optimation and Magnus Ar˚aker for lending me their simulation model which made this thesis possible.

Finally, I must express my gratitude to Martin Servin for his work as examiner.

Mattias Hall´en, Ume˚a, October 2018

(9)

1 Introduction 1

1.1 Motivation . . . 1

1.2 Approach . . . 2

1.3 Limitations . . . 4

2 Theory 5 2.1 Mineral Processing . . . 5

2.2 Grinding process . . . 6

2.3 Reinforcement learning . . . 10

2.3.1 Components of a Reinforcement learning agent . . . 11

2.3.2 Challenges in RL . . . 13

2.3.3 Q-Learning . . . 14

2.3.4 PPO . . . 15

3 Method 21 3.1 Environment . . . 21

3.1.1 Model . . . 21

3.1.2 Architecture . . . 23

3.2 Agents . . . 24

3.2.1 Q-Learning . . . 24

3.2.2 PPO . . . 25

4 Results 26 4.1 Architecture performance . . . 26

4.2 Q-Learning Inverted FMU pendulum . . . 28

4.3 Q-Learning Primary mill . . . 30

4.4 PPO Mill Circuit . . . 33

4.4.1 Maximize throughput . . . 33

4.4.2 Maximize profit . . . 35

5 Discussion 39 5.1 FMU Architecture . . . 39

5.2 Implementation . . . 39

5.3 Parameter tuning . . . 40

5.4 RL at a higher level . . . 40

5.5 Other mining subprocesses . . . 41

(10)

Contents

6 Conclusions 42

6.1 Future work . . . 42

7 References 43

A Appendix 45

A.1 Profit function . . . 45 A.2 PPO training with randomized ore properties . . . 46 A.3 PPO sample episode maximize profit . . . 46

(11)

This thesis focuses on Reinforcement Learning (RL) which is a field in machine learning. The main framework for RL is a decision-making process where the goal is to maximize reward by reinforcing actions of an agent that historically gave high rewards, thus by designing rewards the implementer may define an end goal for behaviour without having the need to specify on how to reach that goal.

Reinforcement learning has a history of successes in various games by beating the strongest human players. Notable examples are Backgammon, Checkers, Go and various Atari games. Due to increased computational capacity Deep Reinforcement Learning in which RL techniques are combined with Deep Artificial Neural Networks have been increasingly popular in recent years. These methods enable solving of more complex problems, such as beating Atari games by only feeding the algorithm with pixel images.

However, RL has few documented cases on process control and this thesis aims to cover this gap. The process in question is the ore grinding comminution process within a mineral processing plant. A simulated model of the grinding process is acquired from a previous project and the approach is to construct a RL agent which controls the simulated environment and then compare the performance of this agent compared to existing control strategy involving PID controllers.

1.1 Motivation

Grinding mills stand for a large part of the cost in running a mineral processing plant.

A sample breakdown of costs for a typical copper concentrator estimates the cost of the grinding circuit alone to be 47% of the total cost per ton of concentrate for the entire process and extra costs such as water and laboratories [1]. This cost is mostly due to the energy requirements for driving the grinding mills.

Another factor is that the ground particle size effectively determines the maximum performance of the concentrating process. The particles must be ground to a correct size since too large or too small particles hinders the subsequent concentrating process.

Due to these two factors the grinding process is an interesting problem to optimize in

(12)

1 Introduction

which small improvements leads to large savings and the goal is to see if Reinforcement Learning may be suitable for controlling mill circuits and possibly other process sections.

One interesting aspect of a RL control problem is that the formulation is basically reversed. When designing a PID controller the method is to control some quantity and thus achieve some kind of goal, i.e. more effective grinding by keeping the quantity within a set point. In the case of a RL controller the goal and method are equal, you receive reward for grinding more effectively and how you do it is irrelevant. The hope is that this would reduce engineering time and make it possible to find operating cases previously unknown in which are unexpectedly beneficial.

Another benefit is that a reinforcement learning controller could be implemented to adapt to changing operating conditions continuously. Wear and tear of machinery and differing ore types would ideally be handled by the RL controller which never ceases to learn from new experience.

1.2 Approach

Reinforcement learning is a field in machine learning focused on learning by interaction, this process is similar to how humans and animals learn. An example of learning by interaction is a child which touches a hot stove, by reaching the hot stove a negative feeling occurs and the child is less inclined to touch the stove again. More formally this example can be summarized as the agent (child) is in a state close to the stove, takes action in an environment by moving its hand, touch the stove and by touching it receives a new state from the environment in which the hand is touching the stove and a negative reward for touching a hot surface. A schematic view of this process can be seen in Figure 1.

(13)

Figure 1 – Overview of a of Reinforcement Learning system, Image credit¹

However, RL does not focus on how humans learn but rather how to construct algorithms capable of learning from an environment. Explicitly the goal is to maximize the accumulated reward and by selecting a reward signal corresponding to the desired behaviour the agent will try to act optimally in order to maximize this reward. An example for process control could be an environment which consists of a tank being filled by some fluid at an unknown rate and an agent with actions to control a pump draining the tank, and a reward signal for keeping the tank level within a desired set point. Thus, to maximize the reward means the agent will try to keep the level within the desired set point.

In this thesis reinforcement learning will be applied to control a simulated mill grinding for a mineral processing plant. First a suitable architecture which integrates computer simulations standardized using Functional Mock-up Interface (FMI) and reinforcement learning algorithms will be developed. This architecture will then be the framework used for applying various reinforcement learning algorithms which will control the mill circuit.

1https://commons.wikimedia.org/wiki/File:Reinforcement_learning_diagram.svg

(14)

1 Introduction

1.3 Limitations

The RL controller will only be implemented on a simulated model which is considered to be correct and calibrated, and not on the real application. For performance comparison the data from existing PID control strategy will also be acquired from simulation. The comparison will only include existing PID control and will not include any comparison to more advanced control methods such as Linear-quadratic regulator or Model predictive control.

(15)

This section is divided into three parts. The first part covers the general concepts behind mineral processing, the second part covers the theory underlying grinding circuits in mineral processing, and the third part covers reinforcement learning theory.

That section will cover the two reinforcement learning algorithms used in this thesis, Q-Learning and PPO. Starting with an introduction to the general decision-making framework Markov Decision Processes followed by an introduction to the different components of a reinforcement learning algorithm before delving into the theory for the two selected algorithms.

2.1 Mineral Processing

Mineral processing is a term describing the various methods used to extract minerals from their respective ores. This can be seen as a concentration process as the minerals are typically surrounded or mixed with unwanted materials known as the gangue and the ultimate goal is to increase the concentration of the specific mineral.

The operation of a mineral processing plant can be divided into four different unit operations listed in the order of operation from raw material to finished product. [2]

1. Size reduction incorporates crushing and grinding of mineral ores which is a process known as comminution. It has a three-folded goal; liberate valuable minerals from their gangue, increase surface area of minerals for increased reactivity, and facilitate the transport of ore particles.

2. Size separation is often accompanied after size reduction since the particles typically need to be handled differently depending on particle size. It is accomplished using various classifiers such as screw classifiers, trommel screens and cyclones.

3. Concentration utilizes the different physiochemical properties of minerals and gangues to increase the concentration of the desired mineral. This can be done in various ways such as gravity concentration, magnetic concentration or froth flotation. For example, the operating principle of froth flotation is rendering the wanted materials hydrophobic by adding chemicals, then aerating a water

(16)

2 Theory

mixture of this material to produce bubbles which the hydrophobic material sticks to enabling collection of the material from the rising bubbles.

4. Dewatering to reduce water contents of the mineral and water mixture, it is accomplished using various thickeners and filters.

In assessing the efficiency of a mineral processing plant besides looking at profit, an interesting factor is the metallurgical efficiency measured by recovery and grade.

Recovery is a term describing the percentage of valuable material extracted from the feeding input of ore, where a 100% recovery would mean that all minerals are extracted from the input feed. Grade is the final measure of concentration for the concentrated product.

Although tempting the best approach is not always to maximize the recovery rate.

If maximizing recovery without obtaining an acceptable grade it will lead to an unsellable product [2] and therefore a balance between grade and recovery always has to be achieved.

2.2 Grinding process

The main objective for a grinding process coupled with a froth flotation is liberation of valuable minerals from their associated gangue. Specifically, the goal is to maximize the exposed mineral surface in order to make froth flotation efficient [1]. In Figure 2 a typical mineral locked into a gangue and a sample breakdown can be seen.

(17)

Figure 2 – A typical ore particle with gangue in white, valuable mineral in black and sample breakage lines shown as dotted grey lines. Also shown is typical smaller particles after this larger particle has undergone size reduction through grinding. The ideal case would be that these smaller particles only consisted of gangue or valuable mineral.

Although it may be tempting to grind the ore as finely as possible there exists a lower limit when it becomes infeasible to grind further, this comes down to two factors.

Firstly, the increase in surface area from splitting particle decreases as the particle gets smaller, thus it costs more energy to grind the particles finer. Secondly when using froth flotation as a concentrating step, grinding the material too fine leads to the small particles sticking to the air bubbles due to their lightness and thus mixing gangue with the desired minerals which may ruin the grade of the recovered product.

To break down the ore particles large mills are used. The design is basically a large rotating drum that grinds the ore in a combination of three effects: compression, chipping, and abrasion as can be seen in Figure3.

(18)

2 Theory

Figure 3 – Overview of the operating principle of a grinding mill in a cross section.

The angular velocity of the grinding mill is a determinant factor in achieving efficient grinding. If the mill is rotating too fast, then the centrifugal force will make the ore particles stick to the inside walls resulting in approximately zero grinding. If the speed is too low the only grinding is occurring through abrasion which is not as efficient as compression and chipping. A balance in speed must be achieved in which the particles being hurled hits the particles at the bottom and thus maximizing this compression and chipping effect.

In the implementation of a mill circuit a common design is two staged grinding which operates two grinding mills. The reasoning behind a two-staged grinding process is that grinding finer materials with coarser material is more efficient than grinding comparably sized materials. This is utilized by allowing coarse materials from a primary mill to be selectively added to a secondary mill via a controllable flap gate.

A typical design of two staged process consists of two grinding mills namely primary (PM) and secondary mill (SM), and two size separators; a trommel screen, and a screw-classifier. Which is used to determine the material path depending on particle size. Finally, there is a controllable flap gate which alters the material flow and a mill pump used for ore/water mixture transportation.

(19)

Figure 4 – Mill circuit focused on in this thesis, showing the operating principle of this two-staged grinding circuit. Not depicted in this figure is water additives which exists at the input of primary mill, secondary mill, and at the mill pump.

The material flow of the mill circuit depicted in Figure 4 starts at the input to the primary mill in which ore is transported via a conveyor belt into the mill where it is mixed with water. This ore is ground and then classified into three different particle sizes in the trommel screen. The finest particles enter the secondary mill, the middle-sized particles enter recirculation to be ground again by the PM, and finally the coarsest particles enter the flap gate. This flap gate can be controlled to add these coarse particles to SM or they can be directed to recirculation to be ground again by PM.

The fine particles and optionally coarse particles added into the secondary mill undergo the same process as in the PM in which further water is added. The particle/water mixture exiting the SM is dropped directly into the mill pump tank in where it is pumped to the final screw classifier. If the particles are fine enough, they are sent through to the froth flotation while the coarse material is sent back into recirculation to be ground again by PM.

(20)

2 Theory

2.3 Reinforcement learning

Reinforcement learning problems are often modelled as Markov Decision Process (MDP) which is a mathematical model for decision making. Given a state a limited amount of actions can be taken, and with each action follows a probability that the decision maker ends up in a certain state and each transition from one state to another yields the decision maker a reward. This can be graphically modelled as a graph as shown in Figure 5. Note that we here assume that the state and actions are finite which means that we are referring to a special case of MDP called Finite Markov Decision Process.

Figure 5 – A Markov Decision Process consisting three states in green, each with two actions in red with corresponding transition probabilities. Reward is also shown for state-action pair (S₁, a₀) and (S₂, a₁) given that the profitable transition occur.

Image credit²

(21)

One important aspect is the Markov property which requires all states to include all information necessary for taking optimal action without regard for the history of states. More formally described as the future is independent of the past given the present. This requirement is important since it enables the decision maker to make decisions based only on the current state. An example is positioning control of an object by applying an instantaneous force. In order to make a correct decision, a state consisting of only position would not be sufficient since the object could have a velocity from previous actions. Thus, if velocity was included in the state representation the history of states is included and the Markov property is fulfilled.

The MDP is a central part since RL problems are often modelled as them. However there exists a difference between the traditional MDP problems and RL, often the model is completely unknown. If the dynamics of a problem is completely known and sufficient computational power is available, then the optimal policy could be computed using methods from optimal control.

However, since the complete model of the environment is seldom achieved, and computational resources are not enough, RL techniques are used. Many RL algorithms are still closely related to dynamic programming. The main difference is that while dynamic programming tries to directly calculate the optimal policies on known models, RL techniques tries to iteratively approximate policies on unknown models.

2.3.1 Components of a Reinforcement learning agent

To construct a RL agent some main components exists but depending on implementation the combination of components differs. A value-based agent consists of a value function and an implicit policy which is typically greedily selected from the values of the value function. A policy-based agent directly contains a policy function without a value function. An agent can also optionally contain a model of the environment.

Policy determines the behaviour of the agent. It consists of a mapping from state to action:

π(s) : S −→ A (1)

Where S is the current state and A is the action proposed by the policy.

Value function represents the value of a state. While a reward is received directly

(22)

2 Theory

for transitioning to a specific state the value function represents the expected reward which subsequent states provide to the agent. Thus, it is a more foresighted way of looking at rewards. The value of state depends on future reward and thus it is dependent on a policy. The expected value of a state s starting at time t is given by:

V^π(s) = E[

∞

X

k=0

γ^kR_t+k+1 | S_t= s] (2)

Where π is the policy, γ is the discount factor commonly used to set how much the agent should value immediate rewards compared to future rewards and R is the reward for each time step. A value function can also be defined as a function of state-action pairs. The expected value of taking action a in state s at time t is given by:

q^π(s, a) = E[

∞

X

k=0

γ^kRt+k+1 | St= s, At= a] (3)

Model of the environment is an optional component. A model is a broad term for describing anything that an agent can use to predict how the environment will respond to its actions [3]. It is used for planning which refers to taking a model as input and producing/improving of a policy. This is a distinction to policy optimization used in value- and policy-based algorithms which takes a state as input and searches for good actions. Depending on if an algorithm contains a model it is referred to as model-free or model-based.

The combination of these different components can be summarized as a Venn-Diagram.

(23)

Actor Critic

Value-Based Policy-Based

Value Function Policy

Model

Figure 6 – Classification of RL algorithms depending on their utilized components represented as a Venn-diagram. Figure recreated based on [4]

From the diagram in Figure6the different types of RL agents are shown. This thesis covers two algorithms, Q-Learning which is value-based-model-free algorithm and Proximal Policy Optimization (PPO) implemented by combining a value function and a policy classifying it as an actor-critic-model-free algorithm.

2.3.2 Challenges in RL

One of the main challenges in RL is the trade-off between exploration and exploitation.

To maximize reward, the agent must exploit the knowledge it has of the environment but in order to find this knowledge the agent has to explore the environment. The problem is that exclusively pursuing one of these goals is bound to fail, a balance between exploration and exploitation must be achieved. A common method is selecting action -greedily in which the greedy action is taken with probability and

(24)

2 Theory

random action is taken with probability 1 − to ensure exploration.

Another challenge is how to handle continuous states and actions. In process control this is the most common form since measurements and control signals are often analogue. With continuous states and actions there effectively exists an infinite amount of states, and an infinite amount of actions for each state which does pose a problem.

The naive way of solving this is by discretizing these continuous signals into discrete bins, i.e. one continuous state/action is divided into n discrete intervals thus providing n discrete states/actions.

For dealing with continuous states a more refined way is to implement some kind of function approximations such as a neural network to enable learning on definite states and generalization to the states in-between. Continuous actions is dealt with primarily by implementing policy gradient algorithms.

2.3.3 Q-Learning

Q-Learning is considered one of the early breakthroughs in RL, it is based on the Bellman Equation from Dynamic programming and is a suitable algorithm for problems containing few discrete inputs and outputs and the basic idea can be summarized as follows

Q_updated← (1 − α)Q_{so f ar}+ αQlatest estimate (4) Where Q is an estimation of state-action pair value presented in section 2.3.1. The working principle taking the weighted average between so far estimated value and the latest estimate where the learning rate α decides how we trust our latest estimation compared to the so far estimated value. More formally by substituting Q_updated and Qso_far by the estimated value Q and Qlatestestimate by the Bellman equation for Q functions yields the Q-Learning algorithm.

Q(S_t, A_t) ← (1 − α)Q(S_t, A_t) + α[R_t+1+ γmax

a Q(S_t+1, A_t+1))] (5)

(25)

which selects the highest valued state-action pair. This state-action pair is not neces- sarily the upcoming selected action by the agent. This is referred to as an off-policy update. The following pseudocode shows a complete implementation of Q-learning.

Algorithm 1: Q-Learning algorithm.

Parameters: Learning rate α ∈ (0,1], Exploration parameter > 0 Initialize: Q(s,a) arbitrarily for all a and s, except Q(terminal,·) = 0 while Learning is not stopped do

Choose A from S using policy derived from Q (e.q., -greedy);

Take action A, observe R, S’

Q(S, A) ← (1 − α)Q(S, A) + αR + γmaxaQ(S⁰, a) end

This method updates its state values after each step. However, it has been proven that taking several steps before updating can improve performance and convergence rates [3].

2.3.4 PPO

Proximal Policy Optimization (PPO) was proposed by John Schulman et al. [5] in 2017.

In its essence it is a policy-based algorithm which is capable of handling continuous inputs and outputs. It is simple to implement and strikes a favorable balance between sample complexity, simplicity and wall-time. While performing comparably or better than state-of-the-art approaches despite being simpler to implement and tune [5].

In this section a value function will be introduced which effectively classifies it as a Actor-Critic algorithm although with the policy optimization part still being pure policy-based PPO.

For continuous control a common way to represent a policy is to create a neural network representing a Gaussian distribution that outputs a mean and a standard deviation for each possible action. Sampling this distribution for actions enables exploration when the estimation is uncertain because of high standard deviation and exploitation when the network is confident due to low standard deviation. Formally this is can be described as a neural network policy π_θ(a|s) where θ corresponds to the weights in the network. An example of this is multi-layer-perceptron (MLP) gaussian probability distribution policy is shown in Figure 7.

(26)

2 Theory

Figure 7 – Example policy network for a gaussian policy distribution with 4 inputs in the Input layer, one hidden layer with 5 units and three outputs represented by mean parameters and standard deviation parameters for each output. Note that the standard deviation does not depend on the inputs, an alternative would be to have two separate networks, one for mean and one for standard deviations but this design is commonly used. Image credit [6]

Following [7], the general policy gradient methods consider a reward function f (τ ) = PH

t=0R(s_t, a_t) for a trajectory τ = (s₀, a₀, ..., s_H, a_H). Where the trajectory is determined by the action making policy π_θ(a|s) parametrized by θ and the environment dynamics P (s_t+1|s_t, a_t). The expected reward for a trajectory τ under this definition is

U (θ) = E

H

X

t=0

R(s_t, a_t); π_θ = X

τ

P (τ ; θ)R(τ ) (6)

Our goal is to maximize the received reward, policy gradient methods do this by performing some kind of optimization over the policy parameters e.g. by gradient descent θ ← θ − α∇U (θ). Thus we need the gradient ∇_θU (θ) for this expected

(27)

reward.

∇_θU (θ) = ∇_θX

τ

P (τ ; θ)R(τ ) definition of expectation

=X

τ

∇θP (τ ; θ)R(τ ) swap sum and gradient

=X

τ

P (τ ; θ)∇_θP (τ ; θ)

P (τ ; θ) R(τ ) multiply and divide by P (τ ; θ)

=X

τ

P (τ ; θ)∇_θlog P (τ ; θ)R(τ ) ∇_θlog P (τ ; θ) = ∇_θP (τ ; θ) P (τ ; θ)

(7)

We can approximate this gradient by a gradient estimator ˆg that samples over trajectories.

∇_θU (θ) ≈ ˆg = 1 m

m

X

i=1

∇_θlog P (τ⁽ⁱ⁾; θ)R(τ⁽ⁱ⁾) (8) One problem still exists, our gradient estimator involves P (τ ; θ) in which trajectories are determined by the policy which we know and by the environment dynamics model that we do not have knowledge of.

∇_θlog P (τ⁽ⁱ⁾; θ) = ∇_θlog

H

Y

t=0

P (s⁽ⁱ⁾_t+1|s⁽ⁱ⁾_t , a⁽ⁱ⁾_t ) · π_θ(a⁽ⁱ⁾_t |s⁽ⁱ⁾_t )

= ∇_θ

H

X

t=0

log P (s⁽ⁱ⁾_t+1|s⁽ⁱ⁾_t , a⁽ⁱ⁾_t ) +

H

X

t=0

logπ_θ(a⁽ⁱ⁾_t |s⁽ⁱ⁾_t )

= ∇_θ

H

X

t=0

log π_θ(a⁽ⁱ⁾_t |s⁽ⁱ⁾_t )

=

H

X

t=0

∇θlog πθ(a⁽ⁱ⁾_t |s⁽ⁱ⁾_t )

(9)

Interestingly enough we don’t need the environment dynamics model. Substituting this back into our gradient estimator ˆg yields

ˆ g = 1

m

X

i=1

^H−1X

t=0

∇_θlog π_θ(a⁽ⁱ⁾_t |s⁽ⁱ⁾_t )^H−1X

t=0

R(s⁽ⁱ⁾_t , a⁽ⁱ⁾_t )

(10) This expected value of this estimation can be approximated by sampling over several time steps

ˆ g = ˆE

h∇_θlog π_θ(a_t|s_t)R(t)i

(11)

(28)

2 Theory

This is a common gradient estimator form used in policy gradient algorithms. In practice, automatic differentiation software is commonly used and the gradient estimator is defined as gradient of an objective function. This objective function is called the Loss function for Policy Gradients (PG)

L^{P G}(θ) = ˆEt[logπ_θ(a_t|s_t)R(t)] (12) In optimizing this loss function with gradient descent techniques this method could intuitively be seen as trying to increase the probability of actions that yielded positive rewards. This formulation is problematic since in the case where all rewards are positive the probabilities of all actions will be increased. Instead of using Rt an alternative is the advantage estimator ˆA = Q(s, a)−V (s) which tells you which actions are better/worse than expected. The advantage function is commonly implemented according to generalized advantage estimation (GAE) [8]. GAE replaces the Q value from the standard advantage formulation with λ-return [9] which is defined as exponentially-weighted average reward with a decay parametrized by λ

R_t(λ) = (1 − λ)

∞

X

n=1

λⁿ⁻¹Rⁿ_t (13)

Where hyper parameter λ = [0, 1] gives the exponential rate of decay of future rewards.

And by substituting Q-value with lambda return R_t(λ) in the advantage estimator yields:

Aˆ^t= Rt(λ) − V (st) (14)

Using this advantage formulation instead of reward gives us

LÂdvantage(θ) = Ê^t[logπθ(at|st) Â(t)] (15) Which will increase the likelihood of actions that received more reward (advantage) than expected from the state-value approximation. This type of policy gradient algorithm is commonly referred to as a ”Vanilla” policy gradient. One problem with this method is that performing multiple optimization steps on the same trajectory can lead to destructively large policy updates [5].

This is what PPO tries to solve, by enabling multiple optimization steps over the same trajectory it increases the sample efficiency dramatically. The loss function for

(29)

it utilizes importance sampling r_t(θ) = _π

θold(at|st) over the updated policy π_θ on the policy which was used to gather experience π_θold.

L^{CP I}(θ) = ˆE^t[r_t(θ) ˆA_t] (16) Finally, what John Schulman et al. [5] proposed for PPO was a loss function which simply clipped the probability ratios with regards to the hyper parameter ³ and takes the minimum of this clipped term and the original term:

L^CLIP(θ) = Ê[min(rt(θ) Â_t, clip(r_t(θ), 1 − ε, 1 + ε) Â_t)] (17) The motivation behind this is that by limiting the amount the policy can be updated regardless of how much more/less the action would be probable in order to counteract too large policy updates. The L^CLIP loss function can intuitively be explained by Figure8.

Figure 8 – Two plots of a single timestep of the loss function L^CLIP as a function of probability ratio r for positive and negative advantage. The red circle shows the starting point of optimization. Image credit [5].

The intuitive explanation in how this clipping loss function works is that it effectively sets the gradient to zero if we try to update too far. In the case of positive advantage:

A > 0, i.e. a good action it clips the gradient if you try to make this profitable

3Note that this is not from -greedy exploration

(30)

2 Theory

action too likely, but it doesn’t limit if you try to make this action less likely. This makes sense since this trajectory is only a local approximation of our policy and not representative for our whole policy. These modification enables PPO to run optimization several times over sampled trajectories from its old policy π_old, since learning will effectively stop when the new policy has changed too much according to the clip function, which will clip the gradient and thus prevent destructively large policy updates.

In the loss function for the policy an entropy bonus S is also introduced to ensure sufficient exploration [5].

L^{CLIP +S}_t (θ) = ˆE^t[L^CLIP_t (θ) + S[πθ](st)] (18) The state-values v(s_t) from Equation 15 can be estimated using a neural network v(s_t, w) parametrized by w. In updating this state-value approximation a common method is to use squared-error loss over the actual reward V_t^targ for a state using the following loss function L^{V F}_t (w) = ˆEt[(V_w(s_t) − V_t^targ)²]. Summarizing all of this the PPO algorithm can be seen in the following pseudo code:

Algorithm 2: PPO, Actor-Critic Style [5]

for iteration=1,2,... do for actor=1,2,...,N do

Run policy π_θ_old in environment for T timesteps Compute advantage estimates ˆA₁, ..., ˆA_T

end

Optimize L wrt θ, with K epochs and minibatch size M ≤ NT θ_old← θ

end

Note from the above algorithm that the experiences are collected using the old policy π_θold and then optimized in several batches before starting the experience collection again, which leverage the main advantage of PPO by taking several optimization steps over the same trajectory.

(31)

This section will cover how the simulated model was adapted to suit the needs of a reinforcement learning problem and following a description of how the two RL algorithms from the theory section was implemented to act as agents for this problem.

3.1 Environment

The environment is an important part of a reinforcement learning problem which differ from a typical continuous simulation task. The following sections cover how a continuous simulated model was adapted to fit a RL problem.

3.1.1 Model

For simulating the mill circuit a model was already implemented in a previous project using the modelling and simulation software Dymola. This model was calibrated and verified against the real application and thus we consider it to be a correct representation of the real plant. In Figure 9 a schematic drawing of the modelled mill circuit can be seen.

Figure 9 – Mill circuit with inputs depicted in green and outputs corresponding to Table1 &2

(32)

3 Method

In this model the existing control strategy based on several PID control loops was included. This gave the possibility to remove single control loops and replacing them with RL controller while keeping the original control. Although varying throughout the different experiments, the following tables shows the possible observations and actions for the RL controller.

Table 1 – Observations (Outputs from model)

Name Unit Accepted range

x1 Primary mill mass ton High & Low limit x2 Primary mill torque % High limit

x3 Primary mill power MW High limit x4 Secondary mill mass ton High limit x5 Secondary mill torque % High limit x6 Secondary mill power MW No constraint x7 Mass flow return ton/h No constraint x8 Mass flow return classifier ton/h No constraint x9 Mass flow out ton/h No constraint x10 Particle size K80⁴ µm No constraint

The motivation for the above observations is to make performance observable, which mainly in a mill circuit is the mass flow out of the circuit and the particle size of this mass flow. Secondly, the main constraining properties was measured to ensure feasible operation and finally the two return flows were added to ensure this environment approximately fulfilled the markovian property.

Table 2 – Actions (Inputs to model)

Name Unit Range

u1 Primary mill speed RPM Continuous u2 Secondary mill speed RPM Continuous

u3 Flap gate - Discrete

u4 Feed flow ton/h Continuous

(33)

The actions in the above table includes all controllable elements in the mill circuit except the water additives for the mills and the mill pumps. The water additive control was kept according to existing control strategy as a simplification. The mill pump control was removed from the model due to the added simulation time of a sampling PID controller in the model. This is as an approximation since at a steady state the outward flow from the secondary mill is equal to the outward flow of the mill pumps.

3.1.2 Architecture

To utilize already implemented RL algorithms OpenAI Gym [11] was chosen as a framework. Although it would be possible to implement a custom interface, the Gym framework is becoming an industry standard which has many implemented algorithms which works out of the box for Gym environments. Besides from having many implemented environments and algorithms OpenAI Gym provides a standardized interface for how to enable actions on an environment and how observations are reflected back.

The entire model from Section 3was packaged as a Functional Mock-up Unit (FMU) which can communicate via the Functional Mock-up Interface (FMI) to allow simulation outside of the Dymola environment. This FMU model was exported as a Co-Simulation model bundled with Dymola solver Cerk23 which is a third order single step Runge-Kutta integrator. For communicating with this FMU the Python programming language was chosen due to its wide support for machine learning and for communication via the FMI standard python library pyFMI was used. This FMU was integrated into OpenAI Gym and a schematic view of how these different components interact can be seen in Figure10.

(34)

3 Method

Functional Mock-up Unit (FMU) Reset

Step forward Custom environment

Reset Action Q-Learning/PPO

OpenAI Gym

Functional Mock- up Interface (FMI)

Create

Initialize

Reward

Figure 10 – Schematic view of the implemented environment. The arrows correspond to the information flow between the various components.

This architecture provides a suitable abstraction layer for RL problems, in which the RL algorithm is completely separated from the implementation specific FMU/FMI communication. The most important parts are the method for taking action, and receiving an observation and a reward and this discrete action step was implemented to run for 60 seconds, which gives the implemented algorithms a sample time of 60 seconds which roughly corresponds to how quickly the simulated mill circuit model reacts.

3.2 Agents

3.2.1 Q-Learning

(35)

observations were discretized by binning and the outputs were discretized into fixed values.

3.2.2 PPO

To overcome the input/output limitations of Q-Learning the PPO algorithm was implemented, the implementation of this algorithm was based on the algorithm PPO2 contained in OpenAI Baselines [12]. It is built around the machine learning framework Tensorflow [13] and has support for parallelization by running parallel environments using Message Passing Interface (MPI). This was utilized using a computer with two Xeon X5650 which amounted to 23 virtual cores during the training process and thus 23 parallel environments. This implementation optimizes over the loss function using the Adam optimizer built into Tensorflow and the policy network is implemented as an MLP Gaussian policy distribution.

(36)

4 Results

4.1 Architecture performance

When interfacing the FMU described in Section3.1.1 three different approaches were evaluated. The approaches differ in how the simulation command was sent to the FMU. The first design was based on sending initial values to the model as inputs and then simulating the model for a time corresponding to sample time and repeating this process for each simulation episode. This provided a clean and intuitive calling to the FMU, but it proved quite slow because the environment was initialized for each sample. This was called the Start/Stop architecture and is further described by Figure11.

Observations Actions

Simulate model one sample

(e.g. 60s)

...

(e.g. 60s)

Actions Actions

Observations Observations

Figure 11 – Start/Stop architecture utilizing the model.simulate() method in pyFMI library.

In an effort to decrease simulation time a new design was implemented which initialized the model once for the whole simulation episode. For interfacing this simulation, a callback function was implemented which interrupted the simulation at fixed intervals in which new actions were provided and observations were acquired. This provided a boost in performance, but it did make the implementation more complex by introducing thread synchronization issues. This method was named the Callback

(37)

Simulate model one episode (e.g.

24h) Callback function as an

input instead of an action

Actions

Observations

...

Callback interrupt when simulation asks for action Thread

1

Thread 2

Figure 12 – Callback architecture utilizing the model.simulate() method in pyFMI library but with a callback function as an input function to halt the simulation for each sample until the RL agent has provided a new action.

To avoid the added complexity of thread synchronization a final method was implemented. The main principle was dissecting the simulation method from the Start/Stop architecture to enable more control over the simulation. This concluded in a method which was manually initialized for the whole simulation episode, and then a manual call to the integrator to solve the system until one sample time was reached, and then wait until new actions was provided. This method was named the step architecture and is schematically drawn in Figure 13.

Setup experiment

for one episode (e.g.

24h)

Integrator solve until sample time

is reached (e.g. 60s)

...

Integrator solve until sample time

is reached (e.g. 60s)

Actions Actions

Observations Observations

Step Step

Figure 13 – Step architecture utilizing the model.do step() method in pyFMI library with manual initialization of the model.

In order to choose between three architectures a sample simulation was performed, and the results is summarized in Figure 14.

(38)

4 Results

Figure 14 – Solving time for simulating mill model 0-300s with 30s/step using integrator Cerk23 with different FMU architectures.

As can be seen the Step architecture is the fastest of the three and it strikes a good balance in implementation complexity versus speed which made it a natural choice.

4.2 Q-Learning Inverted FMU pendulum

To verify that the FMU architecture was working as intended a model simulating an inverted pendulum was developed in Dymola and exported as an FMU. This was chosen as it could be verified against OpenAI Gym environment CartPole depicted in Figure15which had the same dynamics although it is solved inside OpenAI Gym.

Figure 15 – Render of the CartPole environment from OpenAI Gym. The frictionless

(39)

The task is to balance the pole upwards by applying an instantaneous force on the cart to the right or left in the perspective of this image. A Q-learning algorithm was implemented which worked on the OpenAI Gym Cartpole environment, and then this algorithm was applied to the FMU pendulum. In Figure16 we can see the training progress for Q-Learning applied to the FMU pendulum. In this training the agent receives reward = 1 for each timestep that the pole is upright. If the pole deviates too much from the center the episode is terminated and thus the agent receives a lower reward for this episode. The goal is to reach 200 timesteps without termination and the problem can be considered solved if it reaches this point. The Q-learning algorithm worked equally well on the FMU pendulum as for the CartPole environment and after 150 episodes the problem could be considered solved.

Figure 16 – Training progress for Q-Learning on FMU Pendulum showing episodic reward.

The FMU architecture worked as intended and after this small example the next logical step was to switch the CartPole FMU to the mill circuit FMU.

(40)

4 Results

4.3 Q-Learning Primary mill

To test the FMU mill model one of the first experiments was implementing Q-Learning to control the total mass of the material in the primary mill. This case was selected since it should be relatively simple to find a working mill speed with constant feed that would provide the correct mass. If the mill circuit would not contain any recirculation this would be a trivial problem but since some recirculation does exist this problem is still interesting. The same Q-Learning algorithm from the FMU Pendulum problem was implemented with slight adjustments for specifying the inputs/outputs of this problem.

The input to the agent was the mass of the material in the primary mill, and the action was to control the rotational velocity of the primary mill. Although since Q-Learning is a tabular method these values were discretized by binning. In Table 3 all hyper parameters for this algorithm can be seen. The feed to the mill was kept constant and the existing PID control strategy was kept for the secondary mill. The reward was designed to provide +1 if the mass was kept within a small interval around desired set point and then linearly decreasing as mass deviated from this value.

Table 3 – Hyper parameters for Q-Learning controlling mass of material in primary mill

Parameter Value

Exploration rate () LinearDecay[0.3, 0]

Learning rate (α) 0.1 Discount factor (γ) 0.9

Input Bins 10

Output Bins 10

This Q-Learning implementation was applied to the FMU architecture using the mill circuit model and in Figure 17the training progress can be seen.

(41)

Figure 17 – Training progress for Q-Learning on controlling primary mill mass by varying primary mill speed with constant feed showing episodic reward.

The problem can be considered solved and the algorithm achieves approximately maximum possible reward after 200 episodes. In Figure 18 a sample episode from this training process can be seen.

(42)

4 Results

Figure 18 – Episode 230 from the training process. Showing input mill mass and output mill speed from the Q-Learning algorithm.

In analyzing the performance of episode 230 in Figure 17the episode could roughly be divided into two stages. The first stage is roughly before time step 30 when the mass of the mill and mill circulation is building up and the second stage is after time step 30 when the mill mass has reached the desired set point. At the first stage the algorithm keeps the mill speed slow until it has roughly reached the set point and then in the second stage the algorithm starts trying to keep the mill mass around this set point.

One interesting thing is that the emergent control strategy is a bang-bang fashion that switches between a high output signal and a low output signal. Which in this case is a reasonable since the algorithm is only rewarded for keeping the mass of the material in the primary mill between a fixed interval.

(43)

4.4 PPO Mill Circuit

Although the results from applying Q-learning in Section 4.3 was promising the limitations on discrete inputs and outputs made it infeasible if the control problem was extended to include more inputs and outputs. To handle this PPO was implemented which handles continuous inputs and outputs.

4.4.1 Maximize throughput

This experiment was aimed at trying to maximize the throughput of the mill circuit disregarding particle size completely. To accomplish this the reward function was designed to give +1 for each time step that mass flow out of the circuit was at its theoretical max, and linearly decaying if less than that. Since a Gaussian policy was used which has no actions bounds all actions were clipped in the environment according to the action bounds. Furthermore, to discourage actions outside of bounds an addition to the reward function was a reward penalty. The size of this penalty was controlled by a parameter named Penalty factor.

The inputs and outputs to the algorithm was the full set described in Section3.1.1 which meant the algorithm had full control over the circuit and every possible property was measured. The algorithms hyperparameters was tuned over several episodes and in Table 4 the values used in the final experiment is shown and in Figure 19 the training progress can be seen.

(44)

4 Results

Table 4 – Hyper parameters for PPO maximizing throughput of mill circuit

Parameter Value

nsteps 115

nMiniBatches 5

ent coef 0

Learning rate LinearDecay[1e⁻³, 3e⁻⁴]

vf coef 0.5

max grad norm 0.5

gamma 0.8

lam 0.95

noptepochs 4

cliprange 0.3

Policy num layers 2

Policy num hidden 32

value network copy policy network

Penalty factor 0.2

(45)

It is evident that the PPO implementation does not manage to beat the PID implementation in maximizing throughput. It does however manage to receive approximately 92 % of the throughput achieved by PID. One factor to keep in mind is that when maximizing throughput one of the most important actions is to keep the feed signal high. This is problematic since the PPO algorithm receives a penalized reward if selecting actions outside of its bounds and thus becomes cautious on applying actions that are near its bounds.

4.4.2 Maximize profit

Another goal was formulated in this experiment, to maximize profit. The profit for a mill circuit has many factors. A simplified model was given by Boliden which included predicted recovery rate depending on particle size of ground ore and throughput. The predicted recovery rate was a second order function in terms of particle size and this was used in conjunction to how much valuable minerals was extracted to provide a profit function. This model will be left out due to confidentiality although a rough idea of this model in terms of throughput and predicted recovery rate is given in Appendix A.1.

This experiment originally had trouble converging to a good action policy. This was solved when normalization of all inputs was implemented by rescaling all inputs to [0, 1]. The parameters in the algorithm was kept the same as for the maximize throughput experiment and in Figure 20the training progress can be seen.

(46)

4 Results

Figure 20 – Training progress for PPO with reward for maximizing profit using constant ore properties. Also shown is the reward achieved for the existing PID control strategy. Note the time scale of the two figures.

First and foremost, the most interesting thing to note is that around the 5-hour training mark the PPO algorithm manages to beat the existing controls strategy in regard to profit. However, lets walk through the training progress from the beginning.

As can be seen in the beginning of training the algorithm manages to receive a reasonably high reward of around 650. This is all due to the initial value of the policy which was set at 50% of max for all actions which all are reasonable. Continuing this training the reward suddenly drops, this is probably due to early terminating episodes due to constraining measurements. After this drop the algorithm slowly gets better until becoming better than the PID control strategy. But after a while the training becomes destructive and the agent gets worse only to never recover. This problem could potentially be solved by annealing the learning rate even more than was done in this experiment in order to not force the algorithm to learn more when it has ”learnt” enough. This does add another hyper parameter that has to be tuned after the learning process is known.

One issue with this problem formulation was that the algorithm settled on constant action values as training progressed, since the environment dynamics was equal for

(47)

be seen in AppendixA.2. This made the outputs more dynamic depending on input and a sample episode from this agent can be seen in Appendix A.3. This agent manages to achieve around 797 reward when tested on nominal ore properties which is roughly the same as the agent which was trained on nominal ore properties.

An important feature of this control is highlighted in Figure 21, here the mass of the primary mill can be seen for the PID and PPO control. The PID tries to keep the mass of the primary mill within a given set point, however the PPO algorithm does not try to hold any particular set point. It only tries to maximize the reward signal and hence when the ore properties change the values may change as well, the important factor is that the reward is still higher than for PID.

Figure 21 – Primary mill weight and reward for PPO and PID control for one episode. At t=0 the ore is of nominal properties and at t=400 the ore changes properties.

The final action policy in Figure 22 also show that the flap gate is not used by either the PPO or PID control. The PID does not use the flap gate since the mass of secondary mill is too high to add any more coarse material, although changing operating conditions should make the use of the flap gate a necessity.

(48)

4 Results

Figure 22 – Action from PPO and PID control for one episode. At t=0 the ore is of nominal properties and at t=400 the ore changes properties.

In assessing what PPO does differently from the existing PID control Figure 22 gives a clear clue when the operating conditions changes. While the PID in this case still tries to maximize most actions to hold the set points for each PID control loop, the PPO control has lower action values which also clearly adapts to the changing operating conditions.

(49)

5.1 FMU Architecture

The FMU architecture developed in this thesis proved to be a good method of implementing a RL algorithm on a simulated model. After solving these initial problems regarding architecture, implementing a completely different simulation model for a different problem would be an easy task given that the model can be exported as an FMU. The only cause of concern may be the simulation time of a new model. When the mill circuit model was exported as an FMU using default settings resulted in an extremely long simulation time. By tweaking the choice of integrator and internal sampling time of components in the model the simulation time was reduced drastically. When pairing this FMU architecture with algorithms that parallelize into 23 different instances it is easy to forget how much is actually going on due to the relative ease of setting it up. When things do go wrong, debugging these kinds of applications can get quite tedious since this architecture has so many different layers and getting to the cause of the error is not always trivial.

5.2 Implementation

Achieving a RL algorithm that works on this type of control problem is quite forgiving in a sense that the terminating conditions for this mill circuit requires the agent to output incorrect for several consecutive time steps. Compared to other problems such as the CartPole environment in which few incorrect actions will lead to termination.

From the experience in this thesis the algorithms often do work, but without proper hyper parameter tuning the learning process is unreasonably long. Although one problem was never solved, namely the flap gate which controls the flow of coarser material to the secondary mill. This was a discrete output in difference to the other continuous actions. Some efforts were made to set a threshold value that activated this discrete output when the action was above a certain value, but the algorithm never figured out the use of this flap gate. The flap gate also had limited usage in this simulation model, it could be possible that controlling the flap gate would be more important if the simulation model was tuned in a different way.

In implementing a state-of-the-art RL algorithm such as PPO the OpenAI Baselines

(50)

5 Discussion

is a time saver. Implementing these algorithms from scratch and verifying that they are working up to par with recent research results would have been difficult given this time frame. Although some problems did exist; using the built-in function for normalizing inputs did not work properly for saving the parameters for later use, and getting deterministic actions by only selecting values based on mean value output in the PPO policy network required code modifications. Baselines did have an excellent logging implementation using Tensorboard which is a part of Tensorflow. It made it possible to save relevant data to monitor the training progress which was an excellent feature to use while supervising training.

5.3 Parameter tuning

An important factor in implementing RL algorithms is the work of hyper parameter tuning. When performing the first experiments using Q-Learning it was relatively few hyper parameters and the selected range of parameters was forgiving. When implementing algorithms such as PPO the number of parameters that are possible to tune increase drastically, and pairing this with setting up varied simulation conditions to ensure sufficient training gives the implementer a large number of parameters at their disposal, and pairing this with training that take around 24 hours results in quite tedious tuning.

5.4 RL at a higher level

Although the use case of replacing the direct control of processes with RL as was done in this thesis has attracted research such as the work by S.P.K. Spielberg Et al. [14] with promising results. This case does have some difficulties which other implementations may circumvent. In implementing RL for process control an alternative to replacing existing control may be to implement RL as some kind of higher order layer. It could be used for setting control parameters or set points to existing PID controllers which would solve many of the problems which occur when trying to control a process from scratch. Some work has been done in this area such as the work by Jay H. Lee et al. [15] which discusses the potential for RL to

(51)

5.5 Other mining subprocesses

The main motivator behind implementing a RL algorithm should be that the process has many different knobs to turn, and the result from turning these different knobs should not be trivial to deduct. Especially when trying to achieve a goal in terms of control that is abstract such as maximizing profit. With that in mind the following processes could be suitable given that an accurate simulation model is available:

1. Crushing used for size reduction could be suitable as a RL agent trying to keep the particle size within a given range. The effect that the cone height and speed of gyratory crusher has on the particle size could be a possible task for a RL agent.

2. Froth Flotation used for concentration can sometimes be a hard task due to the varied operating conditions determined by the ore size and properties.

A RL agent could be implemented that had the task of reaching a desired recovery/grade by controlling the addition of chemicals.

3. Underground fan ventilation used for underground mines, these control systems often consists of multiple fans and the correct control for these fans can be hard to deduce due to the different way the flow travels between the mine shafts. A specific goal here could be to minimize power while keeping sufficient airflow.

(52)

6 Conclusions

In this work we have developed a software architecture that enables simulation models in the form of Functional-Mock-up Units that suits the use of reinforcement learning.

This architecture is utilized to train a reinforcement learning algorithm that controls a mill circuit within a mineral processing plant that in simulation achieves higher profitability than the existing PID control strategy under certain conditions. The emergent control strategy of this reinforcement learning algorithm is that by lowering the feed of ore the mills can grind them more efficiently and thus providing higher profitability. This trained algorithm is capable of controlling the mill circuit with the same constraints as the existing control strategy although further research is required before implementing this algorithm on a live plant.

6.1 Future work

In implementing the PPO algorithm on the live plant, it would need to have thorough constraints on selected actions. The freedom of selecting any possible action at any given time would not work on a live plant since it could lead to mechanical damage, instead the output would ideally be limited to small increments/decrements for each time step. As a fail-safe the PID control strategy would also ideally be kept as a backup and in the case when the PPO algorithm does not output sensible actions the PID control would take over.

The discrete control problem of the flap gate would also have to be resolved, one potential solution for the discrete flap gate action would be to change the discrete action into a continuous action for example by transforming the discrete flap gate signal to a continuous signal corresponding to the opening time for a time step.

Another possible improvement is the implementation of another stochastic policy instead of a gaussian distribution policy, which has the main drawback of not having any constraints on the size of actions. This would reduce the need to design reward functions that penalize out of bounds action which would make implementation easier by further removing a parameter to tune. Some interesting work has been done