Hybrid model based hierarchical reinforcement learning for contact rich manipulation task

Full text





Hybrid model based hierarchical reinforcement learning for contact rich manipulation task





Hybrid model-based

hierarchical reinforcement learning for contact rich manipulation task


Master in Systems, Control and Robotics Date: October 7, 2020

Supervisor: Shahbaz Khader Examiner: Dr. Christian Smith

School of Electrical Engineering and Computer Science Swedish title: Hybridmodellbaserad hierarkisk

förstärkningsinlärning för kontaktrik manipulationsuppgift




Contact-rich manipulation tasks forms a crucial application in industrial, med- ical and household settings, requiring strong interaction with a complex envi- ronment. In order to efficiently engage in such tasks with human-like agility, it is crucial to search for a method which can effectively handle such contact-rich scenarios. In this work, contact-rich tasks are approached from the perspective of a hybrid dynamical system. A novel hierarchical reinforcement learning is developed: model-based option critic which extensively utilises the structure of the hybrid dynamical model of the contact-rich tasks. The proposed method outperforms the state of the art method PPO and also the previous work of hi- erarchical reinforcement learning: option-critic, in terms of ability to adapt to uncertainty/changes in the contact-rich tasks.




Kontaktrika manipuleringsuppgifter utgör en avgörande applikation i indust- riella, medicinska och hushållsmiljöer, vilket kräver stark interaktion med en komplex miljö. För att effektivt kunna delta i sådana uppgifter med mänsk- lig agility är det viktigt att söka efter en metod som effektivt kan hantera så- dana kontaktrika scenarier. I detta arbete kontaktas kontaktrika uppgifter från ett dynamiskt hybridhybridperspektiv. En ny hierarkisk förstärkningsinlärning utvecklas: modellbaserad alternativkritiker som i stor utsträckning använder strukturen för den hybriddynamiska modellen för de kontaktrika uppgifterna.

Den föreslagna metoden överträffar den moderna metoden PPO och även det tidigare arbetet med hierarkisk förstärkningslärande: alternativkritiker, när det gäller förmågan att anpassa sig till osäkerhet / förändringar i de kontaktrika uppgifterna.



1 Introduction 1

1.1 Background . . . 1

1.2 Research Question . . . 2

1.3 Hypothesis . . . 2

1.4 Related Work . . . 3

2 Preliminaries 6 2.1 Markov Decision Process . . . 6

2.2 Reinforcement Learning . . . 7

2.3 Model-based Reinforcement Learning . . . 8

2.3.1 Hybrid Model Learning . . . 10

2.4 Hierarchical Reinforcement Learning . . . 12

2.4.1 Options framework . . . 13

3 Problem Formulation 15 3.1 Skill learning for contact-rich Manipulation Tasks . . . 15

3.2 Policy Learning for Contact-Rich Manipulation Tasks . . . 16

3.2.1 Base policy . . . 16

3.2.2 Hierarchical policy . . . 16

4 Policy Learning Methods 19 4.1 Base policy - Actor Critic . . . 20

4.1.1 Actor Critic with PPO & GAE . . . 21

4.2 Hierarchical Policy: Option-Critic . . . 25

4.3 Model-based Hierarchical Policy . . . 27

4.3.1 Semi Model-based Reinforcement Learning . . . 27

4.3.2 Options using Hybrid Model . . . 28

4.3.3 Hybrid Model Learning . . . 30

4.3.4 Model-based Option-Critic . . . 34




5 Experimental Evaluation & Results 42

5.1 Description of the Experimental Scenarios . . . 43

5.1.1 Simulated 2D Block-Insertion . . . 43

5.1.2 Simulated 2D Block-Cornering . . . 44

5.2 Implementation . . . 45

5.3 Results . . . 46

5.3.1 Hybrid Model Learning . . . 46

5.3.2 Base & Hierarchical Policy Learning . . . 51

5.3.3 Adaption to changes in environment . . . 59

6 Discussion & Conclusions 64

7 Ethics & Sustainability 66

Bibliography 67


Chapter 1 Introduction

1.1 Background

Many manipulation tasks require physical interaction between the manipula- tor and the environment. Tasks such as pick and place, arrangement of ob- jects, surface finishing operations, insertion of tight-fitting parts, matting of gears, require strong interaction with the environment and involve complex dynamics. However, traditional manipulation framework is developed consid- ering minimal interaction with the environment. This limits the application of manipulator arms to a constrained and well-defined environment with pre- cise kinematic motions. In contrast, human manipulation has no such limi- tations and efficiently engage in a task that involves physical interaction with the environment. This serves as the motivation for developing manipulation methods that actively interact with the environment by explicitly considering interaction phenomena such as stiffness, surface contacts, impacts. Often, the physical interaction is too complicated to model or if approximated, encom- pass significant uncertainty, thereby rendering it unsuitable to integrate with traditional manipulation control framework.

Over the last few decades, there has been tremendous progress in utilizing re- inforcement learning (RL) in robotic manipulation, especially in the area of object grasping, in-hand manipulation etc. However, a direct application of RL is influenced by the following limitations of traditional approaches. In tra- ditional methods, planners like RRT, A*, computes a reference Cartesian tra- jectory of the end-effector, which is translated to reference joint trajectory for a low-level feedback servo control to follow. This approach has few limitations as the (1) efficiency (time taken to follow the trajectory) is dependent on the




response of the low-level controller; (2) the planner requires sub-millimetre accuracy of the environment in order to generate a reference trajectory. Any uncertainty results in failure of the task. Instead, RL policies may directly pro- vide the required torque needed to move the manipulator and achieve the task based on only the rewards received, thereby completely circumventing any no- tion of trajectory tracking.

However, learning manipulation skills in contact-rich manipulation tasks has further challenges such as complexity of the environment in terms of contact dynamics, uncertainty in task conditions etc. contact-rich manipulation task involves switching dynamics that can be represented as a hybrid dynamical model [1]. The dynamical equations governing the entire process switches from one regime to another based on the nature of the contact. This thesis aims to approach the problem of learning manipulation skills using a model- based reinforcement learning framework with the following research theme.

1.2 Research Question

In this work, a hierarchical reinforcement learning based method for contact- rich manipulation problem will be sought which utilises the dynamics of the contact-rich task. In this regard, it will be crucial to evaluate the following research questions:

• How to take advantage of the switching dynamic model of contact-rich task in learning a model-based hierarchical policy?

• Does model-based hierarchical policy has better learning capability and ability to adapt to the changes in task conditions compared to model-free non-hierarchical policy?

1.3 Hypothesis

The hypothesis of this work is:

1. Primary Hypothesis - A hybrid dynamical model can be used in mod- elling the contact-rich tasks, which can be utilised effectively in learning a hierarchical policy.

2. Secondary Hypothesis - It is expected that model-based hierarchical pol- icy can be beneficial in terms of learning efficiency and ability to adapt to changes in the task conditions.



1.4 Related Work

Robotic manipulation leveraging contact information from the environment has been studied for quite some time. Due to the widespread application of contact-rich tasks especially in industrial settings, researchers have attempted to utilise information on the interaction of the manipulator with its environ- ment such as using visual depth sensors, force-torque measurements etc. Con- ventional methods relied on approximate analysis of the contact model and decomposing into different phases [2]. Thereafter, compliant control schemes are utilised to obtain optimal performance in each of the phases [3] where feed- back from the force-torque measurements of the operation is extensively used guiding the insertion of the peg in the hole. Few techniques also utilise vi- sual/sensory servoing approach to handle the uncertainty in the contact model and develop a more robust control method [4][5]. However, these treatments were quite specific to a particular task such as peg-in-tube and cannot be ex- tended to a general framework of contact-rich tasks.

Contact-rich tasks can be naturally modelled as a hybrid dynamical system.

A hybrid dynamical system encompasses the interaction of continuous and discrete-time dynamics leading to rich dynamical behaviour [6]. Consequently, numerous challenges also arise concerning modelling and control of such hy- brid system due to change in governing dynamic modes. Johnson et al. [7]

demonstrated physical modelling of manipulation task but included various physical approximation of the contact models, which limited the general ap- plication. The authors in [8] have proposed hybrid dynamical modelling of contact-rich manipulation and have presented how it can be utilised in the planning of modes and trajectory but have not demonstrated any major ap- plication for their method. In the majority of the literature, the classical con- trol scheme of a hybrid system is based on linearizing each dynamical modes and implementing switching linear control law, [9]. This limits the control authority and efficiency of such methods. Additionally, few nonlinear control techniques have been developed like control Lyapunov technique [10] which assumes complete knowledge of the dynamical equations. A more recent method of model predictive control techniques extended to the linear hybrid system is also appropriate [11]. However, the straightforward implementation of model predictive control to the problem of manipulation is infeasible due to high dimensionality, complex nonlinear behaviour and stochastic nature of the contact-rich manipulation problem [12]. This motivates the need for learning- based approach.



Recently, learning-based approach has been utilised to solve contact-rich ma- nipulation task [13], [14], [15]. In a broad sense, robot learning can be cat- egorised into two major domains - learning from demonstration [16] and re- inforcement learning [17]. In learning from demonstrations, human demon- strations of a task are captured and modelled either by a parameterized action primitive [18] which is utilised in creating a policy to repeat the action of humans. However, these methods lack generality and often suffer from ro- bustness from the stochastic nature of the contact-rich tasks. In [19], authors have tried to incorporate robustness by additional exploration after the hu- man demonstration by dividing the task into various phases and developing a heuristic-based approach to drive the manipulator in case of failures in a particular phase of the task to nearby phase. In contrast, reinforcement learn- ing relies on learning from experience and learning an optimal policy without any extensive information about the task, thus learning a more general control scheme compared to learning from demonstration. In this area, few develop- ments have been made using both model-based and model-free learning.

In the model-free approach, the model of the system is not known and pol- icy is directly learned by interacting with the environment, often suffering from sample inefficiency and long training times. Authors in [13] have tried to overcome the data efficiency using guided policy search, which utilises local linear-gaussian controllers to train a neural network policy. Several successful contact-rich manipulation works have been done using this framework [20].

Another approach to overcome the data inefficiency was proposed by [21], wherein they utilise the higher-level learning framework to drive the lower level position/force control method to achieve the peg-in-hole task. These of- fer reliability but not only relies on efficient force control schemes but also limits the applicability to general contact-rich tasks.

Model-based learning can be substantially more sample efficient but has an ad- ditional requirement of learning the dynamics model from a small number of interactions. This can be achieved by various techniques as mentioned in [22], [23]. The learned model then can be utilised to discover a policy by utilising it as a synthetic sample generator in a process called long-term prediction [1].

The performance of model-based reinforcement learning methods drastically depends on the accuracy of the learned dynamic model. As contact-rich tasks involve discontinuous dynamics, learning an accurate model is not straightfor- ward and limited research has been done in this area. In [14], authors instead



learn various coarse dynamic models of different contact task and adapt it online to a given task with iterative linear quadratic regulator substituting as policy. Recently, Gaussian processes have been utilised for learning dynam- ical models [24]. This was further extended to a hybrid dynamical system in [1] for contact-rich tasks and have demonstrated the strength of the framework in terms of accurate long-term prediction from the learned model.

To utilise the learned hybrid dynamics model of the contact-rich task, a hybrid learning framework can be useful. The work of [25] have demonstrated poten- tial benefits of introducing hierarchy in the learning of multi-phase manipula- tion tasks by learning sub-policy in form of motion primitives for each phase and then learning a higher-level policy to sequence these motion primitives to generalise to new tasks. However, they relied on human demonstration to initially learn the motion primitives. Based on this, [15] have proposed a hier- archical contextual policy search, however, they demonstrated their framework on a continuous dynamical system. Related to these, more recent development of hierarchical reinforcement learning, which adds a hierarchy of policy with different temporal abstraction seems natural to be adapted in this work. Au- thors in [26] proposed an options framework which introduces hierarchy in standard Markov Decision Process. This was utilised by [27] by chaining to- gether several options (sub-policies) backwards from goal to solve the task.

Based on this idea, several promising works has been developed [28], [29], [30], [31] in general reinforcement learning problem and has the potential to be extended to contact-rich manipulation task.

To summarize, traditional control-based methods are unsuitable to address general-purpose contact-rich manipulation problem due to complex dynam- ics involved. Reinforcement learning methods have shown considerable po- tential for such problems, however, often lacks satisfactory learning efficiency and ability to adapt to changes in such contact-rich tasks. This necessitates the search for a learning-based method that has good learning efficiency and demonstrates robustness for contact-rich tasks. Based on extensive literature review, an existing method does not exist which utilises the learned hybrid dy- namics model of contact-rich task in the context of hierarchical reinforcement learning, strengthening the contribution of this work.


Chapter 2


In this chapter, the theoretical background is built to support the methods used in this thesis work. First, the mathematical framework for redmodelling se- quential decision-making problems, or Markov Decision Process (MDP), is presented. Then reinforcement learning (RL), along with a specific variant of it, model-based reinforcement learning (MBRL), is introduced as a solu- tion to an MDP problem. Additionally, the necessary formalism for hybrid automata, that will be used to model contact-rich dynamics and learning such hybrid model is presented. Finally, a recent concept in reinforcement learning:

hierarchical reinforcement learning (HRL) is introduced.

2.1 Markov Decision Process

A decision making problem which satisfies the Markov property can be called as a Markov Decision Process (MDP). According to Markov property, future states are independent of the past given the current state, i.e at any given time, the state retains all the relevant information of the problem. Formally, a MDP comprises of a tuple: (S, A, T , R), where S is the set of state space, A is the set of action space, T : S × A −→ (S → [0, 1]) is a one-step transition dynamics of the environment and R : S × A −→ R is the reward function. At every time step t, the agent or controller external to the MDP, has knowledge about the state st ∈ S of the environment, based on which, it takes an action at∈ A. This action when acted on the environment, results in a new state st+1

according to environment’s probabilistic dynamics, modelled in the transition function T :

p(st+1|st, at) = T (st, at) (2.1)




and is accompanied by a reward signal:

r(st, at) ⊂ R (2.2)

which signifies how good the action was in terms of achieving some task in the environment. The solution to an MDP can be sequence of action which max- imizes the reward signal over time. MDPs serves as a theoretical framework to express the learning problem for the reinforcement learning.

2.2 Reinforcement Learning

Reinforcement learning (RL) tries to solve the learning problem expressed as MDP by acting as the agent taking action. In RL, the reward function R is known, however, the transition dynamics T is unknown and the agent learns a mapping of state and action. The mapping, termed as policy π : (S × A −→

[0, 1]), determines how the agent selects an action atgiven the current state st. The distribution of the trajectories by following the policy in the finite-horizon episodic setting from an initial state s0 is given by:

p(s0, a0, ...., sT, aT) = p(τ ) = p(s0)




π(at|st)p(st+1|st, at) (2.3) The main aim of the reinforcement learning is to optimize the expected reward over time, i.e. take action which not only receives the best reward at the current time, but over future time instant as well. The expected reward is expressed as:

J = E

p(τ )[




r(st, at)] (2.4)

where, the expectation E, is taken over the trajectory distribution given in Eq.2.3. Alternatively, for infinite-horizon problems, future rewards are dis- counted to make the summation tractable. The solution to the reinforcement learning can be obtained primarily by two alternative method families [17].

The first approach falls under the category of value function method. In this, an estimate of the future return from a state is computed and denoted as value of the state V (st)



Vπ(st) = E

p(τ )[




r(st0, at0)]

= E

p(st,at)[r(st, at) + Vπ(st+1)]

= E

p(at|st)[r(st, at) + E



Value function approach estimates an optimal state value which is used to derive the optimal action in each state, which in turns becomes the policy (greedy):

Vπ(st) = max


r(st, at) + E



(2.6) The optimal value function is computed using various approaches like Monte Carlo Tree Seach, Dynamic Programming etc. In the second alternative ap- proach, an optimal policy is learned directly via policy search. In this, policies are represented by wide a variety of parameterized functions πθ(at|st), after which the optimal parameters are computed by optimizing the total expected reward from the initial state:

θ = max

θ E

pθ(τ )[




r(st, at)] (2.7)

Different methods are present [17] to optimize the parameters such as gra- dient descent, expectation-maximization etc. Additionally, the way the pol- icy is learnt can be further classified as model-free or model-based approach.

In model-free approach, the transition dynamics of the underlying MDP is not explicitly learnt. Instead, samples are collected by running the policy on the simulation or real environment and utilized to directly compute the policy based on the rewards received. In contrast, the model-based approach relies on learning the transition dynamics explicitly as an intermediate step.

2.3 Model-based Reinforcement Learning

In model-based reinforcement learning method (MBRL), a model of the tran- sition dynamics is learnt along with the policy. The learnt model is then used to improve the policy. In general settings, the model might be learnt along with the policy falling into the DYNA framework [32] as shown in Fig.2.1, and the learned model used as a black-box synthetic sample generator of the



actual environment. This leads to lower number of samples to be collected from the environment

Figure 2.1: model-based Reinforcement Learning as DYNA framework Problems such as manipulation involves intricate dynamics and thus analytical models or even gray box system identification methods tend to fail to model the dynamics. Therefore, deterministic or stochastic models need to be learnt from samples collected by interacting with the environment. In determinis- tic model learning, parameterized function approximatorsf (; φ) are generallyˆ used, which predicts the next state given the current state and action. The de- terministic model can be visualized as the generating states with highest prob- ability from the inherent probabilistic transition model of the Markov Decision Process (MDP):


st+1 = ˆf (st, at; φ) (2.8) After collecting sample tuples of form (sit+1, sit, ait), the parameters of these function approximators are optimized by defining prediction error typically of the form:

φ = min1 N




||sit+1 − ˆf (sit, ait)||2 (2.9) However, the deterministic models fail to capture the uncertainty in the pre- diction and thus, stochastic models offer significant advantage of being un- certainty aware. The most popular methods of stochastic model tries to fit a Gaussian distribution of the form:

p(st+1|st, at) = N (µ(st, at), Σ(st, at)) (2.10) where, µ(.) and Σ(.) are the mean and variance of the dynamics model. Recent method presented in [24], utilises Gaussian Processes (GP) to learn model of



the form Eq.2.10. However, contact-rich tasks involve discontinuity and the piecewise continuous nature cannot be handled innately by GP’s. Therefore, a recent work of [1] has to be utilised which tries to learn a stochastic hybrid dynamical model.

2.3.1 Hybrid Model Learning

In this work, the contact-rich tasks is approached from a perspective of a hybrid dynamical system. contact-rich tasks comprises of different phases or modes depending environmental constraints such as free motion, sliding or collision.

In the entire process different dynamical equations f (.) govern the various phase of the task. Therefore, it is essential to introduce the formulation of hybrid dynamical system. This characteristic of the process can be effectively modelled as a general class of hybrid systems called switched systems [9] given by:

st+1 = fq(st, at) s ∈ Sq ∈ Rn, a ∈ Aq ∈ Rp (2.11) where s denotes the state, a denotes the control action. Sq, Aq denotes the partition of state and action space respectively with q = 1, ..., Q being the switching mode that determines which dynamical equation is active at time t. The switching mode can be state or time dependent, and can be determin- istic or stochastic. Additionally, the states may exhibit discontinuity when it switches from one mode to other, increasing the complexity of the system. Un- like, the general form of hybrid systems, where the dynamics of both discrete and continuous state variables are modelled, switched system only focuses on the dynamics of the continuous variables.

As hybrid automata [9] provide a formal representation of hybrid systems to implement further control methodology, it is reasonable to formulate the switched system as a hybrid automaton H:

fq : Sq× Aq −→ Sq Mode dynamics (2.12a)

Init ⊆ q × Sq Initial state (2.12b)

E ⊆ Q × (Q − 1) Transition relations (2.12c)

Gq,q0 ⊂ Sq ∀(q, q0) ∈ E Guard relations (2.12d) Rq,q0 ⊂ Sq× Aq −→ Sq0 ∀(q, q0) ∈ E Reset map (2.12e)



Learning hybrid dynamical model of the form given in Eq. 2.11 is not widely known. The model learning method based on [1] encompasses the complete hybrid dynamical model presented in Eq.2.12. Additionally, the method comes under stochastic model learning which makes uncertainty aware prediction.

For the sake of completeness, a brief information of the hybrid model learn- ing approach is presented touching the elements of Eq.2.12. A training dataset of the hybrid model D in form of sequential tuple: (st+1, st, at) t = 0, ..., T is collected by running the policy on the environment such that it contains suf- ficient state corresponding to each expected mode of a given contact-rich task.

1. Mode discovery through clustering - Due to the rigid nature of the envi- ronment and the manipulator, an assumption is introduced, which states that the sub space region corresponding to each mode do not intersect.

Sq ∩ Sq0 = ∅ ∀q 6= q0 ∈ Q. Therefore, in order to efficiently train the hybrid model, an initial clustering using Dirichlet Process Gaus- sian Mixture Model (DPGMM) [1] method is employed on the collected dataset D. DPGMM [33] automatically infers the number of clusters (modes) Q, generalizing the Gaussian Mixture Modelling (GMM) based unsupervised clustering method, where the number of mixtures to be utilised has to be specified.

2. Dynamic model for each model - In order to learn the dynamics model corresponding to each mode fq ∀q ∈ Q, stochastic model learning of Gaussian Process Regression (GPR) is utilised. A Gaussian Process is a non-parametric approach of fitting a function to the data by defining a prior probability over functions [34]. The mode dynamics is modelled of the form:

∆st= st+1− st (2.13a) p(∆st|qt, st, at) ∼ GP(m, k) (2.13b) where, m, k is the mean and the covariance function for the Gaussian Process.

3. Transition relations - As, the sequential order of the D is preserved, the E are directly obtained from the labelled result of clustering. All the mode transitions observed are formed in form of table.

4. Guard functions - The guard region determines that if a state action pair can result in mode transitions. It can also be interpreted as a determin-



istic function, which predicts the next mode given a state action pair.

qt+1 = g(st, at) (2.14) If qt+1 6= qt, this denotes a transition that should also agree with the transition relation table E. A deterministic guard function is learnt using Support Vector Machine (SVM) multi-class classification technique [1].

5. Reset Maps - Reset maps signify the discontinuous discrete jumps in state space when a transition from one mode to other other. It forms essential part of the model learning as without it the state evolution af- ter transitions will be erroneous. The reset maps Rq,q0 are modelled as stochastic models of the form -

p(st+1|qt, qt+1, st, at) ∼ GP(m0, k0), qt6= qt+1 (2.15) and similar to dynamic model, Gaussian Process Regression (GPR) is utilised to learn the model.

Combining the learnt reset maps and the guard functions, the Init set for each mode q can also be estimated, thus completing all the elements of the hybrid automaton presented in Eq.2.12.

As discussed in section 1.4 , traditional control method of such nonlinear hy- brid system expressed in Eq.2.12 is limited, therefore, in this work a learning based framework is utilised to control hybrid models.

2.4 Hierarchical Reinforcement Learning

For the utmost utilization of the rich structure of the learned hybrid model, additional mechanism to the common reinforcement learning setting has to be introduced. More recently, the concept of hierarchical reinforcement learning (HRL) has been developed which tries to learn different layers of policies as shown in Fig.2.2a. In essence, this means dividing a particular task into sub- tasks with sub-goals and finding alternative pathways to achieve it as shown in Fig.2.2b. HRL has shown significant results in long-term planning based task or finding fastest path out of maze like environments.



(a) (b)

Figure 2.2: Visualization of HRL in solving tasks

2.4.1 Options framework

A well-known example of hierarchical reinforcement learning is the options framework introduced in [26].

Figure 2.3: Visualization of options in MDP

Options allows representing actions that take place at different time scales and coherently learning and planning those actions. Each such option comprises three elements: {πω, βω, Iω} where πω : S × A → [0, 1] is the sub-policy , βω : S → [0, 1] is the termination function and Iω ⊆ S being the initiation set.

An option ω : {πω, βω, Iω} is available in a state stif st ∈ Iω. In addition to this, a policy over option π : S × Ω → [0, 1], is present which is responsible for selecting a feasible an option ω ∈ Ω at a higher level. Typically, Marko- vian options are executed as follows - Given a state st, an option ω is selected which has its {πω, βω, Iω}. The action is selected according to the sub-policy πω(st). The environment makes a transition to st+1, where the current option



either terminates with probability βω(st) or else continues requiring the cur- rent sub-policy to select the next action at+1. On termination the policy over option π(st+1) selects a next feasible option ω0 ∈ Ω.

This completes the representation of options in the Markov Decision Process (MDP) with single temporal abstraction. Additional hierarchy and sub-options can also be introduced, but for this work, a single level hierarchy is sufficient.

In the next chapter, the idea of learning options from a hybrid dynamical model of contact-rich task and its potential benefits are presented.


Chapter 3

Problem Formulation

In this chapter, the problem of contact-rich manipulation tasks is formally de- scribed as a learning problem in the reinforcement learning (RL) setting and the hierarchical options learning problem utilising the learned hybrid model is presented.

3.1 Skill learning for contact-rich Manipula- tion Tasks

A manipulation task can be formulated into MDP by assuming the environ- ment to be rigid and stationary. This allows any method such as reinforce- ment learning to solve tasks such as contact-rich manipulation. Expressing the contact-rich manipulation as MDP (S, A, T , R): S consists of space of joint kinematics of the manipulator. The set A represents the space of al- lowable joint torques that can be applied to the rotary/linear actuators of the manipulator (forces in case of linear actuators).

The model T is hybrid in nature specially for contact-rich tasks of the form described in 2.11. Each piece-wise continuous dynamics is defined in certain subspace Sq ⊂ S and governs the motion in the subspace. The interaction dy- namics and environmental constraints are not explicitly modelled in the MDP, but gets indirectly included when the transition dynamics of each modes are learnt to achieve a task. Along with this, the environment’s goal state is mod- elled in the R. Defining reward functions is crucial in the RL setting and determines efficiency of policy learning. Typically, a reward design of contact




tasks like insertion could be defined as

r(st, at) = −rs||st− G||2 (3.1) where stis the state of the object to be manipulated, G is the desired goal state, and rs is the reward scale. This form ensures that as the state of the object to be manipulated gets closer to the goal, higher will be the reward obtained. Ad- ditionally, reward may also include action taken by the policy. The reward can be further shaped to include obstacles in the environment via use of potential functions.

3.2 Policy Learning for Contact-Rich Manip- ulation Tasks

Both primary and secondary hypothesis of this thesis is re-stated formally for clarity: a hierarchical policy that has the capability of selecting appropriate sub-polices for different dynamics modes in a hybrid dynamical system is able to outperform contact-rich manipulation tasks either by faster learning or faster adaptation to changes in the task when compared to a non-hierarchical (base) policy. The hierarchical policy utilises the learnt hybrid model, whereas the base policy is completely model-free in nature. Therefore, in order test the hypothesis, two policies have to be formalized: a base policy and a hierarchical policy.

3.2.1 Base policy

In the base policy, a single policy π(at|st), is obtained without explicitly con- sidering hybrid dynamical nature of the contact-rich tasks. This comes under the common reinforcement learning setting and as discussed in the section 2.2, value based or policy search can be employed. The method selected for learning a base policy is discussed in detail in the method section.

3.2.2 Hierarchical policy

The motivation behind selecting a hierarchical policy is to exploit the structure of the dynamics in order to divide the contact-rich task by introducing hierar- chy. This might have a potential result of a more robust and efficient policy.

The core idea is to have hierarchy such that the lower level is comprised of



multiple sub-policies that mirror the hybrid structure of the dynamics and on a higher level could be in the form of a planner that finds the best sequence of dynamic modes. The options framework presented in the section 2.4 evidently fits the desired method framework, where a higher-level policy over options π

selects an option and subsequently a separate lower level sub-policies is initi- ated, in call-and-return fashion, to provide the control actions. This not only has the potential to achieve the overall task faster but may also be extended for transfer learning and re-usability of such sub-policies.

Model-based hierarchical policy

In its fundamental form, options framework is model-free in nature. However, the major challenge is how to represent such options from the learned hybrid model. Following points were to be considered -

• Representation: Should options mirror the different modes of the hybrid dynamics or the transition from one mode to another ?

• Learning components of option: How to represent components of the options {βω, Iω}, to the elements of the hybrid dynamical model such that they can be be obtained without involved computation.

• Behaviour: If options mirror the transitions, how to ensure the intended behaviour of options is retained while learning ?

These were the principal questions which were considered before searching for the method of model-based options learning as depicted in Fig.3.1. In the next chapter, the complete methodology along with necessary formulation for base policy, hierarchical policy and model-based hierarchical policy is presented.



Figure 3.1: Problem of options using hybrid dynamical model


Chapter 4

Policy Learning Methods

In this chapter, methods used in learning the policies for the contact-rich ma- nipulation task are presented. After extensive literature review, it was found that existing hierarchical policy learning are model-free in nature. However, in this work, a model-based version is required, therefore two versions of hi- erarchical policy is introduced: model-free which does not utilise the hybrid model and model-based which is based on the model-free version but explic- itly utilises and is represented by the hybrid model. The proposed model-based hierarchical policy learning method aims to answer the primary hypothesis of the work, that a hybrid dynamical model of the contact-rich task can be utilised effectively in learning a hierarchical policy.

Contact-rich tasks are episodic in nature. It is essential to define the terms related to episodic reinforcement learning (RL) framework to explain the for- mulation. A rollout is one complete episode in the environment of time steps T , which is the episode length. The goal of the learning framework is to opti- mally complete the task in T . In each iteration of RL framework, M rollouts are sampled from the environment using the current policy. Each such rollouts comprises of T sample tuple: {st, at, rt} for the base policy and {st, at, rt, ω}

for the hierarchical policy. It is important to highlight that since options are selected on top of MDP, they are not selected at every time step and therefore not indexed with time. In each iteration, the base policy is optimised under the the model-free RL and both model and hierarchical policy are optimized under the model-based RL.




4.1 Base policy - Actor Critic

Actor Critic method of policy learning falls in the intersection of the two prominent RL approach of value based and policy search methods. In this, the value of the each state is computed but utilised indirectly in the policy optimization. The policy is represented directly with its own parameters as in policy search methods. The method extends the policy gradient approach where the policy is the actor and the gradient of the RL objective given Eq.4.1 with respect to the policy parameters is computed:

J (θ) = E

pθ(τ )

 T X


r(st, at)

(4.1) In order to find the optimal parameters, the gradient of the parameters with respect to the objective expressed in Eq.4.1 [35].

θJ (θ) = E

pθ(τ )




θlog πθ(at|st)r(st, at)

(4.2) Optimizing over trajectory distribution can be reformulated to optimizing in each step:

θJ (θ) = E

pθ(st,at)θlog πθ(at|st)r(st, at) (4.3) The variance of the computed gradients are reduced by estimating the advan- tage function Aπ(st, at), given by:

Qπ(st, at) =




pθ(sE0t,a0t)[r(s0t, a0t)] (4.4a) Vπ(st) = E


[Qπ(st, at)] (4.4b) Aπ(st, at) = Qπ(st, at) − Vπ(st) (4.4c)

where, Qπ(st, at) is Q-function which is the expected reward to go given (st, at), Vπ(st) is the value of the state or the average expected reward from st, making Aπ(st, at) the estimate of how much better an action at is compared to average. This forms the critic part, which computes the Aπ(st, at) and pro- vides to the gradient update of the actor. The critic can also be regarded as policy evaluation step. In continuous state and action space like problem at hand, the critic (generally Vπ(.)) is also realised as a parameterized function approximator.



θJ (θ) = E

pθ(st,at)θπθ(at|st)Aπ(st, at) (4.5) The gradient expressed in Eq.4.5 is then utilised to improve the parameters of the policy,

θ = θ + αθθJ (θ) (4.6)

where, αθ is the learning rate.

4.1.1 Actor Critic with PPO & GAE

Due to complexity of the contact-rich tasks, a ‘vanilla’ stochastic gradient de- scent as expressed in Eq.4.5 results in slow convergence and oscillating policy.

In order to avoid this, state of the art policy optimization step - Proximal Pol- icy Optimization (PPO) [36] has to be employed. In PPO, the objective is constrained and clipped to avoid the nasty gradients being propagated through the policy avoiding the oscillating behaviour. Defining the surrogate objective as:

JSU RR(θ) = E



πθold(at|st)Aπθold(st, at)


= E

pθold(st,at)[rt(θ)Aπθold(st, at)] (4.7b) The surrogate objective defined in Eq.4.7 is clipped such that the ratio rt(θ) is bounded:

JCLIP(θ) = E



rt(θ)Aπθold(st, at), clip(rt(θ), 1 − , 1 + )Aπθold(st, at)

(4.8) This enables multiple iteration of batch gradient update of parameter θ with samples collected from πθold. Additionally, in order to further reduce the vari- ance and have an unbiased estimate of the expected reward, the critic is also improved using Generalised Advantage Function (GAE) [37] as expressed as:

δt0 = r(st0, at0) + γVπ(st0+1) − Vπ(st0) (4.9a) AGAEπ (st, at) =




(γλ)t0−tδt0 (4.9b)



where, λ, γ are bias-variance and discount parameters respectively. In this work, the Vπof the Eq.4.9b (critic) is approximated byψ(st) with parameters ψ and is trained separately along with the actor. The parameters of the critic are optimised by supervised regression and defining the temporal difference loss (TD) [35] as:

T D(ψ) = E





(γ)t0r(st0, at0) − ˆVψ(st)||2 (4.10a)

ψ =ψ + αψψT D(ψ) (4.10b)

where, αψ is the learning rate parameter. Thus the resulting objective after introducing PPO and GAE becomes -

JCLIP(θ) = E


min(rt(θ)AGAEπθold(st, at), clip(rt(θ), 1 − , 1 + )AGAEπθold(st, at)

(4.11) The gradient is computed on objective defined in Eq:4.11.

Learning Base Actor Critic

For the base policy, the stochastic actor policy- πθ(st) required a form of Mul- tivariate Gaussian distribution N (µat, σat). This was implemented by a para- metric neural network, the output of which was a µat and a separate trainable variable σat shared across all the dimention of the action space. Similarly, the deterministic critic which provided the Vφ(st) was also represented using a neural network with single output. The architecture of both the actor and critic are shown in Fig.4.1 and Fig.4.2



Figure 4.1: Neural network architecture for actor of base policy with m as the dimension of the state and n as the dimension of action

Figure 4.2: Neural network architecture for critic of base policy



Algorithm 1 outlines the the complete base policy learning.

Algorithm 1 Base Actor-Critic

procedure SampleTrajectory(T, M ) repeat

s0← reset environment for t = 0 to T do

at∼ πθ(st)

st+1, rt← step(at)

M ← {st, at, rt} . Sample tuple until M rollouts

procedure BaseActorCritic(λ, γ, oT, oB, T, M ) 1. Initialization

πθ(.) ← N euralN et(s → Dist(a)) . policy Vψ(.) ← N euralN et(s → 1) . state-value function repeat

M ←SampleTrajectory(T, M) 2. Evaluation Step:

for rollout in M do for t = T − 1 to 0 do

{st, at, rt} ← M[rollout][t]

δ ← rt+ γVψ(st+1) − Vψ(st)

A(st, at) ← δ + λA(st−1, at−1) . GAE R ← R + (γ)trt

T D(st, at) ← |R − Vψ(st)|2 3. Policy Optimization Step:

for iter = 1 to oT do . batch optimization

for oB in M do


πθold(oBa|oBs)A(oBs, oBa) θ ← θ + αθ∂Loss∂θ 1P P O

Loss2← T D(oBs, oBa) ψ ← ψ + αψ∂Loss∂ψ 2 until converged



4.2 Hierarchical Policy: Option-Critic

In literature, there are few methods which could learn a hierarchical policy under options framework as in [28], [29], [30], [31]. However, after careful evaluation, the method of option-critic algorithm was selected [31] which is the state of the art of [29]. The primary advantage is that the architecture is based on a well-established method of Actor-Critic and secondly, it also provides flexibility to integrate with the learned hybrid model elements seam- lessly. Additionally, it does not suffer from increasing complexity of providing sub-goals to sub-policies as in [28]. As the option-critic (OC) method is fun- damentally model-free in nature and the proposed model-based hierarchical policy is based on the work of [29], it is presented after the base Actor-Critic (BAC) method. The model-based version which is utilised in this work is pre- sented in the subsequent section.

Figure 4.3: Option Critic Architecture [29]

The architecture for the option actor critic extends the actor critic presented in the base policy, thus following similar policy structure. The top level policy over options π solves the selection and sequencing of options and a lower



level the sub-policies are actors and provides control action for each such se- lected option. A global critic evaluates the policy at both levels. The high level policy for option selection is -greedy in nature, while multiple actors form the sub policies of the form πω,θ(at|st), parameterized by θ. The overall architecture is depicted in Fig.4.3.

The objective of the hierarchical learning algorithm remains the same: op- timize the expected reward over time as in Eq. 4.1. In order to setup the learning framework for such options, additional state-option value function Q(st, ω) and augmented state-option-action value function Qπ(st, ω, at) has to be defined,

Q(st, ω) = E

πθ,ω(at|st)[Qπ(st, ω, at)] (4.12a) Qπ(st, ω, at) =r(st, at) + E

pθ(st+1|st,ω,at)[U (ω, st+1)] (4.12b) Additionally, due to the stochastic nature of the termination condition of an option, the option-value function upon arrival U (ω, st+1) on entering a state st+1is defined as,

U (ω, st+1) = (1 − βω(st+1))Q(st+1, ω) + βω(st+1)V (st+1) (4.13) where, V (st) is the average reward obtained by taking any option feasible ω and action atsequence from state st. The advantage Aπ(st, ω, at) is computed using the generalised advantage estimation as follows-

δt0 = r(st0, at0) + γU (ω, st0+1) − U (ω, st0) (4.14a) AGAEπω (st, at) =




(γλ)t0−tδt0 (4.14b)

Following the implementation of [31], Q(st, ω) is approximated as ˆQψ(st, ω) with parameters ψ, forms the global critic and is trained using the rewards to go by taking st, ω. In the original work, the authors learned the termination function βω(st) as well as an interest function ˆIω(st) representing the initiation set for each options, which shaped the policy over options π(st) as:

πIω(st) ∝ ˆIω(st(st) (4.15)



The gradient descent based learning method of these functions are extremely involved [31]. However in this work, this challenge is eliminated as it is di- rectly obtained from the learned hybrid model. The final objective of the sub- policy has similar structure to the Eq.4.11.

JωSU RR(θ) = E




ω,θold(st, at)

(4.16) In the subsequent section, the novel representation of options using the hybrid dynamical model is presented.

4.3 Model-based Hierarchical Policy

The proposed model-based hierarchical policy for contact-rich manipulation tasks is presented in this section. The method is built upon the option-critic described in the previous section.

4.3.1 Semi Model-based Reinforcement Learning

From a standard point of view, model-based reinforcement learning (MBRL) should be approached as presented in section 2.3 and represented in Fig.2.1.

However, the hybrid model learning presented in the section 2.3.1, imposes certain practical limitations. As MBRL requires the model learning to be it- erative in nature, learning Gaussian Processes from incremental data is com- putationally expensive, thereby making it difficult to further train and test the policy learning framework. Therefore, in this work, only a part of the hybrid model - Mode discovery, Transition relations and Guard functions are learnt.

The Dynamic model part is left out and thus relying on the samples from the actual environment. For this reason, the proposed method is referred to as semi model-based reinforcement learning. Comparing to the DYNA frame- work, the present framework is represented in Fig.4.4.



Figure 4.4: Semi model-based Reinforcement Learning

This approach is justified because it does not undermine the main hypothesis in this work as the hybrid structure of the model is still learned and utilized for the hierarchical policy synthesis. Any future extension to a full model-based setting does not require any theoretical extension but only requires implemen- tation of individual dynamics mode learning. In the following sections, further details are presented on how the hybrid model is utilised in representing and learning a hierarchical policy.

4.3.2 Options using Hybrid Model

The options framework clearly fits the problem of learning hierarchical policy for such hybrid models where selecting the sequence of dynamical modes can be regarded as a sequence of options with ω : {πω, βω, Iω} ∈ Ω.

Representation of options as transitions

The ideal choice is representing each transitions of the hybrid dynamical sys- tem as a unique option. The transition relations obtained while learning the hybrid model can be used to initialise a set of options transitioning from one mode to other.



(a) Transition relation expressed as graph in hybrid model

(b) Options used as transitioning from one mode to other

Figure 4.5: Representation Options using Hybrid Model

As an example, lets consider a representative transition relation for a hybrid model expressed as graph in Fig.4.5a. Three modes, [0, 1, 2] are discovered with five directional edges [0 → 1, 0 → 2, 1 → 0, 1 → 2, 2 → 0] repre- senting the mode transitions. An additional node ∅ is introduced as shown in Fig.4.5b denoting non-transition to any subsequent mode. The idea is to obtain a sequence of options which can solve a given MDP, here contact-rich task. To continue on the example, possible sequences of options could be [0 → ∅], [0 → 1, 1 → ∅], [0 → 1, 1 → 2, 2 → ∅] etc. Thus, each edge in the Fig.4.5b forms a sub-policy: option ω.

The higher level policy over option πis responsible for selecting the optimal sequence of ω’s which can solves the MDP. To formalize, for transition rela- tions E ⊆ Q × (Q − 1) with modes q = 1, ...Q, discovered in during model learning, subsequent options that can be initialised are ω : q → q0 ⊆ Q × Q, including the additional node ∅.

Option components from hybrid model

With options representing the transitions from one mode to other, the termi- nation function βω(st) and initiation Iω set is obtained seamlessly from the



learned model. An option ω should terminate if the current mode changes in the next state. Thus, the termination function βω(st) can be obtained by draw- ing samples from the distribution p(st, πω(at|st)) and computing the relative frequencies of the predicted mode using the learned guard functions from the hybrid model. Considering ω : q → q0

p(qt+1) ← MC(g(.), p(stωθ(at|st)) (4.17a)

βω(st) ← p(qt+1 6= q) (4.17b)

To elaborate, the probability of termination of current option ω : q → q0 is the probability of not being the mode where the option is initiated i.e qt 6= q.

As the guard function g(st, at) is deterministic in nature, this probability is calculated using Monte Carlo (MC) sampling approach.

Similarly initiation set Iω(st) takes the value of 1 if the option ω : q → q0 can be initiated i.e qt is equal to q else takes the value of 0. The current mode is obtained by the values using the deterministic guard function value of the pre- vious state action pair (st−1, at−1). Formally, the initiation set Iω = Sq ⊂ S, with Sq being the subspace of each mode as defined in section 2.3.1. This completes the representation of options framework using the hybrid model. In the upcoming section hybrid model learning method is described keeping in mind its utilization in the hierarchical options learning.

4.3.3 Hybrid Model Learning

A hybrid model of the environment is learnt encapsulating partially the ele- ments of the 2.3.1 and building on the method presented in [1]. The model learning methodology has been developed keeping in mind its utilization in option learning.

Mode discovery

The mode discovery plays a crucial role in not only the efficiency of the hy- brid model learning method but hierarchical policy as well. The previous method of Dirichlet Process Gaussian Mixture Model (DPGMM) [1] was fur- ther improved using an intuitive clustering algorithm relying on segmentation and then clustering based on the estimate of the dynamics of each segment.

The method is based on the transition point clustering [38] where the idea



is presented to cluster and segment the rollout data of states and based on governing hybrid dynamics of states and actions. Based on the presented ap- proach, an initial step of segmentation determines the transition points tp.

This can be viewed as finding states where a change in dynamics occurs. This step when applied to a single rollout (trajectory), divides it into segments - [tp(i) − tp(i + 1)]. Next, it was necessary to cluster together all such segments exhibiting similar local dynamics. In the clustering step, states (position and contact force) of each segment were modelled as a Multivariate Gaussian dis- tribution to approximate the linear dynamics of the segments by computing the mean and covariance of the states.

In the next step, Density-based spatial clustering of applications with noise (DBSCAN)[39] was used to cluster together all segments with fitted Gaussians using Bhattacharya Distance (DB) as distance metric for Gaussians. Although this approach was developed independently, a similar idea is presented [16], justifying the approach. As the DBSCAN provided only cluster labels, for an iterative model update step, it was necessary to associate the labels to a unique dynamic mode. For this, a mode identification step is introduced where Gaussians of segments belonging to the same cluster label are merged [1]. Two Gaussians - N (µ1, Σ1), N (µ2, Σ2) are merged by:

µ = w1µ1+ w2µ2 (4.18a)

Σ = w1Σ1+ w1Σ2 + w1µ1µT1 + w2µ2µT2 − µ1µT2 (4.18b) where, w1 and w2 are weights. The merged Gaussians belonging to the same cluster is represented as Nc. Finally, an array of mode Gaussians Ng is main- tained which compares the cluster Gaussians with existing mode Gaussians based on Bhattacharya Distance. Based on the distance, it either assigns to an existing mode or new mode. Algorithm 2 outlines the implementation of the mode discovery.

Predefined mode region

The mode discovery method presented in the above section has some fun- damental limitations. Due to the well-known local minima problem of the DPGMM and clustering algorithm, when the trajectories are not well-conditioned, the performance of the mode discovery deteriorates. This in turn affects the hierarchical policy learning. Therefore, in order to remove this uncertain be- haviour, mode region for each simulation was manually allocated based on physical intuition providing accurate segmentation and mode label for each



Relaterade ämnen :