Framework for Classical Conditioning in a MobileRobot: Development of Pavlovian Model andDevelopment of Reinforcement Learning Algorithmto Avoid and Predict Noxious Events

(1)

Project Report

Framework for Classical Conditioning in a Mobile

Robot: Development of Pavlovian Model and

Development of Reinforcement Learning Algorithm

to Avoid and Predict Noxious Events

Quentin Delahaye

Technology

Studies from the Department of Technology at Örebro University örebro 2014

(2)

Framework for Classical Conditioning in a Mobile

Robot: Development of Pavlovian Model and

Development of Reinforcement Learning Algorithm to

(3)

Studies from the Department of Technology

at Örebro University

Quentin Delahaye

Framework for Classical Conditioning

in a Mobile Robot: Development of

Pavlovian Model and Development of

Reinforcement Learning Algorithm to

Avoid and Predict Noxious Events

Supervisors: Dr. Andrey Kiselev Dr. Amy Loutfi Examiner: Prof. Franziska Klügl

(4)

© Quentin Delahaye, 2014

Title: Framework for Classical Conditioning in a Mobile Robot:

Development of Pavlovian Model and Development of Reinforcement Learning Algorithm to Avoid and Predict Noxious Events

(5)

Abstract

Nowadays, robots have more and more sensors and the technologies allow using them with less contraints as before. Sensors are important to learn about the environment. But the sensors can be used for classical conditioning, and create behavior for the robot. One of the behavior developed in this thesis is avoiding and predicting obstacles.

The goal of this thesis is to propose a model which consists of developing a specific behavior to avoid noxious event, obstacles.

(6)

List of Figures

2.1 TurtleBot [2] . . . 4

3.1 Representation of pavlovian conditionning . . . 6

4.1 Global Architecture . . . 10

4.2 Different modules implemented and used for the Conditioning Unit . . . 11

4.3 Picture of the matrix draw with value of V in each cell . . . 11

4.4 Assiociative Strength value after 30 trials . . . 12

4.5 Assiociative Strength value after 30 trials . . . 12

4.6 Algorithm of the loop . . . 14

4.7 Position of obstacles according to the event . . . 16

5.1 Matrix of the environment after each forward bumper hit an obstacle (size of cell: 40x40cm) . . . 18

5.2 Picture of the experiment 1 . . . 19

5.3 Schema explained the different paths did by the robot . . . 20

5.4 V-value displayed on the matrix with the position of the robot in green . . . 21

5.5 Picture of the measurement with a box . . . 21

5.6 Plan of the moving of the turtlebot to measure the position of the box in the first hour . . . 22

5.7 Evolution of V-value of each cell from step 1 to 8 (Red blue lines correspond to walls) . . . 23

5.8 Evolution of V-value of each cell from step 1 to 11 (Red blue lines correspond to walls) . . . 24

5.9 Evolution of V-value (from step 3 to 11) of the cell which corre-sponds to the hit with the box at step 3. . . 25

(9)

List of Algorithms

1 Pseudo code to increase the associative strength V-value in the matrix . . . 15 2 Pseudo code to decrease the associative strength V-value in the

matrix . . . 15

(10)

Chapter 1

Introduction

The notion of reflex was introduced by Thomas Willis in the 17th [11]. A re-flex is an involuntary response of a stimulus; for example we retreat quickly the hand when it is touching something scorching. Ivan P. Pavlov presented two dif-ferent types of reflexes: unconditioned reflex (UR) and conditioned reflex (CR) (which are acquired individually) [11]. The UR is a reaction for an

uncondi-tioned stimulus (US) and CR is a reaction for a condiuncondi-tioned stimulus (CS). This

physiologist demonstrated that after a few tries with CS and US simultaneous, only the CS was enough to create the response of the reflex. After that Skinner showed that the response to a CS can be reinforced by its consequences [11] and modifies the behavior, this is Operant conditioning.

In neuroscience, many models have been developed which allows having a mathematical approach for classical conditioning [5]. In this thesis I am in-spired by the work of Robert A. Rescorla and Allan R. Wagner [15]. However a few other algorithms being inspiring by Rescorla-Wagner have been developed like Temporal-difference (TD) [13] or the Q-Learning which "is a method for solving reinforcement learning problems" [7]. However each method has some advantages and drawbacks; we will see which one corresponds to our goal later in the thesis.

In our case the time and the "recognition of place" of the robot has an important impact on the development of conditioned reflexes. However the time can be a significant aspect to develop conditionned reflexes. Indeed in animal world, the effect of US is reinforced when there is a regularity in the time. Then the time acts as a CS. This is the case in a study on the effect of drugs on rats. This study demonstrated that the effect of the drug depended of the periodicity and the hour the rats got the injection of amphetamine [3]. And more the injection was regular, the bigger was the effect. This notion of periodicity for robot could be used to implemant a reinforcement learning. The robot is moving while it is avoiding obstacles or people in a crowded area in public space at specific time because it had already met it the day before at the same hour.

(11)

2 CHAPTER 1. INTRODUCTION

The goal of this thesis is to propose a model which consists of developing a specific behavior to avoid noxious event, obstacles.

This thesis reports the work which has been done in cooperation with two other students: Kaushik Raghavan and Rohith Rao Chennamaneni. Their works are respectively about "Integration of Various Kinds of Sensors and Giv-ing Reliable Information About the State of the Environment" and "Behavior and Path Planning". This thesis is focused on the implemantation of algorithm to develop condtioning reflexes with reinforcement learning on obstacles.

The report is organized as follows. Chapter 2 defines the goal. Chapter 3 describes in details our proposed method. The implementation s is given in Chapter 4. Evaluation of experiment is given in Chapter5 and a disussion of the theisis in Chapter 7.

(12)

Chapter 2

Background and Related Works

2.1 Ultimate Scenario and Tools

Nowadays it is easy for a robot to stock the map of a building, and move while determining its current position.

In the project we use a TurtleBot [2] (Fig 2.1) – a differential-drive robot which has multiple sensors, and runs the ROS [1].

In this project, we consider the ultimate scenario of robot-guide for blind people which is capable of offering the best possible route for a person to reach the destination. The small size of this robot may convenient to guide blind peo-ple to move into an unknown building where tall indications are display on signs. However to guide people the robot has to elaborate a strategy and ana-lyze the best comfortable path to the person. When it should assist a blind man, it will have to generate the comfortable route by already having a knowledge of the path and being able to predict the position of eventual obstacles. To cor-respond to this example the best way will be to have a robot with a behavior similar to a dog which has some basic conditioned and unconditioned reflexes and memory to recognize the place. We were inspired by this type of situation to develop useful tools using the sensors, driver motor and other components of the robot to analyze the environment, avoid collision, and improve reliability. We use ROS middleware to communicate with robot hardware and build our application in the way so it can be easily ported to other robots. The dif-ferent tools we used are:

• RGB image from the Kinect with a field of view of ±45◦_{in diagonal}

• Depth cloud from the Kinect with a field of vision of ±45◦_{in diagonal}

• Three Forward bumpers • The velocity

• The wheel drop

(13)

4 CHAPTER 2. BACKGROUND AND RELATED WORKS

Figure 2.1: TurtleBot [2]

• Motor driver

• Goal trajectory (local planning between two close points) • TF odometry

• 2 degrees of freedom of the turtle bot • Battery Info

Sensor data is pre-processed, combined and interpreted to provide input event for developing condtioning reflexes.

2.2 Type of Obstacles

In dynamic environments there are two types of obstacles: Static and dynamic obstacles [4]. However another type of obstacle is the crowd flow. It can have the same position everyday during the same time. This is the case of traffic in building during busy hours. Most of traffic areas are crowded at different hours (like the arrival, the lunch and the departure). There might be some fixed patterns of crowd flow at some place in the same time. And the most of the time this crowd is congested in the corners or before a turn [9].

(14)

2.3. METHODS TO DETECT OBSTACLES 5

2.3 Methods to Detect Obstacles

SLAM (Simultaneous Location and Mapping) is implemented in many robots. It consists in adding methods to build a map and estimating the position of the robot. Each obstacles detected are directly placed in the map [10]. According to the Article [4], SLAM detects static obstacles and dynamic obstacles, but only the dynamic obstacles are analyzed and filtered from the data generated by the SLAM.

According to Amund Skavhaug and Adam L.Kleppe there are three ways to store a description of an obstacle in a map: "The vector approach and the grid approach" [10] and particles filters.

The vector approach is used in GraphSLAM [20]. In this case the obstacle is

represented like a vector which contains different parameters describing it. But according to Amund and Adam it is hard to find the parameters that describes the obstacle.

The grid map represents the world in a two dimensional map split in cells of

equal size. According to the article [6], it is used for indoor applications. Each cell corresponds to an area. It contains a value which estimates the probability of the cell being occupied or empty. However this system does not store any information about obstacles.

The last method is a mix between the two previous. Particles filter are float-ing points and are randomly drawn in the map [19]. when an obstacle comes in contact with the particles a point will be marked on the SLAM to represent this obstacle. The position of these marked points are stored. "The advantage of this method is that it is multimodal" [10].

There are different methods to detect dynamic obstacles. In this Article [6] about "Real-time Detection of Dynamic Obstacle Using Laser Radar" the au-thors used spatial and temporal difference with grid map to detect dynamic obstacles. They determine dynamic obstacles by comparing in real time the cost of each cell with three different grid-maps at three different times.

In our case we will proposed a solution close from the grid map, and we’ll use the classical conditioning model to develop cost for each cell.

(15)

Chapter 3

Method

3.1 Reinforcement Learning and Model of Classical

Conditioning

"Reinforcement learning is learning by interacting with an environment" [21]. It is one branch of Operant conditioning. It is subdivided in two other branches: Positive or Negative reward [18]. In our case we use a negative reinforcement learning in order to avoid noxious rewards. For example, the robot can avoid hitting a person which means for the robot to avoid receiving data from the bumper (the bumper is consider as a noxious reward).

To allow the robot to learn the position and the time of crowded areas we propose to create associations between: one conditioning and one reward and, one unconditioning and one reward. The result consists in predicting the noxious Reward event by just using a CS.

In this thesis, we will create a connection between Positioning/Time (CS) with reward and we will create another connection between the events inter-preted from sensors (US) and Reward (see figure 3.1). After the CS occurred a few times, the associative strength between it and the Reward is more and more strong.

Figure 3.1: Representation of pavlovian conditionning [8]

(16)

3.2. DIFFERENT MODEL TO COMPUTE THE ASSIOCATIVE STRENGTH 7

3.2 Different Model to Compute the Assiocative

Strength

There are different methods to compute the strength value of the association (between Reward and CS):

• Rescorla Wagner method (RW) • TD: Temporal-Difference learning • Q-learning

The particularity of TD using time between CS and US [13, 5], is called a real-time model [16]. In the thesis, we will not use this method because we directly link the CS and Reward. We won’t compute the time between the two agents (CS and US).

The Q-learning computes the delay of prediction of an reward ("immediate

reward", "delayed reward", and "pure-delayed reward" [7]). In our situation

we just use one type of reward which is periodic and which is not a delayed

reward. Indeed a delay reward is linked to the time between an event and a

reward and not to the periodicity of the event.

The Rescorla Wagner has the advantage to be simple to implement, and allows developing another aspect which is close to animals: inhibition of con-ditioning stimulus [17]. To simplify the implementation we will develop palvo-vian conditionning by using Rescorla-Wagner formula [14] and we will propose a model of reinforcement learning adapted to our subject.

3.3 Rescorla-Wagner model of a pavlovian model

and reinforcement learning

According to Rescorla-Wagner [15] , the equation 3.1 and 3.2 give "the

asso-ciative strength of a given stimulus" [17]. The variables use in the equations are:

• λ "is the maximum conditioning US" [15]: 100 if an US occurred and 0 otherwise

• α is "rate parameters dependent" [15] on the CS • β is "rate parameters dependent" [15] on the US

• V: Strength of the association between CS and the reward

(17)

8 CHAPTER 3. METHOD

Vn+1= Vn+ αβ(λ − Vn−1) (3.2) As it is shown in the thesis about the effect of amphitamine on rats [3], the time factor amplifies the effect and so the link between CS and CR. We tried to represent this aspect by being inspired by the movie "Groundhog Day" released by Bill Murray. It tells the story of a guy who is living the same day and same event everyday. As event are the same every day, he will try to avoid or predict them. We can interpret this movie for the robot like: if there is an obstacle at the same position at the same hour like yesterday the associative strength (V-value) will increase. The robot will work every day from Monday to Friday. As on weekend the robot is off, we don’t include it.

If the robot is at the same place and someone touches it with an interval of 1 second it considers that like a new situation of conditioning. So we compute again the RW equation with the precedent value of the associative strength be-tween the position/time with CR.

(18)

Chapter 4

Implementation

4.1 Global Architecture

Fig 4.1 shows three modules we developed: Input Unit, Conditional Unit and Behavior Unit.

The World model contains the data about environment and internal states of the robot and is used for information sharing with other modules. The data contained by World Model are:

• Clock: simulation of a clock which contains hour and day. • Result of Conditioning Unit stocked in matrix (Paragraph 4.2) • Current Position of the robot

The Interpreter Unit consists in analysing the data from the different sensors

(bumper, RGB kinect ...) of the robot and interpreting sensor data according to situation. It defines the type of each event occurred by using some basic sentences like "I hit something on the left", "I hit something on the right" or "I approach something" etc. These sentences correspond to the unconditional stimulus. The Interpreter Unit sends to the Conditioning Unit and the Behavior Unit all events interpreted.

The Conditioning Unit computs the value of the associative strength (V)

be-tween the position and the time with the reward (which is here a punishment). For instance if there is a dynamic obstacle which occures a few time at the same hour and same position, it will develop conditioning such the robot will predict and avoid it the next time. This module reads the type of events (send by Inter-preter Unit) and update V-value (Fig 4.2). It reads the time and the position of the robot (figure 4.2).

The Behavior Unit consists in generating the best path for the robot. It is

composed of two parts: Behavior planner and Path planner. Behavior planner allows the robot to develop basic action reflexes by reading data send by Inter-preter Unit; for example the robot moves back when someone hits a Bumper.

(19)

10 CHAPTER 4. IMPLEMENTATION

Figure 4.1: Global Architecture

Path planner consists in computing the best path. It reads data from the Con-ditioning Unit and generates the path while comparing the V value of each cell.

4.2 Storing the V-Values on the Map of the

Environement

To store data we use matrices, a model similar to the grid map. Matrices are used for storing the result of V-values and types of events sent by the Interpreter Unit. Several matrices are used, one matrix for each hours.

We don’t need to store the V-values by using many cells. Indeed, to analyze crowd flows it is preferable to use a cell size 40cm because it depends of human anatomy [12] and it is upper than the size of Turtlebot which is 36cm. Events are also stored in the matrix and linked to the cell and hour they are corre-sponding. A same event can’t be saved more than twice in one cell at a specific hour. To save the data of each matrix, a text file is generated.

Fig 4.3 is a picture of the matrix coding in JAVA, and the cost of each cell (from 0 to 100) and the position of the robot in green.

We compute the position of each type of event as we can see in paragraph 4.6. However to update V of a cell we need to know the number of the cells

(20)

4.2. STORING THE V-VALUES ON THE MAP OF THE ENVIRONEMENT 11

Figure 4.2: Different modules implemented and used for the Conditioning Unit

(21)

Figure 4.4: Assiociative Strength value after 30 trials [14] with α = 1 and λ = 100

Figure 4.5: Assiociative Strength value after 30 trials [14] with α = 1 and λ = 0 after

the 6th_trial

which contain the event (or obstacle). That’s why we have a function which allows converting position X/Y to cell number.

4.3 Estimation of the Value of the Constants

To determine the value of β which depends of the event, on Fig 4.4 is drew the evolution of the associative strength (V) after 30 trials by using RW alogrithm. As we can see with β = 0, 3 after two trials the strength value of the associ-ation V-value is upper than 40. This value means that the associassoci-ation is enough strong to create a direct link between CS and the reward.

Figure 4.5 shows that when no event are met after a conditioning, with β = 0, 3 the curve is under 40 after 1 trials. So the link between CS and the reward will be not small after only two failures.

(22)

4.4. RESCORLA-WAGNER IMPLEMANTATION 13

4.4 Rescorla-Wagner Implemantation

We simplified the Rescorla-Wagner equation seen on the paragraph 3.3 with α= 1 :

V_now= Vprevious+ β(λ −

X

V_previous) (4.1) The value of β is according to the importance of the type of event send by the Interpreter Unit. Indeed some events like "I am in danger" are more important and have more impact than "something approaches me". To show this differ-ence we reduced the number of trial before to get a V-value upper than 50 as we saw in the paragraph 4.3 while changing β.

• β = 0.3 (2 trials before to have V>40): "I hit on the left" or "I hit on the right" or "I hit in front of me" or "I aproach something" or "something approaches me"

• β = 0.4 (1 trials before to have V>40): "I am in danger"

4.5 Code Implementation

4.5.1 Loop Algorithm

To do the experimentation we compute main algorithm (Fig 4.6) every second. V-value increases if new events are met. First we Read the buffer which

contains all event which met in the last second. After that for each events we compute (see detail in the algoirthm( 1)):

• Position of the event in the Matrix

• We keep V-value from the Matrix about the cell concerned • We compute V-value with λ = 100

V-value decreases if no events are met. First, we read the Position of the

robot in the matrix. After that we tested if the position of the robot is different from the precedent or if the hour is different or not. If it is true we compute for each event which are met in the cell the new value of V-value with λ = 0 (Algorithm 2).

Algorithm (2) and (1) are coded in C++. Here we explain the main lines of the implemantation of this two algorithms which allow increasing and decreas-ing V-value:

(23)

(24)

4.6. COMPUTE THE POSITION OF OBSTACLE 15

Algorithm 1 Pseudo code to increase the associative strength V-value in the

matrix

Require: Type of Event. Require: Current Cell. Require: Hour. Ensure: V.

1: for all type of Event which are met while the last second do

2: Get cell number of the event 3: Get V from the cell of the Matrix

4: compute β value according to the type of Event. 5: compute V with R = 100.

6: end for

7: if Type of Event is new according to the cell number and the hour then

8: Stock Type of Event in Matrix Data

9: end if

Algorithm 2 Pseudo code to decrease the associative strength V-value in the

matrix

Require: Current V. Require: Current Cell. Require: Hour. Ensure: V.

1: for all type of Event STOCKED in the matrix at a specific hour and cell

number do

2: compute β value according to the type of Event. 3: compute V with λ = 0.

4: end for

5: if V < 10 then

6: Erase all event are met and stocked at a specific hour and cell number

7: end if

4.6 Compute the Position of Obstacle

The turtlebot has to compute the position of obstacles according to the type of event. Indeed the turtle bot has 3 forward bumpers:

• Bumper left • Bumper front • Bumper right

(25)

Figure 4.7: Position of obstacles according to the event

The position of the obstacle must be computed according to the bumper acti-vated. On the figure 4.7 we can see two obstacles which are on different cells (cell 4 and cell 6).

We compute the position of obstacles by using trigonometry:

Xobstacle= Xrobot+cos(θrobot+ θobstacle) × d (4.2a)

Y_obstacle= Yrobot+sin(θrobot+ θobstacle) × d (4.2b)

Where the variable from equations 4.2 are:

• Angular of the robot θrobotaccording to the origin

• Angular of the obstacle, define according to the type of event • Position of the robot Xrobotand Yrobotaccording to the origin

• d: distance between the center of the robot and the extremity of the sensor

4.6.1 Implementation of Algorithm to Compute the Position of

Events

To compute the position of the event in a first time we compute the orientation of the Turtle bot regarding to the origin axis.The TF package from ROS allows getting easily the angular. It displays the result from 0 to 3, which equals to 0◦

to 180◦ _{from the centre forward to the left of the robot. And vice versa in the}

direction to the right with using negative value (0 to -3 =⇒ 0◦_{to -180}◦_{). After}

(26)

4.6. COMPUTE THE POSITION OF OBSTACLE 17

We define for each type of event the theoretical position that the robot deduce from its sensors. In table 4.1 we add an angular for the following event:

Type of event Angular obstacle in left 45◦

obstacle in right -45◦

obstacle in front Someone approachs 0◦

wheel drop 0◦

(27)

Chapter 5

Evaluations

5.1 Observations Results of Computing Position of

Events

We tested the computing of the position of the obstacles each time the Robot meet an obstacle. Fig 5.1 shows the results we got according to the three for-ward bumpers (The red point is the center point of the base of the Turtlebot given by the Table 5.1):

Table 5.1 shows three postions given by the program which contains the algorithm to compute the position of event. We used a length between the center position of the robot and the event of 25cm.

By using a cell size of 40cm we can see that some obstacles detected by the bumpers sensors can be put in the same cell. This is the case for event detected by Right and Front Bumper in Fig 5.1.

Figure 5.1: Matrix of the environment after each forward bumper hit an obstacle (size

of cell: 40x40cm)

(28)

5.2. OBSERVATIONS RESULTS 19

Table 5.1: Position of object and event computed by the algorithm and the Matrix

function

Object/Event Position X,Y,θ Position of the center of the cell concerned Robot 0.91,0.79,85.56◦ _1.00,0.60

Bumper LEFT 0.75,0.98 0.60,1.00 Bumper FRONT 0.93,1.04 1.00,1.00 Bumper RIGHT 1.10,0.95 1.00,1.00

Figure 5.2: Picture of the experiment 1

5.2 Observations Results

5.2.1 Experiment 1

The following experiment consists in testing all modules implement together in a Turtlebot. The robot will hit an obstacle during its course: the Input In-terpreter, the Conditioning Unit and the Bhavior planner will have to commu-nicate together to develop after this event a new behavior. Fig 5.3 diplays the different paths that take the Turtlebot.

• Trial 1: The Turtlebot is directly going to the goal, there are no obstacle on its way

(29)

20 CHAPTER 5. EVALUATIONS

Figure 5.3: Schema explained the different paths did by the robot

• Trial 2: The Turtlebot hits an obstacle on its path, and generates another path to avoid the obstacle

• Trial 3: The Turtlebot genereted a new path but doesn’t include the cell where the obstacle was in trial 2

Fig 5.4 shows at the trial 3 the result at the end of the experiment. The Turtlebot hit an obstacle (here a foot 5.2) which generated events from its sen-sor. Some V-value increased according to the position of the obstacle and stored it at a specify hour. So from the results of the experiment 1, we can see that the Path planner developed a new behavior which consists in avoiding some cells and taking another path.

5.2.2 Experiment 2

As discussed in Chapter 2.3, the crowds flow area are very concentred in the corners. We decided to create a similar situation by using a box of size 48 × 32cm. We put this box at the entrance of one small corridor of size 132cm. The box size has the advantage to have a size close to the cell size used by the Matrix program (Para 4.2). Fig 5.5 shows the entrance with the box which takes only a fourth of the entrance length.

The box here represents a person who suddenly appeared in the corner next to the wall. And only the bumper of the turtlebot detected this person after collision because the field of view of Kinect sensor didn’t allow seing the person behind the wall.

Fig 5.6 shows the path of the turtlebot for the measurement of V-value. At each return from the start point we change the position of the box to do

(30)

Figure 5.4: V-value displayed on the matrix with the position of the robot in green

(31)

Figure 5.6: Plan of the moving of the turtlebot to measure the position of the box in the

first hour

three trials with three different positions of the box. The postition of the box changes after each return to start point, thanks to that we will have the result of the movement of a crowd flow in two directions. The numbers on Fig 5.6 reprensent the different steps of the robot. The first goal is situated on the left (step 2), the second on the right (step 6) and the last goal is on the left (step 10). To control the turtlebot the program was executed which allows moving it with the keyboard teleop manually. We decided for the turtlbot to take the shortest path to go to one point; that’s to say it will have to go close to the corners. But if each cell situated on the corners already have a V-value, we decided to make the turtlebot move to the cells where the V-value is the smallest. The goal of this scenario is to demonstrate:

• The increase of V-value at different cells

• The decrease of V-value at one precise cell while it moves a few time on this cell (Blue path between step 9 and 10 )

• Test the detection of event with the three forward bumpers

Fig 5.7 displays the result from step 1 to 8. We have the cost of different cells which inceased in each corners. After the scenario is over, Fig 5.8 displays the result from step 1 to 11. The TurtleBot hits an obstacle in front of him. At this moment the best way to go to the step 11 is to take on the right. As there are no obstacles on this cell corresponding to the old position of box at trial 1,

(32)

Figure 5.7: Evolution of V-value of each cell from step 1 to 8 (Red blue lines correspond

to walls)

V-value decrease to 24, 99 after one return to the start point. We can observe the evolution of V-value on Fig 5.9 according to the number of trial.

At each second the Turtlebot computes the strength of associative V-value according to the event met. Next V-value is located in a cell according to the current position of the robot and the type of event. Table 5.2 allow showing that the different positions of the box in the real environment are in agreement with the corresponding cell where V-value is different to 0. According to the Fig 5.8 we can observe three areas. Moreover according to Table 5.2 these areas are linked to the different position of the boxes at different trials in the same hour.

(33)

Figure 5.8: Evolution of V-value of each cell from step 1 to 11 (Red blue lines correspond

to walls)

Table 5.2: Comparaison betwwen Box position and Cell where V-value increased

BOX Center position position X,Y of the

Box in the real environment

Center position of the cells according to the different

steps of the measurement

Bumper used

BOX trial 1 1.45,2.29 Step 1: 1.40,1.80; Step 3: 1.80,2.20

Bumper Left, Bumper Right

(twice) BOX trial 2 2.47,2.29 Step 5: 2.60,1.80;

Step 7: 2.60,2.20

Bumper Right,Bumper

Left BOX trial 3 1.95,2.29 Step 9: 2.2,1.80 Bumper Front

(34)

Figure 5.9: Evolution of V-value (from step 3 to 11) of the cell which corresponds to the

(35)

Chapter 6

Discussion and Future Works

In this thesis, we proposed a model to develop a specific behavior which allow avoiding noxious events like obstacles.

On Classical conditioning part the Rescorna-Wagner model has been used to increase or decrease the associative strength V-value and has been tested (Fig 5.9). The reinforcement learning has been implemented by using a matrix to store every event which is met with their position and the hour. The Fig 4.6 is a solution we implemented in the robot to predict the event.

The position of obstacles is computed using the three forward bumpers. For that, we used the method seen in Section 4.6, next the event is sent to the corresponding cell of the matrix. The results displayed in table 5.2 tells this method is correct.

The experiment 1 shows us a situation of reinforcement learning by de-veloping a new behavior (avoid the cell which contained the obstacle in the previous trial). The result in Fig 5.8 illustrates the use of this model. This sce-nario is limited by the fact that it is teleoperated, instead of the path planner and it does not take into account the hour. However the experiment 2 demon-strates the evolution of V-value according to the movement of the crowd and the position of the event.

To return to the ultimate scenario of robot-guide for blind people, it is ad-vantageous to integrate the hour and not only the position. Even though it has not been achieved, basic tools has been developed to pursue this goal (rein-forcement learning and Rescorla-Wagner model).

To store V-values, we used a model based on the Grid map. This method, even if it has the advantage to save resources, the precision is low to detect the shape of the static obstacle. The other way could be to use particles filter method.

The choice about the β-value has been chosen arbitrarily in this thesis to have only two trials before creating a direct link between CS and the reward. The events will get stacked and will be published every second, so some events will be published with a delay. This is one of the limitation to this system.

(36)

27

Moreover we can have V-Value increased one or two times for the same obsta-cle, because we don’t compute the time between each event.

According to the experiment results, in order to upgrade the model pro-posed, the following improvements can be applied:

• Compute time between each event occured in the same cell • Add other sensors (Kinect, sound ...)

(37)

References

[1] ROS. http://www.ros.org/. Accessed: 2014-05-13. (Cited on page

Referenced on page: 3.)

[2] Turtlebot. http://www.turtlebot.com/. Accessed: 2014-05-13. (Cited on pages

Referenced on page: 3 and 4.)

[3] J. Sullivan A. Arvanitogiannis and S. Amir. Time Acts as a Conditioned

Stimulus to Control Behavioral Sensitization to Amphetamine in Rats.

PhD thesis, Concordia University, Montreal, Quebec, 2000. (Cited on pages

[4] Zhirong Zou Baifan Chen, Lijue Liu and Xiyang Xu. A Hybrid Data Association Approach for SLAM in Dynamic Environments. pages 1–7, 2012. (Cited on pages

[5] Christian Balkenius. Computational models of classical conditioning: a comparative study. (Kamin 1968):1, 1998. (Cited on pages

[6] Baifan Chen, Zixing Cai, Zheng Xiao, Jinxia Yu, and Limei Liu. Real-time detection of dynamic obstacle using laser radar. In Young Computer

Sci-entists, 2008. ICYCS 2008. The 9th International Conference for, pages

1728–1732, Nov 2008. (Cited on pages

[7] Chris Gaskett. Q-Learning for Robot Control. 1:21–27, 2002. (Cited on pages

[8] P. Gaussier. Sciences cognitives et robotique: le defi de l apprentissage autonome. pages 35–39. (Cited on page

(38)

REFERENCES 29

[9] K. Katabira, T. Suzuki, H. Zhao, Y. Nakagawa, and R. Shibasaki. An analysis of crowds flow characteristics by using laser range scanners. page 955. (Cited on page

[10] Adam Leon Kleppe and Amund Skavhaug. Obstacle Detection and Map-ping in Low-Cost, Low-Power Multi-Robot Systems using an Inverted Particle Filter. pages 1–15, 2013. (Cited on pages

Referenced on page: 5, 5, and 5.)

[11] Dominique Lecourt. Dictionnaire d’histoire et philosophie des sciences. Presses universitaires de France, edition 2003. (Cited on pages

[12] Masakuni Muramatsu, Tunemasa Irie, and Takashi Nagatani. Jamming transition in pedestrian counter flow. Physica A: Statistical Mechanics and

its Applications, 267:487–498, 1999. (Cited on page

[13] Yael Niv. Reinforcement learning in the brain. pages 1–38, 1997. (Cited on pages

[14] Michael J Renner. Learning the Rescorla-Wagner Model of Pavlovian Conditioning: An Interactive Simulation. 2004. (Cited on pages

[15] R. Rescorla. Rescorla-Wagner model. 3(3):2237, 2008. revision #91711. (Cited on pages

Referenced on page: 1, 7, 7, 7, and 7.)

[16] Andrew G.Barto Richard S.Sutton. A Temporal-Difference Model of Clas-sical Conditioning. (Cited on page

[17] Jean Marc Salotti and Florent Lepretre. Classical and operant condition-ing as roots of interaction for robots. edition 2013. (Cited on pages

[18] J. E. R. Staddon and Y. Niv. Operant conditioning. 3(9):2318, 2008. revision #91609. (Cited on page

[19] S. Thrun. Particle filters in robotics. In Proceedings of the 17th Annual

Conference on Uncertainty in AI (UAI), 2002. (Cited on page

(39)

30 REFERENCES

[20] S. Thrun and M. Montemerlo. The GraphSLAM algorithm with applica-tions to large-scale mapping of urban structures. International Journal on

Robotics Research, 25(5/6):403–430, 2005. (Cited on page

[21] F. Woergoetter and B. Porr. Reinforcement learning. 3(3):1448, 2008. revision #91704. (Cited on page

Framework for Classical Conditioning in a MobileRobot: Development of Pavlovian Model andDevelopment of Reinforcement Learning Algorithmto Avoid and Predict Noxious Events

Project Report

Framework for Classical Conditioning in a Mobile

Robot: Development of Pavlovian Model and

Development of Reinforcement Learning Algorithm

to Avoid and Predict Noxious Events

Quentin Delahaye

Technology

Framework for Classical Conditioning in a Mobile

Robot: Development of Pavlovian Model and

Development of Reinforcement Learning Algorithm to

Studies from the Department of Technology

at Örebro University

Quentin Delahaye

Framework for Classical Conditioning

in a Mobile Robot: Development of

Pavlovian Model and Development of

Reinforcement Learning Algorithm to

Avoid and Predict Noxious Events

© Quentin Delahaye, 2014

Abstract

Contents

List of Figures

List of Algorithms

Chapter 1

Introduction

Chapter 2

Background and Related Works

2.1

Ultimate Scenario and Tools

2.2

Type of Obstacles

2.3

Methods to Detect Obstacles

Chapter 3

Method

3.1

Reinforcement Learning and Model of Classical

Conditioning

3.2

Different Model to Compute the Assiocative

Strength

3.3

Rescorla-Wagner model of a pavlovian model

and reinforcement learning

Chapter 4

Implementation

4.1

Global Architecture

4.2

Storing the V-Values on the Map of the

Environement

4.3

Estimation of the Value of the Constants

4.4

Rescorla-Wagner Implemantation

4.5

Code Implementation

4.5.1

Loop Algorithm

4.6

Compute the Position of Obstacle

4.6.1

Implementation of Algorithm to Compute the Position of

Events

Chapter 5

Evaluations

5.1

Observations Results of Computing Position of

Events

5.2

Observations Results

5.2.1

Experiment 1

5.2.2

Experiment 2

Chapter 6

Discussion and Future Works

References