Machine Behavior Development and Analysis using Reinforcement Learning

(1)

ACTA UNIVERSITATIS

UPSALIENSIS UPPSALA

Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 1983

Machine Behavior Development and Analysis using Reinforcement Learning

YUAN GAO

ISSN 1651-6214 ISBN 978-91-513-1053-4

(2)

Dissertation presented at Uppsala University to be publicly examined in Häggsalen, Ångströmlaboratoriet, Lägerhyddsvägen 1, Uppsala, Friday, 11 December 2020 at 10:00 for the degree of Doctor of Philosophy. The examination will be conducted in English. Faculty examiner: Elin Anna Topp (Lund University).

Abstract

Gao, Y. 2020. Machine Behavior Development and Analysis using Reinforcement Learning.

Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 1983. 43 pp. Uppsala: Acta Universitatis Upsaliensis. ISBN 978-91-513-1053-4.

We are approaching a future where robots and humans will co-exist and co-adapt. To understand how can a robot co-adapt with humans, we need to understand and develop efficient algorithms suitable for our interactive purposes. Not only it can help us to advance the field of robotics but also it can help us to understand ourselves. A subject Machine Behavior, proposed by Iyad Rahwan in a recent Science article, studies algorithms and the social environments in which algorithms operate. What this paper's view tells us is that, when we would like to study any artificial robot we create, like natural science, a two-step method based on logical positivism should be applied. That is, we need to, on one hand, provide a complicated theory based on logical deduction, and on another hand, empirically setup experiments to conduct.

Reinforcement learning (RL) is a computational model that helps us to build a theory to explain the interactive process. Integrated with neural networks and statistics, the current RL is able to obtain a reliable learning representation and adapt over interactive processes. It might be one of the first times that we are able to use a theoretical framework to capture uncertainty and adapt automatically during interactions between humans and robots. Though some limitations are observed in different studies, many positive aspects have also been revealed. Additionally, considering the potentials of these methods people observed from related fields e.g. image recognition, physical human-robot interaction and manipulation, we hope this framework will bring more insights to the field of robotics. The main challenge in applying Deep RL to the field of social robotics is the volume of data. In traditional robotics problems such as body control, simultaneous localization and mapping and grasping, deep reinforcement learning often takes place only in a non-human environment. In such an environment, the robot can learn infinitely in the environment to optimize its strategies. However, applications in social robotics tend to be in a complex environment of human-robot interaction. Social robots require human involvement every time they learn in such an environment, which leads to very expensive data collection.

In this thesis, we will discuss several ways to deal with this challenge, mainly in terms of two aspects, namely, evaluation of learning algorithms and the development of learning methods for human-robot co-adaptation.

Keywords: reinforcement learning, robotics, human robot interaction

Yuan Gao, Department of Information Technology, Division of Visual Information and Interaction, Box 337, Uppsala University, SE-751 05 Uppsala, Sweden.

urn:nbn:se:uu:diva-423434 (http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-423434)

(3)

To my family

(4)

Sammanfattnng på svenska (Summary in Swedish)

Vi närmar oss en framtid där robotar och människor kommer att samexistera och samanpassning sig till varanda. För att förstå hur en robot kan anpassa sig till människor måste vi förstå och utveckla e˛ektiva algoritmer som pas- sar våra interaktiva ändamål. Detta kan avancera vår förståelse, inte bara för robotik, utan även om oss själva. Ämnet Maskinbeteende, föreslaget av Iyad Rahwan i en ny vetenskaplig artikel, studerar algoritmer och de sociala miljöer där algoritmer verkar. Vad denna uppsats berättar för oss är att när vi vill studera en artiÿciell robot vi skapat, likt inom naturvetenskapen, bör en tvåstegsmetod baserad på logisk positivism tillämpas. Det vill säga att vi å ena sidan måste föreslå en komplex teori baserad på logiskt deduktion, å andra sidan att pröva den genom empiriska experiment.

Förstärkningslärning (Engelska: Reinforcement Learning, RL) är en beräkn- ingsmodell som hjälper oss att bygga en teori för att förklara interaktion- sprocesser. Nyttjandes neuralnätverk och statistik kan förstärkningslärning nu skapa en pålitlig inlärningsrepresentation och anpassa sig över interaktiva processer. Det kan vara en av de första gångerna att vi kunnat använda ett teoretiskt ramverk för att fånga osäkerhet och automatisk anpassning under interaktioner mellan människor och robotar. Även om vissa begrän- sningar observerats i olika studier har många positiva aspekter också up- pdagats. Med tanke på potentialerna för dessa metoder som observerats i relaterade fält som bildigenkänning och fysisk människa-robot-interaktion och -manipulation, hoppas vi att detta ramverk kommer att ge insikt inom robotikområdet. Den största utmaningen att använda djup förstärkningsin- lärning (Engelska: Deep RL) inom området för social robotik är datamängden.

I traditionella robotikproblem som kroppskontroll, greppande och simultan lokalisering och kartläggning, sker djup förstärkningslärning ofta bara i en icke-mänsklig miljö. I en sådan miljö kan roboten lära sig obegränsat mycket i miljön,för att optimera sina strategier. Dock tenderar applikationer inom social robotik att vara i en miljö med komplex interaktion mellan människa och robot. Sociala robotar kräver mänskligt inblandning varje gång de lär sig i en sådan miljö, vilket leder till mycket kostsam datainsamling. I denna avhandling kommer vi att diskutera °era sätt att hantera denna utmaning, främst i termer av utvärdering av inlärningsalgoritmer och utveckling av in- lärningsmetoder för samanpassning mellan människa och robot.

(5)

1. Introduction

More and more robots are now being used to support humans in new social roles, such as providing assistance to elderly care at home [10], serv- ing as personalized tutors [44], acting as therapeutic tools for children with autism [8], or as game companions for entertainment purpose[11]. Nonethe- less, human skills remain incomparable in social robots, especially in areas like human-robot co-adaptation. In order for social robots to be highly interactive, they need to have the ability to understand and adapt to another agent’s needs, preferences, interests and emotions. Simulating the tremen- dous social adaptation abilities that characterize human-human interactions requires the establishment of bidirectional processes in which humans and robots synchronize and adapt to each other in real-time by means of an ex- change of verbal and non-verbal behaviors (e.g. facial expressions, gestures and speech) in order to achieve mutual co-adaptation.

Over a long period, the social robotics community tried to establish the aforementioned technological infrastructure, but to date it still faces challenges. On one hand, the social robotics community adopted a focus on understanding psychological phenomena in social human-robot interaction (HRI), and on the other hand, the community lacks cross-disciplinary synergies with computational domains such as machine learning (ML), especially in the area of the data-driven methods which have been widely used in engineering. In addition to the ÿeld of study itself, the scientiÿc and technical issues are primarily that data-driven methods were also not mature enough to be applied to interaction modeling.

In recent years, technical advances in ML methods [43] have opened a door to new ways of building social robotic systems. ML, as the technological base for state-of-the-art automatic speech recognition and image analysis, contributed to the development of di˛erent areas of social robotics. How- ever, modeling the interactive process using ML is an area of research that is still in its early stages. In addition to e˛orts made in the robot Learning from Demonstration domain [18, 65, 42], reinforcement learning (RL)-based frameworks have been used before in socially assistive scenarios to select mo- tivational strategies [23] or supportive behaviors [40] personalized to each participant but they normally have a small amount of data and relatively simple algorithms. My Ph.D. studies focused primarily on the development of adaptation techniques in social HRI and how these adaptation techniques, based on RL, could be applied with a limited amount of data using advanced ML techniques.

(8)

Over the years of development, ML algorithms have demonstrated their superb abilities to tackle di˛erent problems in the area of traditional ÿelds of computer science e.g. computer vision and natural language processing [49].

The continuing trend of ML to break new ground in the traditional ÿeld of computer science demonstrates its ability to be applied to di˛erent problems as a general model. Additionally, a particular branch called deep learning based on artiÿcial neural networks with representation learning is the main factor for the improvement in the ML ÿeld. The main advantages of this branch are that it can predict or classify input data with high accuracy in di˛erent tasks. When combined with the RL sub-ÿeld, it performs even better for interactive scenarios. The RL ÿeld concerns how an agent can take actions in an environment to maximize a predeÿned notion of maximum reward.

In fact, the intersection of these two sub-ÿelds is called deep reinforcement learning (DRL).

While DRL algorithms have demonstrated their ability in general robotics, especially on automatic robot perception and control, such as grasping and locomotion [43], applying DRL in social robotics has not been widely studied. The question of how to make robots learn appropriate social behaviors under these modern frameworks remains underexplored. As a consequence, the interaction scenarios studied with RL algorithms in previous research have been limited to simpliÿed cases and the algorithms studied to relatively simple ones [19]. The main challenge in applying DRL to the ÿeld of so-

Figure 1.1. A typical scenario in social robotics research. To successfully have an interaction with humans in such environment, the social robot needs learn and adapt fast with all kinds of data it accesses.

cial robotics is the volume of data. In traditional robotics problems such as body control, simultaneous localization and mapping (SLAM) and grasping, DRL often takes place only in a non-human environment. In such an environment, the robot can learn inÿnitely in the environment to optimize its strategies. However, applications in social robotics tend to be in a complex

(9)

environment of HRI (see Figure 1.1). Social robots require human involvement every time they learn in such an environment, which leads to very expensive data collection. In this thesis, we will discuss several ways to deal with this challenge. We will discuss how to approach this challenge in terms of two aspects, namely

• Evaluation of learning algorithms This direction primarily considers how experiments can be designed with relatively simple algorithms that can test participants’ perception of socially adaptive robots. This aspect will help us to further sort out what impact di˛erent adaptation algorithms have on social robotics applications and thus guide us on what to look for when developing new algorithms. (Paper A and Paper C)

• Development of learning methods for human-robot co-adaptation This aspect focuses more on how advanced ML algorithms for socially adaptive robots could be developed and tested in a social robotics scenario based on the understandings we gathered in the user studies. This aspect is also the main aspect that addresses the "lack of data" problem in social robotics. (Paper B and Paper D)

Despite that sometimes they are considered to be quite di˛erent research, these two aspects go hand in hand. Together they complete a circle of method- ological evaluation for the development of social robotics applications based on RL (see Figure 1.2). In the list of included papers 1.2, We speciÿcally discuss

Figure 1.2. An illustration of complete developmental circle of social robotics application under the framework of reinforcement learning based social robotics.

each article’s contribution to the ÿeld in these two areas. The main contribution of my thesis to the ÿeld is threefold. Firstly we advocated establish- ing a link between deep RL and social robotics, which provides a previously missing link to increase cross-disciplinary synergies. Secondly, we proposed several frameworks to incorporate the social signals to the RL framework for di˛erent social tasks, providing initial solutions to model social human- robot interactive activities using RL frameworks. Last but not the least, we evaluated the algorithms in real situations using di˛erent psychological met-

(10)

rics, showing that models developed in simulated social scenarios could be transferred in the real world. Though there is still a gap between the real and simulated environment, indicating that this should be one of the main focuses for future researchers in this ÿeld.

(11)

1.1 Outline of thesis

In this thesis, I discuss di˛erent ways to build an adaptive system for social HRI using RL frameworks. Chapter 1 is presented as a general introduction to my Ph.D. study, followed by a motivation introduced in Chapter 2. A summary of previous literature addressing RL in social HRI is then given in Chap- ter 3. Chapter 4 contains a high-level description of the techniques used in this thesis. Chapter 5 summarizes the research questions and results of the papers that form the basis for this thesis. Chapters 6 introduces the future works and conclusions of this thesis.

1.2 List of papers

This thesis is based on these articles:

• Paper A: Fast Adaptation with Meta-Reinforcement Learning for Trust Modelling in Human-Robot Interaction. Gao, Y., Sibirtseva, E., Castellano, G., Kragic, D. (2019).The 28th IEEE/RSJ International Con- ference on Intelligent Robots and Systems (IROS 2019), 2019.

• Paper B: Learning Socially Appropriate Robot Approaching Be- havior Toward Groups using Deep Reinforcement Learning. Gao Y., Yang F., Frisk. M., Daniel H., Peters C., Castellano, G. (2019). The 28th IEEE International Conference on Robot and Human Interactive Commu- nication (RO-MAN 2019), 2019.

• Paper C: When robot personalisation does not help: Insights from a robot-supported learning study. Gao, Y., Barendregt, W., Obaid, M., Castellano, G. (2018). IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN 2018).

• Paper D: Investigating Deep Learning Approaches for Human- Robot Proxemics. IEEE International Symposium on Robot and Human Interactive Communication Gao, Y., Wallkötter, S., Obaid, M., Castellano, G. (2018). (RO-MAN 2018), 2018.

Additionally, during the course of this doctoral project, the following articles were published but are not included in this thesis:

• E˛ects of posture and embodiment on social distance in human- agent interaction in mixed reality Chengjie Li, Theofronia Androulakaki, Alex Yuan Gao, Fangkai Yang, Himangshu Saikia, Christopher Peters, Gabriel Skantze Proceedings of the 18th International Conference on In- telligent Virtual Agent (2017).

(12)

• Exploring Users’ Reactions Towards Tangible Implicit Probes for Measuring Human-Robot Engagement Mohammad Obaid, Yuan Gao, Wolmet Barendregt and Ginevra Castellano. International Conference on Social Robotics, 2017.

• Personalised Human-robot Co-adaptation in Instructional Set- tings using Reinforcement Learning Yuan Gao, Wolmet Barendregt, Mohammad Obaid, Ginevra Castellano. IVA Workshop on Persuasive Em- bodied Agents for Behavior Change: PEACH, 2017

• E˝cient Learning of Socially Aware Robot Approaching Behav- ior Toward Groups via Meta-Reinforcement Learning Chengxi Li, Ginevra Castellano, Yuan Gao. IROS2020 Workshop on Social AI for Human-Robot Interaction of Human-care Service Robots,2020

• CongreG8: A Motion Capture Dataset of Human and Robot Ap- proach Behaviors into Small Group Formations Fangkai Yang, Yuan Gao, Ginevra Castellano and Christopher Peters. IROS2019 Workshop Benchmark and Dataset for Probabilistic Prediction of Interactive Human Behavior, 2019

(13)

2. The Dilemma in Social HRI

Tempora mutantur et nos mutamur in illis. Time changes and we change with the time.

“ 2.1 Motivation

Latin Phrase

”

As one of the typical futures described by science ÿction authors, a society that co-exists with human-like robots has long been imagined and praised by futurists. Despite ethical developments, to endow robots with social abilities requires a lot of technological e˛orts. The ÿeld of social robotics was origi- nally formed by scientists and engineers who are pursuing this vision. The scientiÿc community of robotics wants to develop robots that can interact with other socially intelligent agents in a way that ÿts robots’ social roles. It includes a wide range of service robots e.g. tutor robots, chef robots and com- panion robots. A core skill these robots should acquire is social skills, which enable the robots to understand, analyze and react socially appropriately.

Over the years, researchers have tried to understand the interactive process from di˛erent angles including experience design, sociology, psychology, computer science and even philosophy. Though, some of the robots have attracted public attention like Nao, Pepper, Furhat and Jibo (see Figure 2.1 for an illustration), until now, the social robots are still not very commercially successful due to various issues. As argued by some researchers, the main issue here is that social robots are not generally intelligent enough to carry out the required tasks smoothly. The social robot could fail at understanding the tasks or completing it but what prevents them from being widely applied is their ability to interact. We, as humans, are mostly able to conduct social interactions in various forms, ranging from animal-like actions to regulated interactions (such as interactions regulated laws) according to Piotr Sztompka. For robots, it is not as easy. The current technology still struggles at generating basic animal-like behavior. As a consequence, we need to ÿnd a better framework to generate and analyze the social robot’s behavior.

Reinforcement learning is an area of ML that concerns how could an agent maximize the reward it can receive from an environment. It involves two main components in this process. The ÿrst part is an environment that can provide reward signals and the second part is an agent that can learn from the environment. As consequence, research in this area involves problems that belong to these two areas. On one hand, one needs to ÿnd ways to have as

(14)

Figure 2.1. An illustration of diferent social robots. They are Nao, Pepper, Furhat and Jibo from left to right. The hights of the robots are relative.

many robots as possible to carry out experiments (e.g. through a robots-in- public-spaces project), and on the other hand, the algorithms implemented on the robots need to be more advanced. Also, when the previous two main problems are solved, we then need to consider the ways to collect data. That is to say, we need to fnd a way to learn a model with a limited amount of data. Whether we can solve these issues efciently or not gives us a forecast on whether this methodology would be successful or not.

(15)

3. Reinforcement Learning in Social HRI

3.1 Overview

Reinforcement Learning, which emerged from ML ÿelds, has become an in- creasingly important tool for robotics tasks, considering its °exibility in modeling interactive processes. To grasp its signiÿcance in robotics, one should ÿrst understand its ability to automatically collect rewards in an environment. In the following text, we provide an overview of how RL is applied in di˛erent robotics tasks one step at a time. We start by introducing general robotics tasks that have RL as a base. Then we move to introduce social robotic tasks implemented based on RL, including a summary of the main dif- ÿculties faced while researchers conduct this research. At last, we end with a short conclusion regarding what issues the ÿeld still faces.

3.2 Reinforcement learning in robotics

Since the RL concept was merged from early works in cybernetics, psychology and statistics, it has played an important role also in the artiÿcial intelligence community. Through years of development, a few important books and surveys were added to the community [69, 32] in the middle 90s. The development continued while being in°uenced by other ÿelds of ML. From 2000 to 2016, the ML ÿeld experienced another paradigm shift from statistical ML methods [6] to deep learning methods [22]. What this means is that researchers started to utilize more neural network-based function approx- imation methods in the learning. This transformation then in°uenced the ÿeld of RL as well. The ÿrst remark was made by DeepMind [51], where they added a convolutional neural network to Q learning and proposed a deep learning based RL algorithm. Soon after that, several other deep learning based algorithms were proposed including Deep Deterministic Policy Gradi- ent (DDPG) [45], Asynchronous Actor-Critic Agents (A3C) [50] and an all- together version for value optimization methods called Rainbow [27].

After observing RL’s great success in controlling virtual agents in world- renowned examples like beating humans on Go [67, 68] and StarCraft [75], robotics researchers also accelerated their agenda on testing RL algorithms for di˛erent robotics tasks. This is mainly due to the fact that, like control theory, RL provides a framework and set of tools for robotics to design complex and di˝cult behaviors. In turn, the challenges of the robotics problem

(16)

provide inspiration, in°uence, and validation for the development of RL. The relationship between disciplines is promising enough, just as the relationship between physics and mathematics is. Although the research of using RL for robotics dated back to when the RL was ÿrst proposed [37], in recent years, one of the recent milestones is the end-to-end deep visual policy [43].

Since then, a few state-of-the-art continuous control algorithms were proposed during this time [63, 24]. Their combinations with other areas of ML are also explored, for example with transfer learning [15] and adversarial training [16].

3.3 Reinforcement learning in social robotics

Social robotic systems have been viewed as holding the versatility in that each robot agent in the system can either act with other robots or with humans for di˛erent tasks. Though this characteristic dramatically extends the fea- sibility of the systems to adapt to di˛erent environments for di˛erent tasks, the search space of automating interactive patterns with humans to complete these tasks is large.

For social robots, this problem is twofold. This is mainly due to the fact that social robots ÿrst need to establish basic functionalities before they can cooperate with humans to complete complicated tasks. Among these two parts of the problem, the lower part consists of a representation learning issue. That is, to form a robot agent using di˛erent types of models, the number of possible model integration grows exponentially in terms of the number of model types and the way they can be integrated. Integrating an exponentially increasing large space is computationally intractable at scale. For the top part, when social robot agents are capable of these basic functionalities (e.g. face recognition, emotion recognition and dialog processing), a sequential decision-making process is then required for each robot agent to interact with the humans and other robot agents to complete di˛erent tasks. To perform sequential decision-making is not a trivial problem due to the necessity of evaluating di˛erent situations and it is especially hard in social situations where data is limited. It requires frameworks spanning from optimization theory, dynamic programming, game theory, and decentralized control. De- spite some successful studies of solving both parts of problems using traditional methods, i.e. learning by demonstration, planning and optimization algorithms [58, 66], the challenge remains. It is mainly because of two reasons. Firstly, traditional methods are ine˝cient, not only data ine˝cient but also computationally ine˝cient. Secondly, the solutions that traditional methods can ÿnd are normally suboptimal. This is on account that they are not able to make abstractions in more and more complicated task environments. In ML terms, it is called the lack of the ability of representation learning [5]. Though the suboptimal problem can also be observed in terms

(17)

of fusing di˛erent modalities at scale, it is more apparent when it comes to social interaction decision making. For example, AI systems constructed using traditional methods across multiple disciplines are proven to be not able to meet human-level intelligence [9, 67, 76]. A reasonable deduction is that if we continue the path of exploration using traditional methods for social robotics, it is also likely we cannot meet human-level social intelligence. In spite of other existing multiple frameworks, DRL serves as a multi-potent solution to this problem for both parts. Regarding the multi-model integration, recent studies have shown that the DRL could be used for model selection and integration of social robots [46] and single control policy generation for multiple robots [15]. On the top layer, abundant works were also conducted to show that similar techniques could be used for social robotic systems [46, 54]

and robot-like social systems, including living architecture systems [48] and autonomous cars [17].

If we take a more focused look at the RL algorithms implemented on typical social robots like Furhat ¹or iCat [73], we could notice that researchers focused more on the social scenarios. Over the past two decades, several researchers have sought to implement di˛erent algorithms for social scenarios.

Surveys such as [3] that conducted by Akalin etc. showed RL algorithms used di˛erent social feedbacks as reward functions. Application-wise, social scenarios like rehabilitation [70], eldercare [60, 59, 77], entertainment [34, 1], education [64, 62, 26] and therapy [12, 13, 14] were considered.

3.4 Reinforcement learning for social human-robot co-adaptation

Reinforcement learning in social human-robot co-adaptation has more issues to consider than the application of reinforcement learning in social robotics.

One of the main reasons is the change in the human state during the interaction. This is mainly due to the fact that human is also an intelligent agent that can learn during the interaction. However, just like the general application of reinforcement learning in social robotics, the robot still needs to learn the preferences of humans but in a much more sophisticated way.

Few researchers have addressed the problem of applying RL speciÿcally for social human-robot co-adaptation over the last decades. For example, di˛erent RL algorithms have been applied to implement adaptive behavior selection in di˛erent ÿelds, such as education. For example, [23] considered a Q learning based e˛ective model to determine the verbal and non-verbal behaviors of social robots in an educational game to facilitate e˛ective per- sonalization. In addition, contextual bandit algorithms are applied in [57] to adaptively control the pace of interaction based on user performance and ef-

1https://furhatrobotics.com/

(18)

fective feedback.[56] also uses RL to personalize the robot according to each individual’s learning di˝culty level. [20] proposes an RL framework to enable robots to select supportive behaviors in game-based learning scenarios in order to maximize task performance. Furthermore, we observe a growing trend of applying di˛erent learning-based mechanisms in the HRI domain, [30, 31]. Simultaneously, various RL models have been applied to give information to robot tutors [61]. To sum up, the results of these studies in°uenced individuals’ positive attitudes and contributed to improved job performance or learning ability.

While discussing the RL for social human-robot co-adaptation, an area of research is also worth noting due to the nature of similarity. This domain is co-adaptation through making a cognitive model, which describes how humans create memories or how they regulate emotions in di˛erent situations, and later apply them to the behavior generation of social robots [28]. For example, [4] modeled adaptive strategies for sustainable long-term social interactions based on theories from cognitive science. [71] proposed a cognitive architecture called ACT-R/E that enables robots to predict the user’s behavior in a given scenario by knowing the user’s prior knowledge. The [40]

designed an emphatic model for the iCAT robot to be able to play chess with children. We found limited research on social robots [39] that partner with students in a social environment during long-term interactions. Furthermore, memory-based models in the HRI literature have not been evaluated to ex- amine the impact of memory on user perceptions during long-term interactions [41, 28, 33]. In addition, [29] still needs models that are customizable and can be integrated into real-time social environments. Finally, the impact of models that allow robots to generate memory-based behaviors on educational robots has not been fully investigated. In general, collecting data in social human-robot co-adaptation is notoriously hard. There are mainly three reasons. The ÿrst reason is that in social scenarios the interaction between human and robot is normally very long. For each session, it normally takes 15 mins to 60 mins to get some meaningful results. Secondly, it is hard to ÿnd participants and even more di˝cult to ÿnd demographically well-balanced participants. Thirdly, due to the noise during the interaction sessions, some data may not be usable. Lastly, the collections maybe not done using the same standards, disallowing datasets to be merged or reused.

3.5 Conclusion

Inspired by general robotics, integrating RL in social robotics has now become an emerging research ÿeld in robotics. While people have researched di˛erent reinforcement learning algorithms in various scenarios, there are still many challenges. One of the main issues is the development of the algorithms. Current algorithms are not speciÿcally designed to suit the purpose

(19)

of the social robotics application. Even if people would like to develop and test the algorithms, it is hard to conduct experiments to collect data and it brings out another problem in applying RL in social robotics, namely the lack of data problem.

(20)

4. Methodology

This chapter provides a background for the works covered in this thesis.

4.1 Machine Behaviour

Machine behavior is an area of study ÿrst mentioned in [55]. Quoted from the authors, this study is "concerned with the scientiÿc study of intelligent machines, not as engineering artifacts, but as a class of actors with particular behavioral patterns and ecology." Today’s AI-driven agents and robots are becoming more and more popular in our society. To study how these agents a˛ect humans in di˛erent social scenarios, in this thesis we develop and analyze the behavior of the robots systematically. To enable the functionalities of robots in di˛erent scenarios, we utilize RL frameworks to develop robot behaviors and analyze the results using psychological experiments.

4.2 Reinforcement Learning

Reinforcement learning is an area of ML that concerns how agents could take actions in di˛erent environments to acquire a higher accumulated reward. A machine may acquire behaviors through its own experience. For instance, an RL agent trained to maximize long-term proÿt can learn peculiar short-term trading strategies based on its past actions and concomitant feedback from the market. Similarly, product recommendation algorithms make recommendations based on an endless stream of choices made by customers and update their recommendations accordingly.

As one of the popular models for robot’s control and decision making, RL has been used since the early days of research in social HRI. One of the early works was conducted by Bozinovski et al [7], who considered the concept of emotion in its learning and behavioral scheme. Later, several researchers in HRI investigated the e˛ect of RL algorithms like Exp3 [40, 21, 2]

or Q-learning [47, 72]. With the development of deep learning [38], several methods were proposed to understand di˛erent modalities widely used in computer science, for example, ResNet [25] for image processing and Trans- former [74] based solutions for text processing. As these modalities are also considered in HRI, the application of these methods is observed widely. One of the pioneer works was conducted by Qureshi in 2017 [54] where a Deep

(21)

Q-Network [52] was used to learn a mapping from visual input to one of the several predeÿned actions for greeting people.

One of the methods to study machine behavior is to implement and analyze machine behavior using RL frameworks. In our works, to achieve fast adaptation, we also utilize human feedback and use it under the framework of RL. If we would like to discuss what might be the most common way of learning, learning based on interacting in our daily lives is a natural idea to think about. When we humans as a whole were born in this world, we had no teachers around us, but we learned to fear, to communicate with others and to write a paper. These skills emerged from our daily activities. As a consequence, it is very natural to think that our environment is a great source of information. While playing around with the environment, we learn by taking actions and getting rewards from it. Now when we cook or when we do exercises, we are fully aware of what the responses of the environment will be. RL is an area that studies this mechanism in a computational way. Generally speaking, the goal of RL is to ÿnd a way of mapping di˛erent states to di˛erent actions so that we could maximize the reward signals. The main mathematical framework we consider in the area of RL is based on the Markov Decision Process (MDP). In the following sections of this chapter, we may consider the robot as an agent in all descriptions of related techniques.

4.3 Markov Decision Process

Markov Decision Process (MDP) is a discrete-time stochastic control process.

To understand the process, we may consider a robot in a state s of discrete state space S. The robot can take an action a in all possible action set A resulting in a state s . We can denote this process as a transition function 0

P_a(s, s⁰), meaning the probability of moving from state s to state s⁰through an action a. Then after the robot executes action a, which results in s . It will 0

then receive a reward r according to a reward function denoted as R_a(s, s⁰).

The goal of RL is to optimize cumulative reward of the whole process. De- spite many potential uses, the problems of MDP is clearer to researchers as it is based on mathematical formalizations. On one hand, MDP-based methods together with optimization methods such as Gradient Partially Observable Markov Decision Processes (GPOMDP), projection method, or nature gradient are state-of-the-art in robot trajectory learning. On the other hand, as data collected from a robot is di˛erent from other types of data, it was pointed out the MDP-based methods su˛ers several so-called "curses" [36] including:

• Curse of Dimensionality

• Curse of Real-World Samples

• Curse of Under-Modeling and Model Uncertainty

• Curse of Goal Speciÿcation

These four curses will be explained in details in later sections.

(22)

4.3.1 Partially Observable Markov Decision Process

A partially observable Markov decision process (POMDP) is a generaliza- tion of an MDP. A POMDP models an agent decision process, in which it is assumed that, though the system dynamics is determined by an MDP, the agent cannot directly observe the underlying state. Instead, it must maintain a probability distribution over the set of possible states based on a set of ob- servations, observational probabilities and the underlying MDP. The POMDP framework is general enough to model a variety of real-world sequential decision processes. Applications using POMDP include robot navigation problems, machine maintenance and planning under uncertainty in general. The framework originated in the operational research community and was later taken over by the artiÿcial intelligence and automated planning communities.

An exact solution to a POMDP yields the optimal action for each possible be- lief over the world states. The optimal action maximizes (or minimizes) the expected reward (or cost) of the agent over a possibly inÿnite horizon. The sequence of optimal actions is known as the optimal policy of the agent for interacting with its environment.

More precisely, the POMDP is a discrete time stochastic control process. To understand it mathematically, we may consider a robot in a state s_tof discrete space S where t means iteration number. Now, the robot can take an action a in all possible action set A resulting in a state st+1. We can denote this process as a transition function T_a(s_t, s_t+1), meaning the probability of moving from state st to state st+1 through action a. After the robot executes action a and results in s_t+1, it will receive a reward according to reward function denoted as Ra(st, st+1). Formally, we deÿne MDP as a 4-tuple denoted as (S,A,T · (·, ·),R · (·, ·)). In this tuple,

• S is a ÿnite set of states. It describes all possible states of the space.

• A is a ÿnite set of actions. It describes all possible states that the robot can take.

• T ·(·, ·) is a function that takes three arguments, e.g Tat (st, st+1) means probability of transiting from s_tto s_t+1with action a_t.

• R · (·, ·) is also a function that takes three arguments, e.g Rst (st, st+1) means a reward of transiting from s_tto s_t+1with action a_t.

The main problem of RL is to ÿnd a policy function π(s) : s → a to map every state s with action a so that a cumulative reward R is maximized, where 0 ≤ γ ≤ 1 is a discount factor. As the discount factor generalizes discounted situation and undiscounted situation, it broadens our theory.

4.3.2 Markov Decision Process with Continuous States

Now if we consider the environment of robots, we need some modiÿcations to the original POMDP. First, we still assume the process to be discrete time but we consider a continuous space S ⊆ Rⁿand continuous action set A ⊆ R^m,

(23)

where n is dimension of states and m is dimension of actions. For the initial state s0, we assign a distribution p(s0), where s0 ∈ S. At any state st ∈ S, we have a continuous policy π(a_t|s_t) = p(a_t|s_t, θ) parametrized by θ. Tran- sition function now also becomes continuous. It corresponds to a probability distribution T_a_t(s_t, s_t+1) = p(s_t+1|s_t, a_t). After this step is completed, the process will generate a reward function Rat (s_t, s_t+1) which is deÿned as R·(·, ·) : S ×S ×A → [0, inf). After these modiÿcations, we can now formal- ize continuous MDP. In addition, continuous Markov Decision Process with inÿnite states is a modiÿed version of ordinary POMDP. Mathematically, it is deÿned as a 5-tuple (P_init, S, A, T · (·, ·), R · (·, ·)) where

• Pinit is an initial distribution of the states.

• S ⊆ Rⁿis a inÿnite set of states. It describes all possible states of the space.

• A ⊆ R^mis a inÿnite set of actions. It describes all the possible actions of the agent.

• T ·(·, ·) is a function that takes three arguments. e.g Tat (st, st+1) means probability of transiting from st to st+1 with action at.

• R·(·, ·) is a function that takes three arguments. e.g Rat (st, st+1) means reward of transiting from s_tto s_t+1with action at.

With this continuous setting, we have an objective function deÿned as follows:

∞

J (θ) = Eτ {(1 − γ) X γ^tRt|θ}

Z Z

t=0

= d^π(s) π(a|s)R(a, s)dads (4.1)

S A

Where d^π(s) is deÿned as 1 − γ^tp(s = s_t), 0 ≥ γ ≥ 1 refers to discount factor and π(a|s) is parametrized by θ.

4.3.3 Value Functions

We deÿne two functions to further describe this process. First function is a state-value function denoted as V ^π(x). This means the expected value of an agent that follows a policy π with an initial value of s. It characterizes the reward of following a policy π. Mathematically, it is deÿned as:

∞

V ^π(s) = E_τ{ X γ^tr_t|s = s₀} (4.2)

t=0

where τ stands for a sampled trajectory of agent. Another value function we need to deÿnes is the state-action value function called Q function

(24)

Q^π(s, a), which is deÿned as:

∞

X γ^t

Q^π(s, a) = Eτ { rt|s = s0, a = a0} (4.3)

t=0

The state-value function only depends on the ÿrst state of the agent. After that, the system is governed by policy π.

4.3.4 Policy Gradient Methods

The previous policy evaluation algorithms update state-value for each state.

However, there are also episodic algorithms to evaluate policy based on parameters. In the following text, we are going to introduce the policy gradient method for robotics. RL is probably the most general framework where such robot learning problems can be phrased. Despite the fact that many RL algorithms fail to scale where robots have more degrees of freedom, policy gradient methods are one of the exceptions. There are several advantages of policy gradient algorithms. According to Jan Peters’ paper about policy gradient method[53], ÿrstly the policy representations can be chosen to be meaningful. Secondly, the parameters can incorporate previous domain knowledge.

The third reason is that the policy gradient algorithm has a rather strong theoretical underpinning and additionally, policy gradient algorithms can be used in a model-free fashion. All these advantages ensure that with only a few parameters, robots can learn a decent policy for a certain task. Mathe- matically, the policy gradient algorithm tries to optimize policy parameters θ ∈ Rⁿso that the expected return:

H

J (θ) = E{ X a_kr_k} (4.4)

k=0

is maximized. where a_kis a weighting factor. It can be set as γ^kin the discounted case or _H1 in the average case. r_kis the reward received at step k.

The steepest descent algorithm is normally set to be the optimization method as for each iteration, we would like to have small changes to the robot system.

θ_h+1= θ_h+ α_hr_θJ |_θ=θ_h (4.5) where α_his a learning rate for current updating step h.

4.3.5 Classiÿcation of the Regarded RL Problems

Many pieces of literature have pointed out the complexity of the RL problem in the area of robotics [36, 53]. As previously stated, especially in Kober’s paper [36], four di˛erent aspects were mentioned. He considers these four

(25)

aspects as four curses for applying RL to robotics. These four curses are the curse of high dimensionality, the curse of real-world samples, the curse of under modeling and model uncertainty and the curse of goal speciÿcation.

High-Dimensionality is the ÿrst characteristics considered in the area of robotics. Robot systems normally have many degrees of freedom (DOF), especially in modern anthropomorphic robots. For example, the Baxter robot has two arms, each arm has 7 DOF including three pitch degrees and four roll degrees. This continuity makes traditional RL fail as many traditional methods are based on the discretization of each DOF. If we discretize n DOFs and discretize each DOF to m states, the total states for the system is mⁿ, which is inapplicable in most cases.

The need of Real Samples is another curse of applying RL to robotics. Robot systems inherently interact with physical systems in the real world. During a test with the environment, robot hardware may experience wear and tear. As a consequence, in many cases, this is an expansive process. Despite failure is costly, the test process also requires some kind of supervision from a human.

For example, in one of the studies conducted by [78] , they used Baxter to reach a certain position as fast as possible. In the optimization process, it stucked at a position that is hard to set. If the optimization process happens in a more complex dynamic environment(e.g. helicopter robot), a supervision process conducted by several human cooperation is needed. For all these reasons, generating real word samples require di˛erent resources and is an expensive process.

Under-modelling and model uncertainty is the next problem in the robot system. For reducing the cost of real world samples, people build accurate simulators to accelerate the learning process. Unfortunately, building such kind of models involves a lot of engineering work, which is also expensive.

For a small robot system, the simulator can improve the learning process to some extend. However, if we use a simulator to simulate the complex system, small turbulence can cause the learned system to diverge from the real system. Moreover, it is much more harder if we want to simulate a HRI scenario.

Last but not the least, goal speciÿcation means to specify the reward function for the robot system. As in the RL algorithm, the policy optimization process depends on observing di˛erent rewards from two di˛erent policies.

If the same reward is always received, there is no way of telling which policy is better. In practice, it is surprisingly di˝cult to specify the reward function of the system for a certain task.

These four areas are notorious when people try to apply the RL algorithm to robotics. Here we only discuss the basic ideas of these problems. However, people who studied applying RL in robotics have explored these problems more thoroughly than discussed here. If readers have an interest, please refer to the paper "Reinforcement learning in robotics: A survey"[36]

(26)

4.4 General Methods to Tackle Data Ine˝ciency in HRI

In the ÿeld of HRI, the problem of data e˝ciency is obvious. Unlike general robotics problems, collecting data is much more di˝cult in the area of HRI.

As a consequence, the curse of dimensionality is more easily observed. For example, in the ÿeld of HRI, researchers normally chose to collect no more than 100 participants during their evaluation of an interaction. Since an interactive process is really complicated in its nature, no matter it is a with-in group or between-group study, the explanation is inevitably bounded by the sampling randomness.

Also, even though the number is enough for testing di˛erent hypothe- sizes, the information researchers can have are descriptive (e.g. hypothesis is supported by the data or not). In this case, the study may help with understanding some phenomena during an interaction but it is not enough to direct the development of algorithms. To implement more natural interaction process between robots and humans, in fact, much more data is needed.

Furthermore, even if the number is large enough, the hypothesis test can only determine the rank of two algorithms’ performance in a in a controlled social environment. In other words, it only determines which algorithm is better and it is in fact not very helpful to ensure the generation of a smooth interactive process. Moreover, the algorithms change over time and the social scenarios di˛er a lot in wild situations. The information provided by the study in controlled scenarios does not tell us much about how to make the interaction smoother in the wild.

To solve the problems mentioned above, the methods people could use are limited. In the most ideal case, people could research approaches that can provide more interactive data, meaning to establish infrastructures that can enable massive testing iteratively. This idea is similar to how people tested smartphone designs. Instead of analyzing theoretically what might be the best interactive patterns, the ÿrst thing is to make the interactions available to the general public. After being able to collect data from the general public, the interactive patterns could be developed/learning overtime.

Besides the ideas mentioned in the previous paragraph, there are methods people can consider from an algorithmic point of view. Two speciÿc methods are studied in this thesis. One of them is learning from a simulated environment and the other one is to use a meta-learning model. In the following sections, we will introduce the basic ideas of these two methods.

4.4.1 Simulation

Using simulation in developing robotics applications has become one of the standard practices. Many works in general robotics have shown it is a viable direction to generate a robot’s behavior. However, using the simulation in social HRI is a di˛erent story mainly due to two reasons. The ÿrst reason

(27)

is that it is hard to ÿnd a human model to provide a reward function. That is to say, to make a robot generate appropriate behavior using human feedback, we need to ÿrst have a human model to provide feedback signals. It is not a trivial task as, on one hand, we lack the related research results in computational psychology regarding this and, on the other hand, we do not have enough data to learn such a model as well. Secondly, even if we have such a model to provide feedback, whether a robot can generate appropriate behavior is questionable. The dynamics of generating behavior depends also on the dynamics of RL algorithms. Currently, we do not know enough about how certain behaviors could be generated even if we know what feedback can be provided to the system.

4.4.2 Meta Learning

Meta-learning is an exciting research trend in ML that addresses the question of learning how to learn. The traditional model of ML research is to acquire a large dataset for a particular task and then use that dataset to train the model from scratch. Obviously, this is far worse than humans using past experience to learn quickly with just a small number of samples.

The interest in meta-learning ÿrst arose from the di˝culties that ML community encounters with the free-shot learning problem. We know that neural networks perform quite well on most tasks, such as computer vision/nature language processing, and one of the most important factors is that these domains have easy access to large amounts of data, and well-presented data is the key to being able to drive complex models like neural networks to extract patterns from the data. However, in some tasks, some of these classes have only a fairly small amount of data (few-shots), so that the problem of few-shot learning can be formulated as how to train a model to classify a sample of a class better after having seen only a very small number of samples of that class. To improve the performance of few-shot learning, we need to make the model learn the main features that distinguish the class, rather than class-independent features that are not useful for classifying the sample.

When it comes to learning key features, we naturally think of extract- ing them using some sort of compressed representation. One of the more common approaches in few-shot learning is to learn an embedding, which embeds the original input space into a low-dimensional space. When new samples are given, we can calculate the Euclidean distance between the samples the center of cluster in the embedding space to determine whether the samples belong to the category or not. The main goal of model learning is to identify those features that best distinguish the sample from that category, and learn how to compress and transform the input into such features.

A central idea that we can summarize in this methodology is that instead of focusing on learning the distribution of a single category over the sample

(28)

space and ÿnally trying to classify it, we should focus on the distribution of each category over the entire task space. This theoretically solves a number of problems: the problem of few-shot learning without a complete sample distribution, and the problem of how quickly the model adapts to a new task when the task itself changes. The latter of these is the main concern of meta- learning.

4.5 Machine Behaviour Development

Designing and generating Machine Behaviour is a very complex problem and in this Ph.D. thesis, we mainly consider how to generate social behavior using RL. Modern RL theory divides RL into two main categories, namely ML in discrete and continuous observation spaces. In social robotics, we consider which RL algorithm to use based on the task of social robotics. In fact, both types of RL algorithms are used in di˛erent social robotics tasks for di˛erent purposes.

The discrete RL algorithm is generally used when the robot needs to make higher-level decisions. The continuous RL algorithm is used when the robot needs to make low-level decisions. For example, in Paper A, we used a discrete RL algorithm, while in Paper B, we used a continuous RL algorithm. In Paper A, the problem is abstracted into a Multi Armed Bandit problem. For each action in this problem, we have an explicit concept. However, in Paper C, the action we need to learn is the underlying motor skills, which we do not have an explicit concept.

For a discrete RL algorithm, we generally have several options depending on the task model. If we build the task as a MAB problem, then we have the option of using a statistical algorithm like Exp3 or UCB. If we build the problem as a classical RL problem, then we can consider using Q learning, SARSA, or their deep learning versions such as DNQ. However, continuous RL algorithms can solve more general problems, and this thesis contains a paper that utilizes continuous space RL algorithms, such as PPO. Many other continuous RL algorithms can also be used, such as SAC or DDPG. For a discrete RL, we consider using discrete numerical values as the feedback signals, whereas for a continuous RL algorithm, we use a potential ÿeld equation as the reward function. This is caused by the di˛erences in the algorithms.

The development of Machine Behaviour based on the RL framework is the same as the above-mentioned process. First of all, we must understand the task requirements of a social scenario, and then clarify which RL algorithm to use according to the task requirements, and ÿnally, we must choose a reasonable reward function according to the algorithm and task conditions.

(29)

4.6 Machine Behaviour Analysis

Analysis of the generated behavior under the framework of RL received a lot of inspirations from other areas. On one hand, when it comes to measuring human states, methods used in machine behavior analysis are similar to some other experimental sciences and are generally based on inferential statistics.

On the other hand, since the behaviors are generated using RL, it has also been seen widely that computational metrics are also used.

When it comes to computational evaluations, the metrics are normally from the area of ML. For example, it could be the probabilistic distribution of the output in the case of a prediction and accumulated reward when we want to know how much reward the robot receives. However, to evaluate the algorithms, since the data we can receive is limited, what we can do is to conduct similar tests as experimental psychology. For example, a t-test is used to check the di˛erences in the case of two normally distributed random variables. ANOVA provides a method for statistically testing whether two or more population means are equal, thus extending the t-test beyond the two means. Ranked tests are used when pre-assumptions are not met.

(30)

5. An Overview of Included Publications

• Paper A: Fast Adaptation with Meta-Reinforcement Learning for Trust Modelling in Human-Robot Interaction.

Gao, Y., Sibirtseva, E., Castellano, G.and Kragic, D.

The 28th IEEE/RSJ International Conference on Intelligent Robots and Sys- tems (IROS 2019), 2019.

Research Questions 1. Could meta-learning be used to model trust in human-robot interaction? 2. Can meta-learning be used to learn how to adapt a robot’s behaviour to its users in real-time situated interaction with limited amount of data?

Summary of contribution:

In socially assistive robotics, an important research area is the development of adaptation techniques and their e˛ect on human-robot interaction. We present a meta-learning based policy gradient method for addressing the problem of adaptation in human-robot interaction and also investigate its role as a mechanism for trust modelling.

We built an escape room scenario to conduct a between-subjects study, in which participants interacted with a Pepper robot. The escape room was created in augmented reality and participants were required to wear a Mixed Reality headset HoloLens to see walls of the virtual maze, triggers, keys, and the exit door.

To compare the e˛ects of the algorithms, two groups of participants were asked to complete the scenario. For the control group, a statistical MAB algorithms Exp3 was used. The experimental group interacted under our proposed model, a policy gradient based solution for MAB problem, together with meta-learning. Compared to the control group, an auxiliary environment is used in the experimental group as a pre-training method, where human feedback of each action of MAB is modelled as a Gaussian distribution. This helps the algorithm to acquire a prior knowledge before the interaction and makes the interaction in real time data-e˝cient.

(31)

By comparing the e˛ects of the two conditions, we aimed to test whether the overall perceived bi-directional trust is a˛ected by di˛erent adaptation algorithms. We proposed to deÿne bi-directional trust as how trustworthy the participant perceives the robot and how much, in their opinion, the robot trusts them in return. Moreover, we hypothesize that the dynamics of how the bi-directional trust changes throughout the interaction sessions vary in two conditions.

Our results showed that not only the algorithm adopted a higher learning rate after the meta-learning process but also has increased the participant’s perception on how trustworthy the robot perceives them.

• Paper B: Learning Socially Appropriate Robot Approaching Be- havior Toward Groups using Deep Reinforcement Learning.

Gao Y., Gao Y., Yang F., Frisk. M., Daniel H., Peters C., Castellano, G.

The 28th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2019), 2019.

Research Questions:

1. Can a reinforcement learning pipeline be developed in order to generate socially appropriate robot approaching behavior towards groups of people? 2. Can a learning scheme that acquires a prior model of robot approaching behaviour in simulation be applied to real-world interaction?

Deep reinforcement learning has recently been widely applied in robotics to study tasks such as locomotion and grasping, but its application to social human-robot interaction (HRI) remains a challenge. In this paper, we present a deep learning scheme that acquires a prior model of robot approaching behavior in simulation and applies it to real-world interaction with a physical robot approaching groups of humans. The scheme, which we refer to as Staged Social Behavior Learning (SSBL), considers di˛erent stages of learning in social scenarios.

To demonstrate our idea, we followed our scheme and implemented a robot approaching behavior task based on this scheme. We designed a reward function combining concepts from Hall’s proxemics theory to enable the robot agent to learn a dynamical model which takes social norms into account. Speciÿcally, we used a deep reinforcement learning algorithm called Proximal Policy Optimization (PPO) to learn an appropriate behavior for the robot agent.

(32)

We conducted computational experiments to evaluate di˛erent conÿg- urations of our model. In order to test how our model performs in a socially appropriate manner, we also compared the robot approaching behavior learned by our model with the one generated by a social force model [35]. Additionally, two perceptual studies were conducted using videos and real robots.

We found that the input with simple states outperforms models with videos as input. When the input is a video, a structure of Spatial Auto- encoder Variant outperforms the vanilla convolutional Auto-encoder on this task. Moreover, results from perceptual studies show that our model can generate more socially appropriate approaching behavior than the social force model.

• Paper C: When robot personalisation does not help: Insights from a robot-supported learning study.

Gao, Y., Barendregt, W., Obaid, M. and Castellano, G.

The 27th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN 2018), 2018

Research Questions: 1. Does a robot’s behaviour personalisation improve users’ overall task performance in a tutoring scenario? 2. Is a personalised robot more e˛ective than a robot that adapts to its user without personalising?

In the domain of robotic tutors, personalised tutoring has started to receive scientists’ attention, but is still relatively underexplored. Previous work using reinforcement learning (RL) has addressed personalised tutoring from the perspective of a˛ective policy learning. However, little is known about the e˛ects of robot behaviour personalisation on user’s task performance. Moreover, it is also unclear if and when personalisation may be more beneÿcial than a robot that adapts to its users and the context of the interaction without personalising its behaviour.

We build on previous work on a˛ective policy learning that has used RL to learn what supportive behaviours of a robot are preferred by users in an educational scenario. In this work we take a step forward: we develop an RL framework for personalisation that allows a robot to select verbal supportive behaviours to maximize user’s task performance and positive reactions to the robot’s interventions in a learning scenario where a Pepper1robot acting as a tutor helps people learning how to solve grid-based logic puzzles.

(33)

A between-subjects design user study was conducted to test how the adaptation algorithm a˛ects the process of interaction. In the experiment, each group performed three sessions, namely a pre-interaction session, a human-robot interaction session and a post-interaction session. In both the pre-interaction and post-interaction sessions, participants were asked to solve three nonogram puzzles of similar di˝culty on their own. In the human-robot interaction session, the participants were asked to solve three nonogram puzzles with the assistance of a robot.

We found that people are more e˝cient at solving logic puzzles and prefer a robot that exhibits more varied behaviours compared with a robot that personalises its behaviour by converging on a speciÿcone over time. Our interpretation is that (1) the robot does learn which behaviours maximise users’ positive reactions throughout the interaction, but (2) this does not necessarily mean that, after having experienced the whole interaction, users prefer a robot that personalises in this manner.

• Paper D: Investigating Deep Learning Approaches for Human- Robot Proxemics

Gao, Y., Wallkötter, S., Obaid, M. and Castellano, G.

The 27th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN 2018), 2018

Research Questions 1. Can deep learning models be used to model comfortable human-robot proxemics?

Summary of contribution

Human-human interpersonal distances (proxemics) is one of the non- verbal behaviours that shapes the communication spaces between individuals. Vast research has been done to understand how humans position themselves within a given communication context. In this paper, we investigate the applicability of deep learning methods to adapt and predict comfortable human-robot proxemics.

To this aim, we developed a system that can adapt and predict comfortable human-robot proxemics. The system ÿrst estimates cumulative density functions (CDFs) of discomfort based on the users’ hand height. It then uses a neural network architecture to learn the corre- spondence between users’ discomfort and the distances that the robot has travelled towards the user from three angles. Thereafter, the system generates probabilities for unseen angles by interpolation and forms an

(34)

area where the robot should stop.

Our exploratory experiment shows that among the three possible core layers of the neural network architecture, i.e. LSTM, GRU and FF layers, the conÿguration with a LSTM layer is the best at modelling HRI proxemics data. The ÿnal result shows that we are able to produce a distribution estimation from only three angles. We also argue that because this model is neural network based and end-to-end trainable, it can learn from more data without any further modiÿcation. Future work is directed towards improving the quality of the generated CDFs.

Machine Behavior Development and Analysis using Reinforcement Learning

Machine Behavior Development and Analysis using Reinforcement Learning

Sammanfattnng på svenska (Summary in Swedish)

Contents

1. Introduction

1.1 Outline of thesis

1.2 List of papers

2. The Dilemma in Social HRI

“ 2.1 Motivation

”

3. Reinforcement Learning in Social HRI

3.1 Overview

3.2 Reinforcement learning in robotics

3.3 Reinforcement learning in social robotics

3.4 Reinforcement learning for social human-robot co-adaptation

3.5 Conclusion

4. Methodology

4.1 Machine Behaviour

4.2 Reinforcement Learning

4.3 Markov Decision Process

4.3.1 Partially Observable Markov Decision Process

4.3.2 Markov Decision Process with Continuous States

4.3.3 Value Functions

4.3.4 Policy Gradient Methods

4.3.5 Classiÿcation of the Regarded RL Problems

4.4 General Methods to Tackle Data Ine˝ciency in HRI

4.4.1 Simulation

4.4.2 Meta Learning

4.5 Machine Behaviour Development

4.6 Machine Behaviour Analysis

5. An Overview of Included Publications