Applied machine learning for multi-sensory robot perception

(1)

APPLIED MACHINE LEARNING FOR MULTI-SENSORY ROBOT

PERCEPTION

by Ziling Zhang

(2)

c

(3)

A thesis submitted to the Faculty and the Board of Trustees of the Colorado School of Mines in partial fulfillment of the requirements for the degree of Master of Science (Computer Science). Golden, Colorado Date Signed: Ziling Zhang Signed: Dr. Hao Zhang Thesis Advisor Golden, Colorado Date Signed: Dr. Tracy Camp Professor and Head Department of Computer Science

(4)

ABSTRACT

In recent years, advances in autonomous robotics have begun to transform how we work and live. Unmanned Aerial Vehicles (UAV) and Unmanned Ground Vehicles are helping us to deliver goods, conduct surveys of construction sites, and perform search and rescue alongside first responders. However, designing robots with this level of autonomy is often challenging due to the complexity of the real-world environment.

Multi-sensory perception is a critical component to address this challenge and develop robust autonomous robotic systems. By combining multiple inputs from sensors, the system can eliminate a single point of failure from sensor degradation and generate new insights to make better decisions integrating information from different sensor modalities.

Recent breakthroughs in Machine Learning, especially the Deep Neural Network(DNN) based deep learning perception pipelines have been proven effective in a number of robot perception tasks. However, the significant computation cost for Deep Neural Networks is prohibiting their deployment on a robot system with limited power budget and real-time performance requirement. It is important to bridge this gap by optimization to deploy state-of-the-art machine learning models to a real-world robot systems.

This work investigates the viability to develop robust multi-sensory robot perception systems enhanced by machine learning models in three different chapters. First, I explore the effectiveness of DNN perception pipelines in object detection and semantic segmentation tasks, then experiment on various model optimization techniques to enhance the efficiency of these perception models, achieving real-time performance on robot system with a limited power budget. Then I elucidate the design and implementation of a thermal sensing robot system that performs sensor fusion of a thermal camera and an RGB-Depth Camera to automatically track occupants in a building, measuring their forehead temperature, providing fine-grain information for better decision making in intelligent Air Conditioning (AC) system.

(5)

Finally, I explore camera pose estimation using rectangular to spherical image matching, enabling a robot to quickly grasp a scene with spherical camera, and allow other robots to localize themselves within the scene by matching rectangular sensor images to the spherical image.

(6)

TABLE OF CONTENTS

ABSTRACT . . . iii

LIST OF FIGURES . . . viii

LIST OF TABLES . . . x LIST OF ABBREVIATIONS . . . xi ACKNOWLEDGMENTS . . . xiii CHAPTER 1 INTRODUCTION . . . 1 1.1 Background . . . 1 1.1.1 Sensor Fusion . . . 1

1.1.2 Machine Learning Perception Models . . . 2

1.1.3 Visual Pose Estimation . . . 2

1.2 Challenges . . . 2

1.3 Objectives . . . 3

1.4 Motivation . . . 3

1.5 Main Contributions . . . 3

1.6 Guide To Thesis . . . 4

CHAPTER 2 OPTIMIZING DEEP NEURAL NETWORK MODEL FOR ROBOT PERCEPTION . . . 5

2.1 Introduction . . . 5

2.2 Related Work . . . 6

(7)

2.4 Applying Semantic Segmentation DNN for Robot Industrial Inspection . . . . 10

2.4.1 The Semantic Segmentation DNN Approach . . . 10

2.4.2 Transfer Learning . . . 11

2.4.3 DeepLab v3+ Transfer Learning Experiment . . . 12

2.4.4 Accuracy Results . . . 13

2.5 DNN Deployment Optimizations . . . 13

2.5.1 Parallel Computing Software Frameworks . . . 14

2.5.2 Model Trimming with TensorFlow graph transforms toolkit . . . 15

2.5.3 Asynchronous Input Pipeline . . . 16

2.5.4 Application-Specific Integrated Circuits . . . 17

2.5.5 Model Optimization Experiment . . . 20

2.6 Conclusion . . . 22

CHAPTER 3 SENSOR FUSION OF THERMAL CAMERA AND RGB-D CAMERA FOR AUTOMATIC HUMAN FOREHEAD TEMPERATURE MEASUREMENT ON A ROBOT AGENT . . . 24

3.2 Related Work . . . 25

3.3 Hardware Structure . . . 27

3.3.1 Turtlebot 2 platform . . . 28

3.3.2 RGB-D Camera Sensor . . . 28

3.3.3 Thermal Camera Sensor . . . 28

3.3.4 Temperature, Humidity, Airspeed, and Noise Sensor Suite . . . 29

3.4 Software Structure . . . 29

(8)

3.4.2 Thermal Camera Node . . . 31

3.4.3 RGB-D Camera and Kinect Skeleton Detection Node . . . 32

3.4.4 Sensor Fusion Node . . . 32

3.5 Experimental Results . . . 35

3.6 Privacy Concerns and Ethical Implications . . . 36

3.7 Summary . . . 37

CHAPTER 4 POSE ESTIMATION USING RECTANGULAR TO SPHERICAL IMAGE MATCHING . . . 39

4.2 Related Works . . . 39

4.3 Approach . . . 40

4.4 Spherical Image Pre-processing . . . 41

4.5 Feature Extractors . . . 44 4.5.1 HOG Feature . . . 44 4.5.2 LBP Feature . . . 45 4.5.3 Feature Concatenation . . . 46 4.6 Experiment . . . 46 4.7 Conclusion . . . 47 CHAPTER 5 CONCLUSION . . . 48 5.1 Key Contributions . . . 48 REFERENCES CITED . . . 49

(9)

LIST OF FIGURES

Figure 2.1 YOLOv3+ object detection running on a recorded video in Arvada, Colorado. The model is running real-time (22.4fps) on an Nvidia GTX 1080 GPU, and detected trucks, cars, and stop sign from the video

stream recorded from a front-mounted camera on a vehicle . . . 8

Figure 2.2 YOLOv3+ object detection running on an indoor environment, detecting furniture and people. The camera is mounted on a Jackal UGV from Clearpath Robotics . . . 8

Figure 2.3 Qualitative result of transfer learning on the validation set of the magnetic tiles defect dataset . . . 14

Figure 2.4 The compute graph of the original DeepLab V3+ . . . 20

Figure 2.5 The optimized compute graph of DeepLab V3+. Operations are grouped together into TRTEngineOps, constants with similar values are grouped together. The node count in the compute graph reduced to 26. . . 21

Figure 3.1 Traditional thermostat can only measure temperature from a single point in a room . . . 25

Figure 3.2 An IoRT system equipped with multiple sensors to measure the fine-grain data of the building and its occupants . . . 26

Figure 3.3 The Microsoft Kinect 1st generation RGB-D Camera and the Lepton FLIR 3.5 Thermal Camera Assembly . . . 29

Figure 3.4 Software Structure . . . 30

Figure 3.5 Openni Skeleton Tracker . . . 33

Figure 3.6 Sensor fusion between Lepton 3.5 and Kinect . . . 34

Figure 3.7 Tracking user’s forehead and measuring forehead temperature . . . 36

Figure 4.1 A Ricoh theta 360 camera . . . 41

(10)

Figure 4.3 Singularity issue of the spherical image . . . 42 Figure 4.4 A spherical image re-project into soccer ball patches . . . 43 Figure 4.5 Hexagons(red) and pentagons(green) FOV drawn on a 4:3 image with

60-degree horizontal FOV . . . 44 Figure 4.6 Hog feature extracted from a downsampled 160x120 soccer ball patch,

the feature vector length is 9576 floats . . . 45 Figure 4.7 Left: a 640x480 query image from an iPad camera. Right: the matched

spherical image patch from the C++ program. The match was successful even with glare and illumination changes. . . 46

(11)

LIST OF TABLES

Table 2.1 Dataset Splitting . . . 13 Table 2.2 Measured Speed ups . . . 21

(12)

LIST OF ABBREVIATIONS

Advanced Vector Extensions . . . AVX Air Conditioning . . . AC Amazon Web Services . . . AWS Application Programming Interface . . . API Application-Specific Integrated Circuits . . . ASIC Arithmetic Logic Unit . . . ALU Building Automation System . . . BAS Compute Unified Device Architecture . . . CUDA Compute Unified Device Architecture Deep Neural Network . . . CuDNN Convolution Neural Network . . . CNN Deep Learning Accelerator . . . DLA Deep Neural Network . . . DNN Field Programmable Gate Arrays . . . FPGA Field of View . . . FOV Frames per second . . . fps Generative Adversarial Network . . . GAN Graphics Processing Unit . . . GPU Heating, ventilation, and air conditioning . . . HVAC Histogram Of Gradients . . . HOG Inertial Measurement Unit . . . IMU

(13)

Intel Math Kernel . . . MKL Internet of Robotic Things . . . IoRT Light Detection and Ranging . . . LIDAR Local Binary Patterns . . . LBP Neural Computing Stick . . . NCS Open Computer Vision . . . OpenCV Open Computing Language . . . OpenCL Open Neural Network Exchange . . . ONNX Robotic Operating System . . . ROS Simultaneous Localization And Mapping . . . SLAM Single Instruction Multiple Data . . . SIMD Structure from Motion . . . SfM Tensor Processing Unit . . . TPU Thread Building Block . . . TBB Unified Robot Description Format . . . URDF Unmanned Aerial Vehicle . . . UAV Unmanned Ground Vehicle . . . UGV Video Random Access Memory . . . VRAM mean Intersection over Union . . . mIoU

(14)

ACKNOWLEDGMENTS

I would like to thank everyone who has helped me in completing my master’s thesis. First and foremost, I want to thank my advisor, Dr. Hao Zhang, for accepting me into his Human-Centered Robotics Lab and giving me the freedom to explore and play with all the robots in the lab. It has been a fun journey studying under your guidance.

I would also like to thank all fellow students in the Human-Centered Robotics Lab. It is a great family and I really appreciated the opportunity to work and study with you.

And of course, my wife, Yi Sun, and our cats, Spruce and Miss Evergreen for your unending support through stressful times.

(15)

CHAPTER 1 INTRODUCTION

This Master’s thesis focuses on the multi-sensory robot perception problems, leveraging recent advances in machine learning algorithms that excel at visual perception accuracy and optimize them to reach real-time performance in robot systems.

1.1 Background 1.1.1 Sensor Fusion

Modern robot systems, such as UAVs and UGVs are often equipped with multiple sensors to perform different tasks or combine sensor input to make better decisions. For example, an autonomous vehicle might be equipped with a camera array, Light Detection and Ranging devices (LIDARs), and radars to detect obstacles, roads, and pedestrians. A simultaneous localization and mapping solution in a robot might combine the input from inertial mea-surement unit (IMU), LIDAR, RGB Cameras and wheel odometry to accurately measure ego-motion and make accurate maps. Multi-sensory perception can improve robustness of perception system when a single sensor degrades. For example in dark places, an RGB camera cannot adequately obtain information from the environment while LIDAR can still function. Multi-sensory perception can also generate new insight where a single sensor cannot. For example, facial recognition sensors leverage structured light to recover depth information to get a mesh of the face while RGB camera recover the texture of the face, so a more robust recognition pipeline can be developed. Therefore, combining different sensor input on a robot is a hot research topic and could lead to innovative solutions that enable more intelligent robots that understand the environment, are aware of human cooperators in human-robot teaming scenarios, and are more helpful to us.

(16)

1.1.2 Machine Learning Perception Models

The rise of Deep Learning in recent years has produced breakthroughs in numerous com-puter vision and machine perception tasks. In certain tasks such as image classification, a well-trained Deep Neural Network (DNN) model can reach or even surpass human-level per-formance. Deploying these robust models in a robot can drastically improve the perception capability, which was deemed impossible just a few years ago. However, these DNN require significant computation resources and it is challenging to deploy them in a robot with lim-ited computing and storage resources. On the other hand, robot perception task requires real-time performance, a reasonable perception pipeline must have low latency. Therefore, optimizing DNN perception models and deploying them efficiently has tremendous appli-cation value and has been researched extensively. Innovations in hardware and software optimization solution for DNN deployment has been in rapid development.

1.1.3 Visual Pose Estimation

Visual Pose Estimation is a computer vision task that uses visual image data to recover the current camera’s pose in the world by referring to historical data . The typical approach is generating a descriptor, or feature vector, from the query image, and attempt to find a match of the feature vector among previous images with known poses. Thus, the camera’s pose of the current image can be estimated in the world reference frame.

1.2 Challenges

• It is difficult to synchronize and calibrate the information from different sensors and generate meaningful insights for the robot.

• DNN models require significant computing resource • Robot perception tasks require real-time performance

• To deploy DNN models on robots, it is important to make the models run fast and consume minimal resources

(17)

1.3 Objectives

The research objectives of this Master’s thesis are to explore the design of robot sys-tems capable of multi-sensory perception, optimizing machine learning perception models to real-time performance, and deploy them on robot system with limited computing and storage resource. Robotics is a cross-disciplinary field that incorporates the latest research in computer vision, mechanical engineering, and machine learning to build robust system that address and solve real world problems. This thesis aims to explore ways of deploying advanced research models to build robust robot multi-sensory perception systems.

1.4 Motivation

The primary driving force of developing robots has always been automation. DNN now provides a way to retain the knowledge of human experts with trained eyes. Developing robot solutions with advance perception capabilities can free ourselves from repetitious human labor, reduce human error, increase productivity, and make our lives easier.

1.5 Main Contributions

• Optimization of DNN perception models

Performed transfer learning to re-purpose a semantic segmentation DNN model to use in industrial inspection robots. Optimized the inference speed of the model by performing lossless and lossy DNN model compression, efficient input data pipeline development, and leveraging Application-Specific Integrated Circuits for DNN infer-ence. An optimized object detector model is used in a published co-authored paper on connected vehicle[1].

• Thermal Robot

Designed and implemented a thermal robot that performs sensor fusion between RGB-D camera and thermal camera to automatically track and measure human user’s fore-head temperature. Developed and open-sourced a Python Robot Operating System[2]

(18)

wrapper for the FLIR Lepton 3.5 thermal camera sensor[3]. Developers aiming to utilize the Lepton 3.5 camera may use the open source package.[4]

• Visual Pose Estimation with Spherical Camera

Developed a novel visual pose estimation solution for collaborative robot perception using the spherical camera. Implemented an efficient C++ pipeline to search for a match and recover the pose of the rectangular camera, which can be used for robot collaborative perception.

1.6 Guide To Thesis

This Master’s thesis is structured as follows. Chapter 1 introduces the background, chal-lenges, and main contributions of the thesis. Chapter 2 to Chapter 4 describes the technical details and main findings in completed work in depth. Chapter 2 explores the effectiveness of Deep Neural Network perception pipelines and techniques to optimize these models to real-time level performance on a robot with limited computing resources. Chapter 3 presents the design and implementation of a robot system that performs sensor fusion between a thermal camera and an RGB-Depth Camera to automatically obtain building occupants’ forehead temperature, enabling fine-grained measurement of building occupants’ thermal comfort level and inform the intelligent Air Conditioning (AC) system to make better adjustments, keeping occupants comfortable while saving energy. Chapter 4 investigates rectangular to spherical image matching for robot visual pose estimation. Chapter 5 concludes the thesis.

(19)

CHAPTER 2

OPTIMIZING DEEP NEURAL NETWORK MODEL FOR ROBOT PERCEPTION

2.1 Introduction

Enabled by recent advances in parallel computing hardware, Deep Neural Network (DNN) achieved breakthrough accuracy in numerous computer vision tasks[5]. With a large amount of labeled training data, and training using back-propagation, a DNN can learn the rules to effectively extract low-level to high-level features from image data, surpassing human-level performance in image classification. [6] Expanding from the image classifier Deep Neural Network, several classes of perception DNN were proposed in recent years. The object detector DNN can draw rectangular bounding boxes around objects of interest in an input image. Semantic Segmentation DNN can produce a class prediction on every pixel of an input image, generating pixel-accurate masks on objects of interest. With accuracy comparable to a human, these perception DNN pipelines are instrumental to develop autonomous robot systems. The objects detected in the surroundings of the robot provides valuable information for robot path planning and decision making, enabling the robot to interact with the world more intelligently. For example, a search and rescue robot armed with DNN can segment hazardous conditions in the environment and navigate safely to complete its mission. It can detect the location of people that must be rescued. An industrial inspection robot can be equipped with a trained DNN to look for defects in oil pipes, providing early warning at locations that are not easily accessible by human operators.

However, high accuracy comes with high computation cost. It is difficult to deploy these DNNs to a robot agent with a limited power budget, limited in both computing resource and memory capacity. On the other hand, perception tasks often require real-time performance. Consider the following scenario, an autonomous driving vehicle cruising at high speed must detect pedestrians and obstacles in real-time to avoid a crash. With computing resource

(20)

constraints and latency requirement, it is a challenging engineering problem to efficiently deploy DNNs in robot perception and it is a hot research topic.

This work aims to first apply DNN perception models in development of collaborative robots and industrial inspection robots. After validating the models are working in the application, I then explore various techniques of optimizing DNN inference on a robot agent, leveraging Application-Specific Integrated Circuits (ASIC) in recent computing hardware to achieve real-time performance within a limited power budget.

The following sections first present the application of two state-of-the-art DNN percep-tion models, YOLOv3 for object detecpercep-tion[7] and DeepLab v3+[8] for semantic segmentapercep-tion, then discuss the transfer learning method of re-purposing DNN models to detect and segment different classes specific to the robot industrial inspection domain. Subsequently, a number of optimization techniques, including model trimming, CPU and GPU deep learning accel-eration libraries, Application-Specific Integrated Circuits for DNN, are discussed in detail. Then an experiment on an industrial inspection dataset was performed, model accuracy and speed were measured.

2.2 Related Work

• Object Detection in collaborative robot perception

To model objects in the vicinity of a robot, in [9] low level features were computed in a scene to describe the objects’ identity, location and scale. In recent years, a number of DNN models are proposed to advance the state-of-the-art performance of object detection, hybrid structures like regional proposal[10][11], feature pyramids[12], Single Shot Detection (SSD)[13]. These models are usually aiming to achieve higher accuracy by making deeper DNNs. On the other hand, some researchers try to minimize the model size for quick inference on edge devices[14][15].

(21)

A number of early attempts have been made to use traditional computer vision algo-rithms in industrial inspections [16][17], where template matching, clustering and hand-crafted feature extractors are used. Different types of robots, including UGV[18][19] and UAV[20], are designed for industrial inspection.

• Optimizing DNN for mobile deployment

Mobile devices and robots are both considered edge computing devices. They share some computing resources constraints and optimization techniques. Experiments was conducted to apply quantization on each DNN perception model layer to measure how much quantization optimization affect the DNN model accuracy.[21] Different schemes of weight quantization are discussed in [22]. Distillation was proposed to distill the knowledge from a trained large DNN and produce smaller models.[23] Leveraging Cloud computing could offload some computing tasks on the edge[24], but due to the latency, network stability, security and compliance requirement, other computing tasks must be perform locally on the edge devices[25][26].

2.3 Applying Object Detection DNN to Robot Collaborative Perception Object detectors have a lot of potential in robot applications. In autonomous driving, the autonomous vehicle equipped with a real-time object detector can be aware of other vehicles, pedestrians and traffic signs, and plan it’s path accordingly, see Figure 2.1. For an indoor delivery robot, being able to detect people and furniture can help them in path planning, see Figure 2.2. When a robot needs to perform Simultaneous Localization and Mapping (SLAM)[28], knowing which reference points from the environment are static is important. Because only static references, like a building or wall can act as valid data point for mapping, points on high dynamic objects, like vehicles are noise that could potentially lead to a distorted map. A DNN object detector can act as a pre-processing filter to remove that noise. It is also helpful to detect landmarks to help an autonomous robot to perform loop closure during SLAM.

(22)

Figure 2.1: YOLOv3+ object detection running on a recorded video in Arvada, Colorado. The model is running real-time (22.4fps) on an Nvidia GTX 1080 GPU, and detected trucks, cars, and stop sign from the video stream recorded from a front-mounted camera on a vehicle

Figure 2.2: YOLOv3+ object detection running on an indoor environment, detecting furni-ture and people. The camera is mounted on a Jackal UGV[27] from Clearpath Robotics

(23)

A popular object detection DNN is YOLO. YOLO stands for you only look once, it is proposed by [7]. The main innovation of this DNN model is replacing the redundant sliding window approach of previous models by a mechanism consisting of grids, anchor boxes, and non-max suppression. The YOLOv3 DNN can process the full image together, instead of having to generate regional proposals, then scan over the proposals with a sliding window, eliminating the redundancies in computation and achieved a good balance between accuracy and speed. A robust object detector like YOLOv3 is valuable in numerous robot perception tasks.

However, to achieve reasonable performance, the YOLOv3 model still must be run on a recent GPU with at least 8GB of VRAM. On an Nvidia GTX 1080 GPU with 2560 CUDA cores, 8GB VRAM and a power budget of 180W, the YOLOv3 can perform object detection at 22 to 24 frames per second (fps) on a video stream from a web camera. In a computing unit aimed for robot development, the Nvidia AGX Xavier Jetson computing board, which has a 512 CUDA cores GPU and a 30W power budget, the YOLOv3 compiled with CUDA and CuDNN library can only run at 7 fps. The speed is insufficient for real-time robot perception tasks.

In the following TensorRT optimization section, I show that half-precision and Application-Specific Integrated Circuit can efficiently deploy the DNN model by reducing memory re-quirement to half and provide speed up to reach real-time performance. The optimized YOLOv3 model was used as the pre-processing pipeline of an accepted coauthored paper on connected vehicles[1]. The paper developed a hyper-graph matching method to match and calibrate the representations of objects detected by each vehicle, enabling both vehicles to share their Field of View (FOV) over the air, allowing a vehicle to be aware of objects occluded in its FOV. I worked on the object detection pre-processing part of this paper. The connected vehicular technology developed could potentially make driving safer.

(24)

2.4 Applying Semantic Segmentation DNN for Robot Industrial Inspection Industrial inspection robots equipped with segmentation DNN can be trained to detect defects with specific visual patterns. Such industrial inspection robots can reach dangerous places, detect defects and provide early warnings for maintainers. Develop efficient DNN perception model for this task is instrumental in designing these robot systems. In this project, I applied transfer learning on an advanced semantic segmentation model to re-purpose it to detect defects.

2.4.1 The Semantic Segmentation DNN Approach

The object detector draws rectangular bounding boxes around objects of interest. But in some robot applications like industrial inspection, a finer-grain detection is needed. For example, a UGV operating in the outdoor environment needs fine-grained information, like the exact shape and area of pavement, water body, and obstacles to navigate safely. Robot perception tasks like this call for a different approach, the semantic segmentation DNN.

Semantic segmentation DNN is a class of DNN that build on top of an image classifier DNN. A typical image classifier DNN performs multiple layers of convolution, max-pooling, activation on the input image, producing a feature vector to describe the input image. Then a fully connected layer is appended at the end to generate class predictions for that picture from the feature vector. In semantic segmentation DNN literature, an image classifier DNN that generates a feature vector from an image is usually called an encoder, or the perception backbone. The semantic segmentation neural network appends a decoder structure after the generated feature vector from the image classifier, then looks back into each max-pooling layers in the encoder neural network and traces back to individual pixels that activate a specific class detection. It then groups those pixels together as a mask for that specific class. The DeepLab v3+ DNN from Google is a versatile semantic segmentation model that achieved 89.0% mean intersection over union (mIoU) in PASCAL VOC 2012 dataset and 82.1% mIoU in Cityscapes dataset [8]. The latest variant of DeepLab model achieved an

(25)

even higher 84.2% mIoU in Cityscapes[29]. This network combined the advantages of two methods: spatial pyramid pooling and an encoder-decoder structure. The former method excels at encoding multiscale contextual information, effectively handling objects with dif-ferent scales, and the latter method produces sharper object boundary by a coarse-to-fine recovery of spatial information.

The DeepLab v3+ can use different perception backbones with different deployment platforms in mind. The MobileNet V2 backbone is a lightweight model with about 10 megabytes of weights, and it is intended for mobile devices. The Xception backbone is a powerful model that has about 240 megabytes of weights, and it is suitable to deploy on the cloud.

2.4.2 Transfer Learning

Training a deep neural network from scratch for semantic segmentation typically requires a large-scale dataset with more than 10,000 annotated images. For example, the leading semantic segmentation models trained on open bench-marking datasets PASCAL VOC[30], Cityscapes[31] and ADE20K[32] achieved more than 80% mIoU [8]. These bench-marking datasets consist of 16408, 3475, 22210 annotated images respectively.

However, in robotic fields, high-quality training datasets for object detection and semantic segmentation are difficult to come by. Obtaining such datasets often requires costly human expert labeling. For example, on the Amazon Web Services (AWS) Machine Learning section, there is a crowdsourcing service called Sage Maker Ground Truth that employs freelance workers over the internet to help label training images for object detection and semantic segmentation. The cost of labeling one semantic segmentation image is more than $0.8, it is costly to obtain a reasonable amount of training images for researchers. Additionally, for some sensitive data, it is not viable to crowd source due to security or compliance restrictions. In some cases, the general public doesn’t have sufficient training to label the data. For example, plant species classification or defect detection require labeling from domain experts, further increasing the cost of procuring a high quality labeled dataset.

(26)

A common approach to address insufficient data in deep neural network model devel-opment is transfer learning. In transfer learning, researchers take a starting point from a network pre-trained on large-scale open datasets like ImageNet, remove the last few lay-ers, the output layer, the soft-max layer, and the logit layer and retrain these layers on a new domain with a smaller annotated dataset. The intuition is, after training on a generic large-scale dataset, the deep neural network can learn to effectively extract salient features, these low-level to high-level features extraction filters can then be adapted to analyze images from a new domain. In recent years, successes have been reported in Deep Neural Network transfer learning on medical imaging, where annotated data must be labeled by experts and are limited in size. [33]

2.4.3 DeepLab v3+ Transfer Learning Experiment

A magnetic tile defect dataset[34] was selected to test the efficacy of using DeepLab v3+ and transfer learning to segment the defects on the images.

• The magnetic tile surface defect dataset

The magnetic tile surface defect dataset consists of defect tiles greyscale images and pixel-level mask annotation of the defect. The annotated masks are png images, where the intensity in each pixel denotes each semantic segmentation class. For example, intensity 0 represents the background, intensity 1 represents the blowhole class. There are five defect classes in this dataset, blowhole, break, crack, fray and uneven. Each class has 50 to 120 sample images. [34] To construct a dataset for training DeepLab V3+, the magnetic tile surface defect dataset[34] is split into the training and the validation dataset as shown in Table 2.1.

• DeepLab v3+ hyperparameters

This work selected the MobileNetV2 model as the backbone since it is lightweight and suitable to deploy on a robot. The original fully connected layer was removed and

(27)

Table 2.1: Dataset Splitting Classes Training Set Validation Set

Blowhole 99 16

Break 73 12

Crack 53 8

Fray 50 7

Uneven 89 14

initialized with a new fully connected layer representing the five defect classes. In this defect dataset, the blowhole, break and crack can be small, the contribution from a successful detection on defect pixels might be overwhelmed by the background pixels. If left unchecked, the network predicting a blank background could still get a good value in the loss function. In light of this, the relative weight of background pixel and foreground pixel were set as 1:10 in the loss function. The dilated convolution strides are 6, 12, 18, accounting for the size and shape of the defects. The decoder output stride is set as 4. Output stride is set as 16, so for 513x513 input images, feature vector size is 32x32, a ratio of 16:1 to keep a good balance between speed and accuracy. 2.4.4 Accuracy Results

Quantitative: the model achieved 54.1% mIoU in the magnetic tile dataset.

Qualitative: see Figure 2.3. The model could not differentiate blowhole, crack and break, which could explain the low mIoU score. However, the model is good at detecting fray (in blue) and uneven defects (in purple).

2.5 DNN Deployment Optimizations

During the applications of two DNN models in the two previous sections, I found them to be running slow on computing devices available for robots. For example, YOLOv3 and DeepLab v3+ both require more than 6 GB VRAM on GPUs to even run. The DeepLab v3+ model can use lightweight to powerful perception backbones with different size and

(28)

Figure 2.3: Qualitative result of transfer learning on the validation set of the magnetic tiles defect dataset

computation requirement. The multiply-adds of DeepLab v3+ models range from 2.75 Billion to 54 Billion. Both models can not run faster than 10 fps on the Nvidia Xavier Computing Board for robot development. So it is important to find ways to optimize these DNN models to run fast and consume minimal computing resources before they can be deployed on robots. In this section, a number of DNN deployment techniques are discussed. From software solution that parallelizes matrix operations to lossless model compression and trimming techniques, to specialized hardware that fuses DNN convolution, max-pooling, and activation into a single instruction, to lossy model compression techniques such as half-precision (16bit), and integer quantization of model weights[35].

Then these techniques are applied to optimize a DeepLab v3+ model to make it run faster and consume less resource. An experiment was conducted to measure the performance gain after each level of optimization.

2.5.1 Parallel Computing Software Frameworks

(29)

The Intel Math Kernel (MKL) library is able to detect the SIMD instruction set avail-able on recent Intel processors and leverage the Advanced Vector Extensions (AVX) instruction set to speed up matrix multiplication on multi-core processors. The Intel Thread Building Block library provides an easy to use multi-threading framework that takes advantage of data parallelism in DNN workloads. The TensorFlow framework compiled with MKL and TBB makes the gap of CPU and GPU inference closer and it is more desirable if the model is small, and the workload is sensitive to data transfer overhead of accelerators.

• Nvidia CUDA and CuDNN Library

The Nvidia CUDA library provides an accelerating framework to utilize the CUDA cores on their GPUs. The CuDNN library provides primitives for DNN data types and operations. If such GPU is available, the tensorflow-gpu package compiled with CUDA and CuDNN performs significantly faster than the CPU implementation. The CUDA and CuDNN libraries are also available on the Nvidia AGX Xavier computing board to leverage the 512 CUDA cores GPU on the board and two Deep-Learning Accelerators (DLAs).

2.5.2 Model Trimming with TensorFlow graph transforms toolkit

To deploy a research DNN model, the following techniques can be applied to remove the redundancies in the models to reduce the model size and increase the inference speed.

• Remove Placeholders

In DNN model development, placeholders are scattered across the model description for flexibility and code maintainability. It is convenient to have these helper functions during initial training and tuning because they make it easy to explore the effect of different hyper-parameters. However, at the deployment stage, these unnecessary structures can be safely removed to speed up inference.

(30)

• Freeze Variables

During training, network weights are variables that can be modified during the back-propagation step. During inference, weights can be stored as static constants, which can be placed on faster GPU texture memory for speed up.

• Fold Constants

Constants with the same value can be saved in a single copy, instead of multiple copies scattered in different layers of the model. By folding constants with the same values to a single copy, model size can be reduced.

The TensorFlow graph transforms tools [36] provides a convenient way to snoop for these optimizations in a model before deployment.

2.5.3 Asynchronous Input Pipeline

Accelerator hardware, be it GPU or Field Programmable Gate Arrays (FPGA), introduce additional overhead transferring data from main memory to the accelerator. A typical way to hide this overhead is by developing an asynchronous input pipeline. In a naive implemen-tation, the current batch of images is loaded into the accelerator, processed and transferred back to the main memory before the loading of the next batch. This implementation creates a pipeline bubble where the accelerator needs to remain idle to wait for a new batch of images. In an asynchronous input pipeline, when the accelerator is processing the current batch of images, the next batch of images can be pre-fetched into the accelerator memory, so when the accelerator finish the current batch, the next batch is readily available in the accelerator memory, and the computation of the next batch can start right away. In this way the accelerator can be run at maximum utilization without having to wait for the slow main memory transfer. An asynchronous GPU data prefetch pipeline can be implemented using the TensorFlow data API.

To further reduce the latency of the main memory to accelerator memory transfer, the batch size of the DNN model can be adjusted so that multiple images can be processed at

(31)

once. For example, in a naive implementation, image tensors of size 512x512x3 are trans-ferred from main memory to accelerator memory one by one, the memory copy instruction must be invoked once for each image. In an optimized implementation, a batch of 30 images of size 30x512x512x3 can be transferred from main memory to accelerator memory in one pinned memory transfer, invoking memory copy only once for 30 images, reducing the aver-age overhead among imaver-ages at the expense of higher accelerator memory consumption. The pinned memory batch image transfer can be implemented by adjusting the batch dimension in the computing graph of the model, and a matching input pipeline that group images together into a single tensor.

A combination of these two optimizations can produce an efficient data input pipeline that maximizes accelerator utilization, and optimizes the inference speed in the final robot perception pipeline.

2.5.4 Application-Specific Integrated Circuits

The rise of deep learning creates a high demand for Application-Specific Integrated Cir-cuits (ASIC) to speed up the DNN inference workload. Major chip vendors have been incorporating deep learning accelerating circuits in their cutting-edge silicons. For exam-ple, Intel introduces a neural compute stick (NCS) that can be plugged into a Raspberry Pi ARM computer to accelerate neural network workload on the edge. In the second generation Scalable Xeon processors, Intel introduces the DL boost feature that leverages 512-bit wide AVX Single Instruction Multiple Data (SIMD) to speed up matrix multiply-add operations, adds support for 8bit neural network weight inference, and fuses convolution, max-pooling, activation into one instruction. Xilinx develops deep neural network accelerator cards based on their Field Programmable Gate Arrays (FPGAs) technology. Google developed Tensor Processing Unit (TPU) scalable from the cloud to the edge. They implemented a TPU on the embedded coral edge development board suitable for robot development. Nvidia, the leading GPU manufacturer, introduces tensor cores deep neural network accelerators in their Turing architecture GPU and Xavier computing platform for autonomous driving and robot

(32)

development.

In this work, recognizing the 8GB memory constraint of YOLOv3 and DeepLab v3+, the Nvidia AGX Xavier Jetson computing board and the Nvidia TensorRT software framework were chosen for experimentation. Although the optimization code is specific to Nvidia hard-ware and softhard-ware, the optimization ideas are also applicable for chips from other vendors. A number of DNN specific hardware architecture level optimization ideas are described as follow:

• Single Instruction Multiple Data (SIMD)

The specialized chips that have vectorized instruction sets can take 512-bit data in a single instruction, saving computing cycles. Most matrix operation in DNN model can be optimized with this kind of Single Instruction Multiple Data (SIMD) instruction sets. Available after 2011, the Advanced Vector Extensions (AVX) is an extension to x86 architecture proposed by Intel and Advanced Micro Devices (AMD) that can speed up matrix multiplication, the latest AVX512 iteration can take 512-bit data, drastically improving matrix operation performance and is very beneficial in DNN computation. OpenCL and CUDA libraries can distribute workloads to massively paralleled GPU stream processors. In the Nvidia Turing architecture, specialized Tensor Cores are implemented to perform computation on 4x4x4 matrices in a single instruction. The TensorFlow DNN framework from Google, if compiled using Intel MKL and TBB library and accelerated by CUDA GPU computing framework on Nvidia GPUs, is significantly faster than the CPU version.

• Fused Convolution, Max-Pooling, Activation

Apart from matrix operations, DNNs also have structures that keep repeating. For example, the convolution, max-pooling, and activation operations are often computed sequentially. Specialized SIMD instruction can then be implemented to fuse these three operations together, saving compute cycles. Data can also be stored in the higher-level

(33)

cache, instead of having to write back to the main memory and read from main memory between each layer operation.

• Half-precision (16bit)

By reducing the weights of the DNN from 32-bit float to 16-bit float, the data band-width of DNN inference is reduced by half, and model size is also reduced by half. Chips with silicon area dedicated to 16-bit Arithmetic Logic Unit (ALU) can be used to speed up 16-bit matrix multiplication. Typically the reduce in precision often de-grades the accuracy of DNN, but research shows that DNN is robust to this operation [37], and the resulted speed improvement outweigh the accuracy degradation.

• Quantization

A more aggressive optimization idea than 16-bit half-precision is int8 quantization. By storing weights of DNN using 8bit integers, with [−127, 128] dynamic range, the requirement of model storage and data bandwidth is reduced by another half. Quan-tization can be applied after a model is trained, when the optimizer looks into the model, record the dynamic range of the weights of certain layers, linearly project the dynamic range into [−127, 128], and store the multiplier and offset that retrieve the original dynamic range. Quantization can also be applied during model training. After a model is sufficiently trained, an additional short training section can be administered with quantization weights to adapt for the reduced dynamic range. The approach is usually called quantization-aware training in the literature.

The Nvidia TensorRT is the software framework that leverages the ASIC on AGX Xavier Jetson computing board, so it is selected in this work to perform optimization on DNN. Other vendors also provide similar functionality, such as the Intel MKL and TBB library that can perform fused convolution, max-pooling, activation and 8bit quantization inference on the Cascade Lake architecture CPUs and edge processors like the Neural Computing Stick (NCS).

(34)

2.5.5 Model Optimization Experiment

Figure 2.4: The compute graph of the original DeepLab V3+

Optimization techniques mentioned earlier are applied to the DeepLab v3+ model in the following order: (a) Model compression and graph trimming, freeze variables, fold con-stants, (b) Batch input tensor (c) A-synchronize input data pipeline, prefetch next batch (d) TensorRT Engine compilation.

The original compute graph of the DeepLabv3+ model (Figure 2.4) and the final opti-mized graph (Figure 2.5) differ significantly in compute node count.

The relative speed up from each technique is shown in Table 2.2. All pipelines are run on an Nvidia Quadro RTX 4000 GPU and tensorflow-gpu 1.13. The runs are performing

(35)

Figure 2.5: The optimized compute graph of DeepLab V3+. Operations are grouped together into TRTEngineOps, constants with similar values are grouped together. The node count in the compute graph reduced to 26.

Table 2.2: Measured Speed ups Vanilla DeepLab v3+ vis.py Trim Graph, Freeze Graph, Fold Constants Batch input tensor Asynchronize input pipeline TensorRT 16bit compiled Run 1/s 123.064 61.937 27.367 16.060 7.824 Run 2/s 124.949 63.404 26.805 16.046 7.983 Run 3/s 122.179 61.701 27.095 16.359 8.162 Run 4/s 128.111 62.545 26.816 16.146 7.851 Run 5/s 122.232 63.620 26.809 16.317 7.905 Average/s 124.107 62.641 26.978 16.186 7.945 Standard Deviation 2.239 0.765 0.223 0.130 0.121 Image Time/s 0.103 0.052 0.022 0.013 0.007 Images per second 9.669 19.157 44.480 74.139 151.041

(36)

inference on 1200 513x513 images, I did 5 runs after applying each inference techniques. As shown in Table 2.2, the final model performs at 151.041 images per second, a 15.6x improvement over the visualization script of DeepLab v3+ during training that peforms at 9.669 images per second.

2.6 Conclusion

The experiment results show that transfer learning is effective at re-purposing the seman-tic segmentation DNN model to a new domain, and could be used as the pre-processing layer of robot perception tasks. The speed up measured in optimizing the DNN model shows there is significant room for performance gain in a research DNN model trained from a popular DNN development framework. The final optimized model is almost 15.6 times faster than the original implementation by applying three main ideas: (a) Trim graph, freeze variables, fold constants, (b) Efficient data input pipeline, (c) ASIC half-precision model compression. This work shows that if executed correctly, a powerful DNN model can be run in real-time and serve as the pre-processing filter in robot perception, and provide valuable information about the environment for a robot agent.

There is still a lot of technical difficulties in deploying DNN models. On the accuracy front, even with transfer learning, the DNN is not performing well with limited training data consisting of about 500 images, while a human is able to learn to pinpoint the class and shape of defects given a few examples. On the optimization front, the effort required to deploy a DNN efficiently is still non-trivial. It takes a thorough understanding of the computing capability of the targeted deployment platform. As shown in Figure 2.5, some of the layer operations supported by TensorFlow framework, namely, SpaceToBatch, BatchToSpace , and ResizeBilinear are not supported by the TensorRT 5.1 inference engine and must fall back to the TensorFlow implementation, leaving a few compute nodes in the graph running at slower speed. To leverage the Tensor Cores effectively, the tensor size must be a multiple of 8 or the data will be padded. This call for more design consideration during model research stage. The Open Neural Network Exchange (ONNX) initiative [38] could smooth

(37)

the deployment process by building a standard description of DNN models, it is improving constantly but in it’s current stage, compatibility between different training frameworks (TensorFlow, PyTorch, MXNet, etc.) and deployment frameworks (TensorFlow Extended, DirectML, TensorRT, Xilinx FPGA etc.) is still very rocky.

(38)

CHAPTER 3

SENSOR FUSION OF THERMAL CAMERA AND RGB-D CAMERA FOR AUTOMATIC HUMAN FOREHEAD TEMPERATURE MEASUREMENT ON A

ROBOT AGENT

Heating, ventilation, and air conditioning (HVAC) systems in modern buildings are de-signed to provide occupants with a comfortable and healthy environment. HVAC systems consume most of the energy of an office building and dominate electric peak demand in sum-mer. In this chapter, we aim to develop an innovative Internet of Robotic Things (IoRT) system for building information collection. The proposed system automatically collects de-tailed, real-time information of occupant presence, their forehead skin temperature, along with fine-grained air and surface temperature measurement of different locations within the building. The proposed IoRT includes a mobile robot equipped with multiple sensors in-cluding an RGB-D camera, a thermal camera, mean radiant and air temperature sensors and humidity sensors. The system serves as a prototype to a final product that will process this data to determine optimal building operation and provide seamless integration with an adaptive building automation system (BAS) that balances occupant comfort and energy use by controlling the HVAC system.

3.1 Introduction

Traditional thermostats (Figure 3.1) rely on single-point measurement to adjust the heat-ing, ventilation, and air conditioning (HVAC) system. It is not measuring the fine-grain data of the building’s occupants, such as the occupants’ skin temperature to determine the ther-mal comfort level. It is also not measuring fine-grain environment data of the building, such as room distribution of humidity, air velocity, radiant temperature, air temperature. This work proposes the design of an Internet of Robotic Things (IoRT) system (Figure 3.2)to record these fine-grain data for computing the optimal HVAC parameter adjustment policy.

(39)

Figure 3.1: Traditional thermostat can only measure temperature from a single point in a room

3.2 Related Work

• RGB-D Sensor Applications

The advent of Microsoft Kinect RGB-D Camera creates novel way to develop interac-tive video games. But the implication of such a sensor reach far beyond the gaming world[39]. Robot system armed with the Kinect RGB-D Sensor can use the depth information to map the environment, detect and locate human users, and finally, un-derstand the behavior and intention of us.[40]

• Human Skeleton Tracking Algorithms

Human skeleton tracking, sometimes called keypoint detection, is a subtask of human pose estimation[41]. The algorithms try to detect people and localize the points of interest, like head, hands, elbows, torso from the human users. A number of learning approaches to detect and track keypoints have been proposed. General purpose ob-ject detector[42], has been re-purposed[43] to detect keypoints on a human body. The Microsoft Common Objects in Context (COCO) organization have been holding

(40)

chal-Figure 3.2: An IoRT system equipped with multiple sensors to measure the fine-grain data of the building and its occupants

(41)

lenges on keypoint detection since 2016[44] to advance the research in this perception task. The 2016 winner OpenPose uses a DNN to first recover the keypoints[45], then uses non-parametric representation to learn which body part is belong to which person in an image[46][47], and achieved real-time performance[48]. The COCO 2019 key-point detection winner uses multi-stage networks to aggregate features and performs coarse-to-fine supervision.[49]

• Sensor Fusion of Thermal Camera and RGB-D Camera Efforts

Mobile thermal camera has a lot of applications. In [50], a FLIR Lepton camera was used to extend the visual sensors’ perception capability in mountainous areas. Thermal sensors have been used to improve the human detection capability of Kinect. [51] In [52], depth sensor was used to improve the 3D reconstruction of the thermal image. An automated office thermal comfort adjustment system was implemented by [53] with a human thermal comfort voting model, but the system is statically mounted and can only interact with users in close proximity.

• Human Comfort Models

The Berkeley Comfort Model was established by [54], to model the effects of a number of environmental parameters, such as temperature, convection, humidity on human comfort level. Different human comfort level models in a myriad of environment, such as automobile[55], indoor buildings[56], outdoors[57], have been assessed and researched extensively.

3.3 Hardware Structure

The robot platform is built on top of the turtlebot 2, carrying a ThinkPad X240 laptop. The software solution, including the navigation stack and sensor drivers, is running on the laptop. The thermal camera is powered by the USB 3.0 port from the laptop and transfers data through the USB 3.0 port. The Kinect 1.0 RGB-D Camera is powered from the 12V

(42)

rail from the turtlebot 2 base and connected to the laptop through a USB 3.0 port. The mean radiant, air temperature and humidity sensors are connected by a LabJack Analog to Digital I/O Module, the I/O Module is powered by a DC/DC Converter drawing power from the turtlebot 2 battery, and connected to the laptop via a USB 3.0 port.

3.3.1 Turtlebot 2 platform

The turtlebot 2 platform[58] is an opensource mobile robot platform developed by Clear Path Robotics. The platform comes with a ROS autonomous navigation stack. The nav-igation stack consists of a rudimentary Simultaneous Localization and Mapping solution (SLAM) using the Kinect RGB-D Camera. The platform can broadcast its location and world map, and also broadcast the wheel odometry of the base in Robotic Operating System middleware. It provides an interface to control the speed of the 3-wheel skid-steer roller base.

3.3.2 RGB-D Camera Sensor

The Microsoft Kinect for Xbox 360 (2010, 1st generation) was used as the RGB-D Cam-era. The color (RGB) sensor can capture 8bit 640x480 pixels VGA resolution, the depth sensor is capable of outputting 11-bit depth with 2048 level of sensitivity. The sensor, when working with the Kinect skeleton tracking ROS packages, works best between 1.2-3.5m to the user, and can maintain user tracking in an extended range of 0.7 - 6m. To ensure it’s field of view is optimal in detecting occupants in an office building, the camera is mounted 84.0 cm above ground.

3.3.3 Thermal Camera Sensor

The FLIR Lepton 3.5 was chosen as the thermal camera sensor. It is a coin size thermal camera powered by a micro USB interface. The sensor draws less than 5W and is suitable for robotics development. It has a resolution of 160x120 pixels, and captures image at 9Hz with a horizontal Field of View(FOV) of 57◦_{. The Lepton 3.5 is able to get a 16-bit Kelvin}

(43)

temperature reading in each pixel, ranging from 0.01K to 655.35K, with thermal sensitivity of 50 mK. To ensure a maximum overlap of the effective field of view from the thermal sensor and the RGB-D sensor, they are mounted so that their optical centers are close to each other, see Figure 3.3.

Figure 3.3: The Microsoft Kinect 1st generation RGB-D Camera and the Lepton FLIR 3.5 Thermal Camera Assembly

3.3.4 Temperature, Humidity, Airspeed, and Noise Sensor Suite

A sensor suite of mean radiant and air temperature, humidity, airspeed and noise sensors is mounted on a box behind the Kinect camera, see Figure 3.2. They draw power from the turtlebot 2 battery through a DC/DC converter. The sensor signals are collected by a LabJack Analog to Digital I/O module inside the box, and the collected data is transferred to the laptop using a USB 3.0 port.

3.4 Software Structure

The overall structure of the software solution that automatically detects human users, track and measure their forehead temperature and data logging is depicted in Figure 3.4.

(44)

Kinect Kinect Driver

\tf Openni Skeleton_Tracker

\lepton_raw Thermal Sensor_Node modified libuvc _{Thermal Camera}FLIR Lepton 3.5

/Forehead_Temp Sensor Fusion Node

User Forehead Temperature and Location Temperature Humidity Sensors LabJack Analog to

Digital I/O Module Data Collection Node Room Temperature Humidity Map Turtlebot 2 SLAM

\tf _{World Coordinate}Robot Base _{Wheel Odometry}Turtlebot 2

Figure 3.4: Software Structure

Temperature and humidity sensors, wheel odometry, Kinect RGB-D sensor, FLIR Lep-ton thermal sensor input are handled by their own ROS sensor nodes. The sensor fusion node integrates coordinate transform information from the turtlebot 2 SLAM node, the skeleton joints spatial coordinates detected by the openni human skeleton tracker, and the thermal image, then computes the detected user location in the room and the user’s fore-head temperature, then records the data. The data collection node takes input from the temperature, humidity, airspeed and noise sensor suite, the world coordinate of robot base from the turtlebot 2 SLAM node, then records a sensor reading map of the room.

3.4.1 Robot Operating System (ROS)

Robot Operating System is a popular open-source robotics middleware[2]. It runs on Linux and on both x86 and ARM architecture, providing a unified service platform that im-plements low-level sensor communication nodes, synchronization between nodes in different

(45)

processes, robot models and package management. A number of commonly used functions, such as sensors and hardware drivers, robot physical models, path planning and perception pipelines are implemented by the open-source community. A node in ROS can be a driver node that read data from a sensor, a perception node that detects objects from raw sen-sor images, a planning node that integrates sensen-sor input and output control signals to the motor, or a motor node that receives control signals and activates the motor. Each node can broadcast messages to a specific topic, and subscribe to a specific topic to receive and process a continuous flow of data in robotics. ROS also defines an XML based Unified Robot Description Format(URDF) standard that describes the robot model, providing a convenient way to define the dimensions of static and dynamic joints, physical dimensions of robot body parts and sensors.

3.4.2 Thermal Camera Node

The FLIR Lepton 3.5 camera breakout board provider GroupGets GetLab provided a Python Application Programming Interface (API) to read temperature array data from the Lepton 3.5[59]. But there has been no development work available in the open-source community for a Lepton ROS wrapper. So I developed a Python ROS node to interface with the Lepton 3.5 camera.

The Lepton Python API used a modified Linux webcam driver library libuvc to talk to the camera. The stream format of Lepton 3.5 is 16Y (16bit grayscale) which is not a standard video format in the default libuvc library. The libuvc was modified to be able to process raw sensor data as a 16Y video stream.

ROS Package: Lepton radiometry ROS Node: uvc-radiometry.py ROS Command:

rosrun Lepton radiometry uvc-radiometry.py Publish topics:

(46)

• \Lepton gray

[160x120x8bit greyscale image array]

The gray-scale thermal image clamp by the highest and lowest temperature for visu-alization

• \Lepton raw

[160x120x16bit unsigned int array]

An unsigned int 16bit array of raw thermal sensor data

3.4.3 RGB-D Camera and Kinect Skeleton Detection Node

The Skeleton tracking package for Kinect 1st Generation is the OpenNi tracker [60] (http://wiki.ros.org/openni tracker), see Figure 3.5

ROS Command: rosrun openni tracker openni tracker ROS Dependencies:

• NITE SDK[61] • ros-kinetic-openni* Publish topics: • \tf

The tf transformation tree of detected skeleton joints

The sensor origin frame \tf is set to be \camera depth frame

3.4.4 Sensor Fusion Node

To finally measure the forehead temperature of building occupants, the Lepton thermal camera images and skeleton joints from Kinect must be aligned and represented in the same reference frame. Specifically, we need to calibrate the two sensors and find a mapping

(47)

Figure 3.5: Openni Skeleton Tracker

function from forehead spatial coordinates in the camera reference frame P = (x, y, z) to the pixel coordinate P′ _{= (x}′_{, y}′_{) on the thermal image. Figure 3.6 shows the URDF model of}

turtlebot and Kinect camera in ROS rviz.

The forehead pixel coordinate P′ _{is computed as follows:}

  x′ y′ 1  = W     x y z 1     (3.1)

where W is the camera projection matrix of the thermal camera.

To accurately measure the forehead temperature of a user, an adaptive sampling area around the pixel coordinate of the forehead is implemented, accounting for the changing distance from the user to the thermal camera. The sampling area will be smaller when the user is far from the camera, bigger when the user is within close proximity of the camera.

(48)

The radius of the sampling area, r is computed as follows:

r = α/d′ _(3.2)

where d′ _{is the depth distance from the user’s location to the camera optical center, and}

α is a scaling factor determined during calibration.

The user world coordinates L in the room is computed as follows:

L = T1T2T3P (3.3)

where P is the forehead spatial coordinates in the camera reference frame, T3 is the

transformation matrix from Kinect camera optical centers to camera mounting reference frame, T2 is the transformation matrix from camera mounting reference frame to the robot

base reference frame obtained from reading the robot model, T1 is the transformation matrix

from robot base reference frame to the room reference frame obtained from the turtlebot 2 SLAM solution. In ROS, the transformation tree is handled by the tf package.

Figure 3.6: Sensor fusion between Lepton 3.5 and Kinect

(49)

ROS Node: get-forehead-temp.py

ROS Command: rosrun Lepton radiometry get-forehead-temp.py Subscribe topics:

• \tf

The tf transformation tree of detected skeleton joints • \Lepton raw

An unsigned int 16bit array of raw thermal sensor data Publish topics:

• \Forehead Temp float64 forehead temp uint8 user id

float64 x float64 y

The forehead temperature, user-id, and user location (x,y) in the room. See Figure 3.7 for visualization in ROS rviz graphic visualization.

3.5 Experimental Results

In an office room experiment, the system successfully detected user skeletons within the vicinity of the Turtlebot platform, and tracked the forehead of the detected user and published the forehead temperature and user location within the room in the ROS topic \Forehead Temp.

Due to the resolution limitation of the Kinect 1st Generation, the system can only track two users simultaneously. Skeleton tracking accuracy degraded significantly when the user is too close to the Turtlebot(<0.5m).

(50)

Figure 3.7: Tracking user’s forehead and measuring forehead temperature

Due to the resolution limitation of the Lepton Thermal Camera, when the user is too far away from the Turtlebot (>4m), the forehead temperature reading is not reliable.

The ROS nodes are written in Python and use Python bindings to leverage the native C++ performance of the OpenCV library for image processing, the average processing time of a thermal image is 40˜50ms. It is below the thermal sensor sampling rate of 9Hz.

3.6 Privacy Concerns and Ethical Implications

An autonomous robot system that track building occupants’ statistics could raise privacy concerns and have potential ethical implications.

• Potential Misuses

The technology of tracking human’s forehead temperature could be misused to build military robots that are Lethal Autonomous Weapons (LAWs) or Lethal Autonomous Robots (LAR). The body temperature of human workers within a building is considered the vital signs or medical records of the human workers. The data could be misused by employers to monitor the worker’s health conditions, and could lead to discrimination

(51)

against workers with special health conditions in workplace or denial of employment or medical insurance. The body temperature data, if leaked, could constitute a violation of the Health Insurance Portability and Accountability Act (HIPAA).

• Design Considerations to Mitigate

Human user must go into the psi pose to give explicit consent to let the robot track and measure the user. The users’ data is kept locally and is wiped periodically. Such system, if deployed in a real building, must comply with local or national regulations on the encryption standard to prevent leaking of health information. All data related to human workers must be properly encrypted to prevent misuse by other parties. The interface to the thermostat should only provide adjustment recommendation and not expose the user data.

3.7 Summary

This work demonstrates the design and implementation of a robot platform that per-forms multi-sensory fusion to generate meaningful insights about the environment. 3 ROS nodes that perform 3 separate tasks in synchronization were implemented: a thermal cam-era wrapper node, a skeleton tracking node from an RGB-D camcam-era (Kinect), and a sensor fusion node to record user forehead temperature. A power delivery subsystem that powers all the sensors on this IoRT platform was designed and implemented.

However, the dated Kinect skeleton tracker code has degraded in tracking robustness and is a hassle to maintain and setup. Resolution of the Kinect 1st generation and the low-cost Lepton 3.5 thermal camera limits the perception capability of the system. The current system can only simultaneously track 2 users, and users must be in a reasonable range (0.5m˜4.0m) to be detected and measured. This behavior is consistent with studies on the performance of the Kinect sensor[62][63].

Fully autonomous operation in a highly dynamic environment, like an office building with personnel and equipment moving around dynamically, remains an open research problem. It

(52)

is not possible to build a fully autonomous human comfort measuring robot with the given time and budget. But since ROS code in this project is structured as independent functional nodes, and can be easily expanded and be adapted to other applications, it is possible to deploy the features developed in a fully autonomous robot in the future.

(53)

CHAPTER 4

POSE ESTIMATION USING RECTANGULAR TO SPHERICAL IMAGE MATCHING

4.1 Introduction

Visual pose estimation is the capability of a robot to estimate its current camera pose from visual sensor data such as camera images. It is an important component for loop closure in any visual Simultaneous Localization and Mapping(SLAM) algorithms. By recognizing a previously visited scene, the robot can localize itself in the current moment on the map generated by SLAM, then perform bundle adjustment to correct the accumulated error in SLAM since the last visit of the particular scene.

Visual pose estimation is a challenging problem in long term autonomy. If the environ-ment that the robot operates in experience long term change, such as illumination change between day and night, sunny and cloudy, rainy and snowy weather, vegetation change be-tween seasons, it will be difficult to develop a robust descriptor of a scene to filter the noise of changing condition and discover the invariant features in a particular image.

4.2 Related Works

Common approaches to visual pose estimation include: (a) 2D to 2D matching, where feature vectors are extracted from 2D images, then the extracted feature vectors, or descrip-tors, are compared to the feature vector from the query image to find a match. (b) 2D to 3D model matching, where structure from motion (SfM) algorithms are used to construct a detail 3D model of the environment from a sequence of images, then store the local feature descriptors on the vertices of the 3D model, then given a query image, the local feature descriptors are extracted, and used to perform search on the 3D model to find a match. (c) 2D sequence to sequence matching extends on single image matching to match a sequence of images to another sequence of images, it is reported[64] to be more accurate than single image matching but it requires consistent camera pose between sequences.

(54)

Some researchers have been using hand-crafted feature extractors to compute the statis-tics of the input image, calculating gradients, color distribution, pixel luminosity patterns and use the computed histogram as feature vectors. Machine learning approaches, such as the Shared Representation Appearance Learning, add an offline training process that per-forms regularization on different feature modalities and tries to learn a long term invariant representation that is robust to seasonal, weather and luminosity changes. [65] In recent years, more approaches using a trained DNN as feature extractor saw marked improvement on localization accuracy. The HF net [66] achieved state-of-the-art status in visual localiza-tion benchmark dataset. It utilizes multi-task learning by training a DNN to simultaneously extract global descriptors and local descriptors of a scene, and also build a 3D model of the environment by SfM algorithms and embedded local feature descriptors on the model. Then it uses a coarse to fine approach, using the global descriptor to find a coarse match, and using the local descriptors to fine-tune the 6 degree-of-freedom pose estimation. In [67], Generative Adversarial Network (GAN) was used to generate day time images from night-time images before performing visual localization, so appearance change between day night-time and night time is handled.

4.3 Approach

This work uses a 360-degree spherical camera to perform visual pose estimation. The benefits of deploying a 360-degree spherical camera on a robot agent are threefold: (a) Recent advances in imaging technology and image processing chips have drastically reduced the price and size of spherical camera sensor, it is more cost-effective to use a spherical camera than an array of cameras. (b) The spherical camera captures all the visual information available in a particular scene, avoiding the restriction of consistent camera pose requirement in 2D to 2D image matching, (c) Avoid the memory cost of constructing a 3D model from SfM.

The plan is to explore the possibility in the following robotic collaborative perception scenario: one robot can use a spherical camera to quickly model a scene, producing an efficient descriptor, the descriptors can be deployed on new robots with a conventional 2D

(55)

camera, and the new robots can match 2D images from their cameras to the descriptor to perform visual camera pose estimation in the environment.

Figure 4.1: A Ricoh theta 360 camera

4.4 Spherical Image Pre-processing

This work uses the RICOH Theta 360 camera, see Figure 4.1. A spherical image in this camera is stored in a 2D equirectangular format, see Figure 4.2 for a sample image.

Naively applying feature extractors on the equirectangular images will give too much weight on the pixels near the north pole and south pole. Additionally, pixels near the north and south pole are stretched more extremely as shown in Figure 4.3. Applying feature extraction on these patches will yield very different feature vectors to a conventional 2D camera.

To address the issue of singularity on the north pole and the south pole, soccer ball patches were used to re-project part of the spherical image back to 2D images. Specifically, a soccer ball surface is a truncated icosahedron that consists of 12 regular pentagonal faces

(56)

Figure 4.2: A spherical image captured in Colorado School of Mines