Improving Performance of a Mixed Reality Application on the Edge with Hardware Acceleration

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Bachelor’s thesis, 16 ECTS | Datateknik

2020 | LIU-IDA/LITH-EX-G--20/073--SE

Improving Performance of a

Mixed Reality Applica on on the

Edge with Hardware Accelera on

Jesper Eriksson

Christoﬀer Akouri

Supervisor : Klervie Toczé Examiner : Simin Nadjm-Tehrani

(2)

Upphovsrätt

De a dokument hålls llgängligt på Internet - eller dess fram da ersä are - under 25 år från publicer-ingsdatum under förutsä ning a inga extraordinära omständigheter uppstår.

Tillgång ll dokumentet innebär llstånd för var och en a läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och a använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrä en vid en senare dpunkt kan inte upphäva de a llstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För a garantera äktheten, säker-heten och llgängligsäker-heten ﬁnns lösningar av teknisk och administra v art.

Upphovsmannens ideella rä innefa ar rä a bli nämnd som upphovsman i den omfa ning som god sed kräver vid användning av dokumentet på ovan beskrivna sä samt skydd mot a dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens li erära eller konstnärliga anseende eller egenart.

För y erligare informa on om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years star ng from the date of publica on barring excep onal circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educa onal purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are condi onal upon the consent of the copyright owner. The publisher has taken technical and administra ve measures to assure authen city, security and accessibility.

According to intellectual property law the author has the right to be men oned when his/her work is accessed as described above and to be protected against infringement.

For addi onal informa on about the Linköping University Electronic Press and its procedures for publica on and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

©Jesper Eriksson Christoﬀer Akouri

(3)

Abstract

Using specialized hardware to accelerate workloads have the potential to bring great performance lifts in various applications. Using specialized hardware to speed up the slowest executing component in an application will make the whole application execute faster, since it cannot be faster than it’s slowest part. This work investigates two modifi-cations to improve an existing virtual reality application with the help of more hardware support. The existing virtual reality application uses a server computer which handles virtual object rendering, these are later sent to the mobile phone, which is the end user. In this project the server part of the application, where the Simultaneous Localization And Mapping (SLAM) library is run was modified to use a Compute Unified Device Ar-chitecture (CUDA) accelerated variant. The software encoder and decoder used for the video streaming were modified to use specialized hardware. Small changes were made to the client-side application to allow the latency measurement to work when changing the server-side encoder.

Accelerating SLAM with CUDA showed an increase in the number of processed frames each second, and frame processing time, at the cost of latency between the end and edge device. Using the hardware encoder and decoder resulted in no improvement considering latency or processed frames, in fact, the hardware encoders and decoder performed worse than the baseline configuration. The reduced frame processing time indicates that the CUDA platform is beneficial provided that the additional latency that occurred from the implementation is reduced or removed.

(4)

Acknowledgments

We want to thank our supervisor, Klervie Toczé, for all the help throughout this work. We want to give our thanks to Johan Lindqvist who made this work possible to begin with, and for assisting us with technical details. A big thank you to Markus Larsson and David Ångström who came along as opponents for this thesis. We want to also thank our examiner, professor Simin Nadjm-Tehrani.

(5)

1 Introduction 1 1.1 Motivation . . . 1 1.2 MR-Leo . . . 2 1.3 Problem definition . . . 3 1.4 Approach . . . 3 1.5 Achieved results . . . 4 1.6 Thesis structure . . . 4 2 Technology background 5 2.1 Parallel computing . . . 5 2.2 Point cloud . . . 7 2.3 ORB-SLAM2-CUDA . . . 7 2.4 Nvidia Jetson TX2 . . . 8 2.5 GStreamer . . . 9 2.6 Related works . . . 9

3 Hardware implementation and testing methodology 10 3.1 Installation . . . 10

3.2 Hardware implementation . . . 11

3.3 Testing methodology . . . 12

3.4 Test enviroment . . . 13

3.5 Edge application configuration . . . 15

4 Results 16 4.1 T2VE . . . 16

4.2 Processed frames . . . 17

4.3 FRTT . . . 17

4.4 Mixed reality processing time . . . 18

5 Discussion 20 5.1 Impact of CUDA accelerated ORB-SLAM2 . . . 20

5.2 Impact of HW accelerated encoding/decoding . . . 20

(6)

5.4 Possible scalability with CUDA . . . 21

5.5 Edge device processing time . . . 21

5.6 The work in a broader context . . . 21

6 Conclusion 23 6.1 Performance . . . 23

6.2 Future work . . . 24

Bibliography 25 A Implementation Details 27 A.1 MR-Leo arguments . . . 27

A.2 Test environment hardware . . . 27

A.3 OpenCV installation . . . 27

A.4 Pangolin installation . . . 29

(7)

List of Figures

1.1 An overview of the MR-Leo server and client. Image from [3] adapted to show our

work. . . 2

2.1 The GPU has more ALUs than a CPU [CUDA-arch]. . . . 6

2.2 A point cloud of Monmouth castle. . . 7

2.3 Block diagram of the Nvidia Jetson TX2 hardware. Image from [Nvidia-arch] adapted to our work. . . 8

2.4 Example Gstreamer pipeline. . . 9

3.1 Picture of the testbench. . . 14

3.2 Reference video used for testing. . . 14

3.3 Screenshot from the MR-Leo client on the end device. . . 15

4.1 CDF of the T2VE. . . 17

4.2 CDF of the number of received frames each second. . . 17

4.3 CDF of the number of lost frames each second. . . 18

4.4 CDF of the FRTT. . . 18

4.5 The average time it took for the edge device to process the incoming frames, with the mixed reality processing time shown as a part of the total time. . . 19

(8)

List of Tables

4.1 Number of T2VE datapoints from each scenario. . . 16 4.2 Number of FRTT datapoints from each scenario. . . 18

(9)

1 Introduction

Mixed reality has been a hot topic the second decade of the 2000’s. Mixed reality applications combine input from the real world as we perceive it and combine it with virtual elements to create something in between. But not every device is powerful enough to run these applications. Quality of service may see a significant setback, phones running warmer in the end users’ hand but also a quicker discharge leading to a shorter day to day lifespan of the device.

Edge computing is a way to offload computational workloads to server near the client, which puts lower strain on the network and improves response times. The server computer that is performing the computations that are later sent back, is called an edge device.

This work builds upon the previous work of Toczé et al. [1] where they introduce MR-Leo. MR-Leo is a project that consists of a server application and an android application. The android application uses either a previously recorded video or direct camera feed from the end user device which virtual elements are applied onto. The virtual elements make the video a mixed reality, since there are elements from the real world and virtually generated elements. The virtual element generation is done in the server application on the edge device. MR-Leo makes it possible to offload some of the mixed reality processing from the end user’s device to the edge device. This work will focus on using the hardware present on the edge device to run parts of the server code in MR-Leo in a parallel manner.

1.1 Motivation

The issue with offloading workloads to an edge device is the fact that the transport between the client and server creates transport overhead, which could cancel out the reduced processing time of the edge device. This was seen in the work by Toczé et al. [1] where the average processing time of each image in the image stream was higher in the high-end edge device compared to running in the client only, even though the time spent rendering was reduced. This means that the reduction in the average image processing time needs to be greater than the created transport overhead from offloading to the edge device. In their study they proposed that further improvement might be made by implementing hardware encoding/decoding to the video stream. Lindqvist [3] also proposed that Nvidia Compute Unified Device Architecture (CUDA) could be used to speed up the tasks related to mixed reality.

The delay between moving the camera and seeing the finished processed mixed reality image should be small enough that it does not degrade the immersion too much. The delay

(10)

1.2. MR-Leo

Figure 1.1: An overview of the MR-Leo server and client. Image from [3] adapted to show our work.

has been argued that it should be below 20ms and optimally below 7ms [2]. The delay is going to be referred as the Frame Round Trip Time (FRTT).

If the time the frame spends in the edge is reduced, we could see higher performance of the mixed reality stream, which in this case would mean fewer/no dropped frames. The increased processing capability could enable increasing the quality of the video or enable the use of more complex virtual objects, providing a better mixed reality experience.

1.2 MR-Leo

In this thesis we will be using an edge offloading prototype that goes by the name Mixed-Reality Linköping Edge Offloading or MR-Leo. MR-Leo was developed as a prototype for offloading the mixed reality workloads from a device to an edge server. The prototype is split into two separate applications, a client and a server. The figure 1.1 shows an overview of how the client (end device) and server (edge device) are connected to each other. The client records and sends a video stream to the edge device, which constructs a point cloud from the incoming images and adds the virtual elements on the images. The mixed reality video stream is then sent back to the edge device to be displayed. The grey box with dashed borders in the figure shows where we are modifying the encoding and decoding of the video stream, while the box with the solid grey border is where we modify the ORB-SLAM2 for the mixed reality.

(11)

1.3. Problem definition

1.3 Problem definition

We want to see if it is possible to use Jetson TX2 as the edge device for the MR-Leo application. Specifically, we want to see whether there are any performance benefits to using the parallelism featured in CUDA to lower rendering time. By lowering the rendering time the quality of service may increase by reducing the number of frames that are dropped from the image stream.

MR-Leo uses the Simultaneous Localization And Mapping (SLAM) method to generate point clouds, specifically the SLAM2 [11] library. Better performance in the ORB-SLAM2 would mean a faster point cloud generation which would lead to a lower frame round-trip time and better quality of service.

There are a couple of research questions that we want to answer by modifying the MR-Leo application:

• What effect will CUDA support for ORB-SLAM2 in the MR-Leo server code

have on the performance?

We research the possible performance benefit by using Nvidia’s CUDA platform to handle the real-time SLAM library ORB-SLAM2. By using parallel computing we hope to reduce the time the frame spends in the edge device.

• What effect will hardware accelerated encoding/decoding in the MR-Leo

server code have on the performance?

In previous work the H.264 encoding and decoding have been handled by software which could be replaced with a hardware solution. We research the possible performance benefit of using the Jetson TX2’s integrated h.264 encoder and decoder.

1.4 Approach

The above research questions were addressed by adopting the following approach:

1. Modifying the original ORB-SLAM2 component in MR-Leo to use a CUDA accelerated variant of ORB-SLAM2.

2. Modifying the part in the code that handles video streaming, and changing the software solution to a hardware solution.

3. Run tests with a sample video included with MR-Leo and gather statistics from each test run.

(12)

1.5. Achieved results

1.5 Achieved results

The achieved results were not the ones we were hoping for when we sought out to do this work. Even though the performance became worse in most parts, improvements could be seen in a few places. We learned a lot along the way which we think could be of use when researching this further.

By doing the above approach, we saw that the time spent working on each frame for the edge device was reduced when using CUDA with 22 ms or 28,9%. The overall end-to-end latency was worsened, so much that the previous improvement was negligible and instead reduced the overall performance. The benefits of a more complete video stream and reduced time spent working on each frame came with the drawback of increased latency. The increased latency when sending frames from the end device until it is processed on the edge device and finally sent back was 1544 ms or 54% compared to the baseline. This increase in latency is not so easily identified because of the complexity of the technologies used and MR-Leo itself.

1.6 Thesis structure

This thesis consists of six chapters, as well as an appendix.

• Chapter 2 describes a theoretical framework and related works.

• Chapter 3 describes the acceleration methods and how they were implemented but also how the testing was conducted.

• Chapter 4 describes the yielded results from the previous chapter and how this was analyzed.

• Chapter 5 discusses the results and the impacts of using the Nvidia Jetson TX2 as an edge device.

• Chapter 6 concludes the thesis and answers the questions outlined in the first chapter.

(13)

2 Technology background

This chapter is meant to review the technology background for future references in the thesis. The chapter starts with a description of the edge device which in this thesis is the Nvidia Jetson TX2. Followed by descriptions of technologies that were used in the Mixed Reality application.

2.1 Parallel computing

The Central Processing Unit (CPU) and Graphics Processing Unit (GPU) have different strengths and weaknesses. The CPU is faster and is better at performing fewer sequential operations fast e.g. workloads like handling the operating system where low latency is pre-ferred. The GPU however is better at computing large amounts of data at the same time. Workloads like computing the square root of all numbers in a region of memory. A common way of describing the GPU and CPU is to compare them with vehicles. The CPU is a motor-cycle, which takes one or two people to their destination fast. While the GPU is a bus, which can carry a lot more people, but will arrive at the destination later.

With this in mind we may offload heavier workloads to the GPU, which means that the CPU is free to continue executing the code sequentially while the GPU takes care of the heavier parts of the code. In some cases the CPU will still need to wait for the GPU to finish the offloaded workload, but since the GPU has the capacity to compute much larger workloads, the runtime will decrease compared to only using the CPU. The reason the GPU can handle larger workloads is because of the larger number of threads containing more arithmetic logic units (ALUs) as seen in Figure 2.1. The ALU performs arithmetic and bitwise operations on integer binary numbers, and more ALUs means that the GPU can execute more operations at the same time.

CUDA is a parallel computing platform that was made to simplify the process of utilizing the GPU. Currently the CUDA platform supports multiple different programming languages like: C, C++, Fortran, etc. [14]

(14)

2.1. Parallel computing

Figure 2.1: The GPU has more ALUs than a CPU [19].

(15)

2.2. Point cloud

Figure 2.2: A point cloud of Monmouth castle.

2.2 Point cloud

The point cloud as its name suggests is a collection of data points in a three-dimensional space. This is used in 3D scanning for example. Items that are scanned will have millions of data points produced that together will form a shape that represents the physical object, but now as a point cloud in the computer. An example of this can be seen in Figure 2.2 which shows a point cloud of a castle. This is also done in various Simultaneous Localization And Mapping (SLAM) methods where point clouds are generated according to the surroundings of whatever sensor is used in a given area. This will give a 3D representation of the environment as we humans would normally perceive it. Similar to the way items are 3D scanned and shown digitally on a screen, SLAM scans the environment to depict it digitally through point clouds. In MR-Leo, the point clouds representing the shapes generated by the phone’s camera as the sensor are done at the edge device. Point clouds are handled and created with ORB-SLAM2, which is included in MR-Leo.

2.3 ORB-SLAM2-CUDA

SLAM2 is the successor of SLAM and SLAM2-CUDA is a fork of ORB-SLAM2 which intends to use CUDA to accelerate some of the heavy processes in the original. In the ORB-SLAM2 framework there are three main threads: Tracking, Local Mapping and Loop Closing, the same goes for the CUDA variant. The tracking thread is responsible for localizing the sensor within the map that is generated with SLAM methods. Because the map is continuously updated, the sensor’s location within the generated map must also be updated. The local Mapping thread is responsible for maintaining the generated map. The responsibility of the loop closing thread is to decide if a sensor after a variable length has returned to a previous visited location. The way the tracking thread is defined in ORB-SLAM2 is that it does not allow for new frames to be processed until its execution is finished. This introduces a bottleneck in the execution of the software since more hardware resources will not help unless that particular thread can execute faster. Accelerating the tracking thread would potentially lead to higher frame rates since the waiting time for the tracking thread could be shorter if accelerated. In ORB-SLAM2-CUDA which is the library we chose to replace ORB-SLAM2 with [12], exactly this is implemented.

(16)

2.4. Nvidia Jetson TX2

Figure 2.3: Block diagram of the Nvidia Jetson TX2 hardware. Image from [18] adapted to our work.

2.4 Nvidia Jetson TX2

The Nvidia Jetson TX2 was used which is a small computer part of Nvidia’s Jetson product line, which is focused on high-performance, low-power devices used for artificial intelligence and edge computing. The board contains a GPU module with support for CUDA accelerated programming making it powerful despite only having the width and depth of 17 cm. Containing a GPU that supports CUDA is important in this work since it will make it possible handle workloads in paralell. The TX2 comes with an Nvidia Denver2 dual-core processor in addition to the quad-core Cortex-A57.

The Nvidia Jetson TX2 has dedicated hardware to perform H.264, H.265 encoding and decoding as can be seen in the green group on the left side in Figure 2.3. This is also an important aspect since moving to hardware codecs instead of software ones is one of the aims with this work. The figure depicts a high level description of what hardware the Nvidia Jetson TX2 is composed of. The CUDA cores, which are seen in the green circled group to the left are parts of the GPU whereas the hardware decoder and encoder is not. CUDA cores can do a variety of things, its biggest characteristic being that it can execute code in parallel, however it is up to the programmer to decide what should be executed on the cores. The hardware encoder and decoder are not as flexible, they are hardwired to do one specific task, to either encode or decode. The benefits of narrowing down a hardwares feature set to do a specific thing is the speed it can accomplish that task in. By eliminating overhead in terms of encoding and decoding with software that runs on larger and more general hardware, a very specific hardware with a specific task can be much smaller and much faster. Nurvitadhi et al. [5] explore the idea to improve the execution efficiency of binarized neural networks through hardware acceleration. Their results showed that an Application-Specific Integrated Circuit (ASIC) performed significantly better than a CPU or a GPU, that ran optimized software. Not only was the performance better, but the performance per watt showed really good results as well. The price of an ASIC is costly, since it is designed to only do one task, so a lot of consideration needs to go into whether it will be beneficial to create the solution in hardware. There is a fork of the ORB-SLAM2 project called ORB-SLAM2-CUDA that offloads some of the most intense computational workloads to the CUDA cores.

(17)

2.5. GStreamer

Figure 2.4: Example Gstreamer pipeline.

2.5 GStreamer

Gstreamer [6] is a pipeline-based open-source multimedia framework used for processing au-dio and video. The Gstreamer pipeline makes it easy to add or change the functionality to Gstreamer. As seen in Figure 2.4, the video source is being decoded with the omxh264dec decoder and sent to the nveglglessink to be displayed. Sinks are media pipelines that receive media data to then either playback the media or archive it by writing it to a file. It is possible to extend the pipeline by adding more sinks, allowing it to take one video source and displaying it in two target locations. One example where this could be useful is if you want to display the image on the client device and send it to another device for processing. Gstreamer con-tains a variety of encoders and decoders from different libraries, like Open Media Acceleration (OMX).

2.6 Related works

This thesis is based on previous work by Toczé et al. [1] in which a mixed reality video stream was offloading point cloud creation and graphics rendering was offloaded to the edge. The paper goes to evaluate the latency and throughput of a number of different configurations. One of the findings in the report was that using a high-end CPU had negligible impact on the time spent on graphics, and they proposed that a graphics accelerator would be needed to decrease it. In the paper they show a measured frame round trip time of 392ms or lower in 90 percent of the measurements. In the work by Lindqvist [3] ORB-SLAM2 is used for point cloud generation, however their work lacked hardware acceleration in the graphics generation. Lindqvist recommended further investigation into implementing hardware acceleration for the ORB-SLAM2, encoding, and decoding. In the work by Aldegheri et al. [15], they managed to implement real-time computations between the processor and Jetson TX2 to compute point clouds with the ORB-SLAM2 library. In their work they optimized ORB-SLAM2 to work bet-ter in real-time scenarios, this gave us some indication that the recommendation by Lindqvist could be implemented since MR-Leo is also a real-time application.

Bourque [4] showed some promising results when it came to hardware accelerating the ORB-SLAM framework. By leveraging CUDA in Jetson TX2 they saw, on average, an increase of 3̃3% frames per second. To hardware accelerate ORB-SLAM framework the authors used a CUDA enabled variant of ORB-SLAM2 developed by the Github user yunchih1_{. Since}

we wanted to hardware accelerating the ORB-SLAM2 we thought about using this CUDA enabled version. We did however find a more recent implementation by GitHub user thien942

which was based on the previous work by yunchih. As seen in the figures published by the developers of the fork3_{, the SLAM framework’s performance increased between 50% and 200%,}

when using Nvidia Jetson TX1. Even though this work is not peer reviewed, we expected the performance increase to be similar or better. Additionally in the work by Kaldestad et al. [16] they found that by using a GPU with CUDA, the time to generate point cloud was smaller than when using a CPU. Which further indicates that MR-Leo might benefit from using CUDA to generate point cloud.

1_{https://github.com/yunchih/ORB-SLAM2-GPU2016-final} 2_{https://github.com/thien94/ORB_SLAM2_CUDA} 3_{https://yunchih.github.io/ORB-SLAM2-GPU2016-final/}

(18)

3 Hardware implementation and

testing methodology

This chapter will cover the methods that were used to hardware accelerate MR-Leo. As Section Approach brought up, the ORB-SLAM2 component and the video streaming component will be targeted for acceleration. Here it is also shown how the components were removed and replaced along with other adjustments to make the updated components work in MR-Leo. The testing methodology that was used to benchmark the different versions of MR-Leo. Specific metrics that showcase the performance of different parts of MR-Leo. The test environment is also explained in this chapter.

3.1 Installation

Because of mistakes done with the installation of libraries and running out of storage, factory resetting The Nvidia Jetson TX2 was the easiest solution at the time. The Nvidia Jetson TX2 is installed with the Linux distribution Ubuntu, specifically version 18.04 as its Operating System(OS). To factory reset Jetson and reinstall everything back to a base state, it was not possible to install Ubuntu 18.04 with a USB-drive as one would normally do. This was confirmed when we tried to reinstall Ubuntu on Jetson, it would not start anymore after being installed this way. We believe that the Ubuntu 18.04 version Jetson runs is slightly modified for Jetson. Jetsons OS must be installed with the Nvidia Manager SDK tool provided by Nvidia. This requires an additional computer to run this tool on. The additional computer is connected to Jetson with a USB-cable which is used as the medium to install the OS on Jetson.

3.1.1 Server

A comprehensive and easy to follow guide on how to compile and install the server application is provided by the developer of MR-Leo. The first hurdle that needed to be dealt with was that the version of OpenCV [13] that the package manager on Jetson installs for you is the wrong version. This could be because Jetson is running on an ARM architecture, we did try following the instructions for an x86 platform and the instructions worked just fine and it installed. The only different variables are the processor and OS to some extent. It is the same OS but it is built for a different architecture. It could be that the package managers sources vary between these different versions of the OS. We decided to download and install the correct version of

(19)

3.2. Hardware implementation

what is needed straight from OpenCV instead of relying on the package manager since these will most likely install the latest version of whatever is being requested from it.

To install the correct version of OpenCV, we download the specific version along with the contrib repository that needs to be of the same version. The contrib repository houses extra modules for OpenCV. These usually don’t have a stable API are that well tested, that is why they are not included in the standard version. Errors were noticed when we tried to compile MR-Leo without the extra modules. These errors described undefined references from OpenCV that were not present in the standard version. By installing OpenCV with the extra modules of the contrib repository resolved these issues and MR-Leo could be built.

The second problem we got was that there was no picture at all. This was fixed by installing Pangolin [17] and activating Pangolin in the installation of MR-Leo-Server. Pangolin is a library for managing OpenGL displays. Because Pangolin is lightweight and portable makes it quick to go from idea to prototype. MR-Leo-Server already supports Pangolin, but it wasn’t used as the default way of rendering images. The developer of MR-Leo has since changed this so Pangolin is used on default from now on, so there should be no problem in the future.

3.1.2 End Device

The installation on the end device was done by following the guide that was provided in the MR-Leo git repository. Since we are focusing on improving the edge device and not the end device, we only need to make sure that the application is able to run as expected. The identifycolor argument was used in the MR-Leo application to measure the FRTT. This causes the application to probe the incoming images for the value #F0F, which is a colour value that was added to the top of the video. The probe works by measuring the pixel data of the incoming images multiple times with a set offset between the measurements until it reaches a preset maximum offset value. If the colour probe at any time in a given image does not find the correct colour value, the measurement is cancelled. Changing the encoder caused the colour probe to fail every time, this is likely caused by the different way the codecs encode the images, which caused the probe to measure outside the target area. By shortening the maximum offset value, the FRTT could be measured on the OMX encoded images.

3.2 Hardware implementation

This section explains how we modified the base MR-Leo server to run parts of the code on the hardware.

3.2.1 ORB-SLAM2-CUDA

In the MR-Leo server source folder there are two folders. One folder called externals, and another called source. The external folder accommodates technologies that were not developed as part of MR-Leo, such as ORB-SLAM2 and Pangolin. To make ORB-SLAM2-CUDA work with MR-Leo’s codebase, we first had to swap out the ORB-SLAM2 for ORB-SLAM2-CUDA. It was not as simple as trying to build everything straight away since we got missing library errors and in some places, code was missing. We looked at how the ORB-SLAM2 code in the original work looked like and made it work with the ORB-SLAM2-CUDA1_{. The changes in the}

code were mostly related to using the GPU instead to perform calculations. Some functions were moved entirely to a separate cuda folder where the functions were rewritten using CUDA allowing the calculations to run in parallel. For the missing libraries, we needed to change the multiplatform project file to incorporate the CUDA libraries so when ORB-SLAM2 builds alongside the whole project, it can find the necessary CUDA libraries it needs to compile.

(20)

3.3. Testing methodology

3.2.2 Hardware encoding/decoding

MR-Leo server code uses GStreamer as the multimedia framework to stream the video between the end user and the server. Since GStreamer is a pipeline-based framework, we could look at the pipeline the developer of MR-Leo created. We investigated MR-Leo’s pipelines, one for transmission and one for receiving. For decoding, libav’s H.264 decoder [8] was used. For encoding in the transmission, x264enc [7] was used which is part of a free and open-source software library by VideoLan. These codecs have one thing in common, which is that they are doing the encoding and decoding through software. Because H.264 encoding and decoding still is a heavy task for modern day computers, people have sought to do this through hardware instead. Jetson has specific hardware to perform H264 decoding and encoding. By consulting Nvidia’s user guide to use accelerated GStreamer, keyword omxh264enc and omxh264dec was found [10]. Omxh264enc can replace the x264enc part in the pipeline to perform hardware accelerated encoding. Same goes for the decoder where avdec_h264 can be replaced with omxh264dec to perform hardware accelerated decoding.

3.3 Testing methodology

Three different versions of the MR-Leo server code were created and kept alongside the first unmodified version by the developer of MR-Leo.

• Base

Base is the original unmodified MR-Leo. We decided that a base was needed since the test results are going to be highly dependent on the hardware, and this makes it so that we have something to compare to.

• HW

HW is a Hardware version in which the regular encoder and decoder was swapped out in favour of the OMX library. This is to isolate the test results we get from changing the encoder and decoder from the other changes.

• CUDA

CUDA is a version in which the ORB-SLAM2 was replaced with a CUDA accelerated variant that was modified to work with MR-Leo. Same as in HW we wanted to isolate the test results we get from the CUDA accelerated variant.

• CUDA-HW

CUDA-HW is a version in which the changes in both Hardware and CUDA have been implemented. We added this to see how combining all changes would impact the perfor-mance.

Each version was run 30 times to gather sample data. It took approximately 40-60 minutes to do 30 tests. In total 120 tests were sampled across all versions. All test results were collected and sorted into folders. Afterwards they got analysed and became building blocks for graphs. The graphs for these metrics will be portrayed with Cumulative Distribution Graphs (CDF), which will be seen in Chapter 4. More details to running the MR-Leo application can be seen in appendix entry A.1.

3.3.1 Metrics

We used metrics from the work by Lindqvist [3] when measuring the efficiency of our method. We believe these metrics are adequate when looking at the overall performance of the edge device and can point out which parts of the edge device benefited from hardware acceleration, and which parts that did not. MR-Leo outputs the statistics for these metrics after each run, which is helpful later on when we measure the results.

(21)

3.4. Test enviroment

3.3.1.1 Frame Round Trip Time (FRTT)

FRTT is a measurement of the time, in milliseconds, it takes for the frame to be sent from the end device. To the moment the frame returns to the end device processed by the edge device.

3.3.1.2 Processed frames

Received frames, measured in frames per second (FPS). The number of frames retrieved by the end device each second without getting dropped or lost. Dropped frames means that the frame was stale and removed from the queue. Lost means that the frame somehow disappeared, this will most likely happen when using UDP since it does not ensure that the sent data was received reliably.

3.3.1.3 Time to virtual element (T2VE)

T2VE is the time taken from the moment a virtual element is added by the end user on their device to the display of that element in the augmented video stream that is created at the edge device then sent back to the end device. This is measured in milliseconds.

3.3.1.4 Mixed reality processing time (MR-Time)

Mixed reality processing time, which we will refer to as MR-Time, is the time it takes for a frame to be processed by the edge device. This metric was added to measure the time spent rendering, while not considering the transmission delay between the end and edge devices. This metric was only measured a couple of times for Base and CUDA, since the hardware encoding and decoding tests does not modify the ORB-SLAM2 those versions should not have a noticeable difference from Base in this case. The reason this test was only run a couple of times instead of 30 each as in previous metrics is because this metric was only printed a long time after the benchmark had ended. Adding this additional time to each test would significantly increase the time spent on testing.

3.4 Test enviroment

The testbench as seen in figure 3.1 on which the tests were made consists of: an Asus RT-AC51U router, Jetson TX2, and a Huawei P9 smartphone. More details of the hardware can be found in appendix entry A.2.

The software was set up by installing the MR-Leo server on the intended edge device, the Jetson TX2, and connecting it through a router to a smartphone with the MR-Leo client installed. To minimize the amount of disturbances in the testing environment, the router was disconnected from the internet and no devices other than the end device were connected to the router. The Huawei P9 was reset to factory settings and we tried to either disable or uninstall all applications in the application settings. We had to multiply the OMX encoder bitrate by 1000 since the original encoder, x264enc, measured bitrate in kbit/sec while the omxh264enc measured in bit/sec. This was discovered when we used the same bitrate setting which resulted in noticeably reduced image quality. The benchmarks were made on a 60 seconds long video2 with 640x480 resolution and a frame rate of 30 frames per second. A screenshot from the video can be seen in Figure 3.2. The MR-Leo client application tries to insert a virtual element every ten seconds with the first one inserted five seconds into the video resulting in six samples in total.

2_{https://gitlab.liu.se/ida-rtslab/public-code/2019}

(22)

3.4. Test enviroment

Figure 3.1: Picture of the testbench.

Figure 3.2: Reference video used for testing.

(23)

3.5. Edge application configuration

3.5 Edge application configuration

To conduct the tests we used the same configuration for all tests to ensure consistency. The transmission was made using H.264 software encoding over TCP from the end device, and H.264 over TCP from the edge device. The bitrate was 2000 Kbit/s both ways. The configu-ration screen can be seen in Figure 3.3.

(24)

4 Results

This chapter will present the acquired results after running the different versions in Chapter 3. CDF’s were created for three of the metrics in Section 3.3.1. An additional graph for mixed reality processing time was added to determine the point cloud creation time. We chose to display the results in CDF’s since this removes the worst 10% of the results, which makes outliers affect the result less, which means that we can present what performance we can expect at worst 90% of the time.

4.1 T2VE

The CDF of the T2VE was constructed from the available measurements from the benchmarks. As mentioned previously a virtual element should be added every 10 seconds, resulting in 6 samples from each test, which means that it should be 180 samples for each scenario. The sixth element does not appear since the benchmark runs for less than 60 seconds when accounting for latency. One of the reasons why the virtual element can not be added is if some frames are dropped which makes the point cloud unstable, which in turn does not allow for the virtual element to be placed. The number of datapoints from each scenario did differ as seen in Table 4.1. The Base had the lowest number of samples with 105 samples, and the HW had the most with 140 samples. The reason for this difference is that the number of virtual elements that could be added during each benchmark is different.

Base CUDA HW CUDA-HW

number of samples 105 110 140 127

Table 4.1: Number of T2VE datapoints from each scenario.

As seen in Figure 4.1, the 90th percentile the Base configuration got the lowest latency with 2707ms and the CUDA got the highest 90th percentile latency with 6842ms. When comparing this result to the T2VE in previous work by Toczé et al. [1]. We can see that the 90th percentile in their work is 1122ms while our unmodified base test is 2707ms in the 90th percentile.

(25)

4.2. Processed frames

Figure 4.1: CDF of the T2VE.

4.2 Processed frames

The resulting data from the benchmark was limited to the time 0-59 since there were no frames received from 60 seconds and up.

Figure 4.2 shows that when measuring the number of received frames each second, the 90th percentile is larger in CUDA and CUDA-HW than Base and HW. The 90th percentile for HW and Base is 14 FPS, while it is 25 and 26 FPS in CUDA and CUDA-HW respectively. A perfect result in this case would mean that the line would be flat at the bottom of the figure, and then straight to 1 at 30 FPS, meaning that the FPS would always be 30 as in the source video.

Figure 4.3 is the same as Figure 4.2, but it is inverted to show the number of dropped frames each second.

Figure 4.2: CDF of the number of received frames each second.

The framerate at the 90th percentile of HW and Base is 14, while the CUDA and CUDA-HW is 25 and 26 respectively.

4.3 FRTT

The CDF for FRTT was created from the available data from the benchmark. As in the T2VE the amount of data in each scenario differs as seen in Table 4.2. The Base had the lowest number of samples with 142 samples and CUDA-HW had the most with 174 samples.

(26)

4.4. Mixed reality processing time

Figure 4.3: CDF of the number of lost frames each second.

The reason for this difference in the different configurations is the dropped frames during the colour probe. Since the video is coloured for the first time 5 seconds in the video and thereafter every 10 seconds, the number of samples should be 180 for each scenario.

Base CUDA HW CUDA-HW

number of samples 142 163 172 174

Table 4.2: Number of FRTT datapoints from each scenario.

Figure 4.4: CDF of the FRTT.

As can be seen in Figure 4.4 the 90th percentile for Base was 2849ms, which was the best result. The worst result was the CUDA-HW with 6583ms in the 90th percentile. The CUDA and HW had the values of 4393ms and 5729ms respectively in the 90th percentile.

4.4 Mixed reality processing time

The average processing time for the rendered frames is displayed together with the MR-Time in Figure 4.5. This shows how much of the total time is spent on the mixed reality, and as seen in the figure, the CUDA performs better in both MR-Time and Total time. The improvement in MR-Time from Base to CUDA was on average 22 ms or a 28,9% decrease in time. The improvement in Total time was 23 ms which means that there was a 1 ms improvement unrelated to MR-Time.

(27)

4.4. Mixed reality processing time Base CUD A 40 60 80 100 120 94 71 76 54 av erage time sp en t [ms]

MR-Time Total time

Figure 4.5: The average time it took for the edge device to process the incoming frames, with the mixed reality processing time shown as a part of the total time.

(28)

5 Discussion

This chapter discusses the results gathered in the previous chapter, the thesis work, but also some choices that were made and limitations that were discovered along the way.

5.1 Impact of CUDA accelerated ORB-SLAM2

CUDA accelerated ORB-SLAM2 alone does not seem to be enough to improve the latency, and indeed it increases the frame round trip time considerably. We can see in Figure 4.2 that it drops less frames and as seen in Figure 4.5, the rendering time in the edge device is reduced. The CUDA acceleration of the point cloud meant that less number of frames were dropped, and even less when the CUDA acceleration was combined with hardware encoding/decoding. In Figure 4.1 we can see that the 90th percentile T2VE was higher in all cases where CUDA was used for ORB-SLAM2, which likely is connected to the increase of FRTT.

5.2 Impact of HW accelerated encoding/decoding

Unfortunately, we could not find a way to use Jetsons GPU for the hardware encoding or decoding. The hardware encoder and decoder that we used is a separate hardware block on Jetson as can be seen in Chapter 2.1. It would have been interesting to see if the GPU could have hardware accelerated even faster than the other piece of hardware located on Jetson.

Hardware accelerating the encoding and decoding of the video stream did not yield a favourable result. In Figure 4.4 for example, we see that the times hardware codecs were used, gave the worst FRTT results. When compared to the software variants, the received image falls behind the camera feed by a lot which Figure 4.4 shows. Something we noticed during the testing too was that whenever we ran the hardware variants, the image was black for a long period of time on the edge device, almost as if there was some sort of initialization period for using the hardware codecs, this is just speculation though but we feel like this could be of interest.

(29)

5.3. Jetson TX2 as an edge device

5.3 Jetson TX2 as an edge device

The Jetson TX2 as previously stated in this work uses an ARM processor. ARM processors are known for using a reduced instruction set and are typically found in mobile devices. If we take one of our results as an example, T2VE for instance. In Section 4.1 for T2VE, we can see that the result was 2707ms in the 90th percentile, whereas Toczé et al. [1]. had a result of 1122ms. By running the same application with no modifications yet, we are already more than twice as slow without modifications, the difference here is already two things, the architecture and hardware. The processor in Jetson TX2 is designed by ARM and uses a reduced instruction set computer architecture whereas the processor Toczé et al. [1] used was designed by Intel and uses a complex instruction set computer architecture. We do not think the trends in our testing would change by changing the edge device to a computer with a faster processor or even a different architecture. It could still be interesting to establish that this is the case by using a different edge device.

5.4 Possible scalability with CUDA

We speculate that the application would benefit more from CUDA with streams of frames with higher resolutions as the current test video is of the resolution 640x480 pixels. It could be that this resolution is low enough that the CPUs superior speed outperforms the GPUs ability to calculate large workloads faster. There is a potential that the GPU can handle higher quality video streams e.g. 1280x720 pixels (720p) better than the CPU, since it might not be utilizing all available CUDA cores meaning that there could be some headroom for larger workloads. This could mean that an increased video resolution could negatively affect the CPUs performance more than the GPUs because of the increased size of the workloads. The cost of initializing the workloads might be what is making the FRTT in the CUDA scenarios increase, since the benefits of using the GPU might not be large enough in the tested scenarios compared to the cost of initializing the workload.

5.5 Edge device processing time

As the results showed, the changes made to the code worsened FRTT. To further investigate we ran a couple of additional tests with the same video, but we looked at the mixed reality time metric in the CUDA and Base versions this time to see if the deteriorated FRTT was due to the processing time.

We discovered that the edge device on average spends less time processing each frame, even with this improvement, the resulting FRTT was still worse than in the base test. This means that somewhere among the configuration changes, the quality of FRTT was reduced, even though the mixed reality time was shorter. In Figure 4.2 we can see that the frame rate of the CUDA-HW version of MR-Leo had the highest frame rate of them all in the 90th percentile. More frames typically sounds like a good thing, but maybe the increase of frames per second caused FRTT to increase by creating a queue of frames that needed to be handled serially. It could be that MR-Leo is not optimized for a higher throughput, which in this case would mean more frames per second. Further investigations into this could have to be done to isolate the parameter or parameters that caused the increasing FRTT but were left out due to the time frame of the project.

5.6 The work in a broader context

We believe that the work done in this thesis will help the next person improving MR-Leo up to speed on the process of getting Jetson ready to run MR-Leo. This work shows a few obstacles that take time to troubleshoot if one is not well-versed with Ubuntu or similar operating

(30)

5.6. The work in a broader context

systems running Linux. Getting ORB-SLAM2-CUDA to work in MR-Leo, downloading the correct OpenCV version, building it and then installing it, to name a few. We also believe that our work provides some insight into how some improvements on the MR-Leo server application can be achieved and the potential issues that emerged.

(31)

6 Conclusion

This chapter is meant to conclude the thesis work and answer the questions that were laid out in Section 1.2 and also list items that can be researched or tried when it comes to furthering the work.

6.1 Performance

When looking back at the research question: ”What effect will CUDA support for ORB-SLAM2 in the MR-Leo server code have on the performance?”, we can see that the current implementation of CUDA accelerated ORB-SLAM2 resulted in worse performance when look-ing at the FRTT. The FRTT of CUDA accelerated ORB-SLAM2 was 54% larger than the Base values when comparing their respective 90th percentile. Performance gains could be seen in the number of processed frames when using CUDA, where the 90th percentile frame rate was 79% larger with CUDA compared to the base. The MR-Time shows an average decrease of 28,9% in time spent rendering.

When looking at the second question in Section 1.3, i.e. ”What effect will hardware ac-celerated encoding/decoding in the MR-Leo server code have on the performance?”, we found that the hardware accelerated encoding and decoding did not provide any meaningful differ-ence when looking at the processed frames. The hardware accelerated encoding and decoding did however negatively impact the T2VE and FRTT worse than the CUDA accelerated ORB-SLAM2 did, which was surprising. The 90th percentile T2VE was 105% higher when compared to the base, and the 90th percentile FRTT was 102% higher when compared to base. The only improvement was the number of processed frames, which was 9% larger in the 90th percentile. Combined with CUDA, the hardware accelerated encoding and decoding improved the number of processed frames each second but made the timing performance worse than CUDA alone or the single CPU case with software encoding/decoding.

As a conclusion, the CUDA accelerated ORB-SLAM2, did improve the performance when considering the number of processed frames, and could potentially be a better way to offload mixed reality workloads to the edge if the FRTT had also been reduced. The MR-Time shows that the CUDA implementation improved the processing time but that more studies are needed to understand the impact on FRTT.

As for the hardware accelerated encoding and decoding, the OMX encoder and decoder which only showed a 9% greater number of processed frames in the 90th percentile makes the

(32)

6.2. Future work

OMX library hard to argue in favour of considering that it more than doubled the T2VE and FRTT. There might be improvements in the implementation that we missed since we had no previous experience with Gstreamer.

6.2 Future work

This section includes recommendations to further this work. As our results portrayed, we did not get the result we were hoping for. These are areas we believe are worth looking into to improve understanding the problem and finally shorten the end-to-end latency.

6.2.1 Hardware encoding and decoding on the GPU

In the future there might be a possibility that Nvidia adds hardware encoding and decoding support by using the GPU in GStreamer. We think this could be interesting since Nurvitadhi et al. [5] showed that an ASIC performed better than a GPU or a CPU in their specific case. It would be interesting to see if it would be the case in this work as well. If not, it could be the case that the ASIC onboard Jetson is not that powerful, but more testing and research needs to be done here to establish that. But seeing as there are At the time of writing this thesis, there was no such thing provided by Nvidia for the Jetson TX2.

Our work was affected by something that was discussed in Chapter 5.2, that there could have been an issue with the initialization of the hardware encoder and decoder. This could be worthwhile to look into to see if perhaps the GStreamer pipelines can be improved somehow or if it boils down to the actual hardware.

6.2.2 Further acceleration with more CUDA

Because of time constraints, we used a fork of ORB-SLAM2 called ORB-SLAM2-CUDA which had accelerated ORB-SLAM2 to some degree with CUDA. We believe that it could be acceler-ated even further. Almost any OpenCV function we came across we found a GPU counterpart for. This means that most of the code could be sent over to the GPU to be executed over there instead. This must be carefully considered though, otherwise there can be a risk of utilizing the GPU poorly making ORB-SLAM2 slower.

Further improvement might come from using CUDA to draw the 3D objects or creating the canny filter. The canny filter is used for the point cloud creation and is used for every frame, which might see benefit from using CUDA. Drawing 3D objects could also benefit from using CUDA since it is a typical GPU workload.

6.2.3 Improving FRTT

Despite some improvements to the frame rate, there were some complications with FRTT. FRTT was not improved nor was it the same, it actually became worse after we did our changes to the MR-Leo project. Further investigations and research could go into here to understand the cause of the FRTT increase despite acceleration.

When comparing even the baseline test with the previous work, the FRTT in this work is considerably larger which does not rule out the possibility that our testing methodology may have been flawed.

(33)

Bibliography

[1] Klervie Toczé, Johan Lindqvist, and Simin Nadjm-Tehrani. “Performance Study of Mixed Reality for Edge Computing”. In: Proceedings of the 12th IEEE/ACM

Inter-national Conference on Utility and Cloud Computing. UCC’19. Auckland, New Zealand:

Association for Computing Machinery, 2019, pp. 285–294. isbn: 9781450368940. doi: 10.1145/3344341.3368816. url: https://doi.org/10.1145/3344341.3368816. [2] D. Chatzopoulos, C. Bermejo, Z. Huang, and P. Hui. “Mobile Augmented Reality Survey:

From Where We Are to Where We Go”. In: IEEE Access 5 (2017), pp. 6917–6950. [3] Johan Lindqvist. “Edge Computing for Mixed Reality”. MA thesis. Sweden: Linköping

University, 2019.

[4] Donald Bourque. “CUDA-Accelerated ORB-SLAM for UAVs”. MA thesis. USA: Worces-ter Polytechnic Institute, 2017.

[5] E. Nurvitadhi, D. Sheffield, Jaewoong Sim, A. Mishra, G. Venkatesh, and D. Marr. “Ac-celerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC”. In:

2016 International Conference on Field-Programmable Technology (FPT). 2016, pp. 77–

84.

[6] GStreamer: open source multimedia framework. url: https : / / gstreamer . freedesktop.org/.

[7] VideoLAN: open-source portable cross-platform media player software and streaming

media server. url: https://www.videolan.org/index.html.

[8] libav: free software project that produces libraries and programs for handling multimedia

data. url: https://libav.org/.

[9] NVIDIA Jetson Linux Developer Guide : Multimedia | NVIDIA Docs. url: https : / / docs . nvidia . com / jetson / l4t / index . html # page / Tegra % 5C % 20Linux % 5C % 20Driver%5C%20Package%5C%20Development%5C%20Guide/accelerated_gstreamer. html%20Accessed:%202020-05-31.

[10] OpenMAX: royalty-free, cross-platform API that provides comprehensive streaming media

codec and application portability. url: https://www.khronos.org/openmax/.

[11] R. Mur-Artal and J. D. Tardós. “ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras”. In: IEEE Transactions on Robotics 33.5 (2017), pp. 1255–1262.

(34)

Bibliography

[12] ORB-SLAM2-CUDA: modified version of ORB-SLAM2 with GPU enhancement. url: https://github.com/thien94/ORB_SLAM2_CUDA.

[13] OpenCV: a library of programming functions mainly aimed at real-time computer vision. url: https://opencv.org/.

[14] Programming Guide :: CUDA Toolkit Documentation. url: https://docs.nvidia.com/ cuda/cuda-c-programming-guide/index.html#cuda-general-purpose-parallel-computing-architecture.

[15] Stefano Aldegheri, Nicola Bombieri, Domenico D. Bloisi, and Alessandro Farinelli. “Data Flow ORB-SLAM for Real-time Performance on Embedded GPU Boards”. In: IEEE

International Conference on Intelligent Robots and Systems (2019), pp. 5370–5375. issn:

21530866. doi: 10.1109/IROS40897.2019.8967814.

[16] K. B. Kaldestad, G. Hovland, and D. A. Anisi. “3D Sensor-based obstacle detection com-paring octrees and point clouds using CUDA”. In: Modeling, Identification and Control 33.4 (2012), pp. 123–130. issn: 03327353. doi: 10.4173/mic.2012.4.1.

[17] Pangolin: a lightweight portable rapid development library for managing OpenGL

display / interaction and abstracting video input. url: https : / / github . com /

stevenlovegrove/Pangolin.

[18] https : / / developer . nvidia . com / blog / jetson tx2 delivers twice -intelligence-edge/.

[19] https : / / docs . nvidia . com / cuda / archive / 10 . 2 / cuda - c - programming - guide / index.html/.

(35)

A

Implementation Details

A.1 MR-Leo arguments

To make benchmark the MR-Leo application we used the arguments -lib. • l – logtime

Log the time and print results after 60 seconds. • i – identifycolor

Look for pictures filled with the data colour #F0F and act on it. Used for benchmarking. • b – benchmarking

Fill the screen with a solid colour when point cloud or MR objects are visible.

A.2 Test environment hardware

The router used was the Asus RT-AC51U1_{, which features the IEEE 802.11ac standard. The}

end device that was used was an Huawei P92 _{smartphone running Android Nougat (7.0).}

A.3 OpenCV installation

The OpenCV version that was used in this thesis to get the MR-Leo application to work was version 3.4.10. This version can be downloaded from OpenCV’s website, but we downloaded it from github since it was very easy to browse between different versions there. The same goes for the contrib, it must also be version 3.4.10.

The commands to install OpenCV is listed below.

1_{https://www.asus.com/se/Networking/RTAC51U/specifications/} 2_{https://consumer.huawei.com/se/support/phones/p9/}

(36)

A.3. OpenCV installation cmake \ -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_INSTALL_PREFIX=/usr \ -DBUILD_PNG=OFF \ -DBUILD_TIFF=OFF \ -DBUILD_TBB=OFF \ -DBUILD_JPEG=OFF \ -DBUILD_JASPER=OFF \ -DBUILD_ZLIB=OFF \ -DBUILD_EXAMPLES=ON \ -DBUILD_JAVA=OFF \ -DBUILD_opencv_python2=ON \ -DBUILD_opencv_python3=OFF \ -DENABLE_PRECOMPILED_HEADERS=OFF \ -DWITH_OPENCL=OFF \ -DWITH_OPENMP=OFF \ -DWITH_FFMPEG=ON \ -DWITH_GSTREAMER=OFF \ -DWITH_GSTREAMER_0_10=OFF \ -DWITH_CUDA=ON \ -DWITH_GTK=ON \ -DWITH_VTK=OFF \ -DWITH_TBB=ON \ -DWITH_1394=OFF \ -DWITH_OPENEXR=OFF \ -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-10.0 \ -DCUDA_ARCH_BIN=6.2 \ -DCUDA_ARCH_PTX="" \ -DINSTALL_C_EXAMPLES=ON \ -DINSTALL_TESTS=OFF \ -DOPENCV_TEST_DATA_PATH=../opencv_extra/testdata \ -DOPENCV_EXTRA_MODULES_PATH=../../opencv_contrib-3.4.10/modules ../ \ .. make -j4

sudo make install

(37)

A.4. Pangolin installation

A.4 Pangolin installation

Installing Pangolin is fairly simple. Download the latest version from their website and install the dependencies that they list on their guide.

To build Pangolin, we used the commands that they list on their page. And to install it we used make. We also used ldconfig to make sure the library files are synced.

cd Pangolin mkdir build cd build cmake ..

cmake --build . sudo make install sudo ldconfig

Improving Performance of a Mixed Reality Application on the Edge with Hardware Acceleration

Linköping University | Department of Computer and Information Science

Bachelor’s thesis, 16 ECTS | Datateknik

2020 | LIU-IDA/LITH-EX-G--20/073--SE

Improving Performance of a

Mixed Reality Applica on on the

Edge with Hardware Accelera on

Jesper Eriksson

Christoﬀer Akouri

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1 Motivation

1.2 MR-Leo

1.3 Problem definition

1.4 Approach

1.5 Achieved results

1.6 Thesis structure

2

Technology background

2.1 Parallel computing

2.2 Point cloud

2.3 ORB-SLAM2-CUDA

2.4 Nvidia Jetson TX2

2.5 GStreamer

2.6 Related works

3

Hardware implementation and

testing methodology

3.1 Installation

3.1.1 Server

3.1.2 End Device

3.2 Hardware implementation

3.2.1 ORB-SLAM2-CUDA

3.2.2 Hardware encoding/decoding

3.3 Testing methodology

3.3.1 Metrics

3.4 Test enviroment

3.5 Edge application configuration

4

Results

4.1 T2VE

4.2 Processed frames

4.3 FRTT

4.4 Mixed reality processing time

5

Discussion

5.1 Impact of CUDA accelerated ORB-SLAM2

5.2 Impact of HW accelerated encoding/decoding

5.3 Jetson TX2 as an edge device

5.4 Possible scalability with CUDA

5.5 Edge device processing time

5.6 The work in a broader context

6

Conclusion

6.1 Performance

6.2 Future work

6.2.1 Hardware encoding and decoding on the GPU

6.2.2 Further acceleration with more CUDA

6.2.3 Improving FRTT

Bibliography

A

Implementation Details

A.1 MR-Leo arguments

A.2 Test environment hardware

A.3 OpenCV installation

A.4 Pangolin installation