Speeding up a mixed reality application: A study of two encoding algorithms

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Bachelor’s thesis, 16 ECTS | Information Technology

2020 | LIU-IDA/LITH-EX-G--20/063--SE

Speeding up a mixed reality

application: A study of two

encoding algorithms

Att snabba upp en applikation med blandad verklighet:

En undersökning av två kodekalgoritmer

Jesper Elgh

Ludvig Thor

Supervisor : Simin Nadjm-Tehrani Examiner : Marcus Bendtsen

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Students in the 5 year Information Technology program complete a semester-long software development project during their sixth semester (third year). The project is completed in mid-sized groups, and the students implement a mobile application intended to be used in a multi-actor setting, currently a search and rescue scenario. In parallel they study several topics relevant to the technical and ethical considerations in the project. The project culminates by demonstrating a working product and a written report documenting the results of the practical development process including requirements elicitation. During the final stage of the semester, students create small groups and specialise in one topic, resulting in a bachelor thesis. The current report represents the results obtained during this specialisation work. Hence, the thesis should be viewed as part of a larger body of work required to pass the semester, including the conditions and requirements for a bachelor thesis.

(4)

Abstract

Mixed reality (MR) or augmented reality (AR) is a way of combining the digital world with the real world. It is utilized in a number of different applications. The mobile de-vices of today may not always provide sufficient processing power to handle these kind of applications.

One way of increasing the amount of available processing power is to offload the pro-cessing to a computer that sits on the edge of the network, for example residing in part of an access network (5G or dedicated unit connected to a router today). The upside of doing this is that it allows you to have a lot more processing power using a more powerful com-puter instead of the mobile device itself. The downside is that it increases latency which can deteriorate the user experience.

To be able to send video to and from the edge computer, the video frames have to be encoded using an encoding algorithm. In this study we implemented two different encoding algorithms, VP8 and H.265, into an existing open source mixed reality application and studied how they affect the responsiveness of the application compared to the already implemented encoders, as well as how they perform when encoding at different bitrates.

It is shown that the encoding algorithm affects the responsiveness to a certain degree, where the VP8 codec was the overall best performer in terms of responsiveness combined with visual quality. One major reason is that the VP8 codec had much faster decoding compared to the other codecs.

The bitrate affects the encoding and decoding speed where higher bitrates resulted in higher encoding times. Higher bitrates also led to the application becoming less responsive which show a correlation between the bitrate magnitude and the responsiveness.

(5)

Acknowledgments

We would like to thank the MR-Leo application author Johan Lindqvist for his help with our questions during our time writing this thesis. We would also like to thank our supervisor Simin Nadjm-Tehrani for helping us move forward with the project. Also thanks to Klervie Toczé for her help with the application and input on the thesis.

Thanks as well to the GStreamer community for anwering our questions in time of need on both reddit and their mailing lists.

(6)

List of Figures

2.1 The app when selecting settings. . . 5

2.2 The app when running mixed reality and drawing houses. . . 5

3.1 The setup with the laptop as edge device, phone as client and router to connect them. The cable connecting the phone to the laptop makes it possible to compile and run the app directly from Android Studio on the laptop. . . 9

4.1 T2VE using H.264 over TCP both to and from the server. . . 13

4.2 T2VE using H.264 over UDP both to and from the server. . . 13

4.3 T2VE using H.264 over TCP to the server and MJPEG over TCP from the server. . . 13

4.4 T2VE using H.264 over TCP to the server and MJPEG over UDP from the server. . 13

4.5 T2VE using H.264 over UDP to the server and MJPEG over TCP from the server. . 14

4.6 T2VE using H.264 over UDP to the server and MJPEG over UDP from the server with 2000 kbit/s bitrate. . . 14

4.7 T2VE using H.264 over UDP to the server and MJPEG over UDP from the server with 4000 kbit/s bitrate. . . 14

4.8 T2VE using H.264 at 10000 kbit/s bitrate. . . 15

4.10 Encoding time for H.264 with 2000 kbit/s bitrate. . . 16

4.13 Decoding time for H.264 with 2000 kbit/s bitrate. . . 16

5.1 T2VE using VP8 over TCP both to and from the server at 2000 kbit/s bitrate. . . 19

5.2 T2VE using H.264 over TCP to the server and H.265 over TCP from the server 2000 kbit/s bitrate. . . 19

5.3 T2VE using VP8 at 8500 kbit/s bitrate. . . 19

5.6 Encoding time for VP8 with 2000 kbit/s bitrate. . . 20

5.7 Encoding time for VP8 with 8500 kbit/s bitrate. . . 20

5.8 Decoding time for VP8 with 2000 kbit/s bitrate. . . 20

5.9 Decoding time for VP8 with 8500 kbit/s bitrate. . . 20

(8)

List of Tables

4.1 Summary of baseline measurements regarding Time To Virtual Element. . . 14 4.2 Summary of baseline measurements regarding Time To Virtual Element compared

to earlier work. . . 15 4.3 Summary of baseline measurements regarding Time To Virtual Element for H.264

over TCP with different bitrates. . . 16 4.4 Summary of baseline measurements regarding encoding and decoding speed for

H.264 with different bitrates. . . 17 5.1 Summary of measurements regarding Time To Virtual Element. . . 18 5.2 Summary of measurements regarding Time To Virtual Element for VP8 and H.265

at different bitrates. . . 19 5.3 Summary of measurements regarding encoding and decoding speed for VP8 and

(9)

1 Introduction

Mixed reality is a way of merging the physical world with virtual worlds by adding virtual three-dimensional objects to the user interface of an application that the user can interact with as if it actually is present in the physical world. This kind of technology uses a lot of resources on the mobile devices where these applications are used [3]. This creates a need for improving the performance of these applications if they are going to be used for a longer time because the computations are rather CPU-intensive and can cause the device to heat up and drain the battery.

One way of reducing the amount of computation on the mobile device is with compu-tational offloading to an edge device. The traditional way of offloading computations too heavy for the mobile device to handle is by offloading it to the cloud, meaning a server often far away from the mobile device. When the data then has to travel to the cloud and back, there is bound to be some level of delay [15]. The further away the server is, the more delay the end user will experience. By placing the computational power closer to the mobile device, which is what is done by using an edge device, the latency will be a lot lower compared to a cloud based solution and will in turn result in a much more responsive application [5]. A prototype for studying this problem has already been developed and is what will be used and modified in this project [8].

The prototype, MR-Leo, uses an edge device as a server which is doing all the heavy computation such as creating a point cloud and drawing graphics. However the time it takes from the point that a user presses the button to add a virtual object until the user sees the object on the screen is quite long in terms of it feeling smooth for the user which makes the experience unpleasant and less like being in a real environment. Most of this time is not spent on computation but it is the time it takes to process and transmit the video that the mobile device captures to the server and then back again [8], [14]. One aspect that affects the transmission time is the video encoding algorithm that is used [14].

1.1 Aim

The aim of this project is to study whether the responsiveness of a mixed reality application using edge computing can be improved by changing the algorithm that is used for encoding and decoding the video.

(10)

1.2. Research questions

The application has a few different parts that each add some latency to the overall latency in the application. These parts are video encoding on the client side, time to transmit video from the client to the server, video decoding on the server, image processing such as creating a point cloud and drawing MR objects, video encoding on the server, time to transmit video from the server to the client and video decoding on the client side. The aim is that the new codecs will reduce the overall latency.

1.2 Research questions

In this thesis the following questions will be answered:

• How much does the encoding bitrate affect the encoding/decoding speed and visual quality of the video streams?

• To what extent does the encoding/decoding speed affect the responsiveness of the ap-plication?

These questions will be investigated and answered through implementing a VP8 en-coder/decoder as well as a H.265 enen-coder/decoder into an open source mixed reality appli-cation and running experiments using these new codecs instead of the already implemented codecs and comparing the results to a baseline setup.

1.3 Related work

Improving performance for a mixed reality application is something that has been studied be-fore. Toczé et al. [14] studied the use of Motion Joint Photographic Experts Group (MJPEG) for encoding and found that it achieves acceptable latency if used together with User Data-gram Protocol (UDP). This is similar to this thesis but here the use of VP8 is mainly studied, but also H.265.

Liu et al. [9] studied ways to improve latency for object detection in an AR device using computational offloading to the edge. In order to reduce the time it took to transmit images between the AR device and the edge they encoded the image with dynamic Region of Interest (RoI) encoding which encodes regions of the image that are interesting with high quality, and the rest of the image with worse quality. They found that this encoding largely reduced the size of the files which in turn lowered the time to transmit images.

Feller et al. [6] looked at the (at the time) newly released VP8 codec and compared it to the already existing H.264 codec. They concluded that the VP8 is a competitive alternative to the H264 codec. They found that VP8 was able to generate bit streams at a comparable quality level when stacked against H.264. When they compared VP8 to a highly optimized implementation of H.264, x264, VP8 displayed only minor drawbacks for compression ratios. When looking at encoding speed, they found that x264, the optimized implementation of H.264, was up to 350% faster compared to VP8. However, they also concluded that since VP8 was far away from x264 in terms of optimization, the discrepancy should be smaller in the future.

Sharrab et al. [12] concluded among other things that H.264 performed better at resolu-tions up to 720P, but VP8 performed better when it came to 1080P as well as 2160P. They also found that H.264 was faster at encoding compared to VP8. However, similar to the previous authors, they also concluded that since VP8 at the time was a new codec, it would improve over time when optimized more and more and therefore close the gap between them.

The H.265 encoder is relatively new and is the successor to the widely used H.264 en-coder. There have been several studies done comparing H.265 or High-Efficiency Video Coding (HEVC) to H.264 or Advanced Video Coding (AVC). Koumaras et al. [7] looked at the performance of HEVC and AVC and also if the main objective of the new HEVC codec

(11)

1.4. Delimitations

which is to double the efficiency of the compression while retaining the same video quality is achieved. They came to the conclusion that it on average takes four times longer for HEVC to complete the encoding compared to AVC when encoding the same signal. They also showed that HEVC doubles the compression efficiency while retaining the same quality.

Correa et al. [4] did a complexity assessment of the HEVC encoder. They came to the con-clusion that when more tools such as bitrate and performance handlers were implemented in the codec, the computational complexity increased monotonically for every tool that was added but the efficiency did not improve after some point. They also found that it was possi-ble to achieve coding efficiency on low-complexity setups matching the coding efficiency of high-complexity setups when carefully selecting coding tools.

1.4 Delimitations

There are multiple ways and methods that could be used in order to improve the responsive-ness of the mixed reality application but in this report only one method will be examined. That is how the encoding/decoding algorithm affects the responsiveness and to what extent switching to a more efficient algorithm, in terms of encoding and decoding speed, can reduce latency.

(12)

2 Background

In this chapter important terms and concepts will be explained.

2.1 Mixed reality

Mixed reality is basically a blend of the physical reality that you are in and a virtual reality. Virtual reality means another world that is completely separate from your physical reality. As an example, wearing HTC Vive or Oculus Rift head gear would take you to a virtual reality. Mixed reality is often used interchangeably with the term augmented reality which usually means an improved reality. An example of augmented reality could be Google Glasses (dis-continued) where it would alter your field of view and show you interesting facts or direc-tions for example, while still allowing you to stay within the realm of your physical reality. Mixed reality is very similar to augmented reality but the focus is on adding objects and al-tering the surroundings and allowing the user to interact with these objects as if they were real, rather than just adding information and improving what is already present.

2.2 MR-Leo

MR-Leo [8] is the name of the mixed reality prototype application that will be used in this thesis. The application captures video on the mobile device and sends it to the server which then calculates the point clouds and, if the user has requested it, also draws an MR object in each video frame. The server then sends the video back to the mobile device which displays it to the user. Before the video is sent either to or from the mobile device it gets encoded and then decoded once it arrives at its destination. Figure 2.1 shows a screenshot of the application in the main menu where all the settings are set and Figure 2.2 shows a screenshot of the application when a virtual house has been added to the view. There it can also be seen three different buttons. The green button is the one that is pressed to add a new vitual object to the screen. To get an idea of what is happening in practice, it means that when the user selects "add", the end device sends a notification to the edge device that an object is to be added. Given a stable point cloud (with a clear mapping of the surroundings) the edge device then draws the object onto the images and sends it back to the end device and it is displayed to the user. The red button is for removing objects, and the blue button is for exiting the application.

(13)

2.3. Video encoding and decoding

Figure 2.1: The app when selecting settings. Figure 2.2: The app when running mixed re-ality and drawing houses.

2.3 Video encoding and decoding

Sending video over a network requires a lot of bandwidth because video files are usually quite large [13]. In order to reduce the file size so that it requires less bandwidth to send, the video is encoded. The size of the file after encoding is decided by what bitrate the video is encoded at. Bitrate means how much information each frame of the video contains, or how much data there is per time unit of video. Generally, using a lower bitrate means that the video file size will be smaller compared to when using a higher bitrate. However, a higher bitrate will be more demanding on the CPU and will in theory take longer time to encode [2]. There are many video encoding standards such as H.264, MJPEG, VP8 and H.265. In this thesis VP8 and H.265 are implemented and compared to H.264 and MJPEG which are already implemented.

H.265

H.265, or High Efficiency Video Coding (HEVC) as it is also known, is the successor of H.264. Both H.264 and H.265 works by dividing an image up into smaller macroblocks which in the case of H.264 is a square of 16x16 pixels. In the case of H.265 they can be up to 64x64 pixels and are referred to as coding tree units. The encoders then make some predictions, for example if two frames look similar then some computation is not necessary. Besides the change in macroblock size, H.264 and H.265 work in about the same way but H.265 has some more optimization and tweaks. This makes H.265 require about 50% of the bandwidth that H.264 requires for the same quality [10].

VP8

VP8 is an open source video codec developed by Google. It is replacing its predecessor VP7 and was released in 2010 along with Google’s WebM-project. It is a competitor to the H.264 codec [6].

The VP8 encoding algorithm is similar to H.265 and H.264. On a high level the algorithm works by dividing a frame into smaller macroblocks which are used to make predictions based on key-frames. Typically three earlier encoded frames are used for prediction [1].

Sharrab et al. [12] concluded that the decoding speed of VP8 was better than that of H.264. The encoding speed was however slower, but not by much, for lower resolutions. This means that there could be an improvement in the responsivity while using VP8 due to the decoding speed.

2.4 GStreamer

The application MR-Leo utilizes an open source framework called GStreamer to stream data between the end device and edge device. GStreamer is a library that handles different kinds of plugins, such as plugins for encoding and decoding video as well as plugins that constructs

(14)

2.4. GStreamer

the travel path of the data stream. Examples of plugins that were implemented during our work are vp8enc and vp8dec to encode and decode the video stream. x265enc and x265dec were also used to encode and decode the video stream.

One important aspect of GStreamer is pipelines. A pipeline is a way of telling GStreamer how to handle incoming and outgoing data streams. A pipeline typically begins with a source, where the video is coming from. This can for example be a TCP connection. Then an encoder on the encoding side that encodes the video stream. Then some kind of multiplexer (a software that takes in several inputs and directs them into one output) and then finally a sink where the video stream exits into. This could also be for example a TCP connection. On the decoder side, it is mostly the same but in reverse.

This is a simplified version of a pipeline but hopefully it makes it clear how it works, in principle.

(15)

3 Encoding and measurement of

latency

In this chapter the process of implementing the encoder and also how the experiments for measuring the time to virtual element and encoding/decoding speed were conducted is ex-plained.

3.1 Implementation of encoders

The encoding in the application is done through GStreamer where encoding pipelines can be constructed. The pipelines consist of different elements. There are elements for both audio and video. In Listing 3.1 we see an example of how a pipeline in GStreamer can look. This pipeline uses videotestsrc which is a test video that can be used to test pipelines, x264enc which encodes video into H.264 compressed data, mpegtsmux which multiplexes the data into an MPEG transport stream and at the end is filesink which saves the encoded video in a specified location on the device that is running the pipeline.

Listing 3.1: Example encoding pipeline

videotestsrc ! x264enc ! mpegtsmux ! filesink location=myvideo.mp4 New pipelines had to be constructed in order to implement the new encodings in the application. Elements in a pipeline can not be combined in any random way because they have to be compatible with each other. Elements have source and sink pads which says what type of data the element accepts and what type of data the elements can send to the next element in the pipeline.

Elements also have properties that can be set in the pipeline. For example the element x264enc has a property called bitrate which sets the bitrate that the video should be encoded at.

The resulting pipeline for the VP8 encoding over TCP is shown in Listing 3.2. This pipeline takes a video that is saved on the phone and displays that video in a little preview window in the application. It then encodes that video with VP8 and then multiplexes it into a matroska file. A matroska file is similar to an MP4 or AVI file in the way that it contains video and audio, but matroska is completely open source. The file is then sent to the server over TCP. Instead of using a pre-recorded video that is saved on the phone there is the option

(16)

3.1. Implementation of encoders

of using the camera on the phone. In that case the pipeline looks the same except for instead of having filesrc and decodebin at the start the pipeline would have ahcsrc and capsfilter.

Listing 3.2: Pipeline for VP8 encoding over TCP on client side filesrc ! decodebin ! tee t. ! queue ! glimagesink t. ! vp8enc target-bitrate=2000000 ! matroskamux ! tcpclientsink

The pipeline when using UDP is almost the same as for TCP as seen in Listing 3.3 but rtpvp8pay which is an element that puts VP8 video in RTP packets is added after the encoder, matroskamux is removed and tcpclientsink has been changed to udpsink.

These pipelines also exist on the server with a few modifications since the video source comes from the server instead of from a file or the camera. A problem that arised when the pipelines for VP8 were implemented on the server was that the encoding was really slow. So slow that only about three frames were received by the client during the full 60 second video. To fix this problem some more properties in the vp8enc element were explored. The properties that were added are cpu-used, threads and deadline. The higher the value is for cpu-used the faster the encoder will encode, but at a loss of quality. This property was set to 16 which was the highest value, meaning that it will use as much CPU as it needs. The threads property tells the encoder how many threads it can use for the encoding and it was set to 8. This will also speed up the process. The last element, deadline, was set to realtime which is the recommended setting to use for live encoding since it will be faster. Deadline was the property that had the greatest impact on the speed of the encoder.

Listing 3.3: Pipeline for VP8 encoding over UDP on client side filesrc ! decodebin ! tee t. ! queue ! glimagesink t. ! vp8enc target-bitrate=2000000 ! rtpvp8pay ! udpsink

When the video has been sent it has to be decoded by the receiver. The decoding is also done by constructing pipelines. These are constructed in the same way as the encod-ing pipelines but with different kinds of elements meanencod-ing that instead of an element that encodes, there is an element that decodes, and instead of multiplexing there is an element that demuxes the video stream. In Listing 3.4 the decoding pipeline for VP8 over TCP on the server is shown. For decoding, vp8dec is used. There is a similar pipeline for VP8 encoded video that has been sent over UDP.

Listing 3.4: Pipeline for VP8 decoding over TCP on server tcpserversrc ! matroskademux ! vp8dec ! videoconvert ! video/x-raw,format=(string)RGB ! videoconvert !

appsink emit-signals=true max-buffers=1 drop=true

Limitations

The elements come in different plugin packages in GStreamer and they have special binaries used for android development. The android binaries do not always include everything that it should include according to the GStreamer documentation. This was a problem when H.265 was going to be implemented on the client side because the element x265enc was not included in the android binaries because nobody who maintains it had added the x265 plugin. This meant that in order to implement it for android, the plugin had to be manually added. This meant that a recipe for how to build the plugin had to be added. The possibility of doing this was explored but never successful. Therefore H.265 encoding was only implemented on the server and not the client.

The pipelines for encoding and decoding H.265 look very similar to the previously shown pipelines and will therefore not be shown.

(17)

3.2. Experiment setup

Figure 3.1: The setup with the laptop as edge device, phone as client and router to connect them. The cable connecting the phone to the laptop makes it possible to compile and run the app directly from Android Studio on the laptop.

For the implementation of VP8 over UDP and H.265 over UDP no measurements will be performed due to video artifacts resulting in a video that is not on par with the other configurations.

3.2 Experiment setup

To perform the measurements, three different pieces of hardware was needed. One dedi-cated router not connected to the internet, one laptop running Linux-based operating system Ubuntu 18.04.4 LTS and one smart phone running Android version 9 (Pie). The router that was used is made by D-LINK, has the model number DIR 809 and was made in 2015. The laptop is made by HP and has the model number EliteBook 830 G6. It has an Intel Core i5-8265U processor running at 1.6 GHz as well as 16 gigabytes of RAM. The Android phone is a Samsung Galaxy A40. It has an Octa-Core 1.6 GHz processor and 4 gigabytes of RAM.

The way that this mixed reality system works is by turning the laptop into a server (edge device) connected to the router via Wi-Fi 802.11. The phone is then connected to the same router via Wi-Fi 802.11 as well making it possible for the laptop to communicate with the phone over Wi-Fi, making the phone into an end device. In Figure 3.1 the setup is shown.

3.3 Measurement method

To get an idea of how the different encoders affect the latency and in turn the user experience, there was a need for a benchmarking method, i.e. some way of measuring the difference in latency when using different encoders. For this two measurements were used to evaluate the performance of the encoders. The first one is referred to as Time To Virtual Element (T2VE). It simply means the time it takes for the virtual element to be added to the screen after the user has selected the "add" option in the application. The second one is the encoding and decoding speed of the codecs which is how long time it takes to either encode or decode a video frame.

(18)

3.3. Measurement method

The video file that was used to perform the benchmarking was the same video that was used in earlier work [14]. When a single pre-recorded video file was used during every mea-surement, the results will be more repeatable compared to a live video recording. This video was filmed at Linköping University and displays several items such as a table and a laptop that the mixed reality algorithm has to perform a mapping of. It has a resolution of 640x480 pixels.

Time To Virtual Element

In order to measure T2VE the application was run in benchmarking mode. To run the server in benchmarking mode and to get the needed data, the program was executed from the ter-minal line with an added "-b" after the program execution instructions. This printed out an exact measurement of the T2VE in milliseconds, allowing the data to be plotted in graphs as cumulative distribution functions.

For each encoder and setup, 30 measurements of T2VE were recorded. There were in reality more measurements that were not included due to the point cloud being unstable. This could produce T2VE values of 9000 ms when the average was around 100-200. Those kinds of extreme values were disregarded. Generally the threshold for disregarding measurement values was if the value was about five to six times higher than the average.

Encoding and decoding speed for different bitrates

When a video frame gets encoded it takes some time from the moment that a frame arrives at the GStreamer pipeline source until it is done and comes out at the sink. The time that it takes from the moment that the frame arrives at the source until is arrives at the sink is what is called encoding/decoding speed in this thesis.

Different encoders use different algorithms to encode the video and that should give dif-ferent results regarding encoding speed for difdif-ferent bitrates.

In order to compare the encoders against each other in terms of encoding speed for dif-ferent bitrates, there was a need for a way to perform the measurements. A command was added to the terminal line when running the server which made it possible to measure the encoding speed for the different encoders. The server was started with the command shown in Listing 3.5 which outputs the time for each frame to go through the pipeline in nanosec-onds. The data was then exported from the terminal and plotted as cumulative distribution functions where the time has been converted to milliseconds.

Listing 3.5: Command to run server and measure pipeline latency. GST_DEBUG="GST_TRACER:7" GST_TRACERS="latency(flags=pipeline)" ./MR-Leo-server

This gave a clear idea of how differences in bitrate magnitude affected each encoder’s ability to encode the video data. The different bitrates that were tested were 2000 kbit/s, 10000 kbit/s and 20000 kbit/s for H.264 and H.265, and 2000 kbit/s and 8500 kbit/s for VP8. The reason for choosing 8500 kbit/s for VP8 is explained in the next paragraph. The standard setting for the application was 2000 kbit/s. For comparison, a YouTube video at 1080p at 30 frames per second is encoded at a bitrate of around 8000 kbit/s, and a video at 1080p at 60 frames per second is encoded at about 12000 kbit/s. So 10000 kbit/s is right between those two values. 20000 kbit/s can be compared to a YouTube video at 1440p.

When running the VP8 encoder at 10000 kbit/s and 20000 kbit/s, the video quality wors-ened drastically compared to when running the other encoders at higher bitrates, which only improved the video quality. This might have to do with some setting for the encoder which caps the bitrate at some value. For this reason, VP8 performance at 10000 kbit/s and 20000 kbit/s is not included. Instead measurements for VP8 were performed with the bitrate set to

(19)

3.3. Measurement method

8500 kbit/s. This value is fairly close to 10000 kbit/s and should provide a similar insight. It was also decided that the important part of performing measurements at higher bitrates was not to compare the different encoders at exactly the same bitrate, but rather to see how the different encoders performed at different bitrates. If the study would be performed again the measured bitrates would be the same for all codecs. The reason for this not being done during this thesis was due to time constraints. The measurements were only performed once during the 60 second video for each setup and bitrate. This resulted in about 1500 measurements for encoding and decoding each which were then used to plot the graphs.

No measurements regarding bitrate will be performed for MJPEG due to time constraints. All other measurements regarding encoding speed and bitrate were only performed on the server.

(20)

4 Baseline evaluation

In the application there were a number of encoders already implemented. From the client to the server the user can select to encode the video with H.264 with either hardware or software encoding. From the server to the client the user can select either H.264 or MJPEG. The user also has to select whether to send over TCP or UDP for some of the encodings. This means that there are 12 different combinations that the user can choose from. Then the user can also change the bitrate that the video should be encoded at which makes the possible combinations of different configurations endless. To get as good of a baseline as possible and to have much to compare to, multiple different configurations will be used.

Measurements for some of these setups has already been performed in earlier work [8], [14] but with different hardware, therefore the measurements needed to be redone with the hardware used in this thesis so that the new measurements can be compared to results from other encoders using the same hardware.

4.1 Time to virtual element

Here the results regarding the Time To Virtual Element (T2VE) are presented for the baseline which is the time it takes from the moment that the user presses the add button until the user sees the virtual object on the screen.

All of the following measurements were done while using H.264 to send video from the client to the server. The difference is whether TCP or UDP is used for the connection between the client and the server. When TCP is used for H.264 the bitrate is set to 2000 kbit/s and when UDP is used the bitrade is set to 4000 kbit/s. This is in order to achieve a stable amount of feature points in the point cloud [14].

In Figure 4.1 the results regarding T2VE when using H.264 over TCP, both when the client sends video to the server as well as when the server sends video to the client, are shown. It shows that the value at the 90th percentile is 729 ms. It also shows that the lowest value is 439 ms, and the highest was 986 ms. Toczé et al. [14] also performed measurements with this setup and found that their value at the 90th percentile is 654 ms which is a little lower than in this thesis.

Similarly in Figure 4.2 the results for when using H.264 over UDP both to and from the server are shown. It shows that the value at the 90th percentile is 377 ms. Here the lowest value is 214 ms while the highest is 618 ms. The highest value for this setup is lower than the

(21)

4.1. Time to virtual element

Figure 4.1: T2VE using H.264 over TCP

both to and from the server. Figure 4.2: T2VE using H.264 over_{UDP both to and from the server.}

Figure 4.3: T2VE using H.264 over TCP to the server and MJPEG over TCP from the server.

Figure 4.4: T2VE using H.264 over TCP to the server and MJPEG over UDP from the server.

lowest value when using H.264 over TCP both to and from the client. When Toczé et al. [14] performed this measurement they found the value at the 90th percentile to be 177 ms. This is again lower than the measurement in this thesis.

The following graphs show the results from when using MJPEG from the server to client. In Figure 4.3 H.264 over TCP was used from the client to the server, and MJPEG over TCP was used from the server to the client. In this case 125 ms is the value at the 90th percentile. Toczé et al. [14] found this value to be 98 ms which again is lower than in this thesis. It can also be seen that the lowest value is 24 ms and the highest is 261 ms. In Figure 4.4 H.264 over TCP is used again from the client to the server together with MJPEG over UDP from the server to the client. The value at the 90th percentile for this setup is 57 ms. The highest value is 90 ms and the lowest 19 ms.

When using H.264 over UDP from the client to the server and MJPEG over TCP from the server to the client as seen in Figure 4.5 the value at the 90th percentile is 71 ms. Here the lowest value is 17 ms, and the highest 75 ms.

The two following results are from using H.264 over UDP from the client to the server and MJPEG over UDP from the server to the client. In Figure 4.6 the results for when the bitrate when sending video from the server is set to 2000 kbit/s are shown, compared to Figure 4.7 where the bitrate is set to 4000 kbit/s are shown. When the bitrate is set to 2000 kbit/s the value at the 90th percentile is 53 ms with a lowest value of 15 ms and a highest value of 88 ms. For 4000 kbit/s the 90th percentile value is 57 ms. The lowest value is 18 ms and the highest is 92 ms.

(22)

Figure 4.5: T2VE using H.264 over UDP to the server and MJPEG over TCP from the server.

Figure 4.6: T2VE using H.264 over UDP to the server and MJPEG over UDP from the server with 2000 kbit/s bitrate.

Figure 4.7: T2VE using H.264 over UDP to the server and MJPEG over UDP from the server with 4000 kbit/s bitrate.

The results from the baseline measurements regarding Time To Virtual Element are sum-marized in Table 4.1. There it is shown that H264 over UDP and MJPEG over UDP at 2000 kbit/s bitrate performed the best in terms of the value at the 90th percentile. H.264 over TCP and MJPEG over UDP is tied at second place together with H.264 over UDP and MJPEG over UDP at 4000 kbit/s bitrate. The highest value for T2VE was achieved while using H.264 over TCP both to and from the server.

SETUP 90 % (ms) MIN (ms) MAX (ms)

H.264 (TCP) and H.264 (TCP) 729 439 986

H.264 (UDP) and H.264 (UDP) 377 214 618

H.264 (TCP) and MJPEG (TCP) 125 24 261

H.264 (TCP) and MJPEG (UDP) 57 19 90

H.264 (UDP) and MJPEG (TCP) 71 17 75

H.264 (UDP) and MJPEG (UDP) at 2000 kbit/s 53 15 88

H.264 (UDP) and MJPEG (UDP) at 4000 kbit/s 57 18 92

Table 4.1: Summary of baseline measurements regarding Time To Virtual Element. Table 4.2 shows the values at the 90th percentile for some of the setups from both this thesis and from earlier work. All the values are lower for Toczé et al. [14] compared to the

(23)

Figure 4.8: T2VE using H.264 at 10000 kbit/s bitrate.

values measured in this thesis. This is most likely due to the difference in hardware that was used.

SETUP This thesis (ms) Toczé et al. (ms)

H.264 (TCP) and H.264 (TCP) 729 654

H.264 (UDP) and H.264 (UDP) 377 177

H.264 (TCP) and MJPEG (TCP) 125 98

Table 4.2: Summary of baseline measurements regarding Time To Virtual Element compared to earlier work.

T2VE for higher bitrates

T2VE was also measured while using higher bitrates to see how this affects the responsive-ness. The following measurements are from using H.264 over TCP both to and from the server. The reason that H.264 over TCP is used even though it is clearly the slowest out of the setups in Section 4.1 is because the measurement of T2VE is not the important part when measuring with higher bitrates. The important part is to see whether the bitrate affects the T2VE. Therefore it does not matter which setup is used.

Figure 4.8 shows the results from when the bitrate is set to 10000 kbit/s. There it is shown that the value at the 90th percentile is 1046 ms which is higher than for H.264 over TCP at 2000 kbit/s. The lowest value is 516 ms, and the highest is 2312 ms. The results from using 20000 kbit/s is shown in Figure 4.9 where the value at the 90th percentile is 1449 ms which is even higher than for 10000 kbit/s. The lowest value is 501 ms, and the highest is 3196 ms. These measurements show that there can be a big difference between the smallest and largest value. This is because some frames contain more or less information depending on what they are displaying. When the bitrate is lower the values are more concentrated around a certain point.

The results from measuring T2VE while using different bitrates for H.264 are summa-rized in Table 4.3. It clearly shows that a higher bitrate results in a longer response time as expected. The difference between the smallest and largest value also gets bigger as the bitrate is increased.

(24)

4.2. Encoding and decoding speed for different bitrates

BITRATE 90th percentile (ms) MIN (ms) MAX (ms)

2000 kbit/s 729 439 986

10000 kbit/s 1046 516 2312

20000 kbit/s 1449 501 3196

Table 4.3: Summary of baseline measurements regarding Time To Virtual Element for H.264 over TCP with different bitrates.

Figure 4.10: Encoding time for H.264 with 2000 kbit/s bi-trate.

Figure 4.11: Encoding time for H.264 with 10000 kbit/s bitrate.

Figure 4.13: Decoding time for H.264 with 2000 kbit/s bi-trate.

Figure 4.14: Decoding time for H.264 with 10000 kbit/s bitrate.

4.2 Encoding and decoding speed for different bitrates

Here we will present the results regarding the encoding speed for different bitrates for the baseline. As stated in Chapter 3, these measurements are only performed for H.264 for the baseline.

When encoding with H.264 at a bitrate of 2000 kbit/s as seen in Figure 4.10 the value at the 90th percentile is 3.77 ms with a fastest measured time of 1.99 ms, and a highest of 7.19 ms. For 10000 kbit/s which is shown in Figure 4.11 the value at the 90th percentile is 3.89 ms. 2.01 ms and 6.82 ms are the lowest and highest values. Lastly for 20000 kbit/s as shown in Figure 4.12 the value at the 90th percentile is 4.08 ms while the lowest value is 2.06 ms and the highest value is 8.72 ms. It can be noted that it takes more time to encode at a higher bitrate.

For decoding H.264 video which is encoded at 2000 kbit/s, Figure 4.13 shows that the time at the 90th percentile is 268 ms while the lowest value is 34 ms and the highest is 433 ms. Figure 4.14 shows that when decoding a video that is encoded at 10000 kbit/s the value at the 90th percentile is 370 ms. Here the lowest value is 38 ms and the highest is 589 ms. For 20000 kbit/s as seen in Figure 4.15 the value at the 90th percentile is 390 ms and that the lowest value is 24 ms, and the highest is 618 ms. Here it can also be noted that the time to decode increases as the bitrate increases. Also, the time to decode a video is many times longer than the time to encode.

(25)

The results are summarized in Table 4.4 where it is clearly shown that both the time it takes to do the encoding and decoding is dependant on the bitrate that has been selected. It can also be seen that the decoding time is about 80-90 times longer than the time to encode.

BITRATE 90th percentile encoding (ms) 90th percentile decoding (ms)

2000 kbit/s 3.77 268

10000 kbit/s 3.89 370

20000 kbit/s 4.08 390

Table 4.4: Summary of baseline measurements regarding encoding and decoding speed for H.264 with different bitrates.

(26)

5 Measurements of VP8 and H.265

In this chapter we will present the different latencies that we have got from the experiments when running MR-Leo with VP8, and also H.265 from the server to the client. As mentioned in Chapter 3, the H.265 encoder was only implemented on the server side due to it not being included in the android binaries.

5.1 Time to virtual element

Here the results regarding Time To Virtual Element (T2VE) will be presented for the imple-mentation of VP8 and H.265.

In Figure 5.1 we see the results for when using VP8 over TCP both to and from the server at 2000 kbit/s bitrate. It shows that the value at the 90th percentile is 192 ms while the lowest value is 90 ms and the highest is 391 ms. If compared to the baseline measurements in Chapter 4 it is almost as fast as H.264 over TCP to the server and MJPEG over TCP from the server.

Figure 5.2 shows the results from when using H.264 over TCP from the client and H.265 over TCP from the server. The value at the 90th percentile for these measurements is 775 ms, the lowest value is 440 ms and the highest is 999 ms.

The results from the measurements are summarized in Table 5.1. It shows that VP8 over TCP both to and from the server is faster than using H.264 from the client and H.265 from the server over TCP. The highest value while using VP8 is lower than the lowest value while using the other setup.

SETUP 90th percentile (ms) MIN (ms) MAX (ms)

VP8 (TCP) and VP8 (TCP) 192 90 391

H.264 (TCP) and H.265 (TCP) 775 440 999

Table 5.1: Summary of measurements regarding Time To Virtual Element.

T2VE for higher bitrates

The T2VE was also measured while using different bitrates for both of the setups. For VP8 a measurement while having the bitrate set to 8500 kbit/s was performed, and for H.265 over

(27)

Figure 5.1: T2VE using VP8 over TCP both to and from the server at 2000 kbit/s bitrate.

Figure 5.2: T2VE using H.264 over TCP to the server and H.265 over TCP from the server 2000 kbit/s bitrate.

Figure 5.3: T2VE using VP8 at 8500 kbit/s bitrate.

TCP from the server measurements were performed with the bitrate set to 10000 kbit/s and 20000 kbit/s.

Figure 5.3 shows the results from the measurements while using VP8 at 8500 kbit/s, see Chapter 3 for the reasoning. Here the value at the 90th percentile is 310 ms. The lowest value is 100 ms, and the highest is 326 ms. This is higher than the T2VE for VP8 at 2000 kbit/s.

For H.265 the results while at 10000 kbit/s are shown in Figure 5.4 where the value at the 90th percentile is 1376 ms. For this setup the lowest value is 582 ms, and the highest is 1884 ms. At 20000 kbit/s as seen in Figure 5.5 the value at the 90th percentile is 1392 ms. The lowest value is 894 ms, and the highest is 2229 ms. This also shows that the T2VE increases as the bitrate increases. However the difference between 10000 kbit/s and 20000 kbit/s is not that great.

Table 5.2 summarizes the results for T2VE for the different bitrates. Interesting to note is that VP8 is faster than H.265. Even VP8 at 8500 kbit/s is faster than H.265 at 2000 kbit/s.

SETUP & BITRATE 90 % (ms) MIN (ms) MAX (ms)

VP8 (TCP) at 2000 kbit/s 192 90 391

VP8 (TCP) at 8500 kbit/s 310 100 326

H.264 (TCP) and H.265 (TCP) at 2000 kbit/s 775 440 999 H.264 (TCP) and H.265 (TCP) at 10000 kbit/s 1376 582 1884 H.264 (TCP) and H.265 (TCP) at 20000 kbit/s 1392 894 2229 Table 5.2: Summary of measurements regarding Time To Virtual Element for VP8 and H.265 at different bitrates.

(28)

Figure 5.6: Encoding time for VP8 with 2000 kbit/s bitrate.

Figure 5.7: Encoding time for VP8 with 8500 kbit/s bitrate.

Figure 5.8: Decoding time for VP8 with 2000 kbit/s bitrate.

Figure 5.9: Decoding time for VP8 with 8500 kbit/s bitrate.

5.2 Encoding and decoding speed for different bitrates

Here we will present the results regarding the bitrate for the implementation of VP8 and H.265.

For VP8 encoding at 2000 kbit/s bitrate the results are shown in Figure 5.7. There it is shown that the value at the 90th percentile is 4.63 ms, the lowest value is 2.10 ms and the highest value is 4.90 ms. Figure 5.7 shows the results for encoding speed when the bitrate is set to 8500 kbit/s. The value at the 90th percentile is 13.7 ms, the lowest value is 3.51 ms and the highest is 42.9 ms. The time to encode increases when the bitrate increases.

When decoding VP8 video at 2000 kbit/s as seen in Figure 5.8 the value at the 90th per-centile is 5.57 ms, the lowest value is 1.94 ms and the highest value is 12.7 ms. In Figure 5.9 the results are shown for decoding video that has been encoded at 8500 kbit/s. There the value at the 90th percentile is 5.49 ms. The lowest value is 2.05 ms, and the highest is 25.9 ms. The encoding and decoding seem to be taking about the same amount of time for VP8.

H.265 at 2000 kbit/s which can be seen in Figure 5.10, the value at the 90th percentile is 10.8 ms, the lowest value is 3.14 ms and the highest value is 18.9 ms. At 10000 kbit/s shown in Figure 5.11, the value at the 90th percentile is 16.2 ms, the lowest value is 3.11 ms and the highest value is 28.6 ms. At 20000 kbit/s shown in Figure 5.12, the value at the 90th percentile is 24.5 ms, the lowest value is 8.28 and the highest value is 53 ms. When the bitrate increases, the time it takes to encode the video increases as well. The value at the 90th percentile increases steadily each time the bitrate is increased, as does the highest values. The lowest value for 10000 kbit/s is lower than the same value for 2000 kbit/s which is most likely something circumstantial.

(29)

5.3. Validity test

Figure 5.10: Encoding time for H.265 with 2000 kbit/s bi-trate.

Figure 5.13: Decoding time for H.265 with 2000 kbit/s bi-trate.

Moving on to the H.265 decoding numbers, the behaviour is very similar compared to the encoding side. At 2000 kbits/s, shown in Figure 5.13, the value at the 90th percentile is 269 ms, the lowest value is 34 ms and the highest value is 455 ms. At 10000 kbit/s shown in Figure 5.14, the value at the 90th percentile is 394 ms, the lowest value is 32.4 ms and the highest value is 792 ms. At 20000 kbit/s, shown in figure 5.15, the value at the 90th percentile is 430 ms, the lowest value is 24.6 ms and the highest value is 591 ms.

As can be seen in these results, the H.265 decoding times increases as the bitrate increases. The decoding times are much larger overall compared to the encoding times.

The results are summarized in Table 5.3. It is clear that both the encoding and decoding times increase when the bitrate increases. It can also be seen that the decoding time for VP8 is significantly lower than for H.265.

SETUP & BITRATE 90th percentile encoding (ms) 90th percentile decoding (ms)

VP8 at 2000 kbit/s 4.63 5.57

VP8 at 8500 kbit/s 13.7 5.49

H.265 at 2000 kbit/s 10.8 269

H.265 at 10000 kbit/s 16.2 394

H.265 at 20000 kbit/s 24.5 430

Table 5.3: Summary of measurements regarding encoding and decoding speed for VP8 and H.265 with different bitrates.

5.3 Validity test

In order to make sure that the measurements are not closely connected to the reference video that has been used for all the other measurements, another video was used to validate the

(30)

5.3. Validity test

Figure 5.16: VP8 over TCP at 2000 kbit/s for object.mp4

results. The video is called object.mp4 and is supposed to be a video that makes it simple for the server to create a point cloud, so we expect a lower T2VE with this video. Figure 5.16 shows the results for T2VE while using VP8 over TCP at 2000 kbit/s with the new video. There the value at the 90th percentile is 172 ms which is 20 ms lower than the value for the same setup with the baseline video. The lowest and highest values are 76 ms and 303 ms. This is also pretty close to the expected reduction of the values measured earlier. With these results it can be concluded that the earlier presented measurements are representative of the transmission times (irrespective of video).

(31)

6 Discussion

In this chapter the results that were presented in the results section and also the methods that was described in the method section will be discussed.

6.1 The scope of the experiments

In chapter 5 there were two different results, one covering the T2VE and the other covering the encoding and decoding speeds for different bitrates.

One interesting finding that might have affected the results is that the T2VE results differ each time the application runs. One benchmarking round, meaning performing 30 measure-ments, might give steady values at around 100 ms while another round gives steady values at 150 ms and so on. The reason for this is unknown. Since the experiments were performed in different locations it could be due to router placement or some other hardware issue. It could also be due to some kind of bug in the software. The values that were shown in the results chapter do however accurately represent the findings.

Something that could have been improved upon is the number of measurements that were made. The number of measurements for each benchmark is only 30, which is rather low. The reason for choosing this number was due to both the fact that there were a lot of different configurations that needed to be tested and also time constraints.

It it worth revisiting a few of the delimitations that were previously mentioned and dis-cuss these and how they may have affected the results. One issue was using the UDP protocol in combination with the VP8 codec. This resulted in video artifacts which severely degraded the quality to the point where the original video was unrecognizable. For this reason, VP8 over UDP was not measured. If VP8 over UDP was functional, how fast would it have been? What about the video quality? Going by the other measurements and comparisons of for example H.264 over UDP and TCP, the T2VE speed would likely have increased at the cost of visual quality. This is however nothing certain and to be sure measurements would have to be made.

Another issue was the fact that the H.265 codec was not implemented on the client side. One major difference between H.265 and for example H.264 is that the H.265 codec is a lot more computationally intensive, meaning that to encode a similar sized video at the same bitrate, it takes a longer time on the same computer. This could potentially be an issue for the mobile device since it is both dependant on a battery that a user expects to last all day and it

(32)

6.1. The scope of the experiments

also has a slower processor compared to the laptop. The mobile device used in this thesis has a rather slow processor compared to other phones of higher quality, so using a faster phone might be more suitable for H.265. Still, there is the issue of battery life. However, all of this is simply speculation since the codec was not implemented on the client.

Comparison of T2VE outcomes

T2VE is a way of measuring how it takes for a virtual element to appear on the screen af-ter the user has pressed the "add" button in the application. The greaaf-ter this value is, the more delayed the response will be and the application will as a result feel more sluggish to use. According to [11] 100 ms is the limit at where the user perceives the response as being immediate.

The baseline measurements showed that H.264 over TCP was the slowest out of all the setups with a time at the 90th percentile of 729 ms. This time increased as the bitrate increased. At 10000 kbit/s it had a T2VE of 1046 ms, and at 20000 kbit/s the T2VE was 1449 ms. H.264 over UDP was faster with a T2VE at 377 ms. The setups that involved MJPEG were the fastest, and the fastest overall was H.264 over UDP from the client to the server, and MJPEG over UDP back which had a T2VE at the 90th percentile of 53 ms which is fast enough to feel immediate [11].

The new implementation of VP8 when used over TCP both to and from the server had a T2VE at the 90th percentile of 192 ms. It also increased when the bitrate was increased. At 8500 kbit/s the T2VE was at 310 ms.

The VP8 codecs were faster in terms of T2VE than both the H.264 and H.265 codecs with T2VE times of up to 2-3 times lower than H.264 and H.265. Using H.264 on the client side and H.265 on the server side, the performance was identical to using H.264 over TCP on both sides. However, non of these come close to the performance of H.264 from the client and MJPEG from the server which was significantly faster compared to the before mentioned encoders. This was indeed a surprising outcome and the reason is probably that the MJPEG encoder or decoder is faster than the others. This would have to be investigated closer to be able to draw a conclusion. The difference between when using VP8 over TCP and when using H.264 over TCP to the server and MJPEG over TCP back to the client is 67 ms which means that VP8 was only 1.5 times slower. A reason for this could be that VP8 over TCP had a higher highest value than H.264 and MJPEG which increases the 90th percentile value as seen in Figures 5.1 and 4.3. If these high values were disregarded the VP8 codec would be even closer to the performance of H.264 over TCP and MJPEG over TCP.

Another aspect of this is visual quality. Visual quality is an important aspect of quality of service and is not something to be completely neglected in favor of simply looking at the numbers. The visual quality of VP8 was superior to both H.264 and MJPEG and on a similar level with H.265. This however has not been measured in any other way other than by subjective observations when performing the tests.

When testing different bitrates the results show that the T2VE gets longer for higher bi-trates which was not surprising. For VP8 the difference is about 100 ms between 2000 kbit/s and 8500 kbit/s which is quite a big and noticeable difference when the numbers are other-wise quite low. H.265 also showed that the T2VE increased for higher bitrates. The times were quite high and not something that is useful and responsive enough for a mixed reality application running on the low performance platforms used in the experiments.

Encoding and decoding speed for different bitrates

When looking at the encoding speeds for the different encoders the time to encode increased for higher bitrates as expected. The time to encode does not seem to be proportional to the bitrate since increasing the bitrate by 4.25 times as done for VP8, the encoding time only

(33)

6.2. Method

increased by 2.95 times. The same can be seen for H.265 where a 5 time increase in bitrate only increased the encoding time by 1.6 times.

The increase in bitrate also increased the T2VE as stated earlier. For VP8 this increase is 1.6 times at the 90th percentile. So even though the encoding time increased by 2.95 times the T2VE only increased by 1.6 times.

One of the most interesting findings was that VP8 encoding and decoding are almost as fast as each other. This is something that generally would have been expected [2] but when looking at the decoding performance of the other codecs this is not the case. The other codecs took much longer to decode compared to the time it took to encode. The time to decode VP8 is almost 50 times faster than the time to decode H.264. It is highly likely that this is the factor that causes VP8 to have better T2VE measurements and why it is more responsive than H.264. The short time to decode a VP8 video combined with a relatively fast encoding speed which means that there is more time left over to perform other computations which is what a mixed reality application needs to do, and also with better perceived video quality than MJPEG, we believe that the VP8 codec is the best alternative to use for a mixed reality application.

6.2 Method

There are a few aspects of our methods that are important to discuss in light of our findings. One is hardware and its relation to encoder performance. The H.265 encoder is quite resource intensive and is significantly slower than H.264. This raises the question of what would have happened if a faster and generally more competent computer acted as the edge device instead of the laptop that was used. It is possible that it would have decreased the encoding time, for all encoders, but would it be enough to shorten the gap between H.265 and the other encoders in terms of processing speed?

The mobile phone is another relevant part of the setup. The phone is a Samsung Galaxy A40 which is marketed as a phone for budget oriented people, meaning the hardware is far from being as powerful as the flagship devices that the Android phone manufacturers are making. This was noticeable when using the phone and simply going through menus, etc. It did not provide a very smooth and seamless experience. The question here is if a more powerful mobile device would have improved the performance at all. To test on a new device should not be any problem as long as the same version of android is used. It would at the very least make an implementation of H.265 on the mobile device more feasible, should future researchers try that.

Another aspect is the measurements of the encoding and decoding speeds. For this a built in function in GStreamer was used. Here it would have been beneficial to perform the measurements multiple times to get a larger set of data. Only doing one run of the video may be bad since there is no way of telling if something has gone wrong or not even though 1500 measurements were done.

(34)

7 Conclusion

The aim of this thesis was to look into how the encoding bitrate affects the encoding/decod-ing speed and visual quality of the video streams and to what extent the encodencoding/decod-ing/decodencoding/decod-ing speed affects the responsiveness of the application.

MR-Leo is a complex application with many components. There are more than 25000 lines of code that makes the mixed reality application work. In order to make any contributions to a project of that size a lot of time goes towards reading and understanding the code, and how everything works together. The implementation of the new codecs in this thesis added about 100 lines of code. To understand where and how the code had to be written a lot of time was spent learning how GStreamer works and how it is integrated in to MR-Leo.

The research questions formulated in Chapter 1 were:

• How does the encoding bitrate affect the encoding/decoding speed and visual quality of the video streams?

• To what extent does the encoding/decoding speed affect the responsiveness of the ap-plication?

As it has been shown in this thesis the application could be made 2-3 times more respon-sive using a different encoding and decoding algorithm, namely VP8, compared to the base-line implementation of H.264, and that the bitrate does affect the speed at which the video can be encoded and decoded at, but the bitrate does not seem to play a very big role in the responsiveness. However, increasing the bitrate does improve the visual quality of the video. The biggest improvement in terms of responsiveness seem to come from the decoding speed which was fastest for VP8 by a large margin. So to answer the research questions, it seems like the bitrate does affect the encoding and decoding speed but not significantly. The encoding and decoding algorithm seems to have a big effect on the responsiveness of the application with the decoder playing the biggest part.

7.1 Future work

For the future more testing would have to be performed, especially on the client side, to get a further understanding of what the biggest factor is when it comes to the responsiveness. The measurements on the client side could probably be performed in a similar way as on the

(35)

7.1. Future work

server but instead of adding a command when running the server some code would have to be added on the client.

Also more measurements with MJPEG would be interesting to see so that it can be com-pared to VP8 since this thesis only did measurements regarding T2VE and nothing to do with bitrate. To do that some way of measuring MJPEG encoding/decoding speed would need to found and implemented since the codec is not implemented through GStreamer as the other codecs.

VP9, which is the successor of VP8, could also be implemented to see if there are any improvements. It would also be interesting to see how H.265 performs on the client, and especially on a more powerful phone as mentioned in Section 6.2.

(36)

Bibliography

[1] Pankaj Kumar Bansal, Vijay Bansal, Mahesh Narain Shukla, and Ajit Singh Motra. “VP8 Encoder — Cost effective implementation”. In: 20th International Conference on Software, Telecommunications and Computer Networks. 2012, pp. 1–6.ISBN: 978-953-290-036-1. [2] Nabajeet Barman and Maria Martini. “H.264/MPEG-AVC, H.265/MPEG-HEVC and

VP9 codec comparison for live gaming video streaming”. In: Ninth International Con-ference on Quality of Multimedia Experience (QoMEX). 2017, pp. 1–6. DOI: 10 . 1109 / QoMEX.2017.7965686.

[3] Tristan Braud, Farshid Hassani Bijarbooneh, Dimitris Chatzopoulos, and Pan Hui. “Fu-ture Networking Challenges: The Case of Mobile Augmented Reality”. In: IEEE 37th International Conference on Distributed Computing Systems (ICDCS). 2017, pp. 1796–1807.

DOI: 10.1109/ICDCS.2017.48.

[4] Guilherme Correa, Pedro Assunção, Luciano Agostini, and Luis da Silva Cruz. “Per-formance and Computational Complexity Assessment of High Efficiency Video En-coders”. In: IEEE Transactions on Circuits and Systems for Video Technology. Vol. 22. 2012, pp. 1899–1909.DOI: 10.1109/TCSVT.2012.2223411.

[5] Koustabh Dolui and Soumya Kanti Datta. “Comparison of edge computing implemen-tations: Fog computing, cloudlet and mobile edge computing”. In: Global Internet of Things Summit (GIoTS). 2017, pp. 1–6.DOI: 10.1109/GIOTS.2017.8016213.

[6] Christian Feller, Juergen Wuenschmann, Thorsten Roll, and Albrecht Rothermel. “The VP8 Video Codec - Overview and Comparison to H.264/AVC”. In: IEEE International Conference on Consumer Electronics (ICCE). 2011.DOI: 10.1109/ICCE-Berlin.2011. 6031852.

[7] Harilaos Koumaras, Michail Kourtis, and Drakoulis Martakos. “Benchmarking the en-coding efficiency of H.265/HEVC and H.264/AVC”. In: Jan. 2012, pp. 1–7.ISBN: 978-1-4673-0320-0.

[8] Johan Lindqvist. “Edge Computing for Mixed Reality”. Master’s thesis. Linköping Uni-versity, 2019.

[9] Luyang Liu, Hongyu Li, and Marco Gruteser. “Edge Assisted Real-Time Object De-tection for Mobile Augmented Reality”. In: The 25th Annual International Conference on Mobile Computing and Networking. 2019.DOI: 10.1145/3300061.3300116.

(37)

Bibliography

[10] Monika Malhotra, Ajay Vikram Singh, and Rakesh Matam. “Comparative Perfor-mance Issues with H.264 vs H.265.” In: International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCon). 2019, pp. 283–288.DOI: 10.1109/ COMITCon.2019.8862207.

[11] Robert B. Miller. “Response Time in Man-Computer Conversational Transactions”. In: Proceedings of the December 9-11, 1968, Fall Joint Computer Conference, Part I. 1968, pp. 267– 277.DOI: 10.1145/1476589.1476628.

[12] Yousef O. Sharrab and Nabil J. Sarhan. “Detailed Comparative Analysis of VP8 and H.264”. In: IEEE International Symposium on Multimedia. 2012.DOI: 10 . 1109 / ISM . 2012.33.

[13] Gretchen Siegchrist. What Is Video Compression? 2019.URL: https://www.lifewire.

com/what-is-video-compression-1082036(visited on 03/28/2020).

[14] Klervie Toczé, Johan Lindqvist, and Simin Nadjm-Tehrani. “Performance Study of Mixed Reality for Edge Computing”. In: Proceedings of the 12th IEEE/ACM International Conference on Utility and Cloud Computing. 2019, pp. 285–294.DOI: 10.1145/3344341. 3368816.

[15] Xiumin Wang, Jin Wang, Xin Wang, and Xiaoming Chen. “Energy and Delay Trade-off for Application Offloading in Mobile Cloud Computing”. In: IEEE Systems Journal (2017), pp. 858–867.

Speeding up a mixed reality application: A study of two encoding algorithms

Linköping University | Department of Computer and Information Science

Bachelor’s thesis, 16 ECTS | Information Technology

2020 | LIU-IDA/LITH-EX-G--20/063--SE

Speeding up a mixed reality

application: A study of two

encoding algorithms

Att snabba upp en applikation med blandad verklighet:

En undersökning av två kodekalgoritmer

Jesper Elgh

Ludvig Thor

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1

Aim

1.2

Research questions

1.3

Related work

1.4

Delimitations

2

Background

2.1

Mixed reality

2.2

MR-Leo

2.3

Video encoding and decoding

H.265

VP8

2.4

GStreamer

3

Encoding and measurement of

latency

3.1

Implementation of encoders

Limitations

3.2

Experiment setup

3.3

Measurement method

Time To Virtual Element

Encoding and decoding speed for different bitrates

4

Baseline evaluation

4.1

Time to virtual element

T2VE for higher bitrates

4.2

Encoding and decoding speed for different bitrates

5

Measurements of VP8 and H.265

5.1

Time to virtual element

T2VE for higher bitrates

5.2

Encoding and decoding speed for different bitrates

5.3

Validity test

6

Discussion

6.1

The scope of the experiments

Comparison of T2VE outcomes

Encoding and decoding speed for different bitrates

6.2

Method

7

Conclusion

7.1

Future work

Bibliography