Techniques for Selecting Spatially Variable Video Encoder Quantization for Remote Operation

(1)

Department of Science and Technology

Institutionen för teknik och naturvetenskap

Linköping University

Linköpings universitet

LiU-ITN-TEK-A--20/034--SE

Techniques for Selecting

Spatially Variable Video

Encoder Quantization for

Remote Operation

Daniel Olsson

(2)

LiU-ITN-TEK-A--20/034--SE

Techniques for Selecting

Spatially Variable Video

Encoder Quantization for

Remote Operation

Examensarbete utfört i Datateknik

vid Tekniska högskolan vid

Linköpings universitet

Daniel Olsson

Handledare Sasan Gooran

Examinator Daniel Nyström

(3)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare –

under en längre tid från publiceringsdatum under förutsättning att inga

extra-ordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,

skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för

ickekommersiell forskning och för undervisning. Överföring av upphovsrätten

vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av

dokumentet kräver upphovsmannens medgivande. För att garantera äktheten,

säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ

art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i

den omfattning som god sed kräver vid användning av dokumentet på ovan

beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan

form eller i sådant sammanhang som är kränkande för upphovsmannens litterära

eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se

förlagets hemsida

http://www.ep.liu.se/

Copyright

The publishers will keep this document online on the Internet - or its possible

replacement - for a considerable time from the date of publication barring

exceptional circumstances.

The online availability of the document implies a permanent permission for

anyone to read, to download, to print out single copies for your own use and to

use it unchanged for any non-commercial research and educational purpose.

Subsequent transfers of copyright cannot revoke this permission. All other uses

of the document are conditional on the consent of the copyright owner. The

publisher has taken technical and administrative measures to assure authenticity,

security and accessibility.

According to intellectual property law the author has the right to be

mentioned when his/her work is accessed as described above and to be protected

against infringement.

For additional information about the Linköping University Electronic Press

and its procedures for publication and for assurance of document integrity,

please refer to its WWW home page:

http://www.ep.liu.se/

(4)

Abstract

In teleoperations, there is a lot of data that has to be sent to the operator. The data sent to the operator differ depending on the nature of the operation, but it is often a video stream from the vehicle or a machine. Transmitting a video feed takes up a lot of bandwidth and the larger amount of data that needs to be sent the longer time it takes before it is received by the operator. However, some teleoperation has a high requirement for low latency to be considered safe since the operator must be able to react in time to prevent an accident. To keep latency down an option is to send less data, but sending a smaller amount of data means that the operator gets less information to base their decision on which action should be taken.

In this thesis project, it is analyzed if compressing a video stream differently, depending on the importance of certain features would decrease the data that need to be sent, without decreasing the information perceived by the operator. This is done by using both a static and dynamic selection of features that gets more or less compressed depending on how important they are to the operator. It shows that the bandwidth used is reduced and most information is kept, but because of the hardware used for teleoperations are limited the usability of this method is also limited.

(5)

Acknowledgments

I would like to give thanks to Voysys AB for coming up with the idea for this master thesis project and all help that I have gotten during the project. I would like to a special thanks to my supervisor at Voysys Jonathan Nilsson and Niclas Olmenius for all help during the duration of this master thesis project. I would also want to give a thank to my examiner Daniel Nyström and my supervisor Sasan Gooran for answering questions about both the project and the structure of the master thesis work.

(6)

List of Figures

2.1 The overall structure of a codec. . . 5 2.2 The prediction models workflow. . . 6 2.3 NV12 image format structure. . . 9 2.4 Bit rate for 11 different QP values that shows that the bit rate grows with a

loga-rithmic scale. . . 12 2.5 An example of a fully connected neural network with three input neurons, two

hidden layers with four neurons each and an output layer with two neurons. . . . 13 3.1 The dome setup at Voysys. . . 19 3.2 The three screen setup at Voysys. . . 19 3.3 A shape consisting of four points that is not a rectangle. . . 22 3.4 To calculate if a point lies inside of an arbitrary shape the cross product between

the shapes edges and a vector to the point from the edges vertex is calculated. If all results from the To calculate if a point lies inside of an arbitrary 2D shape the cross product between the shapes edges and a vector to the point from the edges vertex is calculated. If all results from the cross products are the same for each edge, then the point lies inside the shape. For P1all vectors from the cross product

will point out of the plane while for P2one vector will point into the plane and

therefore does the point not lie inside the shape. . . 22 3.5 Data flow for Oden streamer when running the program where red color indicates

stages that utilize Voysys software, green are implemented solely for this thesis project and yellow is where modification to Voysys software has had to be done. . 27 3.6 Data flow when running the program when calculating the image quality

met-rics where red color indicates stages that utilize Voysys software, green are solely implemented for this thesis project and yellow is where modification to Voysys software has had to be done. . . 27 3.7 The frame that is used for calculating the image quality metrics for the city scenario. 29 3.8 The frame that is used for calculating the image quality metrics for comparing the

two methods with the same bit rate. . . 30 3.9 The frame that is used for calculating the image quality metrics for the test with

the same bit rate but low quality. . . 31 4.1 The frame used for comparing image quality metrics when comparing ROI(top)

with RONI(bottom). . . 33 4.2 The bandwidth used for the video stream when comparing RONI and ROI

com-pression. . . 33 4.3 The S-CIELAB image for ROI and RONI compression where the ROI method is

the top image and RONI method is the bottom image. . . 34 4.4 Comparison of bit rate between the different settings for the city scenario. . . 34 4.5 The SSIM image for the city area where regular compression is the top image, QP

(9)

4.6 The S-CIELAB image for the city area where regular compression is the top image, QP map is the middle image, and QP map with focus area is the bottom image. . . 36 4.7 Answers from people in the user test when asked how similar the different

com-pression methods were on a scale 1–5 were 5 is identical after watching the three clips in the dome setup. Left is regular and QP map, the middle is regular and QP map with focus area, the right is QP map, and QP map with focus area. . . 36 4.8 Answers from people in the user test when asked how similar the different

com-pression methods were on a scale 1–5 were 5 is identical after watching the three clips in the three-screen setup. The left is a regular compression, the middle is regular and QP map with focus area, the right is QP map, and QP map with focus area. . . 37 4.9 The bit rate for regular compression and compression with QP map where the bit

rate is almost identical. . . 38 4.10 The SSIM image for two images where both use approximately the same bit rate,

but one uses regular compression(top) and one uses a QP map to decrease quality on a certain feature(bottom). . . 38 4.11 The S-CIELAB image for two images where both use approximately the same bit

rate, but one uses regular compression(top) and one uses a QP map to decrease quality on a certain feature(bottom). . . 39 4.12 The user’s answers on which of the two clips that had even compression, almost

half picked the wrong one. . . 39 4.13 The users perceived quality differences between regular compression and

empha-sis. The chart to the left is the perceived difference for the one that answered "Regular" in figure 4.12 and the one that chose "QP map" is in the right chart. . . . 40 4.14 The bit rate for three different compression methods when the quality is very low. 40 4.15 The SSIM image for the scenario with low quality. . . 41 4.16 The S-CIELAB image for the scenario with low quality. . . 42 4.17 Which of the compression methods the users would rather use. . . 42 4.18 Time for six users before they can identify a traffic sign with five different QP

levels ranging from 31–51 in the dome setup. . . 43 4.19 Time for six users before they can identify a traffic sign with five different QP

(10)

List of Tables

2.1 Values used for the variable wiand σi in the S-CIELAB metric. The variable wi is

for the weight of the plane and σiis the spread of the visual angle in degrees. . . . 11

2.2 Baseline for comparing image quality metrics. . . 12

3.1 Setting for the ROI and RONI methods. . . 28

3.2 QP map settings for the test "Teleoperation - QP map with RONI" . . . 29

3.3 QP map settings for the test "Teleoperation - Same bit rate with low bandwidth usage" . . . 30

4.1 The image quality metrics for the ROI compression and the RONI compression. . . 32

4.2 Image quality metrics for the city scenario. . . 35

4.3 The image quality assessment for two video streams that use approximately the same bandwidth, but one of them is compressed with a QP map. . . 37

4.4 Image quality metrics for the scenario with low quality. . . 41

A.1 Which users that saw a difference in quality between the three clips in the dome setup. . . 53

A.2 Which method the user thought had the best quality in the dome setup. . . 54

A.3 What each user selected as the quality similarity between the methods in the dome setup. . . 54

A.4 The users comments about the features that made them see a difference in quality in the dome setup. . . 55

A.5 How much the user think the bandwidth differ between the three methods with the best one using 100% . . . 55

A.6 Which users that saw a difference in quality between the three clips in the three screen setup. . . 56

A.7 Which method the user thought had the best quality in the three screen setup. . . . 56

A.8 What each user selected as the quality similarity between the methods in the three screen setup. . . 56

A.9 The users comments about the features that made them see a difference in quality in the three screen setup. . . 57

A.10 How much the user think the bandwidth differ between the three methods with the best one using 100% . . . 57

A.11 Which users that saw a difference in quality between the two methods with the same bit rate. . . 58

A.12 The method the users thought had even compression after viewing two different methods. . . 58

A.13 What the users saw as a difference between the two different methods. . . 59

A.14 The method that the user would preferably use. . . 59

A.15 What each user selected as the similarity between the methods. . . 60

(11)

A.17 The time in seconds it took for the users to see the traffic sign in the three screen setup. . . 60

(12)

1 Introduction

In recent years, one of the most researched areas is autonomous vehicles. The research has come far where both companies and universities are currently testing out self-driving cars on roads throughout the globe. One example is Linköping university that has launched a self-driving bus as a test at campus Valla that students can use to get around the campus[1]. However, an area that has not gotten much attention in the media is teleoperations.

Teleoperation is to remotely control a machine or vehicle where it could either be au-tonomous or stationary. It is often performed when the task is too hard to do automatically or when the environment is too harsh for an operator to be on-site[2]. Teleoperations are for the most part used when machines need manual instructions or to survey multiple opera-tions at the same time. Teleoperation works by having different sensors at a machine that record the current state and transmit the data to the operator through some media such as the internet or radio. The data is then visualized to the operator which in turn decides which action will be performed. The action is then transmitted back to the machine that will execute it. The sensors used at the machine can vary depending on the type of operation and it can be everything from cameras to IR sensors. Depending on the sensor used, the transmitted data would require more or less bandwidth. One of the dangers with teleoperation is if the data transmitted from the machine to the operator is delayed, it could cause a dangerous situa-tion because of the increased reacsitua-tion time. For some operasitua-tion, the acceptable latency could just be tens of milliseconds before it starts to be dangerous, therefore it is important to keep latency down. There are different ways to keep latency down, the most common one is to compress the data sent from the machine to the operator, but the compression will affect the quality of the data and possibly affect the information perceived by the operator.

As more and more autonomous vehicles will traffic the roads, there will be a need for a tool to supervise and remotely control these vehicles and this is where teleoperation comes in. A case where teleoperation would be of interest is where an autonomous car has stopped and a human operator needs to take over control. If an operator needs to physically be in the same place as the vehicle it could take a long time before the operator arrives. However, if the vehicle could be controlled remotely, one operator could supervise and control multiple vehi-cles. An area that has been deployed with remote operations is the air traffic control where an air traffic control operator can remotely supervise multiple smaller airports with low traffic flow to reduce costs, this has been approved by the Swedish air navigation service[3].

(13)

1.1. Motivation

1.1 Motivation

This master thesis project will explore the possibility to reduce the bandwidth used during teleoperation. The method proposed is to identify features in a video stream and use dif-ferent compression rates on them depending on the importance of the specific feature. The features that are identified depends on the specific use case, but it should be selected based on the importance to the operator. Using different compression rates could violate a band-width constraint, which in most cases is not acceptable since some teleoperations depend on low latency to be considered safe. To be able to compare different methods for selecting fea-tures in a video stream, a general method for evaluating image quality must be analyzed and compared against real users.

The master thesis project will be done together with Voysys AB. Voysys is a startup in Norrköping, which creates software for immersive video streaming for remote operations. Their product is used at different companies in both Sweden and the rest of the world. The project will be implemented in their software and tested against their use cases.

1.2 Aim

The purpose of this project is to analyze if compression depending on different features in teleoperations can either increase the perceived information or reduce the bandwidth used. The technique that will be evaluated is to extract relevant features from the scenery and com-press areas differently depending on the importance of the features. Therefore use more data to visualize important regions.

The method needs to be evaluated in terms of image quality and perceived information. Since the image has been compressed with different compression rates depending on the importance of a certain feature, an image quality metric has to be evaluated, which could represent the perceived image quality for the users.

1.3 Research questions

These are the four research questions that will be answered with this thesis:

1. How will a segmentation of different compression rates in a video stream affect the bandwidth and how can it be ensured that it does not violate a bandwidth limitation? 2. What image quality metric could be used to determine the quality of the video stream

that corresponds to what the user perceives in the case of different compression rates? 3. How will region of interest compression affect the performance of the current hardware

used for teleoperations and would a limitation of the hardware affect the decision of the operator negatively?

4. Would region of interest or region of non-interest compression be the most suitable for use in teleoperation in regards to safety?

1.4 Delimitations

Teleoperation is a wide concept and can be applied to many different areas and all applica-tions will not have the same relevant features. Therefore the implementation and evaluation will be for vehicles that travel on roads such as cars, trucks, and buses.

As the impact on a compressed image is greater when the screen is larger, the project will focus on those cases when a dome or multiple monitors are used as the receiving display. Smaller monitors will be used as a development tool but all tests will focus on larger screens.

(14)

1.4. Delimitations There are a lot of different features that could be extracted and interpret for the encoder for the specific scenario but because of the time limitation, only a limited number of features will be extracted. The features selected are those that occur the most often or have the largest relevance.

(15)

2 Theory

This chapter will present the theory and background of this thesis project. The areas that will be covered are codecs, teleoperations, machine learning, color spaces, and image quality assessment.

2.1 Teleoperation

Teleoperation is the term for controlling a vehicle from a distance. It is often used for a variety of reasons, such as when there is a risk for injures to the operator or when costs could be reduced by having one operator controlling multiple vehicles at different locations, for example by having one air traffic controller operating several small airfields from one location[3]. Teleoperations could be divided into two different types, manual and supervised. Manual is where all instructions to the vehicle are determined by the operator and supervised is when the operator monitors the machine and gives more general directions. Supervised can be used for an autonomous vehicle where the operator gives a high-level directive to the vehicle which then can transform the directive into smaller instructions that it will be executed[2, 4].

Depending on the specific work, teleoperation can use a variety of visualization tools. Most teleoperations use screens to either display a video stream or other vital information to the operator. Depending on the situation, different screens could be used such as Virtual Reality(VR) goggles, domes, or monitors. Some operations are more sensitive to the feedback from the vehicle, especially if the operation is controlled in real-time and therefore certain regard to the latency is important to ensure safe operations.

2.2 Codec

A codec is a computer program that contains both an encoder and a decoder. An overview of a codecs pipeline is shown in figure 2.1. First, an input source is sent to the encoder which will output a compressed file, then the file can either be transmitted or stored before it is read by the decoder and shown through some media. The main purpose of a codec is to compress the data that is transferred or stored to make the file size smaller. Most codecs work by removing redundant data from the stream by using different compression methods depending on the

(16)

2.2. Codec

Figure 2.1: The overall structure of a codec.

codec used. Overall for most codecs, the encoder will have a much heavier workload than the decoder since the encoder must decide how to encode the stream while the decoder gets that information in the compressed file[5]. The two codecs that are of interest in this thesis project are the H.264 and HEVC since these are the only available codecs in the software. These two codecs will be discussed in this section.

2.2.1

H.264

H.264 which is also called Advanced Video Coding uses a lossy compression method, which means that some of the information will be lost in the compression. It was, in the begin-ning, an extension to the MPEG-4 format before it became its own standard, its goal is to produce a high-quality video compression with the lowest bit rate possible. However, this comes with some downsides, which are that the encoding and decoding process is very com-plex and computational heavy. To reduce the computation time, the H.264 codec has been developed to support both software multithreaded computations and hardware acceleration [5]. However, it has some limitations with the performance gained since it was not designed with multithreading in mind during its development because multithreading was then a new technology. The H.264 codec consists of three different parts which are a prediction model, a spatial model, and an entropy model that are present in both the encoder and decoder[6].

The prediction model takes the current uncompressed data and the previously encoded data as inputs and creates a frame of residuals as its output as seen in figure 2.2. From the previously encoded data, the encoder tries to predict how the data will look like and this could either be done with intra or inter prediction. Intra prediction uses a spatial model and it works by dividing the data into MacroBlocks(MB), the blocks are predicted by the previously coded blocks in the same frame. Inter prediction uses a temporal model that takes either a past or future frame as its reference frame. From the reference frame, the prediction is made, the method for this varies depending on how complex and precise the encoder is implemented. The difference between the raw uncompressed frame and the prediction is calculated to create the residuals which are the input to the spatial model[6].

The spatial model gets the residuals from the prediction model and its purpose is to reduce spatial redundancy. First, the residuals are transformed into another domain where they then get quantized. The residuals are quantized to remove values that are insignificant and to make a more compact representation[6].

The last part of the encoder is the entropy model. The entropy model compresses all information that is relevant for the frame such as the model for how the prediction was made and its corresponding values, and the quantized residual values from the spatial model. The compression in the entropy model removes statistical redundancy and gives the output as a bit stream[6].

To get back the information from the compressed file, the decoder does the oppo-site of the encoder. First, the entropy model takes the bitstream from the compressed file and decode it. The information that is received is the prediction model informa-tion and the quantized residual values. The spatial model transforms back the resid-ual values and this is the step where information has been lost in the encoding. The decoder then predicts the frame and adds the residual values to the frame to give the result[The_H.264AdvancedVideoCompressionStandar, 7].

To simplify the H.264 format it could be said that the codec uses three different types of frames. There are I-frame, P-frame, and B-frame which correspond to different parts of the

(17)

2.2. Codec

Figure 2.2: The prediction models workflow.

process that previously was described. The I-frame is the first stage and is either intra or inter-frame and is the one that is used to calculate the other inter-frames. The P-inter-frame is the prediction and contains the difference calculated either between frames or the spatial difference. This frame is then used to reconstruct the images combined with the I-frame. B-frame is a newer type of frame that is the same as the P-frame except that it could be calculated by either the next or previous frame which adds some benefits to the compression[8].

The value that controls the compression rate is the Quantization Parameter (QP) and it is used in the spatial model step. The QP value could be seen as a step size where a larger QP value would give a more compressed image. The QP value range between 1–51 where 1 gives a lossless image compression. The QP value is used on the residual to decrease varying values which in turn will make the entropy model more efficient. There are many different ways to apply the QP value to the residual and equation 2.1 shows how a scalar quantization is calculated. The residual value X is divided by the QP value and then rounded off to give the quantized value CI. This means in simpler terms that a higher QP will make the residual value fluctuate less[6].

CI=round X QP

(2.1) The decoder then takes the compressed residual and multiple it with the QP value as in equation 2.2 to scale back the residual values. Since the residual value has been re-scaled the image will not be the same after the compression.

Y=CI ˚ QP (2.2)

2.2.2 HEVC

HEVC stands for High-Efficiency Video Coding and is also known as H.265 since it is the suc-cessors to the H.264 format. The goal with the HEVC format was to be able to support all functionality that the H.264 has while the focus on its two main weaknesses, which are the maximum resolution and its limited support for parallelism. The HEVC format successes in addressing these both disadvantages and it has support for up to 8K resolution, and a bet-ter frame rate for processors with a higher core count. The encoding process has also been improved so a frame with similar image quality to H.264 takes almost 50% less disk space[7].

(18)

2.3. Color Spaces Some of the changes made between H.264 and HEVC is that a larger MB size is supported. HEVC divides the image into Coding tree units(CTU) which each has a size of LxL where L could be 16, 32, or 64 pixels. In each CTU there is a luma coding tree block(CTB) and chroma CTB which both contain coding blocks(CB) of the size L/2xL/2. The CB can be divided further when the data in the image may require it. When a CB is split further it is store in the CTB as a tree structure and the minimum size of a CB is 8x8 pixels[7]. This gives that areas with similar structures could be represented by larger blocks while simultaneously supporting smaller blocks for areas that would benefit from it.

With the increase in performance from the HEVC format, the complexity of the encoding process has increased. This is an active research area on how to decrease the complexity without any loss in performance. This is however not a great problem for the user since the format uses a high-level syntax for its functional call.

2.3 Color Spaces

Color spaces are used to standardize different ways to represent a color geometrically. A color space consists of a coordinate system and a subsystem in the coordinate system so that a color can be represented by a geometrical point in the system. Some color spaces are more oriented toward the hardware such as RGB and CMYK while others are oriented toward applications such as CIEXYZ[9]. There are different advantages and disadvantages for each color space and therefore could it be of use to switch between different color spaces depending on the application. The color spaces that will be discussed in this thesis are RGB, CIEXYZ, CIELAB, and YUV.

2.3.1 RGB

The RGB color space is based on the Cartesian coordinate system and contains three channels corresponding to red, green, and blue. The RGB color space is often represented by a cube where the corners on each axis correspond to red, green, and blue. The color in the origin is black and the corner furthers away from the origin is represented by white [9]. The advantage of RGB color space is that it is very intuitive and easy to use. As mentioned in chapter 2.3 it is hardware-specific and most LCD and LED screens to use this color space to display color. The disadvantage with the RGB color space is that it is a device-dependent color space which means that the color displayed is depended on the characteristics of the hardware that is used to display the image[10]. To get around this problem the RGB color space is often transformed into a device-independent color space such as CIEXYZ.

2.3.2 CIEXYZ

The CIEXYZ color space comes from that the human has three cones that can sense light in a different wavelength. The three cones correspond to the three tristimulus value X, Y, and Z which in turn corresponds to three primary colors. However, these values do not correspond to real colors but are conceptual. To calculate CIEXYZ the equations 2.3, 2.4 and 2.5 are used[11]. X=k ż λ R(λ)l(λ)x(λ)dλ (2.3) Y=k ż λ R(λ)l(λ)y(λ)dλ (2.4) Z=k ż λ R(λ)l(λ)z(λ)dλ (2.5)

(19)

2.3. Color Spaces

k= ş 100

λl(λ)y(λ)dλ

(2.6) Where R is the reflectance, I is the incoming light, k is the normalizing factor. The value k is calculated in equation 2.6 which will normalize the color so a white surface will have the value Y = 100. The CIEXYZ color space is very hard to interpret since there is no correlation between the perceived difference in colors and its positions in the coordinate system. To get around this problem the CIELAB color system is used instead.

2.3.3 CIELAB

CIELAB is derived from CIEXYZ and its goal was to have a coordinate system that corre-sponds to how our eyes perceived difference in color. To calculate the L*, a*, and b* in CIELAB color space the following are needed, a color sample in CIEXYZ color space and the white point for the current illumination. To calculate the values for the CIELAB color space the equations 2.7,2.8, 2.9, 2.10 can be used[9, 11].

L˚=116 ˚ h_YY W ´ 16 (2.7) a˚=500hh X XW ´ h Y YW i (2.8) b˚=200hh Y YW ´ h Z ZW i (2.9) h(q) = # 3 ?q _{q ą 0.008856} 7.787q+ 16 116 q ď 0.008856 (2.10) Where X, Y, and Z are the values for the color in CIEXYZ color space, and the corresponding variables with index W is the color values for the white point. In CIELAB color space, the perceived difference between two colors corresponds to the euclidean distance between the two in the coordinate system[12][13]. The use for the CIELAB color space is discussed in section 2.4.5.

2.3.4 YUV

Where X, Y, and Z are the values for the color in CIEXYZ color space, and the corresponding variables with index W is the color values for the white point. In CIELAB color space, the perceived difference between two colors corresponds to the euclidean distance between the two in the coordinate system[12][13]. The use for the CIELAB color space is discussed in section 2.4.5.   R B G  =   1 0 1.13983 1 ´0.39465 ´0.58060 1 2.03211 0  ˚   Y U V   (2.11)

The YUV color space take the advantage of that the human eye has a lower sensitivity for chrominance than luminance components. Therefore it is more important to have accurate luminance components. This is exploited in the NV12 image format also known as 4:2:0 where the luminance component Y is stored in one plane and chrominance components U and V are stored together in another plane. Each Y component corresponds to a pixel while each U and V component corresponds to four pixels as seen in figure 2.3. This gives that storing an image in YUV color space with NV12 image formatting takes half the disk space than storing it as RGB[6].

(20)

2.4. Image quality assessment

Figure 2.3: NV12 image format structure.

2.4 Image quality assessment

To be able to evaluate how compression depending on features will affect the image quality, different metrics will be evaluated to see which one that best fits the use case. There are plenty of different ways to perform an image Quality Assessment(QA). The two major different methods for performing QA are subjective assessment and objective assessment.

The subjective assessment uses human observers to give each image a rank depending on the perceived quality of the image. The advantages with subjective assessment are that the ranking corresponds directly to how the humans perceive quality and it is very easy to execute. However, the downsides are that it requires a large number of people to get a good estimate, it is not cost-effective, and there could be inconsistent with the people participating in the test [14].

Objective quality metrics do not use any human observers but relies on an algorithm to calculate a metric that corresponds to a certain level of quality. This is often used in the industry because it is faster, less expensive, and more reliable[12]. However, it is hard to find a metric that exactly describes how humans perceive quality. There are many different quality metrics, both simple and advanced but there have been tests that show that there is no statistical difference between them[15].

How to calculate the QA with an objective assessment depends on what information is available and there are three different cases. The first one is no reference which is when there is only the reproduction image available and not the original, reduced reference is when there is some information about the original such as a histogram of pixel values, and full reference is when the original and the reproduction image both are available to use for the QA[12].

When talking about QA there are three terms often used which are image fidelity, image difference, and image similarity. These terms can all describe QA and have similar but not the same meaning. Image fidelity is how well the image is reproduced and how visible the errors are from the reproduction, image difference is the how large the difference between the original and the reproduction is and image similarity is how similar the two images are to each other[12].

2.4.1 Mean Square Error

Mean Square Error(MSE) is a very simple method to calculate the image fidelity between reproduction and its original. The equation for it is the sum of the squared difference be-tween the original and reproductions pixels values. The equation can be seen in equation

(21)

2.4. Image quality assessment 2.12 where O(x, y)is the reference image and R(x, y)is the reproduced image, in this case, the compressed image, and M and N are the sizes of the image [13].

MSE= 1 NM M´1 ÿ y=0 N´1 ÿ x=0 O(x, y)´ R(x, y) 2 (2.12) MSE is very easy to implement and efficient. However, MSE has a reputation for performing badly in most image analysis situations[16].

2.4.2 Signal to noise ratio

Signal to Noise Ratio(SNR) defines the ratio between the original signal and the noise that affects the image. This is defined by dividing the sum of the signal value by the sum of the noise, where the noise is defined as the square difference of the original and the compressed image. The full equation is seen in equation 2.13 where a higher SNR value means that the image is more similar to the original[13].

SNR=10 ˚ log10 řM i=0řNj=0O(x, y)2 řM i=0řNj=0(O(x, y)´ R(x, y))2 ! (2.13) In equation 2.13 O(x,y) is a pixel in the reference image, R(x, y)is a pixel from the reproduc-tion, and N and M are the size of the images.

2.4.3 Peak signal to noise ratio

Peak signal to noise ratio(PSNR) is a measurement of how similar a compressed image is to the original by dividing the maximal error by the MSE. The maximal error is the larger difference between two pixels in the images. The formula can be seen in equation 2.14.

PSNR=20 ˚ log10 MSE_MSEmax

(2.14) A larger PSNR value means that the image is more similar to the original[6][13].

2.4.4 Structural Similarity

Structural similarity(SSIM) is a measurement to show the visible difference between a refer-ence image and the compressed image. The algorithm works by comparing structural infor-mation in the image regardless of its illumination or contrast. The algorithm compares pixels in a smaller area which is usually 8x8 pixels[12]. The algorithm for SSIM is shown in equation 2.15.

SSIM(x, y) = (2uxuy+c1)(2σxy+C2) (u2

x+u2y+C1)(σx2+σy2+C2)

(2.15) In equation 2.15 x and y is the position in the smaller area in the two images, µ is the mean intensity in the area, σ is the standard deviation of the area, C1 and C2 are constants. To

calculate the constants C1and C2the dynamic range L of the image is used together with a

variable K. The variable K has a value of Ki« 1. The algorithm is shown in equation 2.16.

Ci= (KiL)2 (2.16)

If a single value is of interest for the whole image equation 2.17 is used. MSSIM(X, Y) = 1 W W ÿ j=1 SSIM(xj, yj) (2.17)

SSIM can both be used as a single value to compare the difference between two images or be used as an image where areas with greater distortion are more highlighted.

(22)

2.4. Image quality assessment

2.4.5 ∆

_E

One of the most used metrics for image quality is the ∆E which is simply the Euclidean distance between two colors in the CIELAB color space. Since the CIELAB color space is made to imitate how our eyes perceive a difference in color the ∆E value works fairly well. The distance between the two colors can be calculated by equation 2.18[12].

∆_E₌ b

(∆_L₎2_{+ (}∆_a₎2_{+ (}∆_b₎2 _(2.18)

∆_{E is often used because is it a good enough measurement and is easy and fast to implement.}

2.4.6 S-CIELAB

S-CIELAB extends the ∆E with spatial filtering that imitates the human visual system. First, the image is transformed from RGB to the device-independent color space CIEXYZ and then it is transformed into the opponent color space as seen in equation 2.19. 2.19.

O1=0.279X+0.72Y ´ 0.107Z,

O2=´0.449X+0.29Y ´ 0.077Z,

O3=0.086X ´ 0.59Y+0.501Z

(2.19) Each of the three channels contains different information about the image. Luminance is in the O1channel, red-green information is in the O2channel and blue-yellow information is in

the O3channel. Each channel is then filtered by a 2-dimensional kernel seen in equation 2.20.

f =kÿ

i

wiEi (2.20)

The variable Eiis calculated by equation 2.21.

Ei=kie[´(x

2_+y2_)/σ2

i] _(2.21)

The variable ki normalize the filter and it sums up to one. The parameters wi and σi are

dependent on the color plane and can be seen in table 2.1.

Table 2.1: Values used for the variable wiand σi in the S-CIELAB metric. The variable wi is

for the weight of the plane and σiis the spread of the visual angle in degrees.

Plane wi σi 0.921 0.0283 Luminance 0.105 0.133 -0.108 4.336 Red-Green 0.531 0.0392 0.330 0.494 Blue-Yellow 0.488 0.0536 0.371 0.386

After the image has been filtered it is transformed back into CIEXYZ and then transformed into CIELAB where ∆E can be calculated, as seen in equation 2.18. This will give a value for each pixel that could be displayed or the average can be calculated to give a single value for the QA[12].

2.4.7 Interpreting of the quality metrics

To be able to evaluate the result when comparing two different QA against each other a guide-line for what is considered a large difference in quality need to be set. There are no defini-tive numbers for how much each image metric can change before a noticeable difference is

(23)

2.5. Neural network shown because different image quality metrics behave differently depending on how the im-age has been destructed. To get a baseline to draw conclusions from, a video sequence will be compressed using different QP values and the metrics will count as a baseline for what is considered good or bad. Table 2.2 shows the different image quality metrics for ten different QP values and it could be seen that the metrics do not scale linearly between the QP values. These values will be as a base for how much the metrics could change.

Table 2.2: Baseline for comparing image quality metrics. QP Value MSE SNR PSNR SSIM ∆_E _S-Cielab

6 0.7294 43.5052 49.5011 0.9955 0.52554 0.0121 11 1.29136 41.3744 47.3704 0.9931 0.759213 0.0201 16 1.70032 39.8295 45.8255 0.9909 0.933156 0.0295 21 2.93675 37.4562 43.4521 0.9858 1.26061 0.0444 26 4.99537 35.1492 41.1451 0.9787 1.61212 0.0643 31 7.55905 33.3502 39.3461 0.9719 1.83931 0.0856 36 12.2761 31.2442 37.2402 0.9606 2.1634 0.1109 41 20.3482 29.0496 35.0455 0.9454 2.50025 0.1402 46 37.3848 26.4079 32.4038 0.9177 3.1417 0.1954 51 70.9119 23.6277 29.6236 0.8702 4.34835 0.3065

As seen in table 2.2 the quality metrics change very slowly for low QP values while chang-ing faster on the higher values. The inverse is seen in the bit rate which is shown in figure 2.4 where the bit rate follows a logarithmic ratio when decreasing QP value.

Figure 2.4: Bit rate for 11 different QP values that shows that the bit rate grows with a loga-rithmic scale.

2.5 Neural network

A neural network is a technique that tries to mimic how the human brain works. This is done by evolving and learning by using examples with correct answer predetermined. The network is built up by having different layers of neurons that can be activated. The activation in one layer determines which neurons in the next layer that should be activated and this continues until the last layer which corresponds to the output of the network[17]. A neural network contains one input layer, an arbitrary number of hidden layers, and one output layer

(24)

2.5. Neural network as seen in figure 2.5. The number of neurons in each layer depends on the application where each neuron in the input layer corresponds to one input value, for example, a network that would analyze an image would have a neuron in the input layer for each pixel in the image. The number of neurons in the hidden layer can vary and it has to be tested which number that gives the best result. The neurons in the output layer corresponding to the classification that the application should perform[18].

Figure 2.5: An example of a fully connected neural network with three input neurons, two hidden layers with four neurons each and an output layer with two neurons.

The activation of the neurons in each layer depends on the previous layer. The input layer gets its activation from some input data and pass its activation on to the next layer with some weight w attached to it. Each neuron in the next layer will sum up all the activation in the previous layer multiplied by its weight to determine its activation as seen in equation 2.22. Where xiis the output of the node, wijis the weight for the node xjwhich is from the previous

layer. xi = n ÿ j=1 wij˚ xj (2.22)

In the end, all layers should correspond to a certain output, this is where the training phase takes place. As the network goes through the training process it checks the result and changes the weight so the result will be closer to the correct answer. Doing this multiple times will hopefully give a good result. However, the result can vary because of different factors, such as the size of the network and the accuracy of the training data.

There are some downsides with using a neural network, especially that the training of the network can take hours to days depending on the processing power of the hardware and the complexity of the network. This makes that small changes to the network could

(25)

2.6. SIMD require a lot of time before they could be analyzed. The second downside is that there is no theoretical method to chose the number of neurons in the hidden layer, which must be determined by the creator of the network. This, in combination with the previous downside, makes it a tedious process to train a network. However, for application which many networks have previously been trained for there are often guidelines for how the network should look like[19]. There are also other techniques to set the number of layers and neurons in the hidden layer, one is to combine the neural network with a genetic algorithm that will evolve the network. The genetic algorithm adds and removes parts of the hidden layer until a certain output is reached[20].

2.6 SIMD

SIMD is one of Flynn’s taxonomy which classifies different architectures depending on if there are single or multiple instructions and data streams[21]. SIMD stands for single instruc-tion multiple data and this is typically the architecture that a GPU uses where it performs the same instruction on a huge data stream, such as each pixel in an image. But the SIMD archi-tecture can also be used on the CPU either using multiple threads often called SIMT(single instruction, multiple threads) or using the CPU’s vectorization.

Vectorization is one of the techniques that CPU manufacturers use to counter Moore’s law[22]. This works by having the CPU apply the instruction to a vector of data instead of just one data entry. The instruction is simultaneously applied to the whole vector with a minimal overhead cost by using a separate pipeline in the CPU dedicated to this. CPU’s have different sizes for the register containing the vector, which means that the speed up gained by vectorization varies depending on the CPU used. The speedup also depends on which data that the instruction is applied to since a floating-point variable uses more space in the vector than a char[23].

(26)

3 Method

This chapter contains the method for this thesis project. To answer the four research ques-tions a program would need to be implemented that could support teleoperaques-tions. Voysys software Oden already has support for teleoperation and the project will be implemented in their software. The project will be implemented as a plugin to Oden, which still gives access to all necessary functionality without interrupting the core code. However, some code must still be implemented in Oden’s core such as changes to the encoder.

3.1 Software and frameworks

In this thesis project, multiple different software and frameworks will be used to add func-tionality to the project and those will be presented in this section.

3.1.1 Oden

The implementation will be in Voysys proprietary software for immersive remote operations. The software is called Oden and it is built as several smaller modules, the models that are relevant for this thesis work are Oden streamer and Oden VR. Oden streamer is the module that takes an input and preprocessing it before streaming it to the receiver. Oden Streamer can accept data from a variety of sources such as files, cameras, and 3D environments. This is where the encoder is implemented and therefore most work that will affect Oden’s core will be implemented here. Oden VR is the module that decodes the transmitted data and displays it. It has a 3D render engine that has support to display video streams on for example a single monitor, multiple projectors in a dome, or VR goggles.

Voysys software is built with C, C++, and Rust with OpenGL as a rendering API. The build environment used is QTCreator and version-control software is git together with Github. The software uses some external tools that will be interacted with such as Dear Imgui for GUI rendering. The implementation will be implemented as a plugin to Oden so it will be a stand-alone feature.

(27)

3.1. Software and frameworks

3.1.2 OpenVino

OpenVino1is a toolkit from Intel that has the functionality to optimize and run deep learn-ing models on different hardware but mainly on Intel hardware. The toolkit contains two different parts, one for model optimization that optimizes the network for the specific hard-ware and then an engine that can execute the optimized model. The OpenVino engine is implemented into Oden and can be used with some smaller configurations to it.

3.1.3 NVENC

The encoder API used is the NvEnc2from Nvidia that can encode in both H.264 and HEVC standards, it is accessible by a high-level API that let the encoder to be integrated into other source code. Which codec that will be used depends on several factors, such as speed, com-pression rate, and functionality. H.264 is faster to encode than HEVC since it is less complex. However, HEVC gives a result that reduces the bandwidth used up to 50% with the tech-niques mention in section 2.2.2. The difference in functionality is that HEVC does not support all different QP map modes that H.264 does and therefore H.264 must be used if a particular map is chosen that HEVC does not support. The choice will be to focus on HEVC since this is the newer standard and Voysys has a desire to switch over all their operations to HEVC when they have evaluated the eventual performance loss.

3.1.3.1 Rate control

Rate control is the part of the encoder that sets the QP value for the current frame. The rate control takes in a lot of different factors to decide which QP value that will be used, but its main focus is to keep the bit rate similar depending on the user’s settings.

3.1.3.2 Settings

Many different settings can be set to achieve a different result with the encoder. Some are not relevant for this thesis project, so they will not be discussed.

Constant bit rate Constant bit rate is a setting that sets an average bit rate that the rate con-trol tries to match by changing the QP. Constant bit rate is a good fit when the application has a bandwidth constraint that should not be exceeded. This is the mode that Voysys currently use, but since their operations do not always have a similar bit rate they have made a small modification to it. They have created an own module that controls the average bit rate that the rate control use depending on how the network access is.

Variable bit rate Variable bit rate sets two variables, the average bit rate, and the maximum bit rate. Rate control then tries to over a long time keep the bit rate around the average without going over the maximum. This mode is better when there is some room for the bit rate to fluctuate, and a more steady QP value is desired.

Constant QP Constant QP uses the same QP for all frames without any regard to the bit rate. This mode gives the most fluctuating bit rate since some frame would need more bandwidth than others. The advantage of this mode is that the image quality is almost at the same level through the whole video stream. This is the setting that will be used since it makes a comparison between different compression of the same video sequence more accurate.

1_{https://docs.openvinotoolkit.org/} 2_{https://developer.nvidia.com/}

(28)

3.1. Software and frameworks

Intra-refresh The intra-refresh setting will if enabled update the I-frame to a newer frame. Having intra-refresh enable will make the encoder faster subject to changes. Intra-refresh will be enabled in this project since it allows for the changes to the compression to be updated faster.

Adaptive quantization To get the encoder to work with the QP map the spatial and tempo-ral quantification must be disabled.

3.1.3.3 QP map and Emphasis Map

Three different methods could be used as a QP map, the first one is emphasis map which is applied after the quality has been determined by the encoder. It describes areas that should be increased in quality and is only supported for the H.264 codec. The values in the emphasis map range from 0–5 where 0 gives the quality determined by rate control and 5 gives the greatest increase in quality. The second method is QP delta map which works similarly to the emphasis map with the difference that the quality value could be changed more freely. The third method is to use a QP map that controls the exact QP value. The biggest difference be-tween them is that the QP map overwrites the rate control while QP delta map and emphasis map are added on top of the rate controls value.

The QP map can be used to freely set the QP values for each macroblock(MB) without any contribution from the rate control. This method gives the greatest freedom to exactly set the desired quality in each area in the video stream. This requires some more exact determination of the QP values because the video could easily be unwatchable. However, QP map is not supported in Voysys version of the encoder so this could not be implemented.

QP delta map as previously stated is added on top of the value determined by the rate control. The values in the QP delta map can range between -51–51 where 0 gives no difference in quality. Negative values will increase the quality and positive values will decrease it. Since the values in the QP delta map are added on top of the QP value determined by the rate control it could go outside of the range of 0–51. However, the encoder clamps outside values. The Emphasis map works similarly to the QP map with the change that there are only six different quality levels. This makes the emphasis map simpler to use but with less freedom in quality settings. The different levels go from 0 – 5 where 0 is no change to the quality and 5 is the most improvement. The emphasis level together with the rate control calculates the QP value that will be used in that particular MB.

The size of the different QP maps depends on the resolution of the video and the size of the MB in the encoder. The H.264 standard can only have an MB size of 16x16 pixels while HEVC can also have a size of 32X32 and 64x64 pixels. The size of the QP map is calculated by equation 3.1 [24].

QP map size= Resolution width

MB width x

Resolution height

MB height (3.1) QP delta map is the one that will be used since it gives more freedom to change the quality settings. It also gives the possibility to both decreases and increases the image quality which is not possible with an emphasis map. The only benefits with emphasis map are that it would be easier to implement and would keep the bandwidth at a more similar level. Since QP maps are available in both HEVC and h.264 the newer and more powerful HEVC will be used since Voysys haven expressed a desire to start using it. When using the HEVC format a MB size of 32x32 pixels will be used.[24].

3.1.4 FFmpeg

Oden can both encode and decode different codecs through the NvEnc API. However, the decoder is integrated into the software in such a way that it is not possible to decode the

(29)

3.2. Hardware files without introducing possible error sources, such as image file compression. To decode the files FFmpeg is used, it is a program containing a variety of encoders and decoders, it can either be used in the terminal or be integrated into existing source code. If used with the terminal the output would be an image for each frame which would take up massive storage space and data could be compressed when saved as certain image formats. FFmpeg is instead integrated into the code and the output is a pointer to a packed RGB data which directly could be calculated on.

3.2 Hardware

Oden can run on both Linux and Windows platforms and does not require powerful hard-ware to run on. However, if higher resolution and faster frame rates are required there could be some restrictions on which hardware to use.

3.2.1 Developer hardware

The implementation is developed on a computer containing an AMD Ryzen 9 3900X 12-core processor and an Nvidia GTX 1660ti graphic card. This computer is also used to stream the video during data gathering instead of the weaker system that Voysys has in their system. This is because the computer should not be a bottleneck for the application since it is more of a proof of concept.

3.3 Voysys teleoperation hardware

When deploying teleoperation in everyday operation a smaller system is often required. Voysys current uses an Nvidia Jetson Nano which is a small computer specialized to run neu-ral networks and video encoding. The Jetson nano has a Quad-core ARM A57 @ 1.43 GHz processor, 128-core Maxwell GPU, and 4GB RAM. However, Voysys has started to switch over to Jetson AGX Xavier which is a slightly larger but more powerful computer that has an 8-Core ARM v8.2 64-Bit CPU, 512-Core Volta GPU with Tensor Cores and 32 GB RAM. This is the computer that the implementation will be tested on to see if the thesis project could be used in real teleoperations.

3.3.1 Receiver hardware

Voysys has a testing room with different equipment to both test and demonstrates its prod-ucts. The types of equipment that will be used in this thesis project are two of their driving station, a dome solution with a viewing angle of 210 degrees, and a three-screen setup. The dome setup is shown in figure 3.1 and uses three projectors that each can display 1920x1080 resolution and the computer used to stitch and display the video uses an AMD Ryzen 5 3600X 6-core processor, 16 GB RAM, and an NVIDIA GTX 1070 GPU. The three-screen setup as seen in figure 3.2 uses three AOC 27" Curved CQ27G2U/BK screens and the computer used to drive the setup use an AMD Ryzen 5 3600X 6-core processor, 16 GB RAM and an Nvidia Quatro RTX 4000.

3.4 Data

The video data used for Voysys’ everyday operations comes from cameras mounted on the vehicle that feed the data directly into Oden’s render engine. But for this thesis project, a camera that gives a live feed would not be feasible since there is no possibility to compare two different live streams in terms of image quality. Therefore the data will be captured and stored as files.

(30)

3.4. Data

Figure 3.1: The dome setup at Voysys.

(31)

3.5. Implementation

3.4.1 Equipment and video format

The data was captured using a Panasonic DMC-GH4 in 3840x2160 with 30 frames per second and was saved in MP4 format. The camera could not save the data as a raw format, so the settings used were the maximal that the camera could handle. However, there was the possibility to record the image in raw format using a USB camera but that would require a setup with a computer as driver and storage, which would make the video capturing more difficult so it was dismissed.

3.4.2 Data information

The data contains different video clips simulating a car traveling on a road and all clips are either in a city or urban environment. The scenery contains features such as pedestrians, cars, trucks, vegetation, buildings, and traffic signs. Multiple clips were recorded so different scenarios could be used.

3.4.3 Post Processing

The recorded videos were very shaky and could cause nausea during the user tests because they were captured without a steady cam. The videos had to be stabilized before being shown to the participants in the user test and it was done using Adobe After Effects warp stabilizer. The warp stabilizer distorts the image, but since the distorted image is used as the original when comparing image quality it will not affect the result.

3.5 Implementation

To be able to answer the research questions stated in section 1.3 some parts were needed to be implemented in Voysys software Oden which are described in this chapter.

3.5.1 Plugin

Oden has support for developing plugins to the engine and this is how most of this thesis project will be implemented. The plugin API contains a set of functions that expose some functionality from Oden’s core engine, such as accessing the displayed image, render objects to the scene, and setting GUI elements. Oden requires that each plugin must have four spe-cific functions to work, which are an initialize function, an update function, and update GUI function and a shutdown function. These four functions are connected to Odens plugin API and are used to set up and run the plugin. Both update functions are executed each frame and there is where most code relevant to the QP delta map is implemented.

To be able to control and set the QP delta map from the plugin, access to either the encoder or the data structure holding the QP delta map is needed. There is no functionality in the plugin API to do this since it has not been a previously desired functionality. This had to be implemented through Oden’s core engine, the decision was to implement a functionality to access the data structure containing the QP delta map. The data that is accessible from the plugin are the following:

• A pointer to *char holding the address to the QP map • An integer containing the width of the QP map. • An integer containing the height of the QP map. • An integer containing the total size of the QP map.

(32)

3.5. Implementation The values in the QP delta map should correspond to what features are present in the video stream, therefore access to the currently displayed image is vital. However, there is no functionality to directly access the displayed image from a plugin but there is a function called Virtual Camera implemented in Oden. A virtual camera in Oden is a way to render out a part of the viewport as an image. The virtual camera takes the viewport and transforms it into an image stored as a packed RGB data. The virtual camera can have different sizes and field of view so there is a possibility to customize it depending on the use case. The downside of using a virtual camera is that it takes around four frames for the image to be captured, transferred, and accessed by the plugin which in turn will give some latency to the operation.

3.5.2 Selecting areas

To be able to fill the QP delta map with relevant values, important areas in the video has to be selected. Both a static and dynamic method of selecting areas will be implemented. The static method works by letting the user select an area in the video that always will be set to the same QP value. The dynamic way will use a neural network to select the features that should affect the QP value.

3.5.2.1 Static selection

Selecting a static area in the video that always will be compressed with the same QP value could be used if the user for example always wants the area straight forward to be in a higher quality. Furthermore, this functionality is also very useful in testing the encoder. There are two approaches to this problem, either by selecting an area of the viewport or an area in the world coordinates. The advantage of the first approach is when using a VR headset the area is always present in the viewport, however, the project will not be tested against VR. The other approach is better in cases where certain regions in the video are of interest such as the area straight ahead of the vehicle. This is the method that will be implemented.

View plane to World Coordinates To select an area in the world coordinate that will be enhanced by the QP delta map, the user selects four-point on the screen which then creates a shape from those points. Since a screen is a plane in 2D and the coordinate should be in 3D, there has to be a transformation from 2D screen coordinates to 3D world coordinates. The steps to calculate this are the following:

• Convert pixel coordinates into normalized values between -1:1. • Multiply it with the inverse projection matrix

• Multiply it with the inverse view matrix

These steps give a ray that is used to find the intersection point with the object that is dis-played on the video stream.

To calculate if a pixel lies inside a shape, the coordinates of the viewport edges have to be determined. A ray from each corner of the viewport is sent to find the intersection point of the displayed object. Since the camera is showing a 2D view of the scene the z component can be discarded. The normalized position for the shape is calculated by the formula in equation 3.2.

xnormalized = _xx ´ xmin

max´ xmin (3.2)

With the normalized position of the shape, it can be compared toward normalized pixel po-sitions to build the mask that is used to update the QP delta map.

(33)

3.5. Implementation

Figure 3.3: A shape consisting of four points that is not a rectangle.

Build pixel mask By allowing the shape to just be a rectangle would be a simple implemen-tation. To calculate if a point lies inside of a rectangle it is just to check if the point P satisfies xmin ă Px ă xmax and ymin ă Py ă ymax but would limit the user in selecting areas. If the

shape instead consists of four arbitrary points such as figure 3.3, another formula to calculate if a point lies inside a shape is needed. Since the camera turns the image from a 3D object into a 2D image, the coordinates will always be in a 2D plane. If the coordinates are extended back into 3D space, the cross product could be used to determine if a point lies inside the shape. All edges of the shape will be treated as vectors as seen in figure 3.4 and a vector from each vertex to the currently analyzed point is created. The cross product between the two vectors gives a vector that either points out or in through the 2D plane. If all vectors from the cross product have the same direction, then the point lies inside the shape otherwise it does not.

Figure 3.4: To calculate if a point lies inside of an arbitrary shape the cross product between the shapes edges and a vector to the point from the edges vertex is calculated. If all results from the To calculate if a point lies inside of an arbitrary 2D shape the cross product between the shapes edges and a vector to the point from the edges vertex is calculated. If all results from the cross products are the same for each edge, then the point lies inside the shape. For P1all vectors from the cross product will point out of the plane while for P2one vector will

(34)

3.5. Implementation

3.5.2.2 Neural network

To select areas and features dynamically in the video stream a neural network is implemented into the plugin. The network used is a pre-trained model called semantic-segmentation-adas-0001 which comes from OpenVino’s model zoo3_{. It is a segmentation network which means}

that each pixel that is sent into the network will be classified into one of the twenty different classes that the network supports. However, some classes will be bundled together to make an easier user interface in the application. The classes that the network can detect will be shown below in their respective bundle:

• road • sidewalk

• building, wall, fence and pole • traffic light

• traffic sign

• vegetation and terrain • sky

• person and bicycle • rider

• car, truck, bus, motorcycle and eco-vehicle • train

The network input is a three-channel image with the resolution 2048x1024 and its output is a one-channel image with the resolution 2048x1024. Each pixel in the output image is represented by a character corresponding to one of the classes shown above.

To be able to run the inference of the network, OpenVino is used which both optimize the network and execute it. Voysys had already implemented OpenVino as a plugin to Oden. However, it was not set-up for the network that was going to be used and contained some bugs which meant that major changes to the code were needed. Since OpenVino was imple-mented as a stand-alone plugin, the data needs to be sent to the other plugin which is done by passing a pointer to the allocated memory over to the other plugin.

3.5.3 Updating QP map

The QP map will be updated each frame to keep it as relevant as possible. First, the plugin checks if a new inference has been done. If no new inference has been done the QP map is kept as it is. If a new inference is done the program loops over all MB’s and for each MB creates a histogram over how many pixels in the MB that belong to a certain class. Each of the classes in the histogram gets multiplied with its importance that is explained in section 3.5.3.2 and the class with the higher number get selected for that MB. If the spread option is enabled which is explained in section 3.5.3.1 the surrounding MB are stored also. After each MB has been assign to a class the program loops over the MB that has been stored by the spread and change them. The new QP map is compared against the old QP map and if the new value is higher than the old one it gets overwritten by the old one and deterioration is added to it. The deterioration is explained in section 3.5.3.3 and it is so the quality will slowly decrease. This is to prevent flickering between quality if the neural network has a problem

3_{https://docs.openvinotoolkit.org/2019_R1/_semantic_segmentation_adas_0001_}

Techniques for Selecting Spatially Variable Video Encoder Quantization for Remote Operation

Department of Science and Technology

Institutionen för teknik och naturvetenskap

Linköping University

Linköpings universitet

LiU-ITN-TEK-A--20/034--SE

Techniques for Selecting

Spatially Variable Video

Encoder Quantization for

Remote Operation

Daniel Olsson

LiU-ITN-TEK-A--20/034--SE

Techniques for Selecting

Spatially Variable Video

Encoder Quantization for

Remote Operation

Examensarbete utfört i Datateknik

vid Tekniska högskolan vid

Linköpings universitet

Daniel Olsson

Handledare Sasan Gooran

Examinator Daniel Nyström

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare –

under en längre tid från publiceringsdatum under förutsättning att inga

extra-ordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,

skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för

ickekommersiell forskning och för undervisning. Överföring av upphovsrätten

vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av

dokumentet kräver upphovsmannens medgivande. För att garantera äktheten,

säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ

art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i

den omfattning som god sed kräver vid användning av dokumentet på ovan

beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan

form eller i sådant sammanhang som är kränkande för upphovsmannens litterära

eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se

förlagets hemsida

http://www.ep.liu.se/

Copyright

The publishers will keep this document online on the Internet - or its possible

replacement - for a considerable time from the date of publication barring

exceptional circumstances.

The online availability of the document implies a permanent permission for

anyone to read, to download, to print out single copies for your own use and to

use it unchanged for any non-commercial research and educational purpose.

Subsequent transfers of copyright cannot revoke this permission. All other uses

of the document are conditional on the consent of the copyright owner. The

publisher has taken technical and administrative measures to assure authenticity,

security and accessibility.

According to intellectual property law the author has the right to be

mentioned when his/her work is accessed as described above and to be protected

against infringement.

For additional information about the Linköping University Electronic Press

and its procedures for publication and for assurance of document integrity,

please refer to its WWW home page:

http://www.ep.liu.se/

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1

Motivation

1.2

Aim

1.3

Research questions

1.4

Delimitations

2

Theory

2.1

Teleoperation

2.2

Codec

2.2.1

_E