Estimating lighting from unconstrained RGB images using Deep Learning in real-time for superimposed objects in an augmented reality application

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Datateknik

2021 | LIU-IDA/LITH-EX-A--2021/042--SE

Estimating lighting from

uncon-strained RGB images using Deep

Learning in real-time for

su-perimposed objects in an

aug-mented reality application

Skattning av ljuset från RGB-bilder med Deep Learning i realtid

för virtuella objekt i en applikation inom förstärkt verklighet

Felix Nodelijk

Arun Uppugunduri

Supervisor : George Osipov Examiner : Cyrille Berger

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

Modern deep learning enables many new possibilities for automation. Within aug-mented reality, deep learning can be used to infer the lighting to accurately render super-imposed objects with correct lighting to mix seamlessly with the environment.

This study aims to find a method of light estimation from RGB images by investigating Spherical Harmonic coefficients and how said coefficients could be inferred for use in an AR application in real-time. The pre-existing method employed by the application esti-mates the light by comparing two points cheek-to-cheek on a face. This fails to accurately represent the lighting in many situations, causing users to stop using the application. This study investigates a deep learning model that shows significant improvements in regards to the lighting estimation while also achieving fast inference time. The model results were presented to respondents in a survey and was found to be the better method of the two in terms of light estimation. The final model achieved 19 ms in inference time and 0.10 in RMS error.

(4)

Acknowledgments

We want to give our thanks to all the employees of Kapanu. All things regarded, we have ever since the starting point of our thesis been given a warm welcome and have felt like we are a part of Kapanu.

A special thanks should, however, be given to our company supervisor Marcel Lancelle for all support, engagement and assistance in our thesis work. An additional thanks should be given to Valentin Vasiliu for his assistance in deep learning specific questions. Last but not least, we would like to thank Alexandros Mourtziapis for his rapid assistance related to the rendering of light spheres when presenting our results. We would also like to extend our gratitude to the rest of the employees of Kapanu for their support and devoted interest in our project.

Moreover, we also want to thank our examiner Cyrille Berger and supervisor George

Os-ipovat Linköping University. The assistance related to the forming of our research questions, thesis structure, and follow-up on schedule and milestones has truly been a major key to our success. Lastly, we would also like to thank all of the participants who took part in the study, giving us valuable results and points of discussion.

(5)

1 Introduction 1 1.1 Kapanu . . . 2 1.2 Motivation . . . 2 1.3 Research questions . . . 3 1.4 Limitations . . . 4 1.5 Thesis Outline . . . 4 2 Theory 5 2.1 Definition of Quality . . . 5 2.2 Background . . . 5 2.3 Related Work . . . 12 3 Method 14 3.1 Dataset . . . 15 3.2 Network architecture . . . 16 3.3 Training process . . . 18 3.4 Experimentation . . . 19 3.5 Hyper-parameters . . . 22

3.6 The inference model . . . 22

3.7 Hardware, resources and environment . . . 23

4 Results 24 4.1 Quality Survey . . . 24

4.2 Trained network results . . . 29

5 Discussion 32 5.1 Results . . . 32

5.2 Method . . . 36

5.3 Source criticism . . . 38

5.4 The work in a wider context . . . 38

6 Conclusion 40 6.1 Future work . . . 41

(6)

A Appendix 42

A.1 Sphere generation . . . 42 A.2 Quality survey . . . 44

(7)

List of Figures

1.1 Visualisation of dental makeover using IvoSmile application. Used with

permis-sion from Kapanu . . . 2

1.2 Example of unrealistic rendering of teeth. Used with permission from Kapanu AG 3 2.1 Low-level feature map . . . 9

2.2 Convolutional layer with input size 9x9, filter size 3x3 . . . 9

2.3 Resulting input and output when applying max pooling with filter size 2x2 and stride 2 . . . 10

3.1 Original and relit image applying target SH coefficients from table 3.1 . . . 16

3.2 Hourglass network architecture . . . 16

3.3 Inference network architecture . . . 18

3.4 Example question for the quality survey . . . 21

3.5 SH coefficients projected onto a sphere . . . 21

4.1 Spread of votes per image. The integer indicates the number of votes for that category. . . 25

4.2 Percentage in favour . . . 26

4.3 Image 2 - Quality Survey - Left sphere by Kapanu right sphere by us . . . 26

4.9 Validation loss and training loss plotted against the epoch. The annotations over the epochs indicate when a skip-connection was added. . . 29

4.10 Validation loss and training loss plotted against the epoch for the inference network 30 4.11 RMSE value for different models (lower is better) . . . 31

(8)

List of Tables

3.1 Table of SH coefficients . . . 15 3.2 Table of inference network layers . . . 17 4.1 Inference times (CPU & GPU) and Memory usage. . . 31

(9)

1 Introduction

Visualising the result of a dental procedure, be it whitening teeth or orthodontics, can aid both the patient and the dentist in making informed decisions. Usually, orthodontists cre-ate mock-ups with the purpose of presenting the patient with the intended aesthetic out-come before deciding to continue with a treatment plan [11]. The creation of a mock-up is, however, a time-consuming task consisting of a variety of steps such as consultation, dentist impressions, diagnostic wax-up, diagnostic mock-up, and treatment plan. This can be both time-consuming and cost intensive while at the same time not resulting in any guarantee of a procedure.

An alternative way to achieve this effect is by using Augmented Reality (AR). AR is an emerging technology that enhances reality with computer-generated objects using sensory inputs, for instance, camera and LiDAR [9]. Visualising the result through AR enables the patient and the orthodontist to interactively investigate the end-result, which can assist in the process of a treatment plan.

A necessity within augmented reality applications is accurate illumination of objects, as this determines whether a virtual object fits within a realistic environment among real-world objects. This is exemplified by visual effects in movie production, where the estimation is required for coherent illumination across scenes and objects [10].

Deep learning is a field within Machine learning dedicated to learning from a vast amount of data using deep, or multiple layered, Neural Networks [13]. Deep learning allows com-puters to learn complex relationships between input data and some, oft difficult-to-describe, concepts. A classic example is that of object recognition in images. While it is a task that a human would excel at, it is difficult to describe why exactly an object is recognised as such. Using convolution, deep and shallow features can be extracted from images to recognise ob-jects in convolutional neural networks (CNN).

Nowadays, there exist methods for estimating illumination using probes. One such tech-nique is Single-shot light probe by Debevec et al. [8], which utilises real-world hollow chrome spheres by photographing said spheres to estimate illumination. Consequently, this technique requires a physical object to perform the illumination estimation, which is not easily accessi-ble for consumers. Moreover, this introduces the need for an object to be visiaccessi-ble within the scene. Another technique performing illumination estimation is DeepLight by Kán et al. [19]. This technique uses a large dataset of images with colour and depth information, along with a known light source, to train a residual neural network (ResNET). While results were deemed

(10)

1.1. Kapanu

promising, this method was only trained on synthetic data, possibly causing it not to be able generalise to complex real-world scenes.

Deep Illumination by Manu et al. [41] introduced a novel probeless illumination estimator using Generative Adversarial Networks (GANs). Similarly to DeepLight, this implementa-tion was applied to evaluate computer-generated simulaimplementa-tions and not a real-world scenario.

Deep Spherical Harmonics by Marques et al. [32] trained a ResNet model on RGB images to output nine Spherical Harmonic coefficients. Spherical Harmonics are special functions that can be used in order to calculate the light conditions on the surface of a sphere as light functions. Marques et al. demonstrated how the Spherical Harmonics could be used in order to estimate the illumination without any prior knowledge of the scene.

1.1 Kapanu

Kapanu AG is a company based in Switzerland that develops an innovative product that aims to aid the dental industry by applying techniques from the field of Visual Computing and Machine Learning, aiding dentists and patients alike. Their application is currently readily available for consumers to download from the iOS app-store and is used to visualise cosmetic dental makeovers. It uses computer-generated models of teeth to superimpose said generated models for application within orthodontics and / or teeth whitening in augmented reality. Kapanu AG is a spin-off from ETH that was founded in 2015. In 2017, at the International Dental Show, they demonstrated the world’s first real-time dental Augmented Reality app. Shortly after this demonstration, Kapanu AG was acquired by Ivoclar Vivadent, who are currently using the AR technology in their app IvoSmile and IvoSmile Orthodontics.

IvoSmile is a patient-centred application allowing to show the patients a preview of the aesthetic smile makeover in real-time. The virtual image or video enables an easier un-derstanding of the result from treatment while not demanding any cost-intensive or time-consuming planning of a cosmetic treatment. Figure 1.1 demonstrates an example IvoSmile application in use where generated 3D models have been superimposed and are displayed in the AR application.

Figure 1.1: Visualisation of dental makeover using IvoSmile application. Used with permis-sion from Kapanu

1.2 Motivation

The motivation behind this paper is to extend IvoSmile [17] to superimpose models of teeth onto users in poorly lit environments such that the superimposition appears to merge real-istically with the surrounding environment. A current pain point for the application is that

(11)

1.3. Research questions

it underperforms in poorly lit environments. More specifically, it results in uneven and un-realistic lighting and shadowing of the superimposed teeth, which consequently results in a fake appearance. This has the unfortunate side-effect of increasing the rate at which users discontinue the use of the application.

Figure 1.2 illustrates how the lack of adequate light estimation results in an unrealistic rendering of the teeth. What to mainly look for in this image is that the cheek teeth are too bright and that the reflection of the teeth is incorrect, making them appear cartoonesque or plastic in nature.

Figure 1.2: Example of unrealistic rendering of teeth. Used with permission from Kapanu AG

The app currently implements a basic illumination estimation technique that utilises the cheeks of a user to determine the lighting situation. By applying the illumination correspond-ing to each cheek to the correlatcorrespond-ing part of the superimposed models, a basic estimation of illumination is achieved. Furthermore, a user is notified of a sub-optimal lighting situation. However, it has been noted that people fail to realise how they can adjust themselves such that the lighting is optimal. Therefore, a system that automatically adjusts the lighting ac-cording to the situation without requiring user interaction would not only provide a more accurate representation of the lighting for the models but would also reduce the amount of work that a user must perform in order for the lighting to be considered optimal.

1.2.1 Aim

The aim of the project is to investigate and evaluate a Deep Learning solution on how real-time lighting estimation can be improved for the purpose of enhancing the lighting of 3D generates models of teeth in AR.

1.3 Research questions

The research questions that will be answered in this thesis project are the following:

1. How can Spherical Harmonics Coefficients be estimated for the purpose of real-time light estimation on mobile devices using Deep Learning?

Spherical Harmonics is a complex relationship to estimate but can be done by a variety of analytical and machine learning approaches. The necessary calculations, however, can take a long time and might then not be suitable for real-time applications.

2. What is the trade-off between faster inference and quality?

As the illumination estimation is supposed to be run on a smartphone, it is imperative that the feature can run in real-time on phones. With this constraint, a plausible trade in quality could manifest itself. This question aims to investigate whether a quality trade-off is induced as the inference time of our model increases.

(12)

1.4. Limitations

1.4 Limitations

The model should be able to run on recent phones independently of a server. RGB images will only be available for input (i.e. Depth can not be considered since this is not available on the majority of mobile devices). As the application perform various other computational tasks, the implementation should co-exist such that it does not impede the performance or prevent other tasks from being executed in a timely manner. Furthermore, there is a limitation related to the size of the app since users tend to avoid large apps. Moreover, since the application runs in real-time, this adds the requirement that also the model needs to perform inference in real-time. Lastly, we limit our model training to the DPR dataset1as this is the only available dataset that is available to us and that fits our purpose.

1.5 Thesis Outline

The following chapter gives information about the background theory necessary to under-stand the related work and our methodology. Chapter 3 covers the method of our work and is followed by the results in Chapter 4. This is followed by the discussion in Chapter 5 where the method and results are discussed. Finally, the conclusions are presented in Chapter 6.

(13)

2 Theory

It is assumed that the reader of this thesis has prior knowledge in the field of machine learning with a focus on deep learning. The apprehension is assumed to be at least on a level corre-sponding to a master’s degree in computer science. Due to this assumption, basic concepts like, but not limited to, loss functions will be omitted or described on a very high level.

2.1 Definition of Quality

The definition of quality for the SH coefficients is two-fold. The first definition is specifi-cally designed for the quality survey and measures how realistispecifi-cally the sphere captures the lighting. Henceforth, we call this definition Qs (for the quality survey) which is a qualita-tive measure described by the share of users showing a subjecqualita-tive preference for our model results. The second definition pertains to the error between the ground-truth SH and the es-timated SH of an image. This definition is henceforth referred to as Qe(for quality error) and is a quantitative measure described by the RMS error.

2.2 Background

This section presents the necessary theory (except for the knowledge of theory that the reader is presumed to possess) to understand the related work section as well as the study as a whole.

2.2.1 Spherical Harmonics

The term Spherical Harmonic was first introduced in Thomson and Tait’s Treatise on Natural Philosophy (1867) [42]. The term refers to the Spherical Harmonic basis functions and their ability to represent 2D functions that satisfy certain conditions over the surface of a sphere. The Spherical Harmonics are homogeneous polynomial solutionsR3ÝÑR of Laplace’s equa-tion ∆2_f ₌ B2u Bx2+ B2u By2+ B2u Bz2.

These functions are pieces of a signal, and by scaling and summing the functions, this sig-nal can be approximated by simply integrating each basis function. In order to approximate

(14)

2.2. Background

using basis functions, the first step is to determine a scalar value Cirepresenting the similarity between each basis function βi(x)and original signal f(x). This is achieved by integrating the product f(x)βi(x)over the full domain of f .

ż

f(x)β_i(x) =Ci

This process of determining the weight of each function is called projection. Projecting overall basis functions results in a vector of approximated coefficients, and by scaling and summing by the corresponding basis function, the result is an approximated function.

f(x)approx.= ÿ

Ciβi

Spherical Harmonics are defined in the frequency domain, where the order specifies the number of polynomial terms that the harmonics function contain. Smooth functions can be obtained by using a lower order. However, to capture higher frequency, higher-order Spheri-cal Harmonics are necessary, resulting in a more precise approximation.

Spherical Harmonics Lighting was first introduced in 2002 by Sloan et al. [37] where it was presented as a technique for calculating the light conditions as light functions, represented by a set of coefficients. The paper demonstrated how Spherical Harmonics Lighting (SH Light-ing) could achieve ultra-realistic lighting of models. By directly using only the coefficients as representations for the light function, the rendering process could be sped up, allowing for the capture and relighting of images in real-time.

2.2.2 Augmented Reality

Augmented reality (AR) is a field that during recent years has exploded. In essence, AR is simply the reality augmented with digital data, which for instance, can take the form of 3D objects or videos. In order to be efficient, the AR system needs to be able to, in some sense, understand the concept of reality in order to reconstruct its digital twin. This then enables the user, in the AR system, to interact with the augmented reality consisting of both real and digital data [33].

In AR, everything is centred around the camera, whose real-time data is analysed, detect-ing shapes and movements and consequently formdetect-ing the augmented world. This output is what constitutes the physical layer of the augmented world which consists of the following components:

1. User Interaction

2. Point Cloud and Spatial Mapping 3. Pattern Matching

4. The Inertial Measurement Unit 5. Tracking, Anchors, Persistence

Above the physical layer, the augmented world also consists of the semantic layer, which is what gives meaning to what the users see. The semantic layer involves training neural networks with a large amount of data in order for the machine to be able to interpret the world as humans do [18].

Lighting cues and concepts in AR

Humans automatically perceive subtle cues in relation to how our visual fields and the objects in them are lit. As an example, if a virtual object has missing shadows or consists of a shiny material without displaying any reflection, the user will feel that the object does not fit in the

(15)

2.2. Background

environment. For the purpose of creating a more immersive and realistic user experience in AR, it is therefore of crucial importance to render objects to match the lighting in a scene [18]. Through light estimation of the real-world scene, the model can be provided with detailed data allowing to mimic a variety of lighting cues to render virtual objects realistically. The following are among the most common lighting cues:

1. Ambient Lighting 2. Reflection

3. Specular highlights 4. Shading

5. Shadows

Ambient light can be described as the overall diffuse light, lighting everything while coming in from the environment. Depending on the reflection of a surface, the light will bounce of differently. Diffuse surfaces are non-reflective, while specular surfaces are highly reflective. Shading can be described as the light intensity of different areas in the same scene, depending on the viewer’s angle. Finally, the shadows tend to be the result of directional lights, which help the user determine the light source direction [33].

A surface is said to be Lambertian if the radiance — reflected light — of the surface is in-dependent of the illumination angle in conjunction with the viewing angle [38]. For instance, a paper that absorbs water well (i.e. diffuse surface) is a prime example of a Lambertian surface, whereas a mirror is not. The face of a human is often assumed to be that of a near Lambertian surface, which facilitates illumination estimation [46, 48].

2.2.3 Neural Network

Deep learning is a field within machine learning which aims to solve problems that are hard-to-describe but for which examples can be provided [13]. Specifically, the term ’Deep’ refers to the depth of a network, i.e. how many layers of neurons are in parallel. A Neural Network (NN) model learns by mapping some input data to some output feature space such that nec-essary, descriptive, features can be discovered to solve a particular task; this is opposed to hand-crafted feature methods like Scale Invariant Feature Transform (SIFT), which retrieves scale-invariant image features. SIFT can provide 2000 features from a single 500x500 image. This has positive implications for object recognition in computer vision, as it allows for object detection in images with a lot of background clutter [29]. Therefore, in the absence of large datasets, which is important for the accuracy of Deep Neural Networks [30], this method still allows for object recognition tasks to be performed. Lin et al. [26] found that with a limited training sample size, the performance of a convolutional neural network (CNN) is equiva-lent to that of the hand-crafted features, which in this case was Texture analysis in the form of grey-level co-occurrence matrix (GLCM) features. However, the combination of the two was found to surpass the performance of both the CNN-only features and the hand-crafted-only features.

A NN aims to map some input to some higher dimensional output feature space such that the input is linearly separable [13]. This follows Cover’s theorem, which states that as the input is non-linearly mapped to a higher dimension, the probability that the data is linearly separable increases [7]. A NN is called such because it emulates, at least to some extent, the brain in terms of neurons and synapses. Just like how a synapse passes a signal — or information — to a connecting neuron, which in turn sends a signal to another neuron; a neuron in a NN sends a signal to another neuron to finally output some result based on information from other previously connected neurons. All a neuron does in a neural network

(16)

2.2. Background

is simply take a weighted linear combination of the input signals and passes it through a non-linear activation function which is then passed onto another node. Typically, a NN contains multiple layers of multiple neurons.

Backpropagation is a way to update the weights of the network by calculating the gradi-ents of composite functions using the chain rule and works in conjunction with an optimi-sation algorithm such as Stochastic Gradient Descent (SGD), both of these comprise what is known as gradient descent-based learning and is used to update the parameters — analogous to weights — of the model. Another optimisation algorithm is Adam which is more memory efficient and more suited towards higher dimensional feature spaces, something which SGD is unsuited for [21]. Adam combines features from other optimisation techniques such as AdaGrad and RMSProp.

Convolutional Neural Network

CNN are networks specialised for processing data with a known grid-like topology such as time-series and image data [13]. The major building blocks of these networks are the convo-lutional layers that perform the convolutions and result in the ability to automatically learn many filters under the constraint of a specific predictive modelling, such as image classifi-cation. Whereas a regular NN takes each pixel as input, resulting in heavy calculations, the convolutional neurons are instead responsible for convolving a spatial region of the image. Through the application of relevant filters, the network can successfully capture the Spatial and Temporal dependencies of an image. Moreover, the reduction of parameters due to the reusability of weights results in a better fitting of an image dataset.

The architecture of a convolutional neural network can take many forms, but the standard usually consist of three parts: a convolutional layer, a pooling layer and a fully connected layer [13].

The convolution operator allows for averaging of the weighted readings to reduce the noise of some signal [13]. The convolution operator is denoted as ˚. Images can, henceforth, be thought of as a 2D signal. Thus, in 2D convolution, the input — a 2D signal — is convolved with a weight function which is referred to as a kernel, to obtain a feature map. An example of a feature map would be contours, which are retrieved earlier in the network. In computer vision, the early features of a convolutional neural network (CNN) are called lower-level fea-tures, whereas features from later layers are called higher-level feafea-tures, which are composed of lower-level features [23].

The maths for 2D convolution is as follows: S(i, j) = (K ˚ I)(i, j) =ÿ

m ÿ

n

I(i ´ m, j ´ n)K(m, n), (2.1) where K is a rotated kernel matrix which is convolved over the image I. Figure 2.1 illustrates an example of a low-level feature map from the output features.

Cross-correlation is calculated when the kernel matrix is not rotated. The conceptual dif-ference is that a rotated kernel provides the commutative property. However, this property is not really essential for computational reasons, and thus most CNN libraries implement cross-correlation instead. Therefore, it is a misnomer to call the implemented convolution layers convolution. A typical convolution layer is depicted in Figure 2.2.

The activation function must be non-linear as this is what allows the model to create com-plex non-linear boundaries, thus allowing for comcom-plex classification and regression of input data. Furthermore, this non-linear function must also be differentiable to allow partial deriva-tives to be calculated during backpropagation. Examples of non-linear activation functions are: Hyperbolic Tangent function (TanH) and Rectified Linear Unit (ReLU). Finally, the out-put is then passed to a pooling layer which perform downsampling.

1_{Convolutional layer by Cecbur, licensed under CC BY-SA 4.0 https://creativecommons.org/licenses/}

(17)

2.2. Background

Figure 2.1: Low-level feature map

Figure 2.2: Convolutional layer with input size 9x9, filter size 3x3 1

(18)

2.2. Background

The role of the pooling layer is to reduce the dimensionality on the feature mappings while still maintaining knowledge about important features [35]. By applying pooling, the statisti-cal efficiency of the model is increased as it both becomes more robust while at the same time reducing the computation cost. There exists a variety of different pooling techniques, where the most commonly employed techniques are average pooling and max pooling. Average pooling is a pooling operation that calculates the average value of each patch in each feature map. Max pooling, quite intuitively, computes the max value for each patch in each feature map. A pooling layer consists of a stride and a filter size. Stride determines how many pixels the filter should cover at each time-point. And filter size determines how many of the original pixels that each pooled pixel should cover. Figure 2.3 illustrates an example of a max-pooling operation on a feature map.

Figure 2.3: Resulting input and output when applying max pooling with filter size 2x2 and stride 2

Depthwise Separable Convolution

Regular 2D convolution performs computation on the spatial dimensions (width and height) and the channel dimension simultaneously to learn a complex 3D filter. Depthwise separable convolution (DSC), instead, works by performing the spatial convolution separately from the pointwise convolution. It is dissimilar to separable convolution in image processing [6]. Sifre [36] proposed DSC as he found some parameters redundant in existing state-of-the-art Deep neural networks. DSC was used in AlexNet and provided not only a reduction in the number of parameters but also performance gains and a reduction of the computational cost. MobileNet was developed by Google and implemented DSC efficiently for use in phones [16]. The following example is adapted from Wang [43]. To put DSC and regular convolution into an example (to show why DSC is faster), imagine an image of shape (12,12,3), where the first item of the shape refers to the width, the second refers to the height, and the third refers to the channels — RGB channels. In regular convolution, imagine convolving a kernel of shape (5,5,3) over this image. This will output a feature map of shape (8,8,1). For each feature map, we must perform 5 ˆ 5 ˆ 3 multiplications. If we wish to create 128 feature maps, we would have to do 5 ˆ 5 ˆ 3 ˆ 128 multiplications. In DSC, depthwise and pointwise convolution are performed separately. Depthwise is performed first, i.e. a kernel of (5,5,1) is performed on each of the RGB channels separately and then stacked up to achieve a feature map of shape (8,8,3). Following this, pointwise convolution is performed, which is done with a shape (1,1,3) kernel — 3 channels deep. We perform this pointwise convolution 128 times to achieve a final output of shape (8,8,128). During convolution, however, the kernel moves over the entire image. In the case of regular 2D convolution, it means that the total multiplications required is 3 ˆ 5 ˆ 5 ˆ 8 ˆ 8 ˆ 128=614400. For depthwise convolution it is instead 3 ˆ 5 ˆ 5 ˆ 8 ˆ 8=4800 and for pointwise convolution it is 1 ˆ 1 ˆ 3 ˆ 8 ˆ 8 ˆ 128=

24576. Which means that DSC amounts to 4800+24576 = 29376 multiplications. It should then be evident that it is magnitudes of order computationally cheaper to perform DSC than regular 2D convolution.

(19)

2.2. Background

Extremely Separated Convolution (XSepConv) proposed by Chen et al. [5] expands upon this concept, reducing the computational cost even further while maintaining or, indeed, in-creasing the performance of a CNN. It expands on the concept by utilising separable convolu-tion, in the sense of image processing, on the spatial dimension alongside DSC, creating spa-tially separated depthwise convolution (SSDC). However, as spaspa-tially separable convolution divide the kernel into a separate width and height kernel, they only obtain information about the width and height dimension respectively, this means that information is lost as opposed to regular spatial convolution. XSepConv deals with this by introducing a 2x2 depthwise convolution with symmetric padding.

2.2.4 CNN Encoder-Decoder Network

Deep neural networks have shown to be powerful models for the purpose of various learning tasks. However, while these networks achieve good results whenever labelled training data is available, they cannot be used in order to learn a mapping of sequence to sequence [1]. The encoder-decoder model, also called sequence to sequence model, is built up by two distinct neural networks. In total, it consists of 3 different parts:

1. Encoder 2. Bottleneck 3. Decoder

The encoder is a neural network whose purpose is to encode the input sequence. The encoder takes an input signal X and propagates it forward while mapping it to a feature space Z. This mapped output vector from the encoder network aims to encapsulate the meaning of the input sequence. With an image of 32x32 as an example, the encoder takes the image as input and squeezes it to a lower-dimensional representation.

The bottleneck is a lower-dimensional hidden layer and is where the actual encoding is taking part. The number of nodes in the bottleneck layer dictates the dimension of the output vector. If, for example, we send the image of size n ˆ m ˆ c (where n, m, c, x are width, height, channels, output channels respectively) through a bottleneck layer of fully connected convo-lutional layers, it will result in x feature maps that contribute the most information about the image. These low-dimensional encoded features are called latent features and these belong to the latent vector output. The set of different values that these features can take is what constitutes the latent space.

The decoder seeks to decode the encoded input into a target sequence. The decoder is the second neural network and takes the produced feature map as input, processes it and produces an output Y aiming to recreate the input [44]. The goal of this network is to learn how to reconstruct the data from the encoded representation in order to be as close as possible to the original input.

2.2.5 Residual Neural Network and Skip-Connections

Deep learning is thought of as learning a hierarchical set of representations, such as low, mid and high-level features. In image recognition, this would compare to learning edges, shapes and then objects. Theoretically speaking, stacking more layers would enrich the levels of the features as the additional layers would progressively learn more complex features. However, as He et al. [15] showed, there is a maximum depth threshold when using a traditional CNN model. Several theories have tried to explain why very deep neural networks fail to perform better, the most common theory being exploding/vanishing gradients. These problems of training very deep networks are what led to the introduction of a new neural network layer - The Residual Block. ResNet by He et al. addresses the degradation problem by fitting each layer to a residual mapping.

(20)

2.3. Related Work

Chul ye et al. [44] demonstrated how the introduction of skip-connections help smooth out the optimisation landscape, consequently making it easier to optimise the residual map-ping, rather than the original. This can then be realised by a feed-forward neural network with skip-connections. This is a form of identity mapping whose function is to bypass an encoder layer output to its decoder layer. In order to match the different dimensions, as a result of the convolutional operation, the skip-connection is multiplied by a linear projection W, expanding channels to match the residual. Consequently, this allows for the input x to be combined as input to the next decoder layer. Since this skip-connection neither adds extra parameters nor extra computational complexity, the network can still be trained end-to-end by an SGD with backpropagation. The introduction of skip-connections controls the degra-dation problem and give the ability to train much deeper networks.

2.2.6 Root Mean Square Error

Root Mean Square Error (RMSE) is a statistical performance metric that can be used to train NN, CNNs and more. With fewer samples (n < 100), RMSE does not give an accurate rep-resentation of the error [4]. As the sample size gets larger (n > 100), it has been shown to accurately represent the error. It is thus a reliable metric to be used for larger sample sizes. RMSE is defined analytically as:

d ÿ i=0

(yˆi´yi)2

n ,

where ˆyiis the prediction and yiis the ground truth and n is the number of samples.

2.2.7 Tensorflow lite

Tensorflow Lite is a framework built upon TensorFlow, which has been specifically designed and optimised to run TensorFlow models on embedded devices and mobile platforms such as iOS and Android [40]. Tensorflow Lite contains a tool that converts a trained Tensorflow model — such as one built-in Keras with Tensorflow as backend — to an optimised Tensor-flow Lite model. This converted file may then be initialised in other languages than the one in which the model was written, such as C++, Javascript and Java. Tensorflow Lite only sup-ports a limited number of Tensorflow operations and is therefore not suitable for all models. Furthermore, Tensorflow Lite contains an interpreter that allows developers to run inference on any Tensorflow Lite model with minimal effort. Tensorflow Lite supports inference on mobile GPUs, enabling quicker inference than on mobile CPUs [39].

2.3 Related Work

Lighting estimation is a problem that is not only applicable to Augmented Reality (AR). On the contrary, the light settings of a scene play a huge role in a variety of tasks such as scene editing, scene reconstruction and environment design, to name a few [27].

Estimation of illumination is an imperative tool in order to realistically illuminate virtual objects in AR. Estimating the Spherical Harmonic coefficients is a technique that enables ap-proximation of a light environment and has been used in video games to light objects and scenes [14]. Analytical estimation of Spherical Harmonic (SH) coefficients require an infeasi-ble amount of computation that prevents real-time estimation. [34]. Furthermore, it has been shown that the analytical approach to solving the estimation of the SH coefficients tend to include a fair amount of error. Consequently, in order to compute reasonable values, it is necessary to compute in iterations [2]. Another issue with analytical estimation of SH is that the true function f (the function to be approximated) can only be reconstructed by summing the infinite series of all SH coefficients [14].

(21)

2.3. Related Work

Estimation of the illumination is a pattern recognition problem and, therefore, machine learning can be applied to solve for the SH coefficients. Deep Spherical Harmonics by Mar-ques et al. [32] demonstrated how Sperhical Harmonics can be used to represent the lighting situation for all directions and how it is an adequate technique for the light estimation of 3D objects. The method takes an RGB image as input and outputs a light probe for the light scene represented by 9 SH coefficients. Marques et al. illustrate how the estimated light probe can be used in order to create a composite image consisting of both virtual and real elements while achieving consistent lighting. The proposed method was validated through synthetic tests where it achieved an RMS error of 0.0573. The inference time on the CPU was 0.53 s while on the GPU it was only 13 ms.

Deep single image relight by Zhou et al. [46] is another technique that performs illumina-tion estimaillumina-tion as a part of their system to generate a relit image from some input lighting. The method estimates the illumination encoded as SH coefficients at the bottleneck layer. The network achieves an accurate and high-quality relit image. To achieve high-quality relit images, the estimated lighting must be disentangled accurately at the bottleneck layer, essen-tially creating a de-lit image. The target lighting is then applied to the image to create a relit version. The entire system trains on and generates high-resolution images, up to 1024x1024. The model is not fit for phones when it comes to inference time and memory usage. This is mostly due to the fact that the model takes large resolution images and is so deep and com-plex. The model also utilises regular convolution, which is more computationally expensive than DSC and is, therefore, slower.

In contrast to other lighting estimation methods, the method by Zhou et al. does not re-quire any special devices such as depth cameras or fish-eye lenses, making it more convenient to use. Furthermore, the method does not require any previous knowledge of scene geometry, making it suitable for both indoor and outdoor environments. The network applied follows the Residual Network Architecture to counteract the degradation problem and takes images of size 1024 ˆ 1024 as input [15].

There exist other illumination estimation techniques, such as light source estimation. These, however, require prior knowledge about the environment and depth information [19]. Depth information in photos is not something that is available in but the most recent phones. As such, it is not a suitable technique to estimate lighting from photos taken with older phones. The results, i.e. the shadowing and lighting of virtual objects, that Kafumann et al. achieved was found to be consistent with in-door lighting of real objects.

DeepLight by LeGendre et al. [24] proposed a deep learning method to retrieve HDR illumination from a single LDR RGB image. The model takes the form of an encoder-decoder architecture. Notably, the encoder network contains the first 17 layers of the MobileNetV2 architecture, which uses DSC [16]. To create the ground-truth training data, three balls of the type specular, matte and diffuse, were placed on a stick with a camera pointed towards the balls such that the balls occupied the lower 1/5th of the screen real estate. However, this only focused on illumination inference from images containing only background, whereas our data will consist of a wide range of diverse sets of faces. It is further noted that the method does not work particularly well with scenes containing mostly similarly coloured material. The method achieved between 12-20 fps on a mobile CPU during inference.

Another illumination technique by LeGendre et al. [25] estimated the HDR lighting from a single LDR RGB image from a phone using a deep learning model. Their lighting estimation technique achieved an astounding 94 FPS on a Google Pixel 4 during inference. The method utilised a convolution-based hourglass network architecture, where the final output was HDR lighting encoded as a mirror sphere. The training dataset was created with a method called "One-Light-at-a-Time" (OLAT). This OLAT technique requires an advanced setup of cameras and LEDs that captures a sequence of photos of a person from different angles with different lighting applied. In the absence of such an OLAT dataset, we reached out to the creators, though, unfortunately, due to GDPR, they could not provide us with the dataset. While showing great promise, this method is too difficult to do with limited time and resources.

(22)

3 Method

Our model is based on Zhou et al. [46] Hourglass network structure, but with the distinction that the final model (henceforth named the inference network) only contains the encoder part and the bottleneck part of the full network. The bottleneck layer is where the SH coefficients are estimated. We are basing our network architecture on the model from this paper for a variety of reasons. It achieved decent results in terms of quality (see definition of quality in section 2.1). Another reason is that SH coefficients can easily be applied in the rendering of 3D objects in augmented reality, hence it is suitable for our objective. The final reason for our choice is that the paper from Zhou et al. [46] includes a large dataset of faces and annotated ground-truth light information in terms of 9 second-order SH coefficients.

In order to relight a face, the network by Zhou et al. is fed an LDR RGB portrait image of resolution 512 ˆ 512 and a target light (SH coefficients) as input to generate a relit image under the target lighting condition. The encoder performs down-sampling on the images while extracting face and lighting features. From the lighting features, the SH coefficients are estimated. The decoder then performs up-sampling to generate a relit portrait image used to calculate the loss during the training phase. Note that the entire network is only used during the training phase as the network’s weights are helped by the additional information provided by the complete linear combination of losses. Early experiments also highlighted this fact. Furthermore, we replace the regular Convolution layers with DSC layers.

Unlike the architecture by Zhou et al. [46], our final model only uses the encoder part and the bottleneck of the network during inference. This is because we are only interested in the estimation of the SH coefficients in the bottleneck layer, whereas the decoder is only relevant for data generation and training. The removal of the decoding part of the network further reduces the number of parameters and therefore also speeds up the inference. Our network is trained by a linear combination of 3 losses which are defined in section 3.3.1. In addition, our model also applies DSC instead of regular convolution, resulting in less computational complexity, smaller network size and a fewer number of trainable parameters. This further speeds up the inference time, making it more suitable for mobile devices with less CPU and GPU power.

The rest of the method is structured as follows: we will begin with the dataset, within which we talk about what it consists of; subsequently followed by the base network archi-tecture, the parameters used for this particular network, the loss functions, and motivation thereof; followed by the experiments, in which we talk about the replacement and the

(23)

im-3.1. Dataset

plementation of Convolution, Depthwise separable convolution, and how these affect the inference time and the quality of the network; finally, what testing environment and compu-tational hardware are used to achieve the results.

3.1 Dataset

The dataset utilised in our endeavours is the DPR dataset provided by Zhou et al. [46]. It is based on the celebA-HQ dataset [20] which consists of a large set of diverse — up to — 1024x1024, RGB images of celebrity faces. Furthermore, the images consist of varied back-grounds and face poses. These features make the dataset suitable for our endeavour as the input to the network during inference will consist of face images with varied poses and, most likely, background. The DPR dataset is a pre-processed version of the celebA-HQ dataset and includes „ 138K relit images, roughly equating to „ 280 GB in size on disk. The term ’relit’ refers to celebA dataset images that have gone through a pre-processing step to generate new images with generated lighting based on computationally expensive techniques. Therefore this dataset is considered to be synthetic. For more information about the pre-processing of the dataset, please refer to Zhou et al. [46]. The entire dataset is available publicly for non-commercial use. This means that our trained model may not be used in the released product and can, therefore, only be considered a proof of concept.

Each entry in the DPR dataset consists of 4 relit images with 9 accompanying SH coef-ficients for each relit image. The SH coefcoef-ficients are supplied in the form of text files. The normal of the image, the original SH coefficients, a shading and a relighting mask are also included for each entry. During training, the relit image, a target image and ground-truth SH are used to calculate the loss. In addition it is worth noting that the accompanying SH coefficients are not "true" ground-truth as they have been estimated. More information about a training sample is found in Section 3.3. An example of a relit image, and its accompany-ing SH coefficients, can be seen in Figure 3.1 and Table 3.1. No further pre-processaccompany-ing must be made, and the DPR dataset can be used as-is. The dataset is split into three folds: train, validation and test with a split of 80%, 10%, 10% respectively.

Table 3.1: Table of SH coefficients SH0 7.499087367642115476e-01 SH1 -8.986789953924927132e-02 SH2 -6.194947178426009338e-02 SH3 -2.276852084772853224e-01 SH4 1.836999449970601883e-01 SH5 -1.299893046410884878e-01 SH6 -7.912139089975360473e-02 SH7 -1.029840631806318574e-01 SH8 3.563010731166914996e-02

(24)

3.2. Network architecture

Figure 3.1: Original and relit image applying target SH coefficients from table 3.1

3.2 Network architecture

The entire network is based on the architecture by Zhou et al. [46] and is displayed in Fig-ure 3.2. The network is split into three parts: the encoder network, the bottleneck and the decoder network. It is also structured in this order. The inference network is defined to be the encoding part and the bottleneck layer of the entire network. It is the only part of the network that is of interest to us as it is here that the SH coefficients of the input image are inferred. The input to the network is a 512 ˆ 512 RGB image that is converted to the L*a*b space, from which the luminance channel is the actual input.

Figure 3.2: Hourglass network architecture

The network is translated from PyTorch into Keras via Tensorflow, enabling easy conver-sion once the model is trained to Tensorflow Lite, which would then be used for inference

(25)

3.2. Network architecture

within the phone application. Our inference network follows the architecture of the encoder part of an encoder-decoder network, where the encoder-decoder network is often referred to as an hourglass network architecture. Given a 512x512 input image, the network encodes the image and infers the resulting SH coefficients at the bottleneck layer. Our inference net-work architecture is displayed in Figure 3.3. The full netnet-work consists of 4 blocks which are referred to as hourglass blocks. Each hourglass block is structured as a 2 part convolution block where each part consists of a convolution layer, batch normalization layer and ReLU activation function. Moreover, each hourglass block is followed by a downsampling layer in the form of MaxPooling [47].

3.2.1 Network details

The layers to the inference network, starting from the left and going all the way to the bottle-neck are displayed in Table 3.2. Where the number of input channels, the number of output channels and the image resolution of a layer are displayed. cxdenotes a convolution block (convolution layer followed by batch normalization layer and a ReLU activation function, this is repeated twice), n denotes the feature map resolution n ˆ n, dydenotes downsample layer and l1, l2 denotes a convolution layer at the bottleneck. pre denotes a pre-processing step including a convolution layer followed by batch normalization and a ReLU activation function. Unlike cxthis is not repeated twice.

Table 3.2: Table of inference network layers Inference network pre d1 c2 d2 c3 d4 c4 d5 c5 l1 l2 chin 16 16 16 16 16 32 32 64 64 27 128 chout 16 16 16 16 32 32 64 64 155 128 9 n 512 256 256 128 128 64 64 32 32 32 32 Encoder

As the name ’encode’ suggest, this part of the network encodes the input images by passing them through multiple layers to create high-level feature maps from which the bottleneck layer can infer the SH coefficients. The encoder network contains the aforementioned convo-lution block, and the details of the input and output channels are defined in Table 3.2 where all but l1, l2refer to the encoder network.

Bottleneck

At the point of the bottleneck layer, we have 155 channels, of which the first 27 channels correspond to the lighting features (9 SH coefficients for each of the three channels: R,G,B). The other channels refer to the face feature maps. At the bottleneck layer, we wish to estimate the input lighting from these 27 channels. This is done by averaging over these 27 features and passing them through two fully connected convolution layers (l1, l2in Table 3.2) with 128 and 9 channels respectively, resulting in the output of 9 SH coefficients [47]. At this stage, we also replace the 27 channels corresponding to the lighting features with our target lighting. Note that this is purely necessary for the training of the network as that is what allows the network to produce a relit image that is of no particular interest for our final model and is therefore stripped away from the network for the final product. Consequently, this increases the overall speed of the network and reduces the overall size. In essence, for the final network, the network to the right of the bottleneck layer is removed. The encoder block and the bottleneck block (i.e. the inference network) are depicted in Figure 3.3.

(26)

3.3. Training process

Figure 3.3: Inference network architecture

Decoder

The decoding network is only of interest during the training phase. The decoding network consists of Convolution layers, Upsampling layers and batch normalization layers. The de-coding network is essentially a mirrored version of the ende-coding network, intending to up-sample to the same shape as the input, thus producing a relit version of the input image. The decoder is not used in the inference network. The decoder is also based on the work by Zhou et al. and further information about the details of the decoder network are laid out in the compendium [47].

3.3 Training process

As previously mentioned in Section 3.1, the dataset is split into three sets. The different sets pertain to the training, validation and testing set. Each set is accompanied by a text file con-taining the source and the target image of each sample. A training sample is defined to be a ground truth image accompanied by its SH coefficients and the target image and its accompa-nied SH coefficients. The network is trained with the loss functions defined in Section 3.3.1. The model is trained on a V100 32GB VRAM GPU. More information regarding the environ-ment and system can be found in Section 3.7. The training is performed in batches of 16, which is the maximum due to the limited amount of video memory available on the GPU (32 GB). A full epoch takes eight hours utilising the hardware described in Section 3.7. For the experiments, we train the full network for nine epochs for a total of 9 ˆ 8 hours, and the inference network for 8 ˆ 8 hours. Both networks are trained using the Adam optimiser with default parameters. More information about the hyper-parameters can be read in Section 3.5. The input samples to the model are sampled in batches of 16 in order and according to a train text file containing the samples pertaining to the training set. The validation set is sampled the same way but from a different validation text file pertaining to the validation samples. The images of a training sample are pre-processed by transforming the RGB images to the L*a*b space and then sliced to only contain the luminance channel (as this is the channel from which we wish to infer the lighting).

After each epoch, the model is evaluated with a batch of validation samples. If the loss from the evaluation is better than the previous best evaluation, the model is saved as a Keras

(27)

3.4. Experimentation

model and then promptly converted to a TensorFlow Lite model (this allows us to run in-ference on the model directly). The weights are also saved, which allows the transferring of weights over from the full network to the inference network.

Zhou et al. [46] provide a detailed explanation of the training strategy. The gist of the strategy is to train the network for five epochs without connections and then add skip-connections one by one after each epoch. At epoch 10, the face loss is added to help the bottleneck layer with transferring the face features and refine the output image. Inspired by this strategy, we implemented the strategy with some changes. We do not add a face loss at epoch 10; this is because we are not interested in the slight refinement of the generated output image. We stop training when the loss is increasing following the addition of the last skip-connection. The skip-connections are visualised in Figure 3.2 as S1,S2,S3 and S4. The skip-connections are added one by one starting with S1. Moreover, we choose to exclude training with face loss as this mainly aims to increase the accuracy of the relit image.

3.3.1 Loss functions

For the full network, we utilise a linear combination of three losses based on the estimated SH coefficients and the relit image. Our loss function is similar to the loss proposed by Zhou et al. [46], but with the exclusion of the face loss and the GAN loss. We choose to stick with this since it was deemed that the authors of the paper had extensive knowledge about their own network and loss function. The loss function, Ll, calculates the L1 loss of the generated image and the target image as well as the L1 loss of the gradients of the generated image and the target image. Ll also calculates the L2 loss of the estimated SH coefficients. The loss function is defined in Equation 3.1. For the inference network, the first two terms of the loss function are excluded as the inference network does not generate a relit image, meaning that the full loss functions can not be calculated.

Ll = 1

N(kIt´ ˆIk1+ k∇It´∇ˆIk1) + (Lgt´ ˆL)

2_, _(3.1)

Lgt and ˆL corresponds to the ground-truth and predicted SH coefficients of the source im-age. It, ˆI correspond to the target image and the generated relit image respectively. N is the number of pixels in an image, i.e. the image resolution 512 ˆ 512.

3.4 Experimentation

The experimentation involves replacing the convolution layers with DSC layers, removing the decoding part of the network — creating the inference network — retrieving the RMSE value for each step of the model during training, benchmarking the models on a phone with both GPU and CPU as a delegator for retrieving both the memory usage and the inference time, training the full architecture according to the training strategy and training the inference network. These experiments are done in order to achieve greater speed and or better quality SH coefficients as defined by Qe, and to compare how the focus on Qeaffects the speed.

3.4.1 Evaluating inference time

The inference time is evaluated using Tensorflow’s benchmarking tool for Tensorflow Lite1 on a Oneplus 3 A3003 Android phone running on OxygenOS 4.1.6 with Android version 7.1.1. The phone’s chipset is a 14nm 64 bit Qualcomm Snapdragon 820 MSM8996 with 6GB of LPDDR4 RAM and has a Qualcomm Adreno 530 GPU. A Tensorflow Lite model with the full architecture and a Tensorflow Lite model with the inference architecture is run through the

(28)

benchmarking tool on the phone, which provides the inference time for the model in a steady-state (i.e. loaded into the RAM) via the Android Debug Bridge (ADB). The benchmarking tool is run with both the GPU enabled and disabled.

3.4.2 Evaluating quality against the inference time

To evaluate Qe we randomly sample 505 samples from the test set and estimate ˆL for each sample by passing these samples through the network. This number of samples is motivated by the theory Section 2.2.6, in which it is stated that for less than ă 100 samples, the error can-not be properly represented. The number of samples is arbitrarily selected by this criterion. By no means are we limited to 505 samples. Then the RMSE value defined in Section 2.2.6 is used to calculate the error between Lgtand ˆL. In order to then evaluate the quality against the inference time, the network is reduced to the inference network, which will intuitively reduce the parameters. Both the full architecture and the inference network are trained on the same dataset. The goal is to see whether reducing the network to the inference net and thus changing the loss function and removing several layers will affect the estimated SH co-efficients and the inference time. Inference of the same image is run on each of the saved network models (Inference net, Full without skip-connections, Full with S1, Full with S1/S2, Full with S1/S2/S3, Full with S1/S2/S3/S4 ), and Qeis calculated to show the quality of the estimated SH coefficients.

DSC, as described in section 2.2.3, reduces the number of parameters, consequently im-proving the time and possibly the efficiency of larger networks. Tensorflow, and thus Keras, have built-in DSC layers that perform depthwise and pointwise convolution separately. In order to conduct this experiment, we simply replace every convolutional layer of the entire network with DSC layers and re-train it with the same loss function described in 3.3.1, using the same dataset.

3.4.3 Quality survey

In order to evaluate the results and the subjective opinions of the model, we present a survey to individuals. The survey is comprised of a short introduction explaining our research and the purpose of the survey. The survey contains a total of 11 questions, where each question consists an image with two spheres. One sphere pertains to our method and one pertains to Kapanu’s pre-existing method. For each question, the sphere by Kapanu is placed to the left while ours is placed to the right. Naturally, however, the user is unaware of which sphere belongs to which model. It would, of course, be good to extend the number of questions and cases further, but this would consequently also add more of an obstacle for the respondents. An example question from the quality survey can be seen in Figure 3.4.

For each question, the user is asked to select their preferred alternative. The selection then acts as an indicator for which model can better capture the light situation of the presented image. The user selects their preference by selecting from a scale of 1 to 5 where 1 indicates an undisputed preference for the model by Kapanu and 5 indicates an undisputed preference for our model. A 3 indicates that the user find that both of the models performing equally good or bad. The survey is done through Google Survey2.

Apart from collecting information about the user preferences in the questions, we also ask for auxiliary information such as Sex (Male/Female/Prefer not to say), Age, Monitor/Screen used, and any additional feedback for the survey. We choose to include this information, hoping to possibly draw further conclusions from the survey results.

(29)

Figure 3.4: Example question for the quality survey

Visualising the SH coefficients

A necessary pre-processing step for the survey is to visualise ˆL. Listing in Appendix A.1 showcases the code necessary to project the SH coefficients onto a sphere. This code is adapted from Zhou et al. [46]. A generated light sphere is depicted in Figure 3.5.

Figure 3.5: SH coefficients projected onto a sphere

Evaluation of survey results

Given the results from the quality survey, our model is evaluated by comparing the quantity and share of users that show a preference for our model. For more information, see Qs de-scribed in Section 2.1. With the results, we can determine the average preference by summing all of the votes and dividing by the total number of votes(NrQuestions ˆ Respondents).

Average preference= Sum of votes

(30)

3.5. Hyper-parameters

Given a value for the total average preference over all images and users, a value that is larger than three indicates that the users are in favour of our model. Furthermore, we deter-mine a percentage-in-favour per question in order to evaluate what the majority of the users prefer in different scenarios. For the determination of the percentage-in-favour, we choose to exclude the votes on 3. This is motivated by the fact that we aim to actually outperform the results by Kapanu, hence why these results are of no direct interest.

The metrics used for the evaluation are:

1. Average preference over all images and users 2. Average preference per user

3. Average preference per image 4. Spread of votes per question

Test images for survey

In order to adequately evaluate and compare the rendered spheres, the test images are chosen in a way that it covers different light direction (left, right, above, below, front), a variety of skin tones and different photo angles. The test images used for the survey were supplied by the company, coming from their database of images with human faces.

3.5 Hyper-parameters

The hyper-parameters of the network refers to the weight initialisation, number och epochs and the learning rate. The number epochs are mentioned in detail in Section 3.3. While weight initialisation and the learning rate are left default, it is still worth noting the default values for those that intend to run these experiments on a platform other than Tensorflow. Tensorflow uses GlorotUniform — Xavier uniform — for its weight initialisation. The learn-ing rate is set to 0.001 in Adam. The batch size is set to 16. Different hyper-parameters are not investigated due to time constraints. It is infeasible to investigate different configurations of hyper-parameters due to the fact that an epoch takes about 8 hours to complete.

3.6 The inference model

The inference model refers to the encoding block and the bottleneck block of the network. This is referred to as such as it is the part of the network that is of main interest for the final product, i.e. the part that estimates the SH coefficients. Furthermore, this is done to reduce the overall complexity and reduce the inference time of the final model such that it can run on a phone.

3.6.1 Tensorflow Lite conversion

In order to have the model running on a mobile device, the model is converted from a Keras model, which was built upon Tensorflow, to a Tensorflow Lite model. This TensorFlow Lite model thus achieves a trans-language state, meaning that it can be utilised in a multitude of different languages such as, but not limited to: C++, Java, Javascript, and Python. This is important as it allows to bridge the gap between the model (which is written in Python) and the phone application (written in C++).

(31)

3.7. Hardware, resources and environment

3.7 Hardware, resources and environment

In order to train the entire network, we utilise the National Super Computer at the University of Linköping (NSC) with access to the GPU cluster Sigma. Sigma contains at least 4 Nvidia Tesla V100 GPUs with 32GB of accessible VRAM each, 2x Intel Xeon Gold 6130 CPUs with 384 GiB of RAM, 14TB NVMe disk and 36 CPU cores per node. However, during training, we were limited to a single V100 GPU and 2 CPU cores, as we did not implement multi-GPU support for our model, and we found that no more than 2 CPU cores were necessary to move the samples of the dataset to and from the GPU. However, due to the absence of multi-GPU support, the batch size during training is bottlenecked by the amount of VRAM available. The environment with which the model is created and trained consists of Python 3.8.8, Tensorflow 2.3.0, Tensorflow-GPU 2.2.0, Tensorflow-addons 0.12.1, NumPy 1.18.5 and various other packages that are not noteworthy enough that they need to be mentioned. The model was built with Keras with Tensorflow as a backend. This allowed the model to be easily converted to Tensorflow Lite, which would further facilitate the integration process of the model into the phone application.

(32)

4 Results

This chapter will display the results from the experiments performed and the network train-ing. The quality survey is first presented and is followed by the loss graphs and RMSE results. Finally, we will showcase the inference time and memory usage.

4.1 Quality Survey

In the following section, we will present a selection of the interesting results from the quality survey. Out of the total 11 images, 8 were in favour of our model. In the results presented below, six examples are highlighted, three examples from where our model outperforms, and three examples from where our model underperforms in relation to the model by Kapanu. These are highlighted because they are the most interesting in terms of outcome of the survey and the light captured. The full survey can be found in Appendix A.2.

The results from the quality survey indicate that the respondents, on average, tend to prefer the results of our model. Given all of the questions, the end results give an average rating of 3.42. As described in Section 3.4.3, this means that the users in total prefer our model, thereby acting as an indicator that our model performs better than Kapanu’s model.

A summary of the spread of votes for the survey can be seen in Figure 4.1. As described in Section 3.4.3, the votes indicate the user preference for the model performance, where 1-2 (red) and 4,5 (blue) represent a preference for the model by Kapanu and ours respectively. Vote 3 (yellow) represent the votes where the users found that the models were performing either equally good or bad.

(33)

4.1. Quality Survey

Figure 4.1: Spread of votes per image. The integer indicates the number of votes for that category.

Given that the purpose of our work is to outperform the results by Kapanu, we also present an overview of where the 3-votes has been excluded in order to determine the percentage-in-favour of the model results. The results from this calculation can be viewed in Figure 4.2.

(34)

4.1. Quality Survey

Figure 4.2: Percentage in favour

Observing Figure 4.2 displaying the percentage-in-favour, we see that there are certain cases where our model results are given a substantially higher degree of votes. From Fig-ure 4.3 it can be seen that our model result is given 90% of the votes whereas Kapanu is given 10%. Studying the output spheres, we see that the main difference is that the left sphere by Kapanu displays a harder shadow on the right side, indicating that the light is coming more from the left.

Figure 4.3: Image 2 - Quality Survey - Left sphere by Kapanu right sphere by us

Also, in Figure 4.4 our model results clearly outperform the model from Kapanu with a percentage-in-favour of 80% to 20%. On the output spheres, it can be observed that both models are able to capture the light direction coming from the left. The difference between the two is that our sphere displays a more smooth fading of the shadows, whereas the shading by Kapanu is more rough and straight.

Estimating lighting from unconstrained RGB images using Deep Learning in real-time for superimposed objects in an augmented reality application

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Datateknik

2021 | LIU-IDA/LITH-EX-A--2021/042--SE

Estimating lighting from

uncon-strained RGB images using Deep

Learning in real-time for

su-perimposed objects in an

aug-mented reality application

Skattning av ljuset från RGB-bilder med Deep Learning i realtid

för virtuella objekt i en applikation inom förstärkt verklighet

Felix Nodelijk

Arun Uppugunduri

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1

Kapanu

1.2

Motivation

1.2.1

Aim

1.3

Research questions

1.4

Limitations

1.5

Thesis Outline

2

Theory

2.1

Definition of Quality

2.2

Background

2.2.1

Spherical Harmonics

2.2.2

Augmented Reality

2.2.3

Neural Network

2.2.4

CNN Encoder-Decoder Network

2.2.5

Residual Neural Network and Skip-Connections

2.2.6

Root Mean Square Error

2.2.7

Tensorflow lite

2.3

Related Work

3

Method

3.1

Dataset

3.2

Network architecture

3.2.1

Network details

3.3

Training process

3.3.1

Loss functions

3.4

Experimentation

3.4.1

Evaluating inference time

3.4.2

Evaluating quality against the inference time

3.4.3

Quality survey

3.5

Hyper-parameters

3.6

The inference model

3.6.1