Reconstruction and recommendation of realistic 3D models using cGANs

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2018

Reconstruction and

recommendation of realistic 3D

models using cGANs

MÓNICA VILLANUEVA AYLAGAS

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

Reconstruction and

recommendation of realistic

3D models using cGANs

MÓNICA VILLANUEVA AYLAGAS

Master in Machine Learning Date: June 15, 2018

Supervisor: Hedvig Kjellström and Mario Romero Vega Examiner: Danica Kragic Jensfelt

Swedish title: Rekonstruktion och rekommendation av realistiska 3D-modeller som använder cGANs

(4)

(5)

iii

Abstract

(6)

iv

Sammanfattning

(7)

2 Background and related work 6 2.1 Background . . . 6 2.1.1 Generative models . . . 6 2.1.2 3D models . . . 8 2.2 Related work . . . 9 2.2.1 2D . . . 10 2.2.2 3D . . . 10 2.2.3 User studies . . . 11 3 Method 13 3.1 Data . . . 13 3.1.1 Format . . . 13 3.1.2 Noise functions . . . 14 3.2 GANs . . . 15 3.2.1 Network architectures . . . 16 3.2.2 Objective function . . . 18 3.3 Distance functions . . . 19 3.4 Recommendation system . . . 20 3.5 Evaluation . . . 21

3.5.1 Quantitative: Distance measure . . . 21

3.5.2 Qualitative: User study . . . 22

3.6 Hardware description . . . 23

(8)

vi CONTENTS

4 Experiments and results 24

4.1 Distance functions . . . 24 4.2 Noise generalization . . . 28 4.3 Discriminator strength . . . 30 4.4 Recommendation system . . . 31 4.4.1 Realistic system . . . 32 4.4.2 Balanced system . . . 34 4.4.3 Similar system . . . 36 4.4.4 System comparative . . . 37 4.5 User models . . . 38

4.6 Qualitative evaluation: User study . . . 40

4.6.1 Population statistics . . . 40

4.6.2 Data preprocessing and analysis . . . 42

4.6.3 Realism experiment . . . 43

4.6.4 Similarity experiment . . . 45

4.6.5 Preference experiment . . . 45

5 Discussion and conclusions 47 5.1 Achievements . . . 47

5.2 Future work . . . 48

Bibliography 49 A Complete list of noise functions 53 A.1 Unstructured noise . . . 53

A.2 Structured noise . . . 54

(9)

Chapter 1 Introduction

Three-dimensional modeling is the process of creating a representation of a sur-face or object in three dimension via a specialized software where the modeler can create and edit the representation. Another way of creating the surfaces is scanning real-world objects into a point cloud. There are multiple 3D computer graphics software for creating 3D models, each with its own characteristics, tools and render engines. Figure 1.1 shows the User Interface (UI) of two modeling softwares, Blender, as an example of open source (nfhGNU GPLv2+ licence) and Autodesk Maya as a commercial one.

(a) Blender interface (b) Maya interface

Figure 1.1: User interfaces of different 3D modeling software

Currently, 3D modeling is difficult to master. Not everyone can reproduce what they see in a successful way, even less what they imagine. This can be the result of personal limitations, the complexity of working with multiple dimen-sions, or the intricacy of the 3D modeling software.

Movies, many video games and even virtual and mixed reality apps surround us with increasing need for realistic models. Experienced modelers can benefit from a tool that helps them quicken the content creation. Furthermore, with the popularization of 3D printers on a daily basis, even beginners would be able to

(10)

2 CHAPTER 1. INTRODUCTION

create their own natural-looking models.

The field of Computer Graphics is not the only one benefiting from advances in the generation of 3D models. Many robotic applications use 3D models to solve problems like interacting with objects. The area of medical imaging also employs 3D models for segmentation of cancer or injuries.

The increase in computational power is boosting the research in Deep Learn-ing which, in synergy with generative models, is increasLearn-ing the amount and qual-ity of 3D Computer-Aided designs (CADs). This data enhancement is, in turn, improving the learning processes and helping achieve better models, adding value to the Machine Learning pipeline.

The aim of this Master Thesis is the design and development of an end-to-end recommend-to-endation system for 3D models using GANs to generate novel re-constructions from a user input. To the best of the author’s knowledge, no rec-ommendation systems are included in 3D modeling software nor the idea has been researched so far.

The decision to use GANs to solve this problem is supported by the preference to reconstruct novel objects and the fact that this method is the state of the art in generation as revealed by the literature study in Section 2. The whole motivation behind this work is outlined in Section 1.2.

1.1 Research question

This work addresses the following research question:

What are the benefits and limitations of using conditional Generative Adversarial Nets to reconstruct unpolished voxelized models and rec-ommend plausible alternatives?

The reconstructions are assessed using Intersection over Union as a quanti-tative measure and the users’ perception is evaluated from the results of a user study regarding both the level of realism and the similarity with respect to the model entered by the user as measured by forced pair-wise comparison.

1.2 Motivation

(11)

CHAPTER 1. INTRODUCTION 3

The reconstructions are guided by three different similarity measures, which make the output follow the distributions of natural 3D models, look like the user models or share features from both natural 3D CADs and the sketch created by the user. This makes it possible to build up a system that uses the reconstructions as recommendations for unfinished or crude models, comparable to predictive text in mobile phones. No previous work has been found that uses end-to-end generation of 3D models to build a recommendation system.

The main difference with the closest related work [19] is the lack of an addi-tional network aside from the GAN. As explained in Section 2.2.2 before, Liu, Yu, and Funkhouser [19] project the 3D model into the manifold to obtain a latent vector that is used as input for the GAN at a later time. In this Master Thesis, the 3D models are fed directly to the Generative Adversarial Network.

Moreover, the creation of Liu, Yu, and Funkhouser [19] is not meant as a rec-ommendation system, but as a tool that improves the current 3D model expecting an iterative interaction with the user. The recommendation system developed for this work instead presents to the user three outputs based on different perceptual attributes.

Finally, a quantitative measure that indicates the accuracy of the system may not be very convenient in this particular case where, ultimately, the resulting models will be judged by humans. Therefore, in addition to evaluating the sys-tems quantitatively, a study is carried out to assess how natural the results are and how much they resemble the model created by the user. It also measures which of these qualities is more useful for a 3D recommendation system.

Given this information, it is possible to analyze the veracity of the mathemati-cal assumptions as well as estimate what is the most valuable weighting for these features according to the users.

1.3 Delimitations

The aim of the project is focused on the technical difficulties of training GANs for the reconstruction of 3D models and not in developing a graphical interface for editing 3D and visualizing recommendations.

The recommendation system can be extended by training additional GANs using different sets of parameters for the weights that modify the distance func-tions. However, only three sets of these parameters are actually trained to mea-sure the theoretical assumptions related with the features of the reconstructions.

(12)

4 CHAPTER 1. INTRODUCTION

employed after the modification of certain hyperparameters.

1.4 Societal, ethical and sustainability aspects

The use of Artificial Intelligence and Machine Learning applications has multiple repercussions in society that need to be addressed by scientists.

For this particular work, the most dangerous issue could be the job loss due to the automation of a process. The recommendation system, however, does not aim to replace human beings, but to present a tool to ease the task of 3D modelers in what has been denominated as Artificial Intelligence Augmentation (AIA), or IA in short.

At the same time, it is possible to argue that improving the productivity of a worker can reduce the necessity of hiring more staff. This problem is real, but it is not new. For instance, with the advances of the Industrial Revolution, thresh-ers and seedthresh-ers were introduced in the fields to help humans. Nowadays, the amount of people working in agriculture is not comparable.

However, this leads to a deeper discussion about shifts in the long-term job creation. It is said that "Artificial Intelligence will create more jobs than it elim-inates" [23]. It is necessary to get ready for the change in terms of education, preparing new generations for the new work market or economic repercussions, such as fewer companies/individuals holding most of the power and wealth. These are topics in which this work does not dwell, but important to have in mind.

Additionally, this work aspires to help inexperienced users to model more re-alistic 3D models so that they can come close to technology. With the increasing popularization of 3D printers nowadays, closing this gap could mean attracting people initially not interested in modeling but appealed by the idea of making small changes to customize premade models, engaging more girls into STEM [24] due to the creative nature of 3D modeling or allowing different kind of designers to see their creations come true.

(13)

CHAPTER 1. INTRODUCTION 5

1.5 Outline of the Master Thesis

(14)

Chapter 2 Background and related work

For the particular problem that concerns us, the reconstruction or improvement of 3D models, the background study must cover areas related with generative models and 3D handling, including representations and metrics of similarity.

The related work, on the other hand, should include solutions to different problems using the selected Machine Learning technique, Generative Adversarial Nets (GANs), as well as other approaches for the same problem. Furthermore, it needs to address how these works design their user studies in order to measure their goals as perceived by human subjects.

2.1 Background

The addressed research question intends to produce models that are as similar as possible to the input created by a user without diverging from the realistic space of 3D models. This makes it clear that the selected method should be part of the family of generative models since the goal is to create improved 3D objects.

In order to condition the generative model on the input, it is essential to choose a representation for the 3D model. This representation will also affect the way to measure the accuracy of the reconstruction.

These issues are discussed in the following sections.

2.1.1 Generative models

The goal of generative models is to learn the true underlying distribution of the data and allow sampling from the joint distribution of the observation and the class.

There are several such models, ranging from traditional methods, like Gaus-sian Mixture Models [32] and Hidden Markov Models [29], to Deep Learning approaches, like Generative Adversarial Networks [10].

(15)

CHAPTER 2. BACKGROUND AND RELATED WORK 7

A possible approach to the problem stated in this Thesis could be finding the closest sample to the input using k-NN on a database of realistic models. How-ever, since the goal is to create novel reconstructions and given the complexity of the underlying distribution and the amount of data necessary to approximate it, the most common methods to achieve this are currently Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs) and Autoregressive models. Their advantages and disadvantages have already been studied by Karpathy et al. [16].

VAEs sets a mathematical Bayesian framework in which to learn complex la-tent variable spaces, escalating good to large datasets. The ability to choose a prior distribution is useful when there is domain knowledge. However, their main drawback is that the sampled data is blurry. This is caused by the use of variational inference that tends to model the mean or mode of the data.

GANs, on the other hand, are designed for generative tasks where the distri-bution is not as important as the perceptual result. Generated samples are sharper because the network is encouraged to imitate the training data. Although GANs could seem like the solution, they are famous for their unstable training process. Its two networks are trained by competing against each other until reaching the Nash equilibrium, which can be complicated to achieve using gradient descent.

Meanwhile, Autoregressive models have a simpler and more stable training that returns more realistic results. Nevertheless, sampling from these models is highly inefficient due to sequential generation.

The selected method, GANs, has been a hot topic in the field since it was first published by Goodfellow et al. [10] in 2014. Generative Adversarial Networks are composed by two architectures that learn from each other minimizing opposing losses until convergence: a generative model that tries to learn the data distri-bution and a discriminative model that attempts to infer if a sample comes from the original distribution or the generator. In Section 3 the method is described in more detail.

In its first publication, these two networks were implemented as a multilayer perceptron and trained using gradient descent, but several improvements have been published over the years, regarding architecture and learning tricks to make the training more stable, like conditioning the data generation on class labels or some part of the data, cGANs [26].

(16)

8 CHAPTER 2. BACKGROUND AND RELATED WORK

easier to reverse and let the network learn its own positional information. It also avoids fully connected layers to increase stability. The generator applies ReLU as the non-linear activation function except on the output layer where using a bounded function like Tanh speeds up the learning. The discriminator, on the other hand employs Leaky ReLU to avoid dead units, since this is the only source of information for the generator to learn. Batch normalization is applied in all layers except the output of the generator and the input of the discriminator using Adam as the gradient descent optimization algorithm. In its conditional version, cDCGAN, the scale and bias removal from the batch normalization achieved bet-ter results.

In order to improve the learning process, researchers have developed several practices like training the discriminator only if the accuracy of the last batch drops below 80% [39], using dropout after Leaky ReLU or using a normal distribution instead of a uniform one to sample [19]. However, the most important progress in the field is the Wasserstein [1] training objective used in WGAN and its improved version with gradient penalty [11] in IWGAN. Both methods avoid the problem of balancing the generator and discriminator training because the Earth-Mover (EM) distance or Wasserstein-1 cost function, unlike Jensen-Shannon (JS) diver-gence, allows the discriminator to train until optimality. Moreover, it appears to work well even without a careful design of the architecture. The improved version proposes an alternative to weight clipping in order to compute this cost function. Penalizing the norm of the gradient of the discriminator with respect to the input removes the need to tune one hyperparameter more and obtain a more stable and precise training since the clipping predetermines the discrimina-tor into learning less complex distributions.

2.1.2 3D models

The same way that images can be represented in different formats depending on the problem, such as gray level, RGB (Red, Green, Blue) or HSL (Hue, Satura-tion, Luminance), 3D models can also be represented in several ways. Some of the most important 3D object representations are point cloud, voxel and polygon mesh [8], represented in Figure 2.1.

Point cloud is the raw data product of a 3D scan by laser. It is formed by a set of data points in a virtual three-dimensional space that represents the surface of the object being scanned. This format can be converted into any other higher-level representation. For example, it is possible to perform surface reconstruction using Delaunay triangulation resulting in triangle mesh.

(17)

CHAPTER 2. BACKGROUND AND RELATED WORK 9

(a) Point cloud (b) Mesh (c) Voxel

Figure 2.1: The Stanford bunny modeled in point cloud, mesh and voxel repre-sentation

advantage is that current GPUs are designed to optimize polygons.

Voxels represent a value on a three-dimensional grid. The situation of a voxel is inferred based on its position relative to other voxels, which can efficiently represent heterogeneous filled spaces, in contrast to polygons.

In Machine Learning, most approaches use voxel representation either as in-put or outin-put [39, 19, 37, 40], though there is no consensus [14, 27, 43] and re-searchers are still trying to find an adequate representation [18].

Measuring 3D shape similarity is an unsolved problem [40] and the 3D repre-sentation affects the available methods directly.

Focusing on voxels, the most widespread technique in research seems to be Intersection over Union (IoU), especially in the field of medicine, where the use of voxel representation is as original as common [5], but also extensively used in other fields [33]. Other error measurements for different representations are addressed in [3, 6], just to name a few examples.

Nevertheless, these geometrically traditional metrics are not the only ones. Recently, a method has been developed related to the representation learned by the different filters of a Convolutional Neural Network. This method, perceptual similarity, is based on the idea that if two image/objects are the same, they have the same filter response [7, 19].

2.2 Related work

(18)

from the wider 2D research and experience. In most cases the ultimate goal of these systems is to please the final user. Therefore, not only quantitative mea-sures are used, but also user studies that judge the developed algorithm against previous works or ground-truth samples.

2.2.1 2D

Examples of problems related to the task at hand can comprise the works of Liu et al. [20] and Choi et al. [4].

The work of Liu et al. [20], denoted Auto-painter, is a good example of a cD-CGAN that uses a partial input, in this case a black and white sketch, and tries to reconstruct a missing feature, color, based on what was learned from a database of colorful images.

Choi et al. [4], in their system called StarGAN, in a different manner, perform image-to-image translation using only one GAN for multiple domains. The learn-ing process is conditioned both on the input image that is to be modified as well as the class label of the target domain.

Isola et al. [13] show how GANs can be used as a general-purpose solution for multiple problems removing the need to engineer specific loss functions. The goal of pix2pix is to prove that it is possible to learn a mapping for two pairs of representations using the same framework.

The discriminator architecture of these implementations is based on the patch-GAN discriminator [13], with Starpatch-GAN adding an extra output layer for classifi-cation. Besides, Auto-painter and pix2pix use a U-net architecture for the gener-ator [34].

The closest investigation in 2D to the research question under study regards the paper published by Zhu et al. [42] in which the authors design manipulation operations so that the edits on the image give rise to realistic results.

First, they approximate the manifold using a GAN and then project the input image to find the closest latent vector using a hybrid approach of feedforward and optimization. Next, they manipulate the latent vector with constraints to match the user’s intent and stay close to the input in the manifold using gradient descent. Finally, the edition is transferred back to the original picture by applying traditional optical flow methods.

2.2.2 3D

(19)

be-CHAPTER 2. BACKGROUND AND RELATED WORK 11

fore, several generative approaches have been used in order to solve these prob-lems [27, 43].

Since the focus of this project is on GANs, as motivated on Section 2.1.1, the detailed study is centered on these particular solutions.

The first major work on 3D generation using GANs was proposed by Wu et al. [39]. In this paper, the advances in GANs are combined with those in volumetric convolutional networks to perform classification, generate random 3D models and reconstruct these models from a 2D image building a Variational Autoen-coder on top of the GAN. In addition, Smith and Meger [37] extends this work implementing the new advances by Gulrajani et al. [11].

However, neither of these studies addresses the problem of controlling the output or conditioning it to an input 3D object.

The closest source available is the work of Liu, Yu, and Funkhouser [19], in-spired by the previously presented work of Zhu et al. [42]. Despite this fact, their goal is not to optimize edits but to project original or modified 3D objects into the learned manifold.

The method trains three neural networks to achieve its objective. The first two are the generator and discriminator networks that learn the latent space of the manifold. The generator accepts as input a latent vector representing the user model. The last network, called the projection operator, maps the input 3D model to its latent vector so that it can be used in the generator. This is the most important difference when compared with the proposed approach in this Master Thesis, where the generator is fed directly with the 3D model instead of training an additional network.

The GAN architecture used is a revised version of 3DGAN [39] to preserve the stability during training. The projection operator, on the contrary, is the novel contribution of the paper. It adjusts the importance of the plausibility of the gen-erated model as well as its similarity with the original input. In their implementa-tion, they use a feedforward network to optimize the similarity and use that as an initial guess find a local minimum using gradient descent on the whole objective loss. In this case, computational time is important due to its interactive nature.

2.2.3 User studies

(20)

practice and more importantly, that p-value cannot tell you if your hypothesis is correct since it is the probability of the data given the hypothesis. Basic and Ap-plied Social Psychology was the first journal to ban p-values [38] in 2015.

The work by Mantiuk, Tomaszewska, and Mantiuk [22] compares four subjec-tive methods for quality assessment, namely single and double stimulus, forced choice and similarity judgments. Single stimulus shows a sample for a certain amount of time and asks the user to rate it while double stimulus shows two in a sequence before requesting to grade both. The task in forced choice is to select the best sample given two or more options. The difference from similarity judg-ments is that it uses a relative scale instead of absolute terms. The conclusions of this work are that the forced choice pairwise comparison is the most accurate, but also the fastest when using a sorting algorithm for the adaptive approach.

Accordingly, the most frequent method used in other papers are variants of this forced choice pairwise comparison. For measuring realism, the studies com-pare the output of a developed system against that of a previous work and/or a ground-truth sample in pair-wise trials. Users have to choose the more plausible example or their lack of preference. These studies have two versions, comparing original models with the result of an algorithm [13] but also comparing among algorithms [15].

However, there are researchers that still prefer the other methods, like the single stimulus in the case of Zhu et al. [42]. Liu et al. [20] show four synthetic examples and ask to rate the best one and the worse one according to subjective appealing. The results are ranked ordered by a "popularity index"

(21)

Chapter 3 Method

The solution proposed in this Master Thesis depends entirely on the data used for training, the decisions made to design the GAN, including the distance functions tested during training and the way the results are evaluated.

All of these particulars are explained in the next sections, along with a de-scription of the equipment used to implement, train and execute the project.

3.1 Data

There are three types of data that are needed to train the system. The most im-portant is a ground-truth dataset from which to learn the natural distribution of 3D objects. The quality and variability of this dataset establish the limitations of the problem. The second type of data is the models designed by users that are to be improved in test time. Finally, the third class of models is produced by in-troducing noise into the ground-truth dataset so that it is possible to measure the distance between the reference model and its reconstruction.

3.1.1 Format

Following the lead of similar papers [18, 19, 37, 39, 41, 43], the selected dataset is ModelNet 10 [17]. This dataset includes 10 classes of objects (chair, sofa, bed, monitor, table, toilet, desk, dresser, bathtub and nightstand) oriented into 12 views of size 32 × 32 × 32 and split into training and test subsets.

For the purpose of this study, the experiments are performed for the class with more samples: chairs. The chair class contains 10 657 training samples and 1 200 test samples, making a total of 11 857 models.

The voxelized version of the dataset, available in the project 3D ShapeNet [41], uses a Matlab format to store the models. Since the system works with Python,

(22)

14 CHAPTER 3. METHOD

these files are preprocessed to 3D matrices in Numpy format. In a similar way, user models can be preprocessed by converting .stl into .npy files.

The output of the system, with the same resolution as the input, is not sub-jected to any condition such as connectivity requirements or thresholding on the existence or absence of voxels. However, these constraints transform the output format of the matrices into the same format as the input, moreover cleaning and improving the visual appeal of the final results, which pleads for a custom post-processing step.

3.1.2 Noise functions

The synthetic data aim to recreate user-generated models through the introduc-tion of noise in the database models. This synthetic data is used during training so that it is possible to measure the distance with respect to a ground-truth. Train-ing with CADs modeled by real users would forfeit this possibility.

The synthetic data must be as similar as possible to the typical modeling errors introduced by modelers. Designing a set of plausible noise functions can be a challenging topic that this work does not aim to solve. Therefore, some basic distortions and crude user-like functions are combined to create this synthetic data during the training stage. Examples are shown in Figure 3.1.

(a) Reference 3D model

(b) Remove voxels (c) Add voxels (d) Dilation

(e) Remove part (f) Move part (g) Bump

(23)

CHAPTER 3. METHOD 15

• Unstructured noise

– Remove voxels: Each voxel has a 50% probability of being removed. It creates clouds of voxels with the shape of the original object that can make the training more robust. Note that this kind of noise is never used for the similar and balanced systems (Section 3.4), seeing that users would rarely produce this sort of models.

– Add voxels: Each void space has a 2% probability of becoming a voxel. After modifying the model, connectivity is enforced and only the largest connected component is returned.

– Dilation: Basic mathematical morphology operation using the mini-mum 3-dimensional structure with connectivity = 1.

• Structured noise

– Remove part: Removes a box of half of the length of the side equal to 3 from a randomly selected center among all possible voxels.

– Move part: Selects a random center and axis among all possible voxels and shifts the whole structure 2 units to the closest boundary, dragging the structure so that the model is still connected.

– Create a bump: Copies the structure from a randomly selected center among all possible voxels to the closest boundary in each direction for a radius of 10 units.

Some of the combined functions include the addition of voxels and dilation, create a bump and move part or dilation and voxel removal. Appendix A con-tains the complete list of noise functions with illustrative figures.

3.2 GANs

Generative Adversarial Networks are composed of two architectures, a Discrim-inator (D) and a Generator (G), that compete against each other as described by Goodfellow et al. [10] and depicted in Figure 3.2.

In this Master Thesis, the generator tries to produce 3D models that look like the samples in the dataset, minimizing Equation 3.5c, while the task of the dis-criminator is to classify the inputs into real (x) or generated (r = G(x)), minimiz-ing Equation 3.5b. The system is trained when the losses of the two networks (Lgenand Ldisc) reach an equilibrium point in which, optimally, the generator has

(24)

Figure 3.2: GAN framework described by Goodfellow et al. [10]. Figure credit: Skymind [36]

3.2.1 Network architectures

The system is built using Convolutional Neural Networks (CNNs). This architec-ture takes advantage of the spatial information in the input, a reason why it is so widely spread in Computer Vision. Unlike image fed networks, 3D models need 3D convolutions to take into account depth information.

The framework includes a conditional generative model that takes into ac-count the input model created by the user. The generator (Figure 3.3) is designed to use a simple encoder-decoder architecture [4] in which the user’s (conditional) model is encoded into a low dimensional representation that keeps the important information of the input. The decoder component learns how to reconstruct the original voxel representation from the low dimensional representation following the restrictions imposed by the loss function.

(25)

Figure 3.4: Discriminator module based on the PatchGan architecture [13] The discriminator architecture choice, depicted in Figure 3.4, is based on the PatchGAN described by Isola et al. [13], which classifies parts of the input as real or generated instead of the whole sample. The resulting classification is the aver-age of all the patches. This design penalizes implausible fine grain patches of size 2x2x2, improving detail in reconstruction and solving the problem of generating realistic generalizations with scarce detail, known in generative models.

The architectural components of the networks are not very different from those originally described in Radford, Metz, and Chintala [30]: the CNN archi-tecture is all convolutional, there are fully-connected layers or pooling layers.

Batch Normalization is replaced in this case by Instance Normalization layers according to Choi et al. [4]. They differ in that while Batch Normalization makes the distribution of the whole batch Gaussian, Instance Normalization makes each sample Gaussian, and it has proved to be useful in stylization tasks. Normaliza-tion is applied in the generator, except for the output layer, the same as ReLU activation. Leaky ReLU is applied in the discriminator to avoid dead units and improving the gradient that the generator uses to learn, as previously stated in Section 2.1.1.

Adam optimizer is used in training with parameters set to η = 0.0002, β1 = 0.5

and dropout is used both in training and test to introduce some randomness into the generator [13].

For the specific problem at hand, the best suited activation function for the output layer of the generator is a sigmoid. This activation function is delimited between (0, 1). The bounded nature of the function helps accelerate the learning process and its output range can be used as the voxel probability of existence, transforming the problem into a classification where the positive class signifies voxel existence.

(26)

3.2.2 Objective function

The improved Wasserstein loss function is used instead of the traditional GAN loss in order to make the training more stable [11]. This loss is combined with a distance loss function (see Section 3.3) that helps to guide the system into learn-ing the true distribution of the dataset while staylearn-ing close to the user’s input.

The loss functions that constitute the objective function are explained next: • Improved conditional GAN loss: The adversarial loss encourages 3D

mod-els to look as close as possible to the true distribution, balancing the loss of the discriminator and the generator until they reach Nash equilibrium. The improved version includes a gradient penalty that enforces unit gradient norm along straight lines between the dataset distribution and the genera-tor distribution.

Ladv = Ex[D(x)] − Ec[D(G(c)] − λgpExˆ[(||∇ˆxD(ˆx)||2 − 1)2], (3.1)

where x comes from the dataset distribution, c represents the conditional input modeled by the user and ˆx is sampled uniformly along the straight line between the two distributions.

• Reconstruction loss: Content loss that imposes the generated model to look close to the sample in the dataset.

Lrec = Ex,c[dist(x, G(c)] (3.2)

• Similarity loss: Content loss that describes the error between the generated model and the user input.

Lsim = Ec[dist(c, G(c)] (3.3)

The combination of these three losses forms the objective loss, i.e., the com-plete loss function to minimize. The weights λrecand λsimcombine the previously

described partial losses conceding different relevance to following the dataset dis-tribution, reconstructing the synthetic model or staying close to the input model respectively.

L = Ladv+ λrecLrec+ λsimLsim (3.4)

Since the training of the two networks is performed separately, this alternative notation is more useful:

L = Ldisc+ Lgen (3.5a)

Ldisc = −Ex[D(x)] + Ec[D(G(c)] + λgpExˆ[(||∇ˆxD(ˆx)||2− 1)2] (3.5b)

(27)

λgp is set to 10 following the guidelines of Gulrajani et al. [11], while λrec and

λsim take different values depending on which of the three components of the

recommendation system is being trained (for more details see Section 3.4).

3.3 Distance functions

A distance function, dist, is used in the loss functions described in Equations 3.2 and 3.3. One measures the difference with respect to the sample from the dataset being reconstructed while the other measures it with respect to the synthetic ver-sion (see Section 3.1.2) of the same sample that simulates the user input.

Since the presence/absence of voxels in the array is unbalanced, it is inadvis-able to use Intersection over Union (IoU). With this kind of metric and a naïve algorithm that selects all positions as void, it is possible to reach an accuracy as high as the unbalance between classes.

Three different measures used in the literature are tested on the problem un-der study: optimized IoU (sometimes called soft IoU) [31], optimized Dice (some-times called soft Dice) [25] and balanced binary cross-entropy, based on re-weighting, similar to the loss used by Salehi, Erdogmus, and Gholipour [35].

In the following descriptions y represents the ground-truth model in the com-parisons. In reality, the ground-truth can be either a dataset sample x or a syn-thetic sample c depending on the loss function being computed. To simplify the notation, let us call the reconstructed model r = G(c). V is the set of all possible voxels in the defined space.

• Soft IoU: measures the similarity between the predicted model and the ground-truth. To avoid setting a threshold that converts the probabilities produced by the sigmoid layer into existent voxels, soft IoU approximates the IoU score according to: |r∩y|_|r∪y| =

P

v∈Vrv∗yv

P

v∈V(rv+yv−rv∗yv)

IoU, also known as Jaccard index or similarity coefficient, returns 1 when the models are identical. In order to be used as a loss function, it is necessary to convert it to a distance by subtracting the score from 1 as follows:

distIoU = 1 − P v∈V rv∗ yv P v∈V(rv+ yv− rv∗ yv) (3.6) • Soft Dice: averages precision and recall thus weighting equally false pos-itives and false negatives. Like IoU, the soft version uses the probabilities produced by the network: _|r|+|y||r∩y| = 2

P

v∈V rv∗yv

P

v∈Vrv+Pv∈Vyv.

(28)

20 CHAPTER 3. METHOD distDice = 1 − 2P v∈V rv∗ yv P v∈V rv+ P v∈V yv (3.7) • Balanced binary cross-entropy: is one of the most used losses for classi-fication in its standard version. The modiclassi-fication included for this work weights voxel existence inversely proportional to the probability of the class. distBCE = −Wv[yvlog rv + (1 − yv) log(1 − rv)] (3.8)

3.4 Recommendation system

It would be convenient for the user to be able to decide how close the generated model should be to the true distribution or the input model. Since the final ob-jective function (see Equation 3.4) includes parameters that weight each of the losses that minimize these distances, it is not possible to modify the behavior in test time. However, it is possible to discretize the range of possibilities training different networks and building a recommendation system that proposes models generated by each set of parameters.

For the purpose of testing the design assumptions and the user’s preferences, three systems are trained to deliver the two available extreme results and a bal-ance between them.

• Realistic system: This system reconstructs the input modeled by the user according to the distribution learned by the generator. It is possible to think about it as reconstructing the closest model in the learned manifold.

For the experiments in this work the parameters are set to λrec = 5 and

λsim = 0.

• Balanced system: The parameters that weight the loss functions are coun-terbalanced so that the generated model should depict a homogeneous mix between the input and the true distribution, even if the result is not com-pletely natural.

For the experiments in this work the parameters are set to λrec = 2.5 and

λsim = 2.5.

• Similar system: Still takes into account the distribution of the dataset, since the final goal of a potential user would be to improve the input. Neverthe-less, the emphasis is on the similarity reconstruction in hopes of learning to what point the verisimilitude of the model is secured and what is the reaction of the users to it.

For the experiments in this work the parameters are set to λrec = 0 and

(29)

3.5 Evaluation

The success of the reconstruction is measured both quantitatively and qualita-tively.

A sound empirical approach is to define an objective goodness metric and compare results of different algorithms among themselves or against a baseline. However, in this case, getting a good numerical performance is not adequate since the goal is to create 3D models that are visually appealing to humans.

It is therefore necessary to design user studies to measure the model quality, as well as the human predilection for diverse reconstructions. The first desired measurement is the level of realism of the reconstruction, i.e. how similar it is to a dataset sample. The objective of the user study is to compare the empiri-cal perception of the users to the mathematiempiri-cal definition of this difference, the Intersection over Union of the voxels in the two comparing models.

The second experiment measures similarity, or how close the generated model resembles the user input. The same empirical-theoretical contrast is performed.

3.5.1 Quantitative: Distance measure

The approach selected for this work is slightly different from those in the liter-ature. Due to the scarcity of previous work in the area and the unavailability of code and time to reproduce experiments, it is impossible to compare the pro-posed algorithm to others in the literature. Nonetheless, the learning algorithm is designed so that it is possible to compare a reconstruction against the original sample in the database and the input model (synthetic data).

The metric selected to compute the differences between models is Intersection over Union. This statistic resembles how humans perceive similarity, that is, if an existing voxel in the reference model exists or not in the reconstruction, like-wise for non-existent voxels. This metric requires that the values of the matrix representing the 3D model take either 0 or 1 values. For that purpose, the output of the generator is thresholded and connectivity is ensured in order to produce more plausible results.

(30)

it is the only way to measure quantitative plausibility.

3.5.2 Qualitative: User study

The goal of the study is to empirically measure the success of the system focusing on the results using each of the three versions of the generator trained to mini-mize different objective functions. These results are assessed in terms of plausi-bility and similarity to the model entered by a user, but also users’ preference for practical reasons.

The subjects are 24 3D modelers or users that have close experience with 3D models. The only restrictions are that the subjects should be over 18 and legally-sighted. The study is conducted on the KTH premises using the author’s com-puter. Each experiment is designed to take 5 minutes so that the whole study takes no more than 30 minutes per participant. This design is meant to avoid fatigue [22].

Subjects are recruited from Facebook groups, mainly related with KTH, where the probability of finding people meeting the study requirement is higher. Due to the lack of funding for this work, no compensation is offered to participants.

In a first stage, the subject is informed about the objective and structure of the study. If the subject agrees to sign the consent form and proceed with the study, a test is performed showing how the visualization works in order to ease the sub-sequent interaction. Examples of the User Interface can be found in Appendices C.1, C.2 and C.3.

The realism test presents two models side by side. These models are randomly sampled for the following distributions: the dataset, generated by the realistic, balanced or similar systems. The task of the user is to decide which one looks more realistic. The comparison groups are realistic (A) vs balanced (B), realistic (A) vs similar (C), realistic (A) vs dataset (D), balanced (B) vs similar (C), bal-anced (B) vs dataset (D), similar (C) vs dataset (D) following the probabilities in Table 3.1. The experiment is designed so that there is a greater number of com-parisons between more similar conditions, namely (A), (B) and (C). In particular, the estimated weights produce approximately twice the number of comparisons between similar conditions than between the dataset and the different reconstruc-tions. The permutations represent the position in which the models are presented. The pilot study showed that the selection pace is highly dependent on the subject, which is the reason why finally it was decided to set a fixed amount of time per experiment (5 minutes) instead of a fixed amount of comparisons. This reduces the possibility of concentration loss for those participants for which the study would take longer.

(31)

to the reference. The reference models are selected from the noisy version of the dataset so that the differences with a natural-looking model are more apparent. The testing groups are realistic (A) vs balanced (B), realistic (A) vs similar (C) and balanced (B) vs similar (C). In this case, the estimated weights produce roughly the same number of comparisons between conditions as shown in Table 3.2.

A B C D

A - 11.11 11.11 5.55

B 11.11 - 11.11 5.55

C 11.11 11.11 - 5.55

D 5.55 5.55 5.55 -Table 3.1: Weighted probabilities for each comparison in the realism ex-periment.

A B C

A - 16.66 16.66

B 16.66 - 16.66

C 16.66 16.66

-Table 3.2: Uniform probabilities for each comparison in the similarity ex-periment.

The last experiment measures the recommendation preference of the partic-ipants by showing the three possible reconstructions along with the unfinished model being reconstructed. These models are submitted by the same subjects and processed before the study. Due to the difficulty of finding subjects willing to participate in the study with experience in 3D modeling and time constraints, the number of user models amounts to 22 examples. In this manner, subjects are presented with triplets of reconstructions in random order until the time is over or they run out of examples.

Finally, some questions are asked (Appendix C.4) in order to understand bet-ter the answers of the subjects and the potential adoption of the system as a mod-eling tool.

3.6 Hardware description

The implementation1_{of the system uses PyTorch as the main library for building}

and training the networks. The system runs in a high-performance node of the PDC supercomputing system [28], using an NVIDIA Tesla K80 GPU.

1_{The complete code developed for this work will be publicly available in a GitHub repository}

(32)

Chapter 4 Experiments and results

This work is mainly a research project to find the benefits and limitations of using conditional Generative Adversarial Networks to build a recommendation system that reconstructs 3D models. Due to the novel approach of the architecture and the loss function, it is necessary to perform experiments that clarify how to obtain the best performance in each of the systems that compose the recommender.

In the following sections, the reader can find experiments on the base model, the realistic system, regarding several distance functions, generalization to differ-ent types of noise and diverse strengths in the discriminator.

Once these elements are optimized, various learning approaches are used to train each of the systems separately. The best ones are selected to assemble the final recommendation system that is evaluated in the user study.

Section 4.5 describes the limitations of the method when the input models are too different from the training data.

4.1 Distance functions

It is important to find a distance function (dist) suitable for the problem in order to guide the system into reconstructions that resemble the dataset (Equation 3.2) or the models introduced by the user (Equation 3.3). Theoretically, the three func-tions described in Section 3.3 are appropriate. Still, the funcfunc-tions are tested over a small number of epochs to compare empirical results.

The realism system is taken as the baseline to test different parameters. The distance with better results is used for the following experiments.

Initially, several values for λrec in Equation 3.4/3.5c are searched for all

dis-tance functions (distIoU, distDice, distBCE), modifying the voxel existence

thresh-old during visualization to perceive the changes in confidence. The final values

(33)

CHAPTER 4. EXPERIMENTS AND RESULTS 25

for the different systems are reported in Section 3.4.

Results during training (Figure 4.1) show how the distance function affects the learning while test results (Figures 4.2 and 4.3) prove generalization to unseen examples.

Training

Original Epoch 5 Epoch 10 Epoch 20 Epoch 40

Soft IoU Soft Dice Balanced BCE

Figure 4.1: The first row shows the learning results over the epochs 5, 10, 20 and 40 with the voxel existence threshold set to 0.5 using soft IoU as the distance function. The second row shows the same results for soft Dice with threshold 0.5, while the last row shows the results for balanced cross-entropy with threshold 0.3.

Figure 4.1 displays the reconstructions during training for the three different distances. The probability assigned to the existence of each voxel by the network is guided by the distance function, which makes the whole algorithm learn dif-ferently.

The soft Dice differs from the other two distances in that it creates a general shape and epoch after epoch it carves out the model, though it also corrects and extends parts that are not completely developed.

(34)

26 CHAPTER 4. EXPERIMENTS AND RESULTS

BCE as the distance function are thresholded at a probability of 0.3 while soft IoU uses 0.5. However, since this threshold is just a postprocessing tool and does not alter training, the learning speed may be a more important factor. Regarding this matter, it is possible to see that the development of legs is more advanced using balanced BCE (epoch 20) and the results more distinct (epoch 40) than its counterpart using soft IoU.

Despite the usefulness of appreciating the way the algorithm is learning, the real value is in the reconstruction of the unseen samples in the test set (Figure 4.2).

Test after 5 epochs

Original Threshold 0.5 Threshold 0.3 Threshold 0.1

Figure 4.2: The first row shows the test results after 5 epochs for voxel existence threshold set to 0.5, 0.3 and 0.1 using soft IoU as the distance function. The second row shows the same results for soft Dice, while the last row shows the results for balanced cross-entropy.

It is interesting to notice that there is no a clear correlation between the IoU metric and the visual similarity with respect to the models being compared. In Table 4.1 the best average value for the test set is the one achieved by the distance balanced BCE using threshold 0.5. This may actually result in empty models for some samples (see the result in Figure 4.2 for the mentioned parameters).

(35)

Test after 5 epochs IoU

Threshold 0.5 0.3 0.1

soft IoU 0.955083 0.939283 0.805049

soft Dice 0.936344 0.875890 0.166469

Balanced BCE 0.965469 0.963394 0.800237

Table 4.1: IoU metric measuring the average similarity between the test dataset and its reconstructions after 5 epochs using different distances and thresholds. and the disassociation between the visual perception and the similarity metric used. Additionally, this difference proves the importance of performing studies with users.

It is obvious, but worth pointing out, that the voxel confidence decreases in test which implies that in order to obtain results as good a those in Figure 4.1 the threshold needs to be lowered.

Once more, it is possible to appreciate that the best value in Table 4.2, which corresponds to the soft Dice distance and the threshold 0.5, represents a chair without rear legs and an irregular hole in the rest. Visually, the best model could be identified as soft IoU with threshold 0.1 if a larger emphasis is set on the body or balance BCE with threshold 0.1 if the importance is shifted to the legs.

Given the difficulty that the generators seem to have with reconstructing legs, the distance used for the subsequent experiments is selected as the balanced BCE. All the same, it would be interesting to test further experiments with the other distances. Due to lack of time, this is left for future work.

Test after 40 epochs IoU

Threshold 0.5 0.3 0.1

soft IoU 0.985563 0.985095 0.983689

soft Dice 0.985879 0.985829 0.985217

Balanced BCE 0.980784 0.984049 0.977174

(36)

Figure 4.3: The first row shows the test results after 40 epochs for the threshold set to 0.5, 0.3 and 0.1 using soft IoU as the distance. The second row shows the same results for soft Dice, while the last row shows the results for balanced BCE.

4.2 Noise generalization

A perfectly understandable question to ask is if the system is learning just to reconstruct certain types of noise, those with which it is trained.

If the result of applying noise functions were close to typical user errors this behavior would not be disturbing. However, since the noise functions described in Section 3.1.2 are basic transformations, not necessarily faithful to user mistakes, it is interesting to check if a system trained with certain noise functions generalize well in test with another set of noise functions.

For this experiment three systems are trained using the balanced BCE dis-tance: one with unstructured noise, another with structured noise and the last one without noise, that is, the model introduced is the same one that the genera-tor is supposed to reconstruct. These systems are tested on the all of the types of noise for several thresholds. The results are displayed in Table 4.3.

(37)

IOU - Threshold Trained on Tested on 0.5 0.3 0.1 Unstructured Unstructured 0.980784 0.984049 0.977174 Structured 0.979197 0.982159 0.976545 - 0.980968 0.984446 0.978921 Average 0.980316 0.983551 0.977547 Structured Unstructured 0.977274 0.978339 0.971616 Structured 0.980821 0.985431 0.982204 - 0.981352 0.986181 0.983033 Average 0.979816 0.983317 0.978951 -Unstructured 0.977131 0.978605 0.971630 Structured 0.979386 0.984444 0.981136 - 0.981887 0.987377 0.983586 Average 0.979468 0.983475 0.978784

Table 4.3: Average IoU metric over the test set for systems trained and tested for all possible combinations of unstructured noise, structured noise and no noise. Results are reported for thresholds 0.5, 0.3 and 0.1.

same noise. Regardless of the type of noise used during training, the best recon-structions are achieved when testing without noise and using a 0.3 threshold. In fact, this particular threshold accomplishes better results than the others in all cases.

On the one hand, this outcome makes sense because the generator does not need to create new information (add or remove voxels), but on the other hand, it is trained to do so in the cases where unstructured and structured noise is used on training. Indeed, this is visible in the IoU values noted in Table 4.3, where the higher value corresponds to the model trained without noise, followed by structured noise and finally unstructured.

The reason why the unstructured noise is more difficult to reconstruct is pre-cisely because it is more different from the target model.

(38)

4.3 Discriminator strength

When using GANs there is an additional parameter to tune, the balance between the generator and the discriminator training. It can also be described, in a simpler way, as the number of update steps of the discriminator for every step of the gen-erator. If the generator is too powerful, there is no gradient left for the generator to follow. On the other hand, if the discriminator is weak the generator can learn meaningless features or exploit a particular weakness producing similar outputs in what is known as mode collapse [9].

Using the Wasserstein GAN loss [1] allows more freedom in balancing the two networks without destabilizing the learning process. That is why it is possible to change this parameter and show the results in the following analysis.

For this experiment, the results of two different parameters balancing the gen-erator and the discriminator are shown. Both systems are trained with all the possible noise functions in Sections 3.1.2 and the hyperparamenters described in Section 3.2.1. Their distinction is that the weaker discriminator version trains the generator once per 5 updates of the discriminator, while the second one updates the discriminator 10 times for every update in the generator.

In order to bear comparison of results, the stronger discriminator system is trained for twice the epochs so that the number of updates in the generator is the same in both systems. In particular, the weaker version is trained for 100 epochs and the stronger for 200.

Test after 100/200 epochs IoU

Threshold 0.5 0.3 0.1

Weaker disc. 0.984643 0.984398 0.983384

Stronger disc. 0.981642 0.981492 0.980914 Table 4.4: Average IoU in test

simi-larity with respect to dataset for dif-ferent systems and thresholds.

Disc. accuracy

Samples Dataset Generated Weaker disc. 1.0 0.035

(39)

Test after 100/200 epochs

W

eaker

disc

Stronger

disc

Figure 4.4: Test results after 100 epoch for the weaker discriminator (first row) and 200 for the stronger (second row). The voxel existence threshold is set to 0.5, 0.3 and 0.1 and balanced cross-entropy is used as the distance function.

Disc. accuracy

Samples Dataset Generated Weaker disc. 1.0 0.039

Stronger disc. 0.77583 0.0

Table 4.6: Discriminator accuracy for database and reconstruction samples in test with threshold=0.1.

The reconstructions in Figure 4.4 depict a higher probability of voxel exis-tence in the legs for the system with the stronger discriminator. Even so, the produced models are less defined, which makes the reconstructed model of the weaker discriminator using threshold 0.1 look more natural. The accuracy in the discriminator (Table 4.6) does not change as could be expected, however.

4.4 Recommendation system

The unstructured noise, particularly the function that randomly removes voxels, was initially included in an attempt to make the system more robust. Neverthe-less, since both the balanced and the similar systems keep some of the features of the input model and given that this kind of noise is highly unlikely for a real user, it was omitted when training the final frameworks.

(40)

As appreciated in the previous experiments, the IoU metric does not always correlate with what humans regard as similarity. This poses a problem at the time of deciding which three systems should be selected as the best ones.

For each of the systems, several learning approaches are tested and inspected in order to decide a final version. These approaches include different number of iterations and strengths in the discriminator, decaying strategies, and best model selection with early stopping using a 10% validation set extracted from the train-ing set.

For the early stopping approach, the similarity distance metric in the valida-tion set is computed every epoch. Only if the target IoU improves (against dataset samples for the realistic system, user input for the similar one and both for the balanced), the model is saved. If after 10 epochs the average distance is not im-proved, the execution halts.

The decaying strategy is based on linear decay from the smallest number of epochs with good results to twice that. For example, the realistic system achieves good results with a minimum number of 100 epochs. The learning rate decays linearly to zero over the next 100 epochs.

Some of the best results are shown in the following sections as a comparative to motivate the selection of the final systems. A comparative of the best system achieved for every set of λ parameters is presented in Section 4.4.4.

4.4.1 Realistic system

All the previous experiments are completed using the realistic system, which grants the reader with a deeper understanding.

The results displayed below show the final research performed to fine-tune the selection of this specific system. The other frameworks are adjusted indepen-dently due to the fact that another parameter is regarded, the similarity to the input, which may render the learning process completely different.

After appraising the results of the different learning approaches, and partic-ularly comparing the reconstructions with the results in the following Sections 4.4.2 and 4.4.3, the realistic system is also trained without the random removal of voxels noise resulting in improved reconstructions.

The goal of the realistic system is to generate samples that look as close as possible to the dataset so that the target IoU is the score of comparing the recon-struction to the dataset. The IoU metrics are reported for both set of noises (Tables 4.7 and 4.8), but only the best results are reported in images (Figure 4.5).

(41)

Test after 100/200 epochs with all noise functions IoU

Dataset Input Average Weaker 100 ep 0.984398 0.968770 0.976584

Stronger 200 ep 0.981492 0.966923 0.974208

Weaker 200 ep 0.983989 0.968331 0.97616

Decay 0.984105 0.968467 0.976286

Validation 0.981978 0.967733 0.974856

Table 4.7: IoU metric for the realistic system trained with all noise functions mea-suring the mean similarity with respect of both the test dataset and the user input thresholding at 0.3. The last column indicates the average of both similarity val-ues.

Test after 100/200 epochs without random removal of voxels IoU

Dataset Input Average Weaker 100 ep 0.985214 0.974970 0.980092

Weaker 200 ep 0.984793 0.974208 0.979501

Decay 0.249668 0.257863 0.253766

Validation 0.982110 0.974729 0.978420

Table 4.8: IoU metric for the realistic system trained with without random re-moval of voxels measuring the mean similarity with respect of both the test dataset and the user input thresholding at 0.3. The last column indicates the average of both similarity values.

stay as close as possible to the sample in the dataset from where the synthetic data comes from. This set up provides with two features to take into account, similarity with respect to the original in Figure 4.5 and realism.

(42)

Test

Original 100 epochs 200 epochs Validation

Figure 4.5: Test results for the realistic system trained without the random re-moval of voxels noise and thresholded to 0.3 for three different dataset samples using several training approaches.

4.4.2 Balanced system

The balanced system is meant to be a combination of the qualities found in the realistic and the similar systems, meaning that the reconstructions should be close to the input model, but still following the distribution of the dataset.

In consequence, it is particularly difficult to assess the success of the results due to the hybrid nature of the training and the output.

Test

IoU

Dataset Input Average Weaker 60 ep 0.979190 0.983888 0.981539

Stronger 120 ep 0.979548 0.982328 0.980938

Weaker 120ep 0.980906 0.985178 0.983042

Decay 0.982523 0.985416 0.983970 Validation 0.975414 0.975393 0.975404

(43)

Test

Input Weaker 60ep Stronger 120ep Weaker 120ep Decay

Figure 4.6: Test results for the balanced system thresholded to 0.3 for three differ-ent dataset samples using several training approaches.

For this particular system, the average between the dataset and the user mod-els is the target IoU given that is a quantity measuring both similarities.

The results obtained with validation are comparatively worst so reconstruc-tion images are not included in Figure 4.6.

If we discard alternatives that remove structural elements of the original, the systems trained with a weaker discriminator during 60 epochs and with a stronger one during 120 would be avoided. These systems erase/modify the arms in ex-amples 1 and 3. Coincidentally, these are also the systems with lower IoU with the dataset.

The model examples shown for similarity with the user model may not be the best ones. But if the IoU value is to be trusted like in the previous case, then the best system is the one trained with linear decay. In fact, it achieves the best metric for all similarity measures.

(44)

4.4.3 Similar system

With the similar system, the goal is to reconstruct models as close as possible to user model retaining realism to a certain point. That is why in this section the objective IoU is that with respect to the input.

In the same way as the balanced system, the validation approach is dismissed due to its low performance.

Test

IoU

Dataset Input Average Weaker 60 ep 0.975507 0.983702 0.978823

Stronger 120 ep 0.974902 0.982819 0.978861

Weaker 120 ep 0.977981 0.986298 0.982140

Decay 0.979298 0.987392 0.983345

Validation 0.965963 0.971143 0.968553

Table 4.10: IoU metrics for the similar system at test time thresholding at 0.3.

Test

Input Weaker 60ep Stronger 120ep Weaker 120ep Decay

(45)

With the focus set on the differences introduced by the noise functions it is possible to notice that, for the first example in Figure 4.7, the system that best retains the user features is the one trained with the weaker discriminator for 60 epochs. However, this reconstruction lacks structural elements like the arms and part of the legs.

In comparison with the rest of the systems that reconstruct these morpholog-ical components, both the systems trained with the weaker discriminator for 60 epochs and with the stronger one during 120 epochs must be relegated in favor of the other systems.

The decayed approach reaches the highest IoU in all comparisons, but signif-icantly for the input, closely competing with the system trained with the weaker discriminator over 120 epochs. In the second example in Figure 4.7 the descrip-tive feature in the user input, the hole in the base of the rest, is more clearly kept, and the arms in the first example are maintained.

The decay approach is selected for the definitive recommendation system sup-ported by all these arguments.

4.4.4 System comparative

The results reported in this section serve as a summary of the best systems achieved in the previous sections. For the realistic system, a weaker discriminator trained for 100 epochs without random removal of voxels; for the balanced and the sim-ilar systems, a weaker discriminator trained for 60 epochs with linear decay for another 60.

The metrics and reconstructions are placed together to simplify the compari-son and get a better overview of the conclusions.

The IoU scores in Table 4.11 indicate that not only each of the systems is the best among the all the tested alternatives of the same set of parameters (Tables 4.8, 4.9 and 4.10) but also the best in its target IoU when compared with the other systems. That is, the realistic system for the IoU with the dataset is better than the same IoU for the balanced and the similar systems. The same happens for the balanced system and the averaged IoU metric and the similar system for the IoU with the input.

(46)

Test IoU

Dataset Input Average Realistic 0.985214 0.974970 0.980092

Balanced 0.982523 0.985416 0.983970

Similar 0.979298 0.987392 0.983345

Table 4.11: Average IoU metric comparative for the best trained systems at test time thresholding at 0.3.

Test

Dataset Input Realistic Balanced Similar

Figure 4.8: Test results for the complete recommendation system thresholded to 0.3 for three different samples. The first two columns represent the dataset model and the user input while the following three depict reconstructions performed by the realistic, balanced and similar systems respectively.

4.5 User models

There is an important point that has not been touched so far, the importance of the similarity between the database and the user input.

(47)

sam-CHAPTER 4. EXPERIMENTS AND RESULTS 39

ples in the dataset while the other components of the objective function are guid-ing the reconstruction into lookguid-ing closer to the input or the dataset. If the model being reconstructed during test time is too different from the dataset/synthetic data the reconstructions are flawed. However, if the model is similar enough, the reconstruction results are as expected. Examples of these two cases can be found in Figure 4.9.

User models

Input Realistic Balanced Similar

Figure 4.9: Reconstruction of 3D models designed by humans using the three developed systems. The results in the first row correspond to reconstructions of a model that follows the dataset distribution while the one in the second row does not.

The main difference between the input model in the first and the second row in Figure 4.9 is the position of the seat. In the dataset, most samples have the seat in a half height position and even most of the stool samples have some kind of backrest that ends in a higher position. These elements are never modified by the noise functions when creating the synthetic data for training.

The reconstruction generated by the realistic system show how the network tries to push down the seat to a position where it is natural in the dataset. This effect lessens when the generator starts keeping the input features, which is the reason why the balanced and the similar reconstructions are better looking.

This is always a problem when training with idealized and processed data, like in this case with the Princeton ModelNet [17] and using the trained system on real, imperfect data.

Reconstruction and recommendation of realistic 3D models using cGANs

Reconstruction and

recommendation of realistic 3D

models using cGANs

MÓNICA VILLANUEVA AYLAGAS

Reconstruction and

recommendation of realistic

3D models using cGANs

MÓNICA VILLANUEVA AYLAGAS

Abstract

Sammanfattning

Contents

Chapter 1

Introduction

1.1

Research question

1.2

Motivation

1.3

Delimitations

1.4

Societal, ethical and sustainability aspects

1.5

Outline of the Master Thesis

Chapter 2

Background and related work

2.1

Background

2.1.1

Generative models

2.1.2

3D models

2.2

Related work

2.2.1

2D

2.2.2

3D

2.2.3

User studies

Chapter 3

Method

3.1

Data

3.1.1

Format

3.1.2

Noise functions

3.2

GANs

3.2.1

Network architectures

3.2.2

Objective function

3.3

Distance functions

3.4

Recommendation system

3.5

Evaluation

3.5.1

Quantitative: Distance measure

3.5.2

Qualitative: User study

3.6

Hardware description

Chapter 4

Experiments and results

4.1

Distance functions

4.2

Noise generalization

4.3

Discriminator strength

4.4

Recommendation system

4.4.1

Realistic system

4.4.2

Balanced system