Graph neural networks for prediction of formation energies of crystals

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Physics, Chemistry and Biology

Master’s thesis, 30 ECTS | Applied Physics and Electrical Engineering - international

2020 | LITH-IFM-A-EX–20/3822–SE

Graph neural networks for

pre-diction of formation energies of

crystals

Graf-neuronnät för prediktion av kristallers formationsenergier

Filip Ekström

Supervisor : Fredrik Lindsten Examiner : Rickard Armiento

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

Predicting formation energies of crystals is a common but computationally expensive task. In this work, it is therefore investigated how a neural network can be used as a tool for predicting formation energies with less computational cost compared to conventional methods. The investigated model shows promising results in predicting formation ener-gies, reaching below a mean absolute error of 0.05 eV/atom with less than 4000 training datapoints. The model also shows great transferability, being able to reach below an MAE of 0.1 eV/atom with less than 100 training points when transferring from a pre-trained model.

A drawback of the model is however that it is relying on descriptions of the crystal structures that include interatomic distances. Since these are not always accurately known, it is investigated how inaccurate structure descriptions affect the performance of the model. The results show that the quality of the descriptions definitely worsen the accuracy. The less accurate descriptions can however be used to reduce the search space in the creation of phase diagrams, and the proposed workflow which combines conventional density func-tional theory and machine learning shows a reduction in time consumption of more than 50 % compared to only using density functional theory for creating a ternary phase diagram.

(4)

(5)

Acknowledgments

First of all, I would like to thank my supervisor Fredrik Lindsten for proposing this thesis topic and for all the help and feedback. I would also like to thank my examiner, Rickard Armiento for his expertise and help in the field of material science, and for his feedback. Finally, I would like to thank Joel Oskarsson for great company and help with computer related issues.

(6)

(7)

List of Figures

2.1 Example phase diagram . . . 6

3.1 A fully connected neural network . . . 8

3.2 Examples of activation functions . . . 9

3.3 Illustration of convolutional graph neural networks . . . 12

4.1 Distributions of formation energies . . . 16

4.2 Distributions of number of atoms per structure . . . 16

5.1 Comparison of loss functions . . . 22

5.2 Comparison of KRR and CGCNN . . . 22

5.3 Comparison of relaxation levels . . . 23

5.4 Time consumption for combined ML/DFT workflow . . . 25

5.5 Number of relaxed structures when for ML/DFT workflow . . . 26

5.6 Validation set MAE not reaching minimum . . . 28

5.7 Transfer learning . . . 29

(10)

(11)

List of Tables

4.1 Time consumption on average for the different steps . . . 19

5.1 TAATA-Hf ML/DFT workflow results . . . 24

5.2 TAATA-Ti ML/DFT workflow results . . . 24

5.3 TAATA-Zr ML/DFT workflow results . . . 24

5.4 Structures from TAATA-Hf close to the convex hull included with DFT/ML work-flow . . . 27

5.5 Structures from TAATA-Ti close to the convex hull included with DFT/ML work-flow . . . 27

5.6 Structures from TAATA-Zr close to the convex hull included with DFT/ML work-flow . . . 27

B.1 Optimized hyperparameters . . . 43

(12)

(13)

1 Introduction

This thesis work has been conducted at the Division of Statistics and Machine Learning (STIMA) at the Department of Computer and Information Science at Linköping University. The topic of the thesis is however not limited to only machine learning; it is rather an appli-cation of machine learning to the field of materials science. To be more precise, it is related to the field of material discovery. Discovering materials with new or better properties is essen-tial for technological advancement. New materials can be used to for example improve the efficiency of solar panels or to enable new inventions like touch screens.

This chapter introduces the topic of material discovery and some previous work where machine learning has been applied to this area. From this, the problem to be investigated in this thesis is presented.

1.1 Material stability and formation energies

To have a new and usable material, the material in question needs to be stable. Focus for determining the stability of a material is the knowledge about the formation energy of the material. The formation energy is the change in energy in the creation of a certain structure from the pure elements. A negative formation energy means that energy is gained for this structure compared to the pure elements (and a positive formation energy means that energy has to be added to form the structure). Therefore, to have a stable material, the formation energy of the proposed material first of all has to be negative (or it will not be favorable compared to the pure elements), and secondly it has to be relatively low. To be more precise, it has to be lower than the formation energies of competing structures, that is, structures with the same composition but different arrangement of the atoms, or a combination of structures with different compositions. If it is higher than any of these alternatives, it is an unfavorable structure and it will decompose into the more favorable structure or structures. The stability of a material can be found by looking in a phase diagram, which shows the stability of all the materials with the same elements.

1.2 High-throughput search for new materials

There are many ways to arrange atoms to form materials, but only a small fraction of these theoretical materials have properties desired for a certain application or are actually stable

(14)

1. INTRODUCTION

at all. To find new useful materials, high-throughput search is the conventional method. A large number of potential materials are screened with computer simulations and the results are then used to evaluate the material, for example by constructing a phase diagram and determining the stability of each material. Many databases with the results from these high-throughput searches are accessible online and for free. In these databases, anyone can look for a material with a certain property.

One conventional method for these high-throughput searches is Density Functional The-ory (DFT). The foundation of DFT was laid out in 1964 by the construction of the two Hohenberg-Kohn theorems [6]. Work by Kohn and Sham [10] further improved the method and it is now the standard simulation method in many applications. However, it comes with the downside that it is computationally heavy: it typically scales asO(N3

e)where Ne is the

number of electrons, limiting the total system size to some thousand atoms. Work has been done to improve the scaling [13], but it is still a time consuming process.

Even if DFT is a well investigated and widely used method, it is not an exact method and comes with some error. When comparing DFT against experimental values, Kirklin et al. [9] determined that DFT has a mean absolute error of 0.082-0.136 eV/atom when predicting formation energies.

More recently, boosted by the success in other fields, machine learning has received a lot of attention as a cheaper alternative for these high-throughput searches. By using data from DFT simulations, a machine learning model can be used as an approximation of calculations made by DFT, but requiring less computation time. Also, as noted by Xie and Grossman [21], machine learning can give chemical insights that can reduce the total search space of materials that is of interest. The materials screened (with any method) can then be chosen in a better way in the sense that they are more likely to have the desired property, or be stable at all.

Database search for candidate structures

Working with theoretical materials means that their exact structures are unknown. An initial guess of a material that can be found by searching through a database for candidate structures and then substituting the atoms with atoms of interest. Before predicting any property of this material, the initial guess has to be relaxed by letting the atoms move so that they can find positions that are energetically favorable with the new atoms. This can be done in several steps with DFT, letting the material gradually reach an energy minimum.

1.3 Previous work

Some previous work has focused on finding suitable representations of materials for machine learning frameworks. One example of this is the work by Faber et al. [3], which investigates a number of different representations to be used together with a kernel ridge regression model. This work is then implemented and put to the test in a previous thesis [2] where the formation energies of crystals are predicted. More specifically, it is the formation energies of different structures from the TM-Zn-N (TM=Hf, Zr, Ti) phase diagrams that are predicted. The thesis is inspired by Tholander et al. [18], which have calculated the formation energies for these structures using DFT. The dataset produced in the article (from now on called TAATA after its creators) is used by Bratu [2] for learning and testing a kernel ridge regression model, using the representations by Faber et al. [3]. In the thesis, other representations are also compared, like the Voronoi tessellation representation by Ward et al. [19]. That representation is however used together with random forest instead of kernel ridge regression. In the thesis, the desired error is a mean absolute error (MAE) of 0.1 eV/atom, since this is roughly the error of DFT compared to experimental values.

Kernel ridge regression and random forests are just two machine learning models, and there are many more. In a variety of other disciplines, neural networks have shown great

(15)

1.4. Areas lacking investigation

success. An area where they are commonly used is image processing, but also areas like drug discovery and natural language processing. In the area of computer vision, machine learning methods were for long relying on hand made feature representations before the use of convolutional neural networks made a breakthrough [15]. The neural networks have the advantage that they rely less on manual feature engineering compared to for example kernel ridge regression.

However, it exists a number of different types of neural networks; an example of a neural network that has found promising results for materials science applications is the "Crystal Graph Convolutional Neural Network" (CGCNN) which is a type of graph neural network [21]. In this model, the crystal is represented as an undirected (multi-)graph where the nodes represent atoms and the edges represent bonds between atoms. This model has found success in predicting a number of properties, including the formation energies of materials.

1.4 Areas lacking investigation

Even if the application of machine learning to materials science has been studied before, it is far from well studied compared to applications to other fields. Many proven powerful techniques developed in these more well studied areas have therefore not yet been deployed to a large extent in materials science. One example is transfer learning, which have been re-searched and found to be useful when combined with neural networks in for example com-puter vision [23]. The goal with transfer learning is to be able to train a model on one domain and then transfer the model to another domain. This could then be leveraged to improve the performance of a model when the availability of training data is limited in the second do-main, for example if it is expensive to label data. Since this is the case for material discovery when the labelling is done with DFT, a method to reduce the need for DFT runs is desirable, and transfer learning could be such a method.

A potential drawback of the CGCNN model is that it relies on the knowledge of the struc-ture of the material, including interatomic distances. These distances are however usually not known for theoretical materials, and this would limit the usefulness of the model. It would therefore be interesting to see how sensitive the model is to different qualities of estimates of these distances.

1.5 Aim

The aim of this thesis is to further investigate if neural networks can be used to predict forma-tion energies of crystals and with that accelerate the discovery of new materials. The first step is to compare the performance of the CGCNN model with the models mentioned in section 1.3, using the TAATA dataset. This will further evaluate the generalization of the CGCNN stated in the original article, but also sets a baseline of its performance to be improved in com-bination with other techniques. One part of this is to evaluate how the performance changes with increasing availability of training data. If the model reaches an MAE of 0.1 eV/atom, it can be determined how much training data is needed for the model to reach this level.

To evaluate how sensitive the CGCNN model is to accurate descriptions of the structure, the model will be trained on structures from the intermediate steps in the relaxation process to see how the performance changes. It will thereafter be investigated if the CGCNN model could be used to accelerate the creation of phase diagrams. This will be done by letting the model decide for which structures to perform the relaxation process. Finally, the transferabil-ity of the CGCNN model between different phase diagrams will be evaluated to determine if transfer learning can be used when availability of training data is low.

(16)

1. INTRODUCTION

1.6 Research questions

To fulfill this aim, the following questions should be investigated and answered:

1. How does a graph neural network compare to the models tested by Bratu [2] when predicting formation energies for the TAATA dataset, and can it reach below a mean absolute error of 0.1 eV/atom?

2. How is the performance of the CGCNN model affected when trained on structures that are described with less accuracy?

3. Can CGCNN be used to produce a correct phase diagram with less computation time than relying only on DFT?

4. Can transfer learning be utilized to transfer a model between materials with different atom types?

1.7 Delimitations

The graph neural network to be used for predictions in this thesis is the Crystal Graph Con-volutional Neural Network [21]. Even if it has shown to be able to predict multiple physical properties, the property investigated here is the formation energy.

The data available is the TAATA dataset [18], and no extra DFT simulations have been performed.

(17)

2 Phase Diagrams and Crystals

This short chapter introduces some background about phase diagrams and crystals.

2.1 Phase Diagrams

To determine the stability of a material, a so called phase diagram could be created. This term is not uniquely defined, but here it refers to a diagram relating the formation energies of different structures with their composition. To create this, the formation energies for many different structures of materials of certain atom types are needed. All structures are plotted according to their composition and formation energy, like in the example in figure 2.1 which displays the Ti-N phase diagram. When all structures have been plotted, the so called convex hull of all structures is created. Structures are divided as either being on the convex hull or above the convex hull. The structures on the hull are predicted as stable, while the ones that are above the hull are predicted unstable. This is since that being above the hull indicates that the formation energy is higher than the formation energy of a combination of structures on the hull. If a structure is above the hull, it is therefore thermodynamically unfavorable and it will decompose into this more favorable combination of structures.

Being above the convex hull and thermodynamically unfavorable is however not a guar-antee that the material will not exist in nature. A typical example of this is diamond. It certainly exists in nature, but is not the most stable structure of pure carbon. It however con-verts to graphite (which is the most stable structure) slowly, making it a metastable structure. It is not the most favorable structure, but in some sense stable enough to not decompose immediately and can therefore exist in nature.

To measure how close a phase is to the hull, the term energy above hull is used. As the name maybe suggests, it is the energy between the structure and the convex hull. An energy above hull of 0 eV/atom means that the structure is on the hull and an energy above hull ą 0 eV/atom means that the structure is not on the hull.

Since material stability is a relative property, depending on the formation energy of the material compared to the formation energies of other materials, an extensive (in principle exhaustive) search is needed to create a correct phase diagram. Missing one structure can change the convex hull and predict the incorrect structures as stable.

(18)

2. PHASEDIAGRAMS ANDCRYSTALS

0.0

0.2

0.4

0.6

0.8

1.0 Fraction

2.0

1.5

1.0

0.5

0.0 Formation energy (eV/fu)

Ti

N

2 Ti

2 N

TiN

Ti

3

N

2

Ti

3

N

4

Figure 2.1: An example of a phase diagram: the Ti-N phase diagram. The black line indicates the convex hull, and the green dots indicate structures on the hull, while red squares indicates structures above hull. The x-axis is the ratio of N in the phase, and y-axis the formation energy in eV/atom. Data used is from the work by Tholander et al. [18]

2.2 Crystals and Unit Cells

A crystal is a solid material with a highly ordered structure of its atoms which is repeating in all directions. The unit cell is the smallest arrangement of atoms that can be repeated to form the material.

(19)

3 Machine Learning

Machine learning is a part of artificial intelligence concerned with learning from data. For example, a generic function with some parameters w= (w1, w2, ..., wk)tcan be used to create

the relationship between the two properties x and y as

y= f(x, w) (3.1)

Here, w is the vector of tweakable parameters used in the generic function. If the exact rela-tionship between x and y is unknown, w can be learned from data with the intent of making

f as good of an approximation of the real relationship as possible.

This chapter introduces the concept of a group of generic functions, or models, called neural networks. Also, neural networks for graphs are introduced and the Crystal Graph Convolutional Neural Network mentioned in the introduction is presented. Finally, the con-cept of transfer learning is introduced in the context of neural networks.

3.1 Neural Networks

In this work, neural networks refer to feed forward neural networks if not stated otherwise. Neural networks are machine learning models that have found big success in many fields. Maybe apparent from their name, they are inspired by the neurons in the brain. The more recent work has focused on so called deep neural networks. An example of a neural network is displayed in figure 3.1. The nodes in the network correspond to neurons, and the edges are connections between the neurons. The neurons form layers, which gives some intuition to the term multilayer perceptron, or MLP, which is another widely used term for a feed forward neural network. The different layers in the MLP represent functions. Each layer is connected to another layer in the sense that it passes on its result to the next layer, which uses that result as input to its own function. If the full network is representing the function f :Rm ÝÑRn, then f can be decomposed as a chain of functions so that f(x) = fl˝fl´1˝... ˝ f1(x)where

x P Rm and fi :Rmi ÝÑ Rni corresponds to the i:th layer. Calling the network deep stems

from adding more layers, or functions, making the network "deeper".

This also gives some more intuition to the term feed forward neural network: the result of each function is fed as input to the next and eventually outputted as a result y, meaning x is transformed by the network in a single (forward) direction. This stands in contrast to a recurrent neural network, where the output result is fed back into the network.

(20)

3. MACHINELEARNING x0 x1 x2 h0,1 h0,2 h1,1 h1,2 y h0,0 h0,3 h1,0 h1,3

Figure 3.1: An illustration of a fully connected neural network with two hidden states.

The network in figure 3.1 is a so called fully connected neural network, which means that each neuron in a layer has a connection with every neuron in the next layer. The input to the network is a vector x= (x1, x2, ..., xm)t, and each layer produces a hidden state hi= fi(hi´1)P

Rni_.

Each element of the hidden state vector is called a hidden unit. For a fully connected network, a hidden unit is given by equation (3.2):

hi,j=σ(w(i)_j hi´1+b(i)_j ) (3.2)

where w(i)_j PRmi_{is a vector of scalars and σ}₍_¨₎_{is an activation function. The activation function}

is also inspired by the the human brain. The neurons in the brain have a binary activation function: either it is active or not. This however requires more neurons than what is currently feasible and makes it very difficult to train the network. Work has therefore instead been done to find non-binary activation functions. Examples of these are the sigmoid (equation (3.3)), rectifier (equation (3.4)) and softplus (equation (3.5)) activation functions. Their behavior can be seen in figure 3.2. A hidden unit using the rectifier activation function is often called rectified linear unit, or ReLU.

sigmoid(x) = 1

1+e´x (3.3)

rectifier(x) =maxt0, xu (3.4)

softplus(x) =ln(1+ex) (3.5)

If w(i)_j is the j:th row in a matrix W(i), the full i:th hidden state can be described with more compact notation as equation (3.6):

hi =σ(W(i)hi´1+b(i)) (3.6)

with W(i) P Rniˆmi _{and b}(i) _P Rni_{. Theses matrices and vectors are the parameters that the}

model learns from data, and the training procedure is described in section 3.2.

Representation learning and Convolutional Neural Networks

A machine learning model can be noted as y = f(x, w), where x are called features and w parameters. A problem that arises is how to choose which features that should be included,

(21)

3.2. Training of Neural Networks 8 6 4 2 0 2 4 6 8 x 0 1 2 3 4 5 6 7 (x )

Sigmoid, recifier and softplus activation functions

rectifier sigmoid softplus

Figure 3.2: Examples of activation functions

that is, which are the properties x that y depends on? For example, if trying to determine whether or not a bank customer should be given a loan could depend on savings, supermar-ket expenses, gambling expenses, previous loans and so on. This process of hand crafting features for a model is called feature engineering.

Another example is image classification. Historically in image classification, methods were based on some type of feature extraction preprocessing techniques where predeter-mined features were extracted and then combined with for example support vector machines [15]. With the emergence of convolutional neural networks, focus shifted to feature learning or representation learning. This means that the feature engineering or feature extraction is built into the learning of the model, and thus the work of finding the features is replaced with find-ing a suitable representation of the data and a model which can learn the feature extraction process and its model parameters from this representation. The input x is then the raw rep-resentation instead of the explicit features. For the case of images, instead of first extracting feature vectors to be fed into support vector machines, a neural network can be fed a matrix where each element corresponds to a pixel value in the image. Then, the network learns from this raw representation of the image instead of preprocessed features.

As noted before, this shift to representation learning started with the success of convolu-tional neural networks (CNNs) and specifically, the success of Krizhevsky et al. [11] in the 2012 ImageNet Large Scale Visual Recognition Challenge [15].

A convolutional neural network has one or more convolutional layers. A convolutional layer consists of a filter in the form of a matrix (or tensor when the input is not two-dimensional, e.g. an image with multiple color channels) and letting this filter convolve with the input by sliding over it and computing dot products. Each dot product then make up a new element in the hidden state which, depending on the size of the filter, can have different dimensions than the original matrix. The elements in the filter are the learnable parameters.

Another common operation in a convolutional neural network is pooling layers. The pooling layer typically comes after the convolved result has been run through a non-linear activation function. Commonly, this is done by letting a patch slide over the hidden state and performing an operation like finding the max or taking the mean over that patch [5]. Thus, this operation does not have any learnable parameters.

3.2 Training of Neural Networks

Training a neural network consists in finding weight parameters w with the help of data. As stated by Bishop [1], the common approach is to choose some initial values of these

(22)

parame-3. MACHINELEARNING

ters, w(0), and then updating them iteratively as

w(l+1)=w(l)+∆w(l) (3.7)

where l indicates step in the iteration process. These iterations continue until some require-ment is fulfilled, but the question is then how to choose the update∆w(l) _{and how to}

deter-mine when to stop updating.

Gradient Descent

Using gradient information for the update refers to the gradient of the loss function. This loss function calculates some kind of measure of loss between the actual label y available in the training data and the label predicted by the network ˆy. The loss will depend on the chosen parameters w since these change the prediction made by the network, and the loss is therefore denoted as J(w). The goal is to minimize this loss which is done by moving in the direction where J(w)decreases as much as possible. This direction is the negative gradient of the loss, and the update term∆w(l)becomes

∆w(l)₌

´η∇J(w(l)) (3.8)

where η is called learning rate and regulates how big step to be taken in the direction decided by the gradient. The new weights w(l+1)are therefore calculated as

w(l+1)=w(l)´ η∇J(w(l)) (3.9)

A common example of a loss function is the mean squared error (MSE) loss, which looks like J(w) = 1 N N ÿ i=1 (yi´ˆyi)2 (3.10)

with yibeing the actual label of datapoint i, ˆyithe prediction by the network and N being the

number of training datapoints. Computing the gradient∇J(w)and updating the weights can be done after going through the whole training set. An alternative to this is using stochastic gradient descent (SGD) where the gradient is computed and the weights updated based on a single randomly selected training point, or a randomly selected subset of the training set [1]. There are alternatives and variants of stochastic gradient descent. For example, stochastic gradient descent with momentum [14] where∆w(l) from equation (3.8) has an added term which includes∆w(l´1)_as

∆w(l)₌_´η_∇_J₍_w(l)_{) +}

α∆w(l´1) (3.11)

where α is a fixed number between 0 and 1.

Another alternative is Adam, which stands for adaptive moment estimation [8]. This method uses an adaptive learning rate for different parameters, but instead has a parameter called step size.

Overfitting, early stopping and regularization

Overfitting is when a model becomes too good at approximating the training data so that it performs poorly on new, unseen data. This is a common problem for any machine learning model. To counter this when training a neural network, the data can be split up into three parts: training, validation and test data. The model is trained by going through the training set and updating the weights. When it has gone through the whole training set once, it is said that it has been trained for one epoch. After each epoch, the model predicts the labels of the validation data. When the error of the predictions of the validation set starts to increase,

(23)

3.3. Graph Neural Networks

it indicates overfitting and the training is stopped. This method is called early stopping. To determine how well the final model performs, the model predicts the labels of the test data. The error when predicting the test data is called generalization error, and is the error that can be expected when predicting on unseen data.

Another way of preventing overfitting is by regularization. It is performed by adding a penalty term to the loss function that penalizes making the weights too large. A typical type of regularization is L2-regularization. When using mean square loss (equation (3.10)) with

this type of regularization, the loss function becomes

J(w) = 1 N N ÿ i=1 (yi´ˆyi)2+λkwk2 (3.12)

with λ being a fixed regularization parameter. The higher λ, the more regularization since it will put more emphasis on keeping the weights small.

3.3 Graph Neural Networks

Some data is best represented as graphs, e.g. citation or social networks. This representa-tion of the data of course impacts the possibility to use certain machine learning models. For example, convolutional neural networks have widely been applied to image data, but com-pared to images, graphs have a very different appearance. Images are easy to represent on a 2D-grid with each pixel having the same number of neighbors, whereas a graph G is defined by(V, E), where V is the set of nodes and E is the set of connections between these nodes. This means that there are no fixed node positions in space, and each node has its own number of neighbors. Theses differences have forced special neural networks to be constructed to be compatible with graph data. This is a work in progress, and a recent survey [20] proposed to divide the current work with graph neural networks into four different categories:

• Recurrent Graph Neural Networks • Convolutional Graph Neural Networks • Graph Autoencoders

• Spatial-temporal Graph Neural Networks

Graph Autoencoders are unsupervised learning methods for either learning graph represen-tations and embeddings, or to be used for generating graphs. The spatial-temporal graph neural networks consider both spatial and temporal adjacency [20]. These two examples are however not relevant to this thesis and will therefore not be described in more detail.

Even if recurrent and convolutional graph neural networks are divided in the cited survey, their goals and principles are very similar: both try to learn node representations by mapping an initial node representation to a hidden state representation and then iteratively update this hidden state by aggregating the hidden states of its neighbors for some number of iterations. The recurrent case does it by feeding back the hidden state to the same function, while the convolutional case can apply different functions in each iteration. Gilmer et al. [4] introduce a framework that can describe both recurrent and convolutional graph neural networks: for an undirected graph G with node features xv(v P V) and edge features evw((v, w)PV), and letting hivdenote the hidden state of node v P V at iteration i, the updated hidden state hi+1v

is calculated as:

hi+1v =Ui(hiv, mi+1v ) (3.13)

where the message mi+1v is computed by:

mi+1v =

ÿ

wPN(v)

(24)

3. MACHINELEARNING

with N(v)being the set of neighbor nodes of node v.

The difference between recurrent and convolutional neural networks is then that the two functions Miand Ui do not change between iterations for the recurrent case since the result

is always fed back into the same function, whereas in the convolutional case each iteration is a new layer with its own parameters and the functions Uiand Mitherefore depend on i.

The type of convolutional operation described in section 3.1 was a filter of fixed size slid-ing over a tensor producslid-ing a new tensor. In the context of convolutional graph neural net-works, the filter consists of the two functions Uiand Mi which produce a new hidden state,

and the filter slides over the graph by centering over each node and covering the node neigh-borhood. This means that the node neighborhood determines the size of the filter, which becomes variable since each node can have a different number of neighbors. Compared to CNNs for images, this technique for convolutional GNNs keeps the initial graph structure and just updates the node representations, as illustrated in figure 3.3.

(a) Initial structure (b) After i convolutions

Figure 3.3: Illustration of how a convolutional graph neural network keeps its graph structure during the convolutions. The nodes are just updated by aggregating the neighbors, meaning they after some convolutions contain information about its initial state but also about the neighbors. In practice, the node representations are high dimensional vectors and the colors are just for illustrating how information from the neighborhood is aggregated to each node.

For producing an output, Gilmer et al. [4] also describe a readout function that produces a feature vector representing the full graph:

ˆy=R(thTv PGu) (3.15)

with T being the last iteration step.

A maximally powerful graph neural network

As noted by Xu et al. [22], there exists little theoretical knowledge about GNNs. In their work, they have therefore intended to contribute to the theory of GNNs by for example proving that a graph neural network can at most be as powerful as the Weisfeiler-Lehman graph isomorphism test in distinguishing graph structures1. They also prove that for a GNN to be that powerful, it has to satisfy certain conditions, namely that, for a GNNN:

a. N updates its node features as hiv=φ(hi´1v , f(thi´1w : w P N(v)u))where the functions

f and φ are injective and f is a function operating on multisets2.

b. A pooling function operating on the multiset of node features from all iterations and producing a graph level representation hG, is injective.

1_{For more information about the test and proof of their statement, see [22]}

2_{As described by Xu et al. [22], a multiset is a generalization of sets, where elements are allowed to be repeated.}

In the context of graphs, the node feature representations in a node neighborhood form a multiset (since multiple nodes potentially can have the same feature representation).

(25)

3.4. Crystal Graph Convolutional Neural Network

To illustrate how a maximally powerful GNN would look like, Xu et al. also propose a GNN which they call Graph Isomorphism Network (GIN) and which fulfills the above two condi-tions. This model also fits into the framework described by Gilmer et al. [4], and for GIN, the functions Uiand Mifrom equations (3.13) and (3.14) become:

Mi(hiv, hiw, evw) =hiw (3.16)

Ui(hiv, mi+1v ) =MLPi+1((1+ei+1)hiv+mi+1v ) (3.17)

with MLPi+1being a multilayer perceptron and ei+1‰0 being a scalar that can either be fixed or learnable with the purpose of distinguishing the node itself from its neighbors. Worth noting is that the edge features evw are not considered (more than implicitly determining

N(v)). To produce a graph level representation hG, GIN uses

hG= ( ÿ vPG h0v)‘( ÿ vPG h1v)‘... ‘( ÿ vPG hTv) (3.18)

with ‘ meaning concatenation of vectors. In other words, GIN first sums the hidden states at each iteration, and then concatenates the results.

3.4 Crystal Graph Convolutional Neural Network

As mentioned in section 1.3, Xie and Grossman [21] have constructed a GNN for predicting formation energies of crystals that they call Crystal Graph Convolutional Neural Network (CGCNN). As opposed to GIN, they consider multigraphs, that is, graphs with multiple edges between the same two nodes. With a small correction to consider the multigraph case, the CGCNN can also be described by the notation from Gilmer et al. [4]: the function Uiis a

mere sum of the hidden state and the message (equation (3.20)), while the function Mi(now

considering multiple edges between the same two nodes) is given by equation (3.19): Mi(hiv, hiw, e(vw)_k) =σ(W(i)f zi(vw)k+b (i) f )dg(W (i) s zi_(vw)_k+b(i)s ) (3.19) Ui(hiv, mi+1v ) =hiv+mi+1v (3.20)

Here, the extra index k is to distinguish between edges between the same nodes. The message mi+1v therefore consists of two sums:

mi+1v = ÿ wPN(v) ÿ k Mt(hiv, hiw, e(vw)k) (3.21) The variable zi_(vw)

k in equation (3.19) is the concatenation of the hidden state vector of the

node itself, the hidden state vector of the neighbor and the edge feature vector as zi_(vw)

k =

hiv‘hiw‘e(vw)k. Also in equation (3.19), σ is the sigmoid activation function (equation (3.3))

and d element-wise multiplication. Since sigmoid produces a number between 0 and 1, this is supposed to work as an importance weight matrix, giving different importance to different neighbors. In their implementation, g(¨)is the softplus function (equation (3.5)).

The initial hidden state h0vof node v is created via an initial embedding layer, embedding

xvas h0v=W(emb)xv+b(emb).

For the CGCNN model, the readout function is a normalized sum of all the hidden states from the last iteration, followed by a fully connected layer with softplus activation function and an output layer that produces a single scalar as the formation energy.

Summary of the Crystal Graph Convolutional Neural Network

(26)

3. MACHINELEARNING

1. Embed each initial atom representation xvas their initial hidden state: h0v=W(emb)xv+

b(emb)

2. Update the hidden states with T convolutional layers according to equations (3.19), (3.20) and (3.21)

3. Pool the final hidden states as a normalized sum and apply element-wise softplus non-linearity, producing one crystal vector hG =softplus(_|V]1

ř

vPVhTv)

4. Send the crystal vector through a multilayer perceptron with softplus activation func-tions resulting in hMLP

G

5. Produce an output ˆy=w( f )h_GMLP+b( f )

According to the conditions for a GNN to be maximally powerful stated by Xu et al. [22], the CGCNN does not have maximal discriminative power. This can easiest be verified by con-cluding that the normalized sum in the pooling is not an injective function (since for example h₁T = (1, 1)t and hT₂ = (0, 0)t have the same normalized sum as for example h₁T = (1, 0)t

and hT₂ = (0, 1)t). Since the pooling function has to be injective for the GNN to be maxi-mally powerful and this is not fulfilled for CGCNN, it does not posess maximaxi-mally powerful discriminative power.

From the investigations by Xu et al. [22], it can however be concluded that maximal dis-criminative power does not automatically lead to the best results. Even if GIN in many exper-iments is the best performing model, it is sometimes outperformed by other GNNs that are concluded as not maximally powerful. The CGCNN model might not have been compared head-to-head to other GNN models, but results from experiments show that it is a promising model for the task at hand, despite theoretical sub-maximal discriminative power.

3.5 Transfer Learning

For us humans, using knowledge from one domain and apply it to new domains does not seem unreasonable. It has been shown that this is also the case for neural networks [23]. In the cited article, they investigated how a neural network could train on a dataset containing images with certain classes, and then being able to transfer this knowledge to classify images of other classes. This was done by training first on one dataset containing images with la-bels from a certain set of classes, then keep the weights in the first n layers and reinitialize the other layers when training on a second dataset with images picturing other classes. The first n layers were either frozen and kept constant or fine-tuned by letting the training continue also on these layers. It was shown that fine tuning some layers improves the performance com-pared to training from a completely random initialization, meaning that the model gained knowledge from seeing other classes. The takeaway is therefore that there exist layers that are general and can be transferred to other image classes.

If the layers in the CGCNN model are general and can be transferred to new structures with different characteristics using only a small extra amount of data, this would be very beneficial since using DFT for producing training data is a slow process.

(27)

4 Method

4.1 Data and construction of graph

The data used in this work is taken from the article by Tholander et al. [18] and was also used by Bratu [2]. However, it was discovered that the data used by Bratu was not complete and contained errors, stemming from being erroneously parsed from the raw DFT-output. Therefore, the data has been re-parsed and this section describes the re-parsed dataset.

The full dataset, called TAATA-Full consists of structures from the TM-Zn-N (TM=Hf, Ti, Zr) phase diagrams. Since the data consist of structures from three different phase diagrams, TAATA-Full can be divided into three subsets:

• TAATA-Hf: A total of 4842 structures from the Hf-Zn-N phase diagram. Between 2 and 99 atoms in the unit cell and formation energies between -1.937 eV/atom and 3.048 eV/atom

• TAATA-Ti: A total of 4329 structures from the Ti-Zn-N phase diagram. Between 2 and 92 atoms in the unit cell and formation energies between -1.903 eV/atom and 2.578 eV/atom

• TAATA-Zr: A total of 3650 structures from the Zr-Zn-N phase diagram. Between 2 and 46 atoms in the unit cell and formation energies between -1.943 eV/atom and 1.969 eV/atom

The DFT simulations have been performed with the Vienna Ab initio Simulation Package, VASP. Each datapoint in its raw format consists of two parts: (1) description of the structure of the unit cell, outputted from VASP and (2) absolute energies, calculated by VASP. The distributions of formation energies and number of atoms per structure can be seen in figure 4.1 and 4.2, respectively. It seems like for larger structures, it is more common with an even number of atoms.

The CGCNN model can be described as ˆy= f(X, w), where ˆy is the formation energy and X = txv : v P Vu Y tevw : (v, w) P Eu. Since the raw data only consists of descriptions of

the unit cells and their absolute energies, the raw data has to be preprocessed. The formation energy is obtained from the absolute energy by subtracting the absolute energies of the pure elements for each atom in the structure, see appendix A for details. The neighbors for each node are obtained by, for each atom, scan for the up to twelve closest neighboring atoms

(28)

4. METHOD

2 1 0 1 2 3

Formation energy [eV/atom]

0 25 50 75 100 125 150 175

Distribution over formation energies, TAATA-Hf

(a) TAATA-Hf

2 1 0 1 2

0 25 50 75 100 125 150 175

Distribution over formation energies, TAATA-Ti

(b) TAATA-Ti

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

0 20 40 60 80 100 120

140 Distribution over formation energies, TAATA-Zr

(c) TAATA-Zr Figure 4.1: Distributions of formation energies for the three different subsets.

0 20 40 60 80 100 Number of atoms 0 50 100 150 200 250

Distribution over number of atoms per structure, TAATA-Hf

(a) TAATA-Hf 0 20 40 60 80 Number of atoms 0 50 100 150 200 250

Distribution over number of atoms per structure, TAATA-Ti

(b) TAATA-Ti 0 10 20 30 40 Number of atoms 0 50 100 150 200 250

Distribution over number of atoms per structure, TAATA-Zr

(c) TAATA-Zr

Figure 4.2: Distributions of number of atoms per structure for the three different subsets.

within a radius of 8 Å while considering the periodicity of the material, that is, considering that the unit cell is repeated in all directions when forming the material. The vectors xvand

evw are vector representations of the nodes (atoms) and the edges, respectively. The initial

node representations xv are concatenated one hot encodings of ten different properties of

each atom type like period and group in the periodic system, electronegativity etc (see the supplementary material of the original work for full details), resulting in a vector xvPR92.

4.2 Choice of hyperparameters

The choice of hyperparameters can be crucial for good performance of a neural network. Since the model used is the CGCNN by Xie and Grossman [21], their method of deciding hyperparameters have been used and is described here. The only hyperparameters that have been tuned are the learning rate, the number of convolutional layers and the penalty term for L2 regularization. For all the other hyperparameters, the default values from the code

provided by the original authors were used, with the exception of the optimizer: after ex-perimenting with both stochastic gradient descent with momentum and with the Adam op-timizer, the Adam optimizer seemed to slightly outperform SGD. It was also the optimizer used in the original article. Therefore, the Adam optimizer was used.

TAATA-Full was used, and just like when training the model, the dataset was divided into training, validation and test sets with the training set being 60 % of the original set and the validation and test set 20 % each. A grid search was used to go through the possible combinations of hyperparameters and the hyperparameters for the model with the lowest validation MAE was chosen.

(29)

4.3. Choice of loss function

4.3 Choice of loss function

Since the the mean absolute error (MAE) is used as a measure of error in many articles where machine learning is used as an approximation of DFT (for example [2, 21]), it would make sense to also use this as the loss function for the neural network to minimize. In the im-plementation of the CGCNN [21] provided by the authors, the mean square error (MSE) is however used as loss function. Intuitively, it makes sense to have the network minimize the function used for comparing results, assuming that this is the most important metric in the end. After experimenting, it was also seen that MAE in general gave slightly better perfor-mance and it was decided to use MAE as loss function.

4.4 Ensemble

In this work, an ensemble of neural networks have been used for predictions since these have shown to improve predictive performance [12]. This means that a number of neural networks, in this work five, are initialized differently and trained with different data. When used for prediction, all five ensemble members predict the same datapoint. The final prediction is obtained by averaging the five predictions in the ensemble when using MSE loss, and finding the median of the five predictions when MAE loss was used. This is motivated by the fact that minimizing MSE and MAE implies that the data is distributed according to normal and Laplace distributions respectively. The two distributions both have parameters µ as modes and the maximum likelihood estimator of these are the mean for the normal distribution and the median for the Laplace distribution [16].

To avoid overfitting of the model, early stopping was used with the dataset divided into three sets: training, validation and test. The test set size was 20 % of the full dataset and the test set was kept constant so that all models, independent of training set size, were tested on the same data.

To simulate a real situation where the available data is some limited set, the data available for training and validation was the same for the different ensemble members. However, to introduce more randomness into the model, it was randomized which data that was used as training and which was used as validation, where 3/4 were used for training and 1/4 for validation.

To evaluate how much data that is necessary for a certain performance, the total size of the training + validation dataset was from 60 datapoints up to 80 % of the total set (meaning all data except for the data used for test).

4.5 Inaccurate structure descriptions

To evaluate the performance when using inaccurate structure descriptions, the descriptions of the materials after the intermediate steps from the relaxations process have been used. The formation energy is still the final, most accurate energy since that is what in the end is interesting, but the descriptions of the structures that the model uses for training and testing are the not yet fully relaxed structures.

4.6 Phase diagram workflow

The aim of this thesis is to see how graph neural networks can be used to accelerate the discovery of new materials. Finding the stability of materials however requires that many materials are screened to form a phase diagram. A workflow with the goal of reducing the total search space needed for producing a phase diagram is therefore proposed. The final phase diagram would still be produced with DFT, but reducing the search space would mean that less DFT runs are needed and the creation of the phase diagrams would therefore be

(30)

4. METHOD

accelerated. This would be done by leveraging the quick prediction of the machine learn-ing model and use these predictions to remove candidate structures that are not of interest (i.e., not likely to be on the convex hull). This can be done in multiple steps since there is a relaxation process with steps that produce more and more accurate structures. Here, this proposed workflow is presented.

Let k = 1, 2, ..., K denote the relaxation step, with k = 1 being the first (least accurate) step and k=K the final step. The workflow for creating a phase diagram using the CGCNN model and DFT would then be the following:

1. Create the full search space via database search (not part of this work, already done by Tholander et al. [18]), lettingSdenote the search space and i PS be the i:th structure. 2. Pick out a set of structures to use for trainingStrainĂS(randomly, but making sure to

include the pure elements to be able to obtain formation energies). Complete the full relaxation process for these structures, and letDk

train=tXik, yKiuiPStraindenote the

train-ing set with relaxed structures up until relaxation step k and their formation energies at the final relaxation step.

3. For the phases not included in the training set, complete the first relaxation step and include these structures inD1

eval =tX1iuiRStrain

4. For k=1, ..., K ´ 1 do:

a) Train an ensemble modelMk_{, using}_Dk

train. Save the mean validation MAE of the

ensemble members as σk.

b) UseMk_{to predict formation energies for the structures in}_Dk eval.

c) Using the actual formation energies for the structures inDk

train and the predicted

formation energies for the structures inDk

eval, create the phase diagram.

d) Let Ehull

i denote the energy above hull for Xki. FormDevalk+1by first finding the set of

structures tXk_i PD_evalk : E_ihull ď3σkuand then continue the relaxation process one

step for these structures. 5. For the structures inDK

eval, obtain also their formation energies with DFT and create

the final phase diagram using the structures fromDK

trainandDevalK and their formation

energies calculated by DFT.

Estimation of time consumption

To estimate the total time consumption when the full relaxation process is not performed, the average time consumption for each step was used. The TAATA dataset consisted of K=

3 steps, with the two first steps called preprerelax and prerelax respectively. In table 4.1, the time consumption as a percentage of the full relaxation process is shown for the first two steps. For example, a structure from TAATA-Hf that was removed after the preprerelax process adds 0.014 to the total time, while a structure that had to complete the full process adds 1.

(31)

4.7. Transfer learning

Dataset Preprerelax (% of time con-sumption for the full relaxation process)

Preprerelax + prerelax (% of time consumption for the full re-laxation process)

TAATA-Hf 1.4 11.1

TAATA-Ti 1.4 10.0

TAATA-Zr 1.1 9.1

Table 4.1: Time consumption on average for the different steps

4.7 Transfer learning

For testing the transferability of the model, ensembles for each phase diagram were trained from scratch, using 80 % of the data as training and validation (the other 20 % were reserved for testing). Then, the models were transferred by continuing the training, using data from another phase diagram. This was done in two different ways:

• Full: all layers were fine tuned.

• Embedding: only the embedding layer (see section 3.4) was fine tuned, the other layers were frozen and not trained.

The fine tuning was done by lowering the learning rate with a factor of 10 when fine tuning all layers, but keeping the learning rate when only training the embedding layer. The low-ering of the learning rate was intended to indicate that it was indeed a fine tuning and not a completely new training of the network. Not lowering the learning rate when training only the embedding layer is motivated by some trial and error: it seemed like the validation error never reached a minimum and that the model continued to improve, even when training for 15 times longer compared to fine tuning the full network. This is further illustrated in section 5.5.

The motivation to transfer only the embedding is that the roll of the embedding layer is to go from the initial representation of an atom to the initial hidden state representation. Since changing from one subset to another means introducing a new atom and therefore a new representation that had not been shown to the model before, it was hypothesized that the embedding layer would have problems since it would have to embed a completely new atom type. Another motivation is to avoid overfitting when the training data availability is low by keeping most of the network fixed.

The approach has some differences compared to Yosinski et al. [23]. Firstly, when freez-ing layers, Yosinski et al. froze the first layers and trained only the last layers, whereas for the CGCNN model the last layers will be frozen and the first (embedding) layer trained. Secondly, Yosinski et al. reinitialize the last layers, which is not done (for any layer) for the CGCNN model. This is since the problems differ in the sense that CGCNN predicts the same output (formation energies) using fairly similar data (structures with Zn, N and a third atom type), whereas Yosinski et al. predicts completely new classes not seen before by the model. It is therefore reasonable to think that all layers could benefit from being pre-trained (since two out of three atom types are the same), and reinitializing layers would work against this.

(32)

(33)

5 Results

This chapter presents the results from the investigations performed in the present project.

5.1 Comparing loss functions

As discussed in section 4.3, two different loss functions, mean squared error (MSE) and mean absolute error (MAE), were tested. The results from these test are shown in figure 5.1. MAE shows to perform slightly better most of the time, but the two options perform in general very similar.

5.2 Comparison with previous results

Since it had been discovered that some datapoints used in previous work were erroneously parsed, it was decided to test the KRR model on the new re-parsed data to see how the performance changed. Displayed in figure 5.2 is the test MAE for the kernel ridge regression model with sinusoidal representation, using the new re-parsed dataset. The new results have been obtained by using the same code and hyperparameters as in the original work. In the figure, it is also compared with the CGCNN model. Comparing with the numbers given in the old work [2], the new dataset improved the performance of the KRR model. For example, predicting on TAATA-Hf, the MAE was 0.52 eV/atom when using 2000 training points of the old data, whereas for the new data, only 240 data points are needed to reach below this error. Similarly, the test MAE for 2000 training points of the old dataset was 0.48 eV/atom for the old TAATA-Ti set, and for the old TAATA-Zr the error was 0.55 eV/atom. With the new dataset, again only 240 training points are needed to reach below this. Using 80 % of the new subsets for training, the test MAE is 0.34, 0.32 and 0.33 eV/atom for TAATA-Hf, TAATA -Ti and TAATA-Zr respectively.

The test MAE for the CGCNN model is lower. Using the smallest training set size, 60 data points (45 for training and 15 for validation), the test MAEs are 0.28 eV/atom for TAATA-Hf, 0.26 eV/atom for TAATA-Ti and 0.25 eV/atom for TAATA-Zr. For the biggest possible training set size, 80 % of the dataset, the test MAEs reach 0.045 eV/atom (TAATA-Hf), 0.042 eV/atom (TAATA-Ti) and 0.047 eV/atom (TAATA-Zr).

(34)

5. RESULTS 102 ₁₀3 Training size 0.05 0.10 0.15 0.20 0.25 MAE [eV/atom]

Test error vs training size, TAATA-Hf

MSE loss L1 (MAE) loss (a) TAATA-Hf 102 ₁₀3 Training size 0.05 0.10 0.15 0.20 0.25 MAE [eV/atom]

Test error vs training size, TAATA-Ti

MSE loss L1 (MAE) loss (b) TAATA-Ti 102 ₁₀3 Training size 0.05 0.10 0.15 0.20 0.25 MAE [eV/atom]

Test error vs training size, TAATA-Zr

MSE loss L1 (MAE) loss

(c) TAATA-Zr

Figure 5.1: Comparing loss functions for the three subsets

102 ₁₀3 Train size 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 MAE [eV/atom]

MAE vs train set size

TAATA-Hf, CGCNN TAATA-Ti, CGCNN TAATA-Zr, CGCNN TAATA-Hf, KRR TAATA-Ti, KRR TAATA-Zr, KRR

Figure 5.2: Comparison between kernel ridge regression and CGCNN. The training set size for CGCNN is the size of training and validation sets combined.

5.3 Using different relaxation levels

To test how different accuracies in the structures affected the performance, different steps from the relaxation process were used for training and testing the model. The results from

(35)

5.4. Phase diagram workflow

these can be found in figure 5.3. It can be seen that the model is able to improve with more accurate structures and more training data. For the biggest training set size (using 60 % of the total dataset for training and 20 % for validation), the preprerelax datasets obtain test MAEs of 0.21 eV/atom (TAATA-preprerelax-Hf and TAATA-preprerelax-Ti) and 0.22 eV/atom (TAATA-preprerelax-Zr). For the prerelax datasets, the test MAE is 0.11 for all datasets. The final relaxation step is the one used in previous experiments. The error for this level is between 0.042 and 0.047 eV/atom and the exact errors can be found in section 5.2. 102 ₁₀3 Training size 0.05 0.10 0.15 0.20 0.25 0.30 0.35 MAE [eV/atom]

Test error vs training size, Hf-Zn-N phase diagram

Final relax Prerelax Preprerelax (a) Hf-Zn-N 102 ₁₀3 Training size 0.05 0.10 0.15 0.20 0.25 0.30 0.35 MAE [eV/atom]

Test error vs training size, Ti-Zn-N phase diagram

Final relax Prerelax Preprerelax (b) Ti-Zn-N 102 ₁₀3 Training size 0.05 0.10 0.15 0.20 0.25 0.30 0.35 MAE [eV/atom]

Test error vs training size, Zr-Zn-N phase diagram

Final relax Prerelax Preprerelax

(c) Zr-Zn-N

Figure 5.3: Comparing the results when using different relaxation levels to train and predict on different phase diagrams

5.4 Phase diagram workflow

Here, the results when using the proposed workflow from section 4.6 are presented. In ta-bles 5.1 to 5.3, the different uncertainty measures σ1and σ2(average validation MAE of the

ensemble members using data from relaxation step preprerelax and prerelax respectivly) can be seen, together with the time consumption compared to only using DFT. The time con-sumption is also visualized in figure 5.4, and is based on the average time for each step (see section 4.6). The numbers show that there is a constant time gain. The total consumption is between 93 % and 46 % of the time required when using only DFT. The numbers only con-sider the CPU time for running the DFT and not for example training the model or using it for predicting. Training two ensembles with the largest training set size takes roughly five hours

(36)

5. RESULTS

in total and prediction is a matter of minutes. As comparison, the full relaxation process of one structure from TAATA-Hf takes slightly more than 9 hours on average.

Train + validation set size σ1 (eV/atom) σ2 (eV/atom) Time consumption (% of only DFT) 60 0.42 0.27 84 120 0.42 0.36 93 240 0.33 0.25 64 440 0.29 0.19 50 880 0.27 0.18 50 1600 0.25 0.15 53 3873 0.22 0.13 84

Table 5.1: TAATA-Hf ML/DFT workflow results

Table 5.2: TAATA-Ti ML/DFT workflow results

Table 5.3: TAATA-Zr ML/DFT workflow results

Figure 5.5 shows how many of the structures that were either removed after a certain relaxation step, or needed the full relaxation. The training data needs the full relaxation process but since this is a fixed number of structures and not dependent of the workflow, it is shown separately.

For all cases except for one, the correct convex hull is found. When using 60 data points for training, the actual stable TiZnN2structure is removed and not included in the final set of

points to be evaluated. Instead, another TiZnN2structure with a formation energy of 0.019

eV/atom higher is included in the final set and with a formation energy low enough to be on the (incorrect) convex hull.

(37)

60 120 240 440 880 1600 3873

Training + validation size

0.0 0.2 0.4 0.6 0.8

Total DFT time compared to only using DFT, TAATA-Hf

(a) TAATA-Hf

60 120 240 440 880 1600 3462

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Total DFT time compared to only using DFT, TAATA-Ti

(b) TAATA-Ti

60 120 240 440 880 1600 2920

0.0 0.2 0.4 0.6 0.8

Total DFT time compared to only using DFT, TAATA-Zr

(c) TAATA-Zr

Figure 5.4: Visualization of the time consumption from tables 5.1 to 5.3 when using combined ML/DFT workflow compared to only DFT. Time usage of 1 means that it took as long as only using only DFT.

(38)

5. RESULTS

60 120 240 440 880 1600 3873 Training + validation size

0 1000 2000 3000 4000

5000 Number of structures for each relaxation step, TAATA-Hf

Training + validation Preprerelax Prerelax Final relax (a) TAATA-Hf 60 120 240 440 880 1600 3462 Training + validation size

0 1000 2000 3000 4000

Number of structures for each relaxation step, TAATA-Ti

Training + validation Preprerelax Prerelax Final relax (b) TAATA-Ti 60 120 240 440 880 1600 2920 Training + validation size

0 500 1000 1500 2000 2500 3000 3500

Number of structures for each relaxation step, TAATA-Zr

Training + validation Preprerelax Prerelax Final relax

(c) TAATA-Zr

Figure 5.5: The usage of the relaxation process. The diagram shows the number of struc-tures that were relaxed up until a certain relaxation step when using the combined ML/DFT workflow. Training and validation data has been relaxed up until the final step but is shown separately.

(39)

Since metastable structures can exist in nature and since DFT is not completely accurate, structures not on the hull can also be of interest. It has not been the aim of this workflow to include these, but since they could be of interest it would be interesting to see how many are lost. In tables 5.4 to 5.6, the number of structures with an energy above hull ď 0.2 eV/atom are shown.

Train + validation set size # of structures with energy above hull ď 0.2 eV/atom

% of total number 60 223 87 120 252 98 240 244 95 440 245 96 880 249 97 1600 251 98 3462 254 99

Table 5.4: Structures from the TAATA-Hf that are kept for the final evaluation in the DFT/ML workflow and have an energy above hull ď 0.2 eV/atom. The maximal possible number which is compared with in the last column is 256.

% of total number 60 231 94 120 237 96 240 233 95 440 234 95 880 231 94 1600 234 95 3462 241 98

Table 5.5: Structures from the TAATA-Ti that are kept for the final evaluation in the DFT/ML workflow and have an energy above hull ď 0.2 eV/atom. The maximal possible number which is compared with in the last column is 246.

% of total number 60 232 98 120 207 88 240 209 89 440 225 95 880 220 93 1600 225 95 2920 222 94

Table 5.6: Structures from the TAATA-Zr that are kept for the final evaluation in the DFT/ML workflow and have an energy above hull ď 0.2 eV/atom. The maximal possible number which is compared with in the last column is 236.

Graph neural networks for prediction of formation energies of crystals

Linköping University | Department of Physics, Chemistry and Biology

Master’s thesis, 30 ECTS | Applied Physics and Electrical Engineering - international

2020 | LITH-IFM-A-EX–20/3822–SE

Graph neural networks for

pre-diction of formation energies of

crystals

Graf-neuronnät för prediktion av kristallers formationsenergier

Filip Ekström

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1

Material stability and formation energies

1.2

High-throughput search for new materials

Database search for candidate structures

1.3

Previous work

1.4

Areas lacking investigation

1.5

Aim

1.6

Research questions

1.7

Delimitations

2

Phase Diagrams and Crystals

2.1

Phase Diagrams

0.0

0.2

0.4

0.6

0.8

1.0

Fraction

2.0

1.5

1.0

0.5

0.0

Formation energy (eV/fu)

Ti

N

2

Ti

2

N

TiN

Ti

N

Ti

N

2.2

Crystals and Unit Cells

3

Machine Learning

3.1

Neural Networks

Representation learning and Convolutional Neural Networks

3.2

Training of Neural Networks

Gradient Descent

Overfitting, early stopping and regularization

3.3

Graph Neural Networks

A maximally powerful graph neural network

3.4

Crystal Graph Convolutional Neural Network

Summary of the Crystal Graph Convolutional Neural Network

3.5

Transfer Learning

4