Exploring Transformer-Based Contextual Knowledge Graph Embeddings : How the Design of the Attention Mask and the Input Structure Affect Learning in Transformer Models

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Computer Science

2021 | LIU-IDA/LITH-EX-A--21/002--SE

Exploring

Transformer-Based

Contextual Knowledge Graph

Embeddings

–

How the Design of the A en on Mask and the Input Structure

Aﬀect Learning in Transformer Models

Oskar Holmström

Supervisor : Jenny Kunz Examiner : Marco Kuhlmann

(2)

Upphovsrätt

De a dokument hålls llgängligt på Internet - eller dess fram da ersä are - under 25 år från publicer-ingsdatum under förutsä ning a inga extraordinära omständigheter uppstår.

Tillgång ll dokumentet innebär llstånd för var och en a läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och a använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrä en vid en senare dpunkt kan inte upphäva de a llstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För a garantera äktheten, säker-heten och llgängligsäker-heten ﬁnns lösningar av teknisk och administra v art.

Upphovsmannens ideella rä innefa ar rä a bli nämnd som upphovsman i den omfa ning som god sed kräver vid användning av dokumentet på ovan beskrivna sä samt skydd mot a dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens li erära eller konstnärliga anseende eller egenart.

För y erligare informa on om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years star ng from the date of publica on barring excep onal circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educa onal purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are condi onal upon the consent of the copyright owner. The publisher has taken technical and administra ve measures to assure authen city, security and accessibility.

According to intellectual property law the author has the right to be men oned when his/her work is accessed as described above and to be protected against infringement.

For addi onal informa on about the Linköping University Electronic Press and its procedures for publica on and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

The availability and use of knowledge graphs have become commonplace as a compact storage of information and for lookup of facts. However, the discrete representation makes the knowledge graph unavailable for tasks that need a continuous representation, such as predicting relationships between entities, where the most probable relationship needs to be found. The need for a continuous representation has spurred the development of knowledge graph embeddings. The idea is to position the entities of the graph relative to each other in a continuous low-dimensional vector space, so that their relationships are preserved, and ideally leading to clusters of entities with similar characteristics. Several methods to produce knowledge graph embeddings have been created, from simple models that minimize the distance between related entities to complex neural models. Almost all of these embedding methods attempt to create an accurate static representation of each entity and relation. However, as with words in natural language, both entities and relations in a knowledge graph hold different meanings in different local contexts.

With the recent development of Transformer models, and their success in creating contextual representations of natural language, work has been done to apply them to graphs. Initial results show great promise, but there are significant differences in archi-tecture design across papers. There is no clear direction on how Transformer models can be best applied to create contextual knowledge graph embeddings. Two of the main differences in previous work is how the attention mask is applied in the model and what input graph structures the model is trained on.

This report explores how different attention masking methods and graph inputs affect a Transformer model (in this report, BERT) on a link prediction task for triples. Models are trained with five different attention masking methods, which to varying degrees restrict attention, and on three different input graph structures (triples, paths, and interconnected triples).

The results indicate that a Transformer model trained with a masked language model objective has the strongest performance on the link prediction task when there are no restrictions on how attention is directed, and when it is trained on graph structures that are sequential. This is similar to how models like BERT learn sentence structure after being exposed to a large number of training samples. For more complex graph structures it is beneficial to encode information of the graph structure through how the attention mask is applied. There also seems to be some indications that the input graph structure affects the models’ capabilities to learn underlying characteristics in the knowledge graph that is trained upon.

(4)

Acronyms

BERT - Bidirectional Encoder Representations from Transformers CoKE - Contextualized Knowledge Graph Embedding

CoLAKE - Contextualized Language and Knowledge Embedding FFN - Feed-Forward Neural Network

GAT - Graph Attention Network

GAAT - Graph Attenuated Attention Network KG - Knowledge Graph

KGE - Knowledge Graph Embedding LSTM - Long short-term memory MLM - Masked language model MRR - Mean reciprocal rank NLP - Natural Language Processing NSP - Next sentence prediction PE - Positional encoding

RoBERTa - Robustly optimized BERT approach t-SNE - t-Distributed Stochastic Neighbor Embedding

(5)

List of Figures

2.1 Graph example from the Freebase knowledge graph. . . 4

2.2 Illustration of the BERT model architecture. . . 6

2.3 Visualization of self-attention for one attention head in the first attention layer. . . 6

2.4 Visualization of multi-head attention for the first Transformer layer. . . 7

2.5 Input representation for a sequence of tokens. . . 8

2.6 Visualization of the wave functions, created for different dimensions, that are used for the positional encoding. . . 9

2.7 A BERT model with a masked input and classification layers for masked language model and next sentence prediction. . . 9

2.8 BERT with additional token classification layers, fine-tuned for part-of-speech tag-ging. . . 10

3.1 Attention masking method for GAT. Entities are allowed to attend to adjacent entities. . . 12

3.2 Attention masking method for BERT-MK. Nodes are allowed to attend to adjacent nodes, where relations are treated equally to entities in the input. . . 13

3.3 Attention masking method for CoLAKE. Nodes are allowed to attend to adjacent nodes in either direction of the graph, and relations are treated equally to entities in the input. . . 13

3.4 An illustration of the input graph for BERT-MK. . . 14

3.5 An illustration of the input graph for BERT-MK and the soft position index applied to the input. . . 14

4.1 An illustration of a triple, consisting of a head, relation and tail node. . . 16

4.2 An illustration of a path with length k= 1. . . 16

4.3 An illustration of a connected triple that consists of two interconnected triples. . . 17

4.4 The positional encoding for an input graph from the triple class. . . 17

4.5 The positional encoding for an input graph from the path class. . . 17

4.6 The positional encoding for an input graph from the connected triples class. . . 18

4.7 An attention mask that allow for full attention, applied to a path. . . 18

4.8 An attention mask that allow for attention in the one-hop neighborhood in the direction of the graph. . . 19

4.9 An attention mask that allow for attention in the one-hop neighborhood with no regard for the direction of the graph. . . 19

4.10 An attention mask that allow for attention in the two-hop neighborhood in the direction of the graph. . . 20

4.11 An attention mask that allow for attention in the two-hop neighborhood with no regard for the direction of the graph. . . 20

(8)

List of Tables

4.1 Statistics on entities, relations, and the number of triples in train, validation and test partitions. . . 22 5.1 Results for models where every node can attend to every other node in the input. . 26 5.2 Results for models where nodes can only attend to adjacent nodes in the direction

of the graph. . . 26 5.3 Results for models where nodes can attend to adjacent nodes, disregarding the

direction of the graph. . . 26 5.4 Results for models where entities can attend to all nodes in its two-hop neighbors

in the direction of the graph. . . 26 5.5 Results for models where entities can attend to all nodes in its two-hop neighbors,

disregarding the direction of the graph. . . 27 5.6 The results for models trained on paths of different lengths. . . 27 5.7 The results when n paths have been created per triple. . . . 27 5.8 Comparison of connected triples, of varying size and samples per node. Models

were trained on FB15k and with an attention mask that allow for full attention between nodes. . . 28 5.9 Comparison of connected triples, of varying size and samples per node. Models

were trained on FB15k and with an attention mask that only allow for attention to adjacent nodes. . . 29 5.10 The results for models trained on different input graph structures from FB15k. . . 30 5.11 The results for models trained on different input graph structures from WN18. . . . 30 5.12 The results for models trained on different input graph structures from FB15k-237 30 5.13 The results for models trained on different input graph structures from WN18RR. 31 5.14 The number of training samples for each input graph structure when created from

(9)

1 Introduction

1.1 Motivation

In the past ten years, deep learning developments have made astounding improvements to a wide range of areas, including natural language processing (NLP). In the NLP domain, break-throughs in the past few years have been brought about with the introduction of Transformer model architectures. It began with the seminal paper “Attention is all you need” [17], where Vaswani et al. proved the effectiveness of self-attention for machine-translation. The attention mechanism allows the model to select what features are important for the specific task at hand, much like a human that can attend to different visual features. Transformer-based language models that learn contextual word representations have become widespread and provide the state of the art performance on several NLP tasks [14, 4, 25, 11].

Over the past decade, there has also been an increased interest in knowledge graphs (KG). Facts represented in text is bound to uphold sentence structure to convey information about entities and their relations. Knowledge graphs instead consist of entities and the relationships between them; it is a more explicit and densely structured representation of knowledge. This makes the graph useful as compact storage of information, and its discrete structure makes it efficient for lookup of facts. However, the discrete representation makes the knowledge graph unavailable for tasks that need a continuous representation, such as predicting relationships between entities, where the most probable relationship needs to be found. One method used to create a continuous representation is to embed the discrete information into a vector space. When applied to a knowledge graph, a knowledge graph embedding (KGE) is created. The idea is to position the discrete entities relative to each other in a continuous low-dimensional vector space so that their relationships are preserved. This would ideally lead to clusters with entities of similar characteristics.

Several methods to produce KGEs have been created, from simple models that minimize the distance between related entities to complex neural models. Almost all of these embed-ding methods attempt to create an accurate static representation of each entity and relation. However, as with words in natural language, both entities and relations in a knowledge graph hold different meanings in different local contexts. Some previous work exists on applying Transformer models to create contextual knowledge graph embeddings [19], recreating what has been achieved for natural language. Initial results show promise, but there are significant differences in architecture design across papers, and there is no clear direction on how

(10)

Trans-1.2. Aim

former models can be best applied to create contextual knowledge graph embeddings. Two of the main differences between previous work is how attention in the model is applied to the input, and the graph structures the model is trained on.

One of the core architectural elements of a Transformer model are the attention mechanism. The idea is that each token in an input sequence can be described by itself and other tokens in the sequence, and attention regulates how much information from each token is used in that description. For each input token that passes through an attention layer, an output is created that combines different tokens in the input sequence. When training, the Transformer model learns what it should focus its attention on to create output representations useful for some training objective. For raw text, the model is commonly allowed to attend to the whole input sequence, i.e., it is possible to use all other tokens to describe a specific input token, but it is also possible to allow a token to only attend to other specific tokens in the sequence by applying an attention mask. When it comes to applying attention to graphs, the most common method has been to apply an attention mask that reflects the graph structure, the idea being that information of the graph structure is then preserved in the model. Previous work use attention masks in different ways: Graph Attention Networks (GAT) where entities can only attend to entities they have a relation to [18], Graph Attenuated Attention Networks (GAAT) where entities show more attention to neighboring entities that are closer than several relations away [20], and Transformer models that incorporate graphs into text sequences and apply attention (to the graph parts of the sequence) to entities that there are both in- and outgoing relations to [15, 10]. Meanwhile, Transformer models without an explicit attention mask have been applied to graph structures consisting of paths and shown strong performance on link prediction tasks [19]. Therefore, it is of interest to explore if explicit attention masks are needed or if the model, through training, can better learn how to direct attention to embed graph information. It is also of interest to explore different attention masking methods for different input graph structures.

The need to clarify what architecture designs are useful is essential to understand if Trans-former models are a viable method to create knowledge graph embeddings, and inspire future research on how to improve them to create even better embeddings.

1.2 Aim

This report explores how different design decisions related to the input graph structure and the attention mask affects Transformer models (in this report, BERT) performance on a triple link prediction task.

1.3 Research questions

1. How does the attention masking method affect model performance on link prediction for triples?

Attention in a Transformer model is applied to each token in the input. The idea is that by directing attention to all tokens in the input sequence the model can describe a token by its context. The attention masking method decides what tokens in the input the attention is allowed to be directed to.

2. How does the training input graph structure affect model performance on link prediction for triples?

When a Transformer model is applied to a graph, the graph that is trained on can take many forms. Practically, this means that the nodes in the graph have different amounts of relations to other nodes. Examples of graphs are cycles (where you can travel the edges in the graph and return to the starting node), a tree (a graph with no cycles), and a path (every node have a degree of at most two).

(11)

1.4. Delimitations

3. How does the training input graph structure affect model performance when trained on datasets with different characteristics?

The models in this report are trained on knowledge graphs. These graphs can have different characteristics, such as different average in/out degrees and different subgraph structures that occur less or more. This means that it could be possible for different input graph structures to capture these underlying characteristics in different ways.

1.4 Delimitations

To create a contextual representation, the model created for this report is based on BERT. Several different Transformer model architectures exist today, which even outperform BERT in several respects, but the the model is chosen due to comparable implementations in previous work. The work in this report is also limited by a lack of computational resources (limits in time to use necessary hardware). As a consequence, there is an upper limit on the number of possible models that can be trained and, therefore, no hyperparameter tuning for the model or optimizer is done to improve model performance.

(12)

2 Background

2.1 Knowledge Graphs

A knowledge graph (KG) is a structured collection of entities and relations, representing en-tities as vertices and relations as directed edges. The graph can be described as a set of interconnected triples, as can be seen in Figure 2.1. The figure presents a selected subset of the triples that the entity 2001:_A_Space_Odyssey belongs to in the Freebase dataset [1]. One such directed triple in the graph is Stanley_Kubrick→ Director → 2001:_A_Space_Odyssey. The triple can be defined as consisting of a head (h), relation (r), and tail (t): h→ r → t. In this example, Stanley_Kubrick is the head, 2001:_A_Space_Odyssey is the tail, and Director is the relation from the head to the tail. As can be observed in the figure, there is also an inverse relation, Directed_By, in the opposite direction.

Figure 2.1: Graph example from the Free-base knowledge graph.

In practice, a knowledge graph is an evolving and concentrated collection of knowledge, often used for a specific organization or knowledge domain. It can be populated with facts manually or automatically, from, for example, text documents. Today, several large-scale public knowledge graphs exist and cover several knowledge domains, e.g., DBpedia [8], which contains millions of entities extracted from Wikipedia and WordNet [13] which contains in-formation on the relationships between words. Compared to text, these structures are denser in their representation of information than sentences. For example, in FB15k [2], a partition of the Freebase knowledge graph, a node is on average related to 39.6 other entities. When taking into account those entities’ neighbors, the amount of information grows exponentially.

(13)

2.2. Knowledge Graph Embeddings

2.2 Knowledge Graph Embeddings

A knowledge graph is inherently discrete: there is a clear distinction between entities and the relationships that bind them together. This representation is useful for some tasks, such as lookup of facts, storage, and updating the graph. However, such a representation prohibits the knowledge graph from being used directly for tasks that require a continuous representation, for example, to find the most likely candidate for predicting an entity or integrating the knowledge in the graph into a modern language model. To create a continuous representation, it is necessary to embed the entities and relations in a continuous vector space, creating a knowledge embedding. The embedding aims to position the entities and relations relative to each other so that characteristics in the graph are preserved. One such characteristic could be that countries are positioned closer to each other than car manufacturers (although this is just an illustrative example.)

Several different classes of methods have been developed to create knowledge graph em-beddings. One of the main differentiators between methods is how they separate entities and relations in vector space and what underlying characteristics they try to capture. Most models create a static representation for each entity, which means that it will always have the same representation. Words in a language, and entities and relations in a knowledge graph, can have different meanings depending on the context they exist in. This is the characteristic that a contextual representation tries to capture.

2.3 BERT

BERT (Bidirectional Encoder Representations from Transformers), which closely resembles the Transformer architecture presented in [17], brought on a paradigm shift in the field of natural language processing. The model’s viability proved itself with state-of-the-art performance on a wide range of NLP tasks, with a relatively small amount of task-specific training needed, compared to training from scratch. This means that the model can be trained on a large dataset of text to learn general language understanding. The model can then be fine-tuned on specific tasks where its language understanding makes it possible to adapt to these new tasks efficiently.

BERT introduced two main innovations beyond the original transformer architecture. It discards the decoder layers for additional encoder layers to create an even deeper architecture. The model then trains in a self-supervised fashion on two pre-training objectives designed to enable the model to capture the bi-directional context for words. Due to the training being self-supervised the model can train on hundreds of gigabytes of text data.

Following is a top-down description of the BERT model architecture, based on its original paper [4] and the Transformer architecture from [17]. For a more detailed understanding of the Transformer architecture and its implementation details, see the The Annotated Transformer1_.

Transformer layer

The main component of BERT is the transformer layer. The base version of the BERT architecture consists of a stack of 12 transformer layers, as can be seen in Figure 2.2. Each layer acts as an encoder. The layer transforms an input representation into an output representation, which will then be passed into the next Transformer layer, and so on, until a final output representation is passed from the model. The idea is that in each layer, the information in the input sequence is encoded, combined in different ways, so that the final output is useful for the training objective. [4]

(14)

2.3. BERT

Each transformer layer has the same internal architecture, as can be seen in Figure 2.2, where the input is passed multi-head attention layer followed by a standard 3-layer feed-forward neural network (FFN). The purpose of the multi-head attention layer is to for every input token create an encoding that is informed by the other tokens in the sequence. How much information that should be encoded from a specific token is decided by the amount of attention given to that token, which is a numerical value. Through training the model learns to better direct its attention. [4]

Figure 2.2: Illustration of the BERT model architecture.

Multi-head Attention

Multi-head attention is an expansion of self-attention. Therefore, to understand multi-head attention, it is essential to first comprehend self-attention.

The self-attention layer takes a sequence of inputs, e.g., token embeddings of words in a sentence. The purpose of self-attention is to inform the encoding of an input with the other inputs in the same sequence. However, the inputs are of different importance and need to be shown attention to varying degrees. The self-attention mechanism, therefore, learns what inputs should inform the encoding. [17]

The image in Figure 2.3 visualizes how attention is given to different words in the input sequence Sound of one hand clapping for the token clapping, in the first layer of the model.

Figure 2.3: Visualization of self-attention for one attention head in the first attention layer.

(15)

2.3. BERT

Following is a description of the self-attention mechanism in two steps. The first is to project the input vectors into a Query (Q), Key (K), and Value (V) vector space, and then multiplying the input vectors and a weight matrix corresponding to each vector space:

Q= X ⋅ WQ (2.1)

K= X ⋅ WK (2.2)

V = X ⋅ WV (2.3)

The second step consists of operations with the Query, Key, and Value matrices:

1. For each input, an attention score is calculated: Q⋅ KT_{. The score indicates how much}

an input should be considered in creating the encoding.

2. The scores are divided by the square root of the Key vectors’ dimension to create more stable gradients.

3. The score is normalized with softmax, so that words with low scores that do not inform the encoding have values close to zero.

4. The encoded representation, z, is calculated for a specific token by multiplying the normalized scores with the Value vector.

The complete calculation, as it is presented by Vaswani et. al. in [17], can be written as:

Z= sof tmax(Q ⋅ K

T₎

√

dimK

⋅ V (2.4)

Multi-head attention and self-attention differ in one significant respect. Self-attention has a single representation space for each projection into Query, Key, and Value vector space by having one weight matrix for each. Multi-head attention creates multiple representation subspaces by assigning a set of such projection matrices to each attention head. By having multiple heads, the learning of self-attention is stabilized. Each attention head has a smaller version of the three weight matrices. The total number of weights of the projection matrices remain the same, only divided amongst the different attention heads. Each attention head then applies self-attention, as described above, to the input. [17]

How attention is applied over the different heads is shown in Figure 2.4, where a different color represents each head.

Figure 2.4: Visualization of multi-head at-tention for the first Transformer layer.

(16)

2.3. BERT

As each attention head outputs a unique representation, every input will have several rep-resentations. These are concatenated into a single vector to provide one input representation for the FFN. Before passing the output, the concatenated vectors are multiplied by a weight matrix. In this way, the model learns how to combine the attention heads’ outputs to provide the best combined representation. [17]

Input representation

The input to BERT is a sequence of tokens belonging to one or more segments of text. Also, there are two unique tokens:

• [CLS], the classification token. It is the first token of the input sequence and is used for classification purposes. The [CLS] token’s output representation is commonly used in a downstream task as the input for a classification layer.

• [SEP], the separator token. It indicates a separation between segments that are part of the input sequence, e.g., a question and an answer segment.

Each token in the input sequences is then encoded as the sum of three different embeddings: a word embedding, segment embedding, and positional encoding, as shown in Figure 2.5.

Figure 2.5: Input representation for a sequence of tokens.

The number of words that can possibly exist is vast, and it is unfeasible for a vocabulary to handle all rare words and variations of words. To handle this, WordPiece, a word embedding, splits out-of-vocabulary words into smaller pieces [23]. The process repeats until all parts of the words have an assigned embedding. BERT uses a WordPiece embedding with a 30,000 token vocabulary, with 60–80% being whole words and the remaining are word pieces [4].

The segment embedding and positional encoding (PE) are required because there is no architectural property of BERT that understands the order of the input [4]. This is in contrast to sequential models such as LSTMs. The order of the input is often crucial. For example, in English and most other languages, the order of the words conveys a sentence’s meaning. Therefore, it is necessary to encode the sequence order in the input. This is done on both the token and sentence level. The segment embedding provides information on which segment a token belongs to, e.g., if a token is part of a question or the following sentence with an answer [4].

The positional encoding provides information on the token’s absolute position in the se-quence and its relative distance to the other tokens in the input sese-quence. It is calculated with a sine and cosine function, depending on if the current dimension index is odd or even, as seen in equation 2.5 and 2.6. What happens is that for each dimension, a unique wave-function is created. It is unique in regard to its offset and frequency, as can be seen in Figure 2.6. [4]

(17)

2.3. BERT

P E_(pos,2i)= sin(pos/100002i/dmodel) _(2.5)

P E_(pos,2i+1)= cos(pos/100002i/dmodel) _(2.6)

pos= The position of the word in the sequence i= Index of the dimension being calculated

dmodel= The dimension of the embedding

Figure 2.6: Visualization of the wave functions, created for differ-ent dimensions, that are used for the positional encoding.

Pre-training

BERT obtains its general language capabilities during the pre-training phase, where it trains on a vast amount of raw textual data. The two datasets used are BookCorpus (800M words) and the text passages from English Wikipedia (2,500M words). As expected, the amount of computational resources and time to train the model is significant: the original BERT model trained on 4 Cloud TPUs for four days. [4]

To utilize the extensive unlabeled training data, BERT is trained in a self-supervised fash-ion. The training consists of two novel pre-training objectives: Masked Language Model (MLM) and Next Sentence Prediction (NSP). These two tasks nudge the model to learn a contextual representation from tokens in both directions in the sequence. In Figure 2.7 the in-put sequence is masked, with one token replaced with a mask token, one replaced by a random word, and one kept intact, and the output representations are passed through classification layers for either MLM or NSP. [4]

Figure 2.7: A BERT model with a masked input and classification layers for masked language model and next sentence prediction.

(18)

2.4. Fine-tuning

MLM is a variation of a cloze task, where a masked word is predicted. Masking a word means that it is replaced by a specific token that hides any information about the original token. To predict a masked word, the model has to use attention to consider the left and right context. Note that predicting the correct word out of all probabilities is hard, even for a human. Therefore, the purpose of pre-training is not to achieve perfect predictions. Instead, it is to improve the model’s language understanding capabilities. [4]

A problem with masking is that the mask token is only seen during pre-training, which might confuse the model when applied to other tasks. A token replacement scheme, where a token is replaced by a random token or left intact, is used to make the pre-training more similar to fine-tuning conditions. [4]

BERT’s masking procedure is as follows:

• 12% of the tokens in a given sequence are replaced by a [MASK] token. • 1.5% of the tokens are replaced by a random token.

• 1.5% of the tokens are not changed but still predicted.

NSP is the task of predicting if a pair of segments follow each other. Even though the task refers to sentences, the segments are any two spans of text. The second segment remains the same or is replaced randomly (p= 50%) with a segment from the training data. For the NSP task, the output embedding for the [CLS] token is used as input for a classifier. Because of this, information about the whole sequence is pooled in the [CLS] output embedding, making it useful for downstream tasks. The NSP and MLM pre-training objectives train concurrently. Therefore, during pre-training, the training loss is the sum of the mean MLM likelihood and the NSP likelihood. [4]

2.4 Fine-tuning

A pre-trained BERT model is only capable of MLM and NSP out of the box. With more training, the model can be fine-tuned to excel in specialized tasks. For fine-tuning the model, a classification layer is appended to the model’s final transformer layer that uses the output representations. The computational resources and time needed for fine-tuning are marginal compared to training the full model, making it feasible to apply a pre-trained model to various tasks. In Figure 2.8 an example is shown for the task of classifying what part-of-speech a token belongs to. It only uses the output representations for the word tokens and not the [CLS] token as it is a token classification task. [4]

Figure 2.8: BERT with additional token classification layers, fine-tuned for part-of-speech tagging.

(19)

3 Related Work

3.1 Contextual knowledge graph embeddings

Language can be contextual in different ways. Words in a language can have different meanings depending on the context, e.g., store can be a place to shop or stow something. It is also the case that real-world entities, described in language, can exist in different contexts, i.e., Joe Biden is both a president and a father. As Wang et. al. noted in [19], there has been relatively little formal discussions about the contextual nature of KGs. However, it can be assumed that a knowledge graph’s entities and relationships should be no less contextual than text. It seems reasonable that entities in a KG can exist in different graph contexts in the same way as in textual contexts. It has been shown that relationships can have different meanings depending on context [24]. For example, the relationship HasPart can represent both a composition (Table, HasPart, Leg) and a location (Atlantic, HasPart, NewYorkBay) [24]. A contextual knowledge graph embedding aims to separate the different contextual meanings in the embedding vector space.

CoKE is one of the first models to create contextual KGEs for both entities and relation-ships [19]. Inspired by BERT, CoKE uses the Transformer layers to encode an input sequence that is a graph. The input consists of directed paths from the graph. Both entities and re-lationships are represented as vertices in the path and are given a positional encoding. The model is then trained on a task that resembles the masked language model (MLM), where either the head or tail entity of the path is masked and to be predicted. The closest work to contextualized KGEs, up to that point, consisted of creating different relationship representa-tions for an entity [21, 9, 6]. However, this does not create a contextual KGE for both entities and relationships.

CoKE achieves close to the state of the art performance on several link prediction tasks. The contextuality of the output representations is shown by applying t-SNE [12], as a dimen-sionality reduction, visualizing the dimension-reduced embeddings. The visualizations clearly show how both entities and relations can have different representations depending on context and that there is an overlap between similar contexts.

(20)

3.2. Attention masking methods

3.2 Attention masking methods

The models that apply attention to graphs tend to differ in how attention is applied, and arguments have been made for using attention masks to incorporate information about the input graph structure into the model. The idea is only to allow an entity in a graph to attend to other specific entities in the graph input. The first model to introduce this idea was Graph attention networks (GAT) [18]. Previous neural knowledge graph embedding methods, such as Graph Convolutional Networks [7] created an embedding for an entity by training on the entities adjacent to it, and gave equal weight to all adjacent entities. GAT instead uses the attention mechanism as a process analogous to adding weights to edges between nodes, which is how important the related entity is for creating the embedding. The attention masking applied in GAT is illustrated in Figure 3.1, where each node in the input can only attend to nodes in the direction of their relation. The dotted arrows indicate which nodes the attention can be directed to. This masking forces the model to mimic the graph structure exactly in how attention is applied. Information beyond the adjacent nodes is not available directly in the encoding of an entity. This exact replication of graph structure in the attention mask can be viewed as too prohibitive: there might be information beyond the adjacent entities that are useful.

Figure 3.1: Attention masking method for GAT. Entities are allowed to attend to adjacent entities.

The attention masking technique in GATs, and the idea that the attention mask should reflect graph structure, has become the prevalent choice of attention masking when applying Transformer models such as BERT [5] and RoBERTa [15] to graphs. BERT-MK directly use the same masking method as in GAT while also treating entities and relations equally in the input [5]. The main difference to GAT can be viewed in Figure 3.2, where each entity can only attend to its outgoing relations, and the relations can only attend to the entities they lead to. This seems to severely limit the amount of information that a node can incorporate in its encoding. Information about entities can only be passed to another entity after being encoded into an intermediate relation’s output representation. Therefore, the layers will act more like a message-passing system, where information is passed to adjacent nodes for each layer.

(21)

3.2. Attention masking methods

Figure 3.2: Attention masking method for BERT-MK. Nodes are allowed to attend to adjacent nodes, where relations are treated equally to entities in the input.

CoLAKE [15], differs from the previous models as it incorporates graphs into sentences. While there is a difference in the input, in the sense that there is both textual and graph data, the input can still be viewed as a graph where the tokens in the sentence are all adjacent to each other (further described in Section 3.3). The model applies an attention mechanism that is similar to the one in BERT-MK. The main difference is that attention is allowed to relations that are part of an incoming edge. The attention masking method is shown in Figure 3.3. The consequence of this is that information from entities can only inform the encoding through a procedure of message passing, similar as in BERT-MK. The main difference is that the node is now exposed to more of the graph, and all nodes can attend to at least one other node. In the implementation of GAT, and consequentially BERT-MK, an entity with no outgoing relations does not have any adjacent node to attend to.

Figure 3.3: Attention masking method for CoLAKE. Nodes are allowed to attend to adjacent nodes in either direction of the graph, and relations are treated equally to entities in the input.

Tthe

CoKE uses the same attention masking method as in BERT, where attention is allowed between all tokens. The model shows strong performance on link prediction tasks, but it should also be noted that the model only trains on graph structures that can be viewed as a sequence. However, it raises the question if the model, during training, can learn how to direct its attention. As the GAT attention masking method, and variations upon it, have become a common inspiration for models, it is of interest to explore if it is valid for Transformer models such as BERT.

(22)

3.3. Input graph structures

3.3 Input graph structures

In conjunction with the different attention masking methods, models are also trained on dif-ferent input graph structures. Therefore, it is not clear if specific attention masking methods are more suitable for certain graph structures.

CoKE, which allows for attention between nodes, only train on paths. Each node in a path has at most one outgoing edge and one incoming edge in a directed graph. The shortest type of path that the model trains on is the triple (h→ r → t), and the longest is a path consisting of at most seven nodes. These sequential graphs are similar to the inputs that the original BERT models train on, i.e., sequences of tokens. This means that the positional encoding can be applied on paths in the same manner as applied to the words in an input sentence.

BERT-MK is claimed to be agnostic to the input graph structure and can as such train on graphs that are of structures other than paths. It is trained on a graph consisting of four triples, where one node is the head in two of the triples and the tail in the two other. The input graph is presented Figure 3.4.

Figure 3.4: An illustration of the input graph for BERT-MK.

CoLAKE, as mentioned previously, combines a sentence input with entities and relations from a graph. It detects entities in a sentence and extracts triples that the entity belongs to from a knowledge graph. An entity that belongs both to the sentence and the triples is called an anchor node. It is at these anchor nodes the sentence and knowledge graph triples are joined. The new graph is a combination of the triples and the sentence, which can be viewed as a fully connected graph. An example of the input graph is shown in Figure 3.5, where the anchor nodes are shaded. To further inform on the model of the graph input, a soft positional encoding is applied. This means that the input graph tokens that belong to a triple have position indices that follow the anchor node position index, as shown in Figure 3.5.

Figure 3.5: An illustration of the input graph for BERT-MK and the soft position index applied to the input.

(23)

4 Method

4.1 Ensuring replicability

A fixed random seed (seed=42) was used in all experiments to ensure that processes that rely on random number generation, such as weight initialization, can be reproduced. In order to enable exact reproductions of experiments, and promote transparency of the method, the code has been made publicly available1_.

4.2 Model architecture

The method for creating a contextual KGE takes inspiration from the work on contextual word embeddings with BERT [4]. Instead of training on a word sequence, the contextual KGE was trained on an input sequence of nodes and edges from a graph. The nodes (entities) and edges (relations) were selected from a more extensive knowledge graph. A general overview of the model can be described as followed:

1. A subgraph of entities connected by relations was selected from a knowledge graph. 2. An input representation of that subgraph was created. This was done by combining an

embedding for the entities and relations with a positional embedding.

3. The input representation was passed through a stack of Transformer layers to produce entity and relation embeddings, and the final output was then passed through a final classification layer for a masked language model training objective.

(24)

4.2. Model architecture

Construction of input graphs

Three different classes of graph structures were created for the input sequence: triples, paths, and connected triples. This was to explore how the input structure affects the model perfor-mance. It should be noted that triples are a path that consists of three nodes. The choice to separate them into two classes was based on two reasons: 1) to be transparent about the design decisions needed to create the paths and 2) to more clearly see the effects of attention masking methods on the two structures.

Triples

Triples consist of three nodes: a head entity, a relation, and a tail entity, as shown in Figure 4.1. They were created by selecting two adjacent nodes from the knowledge graph, where nodes represent entities and their edge the relationship.

Figure 4.1: An illustration of a triple, consisting of a head, relation and tail node.

Paths

The path input is an extension of the triple, where an additional relation and entity, related to the tail entity, are appended to the tail entity. The path can be viewed as two triples, where the tail entity in one triple is also the head entity of the other triple. Paths can be lengthened by appending additional relation and entity nodes to the end of the path. The path consisting of 5 nodes, the extension of the triple, is in this report called a path of length k= 1. In Figure 4.2 a path of length k= 1 is shown.

The possible paths that could be created, from extending a triple or a path, were the number of outgoing edges for the tail entity in the original knowledge graph. The average degree for a node in the knowledge graph was more than one, which meant that several paths could be selected. Therefore, the extended path was selected randomly.

(25)

Connected triples

Connected triples were created to train the model on graph structures where a head entity could be part of several triples, as shown in Figure 4.3. This meant that there were several nodes that had the same distance to the head entity, a difference from sequences such as triples and paths.

A connected triples graph was created for each node in the dataset by randomly selecting

k number of triples, where the node was the head entity for all of the triples. This creates a

connected triple of order k.

Figure 4.3: An illustration of a connected triple that consists of two inter-connected triples.

Input representation

Each token in the input sequence was converted into an input representation. The input representation was the sum of a positional encoding and an element embedding used for both the entities and relations. The positional encoding for the input representations for each class of input graph is shown in Figure 4.4, 4.5, and 4.6. The positional embedding method used was the same as in BERT, where unique wave functions are created for each dimension. However, the embedding application differed as a soft position encoding was applied [10], allowing repeated indices. This was needed for the connected triples graph, as the distance to the head entity of the triples was the same for several nodes.

Figure 4.4: The positional encoding for an input graph from the triple class.

(26)

Figure 4.6: The positional encoding for an input graph from the connected triples class.

Transformer model

BERT was chosen as the Transformer model to learn contextual representations. An imple-mentation of BERT for masked language modeling from the HuggingFace Transformer Library [22] was used. Compared to the original BERT, the number of model parameters was reduced to enable faster pre-training. The model was implemented with six Transformer layers, four attention heads, and a reduction in the model’s hidden size from 768 to 256 and intermediate size from 3072 to 512. These modifications were considered justified because this report’s pur-pose was not to find the upper limits of performance for the models, but to compare models with different attention masking methods and trained on different input graph structures in a controlled setup.

Attention masking procedures

Five different attention masking methods were implemented to explore how the attention masking method affect model performance. The choice of masking methods was primarily inspired by previous work on applying attention to graphs. As the exploration of all possi-ble variations was not feasipossi-ble, the selection of the following methods aims to reveal if the directionality of attention, and attention to distant nodes, impact model performance.

Full attention

The first attention masking method was the same as in BERT, where the mask does not reflect the input graph’s structure. Instead, it could be described as having full attention, as every node in the input could attend to every other node. The masking method is illustrated in Figure 4.7. The lines indicate which nodes a node in the lower row can attend to. The full attention masking method was selected to explore if the model could effectively learn from the graph structures it was trained on, even though it was not informed of the graph structure.

Figure 4.7: An attention mask that allow for full attention, applied to a path.

(27)

One-hop directional attention

The second attention masking method was the same as the method applied in BERT-MK, where every node can only attend to itself and adjacent nodes in the graph’s direction. As shown in Figure 4.8, the graph structure was heavily imposed on the applied attention mask, and nodes without outgoing edges could only be encoded with information of themselves. This masking method could be viewed as one end of informing the model of the graph structure through the attention mask.

Figure 4.8: An attention mask that allow for attention in the one-hop neighborhood in the direction of the graph.

One-hop bidirectional attention

The third attention masking method is a more lenient extension of the second method. Instead of only attending in the graph’s direction, it was possible for nodes to attend any adjacent nodes, as can be seen in Figure 4.9. This allows for information to flow in both directions of the graph. Nodes with no outgoing edges can then be encoded with information from other nodes. For every n attention layer, a node could theoretically be informed by nodes n hops away.

Figure 4.9: An attention mask that allow for attention in the one-hop neighborhood with no regard for the di-rection of the graph.

(28)

Two-hop directional attention

Most knowledge graph embedding methods give more importance to the entities when creating an embedding. The second and third attention masking method could have been too restrictive, as attention for an entity is only directed to an adjacent relation. Therefore, the attention for an entity was allowed to extend entities in the 2-hop neighborhood in the graph’s direction. This meant that the head entity of a triple could be encoded with information from both the relation and tail entity. Figure 4.10 illustrates how the attention masking is applied for a path of length k= 1.

Figure 4.10: An attention mask that allow for atten-tion in the two-hop neighborhood in the direcatten-tion of the graph.

Two-hop bidirectional attention

With the same reasoning as for one-hop bidirectional attention, the attention is allowed to be bidirectional to allow for an open flow of information through the nodes in the graph. This could be especially useful for longer paths, as intermediary entity nodes could encode information from the triple where they were the tail entity and from triple where they were the head entity. This is shown in Figure 4.11, where the second entity node can attend to all path nodes.

Figure 4.11: An attention mask that allow for attention in the two-hop neighborhood with no regard for the di-rection of the graph.

(29)

4.3. Model training

4.3 Model training

The model was pre-trained with a masked language model (MLM) objective, the same as in the original BERT paper: a mask token replaced a node, and the model was trained to predict the original token. By removing the information from the node that was masked, the model had to attend to surrounding tokens, its context, in order to predict the masked token. The masking strategy differed from the original BERT implementation, as the input graphs are short compared to possible text sequences. It was deemed inefficient to apply BERT’s masking strategy, where only 15% of the input is masked and tokens could be left intact or randomly replaced. Instead, samples were duplicated for each head and tail entity that existed in the input graph. One training sample had the head entity masked, and in the others, the tail entity was masked.

In order to observe if the path of a specific length k learned from the additional information, all paths of length≤ k (including triples) were used as input graphs during training.

4.4 Model evaluation

Link Prediction

The model was evaluated on link prediction, a common task for evaluating knowledge graph embeddings. The task of link prediction is to predict part of a triple (h, r, t), where h and

t are entities, and r the relation between them [2]. The prediction task is to either predict

the head or tail entity, given the other entity and relation. For the model implemented in this report, it was analogous to predict a masked token, when the mask token was either the head or the tail of a triple. The evaluation was done in two passes for each triple, masking and predicting the head first and then the tail. The predicted entities were ranked according to probability in descending order. When predicting either the head or tail entity of a test triple, there could be candidate entities that do not correctly complete the triple that has been masked but still create a triple that exists in the knowledge graph. In a sense, this would still be a good prediction by the model. To achieve a more reasonable evaluation of the model, all predicted entities that create a triple that exists in the knowledge graph, but do not correctly complete the masked test triple, were removed from the ranked predictions. This method is called a filtered setting and was introduced by Bordes et. al. in [2]. The correctly predicted entity’s rank was used for calculating the evaluation metrics. The mean reciprocal rank (MMR) calculate the average of the multiplicative inverse of the rank and was used as it gives a fair indication of how the model improves its predictions:

M RR= 1 Q Q ∑ n=1 1 ranki (4.1)

Q = the number of test samples.

The rank was also used to calculate the number of instances where the correct triple was in the top n predictions, where n= 1, 3, 10. The metric is called Hits@n, and is the proportion of correct entities ranked at or above n. Hits@n is closely related to MRR, as it is a metric for the rank of the prediction. The difference is that Hits@n give a better overview of how many of the predictions have, for example, rank 1.

(30)

4.5. Datasets

4.5 Datasets

The evaluation was conducted on four public datasets (FB15k, WN18, FB15k-237, WN18RR) that are samples from Freebase [1] and WordNet [13]. Freebase was a large scale knowledge base containing entity and relationship facts and has now been replaced by Wikidata, a similar but larger knowledge graph. Even though Freebase is not being updated with new facts, it is still useful as a benchmark for knowledge graph embedding methods. WordNet is a large scale lexical knowledge base that stores words and their relationships, e.g., if a word is a hypernym or hyponym to another word. The datasets FB15k and WN18 were introduced by Bordes et. al. in [2], and was consistently used as a benchmark for knowledge graph embedding methods. It has been shown that the vast occurrence of inverse relations can be viewed as data leakage from the test set, and a model can be created to exploit this [3]. The issue with inverse relations is that if the relation in one direction is known, then the relation in the opposite direction is known, too. A subsample that excludes inverse relations was introduced for the two datasets with FB15k-237 [16] and WN18RR [3]. The existence of inverse relations does not mean that the dataset is a bad benchmark, but that a model that only performs well on it might not generalize to other graphs. Therefore knowledge graph embeddings need to be evaluated on several datasets that have different characteristics. A description of the datasets can be seen in Table 4.1. The main differences can be seen between the Freebase and WordNet datasets, as the average in/out degree per node is much higher for the Freebase datasets than the WordNet datasets.

FB15k WN18 FB15k-237 WN18RR

Entities 14,951 40,943 14,951 40,943

Relations 1,345 18 237 11

Avg. in/out degree 39.6 3.7 21.3 2.3

Triples 592,231 151,442 310,116 93,001

- Train 483,142 141,442 272,115 86,835

- Valid 50,000 5,000 17,535 3,034

- Test 59,071 5,000 20,466 3,134

Table 4.1: Statistics on entities, relations, and the num-ber of triples in train, validation and test partitions.

4.6 Training details

Some general training settings applied to all experiments. All models were trained on a 16G NVIDIA V100 GPU for 20 epochs and with a batch size of 2048. The goal of batch size and number of epochs was to minimize the amount of training time needed for the models, while still seeing improvements in model performance that make them comparable to each other. Training with a smaller batch size or training for more epochs would probably improve model performance. As the paper aims to compare different design choices, it was deemed more critical to trade-off performance for more implemented experiments.

All models were trained with AdamW optimizer, as implemented in the HuggingFace Trans-former Library [22]. The default settings were used with β= (0.9, 0.999) and a learning rate of 1e-3 with a linear decay.

(31)

4.7. Experiments

4.7 Experiments

Comparison of attention masking methods

Each attention masking method was applied in a model, with the architecture described in this chapter. These models were then trained on the three input graph structures separately, e.g., three models are created with one-hop directional attention, one trained on triples, one on paths, and one on connected triples. The experiment’s purpose was to observe if the attention masking method affects model performance when trained on different graph structures. All models were trained on the FB15k dataset and evaluated on the link prediction task described in Section 4.4.

During training, the paths used for input were of length k = 1, with one random path selected from a given triple in the knowledge graph. The connected triples consisted of four randomly selected triples, and five such connected triples were created per entity. The reason for this was that the number of nodes in FB15k is comparatively small compared to the number of triples in the dataset. It was assumed that more training samples would enable the model to converge more and differences between masking methods would be more easily identified.

Comparison of paths

Paths could be of different lengths, and therefore it was of interest to explore how path length affected model performance. Paths of lengths up to k≤ 4 were created from FB15k and used as input to train a model with full attention, to determine if the addition of nodes in the input graph also increased the model’s performance in the link prediction task. Additionally, as paths were randomly selected from a triple, the possible paths that could be created when extending a path or triple equaled the tail entity’s degree in the knowledge graph. For FB15k, with an average in/out degree of 39.6, a large amount of available training data was not being used. More paths were randomly selected to observe the effects of additional data when training a model on paths of length k= 1. For each triple in dataset, n randomly selected paths were created, with n≤ 1, 4, 6.

Comparison of connected triples

Connected triples could be created with varying sizes. It could be the case that a connected triple with more nodes can create a richer encoding for each triple, but it could also be the case that more nodes make it harder for the model to understand what context is relevant. Connected triples of order k, k≤ 2, 4, 6, were created from FB15k and passed as input to both a model with full attention allowed during training and a model with one-hop bi-directional attention allowed. These models were chosen to compare how graph size affects model perfor-mance when using an attention masking method that inform of the graph structure and one that does not. The models were then evaluated on the link prediction task.

As the triples were randomly selected to create the graph, there was a lot of data that was not being used during training. It was therefore of interest to explore the effect of increasing the number of training samples. For each entity in FB15k, a random connected triples graph of order k was created n times, where n ≤ 1, 3, 5, 7. This was applied to graphs of orders

(32)

4.7. Experiments

Comparison of input graph structures

Each class of input graph structures was trained on FB15k, WN18, FB15k-237, and WN18RR datasets. The purpose of this was to observe if different characteristics in the underlying knowledge graph, that the input graphs were created from, affect model performance. The attention masking method applied to the models were the ones that performed the best on FB15k in the first experiment. The models trained on triples and paths used an attention mask that allowed for full attention and the model trained on connected triples used an attention mask that allowed for one-hop bidirectional attention. All models were evaluated on the link prediction task.

As in the experiment for comparing attention masking methods, the paths were of length

k= 1, and each connected triples graph consisted of four randomly selected triples and five

(33)

5 Results

5.1 Comparison of attention masking methods

The results from the experiments on different attention masking methods varied by both method and which input graph structure the model was trained on, and can be viewed in Tables 5.1, 5.2, 5.3, 5.4, and 5.5.

When the results are viewed with regard to only the graph structure, for almost all methods, the relationship between the models are the same: paths provide the best performance, with triples slightly behind and connected triples significantly behind.

When viewed from the perspective of the attention masking methods, there are different patterns that can be seen across methods and when those methods are applied to different input graph structures. For the models trained on triples and paths, allowing for full attention was better than the other methods. However, the difference is small when compared to methods that apply bidirectional attention. The same pattern does not exist for models trained on connected triples, which as a more complex graph structure. For connected triples, there is a significant increase in performance when applying bidirectional attention compared to applying full attention.

The directionality of attention has a significant effect on the performance of the model. The methods that allow for bidirectional attention have a significantly better performance than their directional counterpart. For triples and paths there is almost twice the performance, and for connected triples an even larger performance increase can be observed. When applying directional attention, connected triples underperform the other input graph structures by a large margin, yielding a performance of single percentages.

Allowing attention beyond adjacent nodes, from the one-hop to the two-hop neighborhood, does not affect model results as dramatically as the other variable settings. For triples and paths, a small increase can be observed when adding the additional information from the two-hop neighbors. It is fascinating to observe that the opposite is true for connected triples, where it is beneficial to restrict attention to adjacent neighbors.

(34)

5.1. Comparison of attention masking methods

Full attention (FB15k)

MRR Hits@1 Hits@3 Hits@10

Triples .625 .526 .686 .810

Paths .735 .660 .787 .866

Connected triples .138 .091 .146 .229 Table 5.1: Results for models where every node can attend to every other node in the input.

One-hop directional attention (FB15k)

Triples 0.326 0.247 0.357 0.475

Paths 0.335 0.257 0.367 0.481

Connected triples 0.016 0.007 0.016 0.032 Table 5.2: Results for models where nodes can only attend to adjacent nodes in the direction of the graph.

One-hop bidirectional attention (FB15k)

Triples 0.601 0.497 0.663 0.792

Paths 0.707 0.623 0.763 0.857

Connected triples 0.220 0.153 0.240 0.351 Table 5.3: Results for models where nodes can attend to adjacent nodes, disregarding the direction of the graph.

Two-hop directional attention (FB15k)

Triples 0.331 0.252 0.363 0.479

Paths 0.337 0.259 0.369 0.484

Connected triples 0.014 0.006 0.013 0.026 Table 5.4: Results for models where entities can at-tend to all nodes in its two-hop neighbors in the di-rection of the graph.

(35)

5.2. Comparison of paths

Two-hop bidirectional attention (FB15k)

Triples 0.616 0.515 0.676 0.802

Paths 0.731 0.655 0.783 0.865

Connected triples 0.181 0.123 0.195 0.293 Table 5.5: Results for models where entities can at-tend to all nodes in its two-hop neighbors, disregard-ing the direction of the graph.

5.2 Comparison of paths

Increasing path length improves model performance on the link prediction task, as shown in Table 5.6. The largest increase in MRR can be observed when the path length is extended from k= 1 to k = 2. The rate of improvement in performance decreases for every next extension of the path length.

A similar change in performance is observed when the number of paths created from each triple is increased, as shown in Table 5.7. The most considerable performance improvement is seen when adding one additional training sample, with an increase of 5̃% in MRR. The rate of improvement decreases with the addition of more training samples per triple.

Different path lengths (FB15k)

MRR Hits@1 Hits@3 Hits@10 Paths (k≤ 1) .735 .659 .787 .866 Paths (k≤ 2) .770 .705 .813 .870 Paths (k≤ 3) .793 .739 .829 .884 Paths (k≤ 4) .798 .746 .833 .886 Table 5.6: The results for models trained on paths of different lengths.

Training samples per triple (FB15k)

MRR Hits@1 Hits@3 Hits@10 Paths (n= 1) .735 .659 .787 .866 Paths (n≤ 2) .772 .710 .815 .880 Paths (n≤ 4) .809 .763 .842 .890 Paths (n≤ 6) .812 .766 .845 .891 Table 5.7: The results when n paths have been cre-ated per triple.

(36)

5.3. Comparison of connected triples

5.3 Comparison of connected triples

The connected triples’ results, with differing amount of triples and number of samples per node, can be seen in Table 5.8 and 5.9. A consistent pattern can be observed, where model performance increases with the number of samples created. However, there are differences in the effect of using different graph sizes when different attention masking methods are applied. When full attention is allowed the performance drops sharply with the addition of triples. When information of the graph structure is embedded in the attention masking, in this case with one-hop bidirectional attention, the performance is more stable when additional triples are added.

Connected triples (FB15k) - Full attention

MRR Hits@1 Hits@3 Hits@10 No. samples <= 1: Con. triples (k=2) .117 .080 .122 .187 Con. triples (k=4) .097 .066 .101 .156 Con. triples (k=6) .081 .055 .085 .128 No. samples <= 3: Con. triples (k=2) .156 .102 .167 .259 Con. triples (k=4) .115 .075 .121 .193 Con. triples (k=6) .092 .060 .097 .150 No. samples <= 5: Con. triples (k=2) .206 .140 .222 .334 Con. triples (k=4) .138 .091 .146 .229 Con. triples (k=6) .098 .063 .104 .164 No. samples <= 7: Con. triples (k=2) .243 .170 .265 .387 Con. triples (k=4) .158 .104 .169 .261 Con. triples (k=6) .117 .074 .124 .201

Table 5.8: Comparison of connected triples, of vary-ing size and samples per node. Models were trained on FB15k and with an attention mask that allow for full attention between nodes.

(37)

5.3. Comparison of connected triples

Connected triples (FB15k) - One-hop bidirectional attention MRR Hits@1 Hits@3 Hits@10 No. samples <= 1: Con. triples (k=2) .134 .094 .143 .209 Con. triples (k=4) .125 .085 .133 .199 Con. triples (k=6) .119 .080 .126 .192 No. samples <= 3: Con. triples (k=2) .155 .102 .164 .257 Con. triples (k=4) .173 .117 .186 .284 Con. triples (k=6) .175 .118 .189 .285 No. samples <= 5: Con. triples (k=2) .205 .141 .221 .330 Con. triples (k=4) .220 .153 .240 .351 Con. triples (k=6) .212 .146 .233 .342 No. samples <= 7: Con. triples (k=2) .250 .176 .272 .394 Con. triples (k=4) .250 .175 .270 .386 Con. triples (k=6) .225 .157 .246 .358

Table 5.9: Comparison of connected triples, of vary-ing size and samples per node. Models were trained on FB15k and with an attention mask that only allow for attention to adjacent nodes.

Exploring Transformer-Based Contextual Knowledge Graph Embeddings : How the Design of the Attention Mask and the Input Structure Affect Learning in Transformer Models

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Computer Science

2021 | LIU-IDA/LITH-EX-A--21/002--SE

Exploring

Transformer-Based

Contextual Knowledge Graph

Embeddings

How the Design of the A en on Mask and the Input Structure

Aﬀect Learning in Transformer Models

Oskar Holmström

Upphovsrätt

Copyright

Acronyms

Contents

List of Figures

List of Tables

1

Introduction

1.1 Motivation

1.2 Aim

1.3 Research questions

1.4 Delimitations

2

Background

2.1 Knowledge Graphs

2.2 Knowledge Graph Embeddings

2.3 BERT

Transformer layer

Multi-head Attention

Input representation

Pre-training

2.4 Fine-tuning

3

Related Work

3.1 Contextual knowledge graph embeddings

3.2 Attention masking methods

3.3 Input graph structures

4

Method

4.1 Ensuring replicability

4.2 Model architecture

Construction of input graphs

Input representation

Transformer model

Attention masking procedures

4.3 Model training

4.4 Model evaluation

Link Prediction

4.5 Datasets

4.6 Training details

4.7 Experiments

Comparison of attention masking methods

Comparison of paths

Comparison of connected triples

Comparison of input graph structures

5

Results

5.1 Comparison of attention masking methods

5.2 Comparison of paths

5.3 Comparison of connected triples