Deterministic and Flexible Parallel Latent Feature Models Learning Framework for Probabilistic Knowledge Graph

(1)

i Master's thesis

Two ye

Master's thesis

Two years

Datateknik

Computer Engineering

Deterministic and Flexible Parallel Latent Feature Models Learning Framework for Probabilistic Knowledge Graph

Xiao Guan

(2)

MID SWEDEN UNIVERSITY

Department of Information and Communication Systems Examiner: Tingting Zhang, tingting.zhang@miun.se Supervisor: Stefan Forsström, stefan.forsstrom@miun.se Author: Xiao Guan, xigu1500@student.miun.se

Degree programme: International Master of Computer Science, 120 credits Main field of study: Computer Engineering

Semester, year: Autumn, 2018

(3)

Abstract

Knowledge Graph is a rising topic in the field of Artificial Intelligence.

As the current trend of knowledge representation, Knowledge graph research is utilizing the large knowledge base freely available on the internet. Knowledge graph also allows inspection, analysis, the reasoning of all knowledge in reality. To enable the ambitious idea of modeling the knowledge of the world, different theory and implementation emerges.

Nowadays, we have the opportunity to use freely available information from Wikipedia and Wikidata. The thesis investigates and formulates a theory about learning from Knowledge Graph. The thesis researches probabilistic knowledge graph. It only focuses on a branch called latent feature models in learning probabilistic knowledge graph. These models aim to predict possible relationships of connected entities and relations.

There are many models for such a task. The metrics and training process is detailed described and improved in the thesis work. The efficiency and correctness enable us to build a more complex model with confidence. The thesis also covers possible problems in finding and proposes future work.

Keywords: Knowledge Graph, Latent Feature Models, Knowledge Rep- resentation

(4)

Massive data and stronger computational power have emerged in the current world. Artificial Intelligence is shining with such new opportuni- ties. During rapid progression, Computer Vision has been dramatically improved. The computer can recognize objects using a deep neural network. Similar works are happening to improve the understanding of text for the computer. The progress is contributed mainly by enormous corpus from Wikipedia and structured dataset. However, understanding text, extraction text from internet remains strenuous. Moreover, utilizing structured data to power applications is still a challenge. Various systems are developed. Many research fields have established. A recent approach for managing and utilizing structured text comes into existence is Knowledge Graph (KG).

During years, computer scientists build many intelligent systems from the expert system to world wide web. As we all know, the recent web technology has changed the world entirely. The power of the World Wide Web (WWW) comes from the ability to link other documents (web pages).

There is no wonder that we are interested in the same principle for a more intelligent system from WWW’s success. For example, a search engine is interested in improving search result by extending pages with structured information. As Amit Singhal, a software engineer said the entity connects the world as in ‘things, not strings’. [1].

1.1 Background and Problem Motivation

Knowledge Graph has emerged with much attention in many tasks such as machine learning, data mining, and artificial intelligence applications including question answering system, entity disambiguation, named entity linking, fact checking and link prediction to name a few. KG evolves from the semantic web which formulates by RDF like triples

(entity, relation, entity)

The triple builds a connection between two different entities. For example, Douglas Adams is a writer can be represented as

(Douglas Adams, Is a, W riter)

Immersive amounts of triples formulate a structured connected triple set

(9)

which we can call it as Knowledge Graph. Such a graph can be used to reason possible connections not seen or used to identify problems. How- ever, many challenges impend. Building a knowledge graph is tough. It is also hard to know its completeness and correctness. We are also not sure how to use that with existing models. Moreover, storage and query for a large KG is a hard engineering problem. For example, Wikidata has 50 million entities. Its query interface often fails to come back with results.

1.2 Overall Aim and Research Questions

Computers have difficulty in answering questions we asked. In the be- ginning, only Expert System can answer several predefined questions.

This technology requires extensive code to answer a reasonably limited question. Extensive work follows suit. Scientists develop Semantic Net- work (Frame Network), Descriptive Logic, and First-order Logic to reason things for answering questions. Now we are using Knowledge Graph to do this line of work [2]. Nevertheless, the incompleteness of KG still needs to be emphasized. For example, in a famous research KG, Freebase, there are over 70% of people without a place of birth, also, 99% of them do not have ethnicity information [2]. Hence it is crucial to complete KG.

The task is called knowledge graph completion. Many models have been developed, but a general framework is missing. Therefore, I want to investigate how to improve knowledge acquisition, how to improve correctness of knowledge graph completion, how to improve the performance of knowledge graph completion and how much we can benefit from incor- porating multi-lingual labels for knowledge graph completion.

1.3 Concrete and Verifiable Goal

More specifically, the concrete goals of the project are:

1. Read 30 associated field papers and narrow down scopes.

2. Formulate theory, study 7 models and related work.

3. Propose improvement over knowledge base completion.

(a) Find relevant problems over previous research and implementation.

(b) Design a framework to make an improvement.

(c) Implement the frameworik based on the design.

4. Evaluate framework’s performance and explain the results.

1.4 Scope

The study investigates on the field of knowledge graph or knowledge base. It focuses on finding the area of connected fields but limit to one specific task. It focuses on the learning on probabilistic knowledge graph. The study is not focusing on any knowledge representation system other than probabilistic knowledge graph model. The thesis focuses on improving a framework for latent feature models learning in probabilistic knowledge

(10)

graph. It does not encompass with another complex Artificial Intelligence system. Although much implementation has been used in this field, I will only choose one similar framework and the other somewhat different implementation to compare. The evaluation is simple. It only compares the performance quantitatively and explains the reason for such an improvement.

1.5 Outline

Chapter 1 introduces the research’s background, motivation, and research questions. Chapter 2 describes the grounded theory about the field. Chap- ter 3 explains more about the Knowledge Graph in a narrower context.

Chapter 4 claims the general scientific methodology. Chapter 5 extends details of knowledge graph learning models. Chapter 6 records the way and the reason for designing and implementing the framework. In Chap- ter 7 and Chapter 8, I will show the research result and conclusions respectively.

(11)

2 Theory

KG builds upon on a large body of theory, and it has its development his- tory. Other researchers also connect other fields to KG such as Computer Vision (CV). KG can also be used by other fields, especially in Natural Language Processing (NLP).

2.1 Expert System

In the early days, the expert system was built to emulate human experts’

decision making [3]. It was built in the 1970s and used heavily in the 1980s [4]. Expert systems try to solve complex problems by reasoning on the knowledge base. The knowledge base was created by the if-then rules at that time. The other engine of an expert system is an inference engine which applies the rules. When the PC becomes mainstream, the expert system became less attractive. Some consider it as a failed artificial intelligence system. Some think it had become a part of a standard business application.

2.2 Descriptive Logic

Description logics are various of formal knowledge representation lan- guages. By formal, it means formalism which is a theory in foundations of mathematics.

2.3 First-order Logic

First-order logic is a more formal and more expressive system than descriptive logic in mathematics to reason things. First-order logic uses quantified variables to describe real objects instead of strings.

2.4 Ontology

An ontology means to describe names, categories, properties, and relations among concepts, data and entities in one or different domains. Some domains choose and design a specific ontology language to reduce com- plexity.

2.4.1 World Wide Web

World Wide Web is an invention known by the world. With the immersive size of WWW, we may use www to extract structure data. For example, Oren Etzioni leads an open information extraction (OpenIE), nd Tom Mitchell leads the Never-Ending language learning, NELL). As of now, TextRunner has extracted 500 million entity from 100 million web pages The noise in such data is high [5].

The underlying motivation is that natural language is the difficulty for computer to understand and utilize. While dates back to 1989, the World

(12)

Wide Web was invented by Tim Berners-Lee at CERN. The hyperlinks cre- ate a vast network of pages which creates an easy way to construct immersive information online. There are also numerous companies building an ecosystem above WWW, huge search engine Google and Yahoo. People are also trying to ask questions over a search engine. However, there are two significant challenges for the computer. It is not easy to understand the question in various language also, it is also hard to find the relevant information. Current search engine system indexes almost all web pages and built an efficient querying system to find relevant pages based on the words considering relevance and popularity of web pages. It also sends out web spiders to look for new web pages. The search engine seems to be able to find relevant information however, sometimes it does not give the right answer that user is looking for, and it does not understand the knowledge presented in the web document.

Tim further introduces the link of data which is based on Semantic Web, it builds a link between elements such as words and images. Unfortunately, the hope of creating a rich, connected vocabularies does not seem to be understood by the general public.

2.4.2 Linked Data

W3C has started to work on Linked Open Data (LOD) since 2007 which intends to expand Web of documents into Web of data. It equips Resource Description Framework (RDF) to describe the knowledge. RDF represents the connection among its entities by a

(entity₁, connection, entity₂)

triple. As the project grows, it shares billions of RDF across the organiza- tions. Though they faced tremendous challenges on redundant data [5].

2.4.3 OWL

The Web Ontology Language (OWL) is a knowledge representation for describing ontologies. It is overly complex.

2.4.4 RDF

Resource Description Framework is a W3C designed format to store and query graph. It is used with OWL2. RDF is also a complex system. We often see another query language called SPARQL defined by RDF which is a semantic query language for a database. SPARQL is also absurd.

2.5 Machine Learning

Machine learning is a field of learning algorithms and statistical models to help computer improve performance on a specific task based on data.

(13)

2.5.1 Dataset

In ML, we often train models on a particular dataset. Furthermore, we derived it in training, validation and test set. The validation and test set are rather small comparing to the training set. During each training process, the validation set is used to benchmark the training progress. The test set is excluded during the training process as a standalone set. The test set is used for benchmarking the model.

2.6 Artificial Neural Network

Artificial neural networks are computing systems inspired by the biolog- ical neural networks within animal brains. It is comprised of intercon- nected nodes and activation functions. It has mainly two operations which are forward propagation and backward propagation. During the forward propagation, digits flow according to defined nodes moreover, weights of connected edges. During the backward propagation, nodes and weight get updated from derivative of final loss digits.

2.6.1 Universal Approximation Theorem

The universal approximation theorem claims that a single hidden layer with a finite number of perceptron can approximate any continuous function in Euclidiean space (Rⁿ). In 1989, George Cybenko proved this with sigmoid activation function [6].

2.6.2 Activation Functions

Activation functions are the core part of ANN. It maps the result to the desired range between 0 to 1 or -1 to 1. Depending on its properties, it allows ANN to be linear which is similar to other ML models or non-linear models. Majority of new ANN models are a non-linear system.

2.6.3 Types

There are various ANN. Most famous examples are Recurrent Neural Net- work, Long short-term memory unit, Transformer, Convolutional Neural Network and more.

2.6.4 Word Embeddings

We use Work Embeddings and Embeddings interchangeably in the thesis.

Word Embeddings is a technique in language modeling and feature modeling in NLP. In these tasks, words and phrases are mapped into vectors of real numbers.

2.7 Natural Language Processing

Language can be expressed in a different representation, such as words and characters. The vast ambiguity is hard for the computer to understand. Modeling the language is the hardest problem in Artificial Intelli- gence.

(14)

NLP generally is needed for dialogue, translation, speech recognition and text analysis.

2.8 Scipy and Numpy

Scipy is a set of scientific library implemented in Python. Scipy is compa- rable to Matlab. Numpy is the core part of Scipy. It is a matrix library in Python. Majority of its part is implemented in C for performance reason.

It is integrated with PyTorch closely.

(15)

3 Knowledge Graph

Knowledge Graph comes into being when Google introduces the idea.

Generally, we defined it as a graph connected knowledge base (KB). How- ever, it is deeply related to the knowledge base and researched in various aspects. KG is a fast evolving field, and its construction and utilization remain open topic. The general trend for KG is that the data structure becomes simpler and more flexible. Open efforts are DBpedia and Wiki- data as well as obsolete Freebase. In our context of KG, we use its purest form. KG can be represented as a set of triples. A triple is an ordered, three element tuple in which a relation connects two unique entities. For example, Till is a human being can be translated into (T ill, is a, human) in which ”Till” and ”human” are entities, moreover, ”is a” is a relationship between them. This new representation has allowed utilization of state- of-art AI model to operate upon. There are also Relational Database and Graph Database intending to store KG.

Knowledge representation and knowledge graph have been used in some high profile commercial products. One of the leading examples is Google Brian which shows brief information and relevant connections about a search keyword. Some of Wikipedia articles also have important cards on the top right side of the page called infobox which evolves to a semantic database project Wikidata. Wikidata also stores most relevant facts for an entity.

3.1 Definition

More precisely, KG describes existing entities and concepts in reality. In a KG, entities and concepts are identified by a globally unique ID. A set of property-value pairs describes the essence of entities. Relations describe the connection between entities.

The concept of KG is very similar to Frame Network. Formally and con- cretely in the scope of this thesis work, Knowledge Graph can be represented as a set of triples and triples are composed of entities and relations.

Specifically,

G = (E, R, T) (3.1)

where E, R and T are entitity set, relationship set and triple set [7]. However, for a knowledge graph, there are type constraints for relations, non-binary relations, images, text descriptions and numerical literals [8].

The symbols used in the thesis are shown as Table 3.1.

(16)

Table 3.1: Mathematic symbols used in the thesis

a A numeric value.

v A column vector.

Y A tensor or probabilistic KG.

x_ijk A triple in a KG tensor.

y_ijk A random variable based on certain triple in a KG.

G Knowledge Graph.

E Entity set.

R Relation set.

e_i The i-th entity in dataset.

r_i The i-th relation in dataset.

N_e The number of entities in golden set.

N_r The number of relations in golden set.

N_d The number of training triples.

Θ Graph feature models parameters or latent feature models parameters

D The observed triple set.

D⁺ The positive observed triple set.

D⁻ The negative observed triple set.

T The triple set.

T The training triple set.

He The number of latent features for entities.

H_r The number of latent features for relations.

3.2 Construction of Knowledge Graph

KG can be categorized into two major types. One is the schema approach in which each entity and relations have a unique identifier. This is the approach that almost all famous Knowledge Base works including Free- base and Wikidata. The other is the schema-free approach where the data comes from the free text. A technique called open information extraction creates it [9].

Considered the usefulness of KG, it almost always depends on its data quality. Completness and accuracy are important parameters. There are four main groups of KG. (Defined by [10])

1. curated approaches, triples are created manually by a closed group of experts;

2. collaborative approaches, triples are created manually by an open group of volunteers;

3. automated semistructured approaches, triples are extracted auto- matically from semistructured text (e.g., infoboxes in Wikipedia) via hand-crafted rules, learned rules, or regular expressions;

4. automated unstructured approaches, triples are extracted automat-

(17)

ically from unstructured text via machine learning and natural language processing techniques.

Wikidata uses a collaborative approach. The quality of data depends on the topics of interests. A large-scale project on Wikidata aims to ingest all publications so people can reference Wikidata item on Wikipedia. It is estimated more than 50 million entities needed to be imported which exceeds the current number of items. While other knowledge bases also have much information. For example, Freebase has 3.9 million entities and 1.8 billion connections. DBpedia extracts the information from Wikipedia which has 10 million entities and 1.4 billion connections. YAGO3 is constructed using the automated semistructured approach. YAGO extracts the information from Wikipedia and WordNet which has 10 million entities and 120 million connections. [5] Also, of course, there is more such database in fields such as biology.

In particular, we refer Knowledge Vault [11] as knowledge fusion. It is a research project by Google which includes text, DOM Trees, HTML table and RDF which contributes to the creditability of knowledge as well as towards a better extraction. Knowledge fusion includes entity merging, connection merging, and instance merging.

3.3 Statistical Learning over Knowledge Graph

The idea of probabilistic KG comes from modeling each possible triple.

In a KG, we have an entity set E = {e1, . . . , e_N_e} and a relation set R = {r₁, . . . , r_N_r}. We define each possible triple xijk = (e_i, r_k, e_j)over the entity set and the relation set as a binary random variable yijk ∈ {0, 1} where 1 represent existence and 0 otherwise. Thus all possible triples in E ×R×E can be defined s a third-order tensor Y ∈ {0, 1}^N^e^×N^e^×N^r, whose entries are set according to

y_ijk =

(1 if the triple (e_i, r_k, e_j) exists 0 otherwise.

Similar tensor can be seen as Figure 3.1. Then each possible realization of Y is a possible world. We will need to estimate the joint distribution P (Y) from a subset of observed triples (D⁺ ⊆ E × R × E × {0, 1}). The process builds a model estimating a probability distribution over possible worlds. The probability distribution will allow us to predict triples beyond observed triples. 1 in Y express existence of triple or a fact. The interpretation for 0 depends on worldview we choose (see Section 5.6).

The tensor Y can be enormously large. For example, Freebase has over 40 million entities and 35000 relations, the number of possible triples |E × R × E| exceeds 10¹⁹elements. There are also some ways to reduce element size. In such an ample space, it is mattering and efficient for the model to

(18)

y_ijk

ith entity

jth entity kth relation

Y

Figure 3.1: Tensor representation of binary relaitonal data.

deal with the sparsity of relationships.

3.4 Knowledge Graph Application

KG can have a particular theme. The graph structure is more natural to explore for its relationship. In fact Graph mining (Network analysis) and pattern (association rules) are some of the essential topics related to KG. It seems KG is going to benefit search result [5]. It enables Semantic Search and query understanding from natural language.

3.5 Related Work

TransEintroduces the idea of low-dimensional embeddings and translation as prediction [12]. It enables fast prediction in sparse probabilistic KG. Other research addresses the label extraction. [10, 13]. Knowledge Representation also serves as an important topic in AI. [14].

OpenKEpresents a training framework with an offest sampler [15]. It is a unified framework which can train many well-known models. However, its training is based on threads which imply a lock. Extending for our needs is also hard because of code structure. It also doesn’t have tools for

(19)

processing data.

ProjEdoes not only propose a different model but also implement a training framework for its usage [16]. The implementation realizes a queue system for training and validation. However, it occupies much memory.

It is also slow because of copying tensors.

(20)

4 Methodology

It will be quantitative research. Firstly, an archival research will be con- ducted since the field has evolved a lot and I will need time to learn from the prior art. Moreover, several important papers are needed to be ad- dressed and verified.

The deduction will be used as the main approach while reading the liter- ature. After reading the material, I will conduct experiments and understand how the model works.

At last, I will implement my improvement and propose a future study.

4.1 Field Investigation

I plan to find a survey paper about relational machine learning for knowledge graph to achieve Goal 1. The papers are from Google Scholar and references from peer papers. From some survey paper, I can find more relevant papers from the reference list. There are numerous researches in each field; I’ll skim through important surveys and find relevant papers.

I will also go to different Wikimedia conference to ask relevant people who work for Wikidata, an extensive knowledge base. During the reading, I will limit the scope of the study. Also, there are existing open source projects or some research prototype about the study. I am going to investigate and make use of them.

4.2 Theory Formulation

From the reading, I will gather more information and narrow down the problem definition. I will generalize related theory inductively and study 7 models to understand the problem. Then I will look for a survey paper to gain a high-level understanding of the problem. These will be my steps to complete Goal 2. For example, models may be an improvement over the previous state-of-art. Papers will mention other models and related theory. From the most cited papers, I can gain a better understanding of the problem and related area of research. When studying the model, I plan to investigate how the implemented model works so I can continue with my research.

4.3 Proposing and implement

The problem definition needs to break down further for Goal 3. I plan to implement some models from papers I read. I will also try existing implementation and observe how it performs. During trials, I will find out the problems with existing implementation to accomplish Goal 3a. I will start to implement a model and use the same metrics to compare with previous work. The implementation will be several cycles to acomplish Goal 3b and Goal 3c. I will set up the high-level goal for the design. Then

(21)

I will start to implement from the trivial details of the learning process to a minimal runnable implementation. In the end, I will refactor and add more features.

4.4 Evaluation

In the end, I am going to evaluate the framework and previous state-of-art models that I learned and implemented. I will use the typical metrics to evaluate the model. Moreover, I will compare performance with different implementation and analyze the reason. I will also conclude a feature list about why the framework is more advanced. Then I will conclude this new implementation to accomplish Goal 4.

(22)

5 Knowledge Reprensentation

We are interested in the statistical learning model of Knowledge Graph.

The model can be seen as an approach to Knowledge Representation. I am going to introduce more statistical properties, types of statistical model and then focus on latent feature models.

5.1 Properties of Statistical Knowledge Graph

The Knowledge graphs are applied with some deterministic rules such as type constraints and transitivity. For example, if Avicii was born in Stockholm, and Stockholm is located in Sweden, then we can conclude Avicii was born in Sweden.

Nevertheless, KGs have some softer statistical patterns or regularities that can be useful sometimes. Thought it might not be correct at all times.

There are two patterns like this. The first pattern is homophily which is that entities are similar to other entities with related characteristics. For example, actors born in the US are more likely to start in US movies.

For multi-relational data (In KG, an entity may have many relations), homophily is also referred as autocorrelation [17]. The other statistical pat- tern is block structure. Some properties can divide entities into different groups (blocks) where the members of a group have similar relations to the other group. For example, we can group some actors like Alec Guin- ness into a science fiction actor block. Science fiction movies like Star Trek can be grouped into a block. We expect there are many connections between these two groups.

KG can also have global and long-range statistical dependencies. That is dependencies transiting from a chain of triples through different relations.

For example, the citizenship of Avicii (Sweden) depends on his birthplace (Stockholm), also, this statistical dependency needs to follow a path over entities

(Avicii, Stockholm, Sweden) and relations

(bornIn, locatedIn, citizenOf ).

It is remarkable if models can capture this kind of statistical pattern.

There is also an important factor needed to be considered. KG is usually highly incomplete. When we use statistical models to estimate KG, we should know the distribution of facts can be skewed. For example, YAGO3 is extracted from Wikidata and Wikipedia. It is skewed to the distribution of facts in Wikipedia itself.

(23)

5.2 Types of Statistical Learning Model

Depending on the assumption of correlations on the random variable yijk

(see Section 3.3 for symbols), we can define different categories of models [10].

1. Latent feature model: assume all yijkare conditionally independent given latent features associated with the subject, object and relation type, and additional parameters.

2. Graph feature model: assume all yijkare conditionally independent given observed graph features and additional parameters.

3. Markov random fields: assume all yijk have local interactions.

In this thesis, we only focus on the first latent feature model.

Both graph feature models and latent feature models predict the existence of a triple xijkby a score function f (xijk; Θ). The score is the model’s confidence of triple existence given its parameters Θ. The conditional inde- pendence assumption of graph feature models and latent feature models can be written as Equation 5.1,

P(Y|D, Θ) =

Ne

Y

i=1 Ne

Y

j=1 Nr

Y

k=1

Ber(yijk|σ(f (x_ijk; Θ))) (5.1)

where σ(u) = 1/(1 + e^−u)is the sigmoid function and

Ber(y|p) =

(p if y = 1

1 − p if y = 0 (5.2)

Equation 5.2 is the bernoulli distribution.

In the following sections, I am going to describe how Equation 5.1 is optimized.

5.3 Latent Feature Model

Since KG is connected through semantic relations, it is possible to model them by using the connection. For example, John lives in Sundsvall also means he stays in Sweden. The latent model needs to infer those connections from probabilistic information instead of explicit data (text). For example, an entity ei is representing as a vector ei ∈ R^H^e (He is the the number of latent features in the model) can be described as latent features.

For instance, Leonardo DiCaprio is a good actor and Academy Award is a prestigious award. This can be modelled as two latent features.

(24)

eLeonardo =0.9 0.3

,eAcademyAward =0.2 0.9

where the feature ei1means good actor and ei2means a prestigious award.

This example is easy, but the latent features are hard to interpret in reality.

The interactions of latent features are complicated. The way to design the interactions is also multifold. Moreover, we also have various ways to predict possible relations.

There are many proposed models and new findings in this field in recent years. I suggest readers focusing on Section 5.3.3 and Section 5.3.7.

5.3.1 RESCAL

RESCAL models the pairwise interactions of latent features to explain triples [18–20]. For a triple (ei,rk,ej), the score function of RESCAL is defined as:

f_ijk^RESCAL :=e^T_i Wkej =

H

X

a=1 H

X

b=1

wabkeiaejb

where W ∈ R^H×H. RESCAL is a bilinear model because it uses multi- plicate terms to capture the interactions between entities. In detail, the weight wabkdetermines how much two latent features should have interactions. The block structure can be represented as the magnitude of entries. Moreover, homophily patterns can be represented by the magnitude of diagonal entries. The negative entities can capture Anticorrlections.

Also, a generalization called Neural Tensor Network (NTN) can be seen in [12]. It does not show much improvement and is overfitting with the standard dataset.

5.3.2 Semantic energy model

Semantic energy model developed a new way to construct multi-relational data [21]. It set up a neural network approach to learn embeddings over large multi-relational data with a score function. This approach differs than a three-way tensor model. It lays the foundation of distributed representation. Especially, there are three major ideas behind it. Firstly, it put entries including entities and relations into a random initialized embedding space. Thus, each entry is assigned as a vector Ei ∈ R^d. That also applies to other types of data such as images. Secondly, an energy value is assigned with individual triples by a parameterized function E. The semantic matching comes from scoring criteria considering both sides of a triple. Lastly, the model is optimized to set lower energy score for a true triple instead of unknown triples.

(25)

Figure 5.1: Semantic energy model

Figure 5.1 demonstrates the general framework of this model. It can also be represented by a few steps. lhs and rhs are entities respectivily and rel is a relation. They are represented by a vector later as Elhs, Erhs and Erel

(Elhs, E_rhs, E_rel ∈ R^d). Two combination functions are defined to combine the entity with relation where Elhs(rel) = glef t(Elhs, Erel) and Erhs(rel) = g_right(E_rhs, E_rel). These functions are trainable. Elhs(rel) and Erhs(rel) can have a different dimensionality but can also be low dimensional as the previous embedding size. In the end, it can be represented as E((lhs, rel, rhs)) = h E_lhs(rel), E_rhs(rel).

The further extension is called Structure Embeddings but was surpassed by other models in no time [22].

5.3.3 TransE: Translating Embeddings for Modeling Multi-relational Data In particular, embeddings are used from many bodies of works in NLP [23, 24]. Such techniques tend out to be useful in latent knowledge representation.

TransEbuilds upon semantic energy model, and it is still an essential base- line [12]. It models entry interactions as translations in the embedding space. Given a triple (h, `, t), h + ` ≈ t should hold. The model works well and only relies on two embedding spaces. It is based on a closer look from data where Bordes decided to invent a more generic approach. The new approach devises a simple generic approach based on a hypothesis that all heterogeneous relationships affect the locality of relational data at the same time.

The training minimized a margin-based ranking criterion over training set:

(26)

L

= X

(h,`,t)∈S

X

(h⁰,`,t⁰)∈S⁰_(h,`,t)

[γ + d(h + `, t) − d(h⁰+ `, t⁰)]₊

5.3.4 PTransE: Modeling Relation Paths for Representation Learning of Knowledge Bases

PTransE extends TransE’s work with multi-hop inference by adding possible links to the system [25]. It defines a path triple (h, p, t) and computes resource amount flowing from h to t given the path p = (r1, ..., r_l). The resource flowing through S0

r1

−→ S₁ −^r→ ...² −→ S^r^l _lwhere S0 = hand t ∈ Sl. Given any entity m ∈ Si, the direct predecessors along relation ri in Si−1

is denoted as Si−1(·, m). Then the resource flowing to m is defined as R_p(m) = X

n∈Si−1(·,m)

1

|S_i(n, ·)|R_p(n)

where Si(n, ·)is the direct successors of n ∈ Si−1following the relation ri. The initial Rp(h) = 1and relationablity is calculated by R(p|h, t) = Rp(t).

The scoring function is added to the path information by S(h, r, t) = G(h, r, t) + G(t, r⁻¹, h) and

G(h, r, t) = kh + r − tk + 1 Z

X

p∈P (h,t)

Pr(r|p)R(p|h, t)kp − rk

The quantified inference of a path P r(r|p) = P r(r, p)/P r(p) was obtained from training data provided a threshold. The problem with this approach would be the test data would be so biased based on every path it pre- calculated. It is also tough to update dynamically because of this pre- calculation.

5.3.5 ConvE: Convolutional 2D Knowledge Graph Embeddings

ConvE uses 2D convolutional operation to capture long-range statistical information [26]. Its process is described in Figure 5.2.

Figure 5.2: ConvE model process

(27)

The model can be described as

ψ_r(es,eo) = f (vec (f ([es;rr] ∗ ω))W) eo

where rr ∈ R^kis a relation parameter dependeing on r, es, and rr denote a 2D reshaping of esand rr.

ConvE looks overly complicated because of newly introduced 2D convolution operation. It follows a closed world assumption and uses every other entity as negative samples. However, it only achieves average results.

5.3.6 ConvKB: A Novel Embedding Model for Knowledge Base Completion Based on Convolutional Neural Network

ConvKB follows ConvE’s idea to use convolution operator [27]. Its process is described as Figure 5.3.

Figure 5.3: ConvKB model process

Its score function is described as

f (h, r, t) = concat (g ([vh, vr, vt] ∗ Ω)) ·w

where Ω and W are the set of filters and weight vectors. They are all shared

(28)

parameters independent from triples. ∗ represents a convolution operator. And concat denotes a concatenation opeartor.

If we follow the model process, τ = |Ω| represents the number of filters.

These τ feature maps are concatenated into a vector v ∈ R^{τ k×1} which is then computed with a weight vector w ∈ mathbbR^{τ k×1} by a dot prodtuct for the score.

Somehow, the hyperparameter we choose can be deducted to a simpler model, like TransE. Although ConvKB seems to look good, it cannot be bootstrapped in raw. It must be trained from TransE.

5.3.7 ProjE

ProjE [16] takes a different approach by using a neural network and embeddings for KG completion. It continues the work from NTN. More specifically, ProjE contributes in 4 main different ways:

• An entity and relation were combined with a target vector.

• A combination operator is used to combined entity and relation embeddings.

• Collectively ranking was used instead of margin-based pairwise ranking.

• It trains without pre-trained embeddings.

The training inputs of ProjE has m triples in one batch, and they are represented by embeddings (can be chosen) which are uniformly distributed.

Each triple has a negative sample to 25% 50% of all triples. For com- parison, TransE takes one negative sample per triple while ComplEx may use more than ten negative samples. The shape of the head and relation vectors are (m, k) and (m, k) respectively which k is dimension size. The code implements two combination mode, ”simple” which displays in paper and ”complex” for outstanding performance. The complex combination does not achieve outstanding performance.

For the simple combination mode, we have parameters Wh : (k, 1), Wr : (k, 1), bc : (k, 1). Firstly, the target vector was combined as e ⊕ r = De· e + Dr · r + b_c. Though as the code shows, it is actually e ⊕ r = e ◦ Wh + r ◦ Wr+ b_c(here bcis broadcasted for all m triples). In the end of this step, a result vector was constructed and has a shape of (m, ne)where neis the number of entities. Secondly, scores were yielded by applying tanh to the combined elements to be compared with every sampled entity. This step includes a so-called candidate sampling process. For each training triple, a list of nelength vector was built. A ”label” vector has a shape of (m, ne).

(29)

Each of vector is labeled as Equation 5.3.

yi =







1 (existed) 0 (not sampled) -1 (negative sampled)

(5.3) In the code, negative samples were generated from golden sets (including training, validation, and test) which it is biases towards the testing.

After that, the vector was applied with operations. Basically it’s res = f (e ⊕ r) · W_e^T. But it turns out that every entity were compared against the combined element. Lastly, a loss value was calculated by

L

= −X

log "activation" ◦ max(0, "label")

The log "activation" and max(0, "label") have a shape of (m, ne)and (m, ne) respectivily. So only correct label was considered, and the regularizer was used. Be noted, the input triples can either be (h, r) or (t, r).

The complex combination mode shares almost the same procudure as simple mode. I’ll only describe the difference. The relevant parameters are W_e_r : (2k, k), bc : (k, 1), e : (m, k), r : (m, k). In the paper, the combination mode is e ⊕ r = De· e + Dr· r + b_c. In the code, it’s e ⊕ r = [e; r] · Wer + b_c. In particular, the softmax was used in the second step to improve the result.

max_value = reduce_max(result ◦ |label|, 1) rescaled = result − max_value

|label| excludes the value candidates we don’t sample.

It also turns out ProjE reports the wrong ranking score according to its code. The minimal rank is 1. However, it is treated as 0 in ProjE. So the result is a little bit biased to be better.

Speaking of design, the interactions between entity and relation can be significant. Moreover, simple mode works better than the complex mode.

At least it is more stable during gradient descend. ProjE’s target vector is opposite to using corrupted entity and scores which is an interesting idea, but it does not explain why target vector and scoring work.

5.4 Data Problem

[28] discovers the test leakage over WN18 and FB15k which has 94% and 81% of triples can be inferred from the training data. For example, (h1, r1, t1) existed in the training dataset is existed in the test dataset as (t1, r₁, h₁). [26] built a simple model only predict such triples, and it reduces the inference power on predicting actual knowledge graph.

(30)

Nevertheless, the knowledge graph stored this link might be useful for people to understand different relations on entities. However, it is not desirable on statistical

I did some experiments. It turns out some relations connected can be very informative. For example, in FB15k, a relation,

/award/award_category/nominees.

/award/award_nomination/nominated_for

has a high degree of outgoing edges. An award is notable thus there is a significant number of movies connected to it. By grouping them, a neural network can boost general concepts around relation and entity.

An example of such relation:

Dracula (1992 film)

/award/award_category/nominees.

/award/award_nomination/nominated_for Academy Award for Best Sound Editing

5.5 Evaluation Metric

I use the common evaluation protocol. The most common task would be link prediction. Link prediction is a KG completion task where we put a corrupted triple and then query model for the corresponding entry. For example, when predict head entity from triple (ei, r_k, e_j), the corrupted (˙,r_k, e_j)is entered in the model. In the end, the model predicts all triples for scores. We can then sort this score for ranking.

Here we can derive two metrics. One is mean rank. Rank is defined as the order of the correct triple appears in sorted predictions. Mean rank represents the average of all ranks of testing triples. We are also interested in the mean reciprocal rank that is more accurate. The other is hits.

Hitsis defined across all testing triples. hits@1 represents the proportion of triples that predicted as rank 1. hits@10 represents the proportion of triples that predicted before rank 10.

For all metrics, we have two settings. One is called raw. The other is called filtered. Filtered means to collect more accurate information about link prediction. During prediction of particular corrupted triple (h, r, ˙), it is very likely there are more than one correct triples. Thus we filter the correct triples except for the triple of interests. We calculate this filtered rank instead.

(31)

5.6 Worldview

KG is highly incomplete, and we almost always hold a partially complete triple sets. In order to improve the situation, there are different ways to look at non-existed triples [10].

They are categorized based on the view for non-existed triples.

• Closed world assumption (CWA) claims they are false;

• Open world assumption (OWA) claims they are unknown which means they can be true or false;

• Local closed world assumption (LCWA) is different from CWA re- garding the condition of false claims.

To explain LCWA, we define O(e, r) ⊂ E (or O(t, r)) as the set of existing r in a KG given e and r (or t and r). The set can be a singleton such as place of birth. Alternatively, it can have multiple values like stars in for a movie.

Given a triple (h, r, t), t ∈ O(e, r), the triple is considered to be correct. If a triple (h, r, t), t /∈ O(e, r), but |O(e, r)| > 0, the triple is incorrect because we assume KG is locally complete for this (e, r) pair. If |O(e, r)| is empty, we do not label the triple. The triple is thrown away from training and test sets.

RDF and the Semantic Web uses OWA and relational models often use LCWA.

(32)

6 Implementation

The actual implementation is comprised of a training framework, data processing library, extracted datasets, and running environment. The pipeline of the training process is straightforward as Figure 6.1 shown below. Once the model is trained, we can take triples from gold datasets to predict by the trained model.

Environment

Data Storage Gold Dataset

Model 3 Model 2 Model 1

Trained Model

kgexpr

Indexer Sampler

Statistical tools Processing scripts Ranker

IO and data structure

kgekit

Figure 6.1: Model Trainning Process

The implementation of the framework focuses on correctness, performance, flexibility and deterministic behavior. Program paradigm follows a data- oriented design and object-oriented design.

6.1 Environment

I used a workstation in the university with remote access. The workstation has 2 Nvidia Titan X GPUs and 64GB memory. Docker is used to set up a unique environment for running learning models. In the individual packages, pipenv is used to manage the Python environment as well as dependencies.

I use PyTorch and scipy for all implementation because of its flexibility, and ability to build a prototype efficiently. PyTorch implements a dynamic computation graph system which allows easier debugging and test-

(33)

ing cycles.

6.2 Knowledge Graph Embeddings Data Processing, Sampling and Ranking Toolkit

In this library (kgekit), I implement data processing and sampling tools for learning tool. To facilitate design goals, I divided the data processing part and sampling into a separate package. Figure 6.2 shows an overview of components.

Figure 6.2: kgekit components

I will explain how different components work. There are several reasons to have this package. Firstly, data extraction from RDF documents and processing are not the core part of the learning model. Secondly, sampling and ranker are the expensive parts of the computation. We are better of that it can be isolated and parallelized. That’s the reason why it’s implemented in C++ for ultimate control of performance. Python is simply slow because of its boasted objects and GIL. Thirdly, data processing and sampling is atomic and can be modularized. Fourthly, IO operation of learning is extracted because we might face very different types of input.

It is clear to merge IO tools in a separate package and provide upstream with the structure data. Lastly, it is supported as a Python package. So the language barrier between Python and C++ must be as transparent as possible. In order to use the library (kgekit) as a package manager, I need to package C++ dependency as if it is opaque to its upstream usage.

6.2.1 Programing Language Barrier

Python is an excellent choice for data processing. Data is easy to mani- pluate by dynamic list, dict and set data containers. The return value and dynamic data types are also contributing factors. For testing, Python al- lows mockup and interactive debugging. Moreover, scipy tools are imple-

(34)

mented in Python. We can use numpy, pandas, and matplotlib as needed.

As long as we are performing simple data transformation and IO operation, processing should not become the bottleneck of execution. It is also a reason that this package does not use any multiprocessing instrument.

Also, the implementation requires thread safety. However, the performance design relies more on the implementation of Python interpreter.

In statistical learning, we often employ many processes which share resources for performance. It is because GPU is faster than our IO pipeline.

In the case of multi-relational statistical learning, the bottleneck is the sampling process of each mini-batch and ranking in the validation step.

In the context of this thesis work, we use CPython as Python implementation. Thus Python faces the significant obstacles in the performance problem because of GIL. GIL does not allow multi-threaded execution of Python program. The interpreter lock prevents concurrent access of same resources for a shared resource in the same process. Epricically, one thread can never saturate our GPUs. This means there is no point to opti- mize parallelization at Python end.

Python also uses reference counting GC algorithm for resource manage- ment. It also provides dynamic list, dict and set data containers. However, our data does not change much during the training, validation and testing phases. We only need to initiate data once for a negative sampling process and ranking. This gives us the opportunity to use compiled language to implement it without devoting too much efforts in memory management.

C++ becomes a good choice since we can use a higher level of abstraction, seamless integration with C and many third-party libraries.

We also want to avoid copying data. Memory allocation would be slow if we have to allocate often. During a training epoch, we may allocate thou- sands of objects. It is advisable not to copy objects during data collation phase. CPython provides a C API and FFI interface for C/C++ programs to operate on its objects, memory, and state.

On the one hand, for example, CPython allows us to construct a tuple with two long values and a Unicode string using its C API as Code 6.1.

PyObject *t;

t = PyTuple_New(3);

PyTuple_SetItem(t, 0, PyLong_FromLong(1L));

PyTuple_SetItem(t, 1, PyLong_FromLong(2L));

PyTuple_SetItem(t, 2, PyUnicode_FromString("three"));

Code 6.1: Python C API xxample

Code 6.2 is an example of Python function implemented in C/C++.

(35)

import spam

status = spam.system("ls -l")

Code 6.2: Python function implemented by C

The corresponding code in C is listed in Code 6.3.

static PyObject *

spam_system(PyObject *self, PyObject *args) { const char *command;

int sts;

if (!PyArg_ParseTuple(args, "s", &command)) return NULL;

sts = system(command);

return PyLong_FromLong(sts);

}

Code 6.3: Python function implemented in C

On the other hand, the FFI interface allows users to declare C functions interface in Python. See Code 6.4.

from cffi import FFI ffi = FFI()

ffi.cdef("""

typedef struct {

unsigned char r, g, b;

} pixel_t;

""")

image = ffi.new("pixel_t[]", 800*600)

f = open('data', 'rb') # binary mode -- important f.readinto(ffi.buffer(image))

f.close()

image[100].r = 255 image[100].g = 192 image[100].b = 128 f = open('data', 'wb') f.write(ffi.buffer(image)) f.close()

(36)

Code 6.4: Python function implemented in C

As we can see, Python C API has a dramatically different API than FFI (Python). C API requires a lot of function check, but it is closer to CPython implementation than FFI. CPython is implemented in C where we are allowed to use any internal data structure and other functions. Numpy and PyTorch use C API extensively for performance. FFI is more straightforward because we are still writing Python, but it does not solve the com- plication problem for distribution of code. Nevertheless, it requires us to remember how Python is implemented without explicit telling it in the code.

6.2.2 Design Principles

I prefer to write elegant code that I own. Thus, I choose a library pybind11 for help. pybind11 is a C++ template library which allows interchange of Python/C++ objects. It’s a very advanced library requiring C++ knowledge.

#include <pybind11/pybind11.h>

int add(int i, int j) { return i + j;

}

PYBIND11_MODULE(example, m) {

m.doc() = "pybind11 example plugin"; // optional module docstring

,→

m.def("add", &add, "A function which adds two numbers");

}

Code 6.5: pybind11 binding in C++

import example example.add(1, 2)

Code 6.6: Invoking a Python function binded by pybind11

Code 6.5 and Code 6.6 show a function binding with pybind11. It en- capsulates GIL management and implicit type conversion. Moreover, it supports class bindings, casts, exceptions, memory management, and STL type conversion. I also use its Python object binding an embedded interpreter.

(37)

The biggest catch in designing the library is that I want to avoid copying data. Python object wrapped by pybind11 is a managed pointer which can be unwrapped or use easily. C++ objects are managed by the wrapping facility provided by pybind11. But conversion among STL objects is com- plicated because they are copied.

Considering the need for training, I need:

1. Loading data from Python.

2. Ability to interchange data in C++ with Python.

3. Multi-processes sampler.

I set up some principles to design the interface and abstraction.

1. Prefer passing reference and Python objects (also containers);

2. Prefer stateless functions over stateful classes;

3. Using class in C++ only if encapsulating internal state is needed;

4. Avoid copying as much as possible;

5. Define data structure in C++;

6. IO functions are written in Python because of IO bounded characteristics;

7. Thread safety functions and class.

Evidently, it is faster not to allocate memory which requires less copy and object creation. The first four rules apply to the memory part. Moreover, writing data structure in C++ allows us to save memory. When it is needed to be used in Python, pybind11 will build a thin wrapper for the object.

Thread safety is required because we will use the object in different processes..

6.2.3 Realization of Package

The implementation of kgekit package has Python and C++ parts. To glue the project together, I choose CMake and Hunter package manager.

py::class_<kgekit::Triple>(m, "Triple") .def(py::init<>())

.def(py::init<const std::string&, const std::string&, const std::string&>())

,→

.def("__repr__", &kgekit::Triple::repr) .def("__eq__", &kgekit::Triple::operator==) .def("serialize", &kgekit::Triple::serialize) .def_readwrite("head", &kgekit::Triple::head)

.def_readwrite("relation", &kgekit::Triple::relation) .def_readwrite("tail", &kgekit::Triple::tail)

Code 6.7: The triple representation in kgekit

Code 6.7 declares the triple representation and accessors.

(38)

m.def("expand_triple_batch", &kgekit::expand_triple_batch,

"expands triple batch",

,→

py::arg("batch"), py::arg("num_entity"), py::arg("num_relation"), py::arg("expand_entity"), py::arg("expand_relation"));

Code 6.8: A function binding in kgekit

py::class_<kgekit::EntityNumberIndexer>(m,

"EntityNumberIndexer", "index from triples and translation between index and lists")

,→

.def(py::init<const py::list&, const std::string&>()) .def("entityIdMap",

&kgekit::EntityNumberIndexer::entityIdMap)

,→

.def("relationIdMap",

&kgekit::EntityNumberIndexer::relationIdMap)

,→

.def("indexes", &kgekit::EntityNumberIndexer::indexes,

"gets indexes from triples")

,→

.def("entities", &kgekit::EntityNumberIndexer::entities,

"gets entity list")

,→

.def("relations", &kgekit::EntityNumberIndexer::relations,

"gets relation list")

,→

.def("getEntityFromId",

&kgekit::EntityNumberIndexer::getEntityFromId, "gets the entity name from id")

,→

.def("getRelationFromId",

&kgekit::EntityNumberIndexer::getRelationFromId, "gets the relation name from id")

,→

.def("getIdFromEntity",

&kgekit::EntityNumberIndexer::getIdFromEntity, "gets the entity id from name")

,→

.def("getIdFromRelation",

&kgekit::EntityNumberIndexer::getIdFromRelation, "gets the entity name from id");

,→

Code 6.9: A class binding in kgekit

Code 6.8 and Code 6.9 show the bindings for a function and a class respectively.

Noticeably, data manipulation functions are naturally defined in C++ since data structure is defined in C. The indexer is in charge of transforming literals into a sequenced number and providing mapping information. It is

(39)

essential to have indexer because we need an ID number from 0 in statistical learning model. I also built a corruptor with bernoulli distribution (see Section 6.3.4). Combined with corruptor input, I built a negative sampler (see Section 6.3.3). Moreover, a ranker is implemented in C++ for perfor- mance (see Section 6.3.5). In the Python world, kgekit supports triple IO, labels IO, and translation IO.

kgekitis published as a PyPI package.

6.2.4 Convenient Data Processing Utilities

The data processing part is not the core of kgekit, but it is useful for data extraction. There is a Java CLI program which is used for extracting literals from a Turtle document. Turtle document is a simple form of RDF document. However, we only need facts of simplest triple form as

(head, relation, tail)

For labels, we only need a structure of (entity, label). Thus, I choose Java because of RDF4J. It is one of the only maintained packages which can process a large number of triples. There is not an efficient Python package which can read YAGO3 dump. Moreover, C/C++ RDF library is not updated anymore which I do not trust.

However, RDF4J is not friendly with the malignant input. It does not pro- cess percent escaped symbols. It is problematic to filter triples that fulfill the requirement with its algorithm. Nor it is easy to design custom filter- ing algorithm. Moreover, the indexer is implemented in Python. So the majority of data processing code ends up written in Python. There are a set of Python scripts which support data segregation, encoding, reverse triple removal (see Section 6.2.5), and label extraction.

6.2.5 Inverse Triples Removal

The inverse triples problem is a pitfall in KG research as described in Sec- tion 5.4. Algorithm 1 is used to filter the inverse triples.

The algorithm is implemented in Python as a processing script.

6.2.6 Data Extration from YAGO3

There are several datasets available from previous researches. For exam- ple, FB15k is a subset of Freebase extracted with TransE publication [12].

FB15k-237is a subset of FB15k by removing inverse triples [28]. Freebase.

I also employ a relation threshold algorithm to remove triples in which entities do not meet enough different types of relations. The algorithm is shown as Algorithm 2.

The process of extraction can be described as Figure 6.3. I used YAGO3 database because it is multi-lingual and large.

(40)

Algorithm 1 Inverse Triples Removal Algorithm

Require: triples, the list of triples in head, relation, tail form.

1: procedure RemoveInverseTriples(triples)

2: c ← { }

3: for all t ∈ triples do

4: c[< t.head, t.tail >] ← c[< t.head, t.tail >] ∩ t

5: end for

6: r ← [ ]

7: for all t ∈ c do

8: r ← r ∩sample(t)

9: end for

10: return triples − r

11: end procedure

Algorithm 2 Relation Threshold Algorithm

Require: triples, the list of triples in head, relation, tail form. threshold.

1: procedure RemoveDeficitRelation(triples)

2: c ← { }

4: c[t.head] ← c[t.head] ∩ t

5: c[t.tail] ← c[t.tail] ∩ t

6: end for

7: d ← {}

8: for all ent, val ∈ c do

9: if len(val) < threshold then

10: d ← d ∩ ent

11: end if

12: end for

13: r ← {}

15: if t.head ∈ d or t.tail ∈ d then

16: r ← t ∩ r

17: end if

18: end for

19: return triples − r

20: end procedure