Inferring Dataset Relations using Knowledge Graph Metadata

(1)

(2)

(3)

(4)

(5)

Sammanfattning

Webbplatsen dataportalen.se syftar till att ¨ oka tillg¨ angligheten av svensk ¨ oppna data.

Detta g¨ ors genom att samla in metadata om de ¨ oppna datam¨ angderna som tillhan- dah˚ alls av svenska organisationer. I skrivande stund finns metadata fr˚ an mer ¨ an tv˚ a tu- sen datam¨ angder i portalen, och detta antal kommer att ¨ oka. N¨ ar antalet datam¨ angder

¨ okar blir genoms¨ okandet av relevant information allt sv˚ arare och mer tidskr¨ avande. F¨ or n¨ arvarande ¨ ar det m¨ ojligt att genoms¨ oka datam¨ angderna med hj¨ alp av s¨ okning med text och sedan filtrering med tema, organisation, filformat eller licens.

Vi tror att det finns mer potential att koppla samman datam¨ angderna, vilket skulle g¨ ora det enklare att hitta datam¨ angder av intresse. Id´ en ¨ ar att hitta gemensamma n¨ amnare i metadatat f¨ or datam¨ angderna. Eftersom det inte finns n˚ agon anv¨ andardata kommer vi att unders¨ oka i vilken utstr¨ ackning denna id´ e kan realiseras. Datam¨ angderna kom- menteras med metadata som titel, beskrivning, nyckelord, och tema. Genom att j¨ amf¨ ora metadata fr˚ an olika datam¨ angder kan ett m˚ att p˚ a likhet ber¨ aknas. Detta m˚ att kan sedan anv¨ andas f¨ or att hitta de mest relevanta datam¨ angderna f¨ or en specifik datam¨ angd.

Resultaten av analysen av metadata ¨ ar att liknande datam¨ angder kan hittas. Genom att utforska olika metoder fann vi att textdata inneh˚ aller anv¨ andbar information som kan anv¨ andas f¨ or att hitta relationer mellan datam¨ angder. Genom att anv¨ anda ett relaterat arbete som riktm¨ arke fann vi att v˚ ara resultat ¨ ar lika bra, om inte b¨ attre.

Resultaten visar att relaterade datam¨ angder kan hittas med bara textdata, och vi an-

ser att den identifierade metoden ¨ ar tillr¨ ackligt generell f¨ or att ha potential att kunna

anv¨ andas i liknande problem d¨ ar textdata ¨ ar tillg¨ anglig.

(6)

(7)

1 Introduction 1

2 Background 2

2.1 Research question . . . . 6

3 Related work 7 3.1 Recommendations Without User Preferences: A Natural Language Pro- cessing Approach . . . . 7

3.2 Eurovoc . . . . 8

4 Theory 10 4.1 Data . . . . 10

4.1.1 Data Quality . . . . 11

4.1.2 Data Preprocessing . . . . 12

4.1.3 Text data preprocessing . . . . 13

4.2 Feature Extraction . . . . 16

4.2.1 Latent Semantic Analysis . . . . 16

4.2.2 Latent Dirichlet Allocation . . . . 17

4.2.3 Word2Vec . . . . 18

4.2.4 Doc2Vec . . . . 23

4.3 Measures of Similarity and Dissimilarity . . . . 24

5 Materials 26 5.1 dataportalen.se dataset . . . . 26

5.2 Swedish Wikipedia dataset . . . . 30

6 Method 31

6.1 From Linked Data to Flat Data . . . . 31

(8)

6.2 Data Cleaning . . . . 31

6.3 Data Pre-processing, Text Pre-Processing and Feature Extraction . . . . . 32

6.4 Comparing similarity of datasets . . . . 32

6.5 Representing Recommended Datasets . . . . 33

7 Inferring Dataset Relations 34 7.1 Baseline . . . . 34

7.2 Algorithm 1: Structured information matching . . . . 34

7.3 Algorithm 2: LDA . . . . 34

7.4 Algorithm 3: LSA . . . . 35

7.5 Algorithm 4: LDA trained on Wikipedia articles . . . . 35

7.6 Algorithm 5: LSA trained on Wikipedia articles . . . . 35

7.7 Algorithm 6: Doc2Vec trained on Wikipedia articles . . . . 35

7.8 Algorithm 7: Weighted Ensemble Recommender . . . . 36

7.9 Gold Standard . . . . 36

8 Evaluation method 37 9 Results 39 9.1 Recommendations . . . . 39

9.1.1 Baseline . . . . 40

9.1.2 Algorithm 1: Structured information matching . . . . 41

9.1.3 Algorithm 2: LDA . . . . 42

9.1.4 Algorithm 3: LSA . . . . 43

9.1.5 Algorithm 4: LDA trained on Wikipedia articles . . . . 45

9.1.6 Algorithm 5: LSA trained on Wikipedia articles . . . . 46

9.1.7 Algorithm 6: Doc2Vec trained on Wikipedia articles . . . . 47

9.1.8 Algorithm 7: Weighted Ensemble Recommender . . . . 48

(9)

9.1.9 Gold Standard . . . . 49 9.2 Representing recommendations . . . . 50

10 Discussion 51

11 Conclusions 54

12 Future work 54

A Target Datasets used in Evaluation Method 60

(10)

1 Introduction

T he web site oppnadata.se aims to increase the availability of Swedish open data.

At the time of writing, the next version of oppnadata.se is being developed. which resides on the domain dataportalen.se. This new dataportal is being developed by this project’s stakeholder Metasolutions on behalf of the Agency for Digital Goverments (DIGG).

At the beginning of this project, roughly two thousand datasets resided in the portal.

Over the course of the project, this number increased to six thousand. Furthermore, due to laws and regulations [Inf10], it is safe to assume that this number will increase further. As the number of datasets increases, browsing for relevant information becomes increasingly difficult and time-consuming. Currently, it is possible to browse the datasets by searching using text and then filtering the results by theme, organization, file format or license. We believe that there is more potential to connect the datasets using their associated metadata, thus making it easier to find relevant data.

This paper presents an approach to creating recommendations for similar datasets. The assumption made was that datasets with similar metadata also had similar contents.

A national specification defined around 70 mandatory, recommended or optional meta- data fields, including title, description, themes, format, and keywords. By comparing metadata from different datasets, a measure of similarity was computed. Based on this measure, recommendations for datasets with the most similar metadata were created.

Since no user data existed on dataportalen.se, the recommendations were based solely

on the contents of the metadata. A challenge was therefore to maximize the potential of

the metadata. We did so by exploring different approaches to text pre-processing and

feature extraction, some of which utilized transfer learning. These approaches were then

compared to related work in order to evaluate their performance.

(11)

2 Background

Oppna offentliga data ¨

Oppna data ¨ Offentliga data (PSI-data)

Figure 1 Venn diagram of open data, open public data and PSI-data . By Peter Krantz - own work, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=12686318

2 Background

All the data that is generated by the Swedish authorities are referred to as Public Sector Information (PSI), and the subset of this data that can be published to the public is referred to as Open Data. PSI is a European Union (EU) directive [Eur03] which has also been implemented as a Swedish law [Inf10]. The Swedish authorities are therefore regulated to publish a PSI list about what data they possess [Pal19]. In addition to PSI-data, there is also open data, which has a broader definition as digital information that is freely available. In the intersection of PSI-data and open data, there are open public data, as displayed in Figure 1.

However, publishing the data as open data is just the first step according to the director

of the World Wide Web Consortium (W3C), Tim Berners-Lee. The W3C has created

recommendations on how organizations should publish their data [Hau12]. The purpose

of these recommendations is to maximize the simplicity of finding and re-using open

data. The 5 star plan in Figure 2 shows the recommendations on publishing data on the

Web. According to this plan, open data is not the final step, but merely corresponds to

one star. The ultimate goal is to not only make it available as open data, but as linked

open data (LOD). The steps in Figure 2 can be explained as:

(12)

2 Background

Figure 2 The 5 star open data plan

1. make your stuff available on the Web (in any format) under an open license 2. make it available as structured data (e.g., Excel instead of image scan of a table) 3. make it available in a non-proprietary open format (e.g., CSV instead of Excel) 4. use Uniform Resource Identifiers (URI) to denote things, so that people can point

at your stuff

5. link your data to other data to provide context

Similarly to how documents are linked on the Web, the idea is to link data together in

what is sometimes referred to as the Semantic Web. One can think of linked open data

as a global database on the Internet. Figure 3 displays some of the linked open datasets

that are published on the web.

(13)

2 Background

Figure 3 The LOD cloud currently contains 1,239 datasets with 16,147 links (as of March 2019)

The scale of the linked open data cloud is enormous. It is worth noting that the LOD cloud does not cover all the existing linked open datasets on the Web. The Web was initially created for publishing and linking documents together. Google created a web search engine, where one could search these documents using text search. With the emergence of linked data on the web, Google has once again created a search engine, this time for linked data. In 2019, Noy et al. presented a search engine in a paper [BBN19] that had the potential to provide search capabilities over all datasets published on the Web. The approach taken was to create an open ecosystem where owners of the datasets markup their data with semantically enhanced metadata, in the form of either Schema.org or W3C Data Catalog Vocabulary (DCAT). After some necessary augmentation, such as reducing multiple alterations of the same information into a single representation, the system links it with related sources and builds an index [BBN19].

This approach means that the data owner can publish their datasets on their own portal

and, as long as they markup the datasets, Google can successfully index the metadata.

(14)

2 Background

This project’s initiator, Metasolutions, is a company which offers information manage- ment solutions to organizations that want to make their data available as linked open data. By generating metadata about the data using linked data technologies for the metadata representation, they have a solution that handles the coordination of data well. This metadata follows the Swedish version of the European standard DCAT-AP and contains information such as title, description, themes, and keywords [Pal19]. Using this tool, organizations can publish their data in a well-coordinated manner as LOD.

Coordination of the data is not enough, however, as it is also important to be able to find the data. The National Portal for Open Data and PSI [Age19] found on dataportalen.

se gathers and presents metadata about datasets sourced from Swedish organizations thus making the data findable and available for re-use. Metasolutions is the technology provider and maintainer of the portal on behalf of the Agency for Digital Government (DIGG). By collecting and coordinating this information in one place, the portal is striving to increase the transparency, innovation, and growth in our society [Age19].

Exploration of the linked data is essential, according to Tim Berners-Lee [BL06], and important aspects of exploration are findability and reusability. Currently, the data is coordinated and made available on the portal. Furthermore, the number of datasets on the portal is set to increase as a result of the aforementioned EU regulation and Swedish law. As the number of datasets increases, finding relevant information becomes increasingly more time-consuming and difficult. At the time of writing, dataportalen.se and datasetsearch.research.google.com allows users to find datasets using text search and filtering using theme or topic. This project aims to provide users with additional ways of finding relevant datasets. The approach taken was to utilize recommender systems, with the assumption that once a user has found a dataset of interest, the user would most likely want to find more data regarding the same topic.

Recommender systems are widely used and adopted in similar situations when browsing a large set of items, and they are useful since they help users find items they might not have found otherwise. These systems are often used in domains such as music, shopping, etc and companies that provides these services invest heavily into their recommender systems. The history of recommender systems reaches back to Jussi Karlgrens paper An algebra for recommendations [Kar90] published in 1990. In his paper, Karlgren makes the case that related books in a bookshelf usually are close to each other. He claims that this is caused either because the bookshelf is sorted by subject or because of the activity of the readers, which will place read books close to other books they think are similar, thus placing related books close to each other. This user activity was not taken into consideration in the information retrieval systems at the time. Therefore he presented his solution as a sort of digital bookshelf, in which he formalized document recommendations by readers and attempted to define a measure of closeness based on those recommendations.

Today, there are different types of recommender systems. The recommender system

that Karlgren proposed is now known as a collaborative filtering system. And the first

(15)

2 Background

commercial one, Tapestry [GNOT92], was released in 1992, not long after Karlgren’s paper. The book Hands-on recommendation system [Ban18] identifies four types of recommender systems:

• Collaborative filtering

• Content-based systems

• Knowledge-based systems

• Hybrid recommenders

Content-based systems are based on content such as descriptions or metadata, knowledge- based systems are based on explicit knowledge of user preferences, and hybrid recom- menders are combinations of the aforementioned systems. Most companies today use hybrid approaches, such as the winners of the Netflix prize in 2007 where their solution was a hybrid of multiple methods [BKV07]. By blending different approaches, the idea is that it will catch the different, complementing aspects of the data. What is important to note is that as in the case of Karlgren’s work, user activity is still at the heart of recommender systems, and is something that the best recommender systems rely on.

Collecting user data and relying on this information for making recommendations can be troublesome. Collecting user data raises concerns about data privacy. In the afore- mentioned Netflix prize competition, some privacy issues arose from the anonymized dataset offered by Netflix. By matching the film ratings from the Netflix dataset with film ratings from the Internet Movie Database (IMDB), two researchers were able to identify individual users [Sin09]. Ultimately, this led to the cancellation of the Netflix Prize competition. Therefore, there are reasons to be careful about gathering and using user data. The dataset provided for this project does not include any data on user activ- ity. Creating a recommender system for this dataset would then be analogous to, using Karlgren’s bookshelf example, letting a computer read all the books in the bookshelf and then arranging them accordingly.

2.1 Research question

The task of this project was to use the metadata to infer relations between datasets

using the metadata about each dataset. The inference was based on methods used in

recommender systems, specifically, content-based recommenders. Without any user in-

formation, we explored how well datasets could be found using only metadata. These

inferred relations, or recommendations, were then explicitly stated back into the meta-

data, i.e. creating links between datasets that were recommended for a dataset. These

links could then be used on the data portal dataportalen.se to find similar datasets, so

that the information of interest in a given situation was easy to find. The project could

therefore be summarized by the following question:

(16)

3 Related work

Is it possible, using a content based recommender system, to infer relations between datasets, using the metadata provided by the DCAT-AP-SE specification?

3 Related work

This section presents two related projects. Section 3.1 presents a project which aimed at creating recommendations for a given movie from the Internet movie database (IMDB).

Without available user data, the authors utilize natural language processing (NLP) in their model. The results are then compared to human-created recommendations as well as a recommendation system that fully utilizes user data. Section 3.2 presents two projects related to the EuroVoc thesaurus. The first of the two relates to describing documents using the terms in the thesaurus. The second project attempts to look at the similarity of documents based on the terms in the thesaurus. These two projects formed a benchmark which the results of this project could then be compared against.

3.1 Recommendations Without User Preferences: A Natural Language Pro- cessing Approach

In a paper, Fleischman and Hovy presents an approach to creating a recommendation system without access to user preferences [FH03]. As stated by the authors of the paper, the performance of automatic recommendation systems suffer greatly when little to no information about users’ preferences is available [FH03]. Their approach was, in part, to make use of NLP to provide decent content-based recommendations. The data used in the project was taken from the Internet Movie Database (IMDB), and the results of the project was compared to the results produced by IMDB’s own recommendation system. Coming up with recommendations for a given movie was treated as a similarity measurement task, meaning the best recommendation was the movie most similar to the given movie [FH03]. The data used contained structured elements (director, cast etc) as well as textual elements (plot description). Separate similarity measures were calculated for each type of information, i.e., one measure for director, one for cast, one for plot description etc. The final similarity score was then calculated by taking the weighted average of all separate calculations. A normalized overlap score was used for the structured information. For the plot summaries, two algorithms were tested. The first algorithm involved transforming the summary into a vector with one feature for each word. The feature value then signifies whether the word appeared in the summary or not.

The similarity of two summaries are then given by the cosine similarity of the two vectors

[FH03]. Each genre is represented as a vector where each component is a word, and

each component value indicates how indicative that word is for the genre. Hierarchical

clustering is then used to merge similar genres for which new vectors representations are

created. Next, vector representations for each summary is created by calculating the

cosine similarity between the feature vector representing the summary and each of the

(17)

3 Related work

newly created (merged) genres. In other words, each component in the representation is the cosine similarity of the summary and one of the new genres. Finally, the similarity of two film summaries is calculated by taking the cosine similarity between the new vector representations of the summary [FH03].

The paper was evaluated by first picking ten random movies from a subset of movies whose rating was based on at least one thousand votes. For each of these randomly selected movies, ten recommendations were generated by each of the created algorithms, ten recommendations were selected at random, five recommendations were generated by humans, and ten recommendations were generated by IMDB’s own algorithm. The results were then presented to four humans who rated the recommendations on a scale from one to four. The randomly and human selected recommendations achieved the worst and best scores, respectively. Among the created algorithms, the second algorithm outperformed the first one. However, they were both outperformed by IMDB’s own algorithm. As stated by the authors, although their NLP model failed to outperform the commercial model, it is much less labor-intensive [FH03]. A possible extension mentioned by the authors is using Latent Semantic Indexing (also known as Latent Semantic Analysis) to discover higher-order similarities between the summaries [FH03].

3.2 Eurovoc

EuroVoc is a multilingual and multidisciplinary thesaurus that contain hierarchically organized subject domains. The subjects are used by Member States of the EU for classification of official documents. The thesaurus is managed by the EU Publications Office. This section presents two projects related to EuroVoc; a tool for automatically assigning thesaurus terms in order to classify documents and a project which attempts to calculate the similarity of documents using solely EuroVoc descriptors.

Initially, documents had to be manually labeled with the thesaurus terms, referred to as descriptors. The European Commission’s joint research centre (JRC) developed an indexer that automatically assigns EuroVoc descriptors to a given body of text [SET13].

In the model, each descriptor has a profile which consists of a ranked list of features that are common for that particular descriptor. Features may take different forms, for example, basic words, n-grams, part-of-speech combinations etc. Furthermore, the model requires that the document that is to be labeled is represented as a vector containing the frequency of the features in the document. Cosine similarity between the descriptor profiles and the feature vector representation of the document is then used to find the most suitable (the closest) descriptors for the document.

A paper by R. Steinberger, B. Pouliquen, and J. Hagman [SPH02] presents an approach

to calculating the similarity of labeled documents written in the same or different lan-

guages. The approach was to first represent the documents in a language-independent

way, more specifically using the EuroVoc thesaurus terms, referred to as descriptors.

(18)

3 Related work

Like in the previous paragraph, each descriptor has an associated list of words that are common to that particular descriptor. The lists of all the descriptors are then compared to the words in the document that are to be labeled. The assumption made was that if the words in a document were very similar to the associated words of a descriptor, then that descriptor was more likely to accurately describe the document. Because of this assumption, the most suitable descriptors for the document can be found by ranking the descriptors based on similarity. The list of descriptors assigned to the document can then be seen as a representation of the contents of that document [SPH02]. By calculating the cosine similarity between the new descriptor representation of documents, it is then possible to get an idea of what documents may be related. The least distant documents should then be the ones that are also the most similar.

These two papers formed an alternative approach to representing and finding similar

documents, respectively. First using the JEX to label the data in this project after

which a similarity between documents can be based solely on the automatically assigned

descriptors (as done in [SPH02]). The performance of this approach could then be used

for comparison to the models implemented in this project, described in Section 8.

(19)

4 Theory

Theory regarding data is first presented, followed by feature extraction methods and finally how to measure similarity of data.

4.1 Data

There are different types of datasets. They can be be grouped into three categories:

record data, graph-based data and ordered data [TSK19].

Record data is when the dataset is a collection of records (data objects), which has data fields (attributes). In this type of dataset, there is no explicit relationship between data objects or attributes. An example of the record data type dataset is the data matrix, where the data objects have a fixed number of numeric attributes [TSK19]. A collection of records can either be stored in a relational database or as a flat file.

Graph-based data is a type of dataset that captures relationships among data objects. In this type, objects are represented as nodes and the relationships are represented as links between the nodes. An example of this type is how the World Wide Web is organized, where web pages are represented as nodes and relationships between web pages are links between the nodes. A model for this graph-based data is Resource Description Framework (RDF). In the RDF data model, data is represented as triples. A triple consists of a statement containing a subject, predicate and object. An example of a triple is “Bob is 35”, where “Bob” is the subject, “is” is the predicate and “35” is the object. Subjects, predicates and objects can have Uniform Resource Identifier (URI) as values. By giving ”Bob” a URI, such as “http://example.org/Bob”, it can reused in other statements by using ”Bob”, thus statements are linked together. This example is visualized in Figure 4.

Figure 4 An example of an RDF triple.

Ordered data is the type of data where attributes have relationships that involve time or

space [TSK19]. An example of this data type is time-series data, where measurements are

collected over a certain period of time. The attributes will have temporal relationships

in this example.

(20)

4 Theory

4.1.1 Data Quality

In contrast to statistics, where experiments or surveys are carried out in order to achieve a pre-specified level of data quality, data mining is often applied to data where the data was not collected for the purpose of data mining. Subsequently, data quality problems arise because of the nature of the raw data, and in order to deal with problems that arise it is necessary to detect and correct these problems, this process is often denoted data cleaning [TSK19]. By investigating how the data was generated one can learn more about the data quality. Humans, measuring devices or a data collection process can be generators of data. In either of these, there is always a risk of errors, such as human errors, limitations of measuring devices or flaws in the data collection process [TSK19].

Missing values is a common problem and takes place when a data object is missing one or more attributes. There are a number of ways to deal with this issue. The first being to eliminate data objects or attributes, the second is to estimate missing values and the third is to ignore the missing value during analysis. Depending on the situation, one of these strategies can be applied [TSK19].

Duplicate data is when a dataset contains data objects that are duplicates or almost duplicates. Deduplication the process that involves solving this issue. In deduplication it is important two keep two aspects in mind. [TSK19]. The first being the even if two data objects represents a single objects, there might still be attributes that are different in the data objects. The second aspect is that the data objects are not duplicates, just very similar [TSK19].

The application perspective of the data quality is another view point that is important.

That is, given a specific use-case, how relevant or suitable is the data? And the answer to

that question determines the quality of the data for that application [TSK19]. Relevance

and knowledge are two aspects to consider when looking at the data from the application

perspective. Relevance is concerned with if the available data contains the information

that is necessary for the application. Not only are the attributes important, but also

making sure that the data objects are relevant. Sampling bias happens when the data

objects in the dataset does not reflect the actual occurrence in the population. For

example, this can be the case when making a survey and the survey data only describe

those who answered the survey [TSK19]. Since the data analysis only reflects the survey

data, is might not be able to be generalized to a wider population. Knowledge about

the data, on the other hand, are knowledge that helps to describe different aspects of

the data. This can be in the form of documentation. Precision of the data, the scale of

measurement and the origin of data are also important characteristics that can benefit

from documented knowledge.

(21)

4 Theory

4.1.2 Data Preprocessing

Data preprocessing is a fundamental step in making the data ready for data mining.

Some various techniques can be applied in this process, but they all fall into two cat- egories: the first one being selecting data objects and attributes for the analysis while the second on is creating/changing the attributes [TSK19].

Aggregation is a technique in which two or more data objects are combined. There are many ways to combine data objects, what it boils down to is how to handle the attributes when combining the data objects. What can be lost in aggregation is the detail of the attribute of the original data object, and therefore a sense of what abstraction level that is appropriate is something that needs to be considered. Another important motivation is that by aggregating data objects, the result is a smaller dataset which can be of interest for memory and processing reasons [TSK19].

Dimensonality reduction is a technique for handling data objects which potentially have a large number of attributes. As an example, take a text document as the data object, and the occurrence of a word in that text document as an attribute. In this example, the number of attributes will be the same size as the vocabulary of the text document, which in some cases can be in the size of thousands. The curse of dimensionality is a term that refers to the issues that are related to a high number of attributes in data.

Even if the number of attributes increases, a given data object might only have values in just a few of the attributes, which is referred to as sparse data. This phenomenon can make data analysis difficult. A way to tackle this issue is to use techniques that create new attributes by combining the old attributes. One technique for doing so is Singular Value Decomposition (SVD), which is further explained in Section 4.2.1 [TSK19].

Attribute subset selection is an optional way to tackle the problem of a high number of attributes is to identify a smaller number of attributes that are a subset of the old attributes. It might seem like information will be lost in this process, but by identifying redundant and irrelevant attributes, this can be avoided. Redundant attributes duplicate much of the information of other attributes and irrelevant attributes are attributes which are deemed to contain no useful information for the analysis task [TSK19].

Common sense or domain knowledge can be used to judge attributes, however, this

approach is somewhat subjective. Instead, a more systematic approach can be taken

to obtain a more objective judgement. The gist of such a systematic approach is to

try all possible subsets of attributes, use them in the algorithm at hand, and take the

subsets of attributes which produces the best results. As the all the possible subsets of n

attributes is 2 to the nth power, there are three alternatives approaches: embedded, filter

and wrapper. The embedded approach is where the attributes selection is part of the

data mining algorithm, the filter approach is where the attributes is selected by some

independent algorithm before the data analysis, and finally the wrapper approach in

which the data mining algorithm is treated as a black box, similar to testing all possible

(22)

4 Theory

subsets but without testing all possible subsets [TSK19].

Another approach to subset selection is feature weighting. Instead of the binary approach of either deleting of keeping a feature, some attributes are deemed more important than others and vice versa. The importance of a feature can be deemed by domain knowledge [TSK19], or as part of an iterative approach using the data mining algorithm as a black box and inspecting the result, similar to the wrapper approach taken the previous paragraph.

Besides the aforementioned ways to reduce the number of attributes, one can also at- tempt to create new attributes which more effectively captures the information stored in the old attributes using feature creation methods. One methodology to do so is feature extraction, in which the raw data is processed to obtain a higher-level attribute. Taking the case of a text document as an example, there are algorithms which can find patterns in the data objects of a dataset, thus generating subject topics and mapping these topics to each text document. Thus, the new number of attributes is the number of generated topics. Feature extraction is, like in this example, highly domain-specific and techniques from one field generally have limited practical use in other domains. Conclusively, when opting for feature creation it is necessary to delve in to the domain of the attribute to find the right tools [TSK19].

4.1.3 Text data preprocessing

As computers read bodies of text as string objects, it is necessary to separate the larger bodies of text into smaller chunks so that computers can evaluate individual words [Bey18]. This can be achieved through tokenization of the natural text which is the process of segmenting the text down into smaller chunks, or tokens. Segmentation can be done on the actual tokens or separators between tokens such as whitespace or punctuation. This way of representing a text document in terms of tokens is often referred to as a Bag-Of-Words (BOW) model. Adding all these tokens together results in a BOW vector which disregards the order of tokens but retains token multiplicity, therefore the vector is also sometimes referred to as a term frequency vector. Figure 5 shows a BOW vector representation of a document.

Figure 5 The BOW representation of a document containing the phrase ”To be or not

to be”

(23)

4 Theory

However, not all tokens may be desirable. For instance, punctuation appears all through- out natural languages and makes reading the text easier for humans. But punctuation adds little in the way of value in a BOW vector. There also exists words in all languages that occur very frequently but that carry less information about the meaning of a sen- tence [HHL19]. These words are often referred to as stop words, two examples are the and is.

Filtering out these undesirable tokens reduces the overall size of the vocabulary while largely maintaining the information contained. One approach is to utilize regular expres- sions which are a sequence of tokens or characters that defines a pattern. These patterns can then be used to find the undesired tokens and replace or remove them. This ap- proach relies on a predefined list of stop words and regular expressions that tokens can be matched against.

There may exist tokens which, in essence, provide the same information but that for grammatical reasons appear in different forms in a single sentence or text. The process of combining these terms into a single normalized form is known as normalizing the vo- cabulary which further reduces the size of the vocabulary. One instance of normalization is to consolidate multiple words where the only difference is capitalization into a single word. Further normalization can be achieved by breaking words down into their seman- tic roots - their lemmas. Lemmatization is the process of removing inflectional endings of a word and returning the root or the lemma that word. For instance, troubling, troubles and troubled can all be grouped to trouble, as shown in Figure 6. Dictionaries is utilized to allow the algorithm to perform a morphological analysis of terms.

Figure 6 Consolidating multiple alterations of a word by removing inflectional endings.

Stemming is an alternative approach to removing inflectional endings. However, instead

of relying on dictionaries, stemming takes a heuristic rule-based approach. A stemming

algorithm was presented by Martin Porter in 1980. The approach was to, based on a set

of rules, remove suffixes from words [P

⁺

80]. The algorithm utilizes rules in several stages

to reduce a word. A term is matched against a list of rules and the longest match found

is the rule that will be chosen for a stage. Examples of rules from Porter’s algorithm

can be seen in Figure 7.

(24)

4 Theory

Figure 7 Example of removing inflectional using stemming rules.

Combining the BOW vectors representing the documents results in a sparse matrix that represents the entire collection of documents, also known as the corpus. Each row represents a document and the columns of the matrix are all the words in the corpus. The column values indicate how many times a particular word appears in a given document.

However, this numeric value of occurrences provides little insight into how indicative the term is for a given document, i.e., how effectively the term can be used to tell a document apart from the rest of the collection.

An alternative approach to the BOW model is Term Frequency-Inverse Document Fre- quency (TF-IDF). Like the BOW model, vectors are used to represent documents. How- ever, instead of tokens being valued by the number of times they appear throughout a document a TF-IDF value is calculated. TF-IDF is calculated using two values, docu- ment frequency and inverse document frequency. Document Frequency df

_t

is the number of documents in the collection that contain a term t. The Inverse Document Frequency idf

_t

of a term t in a collection consisting of N document is given by 1 as presented by [MRS08].

idf

t

= log( N df

t

) (1)

The IDF value provides insight into how relevant a term is for distinguishing a given document in a collection. Terms that appear frequently throughout the collection will receive a low value while rare terms receive a high value.

Similar to the BOW model, the final matrix representation of the corpus contains a row for every document and a column for every term. A document in this representation is a vector with a component for every term in the corpus. The component is zero for terms that do not occur in the document. For terms that do appear, the TF-IDF value for term t in document d is given by equation 2 as presented by [MRS08].

tf -idf

t,d

= tf

_t,d

× idf

_t

(2)

Terms that occur frequently within a small number of documents will receive a large

value. This signifies that the terms are indicative of the small number of documents,

i.e., the terms are useful when trying to tell the small number of documents apart from

the rest of the collection. Terms that occur in a large number of documents or a limited

number of times in a document will receive a lower value. Finally, terms that occur in

(25)

4 Theory

pretty much all documents will receive a very low value.

4.2 Feature Extraction

Sections 4.2.1 and 4.2.2 presents two different approaches for topic modelling. Topic modelling rests on the assumption that there exist a number of abstract themes or topics in a collection of documents and that every document in the collection can be represented by some mixture of these topics. In the case of text documents, topics are, in essence, similar words grouped or clustered together. This means that instead of being represented by word-vectors, text documents can be described by a mixture of word topics. These mixtures often referred to as topic vectors. By looking at the documents as represented by topic vectors, it is also possible to get a sense of what documents may be related.

Section 4.2.3 presents an approach to representing words as feature vectors . Section 4.2.4 takes this concept a step further and presents an approach to representing entire paragraphs as feature vectors.

4.2.1 Latent Semantic Analysis

Latent Semantic Analysis (LSA) in an approach that attempts to capture relationships between documents and terms used in the documents. It achieves this by using Sin- gular Value Decomposition (SVD), a mathematical algorithm for factorizing a matrix into three matrices. These three new matrices have the property that, when multi- plied together, they form the original matrix. Each of the three matrices have some mathematical properties that can be exploited for dimension reduction or to perform LSA.

In the case of applying SVD on the TF-IDF or the BOW matrices described in the text preprocessing section 4.1.3, SVD will, in essence, find what terms/words that are correlated. Correlation in this case means that the terms frequently occur side by side in the same document and that they also vary together over the collection of documents.

Additionally, SVD also calculates the correlation between documents by the terms that

they use. These two calculations enable SVD to determine what linear combinations of

terms that present the largest variation in the collection. These linear combinations form

the (latent) topics that can then be used to describe each document and how documents

may be related.

(26)

4 Theory

4.2.2 Latent Dirichlet Allocation

Latent Dirichlet allocation (LDA) is a generative model for text and other discrete data presented by David M. Blei, Andrew Y. Ng and Michael I. Jordan [BNJ02]. LDA assumes that there exists a number of topics that are distributions over words in the corpus vocabulary. And that all the documents in the corpus can be represented as distributions over these topics.

Figure 8 Plated notation showing variables involved in LDA.

By using plated notation shown in Figure 8, we can show the variables involved in the process, as well as how these variables are related. The boxes represent repeated sections.

The outer larger box represents the M documents. The inner box represents the word positions. A corpus is represented as D and consists of M documents, each with a length of N

_i

words (document i has N

_i

words).

α and β are parameters for the Dirichlet priors for per-document topic and per-topic word distributions, respectively. A Dirichlet distribution is a multivariate generalization of a Beta distribution. The output of a Dirichlet distribution is a vector of probabilities that sums to one. Dirichlet distributions take a hyperparameter, a higher value on the hyperparameter pushes the distribution toward the center, while a lower value pushes the distribution towards the corners. The effect of different values on the hyperparameter is shown in Figure 9.

For LDA, this means that a higher α results in documents being distributed over many topics while a lower α results in fewer topics. Similar to α, a higher value for +

¯ eta

means a topic is distributed over many words, while a lower β means fewer words are

used. For a given document, denoted i, the topic distribution of that document is θ

_i

. w

_ij

denotes the word at position j in document i, while z

_ij

denotes the topic for the same

word. Lastly, the word distribution for some topic k is denoted as ϕ

_k

.

(27)

4 Theory

Figure 9 The effect the alpha hyperparameter has on the Dirichlet distribution. Higher values push the distribution toward the center while a lower value pushes the distribution towards the corners/edges.

The process, as described by [BNJ03] assumes the following generative process for each document w in D:

For a corpus D consisting of M documents, each with length N

_i

, the generative process is as follows:

1. Choose θ

_i

∼ Dir(α) where i ∈ {1, ..., M }

2. Choose ϕ

_k

∼ Dir(β) where k ∈ {1, ..., K}

3. For each word positions i, j, where i ∈ {1, ..., M } and j ∈ {1, ..., N

_i

}

(a) Choose a topic z

_ij

∼ M ultinomial(θ

i

) (b) Choose a word ϕ

_ij

∼

M ultinomial(φ

zij

)

4.2.3 Word2Vec

While the BOW model of representing texts is relatively straightforward to implement,

it has several drawbacks. The model discards the order in which words appear and

also fails to capture the semantics of words, i.e., the model provides no natural notion

of similarity between words. For instance, the word apple should, in a sense, be more

(28)

4 Theory

similar to orange than apple and car, however, this is not captured in this representation.

Preferable would be to have similar words occupy positions close to one another, or in other words, the vector representations of the words should reflect similarities.

Neural word embeddings is an alternative representation that captures similarities. In this representation, words are represented by feature vectors where each component of the feature vector captures some aspect of the words in the vocabulary. To get some intuition, imagine that one vector component symbolizes royalty. The word king would then have a high value for that component while the word apple would not. However, in practice, components are more complex than single adjectives. Words that have similar features will therefore occupy positions close to another since their feature vectors will have similar values. Furthermore, these new representations have some interesting properties besides grouping similar words. It also captures analogies, a classic example is king is to man, what woman is to queen. To elaborate, the vector difference between the points representing man and woman is very similar to the vector difference between the points representing king and queen [MYZ13]. This enables Shown in Figure 10.

Figure 10 A crude representation of how analogies are captured by the feature repre- sentation. The same vector difference describes the both relation between man and king as well as woman and queen.

The Word2Vec model, created by Mikolov et al. [MCCD13] is a neural network for

creating neural word embeddings. The network consists of input, projection and output

layers. It utilizes one of two methods: the continuous bag-of-words (CBOW) model or

the continuous skip-gram model. The CBOW model attempts to predict a word based

on the context of the word while the continuous skip-gram model attempts to predict

the context of a given word. An overview of CBOW and skip-gram models can be seen

in Figure 11.

(29)

4 Theory

Figure 11 A side-by-side comparison of the architecture of CBOW and skip-gram, respectively [MCCD13].

For the CBOW model (shown in Figure 12), the input into the network is C words from

the context of a focus word. These context words are represented as one-hot vectors, i.e.,

one component in each vector x

_k

, x

_2k

,.., x

_Ck

is 1, the rest is 0. The weights between the

input and hidden layers are contained in a matrix W with dimensions V × N , where V

is the size of the vocabulary and N is the length of the feature vectors. Each row in W

is the feature vector representation of a word in the vocabulary. More specifically, every

word in the input layer has a feature vector representation v

_w

given by it’s associated

row in W. W is shared for all context words, i.e., the feature vector for apple is the

same for all instances of the word.

(30)

4 Theory

Figure 12 The network architecture of the CBOW model [Ron14].

The output of the hidden layer is calculated by taking the product of the transposed weight matrix W and the average of the context vectors, shown in equation 3.

h = 1

C W

^T

(x

₁

+ x

₂

+ · · · + x

_C

) (3)

= 1

C (v

_w₁

+ v

_w₂

+ · · · + v

_w_c

) (4) A matrix W

⁰

with dimensions N × V contains all the weights between the hidden and output layers. Using the weights contained in W

⁰

, a score for each word in the vocabulary can be calculated. Equation 5 is used for calculating the score of a word, where v

⁰_w_j

is the j-th column in W

⁰

.

u

_j

= v

⁰_w_j^T

h (5)

The softmax function is then used to obtain the posterior distribution of words. Given the focus word w

_I

, the posterior is a multinomial distribution given by equation 6.

p(w

j

, w

_I

) = y

_j

= exp (u

_j

) P

V

j⁰=1

exp (u

_j⁰

) (6)

Where the j-th unit in the output layer is given by y

_j

.

(31)

4 Theory

Figure 13 The network architecture for the skip-gram model [Ron14].

In the skip-gram model (shown in Figure 13), the input into the network is simply the focus word given by a one-hot vector x

_k

. This means that the output of hidden layer h is a (transposed) row of the weight matrix W, i.e., the row in W that corresponds to the focus word. Equation 7 shows how h is calculated.

h = W

^T_(k,·)

= v

^T_w

I

(7)

Whereas the CBOW model attempts to predict the focus word given C context words, skip-gram attempts to predict C context words given the focus word. This means the output of the network will be C multinomial distributions. The same weight matrix W

⁰

located between the hidden and the output layers is used for each output.

Equation 8 is used for calculating the score of the c-th word (also referred to as panel [reference]), where v

⁰_w_j

is the output vector of the j-th word in the vocabulary and is given as a column in W

⁰

.

u

_c,j

= u

_j

= v

⁰_w_j^T

· h, for c = 1, 2, ..., C (8)

u

_c,j

denotes the net input on the j-th output unit on the c-th panel of the output layer.

(32)

4 Theory

The multinomial distribution given the focus word w

_I

is given by equation 9.

p(w

c,j

= w

_O,c

| w

_I

) = y

_c,j

= exp (u

_c,j

) P

V

j⁰=1

exp (u

_j⁰

) (9) Where w

_O,c

is the actual c-th context word and w

_c,j

is the j-th word of the c-th panel in the output layer and y

_c,j

is the output for the j-th unit of the c-th panel in the output layer.

Both models utilize stochastic gradient descent to train the network. Backpropagation is used to obtain the gradient. However,this paper will not delve deeper into the equations involved in the backpropagation.

4.2.4 Doc2Vec

This section presents the Paragraph Vector, more commonly referred to as the Doc2Vec model. While the Word2Vec model presented in the previous section created feature vector representations for individual words, the Doc2Vec model creates feature vector representations for texts of varying lengths, or paragraphs, hence the name paragraph vectors. The model was presented by Mikolov et al. [LM14].

The first approach is the Distributed Memory Model of Paragraph Vectors (PV-DM) , shown in Figure 14, which is similar to the CBOW model in Word2Vec. The CBOW model learns the feature vectors for words by using the context of a focus word to predict the actual focus word. An indirect result of the prediction task is that the feature vector representations of the words will capture semantics as well [LM14]. In this new approach, an additional vector which represents the paragraph containing the focus word and the context is also asked to contribute to the prediction of the focus word.

Each paragraph in the collection is represented by a column in a matrix D and every word in the vocabulary is represented by a column in a matrix W. The input to this network is a fixed number of context words taken from a sliding window over the paragraph as well as the vector representing the paragraph itself, often referred to as the paragraph token.

The matrix W is shared across all paragraphs, meaning the vector for apple is the same for all paragraphs. Furthermore, the paragraph token is shared by all contexts sampled from a paragraph and no token is shared by multiple paragraphs, i.e., all paragraphs have their own token.

The second approach is the Distributed Bag of Words version of Paragraph Vector (PV-

DBOW), shown in Figure 15, which is similar to the skip-gram model in Word2Vec. This

approach ignores the context of a word and instead make the model predict randomly

sampled words from the paragraph. The word to be predicted is randomly chosen from

a randomly selected text window.

(33)

4 Theory

Figure 14 A basic overview of the PV-DM model’s architecture [LM14].

Figure 15 A basic overview of the PV-DBOW model’s architecture [LM14].

The columns of a matrix D is still used to represent all paragraphs, furthermore, this approach ignores the context of words. A text window is randomly sampled from a paragraph and a random word is sampled from the window. The model is then tasked with predicting this word given the paragraph token. This approach has the added benefit of not needing to store the word vectors, as opposed to the first approach [LM14].

4.3 Measures of Similarity and Dissimilarity

Measures of Similarity and Dissimilarity are methods that belong to the data mining field and can be used when comparing data objects. While similarity is used for investigating to what degree two data objects are alike, dissimilarity (also known as distance) is a means to determine to what degree two objects are different. Similarity and dissimilarity are related and transformations can be applied in order to convert one to the other.

Proximity is a term that refers to both similarity and dissimilarity. The proximity of objects can be calculated by comparing their attributes [TSK19].

For finding the similarity between sparse data objects, sparse meaning that a data object has relatively few non-zero attributes, certain measures of similarity are better suited than others. The Jaccard Coefficient is a similarity measure that handles sparse data objects by disregarding zero-values attributes. However, this measure is limited to at- tributes which are asymmetric and binary.

Another similarity measure which can handle sparse data objects is the Cosine Similarity.

(34)

4 Theory

Like the Jaccard Coefficient, zero-valued attributes are disregarded, but it also handles non-binary, continuous attributes. Moreover, the Cosine Similarity of two data objects is the angle between them, thus it is oblivious to the length of the two data objects when computing the similarity, as illustrated in Figure 16

Figure 16 Cosine similarity

An alternative approach to the Cosine Similarity is the Extended Jaccard Coefficient (Tanimoto Coefficient). This measure is an extension of the Jaccard similarity, but with the capability to handle both continuous and binary attributes. If the attributes are binary, it simply reduces to the Jaccard Coefficient [TSK19].

Kernel matrix or affinity matrix is the result of comparing the data objects in the dataset

in a pairwise fashion using a proximity measure. As can be seen in Figure 17, the matrix

compares all data objects with each other. Calculating a kernel matrix for a dataset

has many advantages, besides finding out which data objects are similar/dissimilar to

others. The choice of proximity measure is called the kernel function. A kernel matrix

can be a powerful tool in its own right, used as a means to group data objects or for

exploratory purposes. Furthermore, it can also be used as a preprocessing step before

using a data analysis algorithm, assuming the algorithm can handle kernel matrix input

[TSK19].

(35)

5 Materials

Figure 17 Affinity matrix of some datasets.

5 Materials

This section explores and presents the datasets that were used in the project. The first dataset, presented in Section 5.1, contained the metadata about all the datasets that are collected and available on the data portal dataportalen.se. This metadata dataset was used for inferring relations between the datasets. The second dataset, presented in Section 5.2, contained Wikipedia articles written in Swedish which were used for the transfer learning of feature extraction models. By presenting the datasets and describing them we hope the reader gets a better understanding of the preconditions of the data analysis.

5.1 dataportalen.se dataset

The source of the data, at the time of writing, was a data dump from the Swedish data portal: https://registrera.oppnadata.se/all.rdf. The data was available in two serialised formats: JavaScript Object Notation (JSON) & Extensible Markup Language (XML).

We chose to opt for the XML serialization which had a total size of roughly 9Mb. The

dataset contained metadata about all datasets collected on dataportalen.se. In order to

get a better understanding of the content of the dataset and identify some important

characteristics, data exploration was conducted. The first thing worth noting was that

the metadata was graph-based in accordance with the RDF data model.

(36)

5 Materials

Figure 18 The architecture of a subset from metadata.

Figure 18 shows an overview of the metadata. The metadata about a dataset can contain more than 70 mandatory, recommended or optional attributes including title, description, themes, format, and keywords. The graph displayed in Figure 18 only shows a subset of these attributes, in the purpose of illustrating the concept.

Table 1: The presence of the attributes in the dataset.

Attribute Number of datasets Percentage of datasets

title 2171 100%

title (multiple languages) 100 4.6%

description 2171 100%

description (multiple languages) 124 5.7%

theme 1044 48.1%

keyword 770 35.5%

publisher 2156 99.3%

It was not uncommon that metadata were missing for some of the datasets, in fact, only title and description had values for every datasets, as displayed in Table 1. Some attributes could also have values in multiple languages, title and description for instance.

These attributes always had values in Swedish, but could also contain values in English,

for instance. However, as can be seen in Table 1, there were not many datasets with

English values.

(37)

5 Materials

Table 2: Subjects types in the RDF data.

Subject Number Publishers 151 Contact points 198 Datasets 2171 Distributions 6086

There exists four types of subjects in the metadata: dataset, publisher, contact point and distribution. Dataset is self-explanatory, it represents the number of datasets. Publisher represents the organisation that has published the dataset, distribution is the various available versions a dataset is available in, and contact point is the person who can be contacted regarding the dataset. As can be seen in Table 2 the number of datasets is currently 2171 and the number of publishers is 151.

Figure 19 The distribution of published datasets.

The distribution of the published datasets are not equal over the publishers. As can be

seen in Figure 19, most publishers only publish a number of datasets in the range of one

to twenty-five, while there are some publishers who publish more than 150 datasets.

(38)

5 Materials

Table 3: The top ten dataset publishers.

Top 10 publishers Number of datasets

Trafikverket 257

Stockholms stad 228

Naturv˚ ardsverket 210

Ume˚ a energi 169

SMHI 160

Skatteverket 67

Liding¨ o stad 56

G¨ avle kommun 50

G¨ oteborgs stad 38

Skogsstyrelsen 37

Five publishers have published metadata about more than 150 datasets. A potential problem similar to sampling bias can be seen here. Since so many datasets are published by a few publishers, it might the case that a data analysis can not be generalized to datasets by other publishers.

Figure 20 Overview of published datasets.

As can be seen in the above Figure 20, the top 10 dataset publishers have published over

50% of the total datasets.

(39)

5 Materials

Figure 21 An example of a dataset.

Figure 21 shows an example of the metadata about a random dataset. This example of metadata contains ten attributes: keywords, title, description, format, publisher name, publisher type, themes. These attributes can be divided into two types: structured information and free text. Format, publisher name, publisher type and theme fall in the structured information type since there exists pre-defined choices for these attributes.

Title, description and keywords belong to the free text type, since these are completely author-provided, i.e. the person who provided this type of metadata has more freedom in choosing the values if these attributes.

5.2 Swedish Wikipedia dataset

Wikimedia Foundation is a non-profit organization that hosts Wikipedia, among other

things. The foundation provides accessible database dumps of all Wikimedia wikis. XML

is used to embed both metadata (title, id, etc) about the page and potential revisions as

well as the content of the page. The specific dump used in this project was all articles

on the Swedish Wikipedia, i.e., the Swedish-language version of Wikipedia. At the time

of writing, the Swedish Wikipedia contains 3,737,186 articles, where an article is defined

as a page containing at least one link to another page.

(40)

6 Method

This section presents the tools that were used in the project, and describes the steps involved in forming recommendations as well as how the recommendations were repre- sented and retrieved. The project was implemented using Python, as it is the standard programming language for data analysis and has many libraries that support the task.

An alternative was JavaScript, since the project was based on a web service which was built using that programming language. However, JavaScript currently has limited sup- port for data analysis compared to Python.

6.1 From Linked Data to Flat Data

The dataset was sourced from the dataportalen.se dataset described in section 5.1. For making the linked data ready to be analyzed, we used propositionalization, which is a method used to transform graph-based data to record type data. This was implemented using the Python library Pandas. The RDF data consisting of subjects, predicates, and objects are interpreted as data objects, attributes and data fields. The result of the propositionalization is that the data objects in the table are the subjects of the graph data, i.e. the datasets, publishers, distributions and contact points. Each data objects has corresponding attributes and data fields from the links of the original graph.

Alternatives to propositionalization is to either extract feature vectors from the knowl- edge graph and use these vectors as inputs in the model, or using data analysis models that can create internal representations of the knowledge graph [WBDB17]. As these two alternative approaches are currently under research, and because of our prior knowledge of traditional models, we chose to opt for propositionalization.

6.2 Data Cleaning

The dataset was then reduced by taking only the data objects which were datasets.

The data objects which were distributions or publishers were then aggregated to the corresponding dataset in order to obtain the publishers’ names, the types of the publish- ers, and the distribution formats. Duplicated data objects were identified, and removed accordingly.

Attributes that were deemed redundant or irrelevant were discarded. Common sense and

domain knowledge, along with trial and error were used to determine what attributes to

discard. The wrapper approach presented in paragraph 4.1.2 was also used to experiment

with different subsets of attributes. Some of the attributes had both Swedish and English

values. Since the number of Swedish values greatly exceeds the number of English,

we chose to only use Swedish values. An alternative approach can be to use a more

systematic approach, as discussed in subsection 4.1.2.

(41)

6 Method

As presented in Table 1, it was not uncommon for attributes to be missing. The at- tributes chosen for the data analysis were combined into one attribute, referred to as a

“metadata soup” [Ban18]. Using this approach, all data object always had at least one value, the title, in this attribute. Moreover, some of the data objects were duplicates.

This was handled by finding data objects with the same title, and removing all but one of the duplicates.

6.3 Data Pre-processing, Text Pre-Processing and Feature Extraction The goal of the data pre-processing was to create attributes which were used to iden- tify similar data objects. Depending on the chosen subset of attributes, the data pre- processing was different. The attributes were of two types: categories and description, as presented in Subsection 5.1.

The structured information attributes were combined into one single attribute, i.e. one attribute holding all values for a data object. This attribute was then vectorized for further analysis.

For the free text attributes, the text pre-processing differed depending on the feature extraction technique. For feature extraction models, it was necessary to combine the free text attributes into one attribute. Publisher names were removed from the text, since we did not want publisher names to interfere with the similarity analysis. For LDA and LSA feature extraction, it was necessary to do further text pre-processing. The text was first broken down into individual words using tokenization using the Regular Expression Tokenizer (RegexpTokenizer) from the Natural Language Toolkit (NLTK) [nlt20]. Next, unwanted tokens were filtered out using regular expressions. Unwanted tokens, in this project, consisted of all tokens containing numbers, punctuation, information related to web addresses (http, etc), and string literals (\n, \b, etc). This was achieved by using the Regular expression operations (re) module for Python. The final step was to convert the corpus into a sparse matrix representation required by both LDA and LSA. The values in the matrix differed depending on what model was to be used. LDA used token count (BOW) while LSA required the values to be TF-IDF. The implementations used to create these matrices were the Scikit-learn module’s CountVectorizer and TfidfVec- torizer, respectively. The text data was then ready for use in either of the two feature extraction methods.

6.4 Comparing similarity of datasets

The next step in the process was to compare the attributes acquired from the afore-

mentioned steps in the process. By comparing these attributes, we could find what

data objects that were deemed to be similar. As we were interested in finding the most

similar datasets for a given dataset, an intuitive solution was to make an analysis in

Inferring Dataset Relations using Knowledge Graph Metadata

Sammanfattning

Webbplatsen dataportalen.se syftar till att ¨ oka tillg¨ angligheten av svensk ¨ oppna data.

Detta g¨ ors genom att samla in metadata om de ¨ oppna datam¨ angderna som tillhan- dah˚ alls av svenska organisationer. I skrivande stund finns metadata fr˚ an mer ¨ an tv˚ a tu- sen datam¨ angder i portalen, och detta antal kommer att ¨ oka. N¨ ar antalet datam¨ angder

¨ okar blir genoms¨ okandet av relevant information allt sv˚ arare och mer tidskr¨ avande. F¨ or n¨ arvarande ¨ ar det m¨ ojligt att genoms¨ oka datam¨ angderna med hj¨ alp av s¨ okning med text och sedan filtrering med tema, organisation, filformat eller licens.

Resultaten visar att relaterade datam¨ angder kan hittas med bara textdata, och vi an-

ser att den identifierade metoden ¨ ar tillr¨ ackligt generell f¨ or att ha potential att kunna

anv¨ andas i liknande problem d¨ ar textdata ¨ ar tillg¨ anglig.

Contents

1 Introduction 1

2 Background 2

2.1 Research question . . . . 6

3 Related work 7 3.1 Recommendations Without User Preferences: A Natural Language Pro- cessing Approach . . . . 7

3.2 Eurovoc . . . . 8

4 Theory 10 4.1 Data . . . . 10

4.1.1 Data Quality . . . . 11

4.1.2 Data Preprocessing . . . . 12

4.1.3 Text data preprocessing . . . . 13

4.2 Feature Extraction . . . . 16

4.2.1 Latent Semantic Analysis . . . . 16

4.2.2 Latent Dirichlet Allocation . . . . 17

4.2.3 Word2Vec . . . . 18

4.2.4 Doc2Vec . . . . 23

4.3 Measures of Similarity and Dissimilarity . . . . 24

5 Materials 26 5.1 dataportalen.se dataset . . . . 26

5.2 Swedish Wikipedia dataset . . . . 30

6 Method 31

6.1 From Linked Data to Flat Data . . . . 31

6.2 Data Cleaning . . . . 31

6.3 Data Pre-processing, Text Pre-Processing and Feature Extraction . . . . . 32

6.4 Comparing similarity of datasets . . . . 32

6.5 Representing Recommended Datasets . . . . 33

7 Inferring Dataset Relations 34 7.1 Baseline . . . . 34

7.2 Algorithm 1: Structured information matching . . . . 34

7.3 Algorithm 2: LDA . . . . 34

7.4 Algorithm 3: LSA . . . . 35

7.5 Algorithm 4: LDA trained on Wikipedia articles . . . . 35

7.6 Algorithm 5: LSA trained on Wikipedia articles . . . . 35

7.7 Algorithm 6: Doc2Vec trained on Wikipedia articles . . . . 35

7.8 Algorithm 7: Weighted Ensemble Recommender . . . . 36

7.9 Gold Standard . . . . 36

8 Evaluation method 37 9 Results 39 9.1 Recommendations . . . . 39

9.1.1 Baseline . . . . 40

9.1.2 Algorithm 1: Structured information matching . . . . 41

9.1.3 Algorithm 2: LDA . . . . 42

9.1.4 Algorithm 3: LSA . . . . 43

9.1.5 Algorithm 4: LDA trained on Wikipedia articles . . . . 45

9.1.6 Algorithm 5: LSA trained on Wikipedia articles . . . . 46

9.1.7 Algorithm 6: Doc2Vec trained on Wikipedia articles . . . . 47

9.1.8 Algorithm 7: Weighted Ensemble Recommender . . . . 48

9.1.9 Gold Standard . . . . 49 9.2 Representing recommendations . . . . 50

10 Discussion 51

11 Conclusions 54

12 Future work 54

A Target Datasets used in Evaluation Method 60

1 Introduction

1 Introduction

T he web site oppnadata.se aims to increase the availability of Swedish open data.

At the time of writing, the next version of oppnadata.se is being developed. which resides on the domain dataportalen.se. This new dataportal is being developed by this project’s stakeholder Metasolutions on behalf of the Agency for Digital Goverments (DIGG).

At the beginning of this project, roughly two thousand datasets resided in the portal.

This paper presents an approach to creating recommendations for similar datasets. The assumption made was that datasets with similar metadata also had similar contents.

Since no user data existed on dataportalen.se, the recommendations were based solely

on the contents of the metadata. A challenge was therefore to maximize the potential of

the metadata. We did so by exploring different approaches to text pre-processing and

feature extraction, some of which utilized transfer learning. These approaches were then

compared to related work in order to evaluate their performance.

2 Background

Oppna offentliga data ¨

Oppna data ¨ Offentliga data (PSI-data)

Figure 1 Venn diagram of open data, open public data and PSI-data . By Peter Krantz - own work, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=12686318

2 Background

However, publishing the data as open data is just the first step according to the director

of the World Wide Web Consortium (W3C), Tim Berners-Lee. The W3C has created

recommendations on how organizations should publish their data [Hau12]. The purpose

of these recommendations is to maximize the simplicity of finding and re-using open

data. The 5 star plan in Figure 2 shows the recommendations on publishing data on the

Web. According to this plan, open data is not the final step, but merely corresponds to

one star. The ultimate goal is to not only make it available as open data, but as linked

open data (LOD). The steps in Figure 2 can be explained as:

2 Background