Sammanfattning
Webbplatsen dataportalen.se syftar till att ¨ oka tillg¨ angligheten av svensk ¨ oppna data.
Detta g¨ ors genom att samla in metadata om de ¨ oppna datam¨ angderna som tillhan- dah˚ alls av svenska organisationer. I skrivande stund finns metadata fr˚ an mer ¨ an tv˚ a tu- sen datam¨ angder i portalen, och detta antal kommer att ¨ oka. N¨ ar antalet datam¨ angder
¨ okar blir genoms¨ okandet av relevant information allt sv˚ arare och mer tidskr¨ avande. F¨ or n¨ arvarande ¨ ar det m¨ ojligt att genoms¨ oka datam¨ angderna med hj¨ alp av s¨ okning med text och sedan filtrering med tema, organisation, filformat eller licens.
Vi tror att det finns mer potential att koppla samman datam¨ angderna, vilket skulle g¨ ora det enklare att hitta datam¨ angder av intresse. Id´ en ¨ ar att hitta gemensamma n¨ amnare i metadatat f¨ or datam¨ angderna. Eftersom det inte finns n˚ agon anv¨ andardata kommer vi att unders¨ oka i vilken utstr¨ ackning denna id´ e kan realiseras. Datam¨ angderna kom- menteras med metadata som titel, beskrivning, nyckelord, och tema. Genom att j¨ amf¨ ora metadata fr˚ an olika datam¨ angder kan ett m˚ att p˚ a likhet ber¨ aknas. Detta m˚ att kan sedan anv¨ andas f¨ or att hitta de mest relevanta datam¨ angderna f¨ or en specifik datam¨ angd.
Resultaten av analysen av metadata ¨ ar att liknande datam¨ angder kan hittas. Genom att utforska olika metoder fann vi att textdata inneh˚ aller anv¨ andbar information som kan anv¨ andas f¨ or att hitta relationer mellan datam¨ angder. Genom att anv¨ anda ett relaterat arbete som riktm¨ arke fann vi att v˚ ara resultat ¨ ar lika bra, om inte b¨ attre.
Resultaten visar att relaterade datam¨ angder kan hittas med bara textdata, och vi an-
ser att den identifierade metoden ¨ ar tillr¨ ackligt generell f¨ or att ha potential att kunna
anv¨ andas i liknande problem d¨ ar textdata ¨ ar tillg¨ anglig.
Contents
1 Introduction 1
2 Background 2
2.1 Research question . . . . 6
3 Related work 7 3.1 Recommendations Without User Preferences: A Natural Language Pro- cessing Approach . . . . 7
3.2 Eurovoc . . . . 8
4 Theory 10 4.1 Data . . . . 10
4.1.1 Data Quality . . . . 11
4.1.2 Data Preprocessing . . . . 12
4.1.3 Text data preprocessing . . . . 13
4.2 Feature Extraction . . . . 16
4.2.1 Latent Semantic Analysis . . . . 16
4.2.2 Latent Dirichlet Allocation . . . . 17
4.2.3 Word2Vec . . . . 18
4.2.4 Doc2Vec . . . . 23
4.3 Measures of Similarity and Dissimilarity . . . . 24
5 Materials 26 5.1 dataportalen.se dataset . . . . 26
5.2 Swedish Wikipedia dataset . . . . 30
6 Method 31
6.1 From Linked Data to Flat Data . . . . 31
6.2 Data Cleaning . . . . 31
6.3 Data Pre-processing, Text Pre-Processing and Feature Extraction . . . . . 32
6.4 Comparing similarity of datasets . . . . 32
6.5 Representing Recommended Datasets . . . . 33
7 Inferring Dataset Relations 34 7.1 Baseline . . . . 34
7.2 Algorithm 1: Structured information matching . . . . 34
7.3 Algorithm 2: LDA . . . . 34
7.4 Algorithm 3: LSA . . . . 35
7.5 Algorithm 4: LDA trained on Wikipedia articles . . . . 35
7.6 Algorithm 5: LSA trained on Wikipedia articles . . . . 35
7.7 Algorithm 6: Doc2Vec trained on Wikipedia articles . . . . 35
7.8 Algorithm 7: Weighted Ensemble Recommender . . . . 36
7.9 Gold Standard . . . . 36
8 Evaluation method 37 9 Results 39 9.1 Recommendations . . . . 39
9.1.1 Baseline . . . . 40
9.1.2 Algorithm 1: Structured information matching . . . . 41
9.1.3 Algorithm 2: LDA . . . . 42
9.1.4 Algorithm 3: LSA . . . . 43
9.1.5 Algorithm 4: LDA trained on Wikipedia articles . . . . 45
9.1.6 Algorithm 5: LSA trained on Wikipedia articles . . . . 46
9.1.7 Algorithm 6: Doc2Vec trained on Wikipedia articles . . . . 47
9.1.8 Algorithm 7: Weighted Ensemble Recommender . . . . 48
9.1.9 Gold Standard . . . . 49 9.2 Representing recommendations . . . . 50
10 Discussion 51
11 Conclusions 54
12 Future work 54
A Target Datasets used in Evaluation Method 60
1 Introduction
1 Introduction
T he web site oppnadata.se aims to increase the availability of Swedish open data.
At the time of writing, the next version of oppnadata.se is being developed. which resides on the domain dataportalen.se. This new dataportal is being developed by this project’s stakeholder Metasolutions on behalf of the Agency for Digital Goverments (DIGG).
At the beginning of this project, roughly two thousand datasets resided in the portal.
Over the course of the project, this number increased to six thousand. Furthermore, due to laws and regulations [Inf10], it is safe to assume that this number will increase further. As the number of datasets increases, browsing for relevant information becomes increasingly difficult and time-consuming. Currently, it is possible to browse the datasets by searching using text and then filtering the results by theme, organization, file format or license. We believe that there is more potential to connect the datasets using their associated metadata, thus making it easier to find relevant data.
This paper presents an approach to creating recommendations for similar datasets. The assumption made was that datasets with similar metadata also had similar contents.
A national specification defined around 70 mandatory, recommended or optional meta- data fields, including title, description, themes, format, and keywords. By comparing metadata from different datasets, a measure of similarity was computed. Based on this measure, recommendations for datasets with the most similar metadata were created.
Since no user data existed on dataportalen.se, the recommendations were based solely
on the contents of the metadata. A challenge was therefore to maximize the potential of
the metadata. We did so by exploring different approaches to text pre-processing and
feature extraction, some of which utilized transfer learning. These approaches were then
compared to related work in order to evaluate their performance.
2 Background
Oppna offentliga data ¨
Oppna data ¨ Offentliga data (PSI-data)
Figure 1 Venn diagram of open data, open public data and PSI-data . By Peter Krantz - own work, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=12686318
2 Background
All the data that is generated by the Swedish authorities are referred to as Public Sector Information (PSI), and the subset of this data that can be published to the public is referred to as Open Data. PSI is a European Union (EU) directive [Eur03] which has also been implemented as a Swedish law [Inf10]. The Swedish authorities are therefore regulated to publish a PSI list about what data they possess [Pal19]. In addition to PSI-data, there is also open data, which has a broader definition as digital information that is freely available. In the intersection of PSI-data and open data, there are open public data, as displayed in Figure 1.
However, publishing the data as open data is just the first step according to the director
of the World Wide Web Consortium (W3C), Tim Berners-Lee. The W3C has created
recommendations on how organizations should publish their data [Hau12]. The purpose
of these recommendations is to maximize the simplicity of finding and re-using open
data. The 5 star plan in Figure 2 shows the recommendations on publishing data on the
Web. According to this plan, open data is not the final step, but merely corresponds to
one star. The ultimate goal is to not only make it available as open data, but as linked
open data (LOD). The steps in Figure 2 can be explained as:
2 Background
Figure 2 The 5 star open data plan
1. make your stuff available on the Web (in any format) under an open license 2. make it available as structured data (e.g., Excel instead of image scan of a table) 3. make it available in a non-proprietary open format (e.g., CSV instead of Excel) 4. use Uniform Resource Identifiers (URI) to denote things, so that people can point
at your stuff
5. link your data to other data to provide context
Similarly to how documents are linked on the Web, the idea is to link data together in
what is sometimes referred to as the Semantic Web. One can think of linked open data
as a global database on the Internet. Figure 3 displays some of the linked open datasets
that are published on the web.
2 Background
Figure 3 The LOD cloud currently contains 1,239 datasets with 16,147 links (as of March 2019)
The scale of the linked open data cloud is enormous. It is worth noting that the LOD cloud does not cover all the existing linked open datasets on the Web. The Web was initially created for publishing and linking documents together. Google created a web search engine, where one could search these documents using text search. With the emergence of linked data on the web, Google has once again created a search engine, this time for linked data. In 2019, Noy et al. presented a search engine in a paper [BBN19] that had the potential to provide search capabilities over all datasets published on the Web. The approach taken was to create an open ecosystem where owners of the datasets markup their data with semantically enhanced metadata, in the form of either Schema.org or W3C Data Catalog Vocabulary (DCAT). After some necessary augmentation, such as reducing multiple alterations of the same information into a single representation, the system links it with related sources and builds an index [BBN19].
This approach means that the data owner can publish their datasets on their own portal
and, as long as they markup the datasets, Google can successfully index the metadata.
2 Background
This project’s initiator, Metasolutions, is a company which offers information manage- ment solutions to organizations that want to make their data available as linked open data. By generating metadata about the data using linked data technologies for the metadata representation, they have a solution that handles the coordination of data well. This metadata follows the Swedish version of the European standard DCAT-AP and contains information such as title, description, themes, and keywords [Pal19]. Using this tool, organizations can publish their data in a well-coordinated manner as LOD.
Coordination of the data is not enough, however, as it is also important to be able to find the data. The National Portal for Open Data and PSI [Age19] found on dataportalen.
se gathers and presents metadata about datasets sourced from Swedish organizations thus making the data findable and available for re-use. Metasolutions is the technology provider and maintainer of the portal on behalf of the Agency for Digital Government (DIGG). By collecting and coordinating this information in one place, the portal is striving to increase the transparency, innovation, and growth in our society [Age19].
Exploration of the linked data is essential, according to Tim Berners-Lee [BL06], and important aspects of exploration are findability and reusability. Currently, the data is coordinated and made available on the portal. Furthermore, the number of datasets on the portal is set to increase as a result of the aforementioned EU regulation and Swedish law. As the number of datasets increases, finding relevant information becomes increasingly more time-consuming and difficult. At the time of writing, dataportalen.se and datasetsearch.research.google.com allows users to find datasets using text search and filtering using theme or topic. This project aims to provide users with additional ways of finding relevant datasets. The approach taken was to utilize recommender systems, with the assumption that once a user has found a dataset of interest, the user would most likely want to find more data regarding the same topic.
Recommender systems are widely used and adopted in similar situations when browsing a large set of items, and they are useful since they help users find items they might not have found otherwise. These systems are often used in domains such as music, shopping, etc and companies that provides these services invest heavily into their recommender systems. The history of recommender systems reaches back to Jussi Karlgrens paper An algebra for recommendations [Kar90] published in 1990. In his paper, Karlgren makes the case that related books in a bookshelf usually are close to each other. He claims that this is caused either because the bookshelf is sorted by subject or because of the activity of the readers, which will place read books close to other books they think are similar, thus placing related books close to each other. This user activity was not taken into consideration in the information retrieval systems at the time. Therefore he presented his solution as a sort of digital bookshelf, in which he formalized document recommendations by readers and attempted to define a measure of closeness based on those recommendations.
Today, there are different types of recommender systems. The recommender system
that Karlgren proposed is now known as a collaborative filtering system. And the first
2 Background
commercial one, Tapestry [GNOT92], was released in 1992, not long after Karlgren’s paper. The book Hands-on recommendation system [Ban18] identifies four types of recommender systems:
• Collaborative filtering
• Content-based systems
• Knowledge-based systems
• Hybrid recommenders
Content-based systems are based on content such as descriptions or metadata, knowledge- based systems are based on explicit knowledge of user preferences, and hybrid recom- menders are combinations of the aforementioned systems. Most companies today use hybrid approaches, such as the winners of the Netflix prize in 2007 where their solution was a hybrid of multiple methods [BKV07]. By blending different approaches, the idea is that it will catch the different, complementing aspects of the data. What is important to note is that as in the case of Karlgren’s work, user activity is still at the heart of recommender systems, and is something that the best recommender systems rely on.
Collecting user data and relying on this information for making recommendations can be troublesome. Collecting user data raises concerns about data privacy. In the afore- mentioned Netflix prize competition, some privacy issues arose from the anonymized dataset offered by Netflix. By matching the film ratings from the Netflix dataset with film ratings from the Internet Movie Database (IMDB), two researchers were able to identify individual users [Sin09]. Ultimately, this led to the cancellation of the Netflix Prize competition. Therefore, there are reasons to be careful about gathering and using user data. The dataset provided for this project does not include any data on user activ- ity. Creating a recommender system for this dataset would then be analogous to, using Karlgren’s bookshelf example, letting a computer read all the books in the bookshelf and then arranging them accordingly.
2.1 Research question
The task of this project was to use the metadata to infer relations between datasets
using the metadata about each dataset. The inference was based on methods used in
recommender systems, specifically, content-based recommenders. Without any user in-
formation, we explored how well datasets could be found using only metadata. These
inferred relations, or recommendations, were then explicitly stated back into the meta-
data, i.e. creating links between datasets that were recommended for a dataset. These
links could then be used on the data portal dataportalen.se to find similar datasets, so
that the information of interest in a given situation was easy to find. The project could
therefore be summarized by the following question:
3 Related work
Is it possible, using a content based recommender system, to infer relations between datasets, using the metadata provided by the DCAT-AP-SE specification?
3 Related work
This section presents two related projects. Section 3.1 presents a project which aimed at creating recommendations for a given movie from the Internet movie database (IMDB).
Without available user data, the authors utilize natural language processing (NLP) in their model. The results are then compared to human-created recommendations as well as a recommendation system that fully utilizes user data. Section 3.2 presents two projects related to the EuroVoc thesaurus. The first of the two relates to describing documents using the terms in the thesaurus. The second project attempts to look at the similarity of documents based on the terms in the thesaurus. These two projects formed a benchmark which the results of this project could then be compared against.
3.1 Recommendations Without User Preferences: A Natural Language Pro- cessing Approach
In a paper, Fleischman and Hovy presents an approach to creating a recommendation system without access to user preferences [FH03]. As stated by the authors of the paper, the performance of automatic recommendation systems suffer greatly when little to no information about users’ preferences is available [FH03]. Their approach was, in part, to make use of NLP to provide decent content-based recommendations. The data used in the project was taken from the Internet Movie Database (IMDB), and the results of the project was compared to the results produced by IMDB’s own recommendation system. Coming up with recommendations for a given movie was treated as a similarity measurement task, meaning the best recommendation was the movie most similar to the given movie [FH03]. The data used contained structured elements (director, cast etc) as well as textual elements (plot description). Separate similarity measures were calculated for each type of information, i.e., one measure for director, one for cast, one for plot description etc. The final similarity score was then calculated by taking the weighted average of all separate calculations. A normalized overlap score was used for the structured information. For the plot summaries, two algorithms were tested. The first algorithm involved transforming the summary into a vector with one feature for each word. The feature value then signifies whether the word appeared in the summary or not.
The similarity of two summaries are then given by the cosine similarity of the two vectors
[FH03]. Each genre is represented as a vector where each component is a word, and
each component value indicates how indicative that word is for the genre. Hierarchical
clustering is then used to merge similar genres for which new vectors representations are
created. Next, vector representations for each summary is created by calculating the
cosine similarity between the feature vector representing the summary and each of the
3 Related work
newly created (merged) genres. In other words, each component in the representation is the cosine similarity of the summary and one of the new genres. Finally, the similarity of two film summaries is calculated by taking the cosine similarity between the new vector representations of the summary [FH03].
The paper was evaluated by first picking ten random movies from a subset of movies whose rating was based on at least one thousand votes. For each of these randomly selected movies, ten recommendations were generated by each of the created algorithms, ten recommendations were selected at random, five recommendations were generated by humans, and ten recommendations were generated by IMDB’s own algorithm. The results were then presented to four humans who rated the recommendations on a scale from one to four. The randomly and human selected recommendations achieved the worst and best scores, respectively. Among the created algorithms, the second algorithm outperformed the first one. However, they were both outperformed by IMDB’s own algorithm. As stated by the authors, although their NLP model failed to outperform the commercial model, it is much less labor-intensive [FH03]. A possible extension mentioned by the authors is using Latent Semantic Indexing (also known as Latent Semantic Analysis) to discover higher-order similarities between the summaries [FH03].
3.2 Eurovoc
EuroVoc is a multilingual and multidisciplinary thesaurus that contain hierarchically organized subject domains. The subjects are used by Member States of the EU for classification of official documents. The thesaurus is managed by the EU Publications Office. This section presents two projects related to EuroVoc; a tool for automatically assigning thesaurus terms in order to classify documents and a project which attempts to calculate the similarity of documents using solely EuroVoc descriptors.
Initially, documents had to be manually labeled with the thesaurus terms, referred to as descriptors. The European Commission’s joint research centre (JRC) developed an indexer that automatically assigns EuroVoc descriptors to a given body of text [SET13].
In the model, each descriptor has a profile which consists of a ranked list of features that are common for that particular descriptor. Features may take different forms, for example, basic words, n-grams, part-of-speech combinations etc. Furthermore, the model requires that the document that is to be labeled is represented as a vector containing the frequency of the features in the document. Cosine similarity between the descriptor profiles and the feature vector representation of the document is then used to find the most suitable (the closest) descriptors for the document.
A paper by R. Steinberger, B. Pouliquen, and J. Hagman [SPH02] presents an approach
to calculating the similarity of labeled documents written in the same or different lan-
guages. The approach was to first represent the documents in a language-independent
way, more specifically using the EuroVoc thesaurus terms, referred to as descriptors.
3 Related work
Like in the previous paragraph, each descriptor has an associated list of words that are common to that particular descriptor. The lists of all the descriptors are then compared to the words in the document that are to be labeled. The assumption made was that if the words in a document were very similar to the associated words of a descriptor, then that descriptor was more likely to accurately describe the document. Because of this assumption, the most suitable descriptors for the document can be found by ranking the descriptors based on similarity. The list of descriptors assigned to the document can then be seen as a representation of the contents of that document [SPH02]. By calculating the cosine similarity between the new descriptor representation of documents, it is then possible to get an idea of what documents may be related. The least distant documents should then be the ones that are also the most similar.
These two papers formed an alternative approach to representing and finding similar
documents, respectively. First using the JEX to label the data in this project after
which a similarity between documents can be based solely on the automatically assigned
descriptors (as done in [SPH02]). The performance of this approach could then be used
for comparison to the models implemented in this project, described in Section 8.
4 Theory
4 Theory
Theory regarding data is first presented, followed by feature extraction methods and finally how to measure similarity of data.
4.1 Data
There are different types of datasets. They can be be grouped into three categories:
record data, graph-based data and ordered data [TSK19].
Record data is when the dataset is a collection of records (data objects), which has data fields (attributes). In this type of dataset, there is no explicit relationship between data objects or attributes. An example of the record data type dataset is the data matrix, where the data objects have a fixed number of numeric attributes [TSK19]. A collection of records can either be stored in a relational database or as a flat file.
Graph-based data is a type of dataset that captures relationships among data objects. In this type, objects are represented as nodes and the relationships are represented as links between the nodes. An example of this type is how the World Wide Web is organized, where web pages are represented as nodes and relationships between web pages are links between the nodes. A model for this graph-based data is Resource Description Framework (RDF). In the RDF data model, data is represented as triples. A triple consists of a statement containing a subject, predicate and object. An example of a triple is “Bob is 35”, where “Bob” is the subject, “is” is the predicate and “35” is the object. Subjects, predicates and objects can have Uniform Resource Identifier (URI) as values. By giving ”Bob” a URI, such as “http://example.org/Bob”, it can reused in other statements by using ”Bob”, thus statements are linked together. This example is visualized in Figure 4.
Figure 4 An example of an RDF triple.
Ordered data is the type of data where attributes have relationships that involve time or
space [TSK19]. An example of this data type is time-series data, where measurements are
collected over a certain period of time. The attributes will have temporal relationships
in this example.
4 Theory
4.1.1 Data Quality
In contrast to statistics, where experiments or surveys are carried out in order to achieve a pre-specified level of data quality, data mining is often applied to data where the data was not collected for the purpose of data mining. Subsequently, data quality problems arise because of the nature of the raw data, and in order to deal with problems that arise it is necessary to detect and correct these problems, this process is often denoted data cleaning [TSK19]. By investigating how the data was generated one can learn more about the data quality. Humans, measuring devices or a data collection process can be generators of data. In either of these, there is always a risk of errors, such as human errors, limitations of measuring devices or flaws in the data collection process [TSK19].
Missing values is a common problem and takes place when a data object is missing one or more attributes. There are a number of ways to deal with this issue. The first being to eliminate data objects or attributes, the second is to estimate missing values and the third is to ignore the missing value during analysis. Depending on the situation, one of these strategies can be applied [TSK19].
Duplicate data is when a dataset contains data objects that are duplicates or almost duplicates. Deduplication the process that involves solving this issue. In deduplication it is important two keep two aspects in mind. [TSK19]. The first being the even if two data objects represents a single objects, there might still be attributes that are different in the data objects. The second aspect is that the data objects are not duplicates, just very similar [TSK19].
The application perspective of the data quality is another view point that is important.
That is, given a specific use-case, how relevant or suitable is the data? And the answer to
that question determines the quality of the data for that application [TSK19]. Relevance
and knowledge are two aspects to consider when looking at the data from the application
perspective. Relevance is concerned with if the available data contains the information
that is necessary for the application. Not only are the attributes important, but also
making sure that the data objects are relevant. Sampling bias happens when the data
objects in the dataset does not reflect the actual occurrence in the population. For
example, this can be the case when making a survey and the survey data only describe
those who answered the survey [TSK19]. Since the data analysis only reflects the survey
data, is might not be able to be generalized to a wider population. Knowledge about
the data, on the other hand, are knowledge that helps to describe different aspects of
the data. This can be in the form of documentation. Precision of the data, the scale of
measurement and the origin of data are also important characteristics that can benefit
from documented knowledge.
4 Theory
4.1.2 Data Preprocessing
Data preprocessing is a fundamental step in making the data ready for data mining.
Some various techniques can be applied in this process, but they all fall into two cat- egories: the first one being selecting data objects and attributes for the analysis while the second on is creating/changing the attributes [TSK19].
Aggregation is a technique in which two or more data objects are combined. There are many ways to combine data objects, what it boils down to is how to handle the attributes when combining the data objects. What can be lost in aggregation is the detail of the attribute of the original data object, and therefore a sense of what abstraction level that is appropriate is something that needs to be considered. Another important motivation is that by aggregating data objects, the result is a smaller dataset which can be of interest for memory and processing reasons [TSK19].
Dimensonality reduction is a technique for handling data objects which potentially have a large number of attributes. As an example, take a text document as the data object, and the occurrence of a word in that text document as an attribute. In this example, the number of attributes will be the same size as the vocabulary of the text document, which in some cases can be in the size of thousands. The curse of dimensionality is a term that refers to the issues that are related to a high number of attributes in data.
Even if the number of attributes increases, a given data object might only have values in just a few of the attributes, which is referred to as sparse data. This phenomenon can make data analysis difficult. A way to tackle this issue is to use techniques that create new attributes by combining the old attributes. One technique for doing so is Singular Value Decomposition (SVD), which is further explained in Section 4.2.1 [TSK19].
Attribute subset selection is an optional way to tackle the problem of a high number of attributes is to identify a smaller number of attributes that are a subset of the old attributes. It might seem like information will be lost in this process, but by identifying redundant and irrelevant attributes, this can be avoided. Redundant attributes duplicate much of the information of other attributes and irrelevant attributes are attributes which are deemed to contain no useful information for the analysis task [TSK19].
Common sense or domain knowledge can be used to judge attributes, however, this
approach is somewhat subjective. Instead, a more systematic approach can be taken
to obtain a more objective judgement. The gist of such a systematic approach is to
try all possible subsets of attributes, use them in the algorithm at hand, and take the
subsets of attributes which produces the best results. As the all the possible subsets of n
attributes is 2 to the nth power, there are three alternatives approaches: embedded, filter
and wrapper. The embedded approach is where the attributes selection is part of the
data mining algorithm, the filter approach is where the attributes is selected by some
independent algorithm before the data analysis, and finally the wrapper approach in
which the data mining algorithm is treated as a black box, similar to testing all possible
4 Theory
subsets but without testing all possible subsets [TSK19].
Another approach to subset selection is feature weighting. Instead of the binary approach of either deleting of keeping a feature, some attributes are deemed more important than others and vice versa. The importance of a feature can be deemed by domain knowledge [TSK19], or as part of an iterative approach using the data mining algorithm as a black box and inspecting the result, similar to the wrapper approach taken the previous paragraph.
Besides the aforementioned ways to reduce the number of attributes, one can also at- tempt to create new attributes which more effectively captures the information stored in the old attributes using feature creation methods. One methodology to do so is feature extraction, in which the raw data is processed to obtain a higher-level attribute. Taking the case of a text document as an example, there are algorithms which can find patterns in the data objects of a dataset, thus generating subject topics and mapping these topics to each text document. Thus, the new number of attributes is the number of generated topics. Feature extraction is, like in this example, highly domain-specific and techniques from one field generally have limited practical use in other domains. Conclusively, when opting for feature creation it is necessary to delve in to the domain of the attribute to find the right tools [TSK19].
4.1.3 Text data preprocessing
As computers read bodies of text as string objects, it is necessary to separate the larger bodies of text into smaller chunks so that computers can evaluate individual words [Bey18]. This can be achieved through tokenization of the natural text which is the process of segmenting the text down into smaller chunks, or tokens. Segmentation can be done on the actual tokens or separators between tokens such as whitespace or punctuation. This way of representing a text document in terms of tokens is often referred to as a Bag-Of-Words (BOW) model. Adding all these tokens together results in a BOW vector which disregards the order of tokens but retains token multiplicity, therefore the vector is also sometimes referred to as a term frequency vector. Figure 5 shows a BOW vector representation of a document.
Figure 5 The BOW representation of a document containing the phrase ”To be or not
to be”
4 Theory
However, not all tokens may be desirable. For instance, punctuation appears all through- out natural languages and makes reading the text easier for humans. But punctuation adds little in the way of value in a BOW vector. There also exists words in all languages that occur very frequently but that carry less information about the meaning of a sen- tence [HHL19]. These words are often referred to as stop words, two examples are the and is.
Filtering out these undesirable tokens reduces the overall size of the vocabulary while largely maintaining the information contained. One approach is to utilize regular expres- sions which are a sequence of tokens or characters that defines a pattern. These patterns can then be used to find the undesired tokens and replace or remove them. This ap- proach relies on a predefined list of stop words and regular expressions that tokens can be matched against.
There may exist tokens which, in essence, provide the same information but that for grammatical reasons appear in different forms in a single sentence or text. The process of combining these terms into a single normalized form is known as normalizing the vo- cabulary which further reduces the size of the vocabulary. One instance of normalization is to consolidate multiple words where the only difference is capitalization into a single word. Further normalization can be achieved by breaking words down into their seman- tic roots - their lemmas. Lemmatization is the process of removing inflectional endings of a word and returning the root or the lemma that word. For instance, troubling, troubles and troubled can all be grouped to trouble, as shown in Figure 6. Dictionaries is utilized to allow the algorithm to perform a morphological analysis of terms.
Figure 6 Consolidating multiple alterations of a word by removing inflectional endings.
Stemming is an alternative approach to removing inflectional endings. However, instead
of relying on dictionaries, stemming takes a heuristic rule-based approach. A stemming
algorithm was presented by Martin Porter in 1980. The approach was to, based on a set
of rules, remove suffixes from words [P
+80]. The algorithm utilizes rules in several stages
to reduce a word. A term is matched against a list of rules and the longest match found
is the rule that will be chosen for a stage. Examples of rules from Porter’s algorithm
can be seen in Figure 7.
4 Theory
Figure 7 Example of removing inflectional using stemming rules.
Combining the BOW vectors representing the documents results in a sparse matrix that represents the entire collection of documents, also known as the corpus. Each row represents a document and the columns of the matrix are all the words in the corpus. The column values indicate how many times a particular word appears in a given document.
However, this numeric value of occurrences provides little insight into how indicative the term is for a given document, i.e., how effectively the term can be used to tell a document apart from the rest of the collection.
An alternative approach to the BOW model is Term Frequency-Inverse Document Fre- quency (TF-IDF). Like the BOW model, vectors are used to represent documents. How- ever, instead of tokens being valued by the number of times they appear throughout a document a TF-IDF value is calculated. TF-IDF is calculated using two values, docu- ment frequency and inverse document frequency. Document Frequency df
tis the number of documents in the collection that contain a term t. The Inverse Document Frequency idf
tof a term t in a collection consisting of N document is given by 1 as presented by [MRS08].
idf
t= log( N df
t) (1)
The IDF value provides insight into how relevant a term is for distinguishing a given document in a collection. Terms that appear frequently throughout the collection will receive a low value while rare terms receive a high value.
Similar to the BOW model, the final matrix representation of the corpus contains a row for every document and a column for every term. A document in this representation is a vector with a component for every term in the corpus. The component is zero for terms that do not occur in the document. For terms that do appear, the TF-IDF value for term t in document d is given by equation 2 as presented by [MRS08].
tf -idf
t,d= tf
t,d× idf
t(2)
Terms that occur frequently within a small number of documents will receive a large
value. This signifies that the terms are indicative of the small number of documents,
i.e., the terms are useful when trying to tell the small number of documents apart from
the rest of the collection. Terms that occur in a large number of documents or a limited
number of times in a document will receive a lower value. Finally, terms that occur in
4 Theory
pretty much all documents will receive a very low value.
4.2 Feature Extraction
Sections 4.2.1 and 4.2.2 presents two different approaches for topic modelling. Topic modelling rests on the assumption that there exist a number of abstract themes or topics in a collection of documents and that every document in the collection can be represented by some mixture of these topics. In the case of text documents, topics are, in essence, similar words grouped or clustered together. This means that instead of being represented by word-vectors, text documents can be described by a mixture of word topics. These mixtures often referred to as topic vectors. By looking at the documents as represented by topic vectors, it is also possible to get a sense of what documents may be related.
Section 4.2.3 presents an approach to representing words as feature vectors . Section 4.2.4 takes this concept a step further and presents an approach to representing entire paragraphs as feature vectors.
4.2.1 Latent Semantic Analysis
Latent Semantic Analysis (LSA) in an approach that attempts to capture relationships between documents and terms used in the documents. It achieves this by using Sin- gular Value Decomposition (SVD), a mathematical algorithm for factorizing a matrix into three matrices. These three new matrices have the property that, when multi- plied together, they form the original matrix. Each of the three matrices have some mathematical properties that can be exploited for dimension reduction or to perform LSA.
In the case of applying SVD on the TF-IDF or the BOW matrices described in the text preprocessing section 4.1.3, SVD will, in essence, find what terms/words that are correlated. Correlation in this case means that the terms frequently occur side by side in the same document and that they also vary together over the collection of documents.
Additionally, SVD also calculates the correlation between documents by the terms that
they use. These two calculations enable SVD to determine what linear combinations of
terms that present the largest variation in the collection. These linear combinations form
the (latent) topics that can then be used to describe each document and how documents
may be related.
4 Theory
4.2.2 Latent Dirichlet Allocation
Latent Dirichlet allocation (LDA) is a generative model for text and other discrete data presented by David M. Blei, Andrew Y. Ng and Michael I. Jordan [BNJ02]. LDA assumes that there exists a number of topics that are distributions over words in the corpus vocabulary. And that all the documents in the corpus can be represented as distributions over these topics.
Figure 8 Plated notation showing variables involved in LDA.
By using plated notation shown in Figure 8, we can show the variables involved in the process, as well as how these variables are related. The boxes represent repeated sections.
The outer larger box represents the M documents. The inner box represents the word positions. A corpus is represented as D and consists of M documents, each with a length of N
iwords (document i has N
iwords).
α and β are parameters for the Dirichlet priors for per-document topic and per-topic word distributions, respectively. A Dirichlet distribution is a multivariate generalization of a Beta distribution. The output of a Dirichlet distribution is a vector of probabilities that sums to one. Dirichlet distributions take a hyperparameter, a higher value on the hyperparameter pushes the distribution toward the center, while a lower value pushes the distribution towards the corners. The effect of different values on the hyperparameter is shown in Figure 9.
For LDA, this means that a higher α results in documents being distributed over many topics while a lower α results in fewer topics. Similar to α, a higher value for +
¯ eta
means a topic is distributed over many words, while a lower β means fewer words are
used. For a given document, denoted i, the topic distribution of that document is θ
i. w
ijdenotes the word at position j in document i, while z
ijdenotes the topic for the same
word. Lastly, the word distribution for some topic k is denoted as ϕ
k.
4 Theory
Figure 9 The effect the alpha hyperparameter has on the Dirichlet distribution. Higher values push the distribution toward the center while a lower value pushes the distribution towards the corners/edges.
The process, as described by [BNJ03] assumes the following generative process for each document w in D:
For a corpus D consisting of M documents, each with length N
i, the generative process is as follows:
1. Choose θ
i∼ Dir(α) where i ∈ {1, ..., M }
2. Choose ϕ
k∼ Dir(β) where k ∈ {1, ..., K}
3. For each word positions i, j, where i ∈ {1, ..., M } and j ∈ {1, ..., N
i}
(a) Choose a topic z
ij∼ M ultinomial(θ
i) (b) Choose a word ϕ
ij∼
M ultinomial(φ
zij)
4.2.3 Word2Vec
While the BOW model of representing texts is relatively straightforward to implement,
it has several drawbacks. The model discards the order in which words appear and
also fails to capture the semantics of words, i.e., the model provides no natural notion
of similarity between words. For instance, the word apple should, in a sense, be more
4 Theory
similar to orange than apple and car, however, this is not captured in this representation.
Preferable would be to have similar words occupy positions close to one another, or in other words, the vector representations of the words should reflect similarities.
Neural word embeddings is an alternative representation that captures similarities. In this representation, words are represented by feature vectors where each component of the feature vector captures some aspect of the words in the vocabulary. To get some intuition, imagine that one vector component symbolizes royalty. The word king would then have a high value for that component while the word apple would not. However, in practice, components are more complex than single adjectives. Words that have similar features will therefore occupy positions close to another since their feature vectors will have similar values. Furthermore, these new representations have some interesting properties besides grouping similar words. It also captures analogies, a classic example is king is to man, what woman is to queen. To elaborate, the vector difference between the points representing man and woman is very similar to the vector difference between the points representing king and queen [MYZ13]. This enables Shown in Figure 10.
Figure 10 A crude representation of how analogies are captured by the feature repre- sentation. The same vector difference describes the both relation between man and king as well as woman and queen.
The Word2Vec model, created by Mikolov et al. [MCCD13] is a neural network for
creating neural word embeddings. The network consists of input, projection and output
layers. It utilizes one of two methods: the continuous bag-of-words (CBOW) model or
the continuous skip-gram model. The CBOW model attempts to predict a word based
on the context of the word while the continuous skip-gram model attempts to predict
the context of a given word. An overview of CBOW and skip-gram models can be seen
in Figure 11.
4 Theory
Figure 11 A side-by-side comparison of the architecture of CBOW and skip-gram, respectively [MCCD13].
For the CBOW model (shown in Figure 12), the input into the network is C words from
the context of a focus word. These context words are represented as one-hot vectors, i.e.,
one component in each vector x
k, x
2k,.., x
Ckis 1, the rest is 0. The weights between the
input and hidden layers are contained in a matrix W with dimensions V × N , where V
is the size of the vocabulary and N is the length of the feature vectors. Each row in W
is the feature vector representation of a word in the vocabulary. More specifically, every
word in the input layer has a feature vector representation v
wgiven by it’s associated
row in W. W is shared for all context words, i.e., the feature vector for apple is the
same for all instances of the word.
4 Theory
Figure 12 The network architecture of the CBOW model [Ron14].
The output of the hidden layer is calculated by taking the product of the transposed weight matrix W and the average of the context vectors, shown in equation 3.
h = 1
C W
T(x
1+ x
2+ · · · + x
C) (3)
= 1
C (v
w1+ v
w2+ · · · + v
wc) (4) A matrix W
0with dimensions N × V contains all the weights between the hidden and output layers. Using the weights contained in W
0, a score for each word in the vocabulary can be calculated. Equation 5 is used for calculating the score of a word, where v
0wjis the j-th column in W
0.
u
j= v
0wjTh (5)
The softmax function is then used to obtain the posterior distribution of words. Given the focus word w
I, the posterior is a multinomial distribution given by equation 6.
p(w
j, w
I) = y
j= exp (u
j) P
Vj0=1
exp (u
j0) (6)
Where the j-th unit in the output layer is given by y
j.
4 Theory
Figure 13 The network architecture for the skip-gram model [Ron14].
In the skip-gram model (shown in Figure 13), the input into the network is simply the focus word given by a one-hot vector x
k. This means that the output of hidden layer h is a (transposed) row of the weight matrix W, i.e., the row in W that corresponds to the focus word. Equation 7 shows how h is calculated.
h = W
T(k,·)= v
TwI
(7)
Whereas the CBOW model attempts to predict the focus word given C context words, skip-gram attempts to predict C context words given the focus word. This means the output of the network will be C multinomial distributions. The same weight matrix W
0located between the hidden and the output layers is used for each output.
Equation 8 is used for calculating the score of the c-th word (also referred to as panel [reference]), where v
0wjis the output vector of the j-th word in the vocabulary and is given as a column in W
0.
u
c,j= u
j= v
0wjT· h, for c = 1, 2, ..., C (8)
u
c,jdenotes the net input on the j-th output unit on the c-th panel of the output layer.
4 Theory
The multinomial distribution given the focus word w
Iis given by equation 9.
p(w
c,j= w
O,c| w
I) = y
c,j= exp (u
c,j) P
Vj0=1