Lexeme Extraction for Wikidata A proof of concept study for Swedish lexeme extraction Simon Samzelius

(1)

Lexeme Extraction for Wikidata

A proof of concept study for Swedish lexeme extraction Simon Samzelius

Final Project – Report

Main field of study: Computer Engineering Credits: 300 credits

Semester/Year: 6^th semester 3^rd year Supervisor: Patrik Österberg

Examiner: Stefan Forsström

Course code/registration number: DT099G

Degree programme: Master of Science in Engineering – Computer Engineering

(2)

i

Sammanfattning

Wikipedia har problem med att organisera och hantera data samt referenser. Som en lösning skapade de Wikidata för att göra det möjligt för maskiner att tolka dessa data med hjälp av lexem. Ett lexem är en abstrakt lexikal enhet som består av ett ords lemman och dess ordklass.

Syftet med detta arbete är att presentera ett möjligt sätt att tillhandahålla svenska lexemdata till Wikidata. Detta implementerades i två faser. Den första fasen var att identifiera lemman och deras ordklasser; den andra fasen var att bearbeta dessa ord för att skapa sammanhängande lexem.

Den utvecklade modellen kunde bearbeta stora mängder ord från datakällan men lyckades knappt generera sammanhängande lexem.

Även om lexem var tänkt att ge ett effektivt sätt att förstå data för maskiner, leder de erhållna resultaten till slutsatsen att den utvecklade modellen inte uppnådde de förväntade resultaten. Detta beror på mängden ord som finns i samband med de bearbetade orden. Det behöver hittas ett sätt att importera lexem-data till Wikidata från en annan datakälla.

Nyckelord: Lexeme, Lemma, Wikidata

(3)

ii

Abstract

Wikipedia has a problem with organizing and managing data as well as references. As a solution, they created Wikidata to make it possible for machines to interpret these data, with the help of lexemes. A lexeme is an abstract lexical unit which consists of a word’s lemmas and its word class. The object of this paper is to present one possible way to provide Swedish lexeme data to Wikidata. This was implemented in two phases, namely, the first phase was to identify the lemmas and their word classes; the second phase was to process these words to create coherent lexemes. The developed model was able to process large amounts of words from the data source but barely succeeded to generate coherent lexemes.

Although the lexemes was supposed to provide an efficient way of data understanding for machines, the obtained results lead to the conclusion that the developed model did not achieve the anticipated results. This is due to the amount of words found in correlation to the words processed.

It is needed to find a way to import lexeme data to Wikidata from another data source.

Keywords: Lexeme, Lemma, Wikidata

(4)

iii

Acknowledgements

I would like to thank my supervisor Patrik Österberg for his help and understanding through this project, my mentor Annelie Huczkowsky as well as my friends Rasmus Schyberg and Ayad Shaif for their support.

(5)

iv

Terminology

Acronyms

ao Academic Wordlist (swe: Akademisk Ordlista) API Application Programming Interface

GUI Graphical User Interface HTML Hyper Text Markup Language HTTP Hyper Text Transfer Protocol NLP Natural Language Processing re Regular Expression

URL Uniform Resource Locator XML Extensible Markup Language.

(8)

1

1 Introduction

The Wikimedia Foundation has created and operates the largest free knowledge database with the help from volunteers who contribute knowledge and helps millions of people around the globe by sharing this knowledge. In total there are now thirteen projects, generally known as the Wikimedia sister projects: Wikipedia, Wikibooks, Wiktionary, Wikiquote, Wikimedia Commons, Wikisource, Wikiversity, Wikispecies, Wikidata, MediaWiki, Wikivoyage, Wikinews and Meta- Wiki. [1] [2]

The Wikidata sister project collects structured data and acts as the central storage to support the rest of the Wikimedia sister projects. It is possible for anyone to copy, modify and distribute the data without permission, because it is published under the Creative Commons Public Domain Dedication 1.0. It is therefore possible for anyone to contribute to Wikidata. The data entered can be read by both humans and machines and can be in any language. [3] [4]

1.1 Background and problem motivation

Wikidata is a free and open knowledge base that can be read and edited by both humans and machines. Wikidata acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wiktionary, Wikisource and others. Wikidata also provides support to many other sites and services beyond just Wikimedia projects.

As a result, new problems have appeared that made it difficult for Wikimedia to manage its huge amount of data. Therefore, using a new datatype known as lexemes was considered a solution for Wikimedia. [4]

A lexeme is an element of a language which is used in the form of a word, phrase, or prefix, combined with its word class. [5]

Presently, Wikidata lacks Swedish lexemes, and therefore a suitable implementation model should be developed.

(9)

2

1.2 Overall aim

The overall aim of this project is to optimize Wikidatas classification method of Swedish lexemes. Therefore, this research will investigate and propose an optimized model for lexeme classification that can be used by Wikidata.

1.3 Concrete and verifiable goals

The main goal with this study is to create a program that can extract lexemes from a data source and create a list of Swedish lexemes. As a proof of concept, the following goals should be fulfilled:

1. Develop a model for fetching relevant data from Språkbanken.

2. Produce a list of unique lemmas from the fetched data.

3. Classify the unique lemmas and evaluate the success rate of the developed model.

1.4 Scope

This research will only deal with Swedish lexemes gathered from Svenska Språkbanken, due to the time limit of this project. The lexemes will be processed by using specific methods, by checking the language tag in the XML document and extracting its Swedish words. The data gathered will not be uploaded to Wikidata.

1.5 Outline

This report is compiled as follows. Chapter 2 describes the theory and necessary information. Chapter 3 describes the chosen methodology to carry out this project. Chapter 4 presents the step by step design and implementation of the project. In chapter 5 the obtained results were listed and presented. Finally, chapter 6 discusses the chosen implementation, the obtained results, social and ethical aspects as well as improvements.

(10)

3

2 Theory

This section presents all relevant theory and concepts concerning this research as well as related works within this field.

2.1 Natural language processing

Natural Language Processing (NLP) is the study of processing words in different natural languages such as English or Swedish. Areas such as language structuring and representation are studied and processed so that they could be interpreted by computers. [6]

Although NLP seem to be a promising field in terms of improving the interaction between computers and human languages there are some hinders to consider. These hinders can be in many different forms such as challenges in collaboration between language experts and computer scientists, as simple linguistic principles can be experienced as difficult tasks. [7]

Figure 1: Illustration of why language processing can be difficult. [8]

One other challenge with processing natural languages is the rapid change in the language where new terminologies and phrases are continuously introduced by the language speakers. The evolution of the language is demonstrated in figure 1.

2.2 Lexeme

A lexeme [5] is an element of a language which is used in the form of a word, phrase, or prefix. These are entities in the sense of the Wikibase model. A Lexeme is described using the following elements:

(11)

4 - ID

- Lemma - Language

- Lexical Category

- List of Lexeme statements - List of Forms

- Senses

For example, the words run, runs, ran and running are forms of the same lexeme, which is represented by the lemma run as well as its word class.

A lemma is the base form of a word.

2.3 Wikidata

“Wikidata is a free, collaborative, multilingual, secondary database, collecting structured data to provide support for Wikipedia, Wikimedia Commons, the other wikis of the Wikimedia movement, and to anyone in the world” [3]. Its repository consists mainly of items and statements.

Items has the elements label, description, and several aliases. An item is uniquely identified by the letter Q followed by a number. A statement describes the characteristics of an item, consisting of a property and a value.

2.4 Språkbanken

Språkbanken is a research unit based on linguistical data. They produce texts to allow analytical and sophisticated searches to further develop things such as artificial intelligence [9].

2.5 XML

Extensible Markup Language (XML) is made from entities containing either parsed or unparsed data. The parsed data is made from either character data or markup, encoding a description of the document [10].

2.6 Python

Python is a high-level general-purpose programming language suitable as a scripting language, the programming language is mostly optimized for Machine Learning applications as well as web development [11]. In Python there are modules and packages. The difference between these two are the hierarchical structure, where a module may be one or some

(12)

5

files imported in one instance, and a package consists of several modules and needs to be more specifically imported. [12]

2.6.1 Requests

Requests is a python module which allows the user to send Hyper Text Transfer Protocol (HTTP) request without manually having to add queries to the Uniform Resource Locator (URL) [13].

2.6.2 re

re or Regular Expression is a Python module that utilizes regular regression operations by matching patterns of strings. Example of regular expression operations are “.” for any character, “*” for zero or more occurences and “.*” is for everything [14].

2.6.3 xml.etree.ElementTree

xml.etree.ElementTree (ET) is an Application Programming Interface (API) data model in Python that makes it possible to parse and create XML data. The API data model uses a tree structure because XML is normally presented hierarchically [15].

2.6.4 Codecs

The Codecs Python module makes it possible to transcode given data.

This model is mostly used when dealing with Unicode text [16].

2.6.5 Bz2

This module allows compressing and decompressing strings of data with the bzip2 methods [17].

2.6.6 os

os is a Python module that allows reading existing files and writing to files. The os module also allows different operating system functionalities such as manipulation of paths [18].

2.6.7 shutil

shutil is a Python module which allows operations like copying and removing of files and directories. The module optimizes the usage of operations by reducing its number of execution steps [19].

(13)

6 2.6.8 _thread

_thread is a module which allows execution of programs with multiple threads. The module manages the shared computer resources with the help of methods like synchronization and simple locks [20].

2.6.9 Beautiful Soup

Beautiful Soup is a Python package which pulls data out of HTML and XML files by using a parser which allows operations such as navigation, exploration, and modification of the produced parse tree [21].

2.6.10 tkinter

tkinter is a standard Python package used to develop graphical user interfaces. Example of objects that can be created by using tkinter are buttons on a window that execute certain functions within the program [22].

2.7 Related Work

Among related research, there have been several scientific researches regarding NLP and lexicographic analysis. In this section three different researches are represented.

2.7.1 Danish in Wikidata lexemes

Nielsen F. describe in [23] how Danish lexemes could be constructed in Wikidata. However, the paper does not propose any model for managing Danish lexemes for Wikidata.

2.7.2 The lexeme hypotheses: Their use to generate highly grammatical and completely computerized medical records

Macfarlane D. proposed a cost-efficient model in [24] for handling medical records. The model consists of breaking up medical records into lexical fragments, those fragments were then processed as lexeme queries and responses. As a result, the model was able to produce well- structured sentences but needs further testing.

2.7.3 Introducing Lexical Masks: a New Representation of Lexical Entries for Better Evaluation and Exchange of Lexicons

In [25] Cartoni B. et al has proposed an efficient masking model for handling lexical entries. Since lexical entries still lack precise and interoperable requirements of what a language should look like, the

(14)

7

research presents a mechanism to share specifications of lexical entries known as lexical masks. These masks are used to evaluate and exchange lexicon databases in many languages, with the help of the created ShEx- files in Wikidata.

(15)

8

3 Methodology

This study will be divided into several assignments in order to make it easier to work with. The assignments needed should be able to download or refresh files from some source, go through each file and find the lemmas necessary and lastly analyze the contents found and find their word class.

Python was the programming language of choice since it has a lot of the necessary modules and packages that can ease the project flow. For instance, the ET (see section 2.5.3) package that provides a compatible structure with XML files.

Since there are three main assignments, they will be labeled “Refresh Files”, “Evaluate Files” and “Analyze Content”.

3.1 Refresh Files

This method must be able to download files from some source, and depending on source, be able to prepare them for the next assignment. If a file that is downloaded is compressed, it is important to decompress it.

The data source chosen in this study is Språkbanken, which compresses its XML-files in the .bz2 format. Firstly, it is imperative to communicate with the source. The requests python-module can do that. Secondly, it is necessary to divide the HTML given into an array to search for the files.

With the help of Beautiful Soup, this is made possible. Once the URL to the files are found, they must be downloaded. This can be achieved by the requests module. However, the files are compressed but can be decompressed by using the .bz2 module. It is important to save the files somewhere easily accessible so that the next assignment can work with them.

(16)

9

3.2 Evaluate Files

Here the files are needed to be searched through to find lemmas which are not duplicates. Once all lemmas have been found, they should be saved somewhere.

With the help of the os module, traversing through the paths of the computer are made possible. This can be utilized to analyze the XML- files and save the results somewhere else. In order to analyze the XML- files though, a tool like ET can be used as it creates arrays with the tags from the document. With this, it is made possible to find lemmas from the XML-files. Once all lemmas are sorted so there is no instance of duplicates, they must be stored somewhere for the next assignment to be able to work with them.

3.3 Analyze Content

The last assignment. Each lemma that has been found needs to be checked if it is correct or not, as well as finding its word class.

To do any form of comparison, it is necessary to find a trustworthy source.

1. <Lemma>

2. <FormRepresentation>

3. <feat att="writtenForm" val="studie" />

4. <feat att="lemgram" val="studie..nn.1" />

5. <feat att="partOfSpeech" val="nn" />

6. <feat att="nativePartOfSpeech" val="substantiv" />

7. <feat att="rank" val="2" />

8. </FormRepresentation>

9. </Lemma>

Figure 2: Data representation of ao.xml.

Figure 2 represents the structure of ao.xml [26] obtained from Språkbanken. The structure contains Swedish words as well as their word classes. Unfortunately, the file only contains 655 words, but for this project will have to do. All that remains is to compare the lemmas found with the words from ao.xml, and for each successful hit create a lexeme from the lemma and the word class.

(17)

10

3.4 Evaluation of the results

To be able to evaluate the research result it is important to ensure the lexemes are correctly extracted. This could be done by comparing the list of lemmas with known Swedish words from a lexicon i.e.

Språkbanken’s ao.xml. The success rate of the model will be calculated from the ratio between the number of lexemes found from “Analyze Content” and the number of lemmas found in “Evaluate Files”.

(18)

11

4 Design

This section describes the experimental setup by demonstrating the steps taken for each assignment, namely Refresh Files, Evaluate Files and Analyze Content.

To be able to process the lemmas, the data was gathered from Svenska Språkbanken in form of XML-files that partially contains Swedish language tags.

Figure 3: The GUI that was implemented

A simple Graphical User Interface (GUI) was implemented, see figure 3, to choose which part to work with; refresh files, evaluate files or analyze evaluation. This was made solely because each part ran in a sequence manner.

(19)

12

Figure 4: Flowchart of the program as a whole

Figure 4 demonstrates the structure of the developed program. The program consists of three main assignments; namely “Refresh Files”

which is a script for downloading available resources, “Evaluate Files”

which is a script for evaluating the resources as well as “Analyze Evaluation” which is a script for managing the compilation of the resources to a good reference.

4.1 Refresh Files

Språkbanken was chosen as the language resource because it is the National Language database. More specifically, the Korp sentence sets [27].

The main goal of this script is to download and extract each file from Korp. This was made to prepare the files for the evaluation of collected lexemes.

(20)

13

With the help of requests and Beautiful Soup, the script was able to download all compressed files from Korp. The compressed files were then extracted into XML-files with the help of bz2.

The requests module was used to get the html from the Korp sentence sets, which the Beautiful Soup module used to find all downloadable links, which the requests module got to write to a file on the computer.

These files were compressed in the .bz2 format, and thus needed decompression by using the bz2 module.

Figure 5: Flowchart of the Refresh Files part

Figure 5 demonstrates the developed algorithm for the execution of this script. The first event was to find sentence-sets and push them in an array, which was done with the help of requests and Beautiful Soup.

After that, the second event was to download and decompress the collected array with the help of requests and bz2.

(21)

14

Figure 6: shows how the Refresh Files gets its data from Språkbanken

The setup was developed by making the use of the HTTP to request the relevant files from Språkbanken. On the other hand, Språkbanken responds with the requested files in the form of Hyper Text Markup Language (HTML) as demonstrated in figure 6.

When the requested files are in place the program searches for all available URLs that leads to .bz2 files. The program then once again requests Språkbanken for each individual file and downloads them.

While the downloading takes place, each downloaded file gets decompressed and saved to another location. These files are accessible for the Evaluate Files assignment.

4.2 Evaluate Files

The main goal of this script is to create a file with unique words. This was made to optimize the analyzation assignment in this project. The python module os as well as the ET package was used to achieve the goal.

The module os was used to create paths and files that contains the manipulated directories, due to the easier access of files. Whereas ET was used on all files in the iteration to arrange the arrays in a hierarchical structure of the produced files.

(22)

15

Figure 7: Flowchart of the Evaluate Files part

Figure 7 represents how the algorithm functions, for instance by initially pointing to the “w” tag which would result in selecting the whole word.

From this, the ET package is used to find the desired element, which could be represented as a lemma. The unique element is then appended into an array if it does not already exist. Once the file is completely iterated, the array was copied into a new file. The following procedure continues by iterating into the next file and so on. When all files are gone through, the algorithm extracts unique lemmas from all created files and copies them to a new file.

(23)

16

This process makes use of arrays, reading from files and nestled loops.

An array which is empty is created. The name of each file is placed in an array. For each element in this array, the file is opened, and all its contents are put in another array, which then again is iterated. In this iteration, the program checks if the current word exists in the array that was created first. If it does not, that word is put in the array. If it does already exist, then that word is not put into an array. Once all iterations are complete, the words found are written to a file so that the third assignment can access them.

4.3 Analyze Content

The main goal of this assignment was to discover and identify the collected lexemes. This could be achieved by essentially creating three different arrays, namely array A, B and C.

(24)

17

Figure 8: Flowchart of the Analyze Evaluation part

Figure 8 demonstrates the mechanisms that allows the algorithm to analyze the collected lemmas from the evaluation assignment. The output from the evaluation assignment was placed into array A, where array B contains the downloaded words from the Språkbanken’s ao.xml and finally array C was to be filled with the lexemes created by the lemmas that occurred in both arrays A and B. The mechanism was also used for quality checking, as well as gathering the correct word class.

In this process, each word that has been found by Evaluate Files is iterated and matched with the words found in ao.xml. This is to ensure that the words exist in the Swedish language, as well as finding the respective words word class.

(25)

18

1. class Lexeme():

2. """Uses lemma and word class"""

3.

4. def __init__(self, lemma, wordclass):

5. self.lemma = lemma

6. self.wordclass = wordclass 7.

8. def printLexeme(self):

9. print(self.lemma + ', ' + self.wordclass) Figure 9: Python code of the Lexeme class.

A lexeme is created using the word class and lemma for each word that matched, as can be seen in figure 9.

(26)

19

5 Results

In this section the main results from each assignment is represented i.e.

results from refresh files, evaluate files and finally analyze content.

5.1 Refresh files

Due to the huge amount of data for the files and the limited amount of computer resources, only 500 GB of XML-files was downloaded from Språkbanken. The total amount of downloaded files was 167 where each file contained several tags in XML format. The following example demonstrates the structure found in one XML-file named as ekeblad.xml.

1. <corpus id="ekeblad">

2. <text title=" Johan Ekeblad BREVEN TILL CLAES " date=" 1639–

1655" datefrom=" 1639" dateto="1655">

3. <paragraph date="Juli 1639">

4. <sentence id="d649dda-dea400d">

6. </sentence>

7. </paragraph>

8. </text>

9. </corpus>

Figure 10: Example of XML tags in ekeblad.xml.

As shown in figure 10, there are several tags. However, the tag of interest in this study was the w-tag, as it contains the lemma of the word.

5.2 Evaluate files

The total number of evaluated files was 115 where each file contains different amounts of unique lemmas.

1. juli 2. till 3. min, mina

Figure 11: Example of how the lemmas were presented.

The total number of unique lemmas was 4107533 where they were presented in figure 11. Many lemmas that were found had some

(27)

20

problems, i.e. some lexemes like “sprida ut” would get two different lemmas: “sprida, sprida ut” and “ut, sprida ut”.

5.3 Analyze content

A total of 4107533 unique words were found.

1. ålder substantiv 2. berättelse substantiv 3. tidpunkt substantiv 4. sällan adverb 5. nordisk adjektiv

Figure 12: Example of correct lexemes found with their lemmas and word classes.

Figure 12 represents a subset of some correct lexemes found, where the lemma is the first word and its word class are the second.

The following table represents the different variables from the code which kept track of the correct, failed, unique words and words from the ao.xml file.

Table 1: The obtained results from ao.xml

No. Name Definition Count

1 wordArray Number of words from the downloaded ao.xml.

655

2 uniqueArray Number of unique lemmas found throughout all the downloaded texts.

4107533

3 lexemeArray Number of correctly identified lexemes.

558

4 failed Number of incorrectly identified lexemes.

4106975

As demonstrated, 99.9% of the unique words found were a failure. This will be discussed in the next chapter.

(28)

21

5.4 Analysis of results

Wikipedia’s problem with organizing and managing data and references was solved by creating Wikidata, making it possible for machines to interpret this data with the help from lexemes.

(29)

22

6 Conclusions

This section will discuss the chosen method for developing the model as well as the obtained result from the model in terms of the fulfillment of the verifiable goals.

6.1 Evaluation of methods

From the refresh files assignment, about 500 GB of XML-files were downloaded, however it is far from all the files. This is due to RAM issues appearing when some file or several files got too large, leading to the execution of the program stopping. This may not have affected the result in this study, however future improvements would be to make sure all desired files are downloaded. The more files that are downloaded the higher the quality would be. Therefore, the first verifiable goal is successfully accomplished.

For the second verifiable goal, words like ”sprida ut” gives two lemmas;

“sprida, sprida ut” and “ut, sprida ut” which is a problem as more permutations and wrong information are obtained. This is a bad management of resources and should be considered more carefully in future experiments.

Out of the 167 files downloaded, only 115 of them were evaluated. The files that got skipped were either too large or contained other languages.

The model at this stage read the entire XML-file into the RAM, which meant that if the file is too large for the RAM, the execution of the program would crash. The second verifiable goal is completely accomplished, however future works should consider resource management in terms of memory capacity.

The third verifiable goal is partially fulfilled. This is due to the small number of words in the wordArray (655), as the Swedish language contains more than 655 words. This heavily affects the results in a bad way as it is impossible in this study to find more than 655 lexemes.

The reason 99.9% of the unique words found were a failure has to do with ao.xml only containing 655 words. With a complete lexicon, more words should be able to be found. However, a lot of the unique words would probably still be wrong.

(30)

23

The refresh files method could be improved by checking whether the texts are in Swedish or not before downloading them. This could be achieved either by requesting a different URL from Språkbanken, or by creating a language filter.

6.2 Evaluation of the results

The overall aim was to create and evaluate a model for generating Swedish lexemes from some data source to show that it works for some cases. This has been achieved by creating a Python script which download, analyze, and compare all text. On the other hand, this model failed to parse a satisfactory number of lexemes which affected its quality negatively.

The results are very disheartening as 99.9% of them were wrong but looking at the ao.xml file it is to be expected, as the ratio of ao.xml words to unique words is 655:4107533, which also indicates a success rate maximum of 0.016%.

The refresh files assignment was almost able to successfully manage the downloaded files in a resource-cheap manner. It divided each file into chunks during the execution to save computing resources. On the other hand, it was done in multiple threads and there were no thread management which eventually resulted in running out of RAM to use.

In addition, there was no way to guarantee that all files downloaded and checked were in Swedish. Many of the files were not Swedish, and some lacked a language tag in the XML file. This made it harder to autonomously check and consider a file. If a check is made in the script, like if the first sentence contains x amount of Swedish words, then it would be faster to go through all the files.

The execution of Evaluate Files took a long time as everything was done in a single thread as the whole document was loaded into the RAM line by line and kept if it was unique, it was not feasible to open new threads.

This method could be improved upon; however, it might take more time but save resources. Eventually though, the Evaluate Files part did find lemmas which was the most important task and kept them in a file to ease the resource management on the next assignment.

The ao.xml does not contain all Swedish words but can still be used to verify what has been found. While the extraction will miss words due to

(31)

24

ao.xml not containing all words, it is better than nothing and can easily be replicated by finding a better source of words.

6.3 Ethical and social aspects

The Wikidata database model is an efficient way to manage words and languages, if the users make sure the model is correct or it may ruin the capabilities of education, machine learning or the use of a formal language. If someone were to make a bad model that goes uncorrected, people in the future might believe that model to be true and as such, ruining that languages structure or grammar. Language always changes however, so the model always needs to be updated to adapt to the evolution of the language.

6.4 Future work

One improvement for Refresh Files is to check whether the XML-files are Swedish texts or not before downloading them. Names cannot be lexemes, and as such, should be sorted out. ao.xml does not seem to contain any names. Big letter and small letter words will count as different words. Finding a better source than ao.xml to compare words with is a great improvement as ao.xml only contains 655 words. This could possibly be achieved by structuring requirements for good resources.

(32)

25

References

[1] Wikimedia Foundation, “Wikimedia Projects”, https://wikimediafoundation.org/

Retrieved 2020-08-10.

[2] Wikipedia, “Wikimedia sister projects”,

https://en.wikipedia.org/wiki/Wikipedia:Wikimedia_sister_project s

[3] Wikidata, “Wikidata:Introduction”,

https://www.wikidata.org/wiki/Wikidata:Introduction Retrieved 2020-08-10.

[4] Wikidata, “Main Page”,

https://www.wikidata.org/wiki/Wikidata:Main_Page Retrieved 2020-08-10.

[5] Wikidata, “Lexicographical data/Documentation”,

https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/Do cumentation

[6] H. Lane, C. Howard and H. Hapke, Natural language processing in action. p. 4.

[7] Bates, M. (1995). Models of natural language

understanding. Proceedings of the National Academy of Sciences, 92(22), 9977-9982.

[8] Bill Watterson: Homicidal Psycho Jungle Cat: A Calvin and Hobbes Collection (1994), p. 53

[9] Språkbanken, “Om oss”, https://spraakbanken.gu.se/om Retrieved 2020-08-10.

(33)

26 [10] W3, ”Extensible Markup Language”,

https://www.w3.org/TR/REC-xml/

[11] Geeksforgeeks, “Python Prograqmming Language”,

https://www.geeksforgeeks.org/python-programming-language/

[12] Realpython, “Python Modules and Packages – An Introduction”, https://realpython.com/python-modules-packages/

[13] Readthedocs, “Requests: HTTP for Humans^TM”, https://requests.readthedocs.io/en/master/

[14] W3schools, “Python RegEx”,

https://www.w3schools.com/python/python_regex.asp Retrieved 2020-08-10.

[15] Python, “xml.etree.ElementTree – The ElementTree XML API”, https://docs.python.org/3/library/xml.etree.elementtree.html#mod ule-xml.etree.ElementTree

[16] Python, “codecs – Codecs registry and base classes”, https://docs.python.org/3/library/codecs.html Retrieved 2020-08-10.

[17] Python, “bz2 – Support for bzip2 compression”, https://docs.python.org/3/library/bz2.html Retrieved 2020-08-10.

[18] Python, “os – Miscellaneous operating system interfaces”, https://docs.python.org/3/library/os.html

[19] Python, ”shutil – High-level file operations”, https://docs.python.org/3/library/shutil.html Retrieved 2020-08-10.

(34)

27

[20] Python, “_thread – Low-level threading API”, https://docs.python.org/3/library/_thread.html Retrieved 2020-08-10.

[21] Crummy, “Beautiful Soup Documentation”,

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

[22] Python, “Graphical User Interfaces with Tk”, https://docs.python.org/3/library/tk.html Retrieved 2020-08-10.

[23] Nielsen, F. Å. (2019, July). Danish in Wikidata lexemes. In Wordnet Conference (p. 33).

[24] Macfarlane, D. (2016). The lexeme hypotheses: Their use to generate highly grammatical and completely computerized medical records. Medical Hypotheses, 92, 75-79.

[25] Cartoni, B., Aros, D. C., Vrandecic, D., & Lertpradit, S. (2020, May).

Introducing Lexical Masks: a New Representation of Lexical Entries for Better Evaluation and Exchange of Lexicons.

In Proceedings of The 12th Language Resources and Evaluation Conference (pp. 3046-3052).

[26] Språkbanken,

https://svn.spraakdata.gu.se/sb-arkiv/pub/lmf/ao/ao.xml Retrieved 2020-08-10.

[27] Språkbanken, “Språkbankentext”,

https://spraakbanken.gu.se/eng/research/infrastructure/korp/sente ncesets

(35)

28

Appendix A: Source Code

1. # -*- coding: utf-8 -*- 2. #import urllib.request 3. import requests

4. import re

5. import xml.etree.ElementTree as ET 6. import codecs

7. import bz2 8. import os 9. import shutil 10. import _thread 11. import sys 12.

13. from PIL import ImageTk, Image 14. from bs4 import BeautifulSoup 15. from tkinter import *

16.

17. #-Computer-or-laptop--- 18. CoL = "E:"

19.

20. #-Classes--- 21.

22. #Class that keeps informations we need 23. class Lexeme():

24. """Uses lemma and word class"""

25.

26. def __init__(self, lemma, wordclass):

27. self.lemma = lemma

28. self.wordclass = wordclass 29.

30. def printLexeme(self):

31. print(self.lemma + ', ' + self.wordclass) 32.

33. #-Functions--- 34.

35. #call this function to move a file from one directory to the other 36. def moveFile(oldLocation, newLocation):

37. #shutil.move(oldLocation, newLocation) 38. i = 0

39. with open(newLocation, 'wb') as new_file, open(oldLocation, 'rb') as f ile:

40. for data in iter(lambda : file.read(65536 * 1024), b''):

41. print("Moving " + oldLocation + " to " + newLocation + " " + s tr(i))

42. new_file.write(data) 43. i = i + 1

44. os.remove(oldLocation)

45. print("Done moving " + oldLocation + " to " + newLocation + "\n") 46.

47. #use to decompress .bz2 files

48. def decompressFileBz2(fileLocation):

49. print("Decompressing: " + fileLocation)

50. newFileName = "Downloaded/Unzipped/" + fileLocation 51. newFileName = newFileName.replace(".bz2", "") 52. decomp = bz2.BZ2Decompressor()

(36)

29

53. i = 0

54. with open(newFileName, 'wb') as new_file, bz2.BZ2File("Downloaded/" + fileLocation, 'rb') as file:

55. for data in iter(lambda : file.read(65536 * 2048), b''):

56. print("Decompressing " + str(i)) 57. new_file.write(data)

58. i = i + 1

59. os.remove("Downloaded/" + fileLocation)

60. fileLocation = fileLocation.replace(".bz2", "")

61. print("Finished decompressing " + fileLocation + "\n")

62. _thread.start_new_thread(moveFile, (newFileName, CoL + "/Skola/Exjobb/

XML-filer/" + fileLocation)) 63.

64. #Use to download files from spraakbanken, they come in a bz2- compressed format though...

65. def downloadFile(urlStart, urlEnding, decompressBool):

66. print("Downloading: " + urlEnding) 67. r = requests.get(urlStart + urlEnding) 68. print("Writing file: " + urlEnding)

69. print("Finished downloading " + urlEnding + "\n") 70. fileLocation = "Downloaded/" + urlEnding

71. with open(fileLocation, "wb") as f:

72. for chunk in r.iter_content(chunk_size=128):

73. f.write(chunk) 74. if decompressBool == "true":

75. decompressFileBz2(urlEnding) 76.

77. #Use to re-download all sentenceset xml-files from spraakbanken 78. def refreshSentenceSets():

79. req = requests.get("https://spraakbanken.gu.se/eng/research/infrastruc ture/korp/sentencesets")

80. req.encoding = 'utf-8'

81. soup = BeautifulSoup(req.text, "lxml")

82. pages = soup.findAll("a", href=re.compile(".*spraakbanken.gu.se/lb/res urser/meningsmangder/*"))

83. baseUrl = "http://spraakbanken.gu.se/lb/resurser/meningsmangder/"

84. for page in pages:

85. endingUrl = page.get("href").replace(baseUrl, "") 86. downloadFile(baseUrl, endingUrl, "true")

87. print("Done!") 88.

89. #Use to extract unique lines from downloaded XML-files 90. def searchUniqueWords(fileName, fileLocation, newLocation):

91. fileLocation += fileName

92. newLocation += fileName + "_unique"

93. lemmaArray = []

94.

95. with open(fileLocation, errors="ignore") as file:

96. for xmlString in file:

97. try:

98. if xmlString[:2] == "<w": #we only want the w tag 99. wtag = ET.fromstring(xmlString)

100. lem = wtag.get("lemma") 101. if lem != "|":

102. t1 = []

103. t2 = ""

104. for letter in lem:

105. if letter == "|" and t2 != "":

106. t1.append(t2) 107. t2 = ""

(37)

30

108. elif letter != "|":

109. t2 += letter 110. if t2 != "":

111. t1.append(t2) 112. #for word in t1:

113. # print(word) 114. if t1 not in lemmaArray:

115. lemmaArray.append(t1) 116. except:

117. print("skipping some stuff") 118. continue

119. with open(newLocation, "w+") as f:

120. for words in lemmaArray:

121. line = ""

122. for word in words:

123. line += word + ", "

124. line = line[:-1]

125. line = line[:-1]

126. f.write(line) 127. f.write("\n") 128. #print("Done")

129.

130. #Use to extract unique files from a single file, for all files in that directory

131. def createSeveralUnique(fileLocation):

132. files = os.listdir(fileLocation + "XML-filer/") 133. for file in files:

134. #Problematic Files:

135. #Majority of aspac files, those that arent swedish

136. #bloggmix2005.xml, 'charmap' codec can't encode characters 137. #have the file be less than a certain size too

138. if file[:5] != "aspac":

139. print("Doing file: " + file + " " + str(os.path.getsize(fi leLocation + "XML-filer/" + file)) + " bytes")

140. searchUniqueWords(file, fileLocation + "XML- filer/", fileLocation + "Unique/")

141.

142. #Use to extract unique words from the unique files 143. def createUniqueWordList():

144. fileLocation = CoL + "/Skola/Exjobb/"

145. createSeveralUnique(fileLocation)

146. #Check unique files and make a new unique file out of those 147. newFilesLocation = fileLocation + "Unique/"

148. newFiles = os.listdir(newFilesLocation)

149. uniqueArray = [] #Used to keep unique words from all the texts 150. for file in newFiles:

151. with open(newFilesLocation + file, "r") as f:

152. print("Doing " + file) 153. for word in f:

154. if word not in uniqueArray:

155. uniqueArray.append(word) 156.

157. with open(fileLocation + "uniqueFiles","w+") as f:

158. for word in uniqueArray:

159. f.write(word) 160. print("Done")

161.

162. def evaluateQuality():

163. fileLocation = CoL + "/Skola/Exjobb/"

164. uniqueArray = []

(38)

31

165. with open(fileLocation + "uniqueFiles", "r") as uniques:

166. for word in uniques:

167. uniqueArray.append(word)

168. aoTree = ET.parse(fileLocation + 'ao.xml') 169. aoRoot = aoTree.getroot()

170. wordArray = []

171. for fr in aoRoot.iter('FormRepresentation'):

172. for feat in fr.iter('feat'):

173. if feat.get('att') == 'writtenForm':

174. wf = feat.get('val')

175. if feat.get('att') == 'nativePartOfSpeech':

176. npos = feat.get('val') 177. wordArray.append(Lexeme(wf, npos)) 178.

179. korrekt = 0 180. failed = 0 181. lexemeArray = []

182.

183. for i in uniqueArray:

184. failedIterate = True 185. i = i.replace('\n', '') 186. for j in wordArray:

187.

188. if j.lemma == i:

189. lexemeArray.append(j) 190. korrekt = korrekt + 1 191. failedIterate = False 192. break

193. if failedIterate:

194. failed = failed + 1 195.

196.

197. print('wordArray = ' + str(len(wordArray))) 198. print('uniqueArray = ' + str(len(uniqueArray))) 199. print('lexemeArray = ' + str(len(lexemeArray))) 200. print('korrekt = ' + str(korrekt))

201. print('failed = ' + str(failed)) 202.

203. with open(fileLocation + "lexemesFound","w+") as f:

204. for lexeme in lexemeArray:

205. f.write(lexeme.lemma + " " + lexeme.wordclass + "\n") 206.

207. print("Done") 208.

209.

210.

211. #GUI to let user choose to refresh sentencesets or evaluate stuff 212. class ButtonChoice(Frame):

213. def __init__(self, parent=None):

214. Frame.__init__(self, parent) 215. self.pack()

216. self.make_widgets() 217.

218. def make_widgets(self):

219. text = Text(self)

220. text.insert(INSERT, "Welcome to my exjobb. Press the refresh button to download XML- files, evaluate to evaluate (obviously)") 221. text.pack()

(39)

32

222. widget1 = Button(self, text='Evaluate Files', command=createUn iqueWordList) #evaluates all the files

223. widget1.pack(side=LEFT)

224. widget2 = Button(self, text='Refresh Files', command=refreshSe ntenceSets) #redownloads all data

225. widget2.pack(side=RIGHT)

226. widget3 = Button(self, text='Analyze Evaluation', command=eval uateQuality) #Evaluates the quality

227. widget3.pack(side=BOTTOM) 228.

229.

230.

231.

232. if __name__ == '__main__':

233. ButtonChoice().mainloop()

Lexeme Extraction for Wikidata A proof of concept study for Swedish lexeme extraction Simon Samzelius