Definition Extraction From Swedish Technical Documentation : Bridging the gap between industry and academy approaches

(1)

Documentation

Bridging the gap between industry and academy approaches

Benjamin Helmersson

Supervisor:

Marco Kuhlmann

External Supervisor:

Magnus Merkel

Examiner:

Mattias Arvola

Bachelor’s Thesis - Cognitive Science

ISRN: LIU-IDA/KOGVET-G--16/024--SE

Department of Computer and Information Science, IDA

Linköping University, Sweden

(2)

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

c

(3)

Bridging the gap between industry and academy approaches

Benjamin Helmersson

Abstract

Terminology is concerned with the creation and maintenance of concept systems, terms and definitions. Automatic term and definition extraction is used to simplify this otherwise manual and sometimes tedious process. This thesis presents an integrated approach of pattern matching and machine learning, utilising feature vectors in which each feature is a Boolean function of a regular expression. The integrated approach is compared with the two more classic approaches, showing a significant increase in recall while maintaining a comparable precision score. Less promising is the negative correlation between the performance of the integrated approach and training size. Further research is suggested.

Keywords. Definition Extraction; Machine Learning; Pattern Matching; Naive Bayes; Regular Expressions; REV

(4)

(5)

Many have contributed to this work, either intellectually or socially. I have decided to name a few.

Thanks to Marco Kuhlmann for your supervision. Many hours of coding later I was asked questions that steered me towards the questions I ask in this thesis. Meetings with you were all golden.

Thanks to Magnus Merkel, Lars Edholm, Christian Smith and Mikael Lundahl for intellectual support and for answering crazy questions about terminology I would not dare to ask anyone else.

Thanks to Erika Anderskär for supporting me during every thesis-writing marathon-session.

Last but not least, thanks to Fodina Language Technology for providing me with the data needed. Many interesting findings had probably never been with no terminology expertise involved. No data equalises no tests.

(6)

(7)

1 Introduction 1

1.1 Purpose of Study . . . 1

1.2 Research Questions . . . 1

1.3 Delimitations . . . 2

1.3.1 Swedish Technical Documentation . . . 2

1.3.2 Upper Features Limit . . . 2

1.3.3 One Machine Learning Algorithm . . . 2

1.3.4 Data Quantity . . . 2

2 Terminology 3 2.1 Theory . . . 3

2.2 Practice . . . 5

2.3 Information & Definition Extraction . . . 5

3 Data 7 3.1 Preprocessing . . . 7

3.2 Data Sets . . . 7

3.2.1 The Development Set . . . 8

3.2.2 The Training Set . . . 8

3.2.3 The Test Set . . . 8

4 Tools& Methods 10 4.1 Metrics & Measures . . . 10

4.1.1 Confusion Matrix . . . 10

4.1.2 Accuracy (ACC) . . . 10

4.1.3 Precision & Recall (P & R) . . . 10

4.1.4 The F-Measure (F1) . . . 12

4.2 Data Types & Features . . . 12

4.2.1 Lemmas & Part of Speech (POS) . . . 12

4.2.2 Bag-of-words (BoW) . . . 13

4.2.3 N-grams . . . 13

4.2.4 The Regular Expressions Vector (REV) . . . 14

4.3 Classifiers . . . 14

4.3.1 Naive Bayes Classifier (NB) . . . 14

4.3.2 Regular Expressions As Classifiers (REC) . . . 15

5 Experiments 16 5.1 Experiment 1: Two Approaches Compared . . . 16

5.1.1 Evaluation of Approach: Pattern Matching . . . 16

5.1.2 Evaluation of Approach: Machine Learning . . . 16

5.1.3 Comparison . . . 17

5.2 Experiment 2: An Integrated Approach . . . 20

5.2.1 Evaluation & Comparison . . . 20

5.2.2 Scalability . . . 21

6 Analysis& Discussion 23 6.1 Methodology . . . 23

6.2 Experiment 1: Definition Extraction Models Comparison . . . 23

6.3 Experiment 2: Performance & Training Set Size . . . 25

6.4 Future Studies . . . 26

6.4.1 Implications & Applications . . . 26

7 Conclusion 28

8

References

29

(8)

(9)

1 Introduction

There is a great gap between the academy and the industry when it comes to the field of information extraction. Machine learning approaches are commonly used by researchers while rule-based approaches dominate the industry. This gap is solely based around how the two differ in view of costs and benefits, academy isolating documents to measure intrinsic value, recall and precision, which in a business context does not matter as much as the extrinsic value, that is overall performance and interpretability (Chiticariu, Reiss, Li, & Reiss, 2013). Rewriting rules to take into account a new domain instead of creating a brand new data set seems to be the more efficient alternative for businesses. Researchers on the other hand view the rule- and pattern-based approach as “tedious and time-consuming” (Yakushiji, Miyao, Ohta, Tateisi, & Tsujii, 2006, p. 284) and it is also considered to work with varying results (Borg, Rosner, & Pace, 2009, p. 26). Instead researchers turn to automatic and semi-automatic solutions involving large chunks of data and probabilistic machine learning. Why manually crafted regular expressions are still used in business settings may have to do with tradition. They are easily adapted and as soon as a new pattern turns out to be useful it can simply be added to the pool of effective patterns.

Why should one bother with information extraction? The answer could simply be “to be able to answer questions” but information extraction also includes solving tasks such as text summarisation and terminology creation and maintenance. This thesis is focused around the latter even though most of them could be seen as interconnected. Information about certain domain-specific objects and events has to be extracted from large chunks of text in order to create a terminology, and to understand how terms are used in a company or government one has to base this extraction on actual documents from within that function. We could consider the task of definitional context extraction as a fact retrieval task, at least once considered as one of the main problems of natural language processing (Choueka, 2014). In the intersection between natural language and data science, this cognitive science thesis combines linguistics with artificial intelligence to process human language in computers.

1.1 Purpose of Study

This study aims to explore how to solve the problem of concept extraction in Swedish technical documentation, starting not with the extraction of terms but rather with the extraction of definitional contexts. In a broader perspective, this thesis is mainly concerned with one of many tasks terminol-ogy practice is concerned with, namely to:

Collect and record terms, definitions and other relevant information from the source documentation. Consult subject field specialists. (Suonuuti, 2001, p 22–23, entry 16)

It’s more specifically concerned with the collection of “other relevant information” to which def-initions are counted. The other 37 tasks mentioned by Suonuuti are mostly focused around concept systems which in turn affect the term and definition production. If a computer would be able to dis-criminate between two types of sentences, definitions on one hand and non-definitions on the other, a great deal of time could be saved. Solving the issue of definition extraction in Swedish technical documentation would be useful in applications such as automated question answering systems and ontology or terminology development and maintenance.

1.2 Research Questions

The central questions of this study are:

1. How well do pattern matching and machine learning approaches perform in the task of defini-tion extracdefini-tion from Swedish texts?

2. Is it possible to successfully integrate the pattern matching and machine learning approaches and does an integrated approach scale well with data quantity?

The first question asks for an evaluation of pattern matching and machine learning approaches in the task of definition extraction from Swedish texts. The second question asks for an integration of the two approaches, bridging the gap between industry and academy. Answering these questions will result in a better understanding of definition extraction from Swedish technical documentation.

(10)

1.3 Delimitations

Some delimitations were set in order to narrow the scope of this study. They are explained here. Ex-ternal limitations due to the inability to control natural language at production level are not covered.

1.3.1 Swedish Technical Documentation

It would be interesting to study the extraction of definitions from different types of texts ranging from blogs on the web to novels written in English, Dutch and Japanese. However, this study is delimited to Swedish technical documentation. The reason for this is simple. Swedish is my first language and definitions are more commonly found in technical documentation than in other text types. In addition, technical documentation is a common text type used in terminology work due to its domain specificity.

1.3.2 Upper Features Limit

Narrowing the number of features included in the computations of this study has been essential to further narrow its scope. Additional variations could have been added but were not. This is true for all data types included in this study. The N-grams were limited to N sizes ranging from 1 to 4 and the regular expressions vectors were limited to 10 features. See section 4.2 Data Types& Features for further information.

1.3.3 One Machine Learning Algorithm

There are several machine learning algorithms not used in order to narrow the scope of this study. No more than one was used, namely the Naive Bayes classifier. It is commonly used as a baseline for machine learning in the academy and serves as an algorithm able to test this study’s hypotheses. More on this in section 4.3 Classifiers.

1.3.4 Data Quantity

For data to be useful in the training or testing set used by the classifiers of this study they had to be manually tagged as either definitions or non-definitions. The test data is comprised of no more than 100 sentences. The training data is comprised of no more than 298 sentences. A common rule is to have a training set 8 times bigger than the test set. That is not the case in this study. More on this in section 3 Data. The performance of the machine learning classifier is probably slightly lowered due to the small training set size.

(11)

2 Terminology

Terminology is concerned with analysing and defining concept systems. Terminologists collect terms and term descriptions from already existing text and spoken discourse, deciding which con-cepts relate to each other and which concon-cepts do not in order to assign them terms and accurate definitions. This section explains the theory and practice of terminology. The theory section ex-plains concepts, objects, terms and definitions while the practice section exex-plains terminology work and its connections to information extraction.

2.1 Theory

There is a widely known polysemy attached to the word terminology (Cabré, 1996, s. 22). This is a bit ironic as the declared goal of terminology is to avoid unclear terms and concepts. It could be a discipline, a practice and a product generated by that practice. Cabré explains that "as a discipline terminology is a subject which is concerned with specialized terms; as a practice it is the set of principles oriented toward term compilation; finally, as a product, it is the set of terms from a given subject field" (p. 16). The discipline combines theories from different fields such as linguistics, lexicology and lexical semantics and deals with concepts and their representations, an interdisciplinary field which in theory also involves philosophy and computer science. This can easily be related to central questions of cognitive science such as that of the role of language in thinking and that of mental representations.

As if not enough confusing, there are some distinctions between what different disciplines consider a "term" to be. In philosophy the term serves as a cognitive unit representing specialised knowledge and in linguistics it serves as a word, not distinguishable from other words in a lexicon. In terminology, which is the application field of this thesis, the term is an expression of specialised knowledge.

The semiotic triangle (Figure 1), also called the Triangle of Reference (Ogden & Richards, 1923) shows how references, symbols and referents are interconnected. The referent is the actual thing in the world, the symbol the term or word used to refer to it and the reference the set of features that bind the two together. In the field of terminology, terms are symbols, objects are referents and concepts are references. The defining explanation of the concept, the definition, is not included in the triangle.

The terminology pyramid (Figure 2), originally explained by Suonuuti (2001), is a revised ver-sion of the semiotic triangle, adding definitions to the equation. What connects the four parts of the pyramid is the concept, that is the mental picture or thought of an object and its characteristics. The concept of a bird holds among other characteristics “have wings” while it excludes character-istics such as “washes the dishes every morning”. I found this explanation of concepts by Saeed exceptionally well written:

The most usual modification of the image theory is to hypothesize that the sense of some words, while mental, is not visual but a more abstract element: a concept. This has the advantage that we can accept that a concept might be able to contain the non-visual features which make a dog a dog, democracy democracy, etc. [...] Some concepts might be simple and related to perceptual stimuli – like SUN, WATER, etc. Others will be complex concepts like MARRIAGE or RETIREMENT which involve whole theories or cultural complexes. (Saeed, 2009, p. 33)

Many concepts are similar to each other, but small differences still make different concepts different. The relation between two concepts is usually one of three types: generic (logical), partitive (ontological) or associative (Suonuuti, 2001; Ingo, 2007; ISO 704:2009, 2009). Generic relations are “is-a”-relations. They tell of a concept’s group membership in relation to other concepts with similar characteristics. A table is a kind of furniture, a shirt is a garment, etc. Partitive relations are “is-part-of”-relations. A book consists of all of its pages, a handle is a part of a door, etc. Associative relations are those that are based on usage or function. For example, a computer mouse connects to the computer and makes it possible for the user to interact with it. Together these relations between concepts form concept systems, the very core of terminology.

(12)

Figure 1. The semiotic triangle, adapted from what Ogden and Richards (1923) originally called the “Triangle of Reference”.

Figure 2. The Terminology Pyramid, adapted from Suonuuti (2001)

A term is a symbol indirectly referring to an object by directly referring to its concept. One special thing about terminography when compared to lexicography is that it is dedicated to units of language in specialised domains. Terms are no words since they have to be specialised and they have to be connected to an actual object (which of course could be a more abstract object as explained above). In addition, a term could consist of a multitude of words (so called multi-word terms). Felber (1984) defines a term as “any conventional symbol representing a concept defined in a subject field” (p. 1). Unfortunately it excludes the term’s relation to the object. Let us look at how Cabré defines words and terms instead:

A word is a unit described by a set of systematic linguistic features, having the property of referring to an element in reality. A term is a unit described by a similar set of linguistic features, this unit being used in a specialized domain. (Cabré, 1996, p. 22)

Though Cabré’s definition includes the relation between terms and objects, it excludes the rela-tion between terms and their concepts. A combinarela-tion of Felber’s and Cabre’s definirela-tions results in a definition of terms applicable in this study:

A term is a unit comprised of a set of systematic linguistic features, referring to a con-cept defined in a specialised domain in which a referent object exists.

The object is what the concept represents. It has a set of properties that usually matches with the concept’s characteristics. A one-legged human is still a human even though the supposed characteristic "has two legs" is violated. The properties of an object and the characteristics of that

(13)

object’s concept do not necessarily have to be identical, but they usually are.

The definition is the explanation of a concept. It states what separates concepts from each other in a specialised domain. A concept sharing characteristics with another will be related to it in one way or another and definitions are the precise descriptions of these possibly subtle differences. The terminology pyramid does not include any other concept description than the definition even though less standardised but yet useful descriptions could be found in real text. To define a concept we have to understand its characteristics. One way to identify these characteristics is to look for them in already-written texts where the concepts have been explained.

Properly written definitions seldom occur in naturally produced texts. In technical documentation they are far more common than in for example blogs, newspapers and books. When found, definitions are commonly in close conjunction with their respective terms and their concepts are often not properly defined. Together they form definition-term pairs, so-called definitional contexts (Alarcón, Sierra, & Bach, 2009). The terms and definitions are connected by typographic or syntactic patterns such as verbs (syntactic) and punctuation (typographic). One way of finding definitional contexts is to look for certain patterns that commonly occur in such settings.

2.2 Practice

The need for terminology is mainly based on the object classification problem, often aimed at prod-ucts or new designs (Ingo, 2007, p. 229). Concept systems comprised of several concepts and their relations are created and maintained by a group of terminologists. These concepts’ characteristics are determined by the properties of the objects that they refer to. Objects and properties are part of the “real world” while concepts and characteristics are abstractions of the former two. Terminologists ensure that concepts are clearly defined. The main activities of terminology management, some of them nowadays automatised, include:

• Identifying concepts and concept relations;

• Analysing and modelling concept systems on the basis of identified concepts and concept relations;

• Establishing representations of concept systems through concept diagrams; • Defining concepts;

• Attributing designations (predominantly terms) to each concept in one or more languages;

• Recording and presenting terminological data, principally in print and electronic media (terminography)

(ISO 704:2009, 2009, p. V)

Identifying concepts and their relations is a central part of the work. Suonuuti (2001, p. 34) states that the concept and term collection at this point should be less restrictive, while at a later point the number of concepts dealt with by a terminology group has to be limited to at most 200. Concept information is excerpted from the source documentation, usually as term candidates but sometimes also as whole sentences or term-definition pairs. This is where this study comes into the picture. Commonly used terms in large industry domains generate many sentences of which only a few con-tain information that is per se defining and some rarely used terms are not defined at all. A computer program that automatically finds definitions and rejects non-informational sentences would make the activity of concept analysis significantly easier. This study is not concerned with the creation of definitions but rather the search for useful information in texts being already produced, which brings us to the field of information extraction.

2.3 Information

& Definition Extraction

Information extraction (IE) is a subfield of natural language processing in which texts are automatically or semi-automatically analysed in order to uncover information about semantic and temporal relations, entities and events (Jurafsky & Martin, 2009, p. 740). The methods of IE are commonly based around machine learning and bag-of-words. IE is concerned with tasks such as

(14)

question answering, text summarisation, lexicon/terminology management and web search. It is possible to get a close-enough picture of what information a concept holds by looking for terms and definitions in specialised texts, so called definitional or definitory contexts (Sierra & Alarcón, 2002). The solutions are traditionally based on pattern matching (Bertin, Atanassova, & Desclés, 2009, p. 23). The varying results of pattern matching approaches is considered a problem since “definitional patterns are used not only in definitional sentences but also in a wider range of sentences” (Alarcón et al., 2009, p. 9) resulting in inaccurate output.

Many other solutions have been used to solve the problem of definitional context extraction. Some start with extracting terms (Zheng, Zhao, & Yang, 2009; Lossio-Ventura, Jonquet, Roche, & Teisseire, 2014), others with the task of definition extraction (Espinosa-Anke, Ronzano, & Saggion, 2015) and some extract both in parallel as a term-definition pair. A common method other than pattern matching is to use machine learning algorithms to classify sentences as either relevant or irrelevant. The precision score (that is the proportion of sentences classified as relevant truly being relevant) of most such machine learning systems are relatively low (Westerhout, 2009, p. 62). A possible reason for this is that high recall score (that is the proportion of relevant sentences classified as relevant) is more valued in semi-automatic systems in which human intervention is embedded. A human observer could easily reject irrelevant sentences even if they were to be classified as relevant. Precision is generally of greater importance in fully automated systems where no human intervention is possible. The cost of inaccurate definitions is less than that of some definitions not being found.

Del Gaudio & Branco (2009) explored the performance of five different machine learning algorithms in the task of definition extraction and concluded that a fully ML-based approach, at least to some extent, can be made language independent (p. 38). Borg et al. (2009) have also shown that evolutionary algorithms can be successfully used to extract definitions, with an F-measure (the harmonic mean of precision and recall, used in this study as a measure of overall performance) of up to .62 (p. 29).

Though term extraction has been researched both in monolingual Swedish and bilingual Swedish-other language contexts, for example by Foo (2012), no research seems to be purely concerned with definition extraction in Swedish. Definition extraction in other languages has been more thoroughly researched (Sierra & Alarcón, 2002; Trigui, Belguith, & Rosso, 2010; Trigui, 2011; Del Gaudio & Branco, 2007). Closest to definition extraction research in Swedish is probably the work of Dannélls (2005) which is concerned with recognising Swedish acronyms and their definitions in biomedical texts. What she counts as a definition in such a setting seems different from what is considered a definition here though. Dannélls’ definitions seem to include acronym expansions which in this study rather would be treated as term variations since the semiotic function of terms and their short forms are symbolic (Wright, 1997, p. 16). In other words, Central Standard Time would in this study not count as defining to CST but instead as a form variation of the same term.

(15)

3 Data

The source technical documentation of this study was provided by Fodina Language Technology. It consisted of Swedish technical documentation in raw text format. A significant amount of data were not used due to the need of manual annotation of sentences that were to be included in the data sets. This section is divided into two parts. The first part explains the preprocessing step in which raw data was transformed into part of speech-tagged and lemmatised versions of the original text. The second part explains the three data sets that were manually tagged and used in the experiments.

3.1 Preprocessing

The goal of the preprocessing step was to normalise the text as a preparation for the experiments. Upper-case letters were converted to lower-case and punctuations were separated to form tokens of their own. Sentences shorter than 4 tokens and longer than 50 tokens after these procedures were excluded. Each sentence was normalised into two versions, one in which tokens were stemmed to be represented in lemma form (including punctuation marks) and one in which tokens were represented as their respective part of speech tags. Figure 3 illustrates this. Common abbreviations that included punctuation were edited not to include punctuation because of an error with the part of speech tagger treating each dot as a full stop, thus critically affecting the part of speech tag quality. Sentences were segmented and perfect duplicates were removed.

Figure 3. Normalisation example of “Detta är spam” (“This is spam”), processing raw text into tokens from which lemmatised or part of speech-tagged versions of sentences were generated.

The part of speech tags were assigned by the open-sourced Stagger by Östling (2013), a part of speech tagger for Swedish with seemingly unmatched accuracy based on the averaged perceptron. According to Östling it reaches an estimated per-token accuracy (that is the proportion of correct predictions) of 96.6% on SUC 3.0 (The Stockholm-Umeå Corpus). In the technical documentation domain of this study, however, there is a risk of Stagger not being able to perform as well as it is estimated to due to unknown terms and unusual sentence structures.

3.2 Data Sets

In order to perform the experiments and develop the program in which these experiments were run, three data set partitions were created. One of the data sets would be used to develop the experiment setting and to validate that experiments could be run in the testing environment. Another one was used as a training set in the machine learning process. The last data set was used to evaluate the definition extraction results.

(16)

Table 1

Data set statistics.

Data set Size Definitions Non-definitions Sentence length mean Development 875 101 774 18.6 tokens/sentence

Training 298 46 252 14.5 tokens/sentence Test 100 22 78 14.7 tokens/sentence

The data were sampled from the normalised text using a computer-based randomness function. An exception to this was the sampling for the development set to which a handful of definitions were added. The data set sizes were initially limited to 875, 300 and 100 sentences. Each sentence in each data set was tagged as either definition (2), informative non-definition (1) or non-definition (0). Later on these tags were changed to definitions (2) and non-definitions (0-1) in order to make the task one of binary classification. Table 1 presents an overview of the data sets and their class distributions.

3.2.1 The Development Set

The data used as a development set were manually tagged by the author of this thesis. Thus it is possible of an over-representation of false predictions in this set compared to the other two. The development set consisted of 875 sentences, with a mean length of 18.6 tokens per sentence. 572 sen-tences (65%) were tagged as non-definitions, 202 (23%) as informing non-definitions and 101 (12%) as definitions.

3.2.2 The Training Set

The data used to teach the machine learning algorithm consisted of 298 sentences, each manu-ally tagged by one of two experts from Fodina Language Technology as either definition or non-definition. The experts’ identities are unknown to the author of this thesis. The intention was to have a training set of 300 sentences although two were lost in the annotation process. 177 of the sentences in the training set (59%) were tagged as non-definitions, 75 (25%) as informing non-definitions and 46 (15%) as definitions. The mean length of the training set sentences was 14.5 tokens.

3.2.3 The Test Set

The data used as a test and evaluation set required more attention than the other sets since every experiment and every measure in the results section of the thesis would rely on it as a gold standard. 100 sentences were tagged separately by both experts (mentioned in the explanation of the training set above) to achieve a higher level of certainty of the tags’ correctness.

Cohen’s Weighted Kappa (κ) measures the rate of agreement between two classifiers, and is de-fined as

κ = po− pc

1 − pc

where po is the proportion of predictions in which the classifiers agree and pcthe proportion of

predictions for which agreement is expected by chance (Cohen, 1960). It is “directly interpretable as the proportion of joint judgments in which there is agreement, after chance agreement is excluded” (p. 46). Its value ranges from −1 to 1 where 1 is a perfect match between the classifiers’ predictions and −1 is a perfect mismatch. A value of 0 indicates that the correlation probably is due to chance. Table 2 shows the initial agreement rate of the experts’ tags, using the κ score.

Table 2

Initialκ score chunking two of three groups together. Grouping κ

0 vs 1-2 .52 0-1 vs 2 .9 0-2 vs 1 .3

(17)

It becomes rather obvious when looking at Table 2 that it is best to chunk non-definitions and informing non-definitions together to form one single group called non-definitions if we are to maximise the expert agreement rate. Two sentences were re-sent to the experts in an attempt for them to reach an agreement. By this method the classification of non-definitions and definitions of two experts finally reached a κ-score of 1, a perfect match. Other researchers have encountered problems with the subjective annotation (Przepiorkowski, Marcinczuk, & Degórski, 2008, p. 170). This data creation technique is supposed to take care of such a problem.

The test set mean sentence length was 14.7 tokens. 22 of the 100 sentences (22%) were tagged as definitions by both experts. One expert tagged 71 sentences (71%) as non-definitions and 7 sentences (7%) as informing non-definitions while the other tagged 49 sentences (49%) as non-definitions and 29 sentences (29%) as informing non-definitions

(18)

4 Tools

& Methods

Many tools and methods have been adapted from the work of others in an attempt for me not to rein-vent the wheel, mostly evaluation measures and statistical metrics but also classifiers and features. All are explained in this section.

4.1 Metrics

& Measures

Metrics and measures used in this study are presented here.

4.1.1 Confusion Matrix

A confusion matrix is in language technology a table in which each cell represents the intersection of a task’s actual values and a classifier’s predictions. In a binary task, where the classifier is to decide whether a document is part of a certain class or not, there are four cells – one for each of the following:

True positive (TP), correctly predicted to belong to the class; False positive (FP), incorrectly predicted;

False negative (FN), incorrectly rejected; True negative (TN), correctly rejected.

Figure 4. A confusion matrix visualisation.

All of the metrics explained in this subsection depend on different combinations of these values.

4.1.2 Accuracy (ACC) Accuracy is calculated as

ACC= T P+ T N FP+ FN

and is simply a measurement of the percentage of labels correctly assigned. If a classifier assigns the correct label to 40 documents out of 100 the accuracy of that classifier is 40%. Accuracy is simple and great in explaining a system’s performance on balanced sets in which positives and negatives are evenly distributed bur can also be a rather misleading metric since target classes usually turn out to be underrepresented in search tasks like those concerned with information extraction (Bird, Klein, & Loper, 2009); true negatives and true positives are not equally important in this task. Accuracy ironically turns out to be a less accurate metric for this study. Thus it is not used.

4.1.3 Precision& Recall (P & R)

Precision and recall are the most common metrics when it comes to evaluation of search and classification tasks. Together they give a picture of how well a classifier performs by taking into account both false positives and false negatives.

(19)

Figure 5. Accuracy score of 40%, dark boxes are true predictions and light boxes are false predic-tions.

Figure 6. Confusion matrix illustrating a Precision score of 75%. The dark boxed lines indicates which parts of the matrix that are used to calculate Precision.

Precision is calculated as

P= T P T P+ FP

and measures the percentage of documents assigned a label actually being a part of that group. If a classifier labels 20 documents as positives but no more than 15 of them are actual members of the class (Figure 6), the precision score would be 15/ 20 = 75%.

Recall is calculated as

R= T P T P+ FN

and measures the percentage of relevant documents identified, the retrieval rate of the classifier. If a classifier identifies 15 of the positives but rejects 15, its recall score is 15/ 30 = 50%.

Figure 7. Confusion matrix illustrating a Recall score of 50%. The dark boxed lines indicates which parts of the matrix that are used to calculate Recall.

(20)

4.1.4 The F-Measure (F1)

The F-measure (Van Rijsbergen, 1974) is calculated as

Fβ=(1+ β 2_)PR

β2_P+ R

and is the harmonic mean of precision and recall. It serves as a single score of how well a classifier performs, applying relative importance β depending on if it is to favour precision or recall. A β of 0.5 favours precision and a β of 2 favours recall. The equally balanced F1, that is

F1=

2PR P+ R

is used in this study. Whether recall or precision is to be favoured is discussed later in the thesis. The F-measure is in this study used as an overall performance score.

4.2 Data Types

& Features

Natural language is commonly quantified in one way or another in order to make it computer read-able or to enread-able statistical testing that on untouched text would have been impossible. The key is to uncover syntactic, semantic and pragmatic information. Algorithms or whole systems are then run to structure said information into knowledge databases or feature weights for classification. The data types and features used in this study are explained here.

4.2.1 Lemmas& Part of Speech (POS)

The syntactic features of the study are closely tied to the Swedish language. N-grams and bag-of-words also depend on syntactic information, but there is a special kind of features that motivates this section – the syntactic features embedded into or converted into regular expressions. All syntactic features used are based upon lemmatised versions of common definitory contexts since all words in the corpus were lemmatised in the preprocessing step. The syntactic features and their translations are found in Table 3. Note that\b matches the empty string at the beginning or end of a word, so the expression “\bhi\bthere” would neither match “hi there” nor “hithere”. Note also that \w matches every alphanumeric character plus underscore, so the expression “\w+” would match every word and digit, variables like “var01” included.

Table 3

Regular expressions based on lemma, English translation included. Swedish regular expression English translation "\bvara( \w+){1,15} som\b" "\bis( \w+){1,15} that\b" "\bsom\b" "\bthat\b"

"\bvara\b" "\bbe\b"

"\b(om|när|ifall|vid)\b" "\b(if|when|if|at)\b"

"\b(bestå( \w+){0,3} av|(uppfatta|benämna) "\b(consist( \w+){0,3} of|(perceive|denominate) (\w+){0,3} som|referera( \w+){0,3} till)\b" ( \w+){0,3} that|refer( \w+){0,3} to)\b" "\b(s(å)?( )?k(alla)?\b" "\b(s(o)?( )?c(all)?\b"

Some of the patterns were adapted from patterns in other languages published in the proceedings of the International Workshop On Definition Extraction, Borovets, Bulgaria (Aguilar & Sierra, 2009; Alarcón et al., 2009; Valero & Alcina, 2009; Westerhout, 2009) and some were developed using the development set as reference.

The three first patterns were developed with generic relations in mind. The word “vara” is the lemma form of Swedish “är” (“is”) which is the most common word found in definitional contexts containing generic relations. The word “som” (“that”) was found common among definitions in the development and included in two possible forms, one in which it collocated with “vara” in the

(21)

sentence (with some words between them) and one in which its sole occurrence was of importance. The long pattern is a concatenation of three defining structures, practically as three patterns in one. Its first part (“bestå”) matches partitive relations while the second and third (“uppfatta|benämna” and “referera”) match generic definitions.

Conditionality and prepositional arguments in the form of “om”, “när”, “ifall” and “vid” were also found in the development set among definitions – an interesting find that I chose to explore further by including a pattern for it. This was also true for the pattern “till exempel” (“for example”), even though it turned out to match no sentence in the test set. The “så kalla” pattern was developed in order to match both the abbreviation and the lemma form of the “to call as” verbal pattern (Alarcón et al., 2009, p. 9).

Part of speech tag search was made possible in an attempt to generalise patterns. By looking for in POS-tagged versions of the texts, a somewhat smarter kind of regular expressions could be utilised. The features are found in Table 4. For a reference of the POS tags, see Table 5.

Table 4

Regular expressions utilising part of speech data used in this study. Regular expression

"\b(UO|PM|NN) VB( DT)?( JJ)? NN\b" "\b(UO|PM|NN) VB DT( JJ)? NN\b"

"\bPAD ((DT )?(JJ )?(NN|UO|PM) )+PAD\b" Table 5

Part of speech tags and their meaning. POS tag Meaning

UO Foreign word PM Proper name NN Noun VB Verb DT Article JJ Adjective

PAD Paired punctuation

There are two types of patterns utilising part of speech data, both designed by the author of this thesis by analysing sentences in the development set. The first and second are both designed to match functional (ex: “bird catches the worm”) and generic information (ex: “birds are flying animals”). What separates them is whether an article following the verb is optional (first pattern) or not (second pattern). The third pattern is developed to match term synonyms placed within parentheses, as in “the couch (the cosy sofa342)” or “this car (UF3452GL21)”.

4.2.2 Bag-of-words (BoW)

One way to quantify a text and making it easier to analyse is to count word occurrences. This is called Bag of Words (BoW). The sentence “the early bird catches the worm” would instead be repre-sented as the dictionary {the:2, early:1, bird:1, catches:1, worm:1}. Note that "the" is counted twice. An alternative to BoW is to create a Bag of POS, where each word in the dictionary is replaced by its part of speech tag. The example sentence above would then be {article:2, adjective:1, noun:2, verb:1}, shrinked in comparison to the standard BoW since worm and bird are grouped together under "noun". Morphological alternatives to words can be ignored by lem-matising the sentence. Such a creation would look like {the:2, early:1, bird:1, catch:1, worm:1} and would be a way of counting both "early" and "earlier" as the same token.

4.2.3 N-grams

There are some questions that cannot be answered using only BoW. By combining words it is possible to learn how they co-occur. Combined words in a BoW-like manner are called N-grams,

(22)

N being the number of tokens chunked together. For a 2-gram, a so-called bigram, it would look like {the early:1, early bird:1, bird catches:1, catches the:1, the worm:1}. In some cases you also add one beginning of sentence (BOS) marker and one end of sentence (EOS) marker, resulting in 2 additional entries to the dictionary, {BOS the:1, worm EOS:1}.

The N-grams used in this thesis are built upon lemma and POS data.

4.2.4 The Regular Expressions Vector (REV)

In order to combine the strengths of regular expressions and the strengths of machine learning, a vector type using regular expressions as boolean features was created. We denominate this as a Regular Expressions Vector (REV). The REVs included in this study consist of the seven regular expressions listed in Table 3 and the three listed in Table 4 added together.

4.3 Classifiers

A classifier is an algorithm, function or a computer program that is used to assign class labels to examples (Zhang, 2004). By feeding the algorithms with labelled training examples they are able to calculate probabilistic theories and hypotheses from data experience (Russell & Norvig, 2009, p. 802). There are many template-based and machine learning-based variants but only two classifiers were used in this study. They are explained here.

4.3.1 Naive Bayes Classifier (NB)

Naive Bayes is a classifier based on Bayes theorem:

P(a|b)= P(b|a)P(a) P(b)

Looking at the probability P of event a given event b is a powerful tool when determining causes and effects. We could ask ourselves questions that can be answered with a probability. Say that we want to know whether an e-mail m we received is spam s or not. In this particular example there is a 10% risk of getting spam and a 40% chance of receiving an e-mail. Based on this information we have prior probabilities for both classes:

P(s)= 0.1 P(m)= 0.4

We also know that 90% of all spam are sent by e-mail, giving us the conditional probability

P(m|s)= 0.9

Based on these probabilities we now want to know the likelihood of this e-mail received being spam or not.

P(s|m)= P(m|s)P(s) P(m) =

0.9 × 0.1

0.4 = 0.036

Based on this information the risk is no greater than 3.6% of the e-mail being spam. Feeling content, we open the e-mail just to be confronted by the less wanted but ever suspected

FOR YOUR SAFETY UPDATE YOUR CREDIT CARD INFORMATION

By just looking at the chance of a general e-mail being spam, many factors are potentially missed. Instead of calculating the probability of an e-mail being spam we may calculate the probability of this particular e-mail being spam based on its contents. This is where feature vectors like BoWs, N-grams and REVs enter the picture. The e-mail is quantified into a set of content features C= {x1,...,xn}

used to predict whether it is spam or not:

P(s|C)= P(C|s)P(s) P(C)

(23)

All features are used, but in Naive Bayes classifiers there is an assumption of attribute indepen-dence, that is the assumption that no attribute causes changes in other sampled attributes. This assumption is violated in many settings but still the NB classifier is considered particularly effective. The Naive Bayes classifier function, adapted from (Zhang, 2004, p. 1), is:

fnb(C)= P(s= True) P(s= False) n Y i=1 P(xi|s= True) P(xi|s= False)

The product of every feature’s probability given that it is part of the class is compared with the product of its probability given that it is not. The most likely answer is the prediction. In the context of definition extraction the target class is definitions. Del Gaudio & Branco writes:

Naïve Bayes is a simple probabilistic classifier that is very popular in natural language application. In spite of its simplicity, it permit [sic] to obtain results similar to the results obtained with more complex algorithms. (Del Gaudio & Branco, 2009, p. 33)

The NB classifier used in the two experiments of this study is multinomial, counting not only occurrences but also the frequency of each occurrence. In a BoW setting it means that the classifier takes into account not only that a word is present in the text but also how many times it occurred. The NB classifier also utilises the so-called Laplace smoothing as explained by Jurafsky & Martin (2009, p. 100-103), a simple smoothing technique that adds 1 to the frequency of all words or N-grams taking into account the possibility of words in the test set that do not occur in the training set.

4.3.2 Regular Expressions As Classifiers (REC)

As classifiers, regular expressions are rather simple. Either the input is a 100 percent match or a non-match. One has to be familiar with the data source to successfully use regular expressions this way, writing one expression for each type of match wanted. The output of the classifier depends on the quality of the expression alone. A general expression will probably capture most of the intended sentences at the cost of tons of noise while a specialised expression captures just a few of the intended sentences with no noise.

The patterns used are those found in Tables 3 and 4. They are referred to using the item numbers in this list: 1. "\bvara( \w+){1,15} som\b" 2. "\b(UO|PM|NN) VB( DT)?( JJ)? NN\b" 3. "\bsom\b" 4. "\b(UO|PM|NN) VB DT( JJ)? NN\b" 5. "\bvara\b" 6. "\b(om|när|ifall|vid)\b"

7. "\bPAD ((DT )?(JJ )?(NN|UO|PM) )+PAD\b"

8. "\b(bestå( \w+){0,3} av|(uppfatta|benämna)( \w+){0,3} som|referera( \w+){0,3} till)\b" 9. "\b(s(å)?( )?k(alla)?\b"

(24)

5 Experiments

This section presents the statistical results and findings of this study, divided into two experiments. For an explanation of metrics and experiment methodology, refer to the Tools and Methods section. For insights into the data and features used, refer to the Data section.

5.1 Experiment 1: Two Approaches Compared

The first experiment aims to answer the first research question of this thesis, namely:

How well do pattern matching and machine learning approaches perform in the task of definition extraction from Swedish texts?

5.1.1 Evaluation of Approach: Pattern Matching

The first step was to evaluate the performance of a pattern matching approach by using regular expressions classifiers. The RECs performances were measured in terms of recall, precision and F1.

The results are shown in Figure 8. The best RECs, 1 and 2, reached F1scores of .65 and the worst

REC scored no more than F1=0.

Figure 8. Definition extraction performance of regular expression classifiers on test set.

5.1.2 Evaluation of Approach: Machine Learning

The NB classifier was run on lemma N-grams and POS N-grams, both ranging from 1 to 4. The precision of the NB classifier seemed to scale with increased N-gram size, as seen in Figure 9, while the recall score remained on a rather constant level.

To see if there were any significant performance differences between NB using lemma N-grams and the NB using POS N-N-grams, an independent t-test was conducted with data type as grouping variable.

(25)

Figure 9. Performance of NB classifier based on N-gram size, mean score of POS and lemma data.

The test indicated that the recall score of POS N-grams (M=.33, SE=.02) exceeded that of the lemma N-grams (M=.1, SE=.03) at a significant level, t(6)=−6.462, p=.001.

The precision score of POS N-grams (M=.47, SE=.05) was lower than the lemma N-grams precision score (M=.71, SE=.1), though not at a significant level, t(6)=2.064, p>.05.

Figures 10 and 11 show the performance of NB classifiers given POS and lemma N-grams respectively.

5.1.3 Comparison

An independent t-test was conducted to see if the two approaches, pattern matching and machine learning, differed in performance scores. Test results indicated that the NB model’s precision score (M=.59, SE=.07) on average was higher than the precision score of the REC model (M=.53, SE=.11), although not at a significant level, t(14.78)=.494, p>.05 (equal variances not assumed).

The NB model’s recall score (M=.22, SE=.13) was on average lower than the recall score of the REC model (M=.37, SE=.32). The difference was not significant, t(16)=−1.262, p>.05. Figure 12 shows the models in comparison.

(26)

Figure 10. Performance of NB classifier based on N-gram size, lemma data.

(27)

(28)

5.2 Experiment 2: An Integrated Approach

The second research question of this thesis reads as follows:

Is it possible to successfully integrate the pattern matching and machine learning ap-proaches and does an integrated approach scale well with data quantity?

5.2.1 Evaluation& Comparison

The very implementation of REVs shows that an integration is possible, but a successful integration requires an increase in performance. Four one-sample t-tests were conducted to test the hypothesis that there is a difference in precision and recall between the integrated approach and the two other approaches.

Figure 13. The three models compared, showing similar precision scores. The REV-based model shows better recall scores.

The precision P=.57 of the REV-based NB was lower than the average precision of the N-grams-based NB (M=.59, SE=.07) and just slightly higher than the REC precision (M=.53, SE=.11), though not at a significant level (both ps>.05).

However, the recall R=.77 of the REV-based NB was significantly higher than the average recall of the N-grams-based NB (M=.22, SE=.05), t(7)=−12.2, p<.001. Its recall was also significantly higher than the average recall of the REC (M=.37, SE=.1), t(9)=−4.03, p<.01.

Additional t-tests were conducted in order to see how the insignificant differences in precision and the significant differences in recall together affected the overall performance of the REV-based NB. Its F1=.65 compared with that of the N-grams-based NB (M=.28, SE=.05) was significantly

higher, t(7)=−7.97, p<.001. Compared with the F1 of the REC (M=.34, SE=.08), it was also

(29)

Figure 13 shows the performance scores of the integrated approach (REV) in comparison with the two other approaches.

5.2.2 Scalability

Six correlation tests using Pearson’s r were conducted in an attempt to answer the question of how well the integrated approach scales with data quantity. The development set was used in the data sampling as a means to increase testable training data size. Dependent variables were recall, precision and F1scores. Training set size was the independent variable, ranging from 298 (using

original training set exclusively) to 1148 (training set+ development set). Tests were conducted at each increase of 50. A POS 4-gram-based NB classifier was used as a baseline.

The REV-based classifier’s performance was significantly related to the size of the training set; recall at r=−.51, precision at r=−.55 and F1at r=−.49 (all ps<.05).

Figure 14. Correlation between training set size and recall score of a NB classifier run on either sentences quantified into POS 4-grams (light green) or into REVs (dark green).

The baseline’s performance was significantly related to the size of the training set; recall at r=.87, precision at r=.66 and F1at r=.91 (all ps<.01).

The training set size affects the baseline positively (r>0) while it affects the REV-based classifier’s performance negatively (r<0). Figures 14, 15 and 16 visualise the correlations for recall, precision and F1scores respectively. The integrated approach does not scale well with data quantity.

(30)

Figure 15. Correlation between training set size and precision score of a NB classifier run on either sentences quantified into POS 4-grams (light green) or into REVs (dark green).

Figure 16. Correlation between training set size and F1 score of a NB classifier run on either

(31)

6 Analysis

& Discussion

This section contains analyses of how to interpret the results and how they answer the research questions of this study. Included are also comments on applications and ideas for future research as well as some reflections about evaluation.

6.1 Methodology

Del Gaudio & Branco (2007, p. 667) problematise the evaluation methodology of definition extraction systems since the definition of definitions is vague. One person may claim that a sentence includes a definition while another person claims that it does not. This is why two experts were asked to annotate the training and test sets. The quality of the tags is validated by a Kappa score of 1 (total agreement).

Performance was measured as the F1score in the task of classifying sentences as either definitions

or non-definitions. I am fully aware of that there are other ways to evaluate performance. One alternative to the evaluation method used would have been to measure performance at different levels of a sorted list, as is briefly explained by Russell & Norvig (2009, p. 869). Instead of classifying documents as relevant or irrelevant (which in this setting has been as definitions or non-definitions) output would be given as a ranked list with higher ranking to the documents that are more probable to be relevant. The performance would then be measured as the F1 score at different levels of the list, starting for example with the top 10, further evaluating the top 100, further on to the top 1,000 documents. If the algorithm effectively ranks relevant documents as relevant, the score will diminish as the level size increases.

The data set sizes of the first experiment could be considered as small. The small test set size is mainly affecting the RECs since there are regular expressions that are designed to catch rare but accurate patterns (for example pattern 8 and 9 in Figure 8). The generic relation type is the far more common to find in natural language definitional contexts. The regular expressions written for the extraction of partitive and associative relations might lack valid matches because of this skewness. This is one reason to why no more than the included regular expressions were included. Luckily the RECs don’t make use of probabilistic hypothesis testing. They are therefore not affected at all by the training set.

6.2 Experiment 1: Definition Extraction Models Comparison

The first experiment explored the two most common approaches in industry and academy information extraction settings, that is pattern matching and machine learning. The findings indicate that there are small differences between the approaches, even though these differences were insignificant. The performance of certain regular expressions were better than others, further emphasising the importance of well written patterns when approaching definition extraction with pattern matching. Well-written regular expressions will perform very well while the opposite will yield unpromising results.

Significant differences were found within the N-grams NB based on which data type the N-grams was built upon; lemmas or POS-tags. This could have to do with the limited training set size. NB is known to scale well with training set size (Russell & Norvig, 2009, p. 804), but the performance stays poor when there’s just that much data to base the classes’ probabilities on. The POS-tags, however, help the classifier by setting an upper limit to possible token combinations. Each tag combination shares one feature in the N-grams vector and there are more possible words than possible POS-tags.

When compared, the two approaches turn out to be rather similar. No significant differences were found. There would of course be differences between the worst and the best of the two, but all in all there were no huge differences. The F1 score of the best regular expressions was .65 while the

highest scoring N-grams-based NB scored F1=.46. This could be an argument for the industry to

(32)

Which approach is the best one? There were no significant differences, making it harder to an-swer this question. The performance of the machine learning approach scales well with data quan-tity while well-written patterns performs exceptionally well right from the start. If data quanquan-tity is limited or considered a problem I would recommend the pattern matching approach which, at least in this experiment, got higher F1scores. However, if we have an almost unlimited resource of

(33)

Figure 17. The performance of a Naive Bayes classifier using the regular expressions vector over several iterations increasing training set size.

6.3 Experiment 2: Performance

& Training Set Size

The second experiment sought for a way to bridge the gap between the industry and the academy approach using an NB classifier on a feature vector where each feature was the Boolean function output of a regular expression. The findings indicate that the recall score of this integrated approach is significantly higher than the two other approaches, at no significant cost of precision. This increase in recall was enough to affect the overall performance measured in F1 to a level that

was significantly higher than the F1 of the other approaches. These are promising results for an

integrated approach.

The performance of the REV-based NB classifier did however not scale well with training size. Negative correlations were discovered in all metrics as opposed to the all-positive correlations of the baseline using the same classifier but with POS 4-grams as data. An explanation is possibly in the varying quality of annotations used as a gold standard in this part of the experiment. It was thoroughly controlled in the first experiment and the first part of the second experiment albeit not in the second. The training set was increased with fractions from the development set, mainly to give an indication of what behaviours would appear with greater training set sizes. It could by no means be taken for granted that the annotation quality of the development set equals the verified quality of the training and test sets. The integrated approach may possibly be sensitive to this while the baseline is not.

Figure 17 was created to explore this further, showing a rather constant performance until the training set size reaches 800. This decrease in performance could be caused by under-representation of experts tags. Remember that the training set annotated by experts consisted of at about 300 sentences. At 600 sentences the distribution of expert-tagged and novice-tagged sentences is balanced. The percentage of expert tagged sentences further shrinks for each step. When the training set size reaches 800 only 37.5% of the sentences were tagged by an expert. It seems like the expertise in judging whether a sentence is containing a definitional context or not is of great importance in an integrated approach. This may serve as an indication of the importance of expert knowledge when creating an annotated text corpus.

(34)

It is possible to integrate the two approaches. One way of doing this is to quantify sentences into REVs, an approach that in the scope of this study was proven to outperform the other approaches. The negative correlation between training size and performance of the REV-based NB is striking. More research is needed to be able to draw conclusions about the usefulness of the integrated ap-proach in different settings.

6.4 Future Studies

One possible research route could be to explore the performance of REVs with different classifiers. Tests with Naive Bayes may be deemed insufficient since there are many machine learning algorithms that have proven to be more effective in the field of information extraction, not least to explore if there is a dependence between the features. How would a support vector or a linear regression classifier perform when given this kind of input?

The data types used in N-grams are usually either word form, lemma or POS. Some words are common while others are rare. We would hypothetically get more informa-tion by quantifying some words into their POS tags while leaving some at the word form or lemma level. Words that could be interesting to “mix out” are nouns, adjectives and verbs. "The early bird catches the worm" would in such a mixed 2-gram look like {the adj:1, adj noun:1, noun verb:1, verb the:1, the noun:1}. It could possibly be a better way of combining lemma and POS data than including both in a twice as big N-grams matrix.

A more thorough analysis of how the training set affects the different models would be highly valuable. What is more important and in which settings: data quality or data quantity? Also, what if the class distribution was balanced? One regular expressions did not match anything because of data sparsity. With more data available it would probably yield different results.

A more extensive comparison of different regular expressions would serve as a perfect com-plement to this study, analysing which patterns give best results when extracting definitions from Swedish texts.

Technical documentation is not generalisable to text. Applying the integrated approach in a blog setting or on e-mail texts would most probably alter its performance. This would be a viable research coarse.

Muresan & Klavans (2013, p. 736) qualitatively evaluated their definition extraction system, as a complement to the otherwise exclusively quantitative evaluation used when measuring information extraction systems. This was done in terms of readability and completeness, of which at least completeness would have been a good idea to measure in order to understand the difficulty of the definition extraction task. It could indicate naturally occurring definitional contexts’ level of vagueness, which in turn could be a single linguistic research topic.

6.4.1 Implications& Applications

What does the results of this study imply? First of all they tell us that definition extraction from Swedish technical documentation is fully possible using either method; pattern matching, machine learning or the two integrated the way they have been in this study. Linguistic problems such as those in terminology management and definition extraction can be at least partially solved using compu-tational methods. This is no revolutionary find but rather an argument for business and academy to integrate standard practice and explore unexplored ground. If the integrated approach I suggest is successful in both academy and business, we could with some parameter tuning be closer to an effective definition extractor not just specialised on Swedish technical documentation but on any language and with any text type. QA systems could then be tuned to understand the concept of definitions (what is considered a definition) in order to give more accurate answers to questions that involve concepts. Question is: how great is the proportion of questions that involve concepts, as opposed to those who do not?

(35)

Worth to mention is that the feature vector method has been applied in an industry setting, an attempt by the author of this thesis to apply the integrated approach in practice. Instead of REVs consisting of 10 features, the Automatic Term Definition Extractor (ATDEX) quantifies each sentence to a vector of 40 features. Based on the ensemble decision of a set of classifiers each sentence is assigned a probability of containing a definitional context.

ATDEX has two modes; one in which the user looks for definitional contexts connected to a certain term and one in which no term input is given. The first mode is intended for terminologists and linguists building terminologies and ontologies, helping them to find descriptions of concepts which in turn can be built upon to define certain hard-to-define terms. It could also be used when a change in a current terminology is applied, for example when one term goes from being preferred into being deprecated – by searching for definitional contexts of the once preferred term the terminologist can easily find examples of how to use and how not to use it. The second mode is mainly intended for computer linguists building a terminology from nothing but a raw text corpus. A few well-defined terms will probably be found with a dry search for definitions, shaping a solid terminologic foundation.

Another possible application is in real-time content mining. A company’s crawler could be used to find definitional contexts in newly written documents that differ from or violate definitions of terms used by said company.

Then again, what makes this thesis one of cognitive science and not one of data science? In such an interdisciplinary field, is there ever something that could be solely cognitive science and not any of its parts? Is it possible to claim this work as a product of cognitive science when it excludes psychology, neuroscience and anthropology? My answer lies in the underlying philosophy, in the theoretical and practical framework that produces an AI solution to a linguistic problem. The philosophy of concept systems has affected how I look at naturally produced definitions while data science has allowed tools to model and test solutions that otherwise (calculated by hand) would have taken years. The integration of these disciplines is in itself an argument for the uniqueness of this work. This thesis does not only integrate the business and academy information extraction approaches. It also serves as another successfully applied cognitive science study exploring a terminology management problem.

(36)

7 Conclusion

This thesis has aimed to explore the task of concept extraction, more specifically definition extraction from Swedish technical documentation. Regular expressions as classifiers and N-grams-based Naive Bayes classifiers were evaluated and compared, showing no between-group differences. They were thus deemed equal in performance, answering the first question of this thesis. In addition, a NB classifier using what in this study is called regular expressions vectors was compared with the other classifiers, indicating a significant increase in recall with no significant change in precision when compared to the other methods. This tells us that it is possible to successfully integrate pattern matching and machine learning approaches in a terminology setting, answering the second research question.

The integrated approach was explored further, revealing significant correlations between the training size and all aspects of performance. Increasing the training size heightens the performance of the baseline but lowers the performance of the integrated approach. One reason for this could be that the approaches differ in resilience and sensitivity to data quality. This challenges the integrated approach and reveals a possible weak resilience to data of lesser quality.

Concluding this thesis, there are still many questions left unanswered. The results herein are to be treated as one of many steps in a hopefully enlightening direction towards successful definition extraction in Swedish. In a broader perspective, this thesis can be used as an argument for putting more resources into bridging the gap between industry and academy approaches.