Thoughts don't have Colour, do they? : Finding Semantic Categories of Nouns and Adjectives in Text Through Automatic Language Processing

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer Science

Master thesis, 30 ECTS | Kognitionsvetenskap

2017 | LIU-IDA/KOGVET-A--17/002–SE

Thoughts Don’t Have Colour,

do they?

–

Finding Semantic Categories of Nouns and Adjectives in

Text Through Automatic Language Processing

Generering av semantiska kategorier av substantiv och

ad-jektiv genom automatisk textbearbetning

Per Fallgren

Supervisor : Lars Ahrenberg Examiner : Arne Jönsson

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och admin-istrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sam-manhang som är kränkande för upphovsmannenslitterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starting from the date of publication barring exceptional circum-stances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the con-sent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping Uni-versity Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

c

(3)

Abstract

Not all combinations of nouns and adjectives are possible and some are clearly more fre-quent than other. With this in mind this study aims to construct semantic representations of the two types of parts-of-speech, based on how they occur with each other. By inves-tigating these ideas via automatic natural language processing paradigms the study aims to find evidence for a semantic mutuality between nouns and adjectives, this notion sug-gests that the semantics of a noun can be captured by its corresponding adjectives, and vice versa. Furthermore, a set of proposed categories of adjectives and nouns, based on the ideas of Gärdenfors (2014), is presented that hypothetically are to fall in line with the produced representations. Four evaluation methods were used to analyze the result rang-ing from subjective discussion of nearest neighbours in vector space to accuracy generated from manual annotation. The result provided some evidence for the hypothesis which suggests that further research is of value.

(4)

Acknowledgments

I sincerely thank my main supervisor Lars Ahrenberg for showing an interest in the study and for providing me with valuable feedback on every aspect of the process. I would also like to express my thanks to Peter Gärdenfors for helping out with the categories, annotation and analysis, and of course for formulating the ideas on which the study is built upon. I also want to thank Carita Paradis for providing me with relevant literature and well-needed constructive criticism, and Sverker Sikström for giving me feedback on the method.

(5)

List of Figures

2.1 Visualisation of the top three levels of the categorisation of nouns in WordNet. . . 5 2.2 Sentence Represented with Constituency Structure. . . 9 2.3 Sentence Represented with Dependency Grammar. . . 9 3.1 ANP-extraction example. . . 12 3.2 Visualization of the different vector sets on a 2D-plane. It is evident that it is a

(7)

List of Tables

1.1 Table of categories. . . 3

2.1 Examples of words within each noun group in WordNet. . . 6

2.2 Visualisation of a how a co-occurrence matrix (HAL) could look like. The rows and columns w corresponds to each unique word. . . . 8

3.1 Detailed specifications of dataset. . . 11

3.2 Clarification on how the vector sets were analyzed. An "X" states that the set was analyzed by the particular method, and an "-" states that it was not. . . 13

3.3 Adjective and Noun Prototypes. . . 15

3.4 An excerpt of the training data used in the study, the first column corresponds to a target noun or adjective, and the column to the right of it are co-occurring words. 15 4.1 Nearest neighbours of adjectives. . . 17

4.2 Nearest neighbours of nouns. . . 18

4.3 Adjective clusters. . . 19

4.4 Noun clusters. . . 20

4.5 Categorical nearest neighbour analysis of adjectives. CB-vectors. . . 21

4.6 Categorical nearest neighbour analysis of adjectives. Word2vec. . . 21

4.7 Categorical nearest neighbour analysis of nouns. CB-vectors. . . 22

4.8 Categorical nearest neighbour analysis of nouns. Word2vec. . . 22

4.9 Distribution of categories of the occurring nouns generated from prototypical ad-jectives. Columns add up to 100. . . 23

4.10 Distribution of categories of the occurring adjectives generated from prototypical nouns. Rows add up to 100. . . 23

6.1 Annotation of Adjectives. . . 33

(8)

1 Introduction

Thoughts don’t have colour. From a philosophical standpoint one could argue that this is not the case, and let us refrain from bringing synesthesia into the discussion. The point how-ever, is that abstract nouns such as thoughts, moments and troubles are seldom associated with colours and shapes. Is this a coincidence, or is there more to it? Can a noun be defined by its corresponding adjectives? Not all combinations of nouns and adjectives are possible and some combinations are clearly more frequent than others. In this thesis these ideas are inves-tigated with the help of automatic language processing. The assumed dependency between nouns and adjectives will be called semantic mutuality. Furthermore, the notion of distribu-tional semantics is adopted, an approach that derives meaning from how lexical items occur in language.

The central thesis that serves as motivation for the study is that the meaning of an adjec-tive can be expressed as a region of a single domain (Gärdenfors 2014). A domain is a dimen-sion or a set of dimendimen-sions that represent particular features of objects. Common examples of domains for objects are location, shape, size, colour, weight, material and temperature. This means that for a noun that represents a class of objects in a category, certain adjective domains will be relevant for the meaning of the nouns and others will not. Based on this idea Gärdenfors (personal communication) has presented a first attempt to provide a semantic foundation for noun classes. From this one can also extract a set of adjective classes. The idea is summarized in Table 1.1.

Ideas relating to the experiments of this paper have been previously investigated, such as the study by Hatzivassiloglou and McKeown (1993) that focuses on clustering adjectives according to their occurrence with nouns automatically. Furthermore, with the aim of pro-viding a framework of lexical meaning, Paradis (2005) discuss different categories of nouns and adjectives based on similar notions. Additionally, Paradis et al. (2015) investigate how antonyms occur in discourse to contribute to the ongoing research of how antonyms are rep-resented cognitively. There are in other words several published articles on related work, the main difference for this study (apart from the novel set of proposed categories) is however that we adopt a standpoint of distributional semantics in combination with an attempt to scale up the process with natural language processing tools.

In this Section the aim, research questions, and delimitations of the study will be pre-sented. Additionally, the proposed categories are described in the last subsection.

1.1 Aim

The purpose of the study is to further explore latent semantic aspects of language, specifi-cally focused on the distributional aspects of nouns and adjectives, in order to increase our understanding of the relations between them.

1.2 Research Questions

1. Is there a semantic mutuality between classes of nouns and adjectives that can be cap-tured by natural language processing tools?

2. Can this semantic mutuality provide empirical ground for a set of manually constructed semantic categories of nouns and adjectives? (See Table 1.1)

(9)

1.3. Delimitations

1.3 Delimitations

Although the proposed categories have a basis in human cognition, the study will be treated as a natural language processing problem, only briefly discussing how the notions are grounded in thought. Furthermore, the conducted experiments are exclusively in English, the procedure is however easily applicable in other languages. Finally, although different kinds of corpora could produce different results, as discussed in Section 5, only one corpus was used in this study.

1.4 Proposed Categories

In Table 1.1, seen below, the left section contains a proposal for a classification of some of the most frequent forms of nouns (one could define additional classes). It should be noted that some words are ambiguous in the sense that they fit in more than one class. PLACE is for nouns such as beach, forest, Stockholm, etc. CONCRETE is for material objects such as table, tree, house, etc. MASS is for mass nouns such as water, gold, porridge, etc. ABSTRACT is for nouns that have no location in physical space, e.g. moment, thought. AGENT is for humans, animals, machines, etc., that are capable of performing actions.

The columns present a number of domains, that correspond to adjectives. As with nouns, some adjectives may fit in more than one class. Also, metaphorical uses exist for each class. The leftmost is SPATIAL, corresponding to spatial adjectives, e.g. tall, short. The second col-umn, DESCRIPTIVE, contains domains that relate to physical objects such as weight (heavy), age (old) and colour (green). MATERIAL can be viewed as a subgroup of DESCRIPTIVE, con-taining words such as shiny and matte. The FORCEFUL domain relates to the semantics of actions, e.g. strong, gentle (see chapter 8 in (Gärdenfors 2014)). The EMOTIONAL domain is typically used about humans and animals, e.g. happy, depressed. These are the domains this study is restricted to, as with nouns however, one could include further domains. In addition, most domains are dividable into narrower domains.

In the table, the rows indicate which domains combine naturally with a particular class of nouns and which combinations are impossible. ‘++’ denotes that a certain domain is char-acteristic for a certain noun class. For example, MATERIAL adjectives are typical for MASS nouns (shiny is a typical attribute for gold). ’ +’ indicates that a certain domain is included in the meaning of the relevant noun class. ‘?’ indicates that the domain may occur although it is not necessary. For instance a CONCRETE noun such as steel beam can be referred to as strong, i.e. a FORCEFUL adjective, while another CONCRETE noun such as cookie cannot. Finally, but most importantly, ‘-‘ denotes that a particular domain cannot be part of the meaning of a noun from the class marked in a row, unless the domain is used in a metaphorical way. For example, entities expressed by MASS nouns have no shape and entities expressed by CON-CRETE nouns have no emotion. The table, which is a first approximation, can now be used to make predictions concerning the adjective-noun combinations that are studied. The most important prediction is that the frequency of combinations marked by a ’–‘ should be small, and that the instances that do occur should be metaphorical.

(10)

1.4. Proposed Categories

Noun\Adjective Spatial Descriptive Material Forceful Emotional E.g.

Place

++

?

-

Street Forest Concrete

+

++

?

-

Hammer Tree Mass

+

++

-

Water Wood Abstract

-

?

Speed Death Agent

?

+

++

?

Woman Lion

E.g. Tall Blue Shiny Strong Happy

Narrow Round Clear Weak Angry

(11)

2 Background

This chapter serves as a collection of the relevant background relating to the study, ranging from semantics in language to algorithms in machine learning. The order is as follows, con-cepts and categories, distributional semantics and natural language processing paradigms.

2.1 Concepts and Categories

What are concepts and categories and how do they differ? What are some existing approaches for categorizing nouns and adjectives? These questions, that lay ground for the experiments of this paper, are discussed in this section.

Murphy’s Take on Concepts

Try to recall the last time you ate an apple. Had you eaten that exact apple anytime before? Most likely you would say no, however you would probably also argue that even though you never had experienced tasting that particular fruit earlier you still knew exactly how the sensation of eating it would be. Why is this? It seems likely that you have a predefined idea of how you usually perceive the notion of eating an apple. In other words, you have a concept made up of a certain combination of colour, smell, feel, taste and even sound. These ideas, specifically how one would go about defining and explaining concepts are discussed in Murphy (2004). Several approaches are considered, from classical notions based on the work of Aristotle to contemporary ideas. Murphy concludes however, perhaps unfortunately, by saying that the area is quite messy without clear definitions and is in need of future research. In fact, one of the last lines in the final section of the book, labeled as A Note to Students it is stated: We will need to draw on anthropology, linguistics, and computer modeling in order to understand the rich amalgam that makes up our conceptual abilities (Murphy 2004). This can be viewed as motivation for the experiments done in this paper.

Categorizing Concepts

While there certainly are a number of ways to differentiate between categories and concepts, categories are usually defined in a broader sense, as encapsulators, or groupings of concepts. In contrast, concepts refer to mentally manifested notions of things (concrete or abstract). The notion of semantic categories, in this sense, is discussed in chapter 6 & 7 of Gärdenfors (2014), specifically focusing on the semantics of nouns and adjectives.

Today, word classes are introduced to children based on semantic aspects such as reflec-tions of perceptual experiences, nouns correspond to things, verbs to action and so forth. This approach end quite early however, with syntax and grammar soon taking over. It seems probable that the field would advance by instead grounding the categorical principles of lan-guage in cognitive processes and semantics, which layed ground for the categorization in the first place. (Gärdenfors 2014) tries to bring light to the this idea and argues that the semantic notions of language should not depend on grammatical categories.

Categories in English Grammar

Although the approach might not fall in line with the ideas raised in the previous paragraph, it seems appropriate to include what the basic English grammar states regarding groups of

(12)

2.2. Distributional Semantics

nouns and adjectives, as this is what is taught in school1. Apart from proper nouns (names) and pronouns (which really is a separate part-of-speech) common nouns represents the major chunk of nouns. There are variations to distinguish subclasses of common nouns but the following are usually included: Abstract (things you cannot see or touch, e.g. joy, friendship), Concrete (things you can perceive, e.g. chair, marble), Collective (words that describe groups, e.g. choir, band), Countable (mile, friend) and Non-countable (food, music). Additional groups are sometimes included, such as Compound nouns (concatenated words, e.g. pickpocket, soda can), Gender-specific (e.g. actress, uncle), Verbal (nouns derived from verbs, e.g. building, runner) and Gerunds (nouns that represent actions, e.g. running, guessing).

There are also some variations regarding the classification of adjectives, however the fol-lowing are generally accepted: Descriptive (an adjective that describes a noun, e.g. tired, beau-tiful) Numbers or Quantity (e.g. twenty-two, more) Demonstrative (an adjective that points out a noun or pronoun, e.g. this, that), Possessive (e.g. my, his) and Interrogative (an adjective that points to a question, e.g. what, which). One can also include Indefinite (an adjective that does not indicate specific items, i.e. many, few) and Articles (e.g. a, the).

Categories in WordNet

WordNet (Miller 1995) is a large lexical database of English where nouns, verbs, adjectives and adverbs are ordered into groups2. Specifically these groups are called synsets, each a collections of synonyms representing a distinct concept. WordNet also provides a short def-inition of a given word. In essence the database can be viewed as a combined lexicon and thesaurus as it provides lexical knowledge of concepts as well as relations between words.

Nouns in WordNet are categorized based on their position in a hierarchical tree (see Figure 2.1) where the root node is labeled ENTITY. The second level in the tree consists of two self explanatory groups, PHYSICAL and ABSTRACTION. The third level consist of 10 groups: OBJECT, THING, SUBSTANCE, PROCESS, COMMUNICATION, GROUP, PSYCHOLOGI-CAL, ATTRIBUTE, MEASURE and RELATION. They are presented in Table 2.1 along with examples of words that are included in the groups. The tree consists of a number of addi-tional levels, each with higher specificity until reaching the leaf nodes that do not represent further definitions.

ENTITY PHYSICAL

OBJECT THING SUBSTANCE PROCESS

ABSTRACTION

COMMUNICATION GROUP PSYCHOLOGICAL ATTRIBUTE MEASURE RELATION

Figure 2.1: Visualisation of the top three levels of the categorisation of nouns in WordNet.

For adjectives WordNet has a different hierarchy, they are instead organized in terms of antonymity, such as dark-light and rough-smooth. Further, the words are also linked to se-mantically similar word, such as dark –> black, dusky. They are also related to nouns, such as dark –> darkness. While it would be possible to cluster the adjectives into categories ac-cording to WordNet’s structure, no explicit grouping were done in this study.

2.2 Distributional Semantics

Distributional semantics, a notion that is central to this study, is an area of research that have shown promising results in natural language processing, arguably because of the array of unsupervised methods that have got an upswing the last two decennia. In this section the

1_{The following sources were used as reference, http://grammar.yourdictionary.com/,}

http://www.grammar-monster.com/ and http://partofspeech.org/

(13)

2.2. Distributional Semantics

OBJECT THING SUBSTANCE PROCESS COMMUNICATION

Refrigerator Finger Lemon Fire Stoplight

Snake River Crystal Fog Writing

Wine Pond Snow Cloud Signal

GROUP PSYCHOLOGICAL ATTRIBUTE MEASURE RELATION

Gathering Baseball Rectangle Mug Award

Vegetation Activity Shadow Teacup Prize

Crowd Photography Circle Ocean Brand

Table 2.1: Examples of words within each noun group in WordNet.

field is presented. Additionally there is some mentioning of alternative approaches such as logical semantics and hybrid approaches. Note that 2.3 provides further information on how one can apply these ideas to create word vectors that can be used in many applications.

The Distributional Hypothesis

The notion of distributional semantics is grounded in the so called distributional hypothesis, which essentially states that the semantics of a word is defined by its distribution in text. The idea is raised in Harris (1954) and formulated by Firth (1957) as "You shall know a word by the company it keeps". For example, consider the word turtle in the following sentence: The green turtle is underwater. What words are interchangeable with the animal? Perhaps fish and frog, possibly algae and rock, but less so computer and physicist even though they still construct syn-tactically correct sentences. If one were to come up with 500 additional sentences containing the word turtle chances are it will be even harder to find interchangeable words that fit every sentence. This can be seen as a convergence, where each sentence brings us closer to the nar-row definition of the word turtle. Possibly frog would fit in 400 of the 500 sentences, accord-ing to the distributional hypothesis this means that the two words share semantic aspects, which arguably corresponds well to most people’s interpretations of the animals. It is also worth mentioning that this is not just an untested idea, there is a large amount of empirical evidence to the distributional hypothesis, as firstly shown by Rubenstein and Goodenough (1965). Finally, one could exploit these ideas to construct representations of words based on their context, which is exactly what most of the distributional word vector models of today is doing.

Syntagmatic and Paradigmatic Relations

The field of distributional semantics have more to it, as stated by Saussure (1916) there exists paradigmatic and syntagmatic relations between words. For instance, two synonyms (or antonyms) have a paradigmatic relation. In other words, the two words are interchangeable in text, much like the previously mentioned turtle and frog. A syntagmatic relation, however, is expressed by topical aspects. For example the two words competition and winning are not interchangeable in text, but they are topically related, i.e. they have a syntagmatic relation. Further, Sahlgren (2006) showed that one can automatically extract semantic representations of words from corpora following the two types of relations by simply adjusting the amount of neighbours of a target word that should be considered. This will be further elaborated on in 2.3.

An Alternative Approach: Formal Logical Semantics

Another approach towards unraveling the semantics of language is using logic. Formal, or logical semantics strives to understand meaning in language based on how spoken

(14)

expres-2.3. Natural Language Processing Paradigms and Algorithms

sions relates to the external world. Humans are able to communicate because they have simi-lar inner workings regarding linking lexical terms of language with real world concepts. Fur-thermore, speakers share the formulas for calculating the semantics of a syntactically convo-luted phrase which usually leads to a common ground when one is trying to convey meaning. Formal semantics tries to model this phenomenon via mathematical architectures. Consider the following conversation between two students (Aronoff and Rees-Miller 2017):

Student1: Mary got an A on the exam.

Student2: No one got an A on the exam.

For the second student to be able to understand and respond to the first statement he or she must understand the constituents of the sentence, i.e. map the syntactic terms to the real world. For instance, who is Mary? What does the verb phrase got an A stand for? And what is meant by the exam? Additionally, one would need to merge these constituents to form a complete understanding. Given that the second student have performed these calculations properly he or she can respond with the second statement, which implies that that first stu-dent does not speak truthfully. The process is trivial for an everyday speaker, as one already possesses the ability, the modeling of it is however a challenging endeavour. WordNet (Miller 1995) is one attempt to build a resource that maps the relations between words in a taxonomic manner. One can then let a model use this framework to for instance draw conclusions re-garding the truthfulness of phrases, such as Dogs are mammals or Dogs are reptiles.

Hybrid Approaches

Contrary to distributional methods, which usually can rely on unsupervised models, log-ical approaches often require manually produced resources. Naturally however, there are strengths and weaknesses on both sides, as picked up by Lewis and Steedman (2013). They propose a hybrid approach that combines the benefits of logical and distributional semantics. They argue that distributional semantics have developed in isolation from logical semantics, and while they perform well in capturing the semantics of content words, such as nouns as verbs, they do not capture the latent meaning of function words (e.g. determiners, negations, conjunctions) as well. Conversely, logical approaches struggle to model content verbs but perform well in capturing the essence of function words. Similar ideas are raised and ex-plored in Boleda and Herbelot (2017) and both studies show promising results. Although the focus in this study is specifically on distributional semantics, these ideas might be something to have in mind in future studies.

2.3 Natural Language Processing Paradigms and Algorithms

In this study several tools of natural language processing are used, this section aims to present the relevant paradigms that were used on a basic, but thorough level.

Word Vectors

Creating word embeddings (or word vectors) is the process of assigning a vector in a d-dimensional space to a word and then adjusting the vector such that its geometrical properties correspond well to the word’s distributional properties. The field have seen great progress in recent years which in turn facilitates the long list of different NLP-tasks where adequate representations of words are essential. Word vectors are interesting on their own as they elegantly capture relational aspects of language, one can for instance draw the conclusion that a banana has more in common with an apple than a t-shirt, simply by comparing the vectors of each word. Similarly, word vectors have shown to perform well in analogy tasks, such as the famous example by Mikolov et al. (2013b) where they show that VQueen«VKing´

(15)

2.3. Natural Language Processing Paradigms and Algorithms w1 w2 w3 w4 w5 ... wn w1 0 2 0 0 1 ... 0 w2 2 0 0 1 1 ... 0 w3 0 0 0 0 0 ... 1 w4 0 1 0 0 0 ... 2 w5 1 1 0 0 0 ... 0 ... ... ... ... ... ... ... ... wn 0 0 1 2 0 ... 0

Table 2.2: Visualisation of a how a co-occurrence matrix (HAL) could look like. The rows and columns w corresponds to each unique word.

VMan+VWoman. Although, arguably the main purpose of word vectors is to represent words adequately in typical NLP-tasks, e.g. POS-tagging, dependency-parsing.

There is an array of different approaches to create word vectors, one of the most well-known methods is arguably Latent Semantic Analysis (LSA), first introduced by Dumais et al. (1988). The word vectors are generated within an occurrence based word by document matrix where each cell corresponds to how many times a given word w occurs in a document d. If done correctly, after training, two similar words such as photosynthesis and chlorophyll will have similar vectors as they both appear in documents regarding plant biology.

Hyperspace Analogue to Language (HAL) (Lund and Burgess 1996) is a similar technique that instead generates word vectors within a word by word matrix. A cell is updated when a word with index i occurs within the context of a word with index j. To capture this co-occurrence one needs to define the size of a context window, which essentially tells the system how many neighbours of the target word it should process during each iteration. After the training process a symmetric matrix is generated which can be used to extract word vectors.

LSA and HAL have both shown to produce valid results, however neither produces vec-tors of fixed dimensionality and they can get very large which quickly raises the complexity of time and memory (with dimensionality reducing techniques they produce decent results however, as shown in Pennington et al. (2014)). To compensate for this most modern meth-ods builds vectors with a fixed dimensionality, an example of this is the models of word2vec (Mikolov et al. 2013a), that have shown to produce great results. Word2vec has two types of algorithms, continuous bag-of-words (CBOW) and skip-gram (SG), both makes use of a neural network with an objective that tries to predict a word based on its context (CBOW) or predict the context of a target word (SG).

Co-occurrence Matrices

In many fields of technology matrices are excellent at representing information. In natural language processing the idea of representing co-occurrence between words is often used. This is for instance central in some word vector methods, such as the previously mentioned LSA and HAL that represent words based on how they occur in documents or with each other. Table 2.2 is an example of how a co-occurrence matrix could look like, this particular matrix is a word by word matrix, much like HAL. The value of a cell corresponds to the number of times a word wihas occurred with a word wj. One can then view a row vector (or column vector as this particular matrix is symmetrical) as a representation of a target word. Similar approaches can be used in the experiments of this thesis, as the central idea is built upon the notion of co-occurrence between nouns and adjectives.

Dependency Parsing

Although the notion of dependency grammar can be traced back to many centuries ago, the syntactic theory provided by Tesnière (1959) is often referred to as the theory from which

(16)

2.3. Natural Language Processing Paradigms and Algorithms S NP JJ Economic NN news VP VBD had NP NP JJ little NN effect PP IN on NP JJ financial NNS markets Figure 2.2: Sentence Represented with Constituency Structure.

dependency parsing emerged. Dependency parsing is a form av syntactic analysis that aims to extract the relations that exists between the words in a sentence, usually called dependen-cies. The standard alternative is to represent a sentence based on its constituent structure, the main difference between the two is that the dependency grammar isn’t based on phrases. This difference can be seen in Figure 2.2 and 2.3, an example presented by Nivre (2005) where the two approaches are visualized with the sentence Economic news had little effect on financial markets . from Penn Treebank3(Marcus et al. 1994). Additionaly, dependency parsing labels the relations between words, a piece of information that can be used in many applications.

JJ NN VBD JJ NN IN JJ NNS

Economic news had little effect on financial markets

NMOD SBJ OBJ NMOD NMOD PMOD NMOD

Figure 2.3: Sentence Represented with Dependency Grammar.

Syntactic analysis has many real-world applications, and it’s quite easy to see the value of it with a simple example. Consider the two sentences She ate the noodles with chopsticks and She ate the noodles with tofu. The two sentences seem identical in terms of structure and parts-of-speech, however, it is quite obvious that she does not enjoy eating chopsticks, or using tofu as cutlery for that matter. In the first example, the latter noun is related to the verb ate, and in the second example tofu is related to the noun noodles. These kinds of latent structures can be captured by dependency parsing which greatly facilitates many NLP tasks, such as machine translation, named entity recognition and relation extraction where an understanding of the latent structures in language is necessary. Dependency parsing has also shown to produce interesting result in the construction of word embeddings (Levy and Goldberg 2014), specif-ically by using a target words dependencies as context, rather than its linear neighbours, the representations seem to emphasize topical aspects of the corpora. For the purposes of this study, dependency parsing was used to extract pairs of adjectives and nouns.

(17)

2.3. Natural Language Processing Paradigms and Algorithms

Dimensionality Reducing Algorithms

Often when working with matrices one encounters issues regarding memory or time com-plexity issues because of their large size. There are however several approaches one can use to deal with this problem. One can for instance keep the dimensions that are more important, and/or project vectors in high-dimensional space onto a hyperplane of lower dimensionality. Dimensionality reduction techniques are not always used to deal with complexity, often it is used to compress sparse data, in feature extraction and to visualize high-dimensional data (Bishop 2006). Many fields make use of these techniques, e.g. protein matching in biomedical applications, fingerprint recognition, meteorological predictions and satellite image reposito-ries to name a few (Lee and Verleysen 2007).

Principal component analysis (Pearson 1901; Hotelling 1933), or PCA, is a popular method that reduces dimensionality by selecting the dimensions that maximizes variance. In mathe-matical terms PCA can be defined as ’the orthogonal projection of the data onto a lower di-mensional linear space’ (Bishop 2006). t-Distributed Stochastic Neighbor Embedding (t-SNE) (Maaten and Hinton 2008) is another technique that focuses on local groups of similar points rather than taking the global aspects of the vector space into consideration. The method can be used in many applications, not least for visualizing word embeddings (Lecun et al. 2015). The plots in Figure 3.2 in Chapter 3 are generated with t-SNE. Then there are approaches such as Self-organizing maps (Kohonen 1990), in essence this is a special kind of unsupervised neu-ral network that positions high-dimensional vectors onto a 2 or 3 -dimensional map. For this study however, Singular Value Decomposition (SVD) algorithm is used, as it previously has shown to produce good results in similar studies (Pennington et al. 2014). The main idea be-hind SVD is that any MxN matrix A, where M ą=N can be factorizes to a MxN orthogonal matrix U, an NxN diagonal matrix W and the transpose of an NxN orthogonal matrix V VT (Press et al. 1988). To reduce the dimensionality of matrix A from N to N ´ x one can simply set the x lowest positive values in the diagonal matrix W to zero and then treat the produced matrix ˆA as a dimensionality reduced variant of A. By doing this, SVD finds the axes where the sum of of squares of projection errors is as small as possible, furthermore, it projects the high-dimensional vectors onto these axes.

K-Means Clustering

Having its roots in the middle of the 20th century, K-Means (Macqueen 1967) is still one of the most famous clustering method around despite having thousands of challengers. The method is simple and efficent, and is used in all fields where one needs organize data into sen-sible groupings (Jain 2010). K-Means places N vectors into K clusters. The method receives two inputs, a set of data points, i.e. vectors, and a fixed value K. It then places K centroids in space, typically randomly distributed based on the positions of the set of data points. Dur-ing trainDur-ing, each vector is assigned its closest centroid, calculated with euclidean distance, the centroids positions are then adjusted based on the average of their closest vectors. This process is repeated until convergence. Finally, one assigns each vector to its corresponding centroid to form K clusters of the vectors (Hastie et al. 2001).

(18)

3 Method

In this section details regarding the method of the study is presented in the following order: Data, Experiments and Methods for Evaluation.

3.1 Data

The dataset used in this study was the Children’s Book Test (CB) of the Facebook bAbi project (Weston et al. 2015). Several different datasets have been tested out, such as news articles, movie reviews and Wikipedia articles. However, a dataset based on the simple language of children’s books is suitable for this kind of study, partly because of its large size but mostly because they consist of simple language reflecting common knowledge. This greatly facil-itates the challenge of encapsulating literal knowledge in the representations of nouns and adjectives. The notion of using text reflecting common knowledge to build semantic repre-sentations have been explored before, such as De Deyne et al. (2016) who argue that semanti-cally strong pairs such as yellow banana occur relatively seldom even though the words have an obvious link. It seems likely that semantic pairs of this type occur more frequently in a set of children’s books compared to, say a set of news articles.

For details regarding the dataset see Table 3.1. A vocabulary of ~360M tokens lead to a to-tal set of adjective noun pairs (ANP) of almost 13M. The frequency threshold used to narrow down the amount of adjectives and nouns was at 50 i.e. the words had to occur at least this number of times to be included. A second frequency threshold at 1250 was also used, after the generation of the vectors, to extract the most occuring nouns and adjectives, which lead to an adjective count of ~1000 and a noun count of ~1500. This might seem like a low amount however it should be noted that the purpose of this study is not to generate a complete vec-tor set of adjectives and nouns, but rather to produce a set of semantic representations big enough for analysis without sacrificing quality.

3.2 Experiments

The main pipeline of this study consists of three components, namely extraction of ANP, gen-eration of the co-occurrence matrix and dimensionality reduction with clustering. To extract the pairs we made use of dependency parsing, using the parser from SpaCy1presented by

1_{https://spacy.io/}

Label Details

Name bAbi: The Children’s Book Test

Size 1.7GB

Tokens 362 661 222

Extracted Pairs 12 667 419

Frequency Thresholds 50 & 1250 Adjectives Post Frequency Threshold 993

Nouns Post Frequency Threshold 1 479 Table 3.1: Detailed specifications of dataset.

(19)

3.3. Methods for Evaluation

Honnibal and Johnson (2015). The process was as follows, when an adjective was found in a given sentence the part of speech of its head was checked, if it was a noun a pair was extracted. It then went on and checked the head of the head two more times, potentially ex-tracting three pairs for one adjective. The approach can then capture two pairs from sentences like A red and green apple. See Figure 3.1 for the parsed sentence. In this example the noun apple is the head of the adjective red, which means that the pair red apple would be extracted. Additionally, the head of the second adjective green is red which is no noun, the script would then look further and find that the next head in line is once again the noun apple, hence green apple would also be extracted. The parser extracted some incorrect adjectives, such as pos-sessive pronouns and determiners, these were however filtered out by a list of stop words. It is possible that the approach favours precision over recall, as no elaborate evaluation of the method was conducted, this is however something that should not affect the result as a sufficient amount amount of pairs were extracted.

DT JJ CONJ JJ NN

A red and green apple

Figure 3.1: ANP-extraction example.

The frequencies of the pairs are then inserted into a matrix where the rows corresponds to the nouns and the columns corresponds to the adjectives. For instance, if the ANP little pot occurred 100 times the cell corresponding to little and pot will have this number. Ideally a fin-ished matrix consists of both noun and adjective vectors that have captured semantic aspects from the corpus, see Section 2 for additional information about co-occurrence matrices.

Finally it is necessary to reduce the dimensionality of the matrix, as it is very large. After scaling2, singular value decomposition was used to reduce the dimensionality to 100. It is essential to do this not only on the matrix M but also onthe transpose MT as one of them represents the nouns and the other the adjectives. From the reduced matrices one can extract the CB-vectors for each word that can be used for analysis. Clustering was then applied.

For comparison a vector set constructed with the CBOW algorithm of Word2vec on the same data was produced. This was based on the idea that this method has been validated before and has shown to capture some semantic aspects of language. Furthermore, by com-paring the differences and similarities between the two sets of vectors one can possibly find interesting results. With an exception of a dimensionality size of 100, a minimum frequency of 10 and a window size of 5, default settings were used. As these vectors are generated from the entire data, and not only nouns and adjectives, the intersection between the vector set and the CB-vectors were extracted. This lead to two sets with identical vocabularies that could be used for comparison in the evaluation methods.

3.3 Methods for Evaluation

Four approaches were used to analyze the generated vectors, see Table 3.2. Two of them are based on a set of semantic and syntactic concepts, specifically synonymity and antonymity3

2_{Scaled data has a mean of zero and unit variance.}

3_{A synonym is defined as ’a word having the same or nearly the same meaning as another in the language’.}

(20)

along with paradigmatic relations between words, i.e. interchangeability and finally syntag-matic aspects. The remaining evaluation methods generate scores of accuracy, precision and recall. This subsection presents the process of generating the result from the methods, the result is then presented in Section 4 and further discussed in Section 5.

Method CB Word2Vec

Subjective Analysis of Nearest Neighbours X X Subjective Analysis of Clusters X -Categorical Analysis on Nearest Neighbours of Prototypical Words X X Categorical Analysis of Words that Co-occur with Prototypical Words X

-Table 3.2: Clarification on how the vector sets were analyzed. An "X" states that the set was analyzed by the particular method, and an "-" states that it was not.

Subjective Analysis of Nearest Neighbours

A set of target nouns and adjectives were chosen from the dataset. The words were sub-jectively chosen based on the idea of having a topically varied mix of words representing different concepts, a frequency threshold of 10 000 were also used. For each target word five nearest neighbours are generated, calculated with cosine similarity. The approach is inspired by Levy and Goldberg (2014) and Trask et al. (2015) who perform a similar analysis. Neigh-bours are generated from the CB-vectors as well as the vectors generated with Word2vec for comparison, should the neighbours of the two methods overlap to some extent there is evi-dence for the validity of the novel method, as Word2vec have performed well in these tasks. Additionally, one could analyze the words based on the previously mentioned semantic con-cepts.

Subjective Analysis of Clusters

Although it would be preferred to use a low cluster size to see if the vectors manage to dis-tribute themselves in such a way that they form ~5 isolated clusters this approach does not seem to work. By looking at Figure 3.2, which is a projection of the vectors on a 2D-plane it is quite obvious that the words distribute themselves to form a sheet of words, rather than x isolated clusters. This is not only the case for the novel method of this study, but also for the vectors generated with Word2vec. With that said, some testing showed that by increasing the amount of cluster to a bigger number one will find smaller subsets of the words that seem to capture semantic aspects. This idea is the basis for this particular evaluation approach. Specifically the formula N/5 cluster were generated, which lead to a cluster count of 200 for adjectives and 300 for nouns. Twelve clusters were then chosen at random4from each cluster set for further inspection.

4_{Although the clusters were chosen at random, thresholds of minimum and maximum amount of words were}

set at ą 2 and ă 20. The words were distributed in such a way that several clusters consisted of single words and a few had 50 or more words.

(21)

(a) CB-nouns (b) Word2vec nouns

(c) CB-adjectives (d) Word2vec adjectives

Figure 3.2: Visualization of the different vector sets on a 2D-plane. It is evident that it is a complex task to point out a small number of isolated clusters.

Categorical Analysis on Nearest Neighbours of Prototypical Words

The remaining evaluation approaches are directed to the set of constructed categories men-tioned previously in the paper. In this approach five prototypical words for each category and part-of-speech are selected, as seen in Table 3.3. The selection has some subjectivity to it, which is inevitable as one must start somewhere when defining the categories. With that said frequency and ambiguity was taken into consideration which narrowed the candidates to some extent. The prototypes had to occur a certain amount of times to make sure that its representation is valid. All prototypes were extracted from the top ~300 words, all occurring at least 10 000 times. By also avoiding ambiguous words e.g. bank, stone (the type of material vs a memorial stone) and words that are heavily used metaphorically thought, year (one can use spatial adjectives to describe these even though they are abstract, e.g. deep thought, long year).

For each prototype the ten nearest neighbours were generated and labeled5. Furthermore a confusion matrix was built based on what categories the method managed to capture prop-erly and which it did not. From this matrix, accuracy along with precision and recall for each category were calculated. The process was conducted for the CB-vectors and the vectors constructed with Word2vec for comparison.

Categorical Analysis of Words that Co-occur with Prototypical Words

Finally, an analysis on the occurrence level was conducted. For this method, the prototypes used in the previous section were used again, this time to extract the words that they occurred with and calculate a categorical percentage. For instance, the adjective warm occurred with the nouns milk, water, weather and sunshine, see Table 3.4. The first two occurring words follow the category MASS and the last two correspond to the category ABSTRACT, from this example only we then get a percentage of 50% for the two categories. By doing this on 20 selected occurrence words, based on frequency, for each prototype one will get a matrix that is directly comparable to the set of suggested categories, which is presented in Table 1.1.

5_{The labeling was done by me with help from one of my supervisors, if one is interested in this information the}

(22)

SPATIAL OBJECT MATERIAL FORCE EMOTION

big old hard strong happy

small dead soft weak afraid

Adjectives large blue clear gentle glad

short green slippery fast tired

tall thick feathery quick angry

PLACE CONCRETE MASS ABSTRACT AGENT

sea house water moment man

forest tree air kind king

Nouns river door fire love mother

shore horse bed money horse

village palace gold fact sister

Table 3.3: Adjective and Noun Prototypes.

Adjective OccNN1 OccNN2 OccNN3 OccNNx OccNNn

warm milk water weather ... sunshine

golden apple hair light ... crown

blue eye sky coat ... ribbon

... ... ... ... ... ...

happy day face child ... home

Noun OccJ J1 OccJ J2 OccJ J3 OccJ Jy OccJ Jn

dog little old black ... white

king old young great ... wicked

tree old hollow tall ... big

... ... ... ... ... ...

water cold hot little ... deep

Table 3.4: An excerpt of the training data used in the study, the first column corresponds to a target noun or adjective, and the column to the right of it are co-occurring words.

(23)

4 Results

This section presents the result of the four different evaluation methods. The information will be presented as objectively as possible and further discussion and analysis regarding how to interpret the result will be presented in section 5.

4.1 Subjective Analysis of Nearest Neighbours

The following section presents the results of the nearest neighbour analysis, as described in Section 3.3.

Adjectives

By firstly looking at Table 4.1, which represents the result of the first evaluation method, one can find similarities and differences between the two vector sets. There are however seman-tic, and syntactic aspects captured by both. The word beautiful generates similar neighbours between the two vector sets and apart from enchanted and splendid they are more or less inter-changeable with the given word. The second word, dark, generates paradigmatically related neighbours once again for Word2vec while the CB-vectors emphasize the semantics of the word as well as its metaphorical aspects, rather that its syntactic aspects. Words such as dark-est and evil are for instance semantically related to the given word, but not as interchangeable as bright, gray, green. Moving on to eldest, there is once again striking similarities between the two sets of neighbours, while fisher for the CB-vectors doesn’t really fit in the rest follow the general semantics of the given word, i.e. age. For the following words, green and golden there are once again evident paradigmatic relations captured by Word2vec, restricting the neighbours to the specific domain colour. The words generated by the CB-vectors are also of paradigmatic relations, but rather emphasizes adjectives related to the naturalistic aspects of green, such as grassy and flowery. Similarly, the CB-vectors bring forth words semantically related words such as fiery and gilded (something covered with gold). Furthermore, the given word good generates a set of neighbours that are more or less interchangeable for both vector sets. The antonym bad is in both neighbour sets, there are also synonyms along with a few black sheep such as generous for the CB-vectors and clever/lucky for Word2vec. For happy, the antonym sad is generated for both neighbour sets along with some emotional words for the CB-vectors such as grateful and happiest. Word2vec also generates the word grateful along with positive words such as pleasant and comfortable. There is no existing intersection for the neighbour sets produced by poor, they do both capture attributes related to the adjective. For instance, for the CB-vectors, hungry, unhappy and possibly hearty can be associated with a poor individual. Wretched and miserable given by Word2vec reflect similar ideas. As a com-mon trend, a few not as obviously related words are also generated, such as fellow and faced, along with wicked, young and dear. For the given word pure, the CB-vectors generate the three synonymical words true, natural and innocent along with two feminine adjectives, specifically maternal and womanly. The neighbours generated by Word2vec are not as clearly related to the given word, while sweet and fresh share some semantic value, full, mingled and human do not. Finally, terrible generates a proper set of synonyms for Word2vec. Similar results can be seen in the set of CB-vectors, along with some vaguer words, such as winged and confused. Furthermore, there is a lower cosine similarity for the CB-vectors on average, compared to the vectors of Word2vec (~0.42 versus ~0.58).

(24)

4.1. Subjective Analysis of Nearest Neighbours

Adjective CB-Vectors Sim Word2Vec Sim Adjective CB-Vectors Sim Word2Vec Sim

lovely 0.41 lovely 0.84 excellent 0.58 nice 0.62 enchanted 0.33 wonderful 0.69 better 0.31 bad 0.60

beautiful magnificent 0.30 charming 0.67 good bad 0.28 fine 0.56 charming 0.27 handsome 0.65 generous 0.25 clever 0.53 splendid 0.27 splendid 0.62 best 0.24 lucky 0.50 darkest 0.58 dim 0.62 sad 0.46 pleasant 0.59 deeper 0.39 bright 0.52 brave 0.33 comfortable 0.56

dark lonesome 0.39 gray 0.52 happy agreeable 0.32 sad 0.55 evil 0.34 green 0.51 grateful 0.30 busy 0.55 dim 0.33 thick 0.51 happiest 0.29 grateful 0.55 youngest 0.89 youngest 0.72 hungry 0.40 wicked 0.61 elder 0.83 elder 0.51 fellow 0.39 wretched 0.52

eldest fisher 0.65 oldest 0.45 poor faced 0.38 dear 0.47 younger 0.63 second 0.42 hearted 0.37 young 0.47 older 0.51 younger 0.40 unhappy 0.34 miserable 0.46 grassy 0.50 yellow 0.63 true 0.60 sweet 0.48 flowery 0.46 golden 0.62 maternal 0.52 full 0.47

green sunny 0.41 white 0.61 pure womanly 0.51 mingled 0.44 shady 0.33 purple 0.60 natural 0.46 fresh 0.42 drooping 0.29 blue 0.58 innocent 0.43 human 0.40 poisonous 0.45 yellow 0.67 winged 0.52 dreadful 0.80 silver 0.42 red 0.65 fierce 0.49 fearful 0.62

golden fiery 0.38 green 0.62 terrible confused 0.47 strange 0.58 gilded 0.36 silver 0.61 fearful 0.46 frightful 0.56 colored 0.33 crimson 0.59 awful 0.45 horrible 0.56

Table 4.1: Nearest neighbours of adjectives.

Nouns

By switching focus to Table 4.2 representing the nearest neighbours of the nouns one finds the word hair with its corresponding sets of neighbours. The CB-vectors have managed to capture a set of neighbours that are topically related to the given word, curl, lock and braid are for instance clearly semantically related. Similarly, Word2vec captured braid, beard and fore-head, along with a few questionable words such as coat and cloak. For the second word, house, Word2vec generated words that are very much alike the given word, hut, castle, cottage, palace and even room to some extent correspond to same fundamental notions. The CB-vectors also capture some interchangeable words, such as cottage, however syntagmatically related words are also evident, e.g. bed and garden. For horse, the CB-vectors does not generate any neigh-bours with obvious connections to the given word, concrete words such as mirror and ware are generated, but also abstract terms such as bargain is included. The vectors of Word2vec man-age to capture pony, colt, donkey and steed, all more or less subcategories of horse. Also, saddle is included, which is also often associated with horses. The word king generates exclusively agentive adjectives, with the exception of bush, for the CB-vectors. Queen and lord are two paradigmatically related words to king, worm (based on the definition regarding a human) and fairy are to some extent interchangeable, but not at the same level. Word2vec generates similar results, with prince, princess and queen being obvious paradigmatical neighbours and doctor/giant being slightly less semantically related. The CB-vectors generate some question-able neighbours for the word life without any clear semantic neighbours. The neighbours are almost exclusively abstract however, which is central to the given word. Word2vec manages to generate some promising results, such as soul, but much like the CB-vectors the rest are not clearly related to the given word, although they also share the abstract aspects mentioned. Moving on to the second abstract word moment, one will find additional ’timely’ word such as minute and week for both neighbour sets. Word2vec manages to generate a couple of ad-ditional words sharing similar aspects, specifically instant, glance and time. As many times before, the CB-vectors captures words of slightly higher topical value such as ground and hill, compared to Word2vec that includes specific objects like tree and wall. Both vector sets man-age to capture mountain however, which is clearly related to the given word. For the word water, the CB-vectors once again generate topically related words e.g. pool, wave and

(25)

possi-4.2. Subjective Analysis of Clusters

bly weather. However, additionally some uncountable nouns are also picked up, specifically blood and milk. The neighbours of Word2vec are restricted to the topicality of water, generat-ing naturalistic words like sea, river and mud. Further, the timely word week generates similar results for both vector sets where all neighbours are more or less interchangeable with the given word. Finally, the word spirit generates a mix of words for the CB-vectors, where ghost is of synonymical value but the rest are not as clearly related. The neighbours of Word2vec emphasize the abstract aspects of the given word with words like soul, heart and pride. As with the previous section, there is a lower cosine similarity for the CB-vectors on average, compared to the vectors of Word2vec (~0.51 versus ~0.64).

Noun CB-Vectors Sim Word2Vec Sim Noun CB-Vectors Sim Word2Vec Sim

curl 0.75 coat 0.64 minute 0.59 minute 0.84 lock 0.69 forehead 0.64 band 0.46 instant 0.63

hair braid 0.65 braid 0.61 moment week 0.45 time 0.58 neck 0.64 cloak 0.61 angel 0.43 glance 0.53 thread 0.55 beard 0.61 adventure 0.35 week 0.52 cottage 0.65 hut 0.76 ground 0.68 tree 0.67 bed 0.51 room 0.75 mountain 0.64 wall 0.66

house garden 0.49 castle 0.74 rock hill 0.51 stone 0.64 hole 0.44 cottage 0.73 gold 0.51 mountain 0.60 roll 0.41 palace 0.73 honor 0.47 bank 0.60 mirror 0.57 pony 0.66 pool 0.74 ground 0.65 bargain 0.55 saddle 0.63 wave 0.65 sea 0.64

horse ware 0.50 colt 0.61 water weather 0.55 river 0.60 wing 0.46 donkey 0.60 blood 0.55 mud 0.60 linen 0.38 steed 0.57 milk 0.51 stream 0.59 worm 0.61 prince 0.81 year 0.67 month 0.87 bush 0.58 princess 0.72 month 0.62 year 0.86

king fairy 0.56 doctor 0.68 week evening 0.57 minute 0.66 queen 0.44 queen 0.65 minute 0.54 night 0.64 lord 0.43 giant 0.65 afternoon 0.49 day 0.63 court 0.46 soul 0.55 fancy 0.43 soul 0.64 atmosphere 0.43 happiness 0.53 ghost 0.37 heart 0.57

life doing 0.40 fortune 0.51 spirit virtue 0.34 beauty 0.55 sail 0.37 heart 0.50 aspect 0.32 nature 0.51 duty 0.36 absence 0.49 bridge 0.32 pride 0.51

Table 4.2: Nearest neighbours of nouns.

4.2 Subjective Analysis of Clusters

The following section presents the results of the cluster analysis, as described in 3.1.

Adjectives

Table 4.3 represent the generated sample of adjective clusters. The first cluster have captured the words charming and magnificent, while they are not synonyms they can be used simi-larly in language. The second cluster have managed to encapsulate two typical antonyms, namely small and large. Cluster 3 also show semantic value, as merry and gay are often used as synonyms. Cluster 4 have captured words that usually are used when referring to abstract concepts, while following and asleep are quite plain in terms of semantics. The words absurd, crazy and distinct reflect similar notions. Moving on to Cluster 5 the words majestic and accus-tomed are clustered together. They are both referring to either concrete or abstract concepts but do not share any greater semantic value. Cluster 6 contains the words sweetest and un-usual that do not share any particular semantic or syntactic aspects. The words unknown, far and possibly safe could potentially be used paradigmatically depending on the context, it is however hard to say without further inspection. The eighth cluster managed to capture the antonyms left and right. The words vast and sunny could possibly be used similarly in some contexts (vast/sunny plains), but are not interchangeable. The ninth cluster captured an evident pair of antonym, specifically former and present. Two words of Cluster 10, fantastic

(26)

4.2. Subjective Analysis of Clusters

and costly can sometimes refer to similar concepts, but are not synonyms. Cluster 11 contains words that can be associated with agents, they are paradigmatically to some extent related in other words, but do not otherwise share any clear semantic aspects. The twelfth and final cluster consists of 17 words, there is no clear notion that reflects all of them, but there is a common denominator between some words relfecting negative attributes. Such as unhappy, wretched, unlucky, helpless, sore, forlorn, unfourtunate and miserable. These are paradigmatically related and interchangeable to some extent.

Cluster 1 Cluster 2 Cluster 3 Cluster 4

charming small merry following asleep

magnificent large gay absurd distinct

crazy

majestic sweetest unknown far vast sunny

accustomed unusual jolly opposite right

safe left

former fantastic busy devoted german

present costly brave unhappy sore

hearted bold eyed

active forlorn wretched fellow unlucky unfourtunate helpless ambitious faced miserable sober

Table 4.3: Adjective clusters.

Nouns

As shown in Table 4.4 the first cluster contains the words horse and wing, the latter is an am-biguous word referring to "a feathered limb" as well as "a part of a big building". Either way, the words are not synonymical and are not topically related, and while they have a weak paradigmatic relation in terms of what attributes can be associated with them there is no clear semantic relation between them. Similar notions are reflected by Cluster 2 and 3 where linen and wear of Cluster 3 share some topical aspects. Cluster 4 reflects timely nouns, minute, week and month are all clearly related. The fifth cluster have a mix of abstract and concrete nouns that do not have any clear concept in common. Cluster 6 have managed to capture arm and leg, two body parts that are clearly representing similar concepts. Two words tail and sword of Cluster 7 are not synonyms but reflect similar notions in terms of shape. As they can be associated with similar attributes to some extent they do have a paradigmatic relation. Cluster 8 contain the words story and song which in some contexts are synonyms, either way they do have a paradigmatic relation. In Cluster 9 coin and crown can both be asso-ciated with attributes like shiny, there is however no obvious connection between the words, neither is there for hero. The tenth cluster consist of the word reason, lot and trifle, neither having a clear connection to another. For the eleventh and biggest cluster, there are topical aspects to consider. There are for instance several words related to clothing and tailoring, e.g. bag, hood, petticoat, apron, frock, sheet and collar. Then there are two words reflecting features of one’s face, chin and brow. Finally, Cluster 12 captures the words blossom and violet, two paradigmatically related types of flowers.

(27)

4.3. Categorical Analysis on Nearest Neighbours of Prototypical Words

horse gossip sweetness mirror wear minute

wing shop linen week

talent bargain month

cliff haste arm tail story

hue bower leg sword song

seat bridge

coin reason bag frock blossom

hero lot slab sheet violet

crown trifle paw sand

hood chin

petticoat collar

apron brow

Table 4.4: Noun clusters.

4.3 Categorical Analysis on Nearest Neighbours of Prototypical Words

The following section presents the results of the categorical analysis of nearest neighbours, as described in 3.1.

Adjectives

During annotation, a large subset of the adjectives were not corresponding to the proposed categories. Although this was expected to some extent, as the categories are not necessarily exclusive (see Section 1.4), the amount was so high that a new category was introduced la-beled as ABSTRACTIVE. This complicates the interpretation of the result; however, this was solved by adding an additional column for the novel category. The average of precision and recall were only calculated on the original categories (as the scores of the ABSTRACTIVE cat-egory are non-existent), accuracy however is calculated on every cell. It should be noted that the average accuracy for a random model is 16.67%.

Table 4.5 considers the result of the CB-vectors, a total accuracy of 28% was observed, which is above average. There are however several errors as can be seen by the scattered numbers (a perfect matrix would have a clear diagonal with zeros around it). The trade-off between precision and recall is somewhat even, but it is clear that the model favours recall over precision, the only exception being the DESCRIPTIVE category. The categories SPATIAL and DESCRIPTIVE reflects the best results. With some exceptions SPATIAL adjectives are only confused with DESCRIPTIVE adjectives, and vice versa, MATERIAL and FORCEFUL nouns are also sometimes mixed up with DESCRIPTIVE however. Furthermore, MATERIAL generates high recall but very low precision, the model is in other words quite ’spendy’, and not so picky, when it comes to this category. Similar notions are reflected by FORCEFUL and EMOTIONAL, an aspect which ultimately reflected by the higher average recall at 47.6%, compared to the average precision of 28%. The ABSTRACTIVE category is often confused with FORCEFUL and EMOTIONAL words.

(28)

4.3. Categorical Analysis on Nearest Neighbours of Prototypical Words

Pred \Gold Spatial Descriptive Material Forceful Emotional Abstractive

Spatial 20 18 0 1 1 10 Descriptive 12 29 1 0 3 5 Material 6 16 4 1 10 13 Forceful 5 16 0 6 4 19 Emotional 2 10 0 5 11 21 Total 45 89 5 13 29 68 Precision 40% 58% 8% 12% 22% Recall 44% 33% 80% 43% 38%

Average Precision: 28% Recall: 47.6% Accuracy:28% Table 4.5: Categorical nearest neighbour analysis of adjectives. CB-vectors.

The vectors of Word2vec did in general perform better than then CB-vectors, as seen in close to every datapoint in Table 4.6. Relatively speaking, they are quite similar however showing a consistent difference in score. As before, the vectors favour recall over precision, showing an average recall of 61% in comparison an average precision of 43.4%. The diagonal is a bit clearer however, which is reflected by the increase in accuracy at 43.2%. It seems the vectors of Word2vec are better at handling, or excluding, ABSTRACTIVE words only mixing up a total of 30 words, compared to 68 of the CB-vectors.

Pred \Gold Spatial Descriptive Material Forceful Emotional Abstractive

Spatial 26 18 0 1 0 5 Descriptive 5 40 1 0 3 1 Material 7 30 5 1 1 6 Forceful 4 22 1 7 10 6 Emotional 0 7 0 1 30 12 Total 42 117 7 10 44 30 Precision 52% 80% 10% 14% 60% Recall 62% 34% 71% 70% 68%

Average Precision: 43.4% Recall: 61% Accuracy:43.2% Table 4.6: Categorical nearest neighbour analysis of adjectives. Word2vec.

Nouns

As seen in Table 4.7, showing the score of the CB-vectors, the noun categories generate a higher score at a total accuracy of 53.2%. It should be noted however that a random model would generate a score of 20%, because of the number of categories (5 compared to the 6 adjective categories). The trade-off between precision and recall is even as seen by the average precision score of 53.2% compared to an average recall of 55%. This is also quite clear category wise, where the only category that have a large difference is MASS, that clearly favours recall over precision. Additionally, the ABSTRACT category shows a higher precision compared to recall. The model is best at predicting the AGENT category, furthermore ABSTRACT, PLACE and CONCRETE show an average score around of 50% and MASS seem to be the hardest category for the model.

(29)

4.4. Categorical Analysis of Words that Co-occur with Prototypical Words

Pred \Gold Place Concrete Mass Abstract Agent

Place 28 5 2 13 2 Concrete 17 24 3 3 3 Mass 7 16 8 18 1 Abstract 1 3 0 37 9 Agent 0 8 1 5 36 Total 53 56 14 76 51 Precision 56% 48% 16% 74% 72% Recall 53% 43% 57% 49% 71%

Average Precision: 53.2% Recall: 55% Accuracy:53.2% Table 4.7: Categorical nearest neighbour analysis of nouns. CB-vectors.

Finally, Table 4.8 show the score of the Word2vec vectors. As for the adjectives, there is an increase in score, showing an accuracy of 70.4%. The trade-off between precision and recall is again quite even, with a slightly higher precision. As before the AGENT category generates the best results, closing in on a perfect score. In general, the remaining categories also reflect a high score, none going below 50%. See Section 5 for an interpretation of the presented results.

Pred \Gold Place Concrete Mass Abstract Agent

Place 42 5 2 1 0 Concrete 21 25 1 2 1 Mass 8 13 15 14 0 Abstract 0 0 2 47 1 Agent 0 2 0 1 47 Total 71 45 20 65 49 Precision 84% 50% 50% 94% 94% Recall 59% 56% 75% 72% 96%

Average Precision: 74.4% Recall: 71.6% Accuracy:70.4% Table 4.8: Categorical nearest neighbour analysis of nouns. Word2vec.

4.4 Categorical Analysis of Words that Co-occur with Prototypical Words

The following section presents the results of the categorical analysis of words that co-occur with prototypical words, as described in 3.1. The tables speak for themselves and should be interpreted similarly. Table 4.9 corresponds to what extent certain categories of nouns occurred with prototypical adjectives, for this reason the columns add up to 100. For in-stance, the nouns occurring with EMOTIONAL prototypes were ~20% CONCRETE, ~50% ABSTRACT and ~30% AGENT. Table 4.10 is similar, but is in contrast built upon how the adjective categories occurred with the prototypical nouns, which is why the ABSTRACTIVE column is included once again. In this table, the rows instead add up 100. For instance, the ad-jectives occurring with CONCRETE prototypes were ~20% SPATIAL, ~55% DESCRIPTIVE, ~5% EMOTIONAL and ~20% ABSTRACTIVE.

Thoughts don&apos;t have Colour, do they? : Finding Semantic Categories of Nouns and Adjectives in Text Through Automatic Language Processing

Linköping University | Department of Computer Science

Master thesis, 30 ECTS | Kognitionsvetenskap

2017 | LIU-IDA/KOGVET-A--17/002–SE

Thoughts Don’t Have Colour,

do they?

Finding Semantic Categories of Nouns and Adjectives in

Text Through Automatic Language Processing

Generering av semantiska kategorier av substantiv och

ad-jektiv genom automatisk textbearbetning

Per Fallgren

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1

Aim

1.2

Research Questions

1.3

Delimitations

1.4

Proposed Categories

++

?

?

-

-

+

++

++

?

-

+

+

++

-

-

-

-

-

?

?

?

+

+

++

?

2

Background

2.1

Concepts and Categories

Murphy’s Take on Concepts

Categorizing Concepts

Categories in English Grammar

Categories in WordNet

2.2

Distributional Semantics

The Distributional Hypothesis

Syntagmatic and Paradigmatic Relations

An Alternative Approach: Formal Logical Semantics

Hybrid Approaches

2.3

Natural Language Processing Paradigms and Algorithms

Word Vectors

Co-occurrence Matrices

Dependency Parsing

Dimensionality Reducing Algorithms

K-Means Clustering

3

Method

3.1

Data

3.2

Experiments

3.3

Thoughts don't have Colour, do they? : Finding Semantic Categories of Nouns and Adjectives in Text Through Automatic Language Processing