Semantic Analysis of Natural Language and Definite Clause Grammar using Statistical Parsing and Thesauri

(1)

GRAMMAR USING STATISTICAL PARSING AND THESAURI

by

Björn Dagerman

A thesis submitted in partial fulfillment of the requirements for the degree

of BACHELOR OF SCIENCE in Computer Science Examiner: Baran Çürüklü Supervisor: Batu Akan MÄLARDALEN UNIVERSITY School of Innovation, Design and Engineering

(2)

ABSTRACT

Services that rely on the semantic computations of users’ natural linguistic inputs are

becoming more frequent. Computing semantic relatedness between texts is problematic due

to the inherit ambiguity of natural language. The purpose of this thesis was to show how

a sentence could be compared to a predefined semantic Definite Clause Grammar (DCG).

Furthermore, it should show how a DCG-based system could benefit from such capabilities.

Our approach combines openly available specialized NLP frameworks for statistical

parsing, part-of-speech tagging and word-sense disambiguation. We compute the

seman-tic relatedness using a large lexical and conceptual-semanseman-tic thesaurus. Also, we extend

an existing programming language for multimodal interfaces, which uses static predefined

DCGs: COactive Language Definition (COLD). That is, every word that should be

accept-able by COLD needs to be explicitly defined. By applying our solution, we show how our

approach can remove dependencies on word definitions and improve grammar definitions in

DCG-based systems.

(3)

SAMMANFATTNING

Tjänster som beror på semantiska beräkningar av användares naturliga tal blir allt

van-ligare. Beräkning av semantisk likhet mellan texter är problematiskt på grund av naturligt

tals medförda tvetydighet. Syftet med det här examensarbetet är att visa hur en mening

can jämföras med en fördefinierad semantisk Definite Clause Grammar (DCG). Dessutom

bör arbetet visa hur ett DCG-baserat system kan dra nytta av den möjligheten.

Vårt tillvägagångssätt kombinerar öppet tillgängliga specialiserade NLP frameworks för

statistical parsing, part-of-speech tagging och worse-sense disambiguation. Vi beräknar den

semantiska likheten med hjälp av en stor lexikografisk och konceptuellt semantisk

synony-mordbok. Vidare utökar vi ett befintligt programmeringsspråk för multimodala gränssnitt

som använder statiskt fördefinierade DCGs: COactive Language Definition (COLD). Dvs.

alla ord som ska kunna accepteras av COLD måste explicit definieras. Genom att tillämpa

vår lösning visar vi hur vår metod kan minska beroenden på ord-definitioner och förbättra

grammatik-definitioner i DCG-baserade system.

(4)

LIST OF FIGURES

Figure Page

2.1 COLD source code sample . . . 7

3.1 Graph walk in WordNet . . . 9

3.2 POS dictionary sample . . . 10

3.3 Defining almost identical COLD rules . . . 11

3.4 Nested rule declarations . . . 12

3.5 Simple dictionary sample . . . 12

3.6 Using dictionary-defined parameters . . . 12

3.7 Using multiple dictionary-defined parameters in one rule . . . 13

4.1 COLD semantic DCG sample . . . 15

4.2 A dictionary sample . . . 16

4.3 A natural sentence to be used as an input . . . 16

(6)

INTRODUCTION

Applications for computing semantic similarity are becoming more prevalent [1, 5]. A

semantic similarity is a comparison of how similar in meaning two concepts are. These

concepts are associated with some ambiguity, normally related to natural language.

Com-parisons could be between: two words, two natural sentences, or between a sentence and a

predefined statement (a rule). Such a rule would have some grammatical structure, e.g., its

grammar could be described as a set of definite clauses in first-order logic. Such a

repre-sentation is denoted Definite Clause Grammar (DCG). A problem with DCGs is that they

are static as to what they can express. Every combination of possible words needs to be

explicitly defined in order to be acceptable rules. This is a complex problem for systems

comparing a user’s natural language with DCGs, because a user cannot be expected to

con-struct sentences exactly matching rules. As such, rules are defined with close consideration

of natural language, as can be seen in [3].

Conventional approaches for comparing texts fail to deliver human-level (common

sense) results [24]. This is understandable due to the many different ways semantically

identical sentences can be expressed using natural language. Computed semantic similarity

measurements are used in a wide range of services which rely on natural language.

There-fore, increasing the confidence of which texts can be compared is desirable in many natural

language processing applications, including: machine translation [20], conversational agents

[23], web-page retrieval (e.g., by search engines) and image retrieval [21, 19].

This thesis is part of larger project within the field of human robot interaction, aiming

to decrease the drawback of robot deployment for small and medium enterprises (SMEs).

The investment cost of deploying robots for SMEs is partly related to hardware purchases,

(7)

language framework is purposed through [4] which aims to remove the dependencies on

the robot programmers, allowing for such tasks, and (re)purposing, to be performed by

manufacturing engineers.

COLD, or COactive Language Definition, is a high level programming language for the

rapid development of multimodal interfaces [3]. In its current version, possible multimodal

sentences are described as context free definite clause grammar. That is, every word that

is desired to be acceptable by COLD needs to be explicitly written as a rule in a COLD

grammar file. This limits the capabilities of the language and make programming it

cum-bersome. For instance, the sentences: ”go to your house” and ”go home” are similar both

semantically and lexically. However, two different rules must be defined to be able to handle

both cases.

1.1 Thesis Statement and Contributions

This thesis contributes to the field of natural language processing. Specifically, it

discusses the benefits of extending static definite clause grammar (DCG) systems–that

match users’ inputs with predefined rules–with tools for semantic analysis. Also, it supplies

a modular framework for DCG–text and text–text comparisons. Although said framework is

not the purpose of this thesis, it can serve as a basis for further development and verification.

The purpose of this thesis is to:

1. present an approach for computing semantic confidence measurements between

natu-ral lingual phrases and DCGs, and

2. show how an existing static DCG-based system can be extended with said

function-ality.

Our algorithm (Section 4.1) combines common natural language processing tasks such as:

statistical parsing, tokenization, parts-of-speech tagging and word-sense disambiguation.

We apply these techniques through the context of the combined word-sense of the input

(8)

thesauri. Doing so, our algorithm successfully match linguistic phrases with (somewhat)

ambiguous predefined grammar. We show how our approach can benefit DCG-bases system.

This includes:

• Greater freedom in the definitions of the grammar rules. • Not requiring all usable words to be predefined.

• Parts of parsing can be shifted to the statistical parser, allowing for early termination of parses where further traversing of the parse tree would not otherwise be beneficial.

Furthermore, this could enable rule definitions to be performed without the explicit

con-sideration of users’ natural language, but rather in a way that better conveys the semantic

goal of the rule. Although our approach focuses on DCG-based systems, it is still applicable

in any system where the semantic relatedness of two texts is desired, because in essence, a

(9)

CHAPTER 2 BACKGROUND

Before a meaningful semantic comparison can be performed on two sentences of natural

text, they first need to be parsed. This process involves chunking the sentences and tagging

the individual words of a given phrase with its corresponding part-of-speech (POS). In

addition, some semantic relation (known through a knowledge base) must exist between

the words. This chapter presents a broad overview of related work. Furthermore, common

specialized frameworks for natural language processing tasks are presented.

Agent/dialogue systems use Semantic-based Conversational Agents (SCAF) to interact

with users through dialogue, often with scripted responses [23]. Without understanding the

semantics of a user’s input; no intelligible response is possible. Untranslated words are a

major problem in machine translation where a common cause is due to statistical training

of systems where only a small percentage of the corpus is understood [20]. Web-page

retrieval often involves comparing header titles of pages. This is discussed in [21] among

algorithms for web and text classification. The related subject of image retrieval often relies

on semantic hierarchies to classify related images [19]. Social bookmarking enables users to

annotate and share bookmarks of web-pages. Comparing tags to find related sites requires

semantic knowledge of the tag concepts [22]. Tagging of images on social network is a recent

application, which combines these techniques.

2.1 Statistical Parsing

A statistical parser is a type of natural language parser which, given a text, determines

its structure; which words are related, which subjects and objects are related to which

verb, which words are verbs/nouns/adjectives. The parser tries to produce the most likely

(10)

There are two actions of specific interest provided by the statistical parser for use in

semantic similarity comparison: tokenization and part-of-speech tagging:

Tokenization: splitting the given text into a set of tokens where a token represents the

smallest part of speech, i.e., the sentence "its color is red" would yield the tokens: " its ", " color ", " is ", " red ".

Part-of-speech (POS) tagging: every token is tagged with its corresponding POS, i.e.,

red which is an adjective is tagged with adjective. Every noun is tagged with noun, etc.

OpenNLP is a machine learning library for processing natural language that can perform

the above desired tasks [6]. Other features include: sentence segmentation, named entity

extraction, chunking, parsing, and coreference resolution. These tasks are usually required

to build more advanced text processing services.

2.2 Semantic Knowledge Base

WordNet is the outcome of a research project at Pennsylvania University regarding the

creation of a large database of English words [2]. It is available both as an online and offline

resource. Any given word in WordNet is tagged as being part of one, or more,

parts-of-speech (POS); verbs, adjectives, nouns or adverbs. Words of the same POS with similar

(conceptual) meaning are grouped together in synonym sets, denoted as synsets.

A synset contains a short text description of its conceptual meaning in addition to

its list of words (synonyms). More interestingly, the sysnets also contain semantic and

lexical references to other synsets. That is, the various synsets reference each other forming

complex networks. The references are of a super-subordinate relation. Meaning, general

sets link to more specific ones (and vice versa), more specific ones link to very specific, e.g.,

the set containing the word { red } would link to the set with { crimson }, as crimson is a

special kind of red. Furthermore, it would also link to { color }, as red is a color.

Relations are inherited from their superordinates, meaning that from the previous

(11)

directly connected. The same conclusions cannot be drawn from the subordinates point of

view, e.g., a color could be crimson, but it certainly doesn’t have to be. The same is true

for verbs where hypernym and troponym are related, e.g., { walk } is linked with the synset

{ travel, go, move }, since walking is a more precise way of moving.

For the most part there are no connections between different POS. Some adjectives

have a special relation denoted attribute. An attribute describes the given adjectives

con-nection to a noun, linking two POS together (e.g., linking the adjective happy with the

noun happiness).

2.3 Word Sense Disambiguation

Word sense disambiguation is the process of determining which sense of a word is used

in a text. For instance, consider the sentence ”I’m having an old friend for dinner”. Does

the word having refer to what is being served, or does it refer to the company?

Techniques involve supervised, semi-supervised, unsupervised and cross-lingual

evi-dence methods. They are explained through [13]. Other methods involve the use of

dictio-naries and knowledge-bases. One such method is the Lesk algorithm. Lesk assumes that

the words in a given phrase are likely related to each other and that the relations can be

observed through the words definitions [8]. Words can thus be compared—through their

definitions—finding the pairs that are most closely related.

2.4 COactive Language Definition (COLD)

The main motivation behind this thesis is the inclusion of statistical parsing and

se-mantic analysis—to an existing high level framework for natural language—in human robot

interaction. The framework uses the programming language COactive Language Definition

(COLD), for semantic Definite Clause Grammar (DCG) rule definitions. COLD allows for

incremental multimodal grammar rules. Texts, and speech (which can also be represented

as text) are of most significant interest for our approach. The other modalities are explained

(12)

100 command ( R e t v a l ) −−> s e t t h e c o l o r t o r e d { R e t v a l = s e t C o l o r ( r e d ) ; } . 101 command ( R e t v a l ) −−> s e t t h e c o l o r t o b l u e { R e t v a l = s e t C o l o r ( b l u e ) ; } . 102 103 command −−> p i c k u p <O b j e c t ( o b j )> { p i c k u p ( o b j ) ; } 104 105 O b j e c t ( R e t v a l ) −−> t h e box { R e t v a l = p r o p e r t y ( theBox ) ; } . 106 O b j e c t ( R e t v a l ) −−> t h e s p h e r e { R e t v a l = p r o p e r t y ( t h e S p h e r e ) ; } .

Figure 2.1: COLD source code sample

There are two categories of syntax: grammar definitions and semantic representation.

Both are illustrated in Figure 2.1. The semantic representation is any valid C# code and

can be declared in between two ”{ }” clauses.

Note the Prolog-like syntax in Figure 2.1 with the → arrow. The text to the right hand

side of the arrow represents the speakable grammar rule. That is, to call the command on

line 100 the words: ”set the color to red”, should be the input. Calls can be nested, as

shown on line 103. It declares a command with the property ”<Object(obj)>”, which is

defined on lines 105-106. Implying that the valid inputs for rule 103 would be: ”pickup the

box” and ”pickup the sphere”.

The COLD syntax is declared in a script file and an integrated development

environ-ment (COLD IDE) parses these files. It also provides XML files to be used together with

Microsoft’s Speech API (for the speech recognition). The IDE compiles a multimodal

gram-mar graph, which is used in COLD programs to reference the gramgram-mar. See [3] and [14] for

(13)

CHAPTER 3 COMPUTING SEMANTIC RELATEDNESS

In its most general form the input and definite clause grammar system depends on (a)

a set of predefined rules e.g., ”Set the color to red”, ”Pick up the biggest box”, etc. and

(b) some input (i.e., a text). Through analyzing the supplied input together with the rules,

a semantic similarity value sim ∈ [0, 1], can be computed. A value of sim = 1 indicates a

perfect, unambiguous relation, whereas sim = 0 implies no connection what so ever. For

every rule the general approach consists of:

1. tokenization and part-of-speech tagging,

2. identifying the words from the input with the closest semantic meaning to those of

the given rule (word-sense disambiguation),

3. comparing pairs of words to find the overall highest scoring combination.

3.1 Semantic Comparison of Words

Popular techniques for evaluating word similarity are compared in [7]. Wu-Palmer

similarity is one such technique originally developed to allow for inexact matches to achieve

correct lexical selection in machine translation [10]. Wu-Palmer exploits the fact that within

a conceptual domain, the relation between two concepts will be directly related to how

closely they are defined in the word hierarchy. This is the same hypothesis of which the

knowledge base WordNet, described in section 2.2, is defined. Within a graph; consider the

two concepts c₁ and c₂, which are connected at the least common super-concept c₃. Denote

the distance between the nodes as: d1 = dist(c1,c3), d2 = dist(c2,c3) and d3 = dist(c3, root).

The similarity sim between c1 and c2 can then be evaluated through:

sim(c1, c2) =

2 · d₃

d1+ d2+ 2 · d3

(14)

The possibility of multiple paths between c1and c2should be noted. Naturally, the distances

should be the shortest path. Figure 3.1 shows an illustration of a graph search and the

synonym sets along the shortest path.

Figure 3.1: Graph walk in WordNet from the hypernyms tangerine and red to their common hyponym chromatic- and specular color. The numbers represents the length of the walk.

3.2 Dynamic Rule Declarations

Declaring completely static rules is partly a tedious task for the system engineers, but

more importantly, it constrains the users as to how they can interact with the system.

Relying on a user’s ability to construct lexical identical phrases to exactly match predefined

grammar rules is not reasonable [17]. It might be admissible for a classical simple telephone

redirecting service, which states a few rules, e.g., ”Press 1 to ...”, ”Press 2 to ...”, where

the user is presented with the rules and given instructions on how to supply an exact and

unambiguous input (a key press). However, even such systems suffer and often structures

the instructions as to have the last rule provide information on how to repeat the previous

rules.

In recent days dialogue bases systems are becoming more popular [1]. Such systems

(15)

100 JJ : red , g r e e n , b l u e , o r a n g e , b l a c k , s m a l l , l a r g e , huge , 101 a c t i v e , i n a c t i v e

102 VB: add , t a k e , change , a t t a c h , s e t , t a k e 103 NN: house , home , b u i l d i n g , box , mountain , sun 104 RB: q u i c k l y , s l o w l y

105

106 ACTION : pick_up , put_down 107 OBJECT: box , s p h e r e

Figure 3.2: Outtake from a dictionary file showing the predefined POS-categories JJ

(adjec-tives), VB (verbs), NN (nouns), and RB (adverbs) on lines 100-104. The POS abbreviations

are the same as those obtained through the statistical parsing, see section 2.1. Lines 106-107 contain user-defined categories.

a problem when declaring the rules—they should be formulated in such a manner as to

best match the user’s expected natural language. To help abbreviate this problem a local

dictionary for the rule-system is proposed. A simple text file containing categories with

comma separated entries (words), as shown in Figure 3.2. The annotations follow the penn

treebank [18] word level tags, as they are commonly used and also present in the statistical

parser.

Using the dictionary from Figure 3.2, a rule such as ”Perform <ACTION> on <NN>”

would be able to match ”Perform pick_up on box” (as ”box” is a defined noun). However,

it would equally match ”Perform pick_up on sun”. In cases where it is desirable to only

have specific words matchable to a given rule/rules, optional user categories can be added.

For instance, defining the category OBJECT on line 107, allows the rule to be changed to:

”Perform <ACTION> on <OBJECT>”.

Some words belong to several categories, for instance, the word ”word” can be a noun

or a verb, depending on which context it’s used in. If such a word is desired to be usable

in more than one POS it should either be defined multiple times (once per POS-category),

or declared in a user-defined category.

The rule definitions are further improved through allowing the statistical parser and

(16)

consider the rule: ”Set the color to <JJ>”, using the above dictionary. If given an input

such as: ”Set the color to crimson”; a match would not be possible in a static environment

(as crimson is unknown). However, the semantic analyzer could perform a word-sense

disambiguation, realizing that crimson is a color; a special type of red, allowing for the rule

”Set the color to red” to be matched. Furthermore, this allows for the ability to define rules

without the consideration of an agent’s natural language, but rather in a way that best

explains the semantic goal of the rule.

3.2.1 Extending COLD Grammar

The COLD language syntax is explained in section 2.4. The main limitation of the

grammar defined rules is that they are completely static. For instance, say we want to

define a rule to change the color property of an object. In COLD this is possible through

two solutions, as shown in Figures 3.3 and 3.4.

100 command ( R e t v a l ) −−> s e t t h e c o l o r t o r e d 101 { R e t v a l = s e t C o l o r ( " r e d " ) ; } . 102 103 command ( R e t v a l ) −−> s e t t h e c o l o r t o b l u e 104 { R e t v a l = s e t C o l o r ( " b l u e " ) ; } . 105 106 command ( R e t v a l ) −−> s e t t h e c o l o r t o g r e e n 107 { R e t v a l = s e t C o l o r ( " g r e e n " ) ; } .

Figure 3.3: Defining almost identical new rules

Arguably, the second alternative is more dynamic, but both require predefinition of

every single color to be used (i.e., in this example only the colors red, green and blue

would be valid inputs). If we want to add another color then we need to write more rules.

To improve upon this solution we apply the concepts presented above regarding the local

dictionary. As shown in Figure 3.2. Firstly, rules are added to the COLD compiler to allow

(17)

100 command ( R e t v a l ) −−> s e t t h e c o l o r t o <C o l o r ( c l r )> 101 { R e t v a l = s e t C o l o r ( c l r ) ; } . 102 103 C o l o r ( R e t v a l ) −−> r e d { R e t v a l = " r e d " ; } . 104 C o l o r ( R e t v a l ) −−> b l u e { R e t v a l = " b l u e " ; } . 105 C o l o r ( R e t v a l ) −−> g r e e n { R e t v a l = " g r e e n " ; } .

Figure 3.4: Nested calls through defining another rule ”Color”

it (a) checks if it’s defined in the script, or (b) checks if it’s defined in the dictionary. If

either is correct then it’s valid. In the case of (a) it also appends the dictionary with these

definitions. Failure to match both (a) and (b) results in a compile error. This allows for a

dictionary such as the one shown in Figure 3.5 to be used together with the code snippet

in Figure 3.6 to define all the rules from Figure 3.4.

100 JJ : red , g r e e n , b l u e

Figure 3.5: A simple dictionary containing only three color adjectives

100 command ( R e t v a l ) −−> s e t t h e c o l o r t o <JJ> { R e t v a l = s e t C o l o r ( JJ ) ; }

Figure 3.6: A rule using a dictionary-defined parameter, JJ

A common bracket syntax is proposed to allow for different words within the same

dictionary category to be distinguishable within the semantic representation. See Figure

3.7 for a rule that would be able to match an input such as ”change the color from red to

blue”.

Through these additions, COLD rules can be defined in a more natural way. The

(18)

100 command ( R e t v a l ) −−> chan ge t h e c o l o r from <JJ> t o <JJ> { 101 i f ( g e t C o l o r == JJ [ 0 ] )

102 s e t C o l o r ( JJ [ 1 ] ) ; 103 } .

Figure 3.7: A rule using two of same dictionary-defined parameters. They do not have to be represented by the same word. The two parameters are distinguished in the semantic representation (within the ”{ }”) through indexing in ”[ ]” brackets.

synonyms could be defined allowing for even more general inputs. However, this still requires

the structure of the input sentence to be the same as that of the rule. Instead of relying on

defined synonyms in the dictionary this information could be obtained through a semantic

knowledge base. The statistical parser, semantic analyzer and the semantic knowledge base

are all contained within the same .Net library, making the integration into COLD simple.

The extra overhead associated with the added NLP methods should be considered.

COLD relies on multimodal inputs, where speech is the most interesting for our purpose.

We suggest that because these computations can be performed while an agent is speaking;

the calculations will have sufficient time to conclude before the speech is complete (as

it is incrementally parsed). As such, no additional overhead would be present for this

environment. An asymptotic time complexity analysis of this algorithm was outside the

scope of this thesis. Obviously, such an analysis should be carried out in order to support

the above claim. See Chapter 5 Conclusions.

3.3 Comparing Natural Language and Definite Clause Grammar

Comparing any two natural language texts is a similar problem to that of comparing

one text with definite clause grammar (DCG). The different DCG rules together provide a

context in which the natural language text can be compared. Such a context is not present

in the former problem.

Firstly, the inputs (a natural language text) word-sense is disambiguated using an

(19)

operates on the hypothesis that neighboring words in a text have similar meaning. This

idea is further extended through using the context of the different rules, together with a

dictionary, as suggested in Section 3.2 Dynamic Rule Declarations. Through this process

the input is modified by replacing words with synonyms or hypernyms from the knowledge

base, as to better match the context of the DCG.

Secondly, all individual pairs of words obtainable through selecting one word from the

input and one word form the rule are compared to find the semantic similarity between

each of them, as explained in section 3.1 Semantic Comparison of Words. Pairs of words

are selected as to find the overall highest scoring combination. However, in the case where

multiple dictionary tags are defined in the DCG, i.e., the two JJ ’s in ”Change the color from

<JJ> to <JJ>”, different matches are required for each tag. Consider the input ”Turn the

red to blue”; the word red can only be matched with one of the JJ tags, not both. This is

exclusive to matching dictionary tags and has no impact on how red is matched with other

words, i.e., color.

Through this process the input is scored versus all rules yielding a comparison

confi-dence value together with the suggested dictionary matches, for each rule. Provided with

this data, selecting which rule/rules to fire is a simple design decision (e.g., all rules with

(20)

CHAPTER 4 RESULTS

In this chapter, detailed annotated results of a semantic analysis, using the discussed

methods, is presented in Section 4.1. In addition, a general framework for semantic

relat-edness is presented in Section 4.2.

4.1 Semantic Analysis of Natural Language and DCGs

This sections details the semantic analysis algorithm, using the methods discussed in

Chapter 3 Computing Semantic Relatedness. First consider a set of DCGs as shown in

Figure 4.1 together with a lexical dictionary as shown in Figure 4.2. Secondly, the input

from Figure 4.3.

100 command ( R e t v a l ) −−> S e t t h e c o l o r t o <JJ> 101 { R e t v a l = s e t C o l o r ( JJ ) ; } .

102

103 command ( R e t v a l ) −−> Change t h e c o l o r from <JJ> t o <JJ> 104 { 105 v a r c o l o r = g e t C o l o r ( ) ; 106 i f ( c o l o r == JJ [ 0 ] ) R e t v a l = s e t C o l o r ( JJ [ 1 ] ) ; 107 e l s e R e t v a l = c o l o r ; 108 } 109 110 command ( R e t v a l ) −−> Change t h e s i z e t o <JJ> 111 { R e t v a l = s e t S i z e ( JJ ) ; } .

Figure 4.1: Snippet from a COLD script showing some semantic DCGs. The speakable components are to the right of the → arrows

(21)

100 JJ : y e l l o w , c r i m s o n , g r e e n , . . . , b i g , s m a l l , l a r g e 101 NN : . . .

102 VB : . . . 103 RB : . . . 104 . . .

Figure 4.2: A dictionary file detailing some of its color- and size-adjectives

from <JJ> to <JJ>” and ”Change the size to <JJ>”, where <JJ> is any word defined in

the dictionary entry represented by line 100 in Figure 4.2.

100 " Turn i t r e d "

Figure 4.3: A natural sentence to be used as an input

The semantic analysis through these criteria is performed as follows:

1. The input is tokenized (split into the smallest word-unit)

ex. ”Turn it red” → ”[Turn] [it] [red]”

2. The tokenized input is tagged with its part-of-speech (POS)

”[Turn] [it] [red]” → ”[VP Turn/VB ] [NP it/PRP ] red/JJ”

3. For every token, get the synonym set (synset) for that word using its corresponding

POS from the thesaurus

4. For every rule:

(a) Get the synset for every dictionary entry of each containing category (<JJ>,

<NN>, ...)

(b) Find the best matching tags (using Wu-Palmer similarity) with confidence value

c ∈ [0, 1], which is not already matched (i.e., two <JJ> in a rule cannot match

(22)

i.e., rule 100: sim(red, crimson) = 1.0 ⇒ <JJ>[0] = crimson,

and rule 110: sim(red, crimson) = 1.0 ⇒ <JJ>[0] = crimson

(c) Discard any rule that failed to match all tags

i.e., rule 103: sim(red, crimson) = 1.0, sim( N/A , X) = NaN

⇒ <JJ>[0] = crimson, <JJ>[1] = N/A, therefore discard rule 103

5. For the remaining rule(s) create the sentence s using the best matches from (4b) and

remove all determiners:

s100: ”Set color to crimson”

s110: ”Change size to crimson”

6. For every word in input and every sentence from (5), find the best matched sentence,

through comparing the similarity between pairs of words:

input: ”Turn it red”, [VP Turn/VB ] [NP it/PRP ] red/JJ

s100: ”Set color to crimson”, [VP set/VBN ] [NP color/NN ] [PP to/TO ] crimson/JJ

s110: ”Change size to crimson”, [VP change/VB ] [NP size/NN ] [ PP to/TO ] crim-son/JJ

s100: sim(turn, set) = 0.666, sim(red, crimson) = 1.0, sim(red, color) = 0.428 ⇒ s100 = 0,698

s110: sim(turn, change) = 0.571, sim(red, crimson) = 1.0, sim(red, size) = 0.0 ⇒ s110 = 0,523

7. Select rule r corresponding to the best matched sentence from (6) with the POS

matches from (4b):

r : "Set the color to crimson"

Note how the selected rule is the closest matching, even though the input contained none

of the actual words. The input: ”Turn it red”, contains the color red, which is not defined

in the lexical dictionary of accepted words (Figure 4.2). However, crimson is. Due to the

(23)

the color to <JJ>” and ”Change the size to <JJ>” are almost identical, except for the

words color and size. Both use <JJ> tags, which does not exclude as to which words

from that category can be matched. I.e., ”Change the size to crimson” and ”Set the color

to crimson” are both equally good matches as far as the DCGs and lexical dictionary are

concerned. Because of the word-sense of the input and rules, the algorithm could determine

that in this context, the word crimson is more closely related to color than it is to size. As

such, the desired rule was matched.

The example shows how the approach handles ambiguities. It could be noted that

the ambiguity instead could have been avoided altogether by defining the DCGs better.

For instance, this specific problem could have been circumvented by using user-defined

categories instead of general <JJ> tags, i.e., defining the categories colors and sizes.

4.2 A Framework for Computing Semantic Relatedness

Computing semantic relatedness of texts combines many different NLP techniques.

Though these methods complement each other for our purpose, researchers mostly focus

on improving the relations of just one or a few of these. For instance, recent research has

discussed the use of WordNet and how Wikipedia could replace it as a knowledge-base [15,

16].

We have created a general modular framework for the computation of semantic

relat-edness. Each modularity being a NLP technique. All modalities share a common interface

as to how they interact to perform the semantic analysis. As such, different methods can

be replaced as more prominent solutions become available, or to better fit the needs of a

specific system. For example, adding support for Wikipedia in conjunction with WordNet,

would not affect how statistical parsing or part-of-speech tagging is performed.

4.2.1 Test Application

A simple test application is provided to test the functionality of different NLP

tech-niques. It’s included in the delivered modular framework for semantic analysis in COLD.

(24)

Figure 4.4: Screenshot of the test application. The input ”increase it to large” from the above input box is interpreted and the results printed in the box belove

Any sentence can be written as input. The sentence is parsed, tokenized and

POS-tagged. Gloss and synonyms for each individual word from the thesaurus is shown. The

input is matched against some DCGs. The confidence values for these matches are presented,

together with the most closely matched rule. The application reads DCGs from a script file

and passes them as text strings to a data structure. As an alternative, sentences could be

passed directly to the data structure. As such, the input would instead be compared to any

(25)

CHAPTER 5 CONCLUSIONS

In this thesis we have proposed a general natural language processing framework for

computing semantic relatedness of texts. Also, we have shown how our approach can be

used to solve the more specific problem of comparing a text with a definite clause grammar

(DCG) rule. We did this through introducing a lexical dictionary to the DCG and computing

the word-sense using the context of the DCGs and a thesaurus. Although the approach

is explained through a DCG-based system, it is still applicable in any system where the

semantic relatedness of two texts is desired. This is because a semantic DCG includes a set

of words. Expanding a grammar to include the possible acceptable words forms sentences.

Although these sentences are likely fragmented; conceptually they are no different than a

natural language text.

We have shown how an existing incremental multimodal framework (COLD) can benefit

from our approach, rather than its current static grammar definitions. These benefits

include: not requiring a list of all possible words, early termination of rule parsing, and

the ability to make use of word types. The DCG rules can therefore be made much more

dynamic. Furthermore, when defining the rules, no consideration of the natural language

of the users’ is required. As such, rules can be defined as to best express their semantic

meaning, minimizing ambiguities of natural language. Before applying our solution to

COLD; to match a rule defined as ”set the color to crimson”, a user would have to input

exactly that. As can be seen in 4.1, using our method, the input ”turn it red” could

convincingly be matched to the rule ”set the color to crimson”, even though neither of

the words: ”turn”, ”it” or ”red” were explicitly defined. In this case it was sufficient for

the words to be known from the thesaurus, in conjunction with a rule that conveyed its

(26)

It should be noted that the goal of this thesis was not to find "the best" method for

comparing DCGs and texts, but rather any method that would be applicable to COLD

(see 2.4). Therefore, extensive testing of the suggested method was outside the scope of

this thesis. Also, the effectiveness of the proposed algorithm for the comparison of natural

language and DCGs (3.3 & 4.1) has not been thoroughly analyzed. Naturally, further testing

is required to determine its effectiveness.

Our approach uses the thesaurus WordNet. Although a respected source, newer

re-search suggests Wikipedia can provide better results [15]. More details on this can further

be seen in [16] which also discusses the benefits of combining WordNet, Wikipedia and

Google. Although an increase in the confidence values of the semantic relatedness could be

achieved, the complete implications of the (presumed) increased asymptotic time and space

complexity is unknown and needs to be further analyzed.

Our approach has made the rule definitions of COLD more dynamic. Part of this is due

to the introduction of the lexical dictionary and the dictionary tags. In the current solution

it is possible to define grammar using <JJ> (adjective), <VB> (verb), <NN> (noun) and

<RB> (adverb) tags, in addition to user-defined categories. We would like to extend this

to also use phrases, i.e., verb-phrases and noun-phrases. This would further improve upon

(27)

REFERENCES

[1] Y. et. al, “Sentence similarity based on semantic nets and corpus statistics,”

Trans-actions on Knowledge and Data Engineering, vol. 18, no. 8, pp. 1138–1150, August

2006.

[2] G. A. Miller, “Wordnet: A lexical database for english,” Communications of the ACM,

vol. 38, no. 11, pp. 39–41, 1995.

[3] A. Ameri, “A general framework for incremental processing of multimodal inputs,” in

In Proceedings of the 13th International Conference on Multimodal Interaction, 2001,

pp. 225–228.

[4] B. Akan, “Human robot interaction solutions for intuitive industrial robot

program-ming,” Licentiate Thesis No. 149, Mälardalen University, 2012.

[5] D. Michie, “Return of the imitation game,” 2001.

[6] (2010) Apache opennlp. [Online]. Available: http://opennlp.apache.org

[7] E. Blanchard, M. Harzallah, H. Briand, and P. Kuntz, “A typology of ontology-based

semantic measures.” in EMOI-INTEROP, 2005.

[8] C. Fellbaum, WordNet: An electronic lexical database. Wiley Online Library, 1998.

[9] M. Lesk, “Automatic sense disambiguation using machine readable dictionaries: how to

tell a pine cone from an ice cream cone,” in Proceedings of the 5th annual international

(28)

[10] Z. Wu and M. Palmer, “Verbs semantics and lexical selection,” in Proceedings of the

32nd annual meeting on Association for Computational Linguistics. Association for

Computational Linguistics, 1994, pp. 133–138.

[11] S. Banerjee and T. Pedersen, “An adapted lesk algorithm for word sense disambiguation

using wordnet,” in Computational linguistics and intelligent text processing. Springer,

2002, pp. 136–145.

[12] D. Klein and C. D. Manning, “Accurate unlexicalized parsing,” in Proceedings of the

41st Annual Meeting on Association for Computational Linguistics-Volume 1.

Asso-ciation for Computational Linguistics, 2003, pp. 423–430.

[13] C. D. Manning and H. Schütze, Foundations of statistical natural language processing.

MIT press, 1999.

[14] M. Johnston, “Unification-based multimodal parsing,” in Proceedings of the 36th

An-nual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics-Volume 1. Association for Computational

Linguistics, 1998, pp. 624–630.

[15] E. Gabrilovich and S. Markovitch, “Computing semantic relatedness of words and texts

in wikipedia-derived semantic space.” IJCAI’07 Proceedings of the 20th international

join conference on Artificial Intelligence, 2007, pp. 1606–1611.

[16] M. Strube and S. P. Ponzetto, “Wikirelate! computing semantic relatedness using

wikipedia,” in AAAI, vol. 6, 2006, pp. 1419–1424.

[17] B. Galitsky, “Building a repository of background knowledge using semantic skeletons,”

in AAAI Spring symposium 2006 - Formalizing and Compiling Background Knowledge

and Its Applications to Known Represention and Question Ansering, 2006, pp. 22–27.

[18] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini, “Building a large annotated

corpus of english: The penn treebank,” Computational linguistics, vol. 19, no. 2, pp.

(29)

[19] J. Deng, A. C. Berg, and L. Fei-Fei, “Hierarchical semantic indexing for large scale

image retrieval,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE

Conference on. IEEE, 2011, pp. 785–792.

[20] Y. Marton, C. Callison-Burch, and P. Resnik, “Improved statistical machine translation

using monolingually-derived paraphrases,” in Proceedings of the 2009 Conference on

Empirical Methods in Natural Language Processing: Volume 1-Volume 1. Association

for Computational Linguistics, 2009, pp. 381–390.

[21] X. Qi and B. D. Davison, “Web page classification: Features and algorithms,” ACM

Computing Surveys (CSUR), vol. 41, no. 2, p. 12, 2009.

[22] B. Markines, C. Cattuto, F. Menczer, D. Benz, A. Hotho, and G. Stumme, “Evaluating

similarity measures for emergent semantics of social tagging,” in Proceedings of the 18th

international conference on World wide web. ACM, 2009, pp. 641–650.

[23] K. O’Shea, “An approach to conversational agent design using semantic sentence

sim-ilarity,” Applied Intelligence, vol. 37, no. 4, pp. 558–568, 2012.

[24] Y. Li, Z. A. Bandar, and D. McLean, “An approach for measuring semantic similarity

between words using multiple information sources,” Knowledge and Data Engineering,

Semantic Analysis of Natural Language and Definite Clause Grammar using Statistical Parsing and Thesauri

ABSTRACT

SAMMANFATTNING

CONTENTS

LIST OF FIGURES

INTRODUCTION

CHAPTER 2

BACKGROUND

CHAPTER 3

COMPUTING SEMANTIC RELATEDNESS

CHAPTER 4

RESULTS

CHAPTER 5

CONCLUSIONS

REFERENCES