• No results found

MaltParser -- An Architecture for Inductive Labeled Dependency Parsing

N/A
N/A
Protected

Academic year: 2022

Share "MaltParser -- An Architecture for Inductive Labeled Dependency Parsing"

Copied!
92
0
0

Loading.... (view fulltext now)

Full text

(1)

Johan Hall

MaltParser – An Architecture for Inductive Labeled

Dependency Parsing

Licentiate Thesis

Växjö University

(2)
(3)

Parsing

(4)
(5)

MaltParser  An Ar hite ture for

Indu tive Labeled Dependen y Parsing

Li entiate Thesis

Computer S ien e



Växjö University

(6)

MaltParser AnAr hite ture for Indu tive LabeledDependen y

Parsing

JohanHall

VäxjöUniversity

S hoolofMathemati sandSystemsEngineering

se-

 

Växjö,Sweden

http://www.vxu.se/msi



byJohanHall. Allrightsreserved.

ReportsfromMSI,no.



issn



-



isrnvxu-msi-da-r--



--se

(7)

To G ert and K arin

(8)
(9)

This li entiate thesis presents a software ar hite ture for indu tive labeled

dependen y parsing of unrestri ted natural language text, whi h a hieves

a stri t modularization of parsing algorithm, feature model and learning

method su h that these parameters an be varied independently. The ar-

hite ture is based on the theoreti al framework of indu tive dependen y

parsingbyNivre(2006)andhasbeenrealizedin MaltParser,asystemthat

supports several parsingalgorithms and learning methods, for whi h om-

plexfeaturemodels anbedenedinaspe ialdes riptionlanguage. Spe ial

attentionisgiveninthisthesistolearningmethodsbasedonsupportve tor

ma hines(SVM).

Theimplementationis validated inthree sets ofexperimentsusing data

from threelanguages(Chinese,English andSwedish). First,we he kifthe

implementationrealizestheunderlying ar hite ture. Theexperimentsshow

that theMaltParsersystemoutperformsthe baselineandsatisesthebasi

onstraintsofwell-formedness. Furthermore,theexperimentsshowthatitis

possibletovaryparsingalgorithm,featuremodelandlearningmethodinde-

pendently. Se ondly,wefo usonthespe ialpropertiesoftheSVMinterfa e.

It ispossibleto redu ethelearningandparsingtimewithoutsa ri inga -

ura ybydividingthetrainingdataintosmallersets,a ordingtothepart-

of-spee hofthenexttokeninthe urrentparser onguration. Thirdly,the

last setofexperimentspresentabroadempiri alstudythat omparesSVM

to memory-basedlearning(MBL) with ve dierent feature models, where

all ombinationshavegonethroughparameteroptimizationforbothlearning

methods. Thestudy showsthat SVM outperformsMBLfor more omplex

and lexi alized feature modelswith respe t to parsing a ura y. There are

alsoindi ationsthatSVM,withasplittingstrategy, ana hievefasterpars-

ing than MBL. The parsing a ura y a hieved is the highest reported for

the Swedishdata set and very lose to thestateof theart forChinese and

English.

Key-words: Dependen yParsing,SupportVe torMa hines,Ma hineLearn-

ing.

(10)

Dennali entiatavhandlingpresenterarenmjukvaruarkitekturfördatadriven

dependensparsning, dvs. för att automatiskt skapa en syntaktisk analys i

formavdependensgraferförmeningaritexterpånaturligtspråk.Arkitektu-

ren bygger påidén att man ska kunna varieraparsningsalgoritm,särdrags-

modell o h inlärningsmetod oberoende av varandra. Till grund för denna

arkitekturharvianväntdetteoretiskaramverketförinduktivdependenspar-

sningpresenteratavNivre(2006).Arkitekturenharrealiseratsiprogramva-

ranMaltParser,därdetärmöjligtattdenierakomplexasärdragsmodelleri

ettspe ielltbeskrivningsspråk.Idennaavhandlingkommerviattläggaextra

tyngdvidattbeskrivahurviharintegreratinlärningsmetodensupportvektor-

maskiner(SVM).

MaltParser valideras med tre experimentserier, där data från tre språk

används(kinesiska,engelskao hsvenska).Idenförstaexperimentserienkon-

trolleras om implementationen realiserar den underliggande arkitekturen.

Experimenten visar att MaltParser utklassar en trivial metod för depen-

densparsning (eng. baseline) o h de grundläggande kraven på välformade

dependensgrafer uppfylls. Dessutom visar experimenten att det är möjligt

attvarieraparsningsalgoritm,särdragsmodello h inlärningsmetodoberoen-

deavvarandra. Denandraexperimentserienfokuserarpå despe iellaegen-

skaperna för SVM-gränssnittet. Experimenten visar att det är möjligt att

redu era inlärnings-o h parsningstiden utan attförlora i parsningskorrekt-

het genom att dela upp träningsdata enligt ordklasstaggen för nästa ord

i nuvarande parsningskonguration. Den tredje o h sista experimentserien

presenteraren empiriskundersökning somjämförSVM med minnesbaserad

inlärning (MBL). Studien använder sig av fem särdragsmodeller, där alla

kombinationeravspråk,inlärningsmetodo hsärdragsmodellhargenomgått

omfattande parameteroptimering. Experimenten visar att SVM överträar

MBLförmerkomplexao hlexikaliseradesärdragsmodellermedavseendepå

parsningskorrekthet.Det nnsävenvissa indikationerpå att SVM, med en

uppdelningsstrategi, kan parsa en text snabbare än MBL. För svenska kan

virapporteradenhögstaparsningskorrekthetenhittillso hförkinesiskao h

engelskaärresultatennäradebästasomharrapporterats.

(11)

Unfortunately, my parents Gert and Karin annot witness the ompletion

of my li entiate thesis, but I know that they would be very proud of me.

Theyalwaysbelievedin meandsupportedmeinwhateverIwantedtodo. I

wanttothankmysupervisorJoakimNivreforallfruitfuldis ussions,advi e

and fun times when we developed MaltParserand I am looking forward to

thedevelopmentofnextversion. Iespe iallywanttothankJensNilssonfor

the onversionof allthedata usedin thisthesisintodependen ystru tures

and for theMaltEval tool, whi h madeit easierto validate theMaltParser

system. Forthe onversionoftheChinesedataweusedtheheadrulesmade

byYuanDingattheUniversityofPennsylvania. Ialsowanttothankallmy

olleagues in omputers ien e at Växjö Universityto makeit fun to go to

workeveryday. I espe iallywantto thankMorganEri ssonformanyideas

andextra omputerpower.

Finally, I want to thank my love Kristina for all support when I wrote

this thesis.

(12)
(13)

Abstra t vii

Sammanfattning viii

A knowledgments ix

1 Introdu tion 1

1.1 Resear hProblemandAims. . . 2

1.2 OutlineoftheThesis . . . 4

1.3 Division ofLabor . . . 4

2 Ba kground 7 2.1 RequirementsonTextParsing. . . 8

2.2 Dependen yGraphs . . . 9

2.3 Indu tiveDependen yParsing . . . 11

2.3.1 Deterministi Dependen yParsing . . . 12

2.3.2 History-BasedModels . . . 19

2.3.3 Dis riminativeLearningMethods . . . 20

2.4 RelatedWork . . . 23

3 MaltParser 25 3.1 Ar hite ture. . . 25

3.1.1 Parser . . . 27

3.1.2 Guide . . . 28

3.2 Implementation . . . 30

3.2.1 InputandOutput . . . 31

3.2.2 ParserKernel . . . 31

3.2.3 Parser . . . 33

3.2.4 Guide . . . 37

4 Experiments 45 4.1 DataSets . . . 46

4.1.1 Swedish . . . 46

4.1.2 English . . . 48

(14)

4.1.3 Chinese . . . 52

4.2 EvaluationMetri s . . . 54

4.3 FeatureModels . . . 56

4.4 ExperimentI:Validation . . . 56

4.4.1 ExperimentalSetup . . . 57

4.4.2 ResultsandDis ussion. . . 57

4.5 ExperimentII:LIBSVMInterfa e . . . 59

4.5.1 ExperimentalSetup . . . 59

4.5.2 ResultsandDis ussion. . . 60

4.6 ExperimentIII:ComparisonofMBLandSVM . . . 63

4.6.1 ExperimentalSetup . . . 63

4.6.2 ResultsandDis ussion . . . 64

5 Con lusion 67 5.1 MainResults . . . 67

5.2 FutureWork . . . 69

Bibliography 71

(15)

Introdu tion

Synta ti parsingisan important omponentfor manyappli ations ofnat-

urallanguagepro essing. Inthis thesis,weregardparsing asthe pro essof

mapping senten es in unrestri ted natural languagetext to their synta ti

representations. Furthermore, the program whi h performs this pro ess is

alledasynta ti parser,orsimplyparser. Thesynta ti stru tureisformal-

ized withasynta ti representationsu hasphrasestru ture ordependen y

stru ture. Parsing asenten ewithphrasestru turegrammaror ontext-free

grammar re ursivelyde omposesitinto onstituentsorphrasesand inthat

wayaphrasestru turetreeis reatedwithrelationshipsbetweenwordsand

phrases. By ontrast, with dependen y stru ture representations, the goal

of parsinga senten e is to reate a dependen y graph onsisting of lexi al

nodeslinkedbybinaryrelations alleddependen ies. Adependen yrelation

onne tswordswithoneworda tingashead andtheotherasdependent. In

thisthesis,wewill on entrateonparsingwithdependen yrepresentations.

Data-driven methods in natural languagepro essing havebeen used in

manytasks inthepastde adeand synta ti parsingis oneof them. Statis-

ti alparsingisusuallybasedonnondeterministi parsingte hniquesin om-

binationwithgenerativeprobabilisti modelsthatprovidean

n

-bestranking

of thesetof andidateanalysesderivedbytheparser(Collins1997; Collins

1999; Charniak 2000). Dis riminativemodels anbeused to enhan e these

parsersbyrerankingtheanalysesoutputbytheparser(Johnsonet al. 1999;

CollinsandDuy 2005;CharniakandJohnson2005).

Nondeterministi parsinghasbeenthemainstreamapproa h, but ithas

also been shown that deterministi parsing an be performed with fairly

higha ura y,espe iallyindependen y-basedparsing(KudoandMatsumoto

2000a;YamadaandMatsumoto2003;Nivreet al. 2004;Isozaki etal. 2004;

Chengetal. 2005a),butalso in onstituent-basedparsing(SagaeandLavie

2005). Themainideaistoguidetheparserwitha lassiertrainedontree-

bank data using a greedy parsing algorithm that approximates a globally

optimal solutionbymakingaseriesof lo allyoptimal hoi es. A determin-

isti parserusually usesaform ofhistory-based featuremodel(Bla ket al.

(16)

1992; Magerman1995;Ratnaparkhi 1997)to reatearepresentationthat a

lassier anuseto predi tthenextparser state. This isalsotheapproa h

assumedinthisthesis.

Availabilityof largesynta ti ally annotated orpora,also knownastree-

banks,isessentialwhen onstru tingdata-drivenparsers,butone ofthepo-

tential advantages is that they an easily be ported to new languages. A

problem isthat manydata-drivenparsersareovertted toaparti ular lan-

guage,usuallyEnglish. Forexample,Corazzaet al. (2004)report in reased

errorratesof1518%whenusingtwostatisti alparsersdevelopedforEnglish

to parse Italian. We suggestthat adata-driven parserneed to be designed

forexiblere onguration toin reasetheportabilitytoother languages. A

user should be ableto experiment with several parsing algorithms, feature

modelsandlearningmethods.

1.1 Resear h Problem and Aims

Themainresear hproblemforthedo toralthesisistostudytheinuen eof

dierentfa torson a ura yande ien y ofdata-drivendependen ypars-

ing. This study requires a broad evaluation of the parsing system, where

weperformextensivefeaturesele tionandparametertuningtooptimizethe

feature models and lassiers formany languagesand several parsingalgo-

rithms.

Fortheli entiatethesiswewillrestri ttheresear hproblemtothedesign,

implementationandvalidationofanar hite turefordata-drivendependen y

parsingofunrestri ted naturallanguagetext. Thevalidation anbeseenas

a pilot evaluation that will determine future dire tions. However, we will

alsoobtainexperimental resultsthathaveadire t bearingonthelong-term

resear hproblems.

Wepresentasoftwarear hite turethatshouldbeabletohandledierent

parsingalgorithms, feature modelsand learningmethods, forbothlearning

and parsing. When using theimplementation ofthis ar hite ture, the user

shouldbeabletovarytheseparametersindependentlyina onvenientway.

The hoi e of parsing algorithm inuen es how the synta ti stru ture

will be built. In the learning phase this will ae t how the training data

is generated and in the parsing phase whi h stru tures are permissible. It

shouldbeeasytoaddnewparsingalgorithmsintothear hite ture,provided

thattheyfull ertainwell-denedrequirements.

Thelinguisti knowledgeofthelanguageisimportantwhendening the

stru tureofthefeaturemodel,inotherwordswhi hlinguisti featuresshould

beusedtopredi tparsinga tions. Itshouldbeeasytodeneanewfeature

(17)

modelwithoutreprogrammingthesystem. Afeaturemodelshouldbedened

inanappropriatefeaturespe i ationlanguagesothatit anbeloadedwhen

itisrequired.

Given asetof traininginstan es, whereea h instan eis angerprintof

the urrentstate of the parser, as spe ied by the feature model, together

with the transition to the next parser state, the task of the learner is to

indu e amodel at learningtime. At parsingtime, this model is then used

forpredi tingthenextparserstate. Thistask aneasilybeformulatedasa

lassi ationtask,wheredis riminativelearningmethodsarewell-suited.

Thear hite tureisbasedonthetheoreti alframeworkofindu tivedepen-

den yparsingbyNivre(2006)andhasbeenrealizedinasystem alledMalt-

Parser(Nivreetal. 2006),whi hinthe urrentversionsupportstwoparsing

algorithms, inseveralversions,andtwolearningmethods(MBL andSVM),

forwhi h omplexfeaturemodels anbedenedinaspe ialdes riptionlan-

guage. The implementation of the MaltParsersystem has been joint work

togetherwithJoakimNivre. MaltParserwasrstequippedwithaninterfa e

to amemory-basedlearner alled TiMBL(Daelemans and Van den Bos h

2005) and Nivre (2006) ontains an extensive evaluation of memory-based

dependen yparsingusingtheparsingalgorithmdened inNivre(2003). In

ordertovalidatethegeneralityandexibilityofthear hite ture,wetherefore

havetoextendtheparserwithaninterfa etoanotherlearnerandimplement

anadditionaldeterministi parsingalgorithm. Wehave hosentouseSVMas

thelearningmethod,be auseithasbeenproventogivegoodresultsforsimi-

lartasks(KudoandMatsumoto2000b;YamadaandMatsumoto2003;Sagae

andLavie2005). Fortheparsingalgorithm,wehave hosenthein remental

algorithm des ribedinCovington(2001).

Usingthis newimplementation, wehaveperformedthree sets ofexperi-

ments,designedtoanswerthreeessentialquestions:

1. Validation of the implementation: Does MaltParser realize the

underlyingar hite ture,sothatitispossibletovaryparsingalgorithm,

featuremodelandlearningmethodindependently?

2. Investigation of the SVM interfa e: How do thespe ial proper-

ties of theSVM interfa eae t parsinga ura y and timee ien y?

How anlearningandparsinge ien ybeimprovedwithoutsa ri ing

a ura y?

3. Comparison ofMBL and SVM: Whi hofthelearningmethodsis

bestsuitedforthetaskofindu tivelabeleddependen yparsing,taking

bothparsinga ura yandtimee ien y intoa ount?

(18)

Apartfromansweringthesequestions,wewilltrytoidentifyfuturedire tions

thathopefullywill beusefulforthelong-termresear hproblems.

1.2 Outline of the Thesis

Inthisintrodu tory hapter,wehavetriedtooutlinethelong-termresear h

problem and thespe i aims ofthe li entiate thesis. Thestru ture of the

remaining haptersisasfollows.

Chapter 2, Indu tiveDependen y Parsing

Chapter 2 reviews the ba kground material for this thesis. We dene the

problem of parsingunrestri ted naturallanguage text and dis ussdierent

algorithms for dependen y parsing. Furthermore, data-driven parsing and

espe ially the history-based models are dis ussed. The hapter ontinues

with ades riptionof thetwoma hine learningmethods used in the restof

the thesis: SVM and MBL. Finally, the hapter ends with a se tion whi h

brieypresentsrelatedwork.

Chapter 3, MaltParser

Chapter3presentsanar hite tureforparsingunrestri tednaturallanguage

textwithdependen ystru tures. Thear hite tureisdes ribedindetailwith

fo usonthetwomainmodulesParser andGuide. TheMaltParsersystemis

animplementationofthear hite tureandthe hapterendswithades ription

ofthissystem.

Chapter 4, Experiments

Chapter 4starts with a presentation of the treebankdata used for the ex-

perimentsand anexplanation oftheevaluation riteriausedtovalidate the

implementationof the proposed ar hite ture. An investigation of thethree

questionsexplainedaboveispresentedbasedonextensiveexperiments.

Chapter 5, Con lusion

Chapter5 ontainsthemain on lusionsandasummaryofthemainresults

of the thesis. The hapter ends with a dis ussion of dire tions for future

resear h.

1.3 Division of Labor

Asalreadystated,thedesignandimplementationofMaltParserisjointwork

withJoakimNivre. Morespe i ally,theworkhasbeendividedasfollows:

(19)

Thedesignofthear hite tureisjointwork.

Theimplementationofparsingalgorithms,generi featuremodelhan- dlingandthememory-basedlearnerismainlytheworkofJoakimNivre.

Theimplementationofallotherpartsofthesystem,in ludingtheSVM learner,ismainlytheworkof JohanHall.

(20)
(21)

Ba kground

Synta ti parsingis usedin manyappli ationssu hasma hine translation,

information extra tion and question answering. Appli ations dealing with

unrestri ted text need to handle all kinds of text, in ludinggrammati ally

orre ttext,ungrammati altextandforeignexpressions. Itisdesirablethat

su h anappli ation produ essomekindof analysis. Of ourse, iftheinput

isgarbage,itismostlikelythatthesystemwillfailto reateaninteresting

analysis, but the system should nevertheless make its best to produ e an

analysis. Iftheseappli ationsneedasynta ti parser,italsoneedstobeable

tohandleunrestri tedtext,althoughweneedtorestri tthetexttoa ertain

naturallanguageto beabletoderiveameaningfulsynta ti representation.

Nivre(2006)introdu esthenotionoftextparsing to hara terizethisopen-

endedproblemthat an onlybeevaluatedwithrespe ttoempiri alsamples

ofatextlanguage.

1

Ourapproa htotextparsingisdependen y-based anddata-driven. The

goalofdependen y-basedtextparsingisto onstru tadependen ygraphfor

ea hsenten einatext. Figure2.1showsanexampleofadependen ygraph,

onne tingthewordsinaSwedishsenten ebybinaryrelationslabeledwith

dependen ytypes(grammati alfun tions).

Data-driven methods omply well with the fa t that text parsing uses

empiri al samples of a text language. A realisti approa h is then to use

somekindofsupervisedlearningmethodthatmakesuseofatreebank,whi h

onsistsof synta ti allyannotatedsenten es. Aproblemwiththisapproa h

isthatitrestri tsustolanguagesthathaveatleastonetreebank. Inaddition,

thesetreebanksareoftenannotatedwith onstituen y-basedrepresentations

and thereforeneedtobe onvertedto dependen y-based representations.

Giventhatwehaveatreebankforaspe i languageourapproa histo

indu eaparsermodelatlearningtimeandusethisparsermodeltoparsesen-

ten es. However,sin eitisproblemati tousethedependen ygraphdire tly

1

Thetermtextlanguage doesnotex ludespokenlanguage,butemphasizesthatitis

alanguagethato urrsinrealtexts. Inprin iple,thenotionappliesalsotoutteran esin

spokendialogue.

(22)

0 1

nn.nom

Cykelreglerna

(Bikingrules

?

 

SUB

2

vb.n

gäller

arevalid

 

?

ROOT

3

ab

o kså

also

?

 

ADV

4

pp

för

for

 

?

ADV

5

nn.nom

mopedister

mopedriders

?

 

PR

6

mad

.

.)

 

?

IP

Figure 2.1: Dependen y graph for Swedish senten e, onverted from Tal-

banken

to onstru tsu hamodel,weinsteaduseadeterministi parsingalgorithmto

mapadependen ygraphtoatransitionsequen esu hthatthistransitionse-

quen euniquelydeterminesthedependen ygraph. Anindividualtransition

an be,for example,shiftingatokenonto asta koradding anar between

twotokens. Thetransitionsysteminitselfisnormallynondeterministi and

we therefore needame hanismthat resolvesthis nondeterminism. We use

a dis riminative learning method, su h as SVM and MBL, to onstru t a

lassier. Moreover,we usehistory-basedfeature modelsto extra t ve tors

of feature-valuepairs from the urrentparser stateas trainingmaterial for

the lassier.

In this hapter, we reviewthe ne essaryba kgroundfor the design and

implementationofMaltParserfo usingontheframeworkofindu tivedepen-

den yparsingproposedbyNivre(2006). MostofthenotationusedbyNivre

(2006)isalsousedhere,but insome asesthenotationhastobeextended.

Therestofthe hapterisstru turedasfollows. Se tion2.1des ribesthebasi

requirementson textparsing. Se tion2.2 presentsthe ne essarydenitions

ofdependen ygraphs. Se tion2.3presentstheparsingframework,in luding

the deterministi parsing algorithm, the history-based feature models and

dis riminativelearningmethods. Relatedworkisdis ussedin se tion2.4.

2.1 Requirements on Text Parsing

We begin by dening a text as a sequen e

T = (x 1 , . . . , x n )

of senten es,

where ea h senten e

x i = (w 1 , . . . , w m )

is asequen eof tokensand atoken

(23)

w j

isasequen eof hara ters,usuallyawordform. Givenatext

T

,thetask

of textparsingisto derivethe orre tanalysis

y i

foreverysenten e

x i ∈ T

.

Weassumethatthetext

T

ontainssenten esofatextlanguage

L

thatinour

aseisanaturallanguage. Thisassumptionentailsthatthetextlanguageis

notaformallanguageandthatparsingdoesnotentailre ognition. Instead,

weseetext parsingasanempiri alapproximationproblem. Therefore, this

approa h is not well-suited for grammar he king in aword-pro essingap-

pli ation, be auseit will tryto nd an analysis also for an ungrammati al

senten e.

Given these denitions we andene four basi requirementson a text

parser(Nivre2006):

Denition2.1. Aparser

P

shouldmapatext

T = (x 1 , . . . , x n )

inlanguage

L

towell-formedsynta ti representations

(y 1 , . . . , y n )

inawaythatsatises

thefollowingrequirements:

1. Robustness:

P

assignsatleastoneanalysis

y i

toeverysenten e

x i ∈ T

.

2. Disambiguation:

P

assignsatmostoneanalysis

y i

toeverysenten e

x i ∈ T

.

3. A ura y :

P

assignsthe orre tanalysis

y i

toeverysenten e

x i ∈ T

.

4. E ien y :

P

pro esseseverysenten e

x i ∈ T

intimeand spa ethat

ispolynomialinthelengthof

x i

.

We want to reate a parser that uses a parsing strategy that assigns at

least one analysis for ea h senten e (Robustness) and at most one anal-

ysis (Disambiguation). The third requirement(A ura y ) is unrealisti

in pra ti e,but wewillusethis asanevaluation riterionin theChapter4.

Inorder tosatisfythefourth requirement,wewillusedeterministi parsing

algorithmswithatmostquadrati time omplexityandlinearspa e omplex-

ity. Wewillusetwoparsingalgorithmsthathavelinear omplexity(Nivre's

ar -eagerandar -standardalgorithms)andonethathasquadrati omplex-

ity(Covington'salgorithm). Intheexperiments,theE ien yrequirement

willbeanevaluation riterionthatmeasuresthetimeittakestoparseatext.

2.2 Dependen y Graphs

Dependen yparsingis basedon synta ti representationsbuiltfrom binary

relations between tokens (or words) labeledwith synta ti fun tions orde-

penden ytypes. Wedenesu hrepresentationsasdependen ygraphs:

(24)

Denition2.2. Givenasenten e

x = (w 1 , . . . , w n )

andaset

R = {r 0 , r 1 , . . . r m }

ofdependen ytypes,adependen ygraphforasenten e

x

isalabeleddire ted

graph

G = (V, E, L)

,where:

1.

V = Z n+1 = {0, 1, 2, . . . , n}

2.

E ⊆ V × V

3.

L : E → R

A dependen y graph onsists of a set

V

of nodes, where a node is anon-

negativeinteger(in luding

n

). Everypositivenodehasa orrespondingtoken inthesenten e

x

andwewillusethetermtokennodeforthesenodes(i.e.,the

token

w i

orrespondstothetokennode

i

). Inaddition,thereisaspe ialroot

node0,whi histherootofthedependen ygraphandhasno orresponding

token in thesenten e

x

. Furthermore,the set

V +

denotesthe set of token

nodes, i.e.,

V + = V − {0}

. Thereis apra ti aladvantagein using position

indi es insteadof wordforms to represent tokens (Maruyama1990), whi h

allowsthe useof thearithmeti relation

<

to order thenodes, and ensures

thateverytokenhasauniquenodein thegraph.

Anar

(i, j) ∈ E

onne tstwonodes

i

and

j

inthegraphandrepresentsa

dependen yrelationwhere

i

istheheadand

j

isthedependent. Thenotation

i → j

will be used for the pair

(i, j) ∈ E

and

i → j

for the reexive and

transitive losure, i.e.,

i → j

if andonly ifthere is apathof zeroormore

ar s onne ting

i

to

j

. Finally,thefun tion

L

labelseveryar

i → j

witha

dependen ytype

r ∈ R

andanar withalabel

r

will bedenoted

i → j r

.

Tobe ableto onstru t adependen ygraph using aparsingalgorithm,

weusuallyhavetodenesomebasi onstraintsthatagraphmustsatisfy.

Denition 2.3. A dependen y graph

G

is well-formed if and only if the following onstraintshold:

1. Root: Thenode0isaroot,i.e.,there isanode

i

su hthat

i → 0

.

2. Conne tedness:

G

isweakly onne ted,i.e.,foreverynode

i

thereis

somenode

j

su h that

i → j

or

j → i

.

3. Single-Head: Ea h node has at most one head, i.e., if

i → j

then

there isnonode

k

su hthat

k 6= i

and

k → j

.

4. A y li ity :

G

isa y li ,i.e.,if

i → j

thennot

j → i

.

5. Proje tivity :

G

isproje tive,i.e.,if

i → j

then

i → k

,foreverynode

k

su hthat

i < k < j

or

j < k < i

.

(25)

A spe ial root node makes it easier to omply with the se ond onstraint

Conne tedness,sin eitisalwayspossibletohookupanynodetotheroot

and in that way the graphwill always be onne ted. Furthermore, with a

root node wealwaysknow theentran eto thegraph. The third onstraint

Single-Head(sometimes alleduniqueness)is ommonlyassumedindepen-

den y grammar, although Hudson (1984) allows multiple headsto apture

ertain transformational phenomena, where a single token is onne ted to

more than oneposition in the senten e. The fourth onstraint A y li ity

together with rst three onstraints entail that the graph is a rooted tree.

Theseassumptionsmakeitsimplerto onstru tparsingalgorithmsthatbuild

dependen ytreesautomati ally.

Thelast onstraintProje tivity ismore ontroversialand mostdepen-

den y grammarsallownon-proje tivegraphs,be ausenon-proje tiverepre-

sentations areable to apture non-lo al dependen ies. There exists several

treebanksthat ontainnon-proje tivestru turessu hasthePragueDepen-

den y Treebank of Cze h (Haji£ et al. 2001) and the DanishDependen y

Treebank (Kromann 2003). We will assume the onstraint Proje tivity

here be ause the parsingalgorithms used in this thesis are limited to pro-

je tivestru turesandthetreebanksusedonly ontainproje tivestru tures.

Moreover, when dealing with non-proje tivedata, it is possible to proje -

tivizethetrainingdataandre overnon-proje tivedependen iesbyapplying

aninversetransformationafterparsinginapost-pro essingstep(Nivreand

Nilsson 2005).

Figure 2.1 shows a labeled proje tive dependen y graph for a Swedish

senten e, where ea h wordof thesenten e is tagged withits part-of-spee h

andea har labeledwithadependen ytype.

2

2.3 Indu tive Dependen y Parsing

The frameworkof indu tive dependen y parsing, as hara terized byNivre

(2006),isbasedonthree essentialelements:

1. Deterministi parsingalgorithmsforbuildingdependen ygraphs(Kudo

andMatsumoto2002;YamadaandMatsumoto2003;Nivre2003)

2. History-based feature models for predi ting the next transition from

oneparser ongurationtoanother(Bla ketal. 1992;Magerman1995;

Ratnaparkhi1997;Collins1999)

2

Thedependen y typesusedinFigure2.1aredes ribedinse tion4.1.1.

(26)

3. Dis riminativelearningmethodstomaphistoriestotransitions(Veen-

stra and Daelemans 2000; Kudo and Matsumoto 2002; Yamada and

Matsumoto2003;Nivreet al. 2004)

Inthisse tionwewilldis ussthesethreeelements. Se tion2.3.1presentstwo

deterministi dependen y-basedparsingalgorithms. Se tion 2.3.2 des ribes

howhistory-basedmodels anbeusedforpredi tingthenexttransitionfrom

oneparser ongurationto another. Finally, Se tion 2.3.3explains how we

an use dis riminativelearningmethods for indu inga lassier that maps

parser ongurationsto transitions, in ludingabriefdes riptionof thetwo

learningmethod usedintheexperiments: SVM andMBL.

2.3.1 Deterministi Dependen y Parsing

Mainstreamapproa hesto data-driven textparsing arebased onnondeter-

ministi parsingte hniques,butthedisambiguation anbeperformeddeter-

ministi ally, using agreedyparsingalgorithm that approximatesa globally

optimalsolutionbymakingasequen eoflo allyoptimal hoi es(seese tion

2.4formoredetailsofrelatedworkinthisarea). TheexperimentsinChapter

4will use twoparsingalgorithms, alled Nivre's algorithm andCovington's

algorithm,and bothalgorithms ome in twoversions. Webeginbydening

parser ongurationsthat an be used byboth algorithms, following Nivre

(2006):

Denition 2.4. Givenaset

R = {r 0 , r 1 , . . . , r m }

of dependen ytypes and

a senten e

x = (w 1 , . . . , w n )

, a parser onguration for

x

is a quintuple

c = (σ, τ, υ, h, d)

,where:

1.

σ

is a sta k of partially pro essed token nodes

i

(

1 ≤ i ≤ j

for some

j ≤ n

).

2.

τ

isalistofremaininginputtokennodes

i

(

k ≤ i ≤ n

forsome

k > j

).

3.

υ

isasta koftokennodes

i

o urringbetweenthetokenontopofthe

sta k

σ j

andthenextinputtoken

τ k

(

j < i < k

).

4.

h : V x + → V x

isaheadfun tion fromtokennodestonodes.

5.

d : V x + → R

isalabelfun tionfrom tokennodesto dependen ytypes.

6. Foreverytokennode

i ∈ V x +

,

d(i) = r 0

onlyif

h(i) = 0

.

Thedenition ofaparser ongurationintrodu esthree data stru tures: a

sta k

σ

,alist

τ

andasta k

υ

. Thersttwodatastru tures(thesta k

σ

and

thelist

τ

)arein ludedinthedenition ofNivre(2006). Herethedenition

(27)

isextendedwithasta k

υ

,whi hwe allthe ontextsta k andwhi hisused

byCovington'salgorithm. Inordertodene theparsingalgorithmslaterin

thisse tion,wewillrepresentallthreedatastru turesaslists. Tobeableto

useindividual omponentsintheselists,wewilluse

j|τ

torepresentalistof

inputtokenswith head

j

andtail

τ

, while

σ|i

and

υ|i

representsta kswith

thetop

i

andtail

σ

and

υ

. Anemptysta k/listisrepresentedby

ǫ

.

Thesymbols

V x +

and

V x

areusedtoindi atethat

V +

and

V

arethenodes

forthesenten e

x

. Theheadfun tion

h

denesthepartiallybuiltdependen y

graph. Foreverytokennode

i

thereisasynta ti head

h(i) = j

. Ifthetoken

node

i

isnotyetatta hedtoahead,thespe ialrootnode

h(i) = 0

isused.

Finally, thelabelfun tion

d

labelsthepartially built dependen ystru -

ture, where everytoken node

i

is assigned adependen y type

r j

using the

label fun tion

d(i) = r j

(

d(i) = r 0

is used for token nodes that are not

yet atta hed). Weestablisha onne tionbetweenparser ongurationsand

dependen ygraphsinthefollowingway(Nivre2006):

Denition 2.5. A parser onguration

c = (σ, τ, υ, h, d)

for

x

denes the

dependen ygraph

G c = (V x , E c , L c )

,where:

1.

E c = {(i, j) | h(j) = i}

2.

L c = {((i, j), r) | h(j) = i, d(j) = r}

Forthefun tions

h

and

d

, wewill use thenotation

f [x 7→ y]

; if

f (x) = y

,

then

f [x 7→ y] = f − {(x, y )} ∪ {(x, y)}

.

Denition 2.6. A parser onguration

c

forthesenten e

x = (w 1 , . . . , w n )

isinitial ifandonlyifithastheform

c = (ǫ, (1, . . . , n), ǫ, h 0 , d 0 )

,where:

1.

h 0 (i) = 0

forevery

i ∈ V x +

.

2.

d 0 (i) = r 0

forevery

i ∈ V x +

.

Whentheparserbeginstoparseasenten e,thetwosta ks

σ

and

υ

areempty

andallthetokennodesofthesenten eareinthelist

τ

. Inthebeginning,all

tokennodesaredependentsofthespe ialroot node0andlabeledwiththe

spe ial label

r 0

. Theparser terminates theparsing of asenten e when the

following onditionismet:

Denition 2.7. A parser onguration

c

forthesenten e

x = (w 1 , . . . , w n )

isterminal ifandonlyifithastheform

c = (σ, ǫ, υ, h, d)

(forarbitrary

σ

,

υ

,

h

and

d

).

(28)

Theparserpro essestheinputleft-to-rightandterminateswheneverthelist

of input tokensis empty. The set

C

will denote all possible ongurations and

C n

the set of non-terminal ongurations, i.e., any onguration

c = (σ, τ, υ, h, d)

where

τ 6= ǫ

. Atransition fromanon-terminal ongurationto

anew ongurationisapartialfun tion

t : C n → C

.

Wewilldeneatransitionsystemforea hversionofthealgorithms,whi h

isnondeterministi . Hen e,therewillbemorethanonetransitionappli able

toagiven onguration. An ora le

o : C n → (C n → C)

isusedtoover ome

thisnondeterminism(Kay2000). Forea hnondeterministi hoi epointthe

parsingalgorithm will askthe ora letopredi t thenexttransition. Inthis

se tion we will onsider theora leas abla k box, whi h alwaysknows the

orre ttransition. Inse tion2.3.2,wewillseethatwe anapproximatethis

ora lebyindu inga lassier.

Nivre's algorithm. This parsing algorithm wasrst proposed for unla-

beleddependen y parsingbyNivre(2003)and wasextended to labeled de-

penden y parsing by Nivre et al. (2004). A senten e

x = (w 1 , . . . , w n )

is

parsedbythealgorithmParse-Nivreinthefollowingway:

Parse-Nivre(

x = (w 1 , . . . , w n )

)

1

c ← (ǫ, (1, . . . , n), ǫ, h 0 , d 0 )

2 while

c = (σ, τ, υ, h, d)

isnotterminal

3 if

σ = ǫ

4

c ←

Shift

(c)

5 else

6

c ← [o(c)](c)

7

G ← (V x , E c , L c )

8 return

G

The algorithm will perform the Shifttransition ifthe sta k is empty and

otherwiselettheora le

o

predi tthenexttransition

o(c)

aslongastheparser

remainsinanon-terminal onguration

c ∈ C n

. TheShifttransitionpushes

thenextinputtoken

i

ontothesta k

σ

. Whentheterminal ongurationis rea hedthedependen ygraphisreturned.

The algorithm omes in two versions with two transition systems: an

ar -eager andanar -standard version. Thear -eagerversionusesfourtran-

sitions, twoof whi h areparameterized by a dependen y type

r ∈ R

. The

transitionsystemupdatestheparser ongurationasfollows(Nivre2006):

Denition2.8. Forevery

r ∈ R

,thefollowingtransitions arepossible:

1. Shift:

(σ, i|τ, ǫ, h, d) → (σ|i, τ, ǫ, h, d)

(29)

2. Redu e:

(σ|i, τ, ǫ, h, d) → (σ, τ, ǫ, h, d)

if

h(i) 6= 0

3. Right-Ar (

r

):

(σ|i, j|τ, ǫ, h, d) → (σ|i|j, τ, ǫ, h[j 7→ i], d[j 7→ r])

if

h(j) = 0

4. Left-Ar (

r

):

(σ|i, j|τ, ǫ, h, d) → (σ, j|τ, ǫ, h[i 7→ j], d[i 7→ r])

if

h(i) = 0

The transition Shift(SH) shifts (pushes) thenext input token

i

onto the

sta k

σ

. This isthe orre t a tionwhen theheadof thenext wordis posi-

tionedtotherightofthenextwordorthenextwordisaroot. Thetransition

Redu e(RE)redu es(pops)thetoken

i

ontopofthesta k

σ

. Itisimpor-

tanttoensurethattheparserdoesnotpopthetoptokenifithasnotbeen

assigned ahead,sin eitwillotherwisebeleft unatta hed.

TheRight-Ar transition(RA)addsanar fromthetoken

i

ontopof

thesta k

σ

tothenextinputtoken

j

,i.e.,

i → j r

andinvolvespushing

j

onto

thesta k. Finally,thetransitionLeft-Ar (LA)addsanar fromthenext

input token

j

to the token

i

on topof thesta k

σ

, i.e.,

j → i r

and involves

popping

i

fromthesta k. Thistransitionisonlyallowedwhenthetoptoken

i

onthesta khaspreviouslyre eivedanar to thespe ialrootnode0. We

make use of the assumption of proje tivity be ause we know that the top

token

i

annothaveanymoreleftandrightdependentsandthereforeit an

bepopped.

Nivre'sar -eageralgorithm is guaranteedto terminate after at most

2n

transitions,givenasenten eoflength

n

(Nivre2003). Furthermore,italways produ es a dependen y graph that is a y li and proje tive. The orre t

transitionsequen efortheSwedishsenten eshowninFigure2.1usingNivre's

ar -eageralgorithmis asfollows:

(30)

(

ǫ

,

(1, . . . , 6)

,

ǫ

,

h 0

,

d 0

)

D SH

(

(1)

,

(2, . . . , 6)

,

ǫ

,

h 0

,

d 0

)

N LA(SUB)

(

ǫ

,

(2, . . . , 6)

,

ǫ

,

h 1 = h 0 [1 7→ 2]

,

d 1 = d 0 [1 7→

SUB

]

)

D SH

(

(2)

,

(3, . . . , 6)

,

ǫ

,

h 1

,

d 1

)

N RA(ADV)

(

(2, 3)

,

(4, 5, 6)

,

ǫ

,

h 2 = h 1 [3 7→ 2]

,

d 2 = d 1 [3 7→

ADV

]

)

N RE

(

(2)

,

(4, . . . , 6)

,

ǫ

,

h 2

,

d 2

)

N RE(ADV)

(

(2, 4)

,

(5, 6)

,

ǫ

,

h 3 = h 2 [4 7→ 2]

,

d 3 = d 2 [4 7→

ADV

]

)

N RA(PR)

(

(2, 4, 5)

,

(6)

,

ǫ

,

h 4 = h 3 [5 7→ 4]

,

d 4 = d 3 [5 7→

PR

]

)

N RE

(

(2, 4)

,

(6)

,

ǫ

,

h 4

,

d 4

) N RE

(

(2)

,

(6)

,

ǫ

,

h 4

,

d 4

)

N RA(IP)

(

(2, 6)

,

ǫ

,

ǫ

,

h 5 = h 4 [6 7→ 2]

,

d 5 = d 4 [6 7→

IP

]

)

The rst rowpresentsthe initial parser onguration with an empty sta k

and

h 0 (i) = 0

and

d 0 (i) = r 0

for everynode

i ∈ V

. These ond row shows

theparser ongurationaftertheshifttransitionhasbeenexe uted. Theleft

olumntellsusifthetransitionisdeterministi (D)ornondeterministi (N),

in other wordsif the ora le

o

is used ornot. Forexample, the se ond row

anonlybeashifttransition be ausethesta kis empty (D)and thethird

rowisanondeterministi transition(N).

The ar -standardversionusesastri tbottom-uppro essingasin tradi-

tionalshift-redu e parsing. ThealgorithmbyKudoandMatsumoto(2002),

Yamada and Matsumoto (2003) and Cheng et al. (2005a) uses the ar -

standardstrategy,butalsoallowsmultiplepasses overtheinput.

Thear -standardversionusesatransitionsystemsimilartothear -eager

version,but has only three transitions Shift, Left-Ar and Right-Ar

(noRedu e). Thersttwotransitions,Shiftand Left-Ar , areapplied

in exa tlythe samewayasfor thear -eagerversion. Thetransition system

isdened asfollows:

1. Shift:

(σ, i|τ, ǫ, h, d) → (σ|i, τ, ǫ, h, d)

2. Right-Ar (

r

):

(σ|i, j|τ, ǫ, h, d) → (σ, i|τ, ǫ, h[j 7→ i], d[j 7→ r])

if

h(j) = 0

3. Left-Ar (

r

):

(σ|i, j|τ, ǫ, h, d) → (σ, j|τ, ǫ, h[i 7→ j], d[i 7→ r])

if

h(i) = 0

Insteadofpushingthenexttoken

j

ontothesta k

σ

,Right-Ar movesthe

topmost token

i

on the sta k ba k to the list of remaining input tokens

τ

,

(31)

where itrepla esthetoken

j

asthenexttoken. Thetransitionsequen efor

thesamesenten eusingNivre'sar -standardalgorithm:

(

ǫ

,

(1, . . . , 6)

,

ǫ

,

h 0

,

d 0

)

D SH

(

(1)

,

(2, . . . , 6)

,

ǫ

,

h 0

,

d 0

)

N LA(SUB)

(

ǫ

,

(2, . . . , 6)

,

ǫ

,

h 1 = h 0 [1 7→ 2]

,

d 1 = d 0 [1 7→

SUB

]

)

D SH

(

(2)

,

(3, . . . , 6)

,

ǫ

,

h 1

,

d 1

)

N RA(ADV)

(

ǫ

,

(2, 4, . . . , 6)

,

ǫ

,

h 2 = h 1 [3 7→ 2]

,

d 2 = d 1 [3 7→

ADV

]

)

D SH

(

(2)

,

(4, . . . , 6)

,

ǫ

,

h 2

,

d 2

) N SH

(

(2, 4)

,

(5, . . . , 6)

,

ǫ

,

h 2

,

d 2

)

N RA(PR)

(

(2)

,

(4, 6)

,

ǫ

,

h 3 = h 2 [5 7→ 4]

,

d 3 = d 2 [5 7→

PR

]

)

N RA(ADV)

(

ǫ

,

(2, 6)

,

ǫ

,

h 4 = h 3 [4 7→ 2]

,

d 4 = d 3 [4 7→

ADV

]

)

D SH

(

(2)

,

(6)

,

ǫ

,

h 4

,

d 4

)

N RA(IP)

(

ǫ

,

(2)

,

ǫ

,

h 5 = h 4 [6 7→ 2]

,

d 5 = d 4 [6 7→

IP

]

)

D SH

(

(2)

,

ǫ

,

ǫ

,

h 5

,

d 5

)

We anseethatthetransitionsareperformedin anotherorder,forinstan e

the Right-Ar (PR) is exe uted before Right-Ar (ADV), ompared to

thear -eagerversion.

Covington's algorithm. Covington (2001)proposesseveral in remental

parsing algorithms for dependen y parsing. Two of the algorithms are the

proje tive algorithm and the exhaustive left-to-right sear h algorithm. The

rst algorithmusesaheadlist withwordsthat donotyethaveheadsanda

wordlist with all wordsen ounteredso far. Wewill not usethese twodata

stru tures; insteadwe will des ribethese twoalgorithms byusing the data

stru tures dened bythe parser onguration: thesta ks

σ

and

υ

,and the

list

τ

. A tually, wewill regardthese twoalgorithms asone algorithm with

twotransitionsystemsorastwoversionsofthesamealgorithm. Wewill all

these ondversiontheunrestri ted,be auseitallowsdependen ygraphsthat

arenon-proje tiveand y li . Bothversionshavequadrati omplexity,sin e

theypro eedbytryingtolink ea hnewtokento ea hpre edingtoken. Itis

also possibleto deneother versions. Forexample,aversionthat onforms

totheA y li ityrequirementbutallowsnon-proje tivegraphs,butthiswill

notbedonein thisthesis. Theadaptedversionof Covington'salgorithm is

des ribedasfollows:

(32)

Parse-Covington(

x = (w 1 , . . . , w n )

)

1

c ← (ǫ, (1, . . . , n), ǫ, h 0 , d 0 )

2 while

c = (σ, τ, υ, h, d)

isnotterminal

3 Done

← f alse

4 while

σ 6= ǫ

and

¬

Done

5

c ← [o(c)](c)

6 while

υ 6= ǫ

7 Push

(

Pop

(υ), σ)

8 Push

(

First

(τ ), σ)

9

G ← (V x , E c , L c )

10 return

G

Thealgorithmbeginsbyinitializingthe ongurationwithtwoemptysta ks

and alltokennodesin the list

τ

, in thesamewayasNivre'salgorithm. As

longastheparserremainsinanon-terminal onguration,itwillrstiterate

as longas the sta k

σ

is notempty or the ag Done is false. The Done

ag is only used by the proje tive version to indi ate that it an pro eed

to the nexttoken without anempty sta k. Beforeit an pro eed with the

next input token, the algorithm must move ba k all unatta hed tokens in

the ontextsta k

υ

tothesta k

σ

. ThePush fun tion pushesatokenonto

asta kand thePop fun tion popsatoken from asta k. Finally, the next

inputtokenispushedontothesta k

σ

,usingthefun tionFirsttoretrieve

thersttokenin alist.

The unrestri ted versionuses three transitions and these aredened in

thefollowingway:

1. Redu e:

(σ|i, τ, υ, h, d) → (σ, τ, υ|i, h, d)

2. Right-Ar (

r

):

(σ|i, j|τ, υ, h, d) → (σ, j|τ, υ|i, h[j 7→ i], d[j 7→ r])

3. Left-Ar (

r

):

(σ|i, j|τ, υ, h, d) → (σ, j|τ, υ|i, h[i 7→ j], d[i 7→ r])

All threetransitions movethetoptokenofthe sta k

σ

tothesta k

υ

. The

Right-Ar and Left-Ar transitionsin addition add anar

i → j r

oran

ar

j → i r

,respe tively.

Theproje tiveversionmakesuseofthefa tthatitshouldbuildaproje -

tivegraph,whi hallowsthealgorithmto ontinuewiththenextinputtoken

(33)

withoutexploringall ombinationsthat ouldmakethegraphnon-proje tive.

Thetransitionsystemisredenedasfollows:

1. Redu e:

(σ|i, j|τ, υ, h, d) → (σ, j|τ, υ|i, h, d)

Done

← true

2. Right-Ar (

r

):

(σ|i, j|τ, υ, h, d) → (σ, j|τ, υ|i, h[j 7→ i], d[j 7→ r])

Done

← true

if

h(j) = 0

3. Left-Ar (

r

):

(σ|i, j|τ, υ, h, d) → (σ, j|τ, υ, h[i 7→ j], d[i 7→ r])

if

h(i) = 0

The Redu e transition is exa tlythe same asfor the unrestri ted version

ex ept that it sets the Done ag to true, in order to indi ate that all the

remainingtokensinsta k

σ

annotbelinkedtothetoken

j

,sin ethiswould

produ e a non-proje tivegraph. The Right-Ar transition makes use of

thefa t thatthear

i → j r

oversthetokensbetweenthetoptokenandthe

nexttoken;topreventthat thegraphbe omesnon-proje tivethetoptoken

i

of sta k

σ

is popped and then pushed onto the ontext sta k

υ

and the

ag Doneisassignedthevaluetrue,forthesamereasonasintheRedu e

transition. TheLeft-Ar transitionaddsanar

j → i r

;be ause

i

annotbe

linkedto anothertokenitispoppedfrom thesta k

σ

.

2.3.2 History-Based Models

In se tion2.3.1wedened aset

C

of possibleparser ongurationsandfor ea hversionoftheparsingalgorithmwedened atransition systemisnon-

deterministi . Furthermore, we introdu ed an ora le

o : C n → (C n → C)

,

whi htheparsingalgorithmusestogetthe orre ttransition. If itispossi-

ble to derivethe orre t transitionsfrom synta ti allyannotatedsenten es,

we an use these as training data to approximate su h an ora le through

indu tive learning. In other words, we dene a one-to-one mapping from

an input string

x

and a dependen y graph

G

to a sequen e of transitions

S = (t 1 , . . . , t m )

su h that

S

uniquely determines

G

. A transition

t i

is

dependent on all previously made transitions

(t 1 , . . . , t i−1 )

and all avail-

able information about these transitions, alled the history. The history

H i = (t 1 , . . . , t i−1 )

orrespondstosomepartially builtstru tureandwealso in lude stati propertiesthatare kept onstantduringtheparsingofasen-

ten e, su haswordformandpart-of-spee hofatoken.

(34)

The basi idea is thus to traina lassier that approximates an ora le

given that a treebank is available. We will all the approximated ora lea

guide (Boullier2003),be ausetheguidedoesnotguaranteethatthetransi-

tionis orre t. Thehistory

H i = (t 1 , . . . , t i−1 )

ontains ompleteinformation aboutallprevioustransitions. Allthisinformationisintra tablefortraining

a lassier. Insteadwe anuse history-based feature models for predi ting

the next transition. History-based feature models were rst introdu ed by

Bla k et al. (1992) and have beenused extensivelyin data-driven parsing

(Magerman 1995; Ratnaparkhi 1997; Collins 1999). To make it tra table

the history

H i

is repla ed by a feature ve tor dened by a feature model

Φ = (φ 1 , . . . , φ p )

, where ea h feature

φ i

is a fun tion that identies some

signi antpropertyofthehistory

H i

and/ortheinputstring

x

. Tosimplify

notationwewillwrite

Φ(H i , x)

todenotetheappli ationofthefeatureve tor

(φ 1 , . . . , φ p )

to

H i

and

x

, i.e.,

Φ(H i , x) = (φ 1 (H i , x), . . . , φ p (H i , x))

.

At learning time the parser derives the orre t transition by using an

ora lefun tion

o

applied to goldstandard treebank. Forea h transition it

providesthelearnerwithatraininginstan e

Φ((H i , x), t i )

,where

Φ(H i , x)

is

a urrentve torof feature valuesand

t i

is the orre t transition. A set of traininginstan es

I

isthenusedbythelearnertoindu eaparsermodel,by

usingasupervisedlearningmethod.

At parsingtimetheparser usestheparser model, asaguide, to predi t

thenexttransitionandnowtheve toroffeaturevalues

Φ(H i , x)

istheinput

andthetransition

t i

istheoutput oftheguide. Se tion2.3.3des ribeshow

we antraina lassierthat makesthispredi tion.

2.3.3 Dis riminative Learning Methods

Thelearningproblemistoindu ea lassierfromasetoftraininginstan es

I

relative to a spe i feature model

Φ

by using a learning algorithm. In

thisse tion,wewilldes ribetwodis riminativelearningmethods, SVMand

MBL,that anbeusedforthis lassi ationtask.

Ingeneral, lassi ationisthetaskofpredi tingthe lass

y

givenavari-

able

x

,whi h anbea omplishedbyprobabilisti methodsanditis ommon todividethese methodsintotwo lasses: generative anddis riminative. For

generativemethods, weuse theBayesrule to obtain

P (y | x)

byestimating

thejointdistribution

P (x, y)

. By ontrast,dis riminativemethods makeno attempt tomodelunderlying distributions andinstead estimate

P (y | x)

di-

re tly. Wewill usetwodis riminativemethods for thelearningtask: SVM

andMBL.

Support Ve tor Ma hines. Inthe last de ade, there has beenagrow-

ing interest in Support Ve tor Ma hines (SVM), whi h were proposed by

(35)

VladimirVapnikattheendoftheseventies(Vapnik1979). SVMisbasedon

theideathattwolinearlyseparable lasses,thepositiveandnegativesamples

inthetrainingdata, anbeseparatedbyahyperplanewiththelargestmar-

gin. Ithasbeenshownthat SVMsgivegoodgeneralizationperforman ein

variousresear hareas,su hasfa edete tion(Osunaetal. 1997)andpedes-

triandete tion(Orenet al. 1997). Within naturallanguagepro essingthey

have been used extensively in, for example, text ategorization (Joa hims

1998), hunking(KudoandMatsumoto2001)andsynta ti parsing(Yamada

and Matsumoto2003).

Givenadatasetof

instan e-labelpairs

I = {(− → x i , y i )} i=1

,where

x i ∈ R N

and

y i ∈ {−1, 1}

,

x i

isafeatureve torofthe

i

-thsample,whi hisrepresented byan

n

dimensionalve tor

→ x i = (f 1 , . . . , f n )

, and

y i

isthe lasslabelofthe

i

-th samplewhi h belongs to either the positive (

+1

)or thenegative(

−1

)

lass. Thefeatureve tor

→ x i

will inour asebethefeatureve tordenedby

Φ(H i , x)

andthe lasslabel

y i

willbethetransition

t i

,butweneedamethod

thathandlesmultiple lasslabels(moreaboutthatlaterinthisse tion). The

idea isto estimateave tor

→ w

and as alar

b

, whi h maximizethe distan e

of any data point from thehyperplanedened by

→ w · − → x + b

. The goal of

the SVM is to nd the solution of the following optimization (Kudo and

Matsumoto2000a;Burges1998):

Minimize:

L(w) = 1 2 k − → w k 2

Subje tto:

y i (− → w · − → x i + b) ≥ 1∀i = 1, . . . , ℓ

(2.1)

Figure2.2: AlinearSupportVe torMa hine

Inotherwords,theSVM methodtries tondthehyperplanethat sepa-

rates thetrainingdataintotwo lasseswith thelargestmargin. Figure 2.2

illustrates two possible hyperplanes, whi h orre tly separate the training

dataintotwo lasses,andthelefthyperplanehasthelargestmarginbetween

thetwo lasses.

(36)

ThedatainFigure2.2areeasytoseparateintotwo lasses,butinpra ti e

thedatamaybenoisyandthereforenotlinearlyseparable. Onesolutionis

to allowing some mis lassi ations by introdu ing apenalty parameter

C

,

whi h denesthetradeo betweenthetrainingerrorandthemagnitudeof

themargin.

SVM anbe extendedto solveproblems thatare notlinearlyseparable.

Thefeatureve tor

x i

is mappedtoahigherdimensional spa ebythefun - tion

φ

, whi h makes it possible to arry out non-linear lassi ation. The optimizationproblem anberewrittenintoadualform,whi h isdonewith

aso alled Kernel fun tion

K(x i , x j ) ≡ φ(x i ) T φ(x j )

(KudoandMatsumoto

2001;Vapnik1998). Therearemanykernelfun tions,butthemost ommon

are:

polynomial:

K(x i , x j ) = (γx T i x j + r) d , γ > 0

.

radialbasisfun tion (RBF):

K(x i , x j ) = exp(−γ k x i − x j k 2 ), γ > 0

.

sigmoid:

K(x i , x j ) = tanh(γx T i x j + r)

.

where

γ, r

and

d

denotedierentkernelparameters(Hsuetal. 2004).

SVM is in its basi form a binary lassier, but many learning prob-

lemshaveto dealwithmorethantwo lasses. TomakeSVM handlemulti-

lassi ation,manybinary lassiersareused. Formulti- lass lassi ation,

we an hoosebetweenthemethodsone-against-allandall-against-all.Given

thatwehave

n

lasses,theone-against-allmethodtrains

n

lassierstosepa-

rateea h lassfromtherestandtheall-against-allmethodtrains

n(n − 1)/2

lassiers,oneforea hpairof lasses(VuralandDy2004). Avotingme ha-

nismorsomeothermeasureisusedtodis riminatea rossallthese lassiers

to lassifyanewinstan e.

Memory-Based Learning. Memory-based learning(MBL) and lassi-

ation is based on theassumption that a ognitive learningtask to ahigh

degreedependsondire t experien e andmemory,rather thanextra tionof

anabstra trepresentation. MBLhasbeenusedformanylanguagelearning

tasks, su h aspart-of-spee h tagging(Cardie 1993;Daelemanset al. 1996),

semanti rolelabeling(Vanden Bos het al. 2004; Kou hnir2004)andsyn-

ta ti parsing(Nivreetal. 2004).

MBLisalazymethodandisbasedontwofundamentalprin iples: learn-

ingisstoringexperien esinmemory,andsolvinganewproblemisa hieved

byreusingsolutionsfrom previouslysolvedproblemsthat aresimilarto the

new problem. The idea during training for MBL is to olle t the values

of dierent features from the training data together with the orre t lass

(37)

(Daelemans andVanden Bos h2005). MBLgeneralizesby applyingasim-

ilaritymetri withoutabstra ting oreliminatinglow-frequen yevents. This

similarity metri anbe seenas animpli it smoothingme hanism for rare

events. Daelemans and olleagues have shown that it may be harmful to

eliminaterareeventsinthetrainingdataforlanguagelearningtasks(Daele-

mansetal. 2002),be auseitisverydi ulttodis riminatenoisefromvalid

ex eptions.

The

n

feature-valuesaremappedintoan

n

-dimensionalspa e,whereea h featureve torfromthetrainingdatawithits orresponding lassisapointin

thisspa e. Thetaskatde isiontimeistondthenearestneighbor(s)inthis

n

-dimensionalspa eandreturna ategorybasedonthe

k

nearestneighbor(s).

Thewaythis sear hisperformed anbevariedinmanydierentways.

TheOverlapmetri isoneofthemostbasi metri sandusesthedistan e

∆(X, Y )

betweentwopatterns

X

and

Y

,whi harerepresentedas

n

features:

∆(X, Y ) = X n i=1

w i δ(x i , y i )

(2.2)

where

w i

isaweightforfeature

i

,andthefun tion

δ(x i , y i )

isthedistan eper

featureandwillbe0if

x i = y i

,otherwise1. Theweight

w i

anbe al ulated

by avarietyof methods, e.g. Information Gain (IG), whi hmeasures ea h

feature's ontributiontoourknowledgewithrespe ttothetarget lass.

A variation of the Overlap metri s is the more sophisti ated Modied

ValueDieren eMetri (MVDM),introdu edbyCostand Salzberg(1993),

whi hestimatesthedistan ebetweentwovaluesofafeature by onsidering

their oo urren e with the target lasses. However, this metri is more

sensitiveto sparsedata.

2.4 Related Work

Duringthelastde ades,therehasbeenagreatinterestindata-drivenmeth-

odsforvariousnaturallanguagepro essingtasks. Data-drivenapproa hesto

synta ti parsingwererstdevelopedduringthe90sfor onstituen y-based

representations. The standard approa hes are based on nondeterministi

parsing te hniques, usually involving some kind of dynami programming,

in ombinationwith generativeprobabilisti modelsthat providean

n

-best

ranking of the set of andidate analyses derived by the parser. The most

well-knownparsersbasedonthesete hniquesaretheparserofCollins(1997,

1999) and the parser of Charniak (2000). Dis riminativelearning methods

have been used to enhan e these parsers by reranking the analyses output

References

Related documents

Istället för att bara skapa noder där fordonet detekterar öppningar som sensorerna inte ser något slut i, så skulle noder även kunna skapas där de detekterade avstånden

The L parallel code with varying synchronization element, L, P CV S also uses every codeword in C to carry information, but only the synchronization element can be sent without delay

We also conduct several experiments using UUparser and UDPipe with and without gold or predicted POS tags in order to study the impact of social media data on dependency parsing..

Secondly, similar to the parsing results from MaltParser, converting unsuper- vised morphological segmentations into features increases the parsing accuracy of all

The rst mode is the training mode where the input is congurations for the machine learning technique to use, a feature extraction model, dependency parsing algorithm settings

Having developed an axiomatic notion of order-preserving hyperedge replacement grammars that allows for parsing in uniform polynomial time, and discussed a particular instantiation

These do not actually have to be orders in the mathematical sense, but are binary relations on the node set of every graph that, in an order-preserving HR grammar, are required to

Figure : Example of Parser Delegation for Grammar Mo dularisation. ele atin