MaltParser -- An Architecture for Inductive Labeled Dependency Parsing

(1)

Johan Hall

MaltParser – An Architecture for Inductive Labeled

Dependency Parsing

Licentiate Thesis

Växjö University

(2)

(3)

Parsing

(4)

(5)

MaltParser An Ar hite ture for

Indu tive Labeled Dependen y Parsing

Li entiate Thesis

Computer S ien e



Växjö University

(6)

MaltParser AnAr hite ture for Indu tive LabeledDependen y

Parsing

JohanHall

VäxjöUniversity

S hoolofMathemati sandSystemsEngineering

se-

 

^Växjö,^Sweden

http://www.vxu.se/msi



^by^Johan^Hall. ^All^rights^reserved.

ReportsfromMSI,no.



issn



^-



isrnvxu-msi-da-r--



^--se

(7)

To G ert and K arin

(8)

(9)

This li entiate thesis presents a software ar hite ture for indu tive labeled

dependen y parsing of unrestri ted natural language text, whi h a hieves

a stri t modularization of parsing algorithm, feature model and learning

method su h that these parameters an be varied independently. The ar-

hite ture is based on the theoreti al framework of indu tive dependen y

parsingbyNivre(2006)andhasbeenrealizedin MaltParser,asystemthat

supports several parsingalgorithms and learning methods, for whi h om-

plexfeaturemodels anbedenedinaspe ialdes riptionlanguage. Spe ial

attentionisgiveninthisthesistolearningmethodsbasedonsupportve tor

ma hines(SVM).

Theimplementationis validated inthree sets ofexperimentsusing data

from threelanguages(Chinese,English andSwedish). First,we he kifthe

implementationrealizestheunderlying ar hite ture. Theexperimentsshow

that theMaltParsersystemoutperformsthe baselineandsatisesthebasi

onstraintsofwell-formedness. Furthermore,theexperimentsshowthatitis

possibletovaryparsingalgorithm,featuremodelandlearningmethodinde-

pendently. Se ondly,wefo usonthespe ialpropertiesoftheSVMinterfa e.

It ispossibleto redu ethelearningandparsingtimewithoutsa ri inga -

ura ybydividingthetrainingdataintosmallersets,a ordingtothepart-

of-spee hofthenexttokeninthe urrentparser onguration. Thirdly,the

last setofexperimentspresentabroadempiri alstudythat omparesSVM

to memory-basedlearning(MBL) with ve dierent feature models, where

all ombinationshavegonethroughparameteroptimizationforbothlearning

methods. Thestudy showsthat SVM outperformsMBLfor more omplex

and lexi alized feature modelswith respe t to parsing a ura y. There are

alsoindi ationsthatSVM,withasplittingstrategy, ana hievefasterpars-

ing than MBL. The parsing a ura y a hieved is the highest reported for

the Swedishdata set and very lose to thestateof theart forChinese and

English.

Key-words: Dependen yParsing,SupportVe torMa hines,Ma hineLearn-

ing.

(10)

Dennali entiatavhandlingpresenterarenmjukvaruarkitekturfördatadriven

dependensparsning, dvs. för att automatiskt skapa en syntaktisk analys i

formavdependensgraferförmeningaritexterpånaturligtspråk.Arkitektu-

ren bygger påidén att man ska kunna varieraparsningsalgoritm,särdrags-

modell o h inlärningsmetod oberoende av varandra. Till grund för denna

arkitekturharvianväntdetteoretiskaramverketförinduktivdependenspar-

sningpresenteratavNivre(2006).Arkitekturenharrealiseratsiprogramva-

ranMaltParser,därdetärmöjligtattdenierakomplexasärdragsmodelleri

ettspe ielltbeskrivningsspråk.Idennaavhandlingkommerviattläggaextra

tyngdvidattbeskrivahurviharintegreratinlärningsmetodensupportvektor-

maskiner(SVM).

MaltParser valideras med tre experimentserier, där data från tre språk

används(kinesiska,engelskao hsvenska).Idenförstaexperimentserienkon-

trolleras om implementationen realiserar den underliggande arkitekturen.

Experimenten visar att MaltParser utklassar en trivial metod för depen-

densparsning (eng. baseline) o h de grundläggande kraven på välformade

dependensgrafer uppfylls. Dessutom visar experimenten att det är möjligt

attvarieraparsningsalgoritm,särdragsmodello h inlärningsmetodoberoen-

deavvarandra. Denandraexperimentserienfokuserarpå despe iellaegen-

skaperna för SVM-gränssnittet. Experimenten visar att det är möjligt att

redu era inlärnings-o h parsningstiden utan attförlora i parsningskorrekt-

het genom att dela upp träningsdata enligt ordklasstaggen för nästa ord

i nuvarande parsningskonguration. Den tredje o h sista experimentserien

presenteraren empiriskundersökning somjämförSVM med minnesbaserad

inlärning (MBL). Studien använder sig av fem särdragsmodeller, där alla

kombinationeravspråk,inlärningsmetodo hsärdragsmodellhargenomgått

omfattande parameteroptimering. Experimenten visar att SVM överträar

MBLförmerkomplexao hlexikaliseradesärdragsmodellermedavseendepå

parsningskorrekthet.Det nnsävenvissa indikationerpå att SVM, med en

uppdelningsstrategi, kan parsa en text snabbare än MBL. För svenska kan

virapporteradenhögstaparsningskorrekthetenhittillso hförkinesiskao h

engelskaärresultatennäradebästasomharrapporterats.

(11)

Unfortunately, my parents Gert and Karin annot witness the ompletion

of my li entiate thesis, but I know that they would be very proud of me.

Theyalwaysbelievedin meandsupportedmeinwhateverIwantedtodo. I

wanttothankmysupervisorJoakimNivreforallfruitfuldis ussions,advi e

and fun times when we developed MaltParserand I am looking forward to

thedevelopmentofnextversion. Iespe iallywanttothankJensNilssonfor

the onversionof allthedata usedin thisthesisintodependen ystru tures

and for theMaltEval tool, whi h madeit easierto validate theMaltParser

system. Forthe onversionoftheChinesedataweusedtheheadrulesmade

byYuanDingattheUniversityofPennsylvania. Ialsowanttothankallmy

olleagues in omputers ien e at Växjö Universityto makeit fun to go to

workeveryday. I espe iallywantto thankMorganEri ssonformanyideas

andextra omputerpower.

Finally, I want to thank my love Kristina for all support when I wrote

this thesis.

(12)

(13)

Abstra t vii

Sammanfattning viii

A knowledgments ix

1 Introdu tion 1

1.1 Resear hProblemandAims. . . 2

1.2 OutlineoftheThesis . . . 4

1.3 Division ofLabor . . . 4

2 Ba kground 7 2.1 RequirementsonTextParsing. . . 8

2.2 Dependen yGraphs . . . 9

2.3 Indu tiveDependen yParsing . . . 11

2.3.1 Deterministi Dependen yParsing . . . 12

2.3.2 History-BasedModels . . . 19

2.3.3 Dis riminativeLearningMethods . . . 20

2.4 RelatedWork . . . 23

3 MaltParser 25 3.1 Ar hite ture. . . 25

3.1.1 Parser . . . 27

3.1.2 Guide . . . 28

3.2 Implementation . . . 30

3.2.1 InputandOutput . . . 31

3.2.2 ParserKernel . . . 31

3.2.3 Parser . . . 33

3.2.4 Guide . . . 37

4 Experiments 45 4.1 DataSets . . . 46

4.1.1 Swedish . . . 46

4.1.2 English . . . 48

(14)

4.1.3 Chinese . . . 52

4.2 EvaluationMetri s . . . 54

4.3 FeatureModels . . . 56

4.4 ExperimentI:Validation . . . 56

4.4.1 ExperimentalSetup . . . 57

4.4.2 ResultsandDis ussion. . . 57

4.5 ExperimentII:LIBSVMInterfa e . . . 59

4.5.2 ResultsandDis ussion. . . 60

4.6 ExperimentIII:ComparisonofMBLandSVM . . . 63

4.6.2 ResultsandDis ussion . . . 64

5 Con lusion 67 5.1 MainResults . . . 67

5.2 FutureWork . . . 69

Bibliography 71

(15)

Introdu tion

Synta ti parsingisan important omponentfor manyappli ations ofnat-

urallanguagepro essing. Inthis thesis,weregardparsing asthe pro essof

mapping senten es in unrestri ted natural languagetext to their synta ti

representations. Furthermore, the program whi h performs this pro ess is

alledasynta ti parser,orsimplyparser. Thesynta ti stru tureisformal-

ized withasynta ti representationsu hasphrasestru ture ordependen y

stru ture. Parsing asenten ewithphrasestru turegrammaror ontext-free

grammar re ursivelyde omposesitinto onstituentsorphrasesand inthat

wayaphrasestru turetreeis reatedwithrelationshipsbetweenwordsand

phrases. By ontrast, with dependen y stru ture representations, the goal

of parsinga senten e is to reate a dependen y graph onsisting of lexi al

nodeslinkedbybinaryrelations alleddependen ies. Adependen yrelation

onne tswordswithoneworda tingashead andtheotherasdependent. In

thisthesis,wewill on entrateonparsingwithdependen yrepresentations.

Data-driven methods in natural languagepro essing havebeen used in

manytasks inthepastde adeand synta ti parsingis oneof them. Statis-

ti alparsingisusuallybasedonnondeterministi parsingte hniquesin om-

binationwithgenerativeprobabilisti modelsthatprovidean

n

^-best^ranking

of thesetof andidateanalysesderivedbytheparser(Collins1997; Collins

1999; Charniak 2000). Dis riminativemodels anbeused to enhan e these

parsersbyrerankingtheanalysesoutputbytheparser(Johnsonet al. 1999;

CollinsandDuy 2005;CharniakandJohnson2005).

Nondeterministi parsinghasbeenthemainstreamapproa h, but ithas

also been shown that deterministi parsing an be performed with fairly

higha ura y,espe iallyindependen y-basedparsing(KudoandMatsumoto

2000a;YamadaandMatsumoto2003;Nivreet al. 2004;Isozaki etal. 2004;

Chengetal. 2005a),butalso in onstituent-basedparsing(SagaeandLavie

2005). Themainideaistoguidetheparserwitha lassiertrainedontree-

bank data using a greedy parsing algorithm that approximates a globally

optimal solutionbymakingaseriesof lo allyoptimal hoi es. A determin-

isti parserusually usesaform ofhistory-based featuremodel(Bla ket al.

(16)

1992; Magerman1995;Ratnaparkhi 1997)to reatearepresentationthat a

lassier anuseto predi tthenextparser state. This isalsotheapproa h

assumedinthisthesis.

Availabilityof largesynta ti ally annotated orpora,also knownastree-

banks,isessentialwhen onstru tingdata-drivenparsers,butone ofthepo-

tential advantages is that they an easily be ported to new languages. A

problem isthat manydata-drivenparsersareovertted toaparti ular lan-

guage,usuallyEnglish. Forexample,Corazzaet al. (2004)report in reased

errorratesof1518%whenusingtwostatisti alparsersdevelopedforEnglish

to parse Italian. We suggestthat adata-driven parserneed to be designed

forexiblere onguration toin reasetheportabilitytoother languages. A

user should be ableto experiment with several parsing algorithms, feature

modelsandlearningmethods.

1.1 Resear h Problem and Aims

Themainresear hproblemforthedo toralthesisistostudytheinuen eof

dierentfa torson a ura yande ien y ofdata-drivendependen ypars-

ing. This study requires a broad evaluation of the parsing system, where

weperformextensivefeaturesele tionandparametertuningtooptimizethe

feature models and lassiers formany languagesand several parsingalgo-

rithms.

Fortheli entiatethesiswewillrestri ttheresear hproblemtothedesign,

implementationandvalidationofanar hite turefordata-drivendependen y

parsingofunrestri ted naturallanguagetext. Thevalidation anbeseenas

a pilot evaluation that will determine future dire tions. However, we will

alsoobtainexperimental resultsthathaveadire t bearingonthelong-term

resear hproblems.

Wepresentasoftwarear hite turethatshouldbeabletohandledierent

parsingalgorithms, feature modelsand learningmethods, forbothlearning

and parsing. When using theimplementation ofthis ar hite ture, the user

shouldbeabletovarytheseparametersindependentlyina onvenientway.

The hoi e of parsing algorithm inuen es how the synta ti stru ture

will be built. In the learning phase this will ae t how the training data

is generated and in the parsing phase whi h stru tures are permissible. It

shouldbeeasytoaddnewparsingalgorithmsintothear hite ture,provided

thattheyfull ertainwell-denedrequirements.

Thelinguisti knowledgeofthelanguageisimportantwhendening the

stru tureofthefeaturemodel,inotherwordswhi hlinguisti featuresshould

beusedtopredi tparsinga tions. Itshouldbeeasytodeneanewfeature

(17)

modelwithoutreprogrammingthesystem. Afeaturemodelshouldbedened

inanappropriatefeaturespe i ationlanguagesothatit anbeloadedwhen

itisrequired.

Given asetof traininginstan es, whereea h instan eis angerprintof

the urrentstate of the parser, as spe ied by the feature model, together

with the transition to the next parser state, the task of the learner is to

indu e amodel at learningtime. At parsingtime, this model is then used

forpredi tingthenextparserstate. Thistask aneasilybeformulatedasa

lassi ationtask,wheredis riminativelearningmethodsarewell-suited.

Thear hite tureisbasedonthetheoreti alframeworkofindu tivedepen-

den yparsingbyNivre(2006)andhasbeenrealizedinasystem alledMalt-

Parser(Nivreetal. 2006),whi hinthe urrentversionsupportstwoparsing

algorithms, inseveralversions,andtwolearningmethods(MBL andSVM),

forwhi h omplexfeaturemodels anbedenedinaspe ialdes riptionlan-

guage. The implementation of the MaltParsersystem has been joint work

togetherwithJoakimNivre. MaltParserwasrstequippedwithaninterfa e

to amemory-basedlearner alled TiMBL(Daelemans and Van den Bos h

2005) and Nivre (2006) ontains an extensive evaluation of memory-based

dependen yparsingusingtheparsingalgorithmdened inNivre(2003). In

ordertovalidatethegeneralityandexibilityofthear hite ture,wetherefore

havetoextendtheparserwithaninterfa etoanotherlearnerandimplement

anadditionaldeterministi parsingalgorithm. Wehave hosentouseSVMas

thelearningmethod,be auseithasbeenproventogivegoodresultsforsimi-

lartasks(KudoandMatsumoto2000b;YamadaandMatsumoto2003;Sagae

andLavie2005). Fortheparsingalgorithm,wehave hosenthein remental

algorithm des ribedinCovington(2001).

Usingthis newimplementation, wehaveperformedthree sets ofexperi-

ments,designedtoanswerthreeessentialquestions:

1. Validation of the implementation: Does MaltParser realize the

underlyingar hite ture,sothatitispossibletovaryparsingalgorithm,

featuremodelandlearningmethodindependently?

2. Investigation of the SVM interfa e: How do thespe ial proper-

ties of theSVM interfa eae t parsinga ura y and timee ien y?

How anlearningandparsinge ien ybeimprovedwithoutsa ri ing

a ura y?

3. Comparison ofMBL and SVM: Whi hofthelearningmethodsis

bestsuitedforthetaskofindu tivelabeleddependen yparsing,taking

bothparsinga ura yandtimee ien y intoa ount?

(18)

Apartfromansweringthesequestions,wewilltrytoidentifyfuturedire tions

thathopefullywill beusefulforthelong-termresear hproblems.

1.2 Outline of the Thesis

Inthisintrodu tory hapter,wehavetriedtooutlinethelong-termresear h

problem and thespe i aims ofthe li entiate thesis. Thestru ture of the

remaining haptersisasfollows.

Chapter 2, Indu tiveDependen y Parsing

Chapter 2 reviews the ba kground material for this thesis. We dene the

problem of parsingunrestri ted naturallanguage text and dis ussdierent

algorithms for dependen y parsing. Furthermore, data-driven parsing and

espe ially the history-based models are dis ussed. The hapter ontinues

with ades riptionof thetwoma hine learningmethods used in the restof

the thesis: SVM and MBL. Finally, the hapter ends with a se tion whi h

brieypresentsrelatedwork.

Chapter 3, MaltParser

Chapter3presentsanar hite tureforparsingunrestri tednaturallanguage

textwithdependen ystru tures. Thear hite tureisdes ribedindetailwith

fo usonthetwomainmodulesParser andGuide. TheMaltParsersystemis

animplementationofthear hite tureandthe hapterendswithades ription

ofthissystem.

Chapter 4, Experiments

Chapter 4starts with a presentation of the treebankdata used for the ex-

perimentsand anexplanation oftheevaluation riteriausedtovalidate the

implementationof the proposed ar hite ture. An investigation of thethree

questionsexplainedaboveispresentedbasedonextensiveexperiments.

Chapter 5, Con lusion

Chapter5 ontainsthemain on lusionsandasummaryofthemainresults

of the thesis. The hapter ends with a dis ussion of dire tions for future

resear h.

1.3 Division of Labor

Asalreadystated,thedesignandimplementationofMaltParserisjointwork

withJoakimNivre. Morespe i ally,theworkhasbeendividedasfollows:

(19)

•

^The^design^of^thear hite tureisjointwork.

•

^Theimplementationofparsingalgorithms,generi featuremodelhan- dlingandthememory-basedlearnerismainlytheworkofJoakimNivre.

•

^Theimplementationofallotherpartsofthesystem,in ludingtheSVM learner,ismainlytheworkof JohanHall.

(20)

(21)

Ba kground

Synta ti parsingis usedin manyappli ationssu hasma hine translation,

information extra tion and question answering. Appli ations dealing with

unrestri ted text need to handle all kinds of text, in ludinggrammati ally

orre ttext,ungrammati altextandforeignexpressions. Itisdesirablethat

su h anappli ation produ essomekindof analysis. Of ourse, iftheinput

isgarbage,itismostlikelythatthesystemwillfailto reateaninteresting

analysis, but the system should nevertheless make its best to produ e an

analysis. Iftheseappli ationsneedasynta ti parser,italsoneedstobeable

tohandleunrestri tedtext,althoughweneedtorestri tthetexttoa ertain

naturallanguageto beabletoderiveameaningfulsynta ti representation.

Nivre(2006)introdu esthenotionoftextparsing to hara terizethisopen-

endedproblemthat an onlybeevaluatedwithrespe ttoempiri alsamples

ofatextlanguage.

1

Ourapproa htotextparsingisdependen y-based anddata-driven. The

goalofdependen y-basedtextparsingisto onstru tadependen ygraphfor

ea hsenten einatext. Figure2.1showsanexampleofadependen ygraph,

onne tingthewordsinaSwedishsenten ebybinaryrelationslabeledwith

dependen ytypes(grammati alfun tions).

Data-driven methods omply well with the fa t that text parsing uses

empiri al samples of a text language. A realisti approa h is then to use

somekindofsupervisedlearningmethodthatmakesuseofatreebank,whi h

onsistsof synta ti allyannotatedsenten es. Aproblemwiththisapproa h

isthatitrestri tsustolanguagesthathaveatleastonetreebank. Inaddition,

thesetreebanksareoftenannotatedwith onstituen y-basedrepresentations

and thereforeneedtobe onvertedto dependen y-based representations.

Giventhatwehaveatreebankforaspe i languageourapproa histo

indu eaparsermodelatlearningtimeandusethisparsermodeltoparsesen-

ten es. However,sin eitisproblemati tousethedependen ygraphdire tly

1

Thetermtextlanguage doesnotex ludespokenlanguage,butemphasizesthatitis

alanguagethato urrsinrealtexts. Inprin iple,thenotionappliesalsotoutteran esin

spokendialogue.

IP

Figure 2.1: Dependen y graph for Swedish senten e, onverted from Tal-

banken

to onstru tsu hamodel,weinsteaduseadeterministi parsingalgorithmto

mapadependen ygraphtoatransitionsequen esu hthatthistransitionse-

quen euniquelydeterminesthedependen ygraph. Anindividualtransition

an be,for example,shiftingatokenonto asta koradding anar between

twotokens. Thetransitionsysteminitselfisnormallynondeterministi and

we therefore needame hanismthat resolvesthis nondeterminism. We use

a dis riminative learning method, su h as SVM and MBL, to onstru t a

lassier. Moreover,we usehistory-basedfeature modelsto extra t ve tors

of feature-valuepairs from the urrentparser stateas trainingmaterial for

the lassier.

In this hapter, we reviewthe ne essaryba kgroundfor the design and

implementationofMaltParserfo usingontheframeworkofindu tivedepen-

den yparsingproposedbyNivre(2006). MostofthenotationusedbyNivre

(2006)isalsousedhere,but insome asesthenotationhastobeextended.

Therestofthe hapterisstru turedasfollows. Se tion2.1des ribesthebasi

requirementson textparsing. Se tion2.2 presentsthe ne essarydenitions

ofdependen ygraphs. Se tion2.3presentstheparsingframework,in luding

the deterministi parsing algorithm, the history-based feature models and

dis riminativelearningmethods. Relatedworkisdis ussedin se tion2.4.

2.1 Requirements on Text Parsing

We begin by dening a text as a sequen e

T = (x 1 , . . . , x n )

^of ^senten
es,

where ea h senten e

x i = (w 1 , . . . , w m )

îs â^sequen
eôf ^tokensând â^token

(23)

w j

îsâ^sequen
eôf hara ters,usuallyawordform. Givenatext

T

^,^the^task

of textparsingisto derivethe orre tanalysis

y i

^for^every^senten
e

x i ∈ T

^.

Weassumethatthetext

T

ôntains^senten
esôfâ^text^language

L

^thatⁱⁿ^our

aseisanaturallanguage. Thisassumptionentailsthatthetextlanguageis

notaformallanguageandthatparsingdoesnotentailre ognition. Instead,

weseetext parsingasanempiri alapproximationproblem. Therefore, this

approa h is not well-suited for grammar he king in aword-pro essingap-

pli ation, be auseit will tryto nd an analysis also for an ungrammati al

senten e.

Given these denitions we andene four basi requirementson a text

parser(Nivre2006):

Denition2.1. Aparser

P

^should^map^a^text

T = (x 1 , . . . , x n )

ⁱⁿ^language

L

^towell-formedsynta ti representations

(y 1 , . . . , y n )

ispolynomialinthelengthof

x i

^.

We want to reate a parser that uses a parsing strategy that assigns at

least one analysis for ea h senten e (Robustness) and at most one anal-

ysis (Disambiguation). The third requirement(A ura y ) is unrealisti

in pra ti e,but wewillusethis asanevaluation riterionin theChapter4.

Inorder tosatisfythefourth requirement,wewillusedeterministi parsing

algorithmswithatmostquadrati time omplexityandlinearspa e omplex-

ity. Wewillusetwoparsingalgorithmsthathavelinear omplexity(Nivre's

ar -eagerandar -standardalgorithms)andonethathasquadrati omplex-

ity(Covington'salgorithm). Intheexperiments,theE ien yrequirement

willbeanevaluation riterionthatmeasuresthetimeittakestoparseatext.

2.2 Dependen y Graphs

Dependen yparsingis basedon synta ti representationsbuiltfrom binary

relations between tokens (or words) labeledwith synta ti fun tions orde-

penden ytypes. Wedenesu hrepresentationsasdependen ygraphs:

(24)

Denition2.2. Givenasenten e

x = (w 1 , . . . , w n )

^and^a^set

R = {r 0 , r 1 , . . . r m }

ofdependen ytypes,adependen ygraphforasenten e

x

^is^a^labeled^dire
ted

graph

G = (V, E, L)

^,^where:

1.

V = Z n+1 = {0, 1, 2, . . . , n}

2.

E ⊆ V × V

3.

L : E → R

A dependen y graph onsists of a set

V

ôf ^nodes, ^where â ^node îs â^non-

negativeinteger(in luding

n

^). ^Every^positive^node^has^a orrespondingtoken inthesenten e

x

^and^we^will^use^the^term^token^node^for^these^nodes^(i.e.,^the

token

w i

orrespondstothetokennode

i

^). Înâddition,^thereîsâ^spe
ial^root

node0,whi histherootofthedependen ygraphandhasno orresponding

token in thesenten e

x

^. Furthermore,the set

V ⁺

^denotes^the ^set ^of ^token

nodes, i.e.,

V ⁺ = V − {0}

^. ^Thereîs â^pra
ti
alâdvantageⁱⁿ ûsing ^position

indi es insteadof wordforms to represent tokens (Maruyama1990), whi h

allowsthe useof thearithmeti relation

<

^to ôrder ^the^nodes, ând ênsures

thateverytokenhasauniquenodein thegraph.

Anar

(i, j) ∈ E

^onne
ts^two^nodes

i

^and

j

ⁱⁿ^the^graph^and^represents^a

dependen yrelationwhere

i

^is^the^head^and

j

^is^the^dependent. ^The^notation

i → j

^will ^be ^used ^for ^the ^pair

(i, j) ∈ E

^and

i → ^∗ j

^for ^the ^reexive ^and

transitive losure, i.e.,

i → ^∗ j

îf ândônly îf^there îs â^pathôf ^zeroôr^more

ar s onne ting

i

^to

j

^. ^Finally,^the^fun
tion

L

^labels^every^ar

i → j

^with^a

dependen ytype

r ∈ R

i → j

^then

i → ^∗ k

^,^for^every^node

k

^su
h^that

i < k < j

^or

j < k < i

^.

(25)

A spe ial root node makes it easier to omply with the se ond onstraint

Conne tedness,sin eitisalwayspossibletohookupanynodetotheroot

and in that way the graphwill always be onne ted. Furthermore, with a

root node wealwaysknow theentran eto thegraph. The third onstraint

Single-Head(sometimes alleduniqueness)is ommonlyassumedindepen-

den y grammar, although Hudson (1984) allows multiple headsto apture

ertain transformational phenomena, where a single token is onne ted to

more than oneposition in the senten e. The fourth onstraint A y li ity

together with rst three onstraints entail that the graph is a rooted tree.

Theseassumptionsmakeitsimplerto onstru tparsingalgorithmsthatbuild

dependen ytreesautomati ally.

Thelast onstraintProje tivity ismore ontroversialand mostdepen-

den y grammarsallownon-proje tivegraphs,be ausenon-proje tiverepre-

sentations areable to apture non-lo al dependen ies. There exists several

treebanksthat ontainnon-proje tivestru turessu hasthePragueDepen-

den y Treebank of Cze h (Haji£ et al. 2001) and the DanishDependen y

Treebank (Kromann 2003). We will assume the onstraint Proje tivity

here be ause the parsingalgorithms used in this thesis are limited to pro-

je tivestru turesandthetreebanksusedonly ontainproje tivestru tures.

Moreover, when dealing with non-proje tivedata, it is possible to proje -

tivizethetrainingdataandre overnon-proje tivedependen iesbyapplying

aninversetransformationafterparsinginapost-pro essingstep(Nivreand

Nilsson 2005).

Figure 2.1 shows a labeled proje tive dependen y graph for a Swedish

senten e, where ea h wordof thesenten e is tagged withits part-of-spee h

andea har labeledwithadependen ytype.

2

2.3 Indu tive Dependen y Parsing

The frameworkof indu tive dependen y parsing, as hara terized byNivre

(2006),isbasedonthree essentialelements:

1. Deterministi parsingalgorithmsforbuildingdependen ygraphs(Kudo

andMatsumoto2002;YamadaandMatsumoto2003;Nivre2003)

2. History-based feature models for predi ting the next transition from

oneparser ongurationtoanother(Bla ketal. 1992;Magerman1995;

Ratnaparkhi1997;Collins1999)

2

Thedependen y typesusedinFigure2.1aredes ribedinse tion4.1.1.

(26)

3. Dis riminativelearningmethodstomaphistoriestotransitions(Veen-

stra and Daelemans 2000; Kudo and Matsumoto 2002; Yamada and

Matsumoto2003;Nivreet al. 2004)

Inthisse tionwewilldis ussthesethreeelements. Se tion2.3.1presentstwo

deterministi dependen y-basedparsingalgorithms. Se tion 2.3.2 des ribes

howhistory-basedmodels anbeusedforpredi tingthenexttransitionfrom

oneparser ongurationto another. Finally, Se tion 2.3.3explains how we

an use dis riminativelearningmethods for indu inga lassier that maps

parser ongurationsto transitions, in ludingabriefdes riptionof thetwo

learningmethod usedintheexperiments: SVM andMBL.

2.3.1 Deterministi Dependen y Parsing

Mainstreamapproa hesto data-driven textparsing arebased onnondeter-

ministi parsingte hniques,butthedisambiguation anbeperformeddeter-

ministi ally, using agreedyparsingalgorithm that approximatesa globally

optimalsolutionbymakingasequen eoflo allyoptimal hoi es(seese tion

2.4formoredetailsofrelatedworkinthisarea). TheexperimentsinChapter

4will use twoparsingalgorithms, alled Nivre's algorithm andCovington's

algorithm,and bothalgorithms ome in twoversions. Webeginbydening

parser ongurationsthat an be used byboth algorithms, following Nivre

(2006):

Denition 2.4. Givenaset

R = {r 0 , r 1 , . . . , r m }

^of ^dependen
y^types ^and

a senten e

x = (w 1 , . . . , w n )

^, ^a ^parser onguration for

x

^is ^a ^quintuple

c = (σ, τ, υ, h, d)

^,^where:

1.

σ

îs â ^sta
k ôf ^partially ^pro
essed ^token ^nodes

i

⁽

1 ≤ i ≤ j

^for ^some

j ≤ n

^).

2.

τ

îsâ^listôf^remainingînput^token^nodes

i

⁽

k ≤ i ≤ n

^for^some

k > j

^).

3.

υ

îsâ^sta
kôf^token^nodes

i

ô

urring^between^the^tokenôn^topôf^the

sta k

σ j

^and^the^next^input^token

τ k

⁽

j < i < k

^).

4.

h : V x ⁺ → V x

^is^a^head^fun
tion ^from^token^nodes^to^nodes.

5.

d : V _x ⁺ → R

^is^a^label^fun
tion^from ^token^nodes^to ^dependen
y^types.

6. Foreverytokennode

i ∈ V _x ⁺

^,

d(i) = r 0

^only^if

h(i) = 0

^.

Thedenition ofaparser ongurationintrodu esthree data stru tures: a

sta k

σ

^,^a^list

τ

^and^a^sta
k

υ

^. ^The^rst^two^data^stru
tures^(the^sta
k

σ

^and

thelist

τ

⁾âreîn
ludedⁱⁿ^the^denition ôf^Nivre^(2006). ^Here^the^denition

(27)

isextendedwithasta k

υ

^,^whi
h^weâll^theôntext^sta
k ând^whi
hîsûsed

byCovington'salgorithm. Inordertodene theparsingalgorithmslaterin

thisse tion,wewillrepresentallthreedatastru turesaslists. Tobeableto

useindividual omponentsintheselists,wewilluse

j|τ

^to^represent^a^list^of

inputtokenswith head

j

^and^tail

τ

^, ^while

σ|i

^and

υ|i

^represent^sta
ks^with

thetop

i

^and^tail

i

^there^is^a^synta
ti^head

h(i) = j

^. ^If^the^token

node

i

îs^not^yetâtta
hed^toâ^head,^the^spe
ial^root^node

h(i) = 0

^is^used.

Finally, thelabelfun tion

d

^labels^the^partially ^built ^dependen
y^stru
-

ture, where everytoken node

i

îs âssigned â^dependen
y ^type

r j

^using ^the

label fun tion

d(i) = r j

⁽

d(i) = r 0

îs ûsed ^for ^token ^nodes ^that âre ^not

yet atta hed). Weestablisha onne tionbetweenparser ongurationsand

dependen ygraphsinthefollowingway(Nivre2006):

Denition 2.5. A parser onguration

c = (σ, τ, υ, h, d)

^for

x

^denes ^the

dependen ygraph

G c = (V x , E c , L c )

^,^where:

1.

E c = {(i, j) | h(j) = i}

2.

L c = {((i, j), r) | h(j) = i, d(j) = r}

Forthefun tions

h

^and

d

^, ^we^will ^use ^the^notation

f [x 7→ y]

^; ^if

f (x) = y ^′

^,

then

f [x 7→ y] = f − {(x, y ^′ )} ∪ {(x, y)}

^.

c

^for^the^senten
e

x = (w 1 , . . . , w n )

isinitial ifandonlyifithastheform

c = (ǫ, (1, . . . , n), ǫ, h 0 , d 0 )

^,^where:

1.

h 0 (i) = 0

^for^every

i ∈ V _x ⁺

^.

2.

d 0 (i) = r 0

^for^every

i ∈ V x ⁺

^.

Whentheparserbeginstoparseasenten e,thetwosta ks

σ

^and

υ

^are^empty

andallthetokennodesofthesenten eareinthelist

τ

^. ^In^the^beginning,^all

tokennodesaredependentsofthespe ialroot node0andlabeledwiththe

spe ial label

r 0

^. ^The^parser ^terminates ^the^parsing ^of ^a^senten
e ^when ^the

following onditionismet:

c

^for^the^senten
e

x = (w 1 , . . . , w n )

isterminal ifandonlyifithastheform

c = (σ, ǫ, υ, h, d)

^(for^arbitrary

σ

^,

υ

^,

h

^and

d

^).

(28)

Theparserpro essestheinputleft-to-rightandterminateswheneverthelist

of input tokensis empty. The set

C

^will ^denote ^all ^possible ongurations and

C ⁿ

^the ^set ^of non-terminal ongurations, i.e., any onguration

c = (σ, τ, υ, h, d)

^where

τ 6= ǫ

^. ^A^transition ^from^anon-terminal ongurationto

anew ongurationisapartialfun tion

t : C ⁿ → C

^.

Wewilldeneatransitionsystemforea hversionofthealgorithms,whi h

isnondeterministi . Hen e,therewillbemorethanonetransitionappli able

toagiven onguration. An ora le

o : C ⁿ → (C ⁿ → C)

îsûsed^toôver
ome

thisnondeterminism(Kay2000). Forea hnondeterministi hoi epointthe

parsingalgorithm will askthe ora letopredi t thenexttransition. Inthis

se tion we will onsider theora leas abla k box, whi h alwaysknows the

orre ttransition. Inse tion2.3.2,wewillseethatwe anapproximatethis

ora lebyindu inga lassier.

Nivre's algorithm. This parsing algorithm wasrst proposed for unla-

beleddependen y parsingbyNivre(2003)and wasextended to labeled de-

penden y parsing by Nivre et al. (2004). A senten e

x = (w 1 , . . . , w n )

^is

parsedbythealgorithmParse-Nivreinthefollowingway:

Parse-Nivre(

x = (w 1 , . . . , w n )

⁾

1

c ← (ǫ, (1, . . . , n), ǫ, h 0 , d 0 )

2 while

c = (σ, τ, υ, h, d)

^is^not^terminal

3 if

σ = ǫ

4

c ←

^Shift

(c)

5 else

6

c ← [o(c)](c)

7

G ← (V x , E c , L c )

8 return

G

The algorithm will perform the Shifttransition ifthe sta k is empty and

otherwiselettheora le

o

^predi
t^the^next^transition

o(c)

^as^long^as^the^parser

remainsinanon-terminal onguration

c ∈ C ⁿ

^. ^The^Shift^transition^pushes

thenextinputtoken

i

^onto^the^sta
k

σ

^. ^When^the^terminal ongurationis rea hedthedependen ygraphisreturned.

The algorithm omes in two versions with two transition systems: an

ar -eager andanar -standard version. Thear -eagerversionusesfourtran-

sitions, twoof whi h areparameterized by a dependen y type

r ∈ R

^. ^The

transitionsystemupdatestheparser ongurationasfollows(Nivre2006):

Denition2.8. Forevery

r ∈ R

^,^the^followingtransitions arepossible:

1. Shift:

(σ, i|τ, ǫ, h, d) → (σ|i, τ, ǫ, h, d)

i

^onto ^the

sta k

σ

^. ^This îs^theôrre
t â
tion^when ^the^headôf ^the^next ^wordîs ^posi-

tionedtotherightofthenextwordorthenextwordisaroot. Thetransition

Redu e(RE)redu es(pops)thetoken

i

^on^top^of^the^sta
k

σ

^. Îtîsîmpor-

tanttoensurethattheparserdoesnotpopthetoptokenifithasnotbeen

assigned ahead,sin eitwillotherwisebeleft unatta hed.

TheRight-Ar transition(RA)addsanar fromthetoken

i

^on^top^of

^and ^involves

popping

i

^from^the^sta
k. ^This^transitionîsônlyâllowed^when^the^top^token

i

ôn^the^sta
k^has^previously^re
eivedânâr^to ^the^spe
ial^root^node^0. ^We

make use of the assumption of proje tivity be ause we know that the top

token

i

ânnot^haveâny^more^leftând^right^dependentsând^thereforeîtân

bepopped.

Nivre'sar -eageralgorithm is guaranteedto terminate after at most

2n

transitions,givenasenten eoflength

n

^(Nivre^2003). Furthermore,italways produ es a dependen y graph that is a y li and proje tive. The orre t

^,

d 2 = d 1 [3 7→

^ADV

]

⁾

⁾

]

⁾

⁾

olumntellsusifthetransitionisdeterministi (D)ornondeterministi (N),

in other wordsif the ora le

o

îs ûsed ôr^not. ^Forêxample, ^the ^se
ond ^row

anonlybeashifttransition be ausethesta kis empty (D)and thethird

rowisanondeterministi transition(N).

The ar -standardversionusesastri tbottom-uppro essingasin tradi-

tionalshift-redu e parsing. ThealgorithmbyKudoandMatsumoto(2002),

Yamada and Matsumoto (2003) and Cheng et al. (2005a) uses the ar -

standardstrategy,butalsoallowsmultiplepasses overtheinput.

Thear -standardversionusesatransitionsystemsimilartothear -eager

version,but has only three transitions Shift, Left-Ar and Right-Ar

(noRedu e). Thersttwotransitions,Shiftand Left-Ar , areapplied

in exa tlythe samewayasfor thear -eagerversion. Thetransition system

isdened asfollows:

1. Shift:

(σ, i|τ, ǫ, h, d) → (σ|i, τ, ǫ, h, d)

2. Right-Ar (

r

^):

(σ|i, j|τ, ǫ, h, d) → (σ, i|τ, ǫ, h[j 7→ i], d[j 7→ r])

if

h(j) = 0

3. Left-Ar (

r

^):

(σ|i, j|τ, ǫ, h, d) → (σ, j|τ, ǫ, h[i 7→ j], d[i 7→ r])

if

h(i) = 0

Insteadofpushingthenexttoken

j

^onto^the^sta
k

σ

^,^Right-Ar^moves^the

topmost token

i

ôn ^the ^sta
k ^ba
k ^to ^the ^list ôf ^remaining înput ^tokens

τ

^,

(31)

where itrepla esthetoken

j

d 2 = d 1 [3 7→

^ADV

]

⁾

⁾

]

⁾

⁾

We anseethatthetransitionsareperformedin anotherorder,forinstan e

the Right-Ar (PR) is exe uted before Right-Ar (ADV), ompared to

thear -eagerversion.

Covington's algorithm. Covington (2001)proposesseveral in remental

parsing algorithms for dependen y parsing. Two of the algorithms are the

proje tive algorithm and the exhaustive left-to-right sear h algorithm. The

rst algorithmusesaheadlist withwordsthat donotyethaveheadsanda

wordlist with all wordsen ounteredso far. Wewill not usethese twodata

stru tures; insteadwe will des ribethese twoalgorithms byusing the data

stru tures dened bythe parser onguration: thesta ks

σ

^and

υ

^,^and ^the

list

τ

^. Â
tually, ^we^will ^regard^these ^twoâlgorithms âsône âlgorithm ^with

twotransitionsystemsorastwoversionsofthesamealgorithm. Wewill all

these ondversiontheunrestri ted,be auseitallowsdependen ygraphsthat

arenon-proje tiveand y li . Bothversionshavequadrati omplexity,sin e

theypro eedbytryingtolink ea hnewtokento ea hpre edingtoken. Itis

also possibleto deneother versions. Forexample,aversionthat onforms

totheA y li ityrequirementbutallowsnon-proje tivegraphs,butthiswill

notbedonein thisthesis. Theadaptedversionof Covington'salgorithm is

des ribedasfollows:

(32)

Parse-Covington(

x = (w 1 , . . . , w n )

⁾

1

c ← (ǫ, (1, . . . , n), ǫ, h 0 , d 0 )

2 while

c = (σ, τ, υ, h, d)

^is^not^terminal

3 Done

← f alse

4 while

σ 6= ǫ

^and

¬

^Done

5

c ← [o(c)](c)

6 while

υ 6= ǫ

7 Push

(

^Pop

(υ), σ)

8 Push

(

^First

(τ ), σ)

9

G ← (V x , E c , L c )

10 return

G

Thealgorithmbeginsbyinitializingthe ongurationwithtwoemptysta ks

and alltokennodesin the list

τ

^, ⁱⁿ ^the^same^wayâs^Nivre'sâlgorithm. Âs

longastheparserremainsinanon-terminal onguration,itwillrstiterate

as longas the sta k

σ

îs ^notêmpty ôr ^the âg ^Done îs ^false. ^The ^Done

ag is only used by the proje tive version to indi ate that it an pro eed

to the nexttoken without anempty sta k. Beforeit an pro eed with the

next input token, the algorithm must move ba k all unatta hed tokens in

the ontextsta k

υ

^to^the^sta
k

σ

^. ^The^Push ^fun
tion ^pushes^a^token^onto

asta kand thePop fun tion popsatoken from asta k. Finally, the next

inputtokenispushedontothesta k

σ

^,^using^the^fun
tion^First^to^retrieve

thersttokenin alist.

The unrestri ted versionuses three transitions and these aredened in

thefollowingway:

1. Redu e:

(σ|i, τ, υ, h, d) → (σ, τ, υ|i, h, d)

2. Right-Ar (

r

^):

(σ|i, j|τ, υ, h, d) → (σ, j|τ, υ|i, h[j 7→ i], d[j 7→ r])

3. Left-Ar (

r

^):

(σ|i, j|τ, υ, h, d) → (σ, j|τ, υ|i, h[i 7→ j], d[i 7→ r])

All threetransitions movethetoptokenofthe sta k

σ

^to^the^sta
k

υ

^. ^The

Right-Ar and Left-Ar transitionsin addition add anar

i → j ^r

^or^an

ar

j → i ^r

^,respe tively.

Theproje tiveversionmakesuseofthefa tthatitshouldbuildaproje -

tivegraph,whi hallowsthealgorithmto ontinuewiththenextinputtoken

(33)

withoutexploringall ombinationsthat ouldmakethegraphnon-proje tive.

Thetransitionsystemisredenedasfollows:

1. Redu e:

(σ|i, j|τ, υ, h, d) → (σ, j|τ, υ|i, h, d)

^Done

← true

2. Right-Ar (

r

^):

(σ|i, j|τ, υ, h, d) → (σ, j|τ, υ|i, h[j 7→ i], d[j 7→ r])

^Done

← true

if

h(j) = 0

3. Left-Ar (

r

^):

(σ|i, j|τ, υ, h, d) → (σ, j|τ, υ, h[i 7→ j], d[i 7→ r])

if

h(i) = 0

The Redu e transition is exa tlythe same asfor the unrestri ted version

ex ept that it sets the Done ag to true, in order to indi ate that all the

remainingtokensinsta k

σ

^annot^be^linked^to^the^token

j

^,^sin
e^this^would

produ e a non-proje tivegraph. The Right-Ar transition makes use of

thefa t thatthear

i → j ^r

^overs^the^tokens^between^the^top^token^and^the

nexttoken;topreventthat thegraphbe omesnon-proje tivethetoptoken

i

^of ^sta
k

σ

îs ^popped ând ^then ^pushed ônto ^the ôntext ^sta
k

υ

^and ^the

ag Doneisassignedthevaluetrue,forthesamereasonasintheRedu e

transition. TheLeft-Ar transitionaddsanar

j → i ^r

^;^be
ause

i

^annot^be

linkedto anothertokenitispoppedfrom thesta k

σ

^.

2.3.2 History-Based Models

In se tion2.3.1wedened aset

C

^of ^possible^parser ongurationsandfor ea hversionoftheparsingalgorithmwedened atransition systemisnon-

deterministi . Furthermore, we introdu ed an ora le

o : C ⁿ → (C ⁿ → C)

^,

whi htheparsingalgorithmusestogetthe orre ttransition. If itispossi-

ble to derivethe orre t transitionsfrom synta ti allyannotatedsenten es,

we an use these as training data to approximate su h an ora le through

indu tive learning. In other words, we dene a one-to-one mapping from

an input string

x

^and ^a ^dependen
y ^graph

G

^to ^a ^sequen
e ^of transitions

S = (t 1 , . . . , t m )

^su
h ^that

S

^uniquely ^determines

G

^. ^A ^transition

t i

^is

dependent on all previously made transitions

(t 1 , . . . , t i−1 )

ând âll âvail-

able information about these transitions, alled the history. The history

H i = (t 1 , . . . , t i−1 )

orrespondstosomepartially builtstru tureandwealso in lude stati propertiesthatare kept onstantduringtheparsingofasen-

ten e, su haswordformandpart-of-spee hofatoken.

(34)

The basi idea is thus to traina lassier that approximates an ora le

given that a treebank is available. We will all the approximated ora lea

guide (Boullier2003),be ausetheguidedoesnotguaranteethatthetransi-

tionis orre t. Thehistory

H i = (t 1 , . . . , t i−1 )

^ontains^ompleteinformation aboutallprevioustransitions. Allthisinformationisintra tablefortraining

a lassier. Insteadwe anuse history-based feature models for predi ting

^and

x

^, ^i.e.,

Φ(H i , x) = (φ 1 (H i , x), . . . , φ p (H i , x))

^.

At learning time the parser derives the orre t transition by using an

ora lefun tion

o

âpplied ^to ^gold^standard ^treebank. ^Forêa
h ^transition ît

providesthelearnerwithatraininginstan e

Φ((H i , x), t i )

^,^where

Φ(H i , x)

^is

a urrentve torof feature valuesand

t i

^is ^the ^orre
t transition. A set of traininginstan es

I

îs^thenûsed^by^the^learner^toîndu
eâ^parser^model,^by

usingasupervisedlearningmethod.

At parsingtimetheparser usestheparser model, asaguide, to predi t

thenexttransitionandnowtheve toroffeaturevalues

Φ(H i , x)

^is^the^input

andthetransition

t i

îs^theôutput ôf^the^guide. ^Se
tion^2.3.3^des
ribes^how

we antraina lassierthat makesthispredi tion.

2.3.3 Dis riminative Learning Methods

Thelearningproblemistoindu ea lassierfromasetoftraininginstan es

I

^relative ^to ^a ^spe
i ^feature ^model

Φ

^by ûsing â ^learning âlgorithm. În

thisse tion,wewilldes ribetwodis riminativelearningmethods, SVMand

MBL,that anbeusedforthis lassi ationtask.

Ingeneral, lassi ationisthetaskofpredi tingthe lass

y

^given^a^vari-

able

x

^,^whi
h^an^bea omplishedbyprobabilisti methodsanditis ommon todividethese methodsintotwo lasses: generative anddis riminative. For

generativemethods, weuse theBayesrule to obtain

P (y | x)

^by^estimating

thejointdistribution

P (x, y)

^. ^By^ontrast,dis riminativemethods makeno attempt tomodelunderlying distributions andinstead estimate

P (y | x)

^di-

re tly. Wewill usetwodis riminativemethods for thelearningtask: SVM

andMBL.

Support Ve tor Ma hines. Inthe last de ade, there has beenagrow-

ing interest in Support Ve tor Ma hines (SVM), whi h were proposed by

(35)

VladimirVapnikattheendoftheseventies(Vapnik1979). SVMisbasedon

theideathattwolinearlyseparable lasses,thepositiveandnegativesamples

inthetrainingdata, anbeseparatedbyahyperplanewiththelargestmar-

gin. Ithasbeenshownthat SVMsgivegoodgeneralizationperforman ein

variousresear hareas,su hasfa edete tion(Osunaetal. 1997)andpedes-

triandete tion(Orenet al. 1997). Within naturallanguagepro essingthey

have been used extensively in, for example, text ategorization (Joa hims

1998), hunking(KudoandMatsumoto2001)andsynta ti parsing(Yamada

and Matsumoto2003).

Givenadatasetof

ℓ

instan e-labelpairs

I = {(− → x i , y i )} ^ℓ _i=1

^,^where

x i ∈ R ^N

and

y i ∈ {−1, 1}

^,

x i

îsâ^feature^ve
torôf^the

i

^-th^sample,^whi
h^isrepresented byan

n

dimensionalve tor

−

→ x i = (f 1 , . . . , f n )

^, ^and

y i

^is^the^lass^label^of^the

i

^-th ^sample^whi
h ^belongs ^to ^either ^the ^positive ⁽

+1

⁾^or ^the^negative⁽

−1

⁾

lass. Thefeatureve tor

−

→ x i

^will ⁱⁿ^our^ase^be^the^feature^ve
tor^dened^by

Φ(H i , x)

^and^the^lass^label

y i

^will^be^the^transition

t i

^,^but^we^need^a^method

thathandlesmultiple lasslabels(moreaboutthatlaterinthisse tion). The

idea isto estimateave tor

−

→ w

^and ^a^s
alar

b

^, ^whi
h ^maximize^the ^distan
e

of any data point from thehyperplanedened by

−

→ w · − → x + b

^. ^The ^goal ^of

the SVM is to nd the solution of the following optimization (Kudo and

Matsumoto2000a;Burges1998):

Minimize:

L(w) = ¹ ₂ k − → w k ²

Subje tto:

y i (− → w · − → x i + b) ≥ 1∀i = 1, . . . , ℓ

^(2.1)

Figure2.2: AlinearSupportVe torMa hine

Inotherwords,theSVM methodtries tondthehyperplanethat sepa-

rates thetrainingdataintotwo lasseswith thelargestmargin. Figure 2.2

illustrates two possible hyperplanes, whi h orre tly separate the training

dataintotwo lasses,andthelefthyperplanehasthelargestmarginbetween

thetwo lasses.

(36)

ThedatainFigure2.2areeasytoseparateintotwo lasses,butinpra ti e

thedatamaybenoisyandthereforenotlinearlyseparable. Onesolutionis

to allowing some mis lassi ations by introdu ing apenalty parameter

C

^,

whi h denesthetradeo betweenthetrainingerrorandthemagnitudeof

themargin.

SVM anbe extendedto solveproblems thatare notlinearlyseparable.

Thefeatureve tor

x i

^is ^mapped^to^a^higherdimensional spa ebythefun - tion

φ

^, ^whi
h ^makes ît ^possible ^to ârry ôut ^non-linear lassi ation. The optimizationproblem anberewrittenintoadualform,whi h isdonewith

aso alled Kernel fun tion

K(x i , x j ) ≡ φ(x i ) ^T φ(x j )

^(Kudo^and^Matsumoto

2001;Vapnik1998). Therearemanykernelfun tions,butthemost ommon

^.

where

γ, r

^and

d

^denote^dierent^kernel^parameters^(Hsu^et^al. ^2004).

SVM is in its basi form a binary lassier, but many learning prob-

lemshaveto dealwithmorethantwo lasses. TomakeSVM handlemulti-

lassi ation,manybinary lassiersareused. Formulti- lass lassi ation,

we an hoosebetweenthemethodsone-against-allandall-against-all.Given

thatwehave

n

^lasses,^theone-against-allmethodtrains

n

^lassiers^to^sepa-

rateea h lassfromtherestandtheall-against-allmethodtrains

n(n − 1)/2

lassiers,oneforea hpairof lasses(VuralandDy2004). Avotingme ha-

nismorsomeothermeasureisusedtodis riminatea rossallthese lassiers

to lassifyanewinstan e.

Memory-Based Learning. Memory-based learning(MBL) and lassi-

ation is based on theassumption that a ognitive learningtask to ahigh

degreedependsondire t experien e andmemory,rather thanextra tionof

anabstra trepresentation. MBLhasbeenusedformanylanguagelearning

tasks, su h aspart-of-spee h tagging(Cardie 1993;Daelemanset al. 1996),

semanti rolelabeling(Vanden Bos het al. 2004; Kou hnir2004)andsyn-

ta ti parsing(Nivreetal. 2004).

MBLisalazymethodandisbasedontwofundamentalprin iples: learn-

ingisstoringexperien esinmemory,andsolvinganewproblemisa hieved

byreusingsolutionsfrom previouslysolvedproblemsthat aresimilarto the

new problem. The idea during training for MBL is to olle t the values

of dierent features from the training data together with the orre t lass

(37)

(Daelemans andVanden Bos h2005). MBLgeneralizesby applyingasim-

ilaritymetri withoutabstra ting oreliminatinglow-frequen yevents. This

similarity metri anbe seenas animpli it smoothingme hanism for rare

events. Daelemans and olleagues have shown that it may be harmful to

eliminaterareeventsinthetrainingdataforlanguagelearningtasks(Daele-

mansetal. 2002),be auseitisverydi ulttodis riminatenoisefromvalid

ex eptions.

The

n

feature-valuesaremappedintoan

n

-dimensionalspa e,whereea h featureve torfromthetrainingdatawithits orresponding lassisapointin

thisspa e. Thetaskatde isiontimeistondthenearestneighbor(s)inthis

n

-dimensionalspa eandreturna ategorybasedonthe

k

^nearestneighbor(s).

Thewaythis sear hisperformed anbevariedinmanydierentways.

TheOverlapmetri isoneofthemostbasi metri sandusesthedistan e

∆(X, Y )

^between^two^patterns

X

^and

Y

^,^whi
h^arerepresentedas

n

^features:

∆(X, Y ) = X n i=1

w i δ(x i , y i )

^(2.2)

where

w i

^is^a^weight^for^feature

i

^,^and^the^fun
tion

δ(x i , y i )

^is^the^distan
e^per

featureandwillbe0if

x i = y i

^,^otherwise^1. ^The^weight

w i

^an^be^al
ulated

by avarietyof methods, e.g. Information Gain (IG), whi hmeasures ea h

feature's ontributiontoourknowledgewithrespe ttothetarget lass.

A variation of the Overlap metri s is the more sophisti ated Modied

ValueDieren eMetri (MVDM),introdu edbyCostand Salzberg(1993),

whi hestimatesthedistan ebetweentwovaluesofafeature by onsidering

their oo urren e with the target lasses. However, this metri is more

sensitiveto sparsedata.

2.4 Related Work

Duringthelastde ades,therehasbeenagreatinterestindata-drivenmeth-

odsforvariousnaturallanguagepro essingtasks. Data-drivenapproa hesto

synta ti parsingwererstdevelopedduringthe90sfor onstituen y-based

representations. The standard approa hes are based on nondeterministi

parsing te hniques, usually involving some kind of dynami programming,

in ombinationwith generativeprobabilisti modelsthat providean

n

^-best

ranking of the set of andidate analyses derived by the parser. The most

well-knownparsersbasedonthesete hniquesaretheparserofCollins(1997,

1999) and the parser of Charniak (2000). Dis riminativelearning methods

have been used to enhan e these parsers by reranking the analyses output

MaltParser -- An Architecture for Inductive Labeled Dependency Parsing

Johan Hall