Johan Hall
MaltParser – An Architecture for Inductive Labeled
Dependency Parsing
Licentiate Thesis
Växjö University
Parsing
MaltParser An Ar hite ture for
Indu tive Labeled Dependen y Parsing
Li entiate Thesis
Computer S ien e
Växjö University
MaltParser AnAr hite ture for Indu tive LabeledDependen y
Parsing
JohanHall
VäxjöUniversity
S hoolofMathemati sandSystemsEngineering
se-
Växjö,Swedenhttp://www.vxu.se/msi
byJohanHall. Allrightsreserved.ReportsfromMSI,no.
issn
-
isrnvxu-msi-da-r--
--seTo G ert and K arin
This li entiate thesis presents a software ar hite ture for indu tive labeled
dependen y parsing of unrestri ted natural language text, whi h a hieves
a stri t modularization of parsing algorithm, feature model and learning
method su h that these parameters an be varied independently. The ar-
hite ture is based on the theoreti al framework of indu tive dependen y
parsingbyNivre(2006)andhasbeenrealizedin MaltParser,asystemthat
supports several parsingalgorithms and learning methods, for whi h om-
plexfeaturemodels anbedenedinaspe ialdes riptionlanguage. Spe ial
attentionisgiveninthisthesistolearningmethodsbasedonsupportve tor
ma hines(SVM).
Theimplementationis validated inthree sets ofexperimentsusing data
from threelanguages(Chinese,English andSwedish). First,we he kifthe
implementationrealizestheunderlying ar hite ture. Theexperimentsshow
that theMaltParsersystemoutperformsthe baselineandsatisesthebasi
onstraintsofwell-formedness. Furthermore,theexperimentsshowthatitis
possibletovaryparsingalgorithm,featuremodelandlearningmethodinde-
pendently. Se ondly,wefo usonthespe ialpropertiesoftheSVMinterfa e.
It ispossibleto redu ethelearningandparsingtimewithoutsa ri inga -
ura ybydividingthetrainingdataintosmallersets,a ordingtothepart-
of-spee hofthenexttokeninthe urrentparser onguration. Thirdly,the
last setofexperimentspresentabroadempiri alstudythat omparesSVM
to memory-basedlearning(MBL) with ve dierent feature models, where
all ombinationshavegonethroughparameteroptimizationforbothlearning
methods. Thestudy showsthat SVM outperformsMBLfor more omplex
and lexi alized feature modelswith respe t to parsing a ura y. There are
alsoindi ationsthatSVM,withasplittingstrategy, ana hievefasterpars-
ing than MBL. The parsing a ura y a hieved is the highest reported for
the Swedishdata set and very lose to thestateof theart forChinese and
English.
Key-words: Dependen yParsing,SupportVe torMa hines,Ma hineLearn-
ing.
Dennali entiatavhandlingpresenterarenmjukvaruarkitekturfördatadriven
dependensparsning, dvs. för att automatiskt skapa en syntaktisk analys i
formavdependensgraferförmeningaritexterpånaturligtspråk.Arkitektu-
ren bygger påidén att man ska kunna varieraparsningsalgoritm,särdrags-
modell o h inlärningsmetod oberoende av varandra. Till grund för denna
arkitekturharvianväntdetteoretiskaramverketförinduktivdependenspar-
sningpresenteratavNivre(2006).Arkitekturenharrealiseratsiprogramva-
ranMaltParser,därdetärmöjligtattdenierakomplexasärdragsmodelleri
ettspe ielltbeskrivningsspråk.Idennaavhandlingkommerviattläggaextra
tyngdvidattbeskrivahurviharintegreratinlärningsmetodensupportvektor-
maskiner(SVM).
MaltParser valideras med tre experimentserier, där data från tre språk
används(kinesiska,engelskao hsvenska).Idenförstaexperimentserienkon-
trolleras om implementationen realiserar den underliggande arkitekturen.
Experimenten visar att MaltParser utklassar en trivial metod för depen-
densparsning (eng. baseline) o h de grundläggande kraven på välformade
dependensgrafer uppfylls. Dessutom visar experimenten att det är möjligt
attvarieraparsningsalgoritm,särdragsmodello h inlärningsmetodoberoen-
deavvarandra. Denandraexperimentserienfokuserarpå despe iellaegen-
skaperna för SVM-gränssnittet. Experimenten visar att det är möjligt att
redu era inlärnings-o h parsningstiden utan attförlora i parsningskorrekt-
het genom att dela upp träningsdata enligt ordklasstaggen för nästa ord
i nuvarande parsningskonguration. Den tredje o h sista experimentserien
presenteraren empiriskundersökning somjämförSVM med minnesbaserad
inlärning (MBL). Studien använder sig av fem särdragsmodeller, där alla
kombinationeravspråk,inlärningsmetodo hsärdragsmodellhargenomgått
omfattande parameteroptimering. Experimenten visar att SVM överträar
MBLförmerkomplexao hlexikaliseradesärdragsmodellermedavseendepå
parsningskorrekthet.Det nnsävenvissa indikationerpå att SVM, med en
uppdelningsstrategi, kan parsa en text snabbare än MBL. För svenska kan
virapporteradenhögstaparsningskorrekthetenhittillso hförkinesiskao h
engelskaärresultatennäradebästasomharrapporterats.
Unfortunately, my parents Gert and Karin annot witness the ompletion
of my li entiate thesis, but I know that they would be very proud of me.
Theyalwaysbelievedin meandsupportedmeinwhateverIwantedtodo. I
wanttothankmysupervisorJoakimNivreforallfruitfuldis ussions,advi e
and fun times when we developed MaltParserand I am looking forward to
thedevelopmentofnextversion. Iespe iallywanttothankJensNilssonfor
the onversionof allthedata usedin thisthesisintodependen ystru tures
and for theMaltEval tool, whi h madeit easierto validate theMaltParser
system. Forthe onversionoftheChinesedataweusedtheheadrulesmade
byYuanDingattheUniversityofPennsylvania. Ialsowanttothankallmy
olleagues in omputers ien e at Växjö Universityto makeit fun to go to
workeveryday. I espe iallywantto thankMorganEri ssonformanyideas
andextra omputerpower.
Finally, I want to thank my love Kristina for all support when I wrote
this thesis.
Abstra t vii
Sammanfattning viii
A knowledgments ix
1 Introdu tion 1
1.1 Resear hProblemandAims. . . 2
1.2 OutlineoftheThesis . . . 4
1.3 Division ofLabor . . . 4
2 Ba kground 7 2.1 RequirementsonTextParsing. . . 8
2.2 Dependen yGraphs . . . 9
2.3 Indu tiveDependen yParsing . . . 11
2.3.1 Deterministi Dependen yParsing . . . 12
2.3.2 History-BasedModels . . . 19
2.3.3 Dis riminativeLearningMethods . . . 20
2.4 RelatedWork . . . 23
3 MaltParser 25 3.1 Ar hite ture. . . 25
3.1.1 Parser . . . 27
3.1.2 Guide . . . 28
3.2 Implementation . . . 30
3.2.1 InputandOutput . . . 31
3.2.2 ParserKernel . . . 31
3.2.3 Parser . . . 33
3.2.4 Guide . . . 37
4 Experiments 45 4.1 DataSets . . . 46
4.1.1 Swedish . . . 46
4.1.2 English . . . 48
4.1.3 Chinese . . . 52
4.2 EvaluationMetri s . . . 54
4.3 FeatureModels . . . 56
4.4 ExperimentI:Validation . . . 56
4.4.1 ExperimentalSetup . . . 57
4.4.2 ResultsandDis ussion. . . 57
4.5 ExperimentII:LIBSVMInterfa e . . . 59
4.5.1 ExperimentalSetup . . . 59
4.5.2 ResultsandDis ussion. . . 60
4.6 ExperimentIII:ComparisonofMBLandSVM . . . 63
4.6.1 ExperimentalSetup . . . 63
4.6.2 ResultsandDis ussion . . . 64
5 Con lusion 67 5.1 MainResults . . . 67
5.2 FutureWork . . . 69
Bibliography 71
Introdu tion
Synta ti parsingisan important omponentfor manyappli ations ofnat-
urallanguagepro essing. Inthis thesis,weregardparsing asthe pro essof
mapping senten es in unrestri ted natural languagetext to their synta ti
representations. Furthermore, the program whi h performs this pro ess is
alledasynta ti parser,orsimplyparser. Thesynta ti stru tureisformal-
ized withasynta ti representationsu hasphrasestru ture ordependen y
stru ture. Parsing asenten ewithphrasestru turegrammaror ontext-free
grammar re ursivelyde omposesitinto onstituentsorphrasesand inthat
wayaphrasestru turetreeis reatedwithrelationshipsbetweenwordsand
phrases. By ontrast, with dependen y stru ture representations, the goal
of parsinga senten e is to reate a dependen y graph onsisting of lexi al
nodeslinkedbybinaryrelations alleddependen ies. Adependen yrelation
onne tswordswithoneworda tingashead andtheotherasdependent. In
thisthesis,wewill on entrateonparsingwithdependen yrepresentations.
Data-driven methods in natural languagepro essing havebeen used in
manytasks inthepastde adeand synta ti parsingis oneof them. Statis-
ti alparsingisusuallybasedonnondeterministi parsingte hniquesin om-
binationwithgenerativeprobabilisti modelsthatprovidean
n
-bestrankingof thesetof andidateanalysesderivedbytheparser(Collins1997; Collins
1999; Charniak 2000). Dis riminativemodels anbeused to enhan e these
parsersbyrerankingtheanalysesoutputbytheparser(Johnsonet al. 1999;
CollinsandDuy 2005;CharniakandJohnson2005).
Nondeterministi parsinghasbeenthemainstreamapproa h, but ithas
also been shown that deterministi parsing an be performed with fairly
higha ura y,espe iallyindependen y-basedparsing(KudoandMatsumoto
2000a;YamadaandMatsumoto2003;Nivreet al. 2004;Isozaki etal. 2004;
Chengetal. 2005a),butalso in onstituent-basedparsing(SagaeandLavie
2005). Themainideaistoguidetheparserwitha lassiertrainedontree-
bank data using a greedy parsing algorithm that approximates a globally
optimal solutionbymakingaseriesof lo allyoptimal hoi es. A determin-
isti parserusually usesaform ofhistory-based featuremodel(Bla ket al.
1992; Magerman1995;Ratnaparkhi 1997)to reatearepresentationthat a
lassier anuseto predi tthenextparser state. This isalsotheapproa h
assumedinthisthesis.
Availabilityof largesynta ti ally annotated orpora,also knownastree-
banks,isessentialwhen onstru tingdata-drivenparsers,butone ofthepo-
tential advantages is that they an easily be ported to new languages. A
problem isthat manydata-drivenparsersareovertted toaparti ular lan-
guage,usuallyEnglish. Forexample,Corazzaet al. (2004)report in reased
errorratesof1518%whenusingtwostatisti alparsersdevelopedforEnglish
to parse Italian. We suggestthat adata-driven parserneed to be designed
forexiblere onguration toin reasetheportabilitytoother languages. A
user should be ableto experiment with several parsing algorithms, feature
modelsandlearningmethods.
1.1 Resear h Problem and Aims
Themainresear hproblemforthedo toralthesisistostudytheinuen eof
dierentfa torson a ura yande ien y ofdata-drivendependen ypars-
ing. This study requires a broad evaluation of the parsing system, where
weperformextensivefeaturesele tionandparametertuningtooptimizethe
feature models and lassiers formany languagesand several parsingalgo-
rithms.
Fortheli entiatethesiswewillrestri ttheresear hproblemtothedesign,
implementationandvalidationofanar hite turefordata-drivendependen y
parsingofunrestri ted naturallanguagetext. Thevalidation anbeseenas
a pilot evaluation that will determine future dire tions. However, we will
alsoobtainexperimental resultsthathaveadire t bearingonthelong-term
resear hproblems.
Wepresentasoftwarear hite turethatshouldbeabletohandledierent
parsingalgorithms, feature modelsand learningmethods, forbothlearning
and parsing. When using theimplementation ofthis ar hite ture, the user
shouldbeabletovarytheseparametersindependentlyina onvenientway.
The hoi e of parsing algorithm inuen es how the synta ti stru ture
will be built. In the learning phase this will ae t how the training data
is generated and in the parsing phase whi h stru tures are permissible. It
shouldbeeasytoaddnewparsingalgorithmsintothear hite ture,provided
thattheyfull ertainwell-denedrequirements.
Thelinguisti knowledgeofthelanguageisimportantwhendening the
stru tureofthefeaturemodel,inotherwordswhi hlinguisti featuresshould
beusedtopredi tparsinga tions. Itshouldbeeasytodeneanewfeature
modelwithoutreprogrammingthesystem. Afeaturemodelshouldbedened
inanappropriatefeaturespe i ationlanguagesothatit anbeloadedwhen
itisrequired.
Given asetof traininginstan es, whereea h instan eis angerprintof
the urrentstate of the parser, as spe ied by the feature model, together
with the transition to the next parser state, the task of the learner is to
indu e amodel at learningtime. At parsingtime, this model is then used
forpredi tingthenextparserstate. Thistask aneasilybeformulatedasa
lassi ationtask,wheredis riminativelearningmethodsarewell-suited.
Thear hite tureisbasedonthetheoreti alframeworkofindu tivedepen-
den yparsingbyNivre(2006)andhasbeenrealizedinasystem alledMalt-
Parser(Nivreetal. 2006),whi hinthe urrentversionsupportstwoparsing
algorithms, inseveralversions,andtwolearningmethods(MBL andSVM),
forwhi h omplexfeaturemodels anbedenedinaspe ialdes riptionlan-
guage. The implementation of the MaltParsersystem has been joint work
togetherwithJoakimNivre. MaltParserwasrstequippedwithaninterfa e
to amemory-basedlearner alled TiMBL(Daelemans and Van den Bos h
2005) and Nivre (2006) ontains an extensive evaluation of memory-based
dependen yparsingusingtheparsingalgorithmdened inNivre(2003). In
ordertovalidatethegeneralityandexibilityofthear hite ture,wetherefore
havetoextendtheparserwithaninterfa etoanotherlearnerandimplement
anadditionaldeterministi parsingalgorithm. Wehave hosentouseSVMas
thelearningmethod,be auseithasbeenproventogivegoodresultsforsimi-
lartasks(KudoandMatsumoto2000b;YamadaandMatsumoto2003;Sagae
andLavie2005). Fortheparsingalgorithm,wehave hosenthein remental
algorithm des ribedinCovington(2001).
Usingthis newimplementation, wehaveperformedthree sets ofexperi-
ments,designedtoanswerthreeessentialquestions:
1. Validation of the implementation: Does MaltParser realize the
underlyingar hite ture,sothatitispossibletovaryparsingalgorithm,
featuremodelandlearningmethodindependently?
2. Investigation of the SVM interfa e: How do thespe ial proper-
ties of theSVM interfa eae t parsinga ura y and timee ien y?
How anlearningandparsinge ien ybeimprovedwithoutsa ri ing
a ura y?
3. Comparison ofMBL and SVM: Whi hofthelearningmethodsis
bestsuitedforthetaskofindu tivelabeleddependen yparsing,taking
bothparsinga ura yandtimee ien y intoa ount?
Apartfromansweringthesequestions,wewilltrytoidentifyfuturedire tions
thathopefullywill beusefulforthelong-termresear hproblems.
1.2 Outline of the Thesis
Inthisintrodu tory hapter,wehavetriedtooutlinethelong-termresear h
problem and thespe i aims ofthe li entiate thesis. Thestru ture of the
remaining haptersisasfollows.
Chapter 2, Indu tiveDependen y Parsing
Chapter 2 reviews the ba kground material for this thesis. We dene the
problem of parsingunrestri ted naturallanguage text and dis ussdierent
algorithms for dependen y parsing. Furthermore, data-driven parsing and
espe ially the history-based models are dis ussed. The hapter ontinues
with ades riptionof thetwoma hine learningmethods used in the restof
the thesis: SVM and MBL. Finally, the hapter ends with a se tion whi h
brieypresentsrelatedwork.
Chapter 3, MaltParser
Chapter3presentsanar hite tureforparsingunrestri tednaturallanguage
textwithdependen ystru tures. Thear hite tureisdes ribedindetailwith
fo usonthetwomainmodulesParser andGuide. TheMaltParsersystemis
animplementationofthear hite tureandthe hapterendswithades ription
ofthissystem.
Chapter 4, Experiments
Chapter 4starts with a presentation of the treebankdata used for the ex-
perimentsand anexplanation oftheevaluation riteriausedtovalidate the
implementationof the proposed ar hite ture. An investigation of thethree
questionsexplainedaboveispresentedbasedonextensiveexperiments.
Chapter 5, Con lusion
Chapter5 ontainsthemain on lusionsandasummaryofthemainresults
of the thesis. The hapter ends with a dis ussion of dire tions for future
resear h.
1.3 Division of Labor
Asalreadystated,thedesignandimplementationofMaltParserisjointwork
withJoakimNivre. Morespe i ally,theworkhasbeendividedasfollows:
•
Thedesignofthear hite tureisjointwork.•
Theimplementationofparsingalgorithms,generi featuremodelhan- dlingandthememory-basedlearnerismainlytheworkofJoakimNivre.•
Theimplementationofallotherpartsofthesystem,in ludingtheSVM learner,ismainlytheworkof JohanHall.Ba kground
Synta ti parsingis usedin manyappli ationssu hasma hine translation,
information extra tion and question answering. Appli ations dealing with
unrestri ted text need to handle all kinds of text, in ludinggrammati ally
orre ttext,ungrammati altextandforeignexpressions. Itisdesirablethat
su h anappli ation produ essomekindof analysis. Of ourse, iftheinput
isgarbage,itismostlikelythatthesystemwillfailto reateaninteresting
analysis, but the system should nevertheless make its best to produ e an
analysis. Iftheseappli ationsneedasynta ti parser,italsoneedstobeable
tohandleunrestri tedtext,althoughweneedtorestri tthetexttoa ertain
naturallanguageto beabletoderiveameaningfulsynta ti representation.
Nivre(2006)introdu esthenotionoftextparsing to hara terizethisopen-
endedproblemthat an onlybeevaluatedwithrespe ttoempiri alsamples
ofatextlanguage.
1
Ourapproa htotextparsingisdependen y-based anddata-driven. The
goalofdependen y-basedtextparsingisto onstru tadependen ygraphfor
ea hsenten einatext. Figure2.1showsanexampleofadependen ygraph,
onne tingthewordsinaSwedishsenten ebybinaryrelationslabeledwith
dependen ytypes(grammati alfun tions).
Data-driven methods omply well with the fa t that text parsing uses
empiri al samples of a text language. A realisti approa h is then to use
somekindofsupervisedlearningmethodthatmakesuseofatreebank,whi h
onsistsof synta ti allyannotatedsenten es. Aproblemwiththisapproa h
isthatitrestri tsustolanguagesthathaveatleastonetreebank. Inaddition,
thesetreebanksareoftenannotatedwith onstituen y-basedrepresentations
and thereforeneedtobe onvertedto dependen y-based representations.
Giventhatwehaveatreebankforaspe i languageourapproa histo
indu eaparsermodelatlearningtimeandusethisparsermodeltoparsesen-
ten es. However,sin eitisproblemati tousethedependen ygraphdire tly
1
Thetermtextlanguage doesnotex ludespokenlanguage,butemphasizesthatitis
alanguagethato urrsinrealtexts. Inprin iple,thenotionappliesalsotoutteran esin
spokendialogue.
0 1
nn.nom
Cykelreglerna
(Bikingrules
?
SUB
2
vb.n
gäller
arevalid
?
ROOT
3
ab
o kså
also
?
ADV
4
pp
för
for
?
ADV
5
nn.nom
mopedister
mopedriders
?
PR
6
mad
.
.)
?
IP
Figure 2.1: Dependen y graph for Swedish senten e, onverted from Tal-
banken
to onstru tsu hamodel,weinsteaduseadeterministi parsingalgorithmto
mapadependen ygraphtoatransitionsequen esu hthatthistransitionse-
quen euniquelydeterminesthedependen ygraph. Anindividualtransition
an be,for example,shiftingatokenonto asta koradding anar between
twotokens. Thetransitionsysteminitselfisnormallynondeterministi and
we therefore needame hanismthat resolvesthis nondeterminism. We use
a dis riminative learning method, su h as SVM and MBL, to onstru t a
lassier. Moreover,we usehistory-basedfeature modelsto extra t ve tors
of feature-valuepairs from the urrentparser stateas trainingmaterial for
the lassier.
In this hapter, we reviewthe ne essaryba kgroundfor the design and
implementationofMaltParserfo usingontheframeworkofindu tivedepen-
den yparsingproposedbyNivre(2006). MostofthenotationusedbyNivre
(2006)isalsousedhere,but insome asesthenotationhastobeextended.
Therestofthe hapterisstru turedasfollows. Se tion2.1des ribesthebasi
requirementson textparsing. Se tion2.2 presentsthe ne essarydenitions
ofdependen ygraphs. Se tion2.3presentstheparsingframework,in luding
the deterministi parsing algorithm, the history-based feature models and
dis riminativelearningmethods. Relatedworkisdis ussedin se tion2.4.
2.1 Requirements on Text Parsing
We begin by dening a text as a sequen e
T = (x 1 , . . . , x n )
of senten es,where ea h senten e
x i = (w 1 , . . . , w m )
is asequen eof tokensand atokenw j
isasequen eof hara ters,usuallyawordform. GivenatextT
,thetaskof textparsingisto derivethe orre tanalysis
y i
foreverysenten ex i ∈ T
.Weassumethatthetext
T
ontainssenten esofatextlanguageL
thatinouraseisanaturallanguage. Thisassumptionentailsthatthetextlanguageis
notaformallanguageandthatparsingdoesnotentailre ognition. Instead,
weseetext parsingasanempiri alapproximationproblem. Therefore, this
approa h is not well-suited for grammar he king in aword-pro essingap-
pli ation, be auseit will tryto nd an analysis also for an ungrammati al
senten e.
Given these denitions we andene four basi requirementson a text
parser(Nivre2006):
Denition2.1. Aparser
P
shouldmapatextT = (x 1 , . . . , x n )
inlanguageL
towell-formedsynta ti representations(y 1 , . . . , y n )
inawaythatsatisesthefollowingrequirements:
1. Robustness:
P
assignsatleastoneanalysisy i
toeverysenten ex i ∈ T
.2. Disambiguation:
P
assignsatmostoneanalysisy i
toeverysenten ex i ∈ T
.3. A ura y :
P
assignsthe orre tanalysisy i
toeverysenten ex i ∈ T
.4. E ien y :
P
pro esseseverysenten ex i ∈ T
intimeand spa ethatispolynomialinthelengthof
x i
.We want to reate a parser that uses a parsing strategy that assigns at
least one analysis for ea h senten e (Robustness) and at most one anal-
ysis (Disambiguation). The third requirement(A ura y ) is unrealisti
in pra ti e,but wewillusethis asanevaluation riterionin theChapter4.
Inorder tosatisfythefourth requirement,wewillusedeterministi parsing
algorithmswithatmostquadrati time omplexityandlinearspa e omplex-
ity. Wewillusetwoparsingalgorithmsthathavelinear omplexity(Nivre's
ar -eagerandar -standardalgorithms)andonethathasquadrati omplex-
ity(Covington'salgorithm). Intheexperiments,theE ien yrequirement
willbeanevaluation riterionthatmeasuresthetimeittakestoparseatext.
2.2 Dependen y Graphs
Dependen yparsingis basedon synta ti representationsbuiltfrom binary
relations between tokens (or words) labeledwith synta ti fun tions orde-
penden ytypes. Wedenesu hrepresentationsasdependen ygraphs:
Denition2.2. Givenasenten e
x = (w 1 , . . . , w n )
andasetR = {r 0 , r 1 , . . . r m }
ofdependen ytypes,adependen ygraphforasenten e
x
isalabeleddire tedgraph
G = (V, E, L)
,where:1.
V = Z n+1 = {0, 1, 2, . . . , n}
2.
E ⊆ V × V
3.
L : E → R
A dependen y graph onsists of a set
V
of nodes, where a node is anon-negativeinteger(in luding
n
). Everypositivenodehasa orrespondingtoken inthesenten ex
andwewillusethetermtokennodeforthesenodes(i.e.,thetoken
w i
orrespondstothetokennodei
). Inaddition,thereisaspe ialrootnode0,whi histherootofthedependen ygraphandhasno orresponding
token in thesenten e
x
. Furthermore,the setV +
denotesthe set of tokennodes, i.e.,
V + = V − {0}
. Thereis apra ti aladvantagein using positionindi es insteadof wordforms to represent tokens (Maruyama1990), whi h
allowsthe useof thearithmeti relation
<
to order thenodes, and ensuresthateverytokenhasauniquenodein thegraph.
Anar
(i, j) ∈ E
onne tstwonodesi
andj
inthegraphandrepresentsadependen yrelationwhere
i
istheheadandj
isthedependent. Thenotationi → j
will be used for the pair(i, j) ∈ E
andi → ∗ j
for the reexive andtransitive losure, i.e.,
i → ∗ j
if andonly ifthere is apathof zeroormorear s onne ting
i
toj
. Finally,thefun tionL
labelseveryari → j
withadependen ytype
r ∈ R
andanar withalabelr
will bedenotedi → j r
.Tobe ableto onstru t adependen ygraph using aparsingalgorithm,
weusuallyhavetodenesomebasi onstraintsthatagraphmustsatisfy.
Denition 2.3. A dependen y graph
G
is well-formed if and only if the following onstraintshold:1. Root: Thenode0isaroot,i.e.,there isanode
i
su hthati → 0
.2. Conne tedness:
G
isweakly onne ted,i.e.,foreverynodei
thereissomenode
j
su h thati → j
orj → i
.3. Single-Head: Ea h node has at most one head, i.e., if
i → j
thenthere isnonode
k
su hthatk 6= i
andk → j
.4. A y li ity :
G
isa y li ,i.e.,ifi → j
thennotj → ∗ i
.5. Proje tivity :
G
isproje tive,i.e.,ifi → j
theni → ∗ k
,foreverynodek
su hthati < k < j
orj < k < i
.A spe ial root node makes it easier to omply with the se ond onstraint
Conne tedness,sin eitisalwayspossibletohookupanynodetotheroot
and in that way the graphwill always be onne ted. Furthermore, with a
root node wealwaysknow theentran eto thegraph. The third onstraint
Single-Head(sometimes alleduniqueness)is ommonlyassumedindepen-
den y grammar, although Hudson (1984) allows multiple headsto apture
ertain transformational phenomena, where a single token is onne ted to
more than oneposition in the senten e. The fourth onstraint A y li ity
together with rst three onstraints entail that the graph is a rooted tree.
Theseassumptionsmakeitsimplerto onstru tparsingalgorithmsthatbuild
dependen ytreesautomati ally.
Thelast onstraintProje tivity ismore ontroversialand mostdepen-
den y grammarsallownon-proje tivegraphs,be ausenon-proje tiverepre-
sentations areable to apture non-lo al dependen ies. There exists several
treebanksthat ontainnon-proje tivestru turessu hasthePragueDepen-
den y Treebank of Cze h (Haji£ et al. 2001) and the DanishDependen y
Treebank (Kromann 2003). We will assume the onstraint Proje tivity
here be ause the parsingalgorithms used in this thesis are limited to pro-
je tivestru turesandthetreebanksusedonly ontainproje tivestru tures.
Moreover, when dealing with non-proje tivedata, it is possible to proje -
tivizethetrainingdataandre overnon-proje tivedependen iesbyapplying
aninversetransformationafterparsinginapost-pro essingstep(Nivreand
Nilsson 2005).
Figure 2.1 shows a labeled proje tive dependen y graph for a Swedish
senten e, where ea h wordof thesenten e is tagged withits part-of-spee h
andea har labeledwithadependen ytype.
2
2.3 Indu tive Dependen y Parsing
The frameworkof indu tive dependen y parsing, as hara terized byNivre
(2006),isbasedonthree essentialelements:
1. Deterministi parsingalgorithmsforbuildingdependen ygraphs(Kudo
andMatsumoto2002;YamadaandMatsumoto2003;Nivre2003)
2. History-based feature models for predi ting the next transition from
oneparser ongurationtoanother(Bla ketal. 1992;Magerman1995;
Ratnaparkhi1997;Collins1999)
2
Thedependen y typesusedinFigure2.1aredes ribedinse tion4.1.1.
3. Dis riminativelearningmethodstomaphistoriestotransitions(Veen-
stra and Daelemans 2000; Kudo and Matsumoto 2002; Yamada and
Matsumoto2003;Nivreet al. 2004)
Inthisse tionwewilldis ussthesethreeelements. Se tion2.3.1presentstwo
deterministi dependen y-basedparsingalgorithms. Se tion 2.3.2 des ribes
howhistory-basedmodels anbeusedforpredi tingthenexttransitionfrom
oneparser ongurationto another. Finally, Se tion 2.3.3explains how we
an use dis riminativelearningmethods for indu inga lassier that maps
parser ongurationsto transitions, in ludingabriefdes riptionof thetwo
learningmethod usedintheexperiments: SVM andMBL.
2.3.1 Deterministi Dependen y Parsing
Mainstreamapproa hesto data-driven textparsing arebased onnondeter-
ministi parsingte hniques,butthedisambiguation anbeperformeddeter-
ministi ally, using agreedyparsingalgorithm that approximatesa globally
optimalsolutionbymakingasequen eoflo allyoptimal hoi es(seese tion
2.4formoredetailsofrelatedworkinthisarea). TheexperimentsinChapter
4will use twoparsingalgorithms, alled Nivre's algorithm andCovington's
algorithm,and bothalgorithms ome in twoversions. Webeginbydening
parser ongurationsthat an be used byboth algorithms, following Nivre
(2006):
Denition 2.4. Givenaset
R = {r 0 , r 1 , . . . , r m }
of dependen ytypes anda senten e
x = (w 1 , . . . , w n )
, a parser onguration forx
is a quintuplec = (σ, τ, υ, h, d)
,where:1.
σ
is a sta k of partially pro essed token nodesi
(1 ≤ i ≤ j
for somej ≤ n
).2.
τ
isalistofremaininginputtokennodesi
(k ≤ i ≤ n
forsomek > j
).3.
υ
isasta koftokennodesi
o urringbetweenthetokenontopofthesta k
σ j
andthenextinputtokenτ k
(j < i < k
).4.
h : V x + → V x
isaheadfun tion fromtokennodestonodes.5.
d : V x + → R
isalabelfun tionfrom tokennodesto dependen ytypes.6. Foreverytokennode
i ∈ V x +
,d(i) = r 0
onlyifh(i) = 0
.Thedenition ofaparser ongurationintrodu esthree data stru tures: a
sta k
σ
,alistτ
andasta kυ
. Thersttwodatastru tures(thesta kσ
andthelist
τ
)arein ludedinthedenition ofNivre(2006). Herethedenitionisextendedwithasta k
υ
,whi hwe allthe ontextsta k andwhi hisusedbyCovington'salgorithm. Inordertodene theparsingalgorithmslaterin
thisse tion,wewillrepresentallthreedatastru turesaslists. Tobeableto
useindividual omponentsintheselists,wewilluse
j|τ
torepresentalistofinputtokenswith head
j
andtailτ
, whileσ|i
andυ|i
representsta kswiththetop
i
andtailσ
andυ
. Anemptysta k/listisrepresentedbyǫ
.Thesymbols
V x +
andV x
areusedtoindi atethatV +
andV
arethenodesforthesenten e
x
. Theheadfun tionh
denesthepartiallybuiltdependen ygraph. Foreverytokennode
i
thereisasynta ti headh(i) = j
. Ifthetokennode
i
isnotyetatta hedtoahead,thespe ialrootnodeh(i) = 0
isused.Finally, thelabelfun tion
d
labelsthepartially built dependen ystru -ture, where everytoken node
i
is assigned adependen y typer j
using thelabel fun tion
d(i) = r j
(d(i) = r 0
is used for token nodes that are notyet atta hed). Weestablisha onne tionbetweenparser ongurationsand
dependen ygraphsinthefollowingway(Nivre2006):
Denition 2.5. A parser onguration
c = (σ, τ, υ, h, d)
forx
denes thedependen ygraph
G c = (V x , E c , L c )
,where:1.
E c = {(i, j) | h(j) = i}
2.
L c = {((i, j), r) | h(j) = i, d(j) = r}
Forthefun tions
h
andd
, wewill use thenotationf [x 7→ y]
; iff (x) = y ′
,then
f [x 7→ y] = f − {(x, y ′ )} ∪ {(x, y)}
.Denition 2.6. A parser onguration
c
forthesenten ex = (w 1 , . . . , w n )
isinitial ifandonlyifithastheform
c = (ǫ, (1, . . . , n), ǫ, h 0 , d 0 )
,where:1.
h 0 (i) = 0
foreveryi ∈ V x +
.2.
d 0 (i) = r 0
foreveryi ∈ V x +
.Whentheparserbeginstoparseasenten e,thetwosta ks
σ
andυ
areemptyandallthetokennodesofthesenten eareinthelist
τ
. Inthebeginning,alltokennodesaredependentsofthespe ialroot node0andlabeledwiththe
spe ial label
r 0
. Theparser terminates theparsing of asenten e when thefollowing onditionismet:
Denition 2.7. A parser onguration
c
forthesenten ex = (w 1 , . . . , w n )
isterminal ifandonlyifithastheform
c = (σ, ǫ, υ, h, d)
(forarbitraryσ
,υ
,h
andd
).Theparserpro essestheinputleft-to-rightandterminateswheneverthelist
of input tokensis empty. The set
C
will denote all possible ongurations andC n
the set of non-terminal ongurations, i.e., any ongurationc = (σ, τ, υ, h, d)
whereτ 6= ǫ
. Atransition fromanon-terminal ongurationtoanew ongurationisapartialfun tion
t : C n → C
.Wewilldeneatransitionsystemforea hversionofthealgorithms,whi h
isnondeterministi . Hen e,therewillbemorethanonetransitionappli able
toagiven onguration. An ora le
o : C n → (C n → C)
isusedtoover omethisnondeterminism(Kay2000). Forea hnondeterministi hoi epointthe
parsingalgorithm will askthe ora letopredi t thenexttransition. Inthis
se tion we will onsider theora leas abla k box, whi h alwaysknows the
orre ttransition. Inse tion2.3.2,wewillseethatwe anapproximatethis
ora lebyindu inga lassier.
Nivre's algorithm. This parsing algorithm wasrst proposed for unla-
beleddependen y parsingbyNivre(2003)and wasextended to labeled de-
penden y parsing by Nivre et al. (2004). A senten e
x = (w 1 , . . . , w n )
isparsedbythealgorithmParse-Nivreinthefollowingway:
Parse-Nivre(
x = (w 1 , . . . , w n )
)1
c ← (ǫ, (1, . . . , n), ǫ, h 0 , d 0 )
2 while
c = (σ, τ, υ, h, d)
isnotterminal3 if
σ = ǫ
4
c ←
Shift(c)
5 else
6
c ← [o(c)](c)
7
G ← (V x , E c , L c )
8 return
G
The algorithm will perform the Shifttransition ifthe sta k is empty and
otherwiselettheora le
o
predi tthenexttransitiono(c)
aslongastheparserremainsinanon-terminal onguration
c ∈ C n
. TheShifttransitionpushesthenextinputtoken
i
ontothesta kσ
. Whentheterminal ongurationis rea hedthedependen ygraphisreturned.The algorithm omes in two versions with two transition systems: an
ar -eager andanar -standard version. Thear -eagerversionusesfourtran-
sitions, twoof whi h areparameterized by a dependen y type
r ∈ R
. Thetransitionsystemupdatestheparser ongurationasfollows(Nivre2006):
Denition2.8. Forevery
r ∈ R
,thefollowingtransitions arepossible:1. Shift:
(σ, i|τ, ǫ, h, d) → (σ|i, τ, ǫ, h, d)
2. Redu e:
(σ|i, τ, ǫ, h, d) → (σ, τ, ǫ, h, d)
if
h(i) 6= 0
3. Right-Ar (
r
):(σ|i, j|τ, ǫ, h, d) → (σ|i|j, τ, ǫ, h[j 7→ i], d[j 7→ r])
if
h(j) = 0
4. Left-Ar (
r
):(σ|i, j|τ, ǫ, h, d) → (σ, j|τ, ǫ, h[i 7→ j], d[i 7→ r])
if
h(i) = 0
The transition Shift(SH) shifts (pushes) thenext input token
i
onto thesta k
σ
. This isthe orre t a tionwhen theheadof thenext wordis posi-tionedtotherightofthenextwordorthenextwordisaroot. Thetransition
Redu e(RE)redu es(pops)thetoken
i
ontopofthesta kσ
. Itisimpor-tanttoensurethattheparserdoesnotpopthetoptokenifithasnotbeen
assigned ahead,sin eitwillotherwisebeleft unatta hed.
TheRight-Ar transition(RA)addsanar fromthetoken
i
ontopofthesta k
σ
tothenextinputtokenj
,i.e.,i → j r
andinvolvespushingj
ontothesta k. Finally,thetransitionLeft-Ar (LA)addsanar fromthenext
input token
j
to the tokeni
on topof thesta kσ
, i.e.,j → i r
and involvespopping
i
fromthesta k. Thistransitionisonlyallowedwhenthetoptokeni
onthesta khaspreviouslyre eivedanar to thespe ialrootnode0. Wemake use of the assumption of proje tivity be ause we know that the top
token
i
annothaveanymoreleftandrightdependentsandthereforeit anbepopped.
Nivre'sar -eageralgorithm is guaranteedto terminate after at most
2n
transitions,givenasenten eoflength
n
(Nivre2003). Furthermore,italways produ es a dependen y graph that is a y li and proje tive. The orre ttransitionsequen efortheSwedishsenten eshowninFigure2.1usingNivre's
ar -eageralgorithmis asfollows:
(
ǫ
,(1, . . . , 6)
,ǫ
,h 0
,d 0
)D SH
→
((1)
,(2, . . . , 6)
,ǫ
,h 0
,d 0
)N LA(SUB)
→
(ǫ
,(2, . . . , 6)
,ǫ
,h 1 = h 0 [1 7→ 2]
,d 1 = d 0 [1 7→
SUB]
)D SH
→
((2)
,(3, . . . , 6)
,ǫ
,h 1
,d 1
)N RA(ADV)
→
((2, 3)
,(4, 5, 6)
,ǫ
,h 2 = h 1 [3 7→ 2]
,d 2 = d 1 [3 7→
ADV]
)N RE
→
((2)
,(4, . . . , 6)
,ǫ
,h 2
,d 2
)N RE(ADV)
→
((2, 4)
,(5, 6)
,ǫ
,h 3 = h 2 [4 7→ 2]
,d 3 = d 2 [4 7→
ADV]
)N RA(PR)
→
((2, 4, 5)
,(6)
,ǫ
,h 4 = h 3 [5 7→ 4]
,d 4 = d 3 [5 7→
PR]
)N RE
→
((2, 4)
,(6)
,ǫ
,h 4
,d 4
) N RE→
((2)
,(6)
,ǫ
,h 4
,d 4
)N RA(IP)
→
((2, 6)
,ǫ
,ǫ
,h 5 = h 4 [6 7→ 2]
,d 5 = d 4 [6 7→
IP]
)The rst rowpresentsthe initial parser onguration with an empty sta k
and
h 0 (i) = 0
andd 0 (i) = r 0
for everynodei ∈ V
. These ond row showstheparser ongurationaftertheshifttransitionhasbeenexe uted. Theleft
olumntellsusifthetransitionisdeterministi (D)ornondeterministi (N),
in other wordsif the ora le
o
is used ornot. Forexample, the se ond rowanonlybeashifttransition be ausethesta kis empty (D)and thethird
rowisanondeterministi transition(N).
The ar -standardversionusesastri tbottom-uppro essingasin tradi-
tionalshift-redu e parsing. ThealgorithmbyKudoandMatsumoto(2002),
Yamada and Matsumoto (2003) and Cheng et al. (2005a) uses the ar -
standardstrategy,butalsoallowsmultiplepasses overtheinput.
Thear -standardversionusesatransitionsystemsimilartothear -eager
version,but has only three transitions Shift, Left-Ar and Right-Ar
(noRedu e). Thersttwotransitions,Shiftand Left-Ar , areapplied
in exa tlythe samewayasfor thear -eagerversion. Thetransition system
isdened asfollows:
1. Shift:
(σ, i|τ, ǫ, h, d) → (σ|i, τ, ǫ, h, d)
2. Right-Ar (
r
):(σ|i, j|τ, ǫ, h, d) → (σ, i|τ, ǫ, h[j 7→ i], d[j 7→ r])
if
h(j) = 0
3. Left-Ar (
r
):(σ|i, j|τ, ǫ, h, d) → (σ, j|τ, ǫ, h[i 7→ j], d[i 7→ r])
if
h(i) = 0
Insteadofpushingthenexttoken
j
ontothesta kσ
,Right-Ar movesthetopmost token
i
on the sta k ba k to the list of remaining input tokensτ
,where itrepla esthetoken
j
asthenexttoken. Thetransitionsequen eforthesamesenten eusingNivre'sar -standardalgorithm:
(
ǫ
,(1, . . . , 6)
,ǫ
,h 0
,d 0
)D SH
→
((1)
,(2, . . . , 6)
,ǫ
,h 0
,d 0
)N LA(SUB)
→
(ǫ
,(2, . . . , 6)
,ǫ
,h 1 = h 0 [1 7→ 2]
,d 1 = d 0 [1 7→
SUB]
)D SH
→
((2)
,(3, . . . , 6)
,ǫ
,h 1
,d 1
)N RA(ADV)
→
(ǫ
,(2, 4, . . . , 6)
,ǫ
,h 2 = h 1 [3 7→ 2]
,d 2 = d 1 [3 7→
ADV]
)D SH
→
((2)
,(4, . . . , 6)
,ǫ
,h 2
,d 2
) N SH→
((2, 4)
,(5, . . . , 6)
,ǫ
,h 2
,d 2
)N RA(PR)
→
((2)
,(4, 6)
,ǫ
,h 3 = h 2 [5 7→ 4]
,d 3 = d 2 [5 7→
PR]
)N RA(ADV)
→
(ǫ
,(2, 6)
,ǫ
,h 4 = h 3 [4 7→ 2]
,d 4 = d 3 [4 7→
ADV]
)D SH
→
((2)
,(6)
,ǫ
,h 4
,d 4
)N RA(IP)
→
(ǫ
,(2)
,ǫ
,h 5 = h 4 [6 7→ 2]
,d 5 = d 4 [6 7→
IP]
)D SH
→
((2)
,ǫ
,ǫ
,h 5
,d 5
)We anseethatthetransitionsareperformedin anotherorder,forinstan e
the Right-Ar (PR) is exe uted before Right-Ar (ADV), ompared to
thear -eagerversion.
Covington's algorithm. Covington (2001)proposesseveral in remental
parsing algorithms for dependen y parsing. Two of the algorithms are the
proje tive algorithm and the exhaustive left-to-right sear h algorithm. The
rst algorithmusesaheadlist withwordsthat donotyethaveheadsanda
wordlist with all wordsen ounteredso far. Wewill not usethese twodata
stru tures; insteadwe will des ribethese twoalgorithms byusing the data
stru tures dened bythe parser onguration: thesta ks
σ
andυ
,and thelist
τ
. A tually, wewill regardthese twoalgorithms asone algorithm withtwotransitionsystemsorastwoversionsofthesamealgorithm. Wewill all
these ondversiontheunrestri ted,be auseitallowsdependen ygraphsthat
arenon-proje tiveand y li . Bothversionshavequadrati omplexity,sin e
theypro eedbytryingtolink ea hnewtokento ea hpre edingtoken. Itis
also possibleto deneother versions. Forexample,aversionthat onforms
totheA y li ityrequirementbutallowsnon-proje tivegraphs,butthiswill
notbedonein thisthesis. Theadaptedversionof Covington'salgorithm is
des ribedasfollows:
Parse-Covington(
x = (w 1 , . . . , w n )
)1
c ← (ǫ, (1, . . . , n), ǫ, h 0 , d 0 )
2 while
c = (σ, τ, υ, h, d)
isnotterminal3 Done
← f alse
4 while
σ 6= ǫ
and¬
Done5
c ← [o(c)](c)
6 while
υ 6= ǫ
7 Push
(
Pop(υ), σ)
8 Push
(
First(τ ), σ)
9
G ← (V x , E c , L c )
10 return
G
Thealgorithmbeginsbyinitializingthe ongurationwithtwoemptysta ks
and alltokennodesin the list
τ
, in thesamewayasNivre'salgorithm. Aslongastheparserremainsinanon-terminal onguration,itwillrstiterate
as longas the sta k
σ
is notempty or the ag Done is false. The Doneag is only used by the proje tive version to indi ate that it an pro eed
to the nexttoken without anempty sta k. Beforeit an pro eed with the
next input token, the algorithm must move ba k all unatta hed tokens in
the ontextsta k
υ
tothesta kσ
. ThePush fun tion pushesatokenontoasta kand thePop fun tion popsatoken from asta k. Finally, the next
inputtokenispushedontothesta k
σ
,usingthefun tionFirsttoretrievethersttokenin alist.
The unrestri ted versionuses three transitions and these aredened in
thefollowingway:
1. Redu e:
(σ|i, τ, υ, h, d) → (σ, τ, υ|i, h, d)
2. Right-Ar (
r
):(σ|i, j|τ, υ, h, d) → (σ, j|τ, υ|i, h[j 7→ i], d[j 7→ r])
3. Left-Ar (
r
):(σ|i, j|τ, υ, h, d) → (σ, j|τ, υ|i, h[i 7→ j], d[i 7→ r])
All threetransitions movethetoptokenofthe sta k
σ
tothesta kυ
. TheRight-Ar and Left-Ar transitionsin addition add anar
i → j r
oranar
j → i r
,respe tively.Theproje tiveversionmakesuseofthefa tthatitshouldbuildaproje -
tivegraph,whi hallowsthealgorithmto ontinuewiththenextinputtoken
withoutexploringall ombinationsthat ouldmakethegraphnon-proje tive.
Thetransitionsystemisredenedasfollows:
1. Redu e:
(σ|i, j|τ, υ, h, d) → (σ, j|τ, υ|i, h, d)
Done← true
2. Right-Ar (
r
):(σ|i, j|τ, υ, h, d) → (σ, j|τ, υ|i, h[j 7→ i], d[j 7→ r])
Done← true
if
h(j) = 0
3. Left-Ar (
r
):(σ|i, j|τ, υ, h, d) → (σ, j|τ, υ, h[i 7→ j], d[i 7→ r])
if
h(i) = 0
The Redu e transition is exa tlythe same asfor the unrestri ted version
ex ept that it sets the Done ag to true, in order to indi ate that all the
remainingtokensinsta k
σ
annotbelinkedtothetokenj
,sin ethiswouldprodu e a non-proje tivegraph. The Right-Ar transition makes use of
thefa t thatthear
i → j r
oversthetokensbetweenthetoptokenandthenexttoken;topreventthat thegraphbe omesnon-proje tivethetoptoken
i
of sta kσ
is popped and then pushed onto the ontext sta kυ
and theag Doneisassignedthevaluetrue,forthesamereasonasintheRedu e
transition. TheLeft-Ar transitionaddsanar
j → i r
;be ausei
annotbelinkedto anothertokenitispoppedfrom thesta k
σ
.2.3.2 History-Based Models
In se tion2.3.1wedened aset
C
of possibleparser ongurationsandfor ea hversionoftheparsingalgorithmwedened atransition systemisnon-deterministi . Furthermore, we introdu ed an ora le
o : C n → (C n → C)
,whi htheparsingalgorithmusestogetthe orre ttransition. If itispossi-
ble to derivethe orre t transitionsfrom synta ti allyannotatedsenten es,
we an use these as training data to approximate su h an ora le through
indu tive learning. In other words, we dene a one-to-one mapping from
an input string
x
and a dependen y graphG
to a sequen e of transitionsS = (t 1 , . . . , t m )
su h thatS
uniquely determinesG
. A transitiont i
isdependent on all previously made transitions
(t 1 , . . . , t i−1 )
and all avail-able information about these transitions, alled the history. The history
H i = (t 1 , . . . , t i−1 )
orrespondstosomepartially builtstru tureandwealso in lude stati propertiesthatare kept onstantduringtheparsingofasen-ten e, su haswordformandpart-of-spee hofatoken.
The basi idea is thus to traina lassier that approximates an ora le
given that a treebank is available. We will all the approximated ora lea
guide (Boullier2003),be ausetheguidedoesnotguaranteethatthetransi-
tionis orre t. Thehistory
H i = (t 1 , . . . , t i−1 )
ontains ompleteinformation aboutallprevioustransitions. Allthisinformationisintra tablefortraininga lassier. Insteadwe anuse history-based feature models for predi ting
the next transition. History-based feature models were rst introdu ed by
Bla k et al. (1992) and have beenused extensivelyin data-driven parsing
(Magerman 1995; Ratnaparkhi 1997; Collins 1999). To make it tra table
the history
H i
is repla ed by a feature ve tor dened by a feature modelΦ = (φ 1 , . . . , φ p )
, where ea h featureφ i
is a fun tion that identies somesigni antpropertyofthehistory
H i
and/ortheinputstringx
. Tosimplifynotationwewillwrite
Φ(H i , x)
todenotetheappli ationofthefeatureve tor(φ 1 , . . . , φ p )
toH i
andx
, i.e.,Φ(H i , x) = (φ 1 (H i , x), . . . , φ p (H i , x))
.At learning time the parser derives the orre t transition by using an
ora lefun tion
o
applied to goldstandard treebank. Forea h transition itprovidesthelearnerwithatraininginstan e
Φ((H i , x), t i )
,whereΦ(H i , x)
isa urrentve torof feature valuesand
t i
is the orre t transition. A set of traininginstan esI
isthenusedbythelearnertoindu eaparsermodel,byusingasupervisedlearningmethod.
At parsingtimetheparser usestheparser model, asaguide, to predi t
thenexttransitionandnowtheve toroffeaturevalues
Φ(H i , x)
istheinputandthetransition
t i
istheoutput oftheguide. Se tion2.3.3des ribeshowwe antraina lassierthat makesthispredi tion.
2.3.3 Dis riminative Learning Methods
Thelearningproblemistoindu ea lassierfromasetoftraininginstan es
I
relative to a spe i feature modelΦ
by using a learning algorithm. Inthisse tion,wewilldes ribetwodis riminativelearningmethods, SVMand
MBL,that anbeusedforthis lassi ationtask.
Ingeneral, lassi ationisthetaskofpredi tingthe lass
y
givenavari-able
x
,whi h anbea omplishedbyprobabilisti methodsanditis ommon todividethese methodsintotwo lasses: generative anddis riminative. Forgenerativemethods, weuse theBayesrule to obtain
P (y | x)
byestimatingthejointdistribution
P (x, y)
. By ontrast,dis riminativemethods makeno attempt tomodelunderlying distributions andinstead estimateP (y | x)
di-re tly. Wewill usetwodis riminativemethods for thelearningtask: SVM
andMBL.
Support Ve tor Ma hines. Inthe last de ade, there has beenagrow-
ing interest in Support Ve tor Ma hines (SVM), whi h were proposed by
VladimirVapnikattheendoftheseventies(Vapnik1979). SVMisbasedon
theideathattwolinearlyseparable lasses,thepositiveandnegativesamples
inthetrainingdata, anbeseparatedbyahyperplanewiththelargestmar-
gin. Ithasbeenshownthat SVMsgivegoodgeneralizationperforman ein
variousresear hareas,su hasfa edete tion(Osunaetal. 1997)andpedes-
triandete tion(Orenet al. 1997). Within naturallanguagepro essingthey
have been used extensively in, for example, text ategorization (Joa hims
1998), hunking(KudoandMatsumoto2001)andsynta ti parsing(Yamada
and Matsumoto2003).
Givenadatasetof
ℓ
instan e-labelpairsI = {(− → x i , y i )} ℓ i=1
,wherex i ∈ R N
and
y i ∈ {−1, 1}
,x i
isafeatureve torofthei
-thsample,whi hisrepresented byann
dimensionalve tor−
→ x i = (f 1 , . . . , f n )
, andy i
isthe lasslabelofthei
-th samplewhi h belongs to either the positive (+1
)or thenegative(−1
)lass. Thefeatureve tor
−
→ x i
will inour asebethefeatureve tordenedbyΦ(H i , x)
andthe lasslabely i
willbethetransitiont i
,butweneedamethodthathandlesmultiple lasslabels(moreaboutthatlaterinthisse tion). The
idea isto estimateave tor
−
→ w
and as alarb
, whi h maximizethe distan eof any data point from thehyperplanedened by
−
→ w · − → x + b
. The goal ofthe SVM is to nd the solution of the following optimization (Kudo and
Matsumoto2000a;Burges1998):
Minimize:
L(w) = 1 2 k − → w k 2
Subje tto:
y i (− → w · − → x i + b) ≥ 1∀i = 1, . . . , ℓ
(2.1)Figure2.2: AlinearSupportVe torMa hine
Inotherwords,theSVM methodtries tondthehyperplanethat sepa-
rates thetrainingdataintotwo lasseswith thelargestmargin. Figure 2.2
illustrates two possible hyperplanes, whi h orre tly separate the training
dataintotwo lasses,andthelefthyperplanehasthelargestmarginbetween
thetwo lasses.
ThedatainFigure2.2areeasytoseparateintotwo lasses,butinpra ti e
thedatamaybenoisyandthereforenotlinearlyseparable. Onesolutionis
to allowing some mis lassi ations by introdu ing apenalty parameter
C
,whi h denesthetradeo betweenthetrainingerrorandthemagnitudeof
themargin.
SVM anbe extendedto solveproblems thatare notlinearlyseparable.
Thefeatureve tor
x i
is mappedtoahigherdimensional spa ebythefun - tionφ
, whi h makes it possible to arry out non-linear lassi ation. The optimizationproblem anberewrittenintoadualform,whi h isdonewithaso alled Kernel fun tion
K(x i , x j ) ≡ φ(x i ) T φ(x j )
(KudoandMatsumoto2001;Vapnik1998). Therearemanykernelfun tions,butthemost ommon
are:
•
polynomial:K(x i , x j ) = (γx T i x j + r) d , γ > 0
.•
radialbasisfun tion (RBF):K(x i , x j ) = exp(−γ k x i − x j k 2 ), γ > 0
.•
sigmoid:K(x i , x j ) = tanh(γx T i x j + r)
.where
γ, r
andd
denotedierentkernelparameters(Hsuetal. 2004).SVM is in its basi form a binary lassier, but many learning prob-
lemshaveto dealwithmorethantwo lasses. TomakeSVM handlemulti-
lassi ation,manybinary lassiersareused. Formulti- lass lassi ation,
we an hoosebetweenthemethodsone-against-allandall-against-all.Given
thatwehave
n
lasses,theone-against-allmethodtrainsn
lassierstosepa-rateea h lassfromtherestandtheall-against-allmethodtrains
n(n − 1)/2
lassiers,oneforea hpairof lasses(VuralandDy2004). Avotingme ha-
nismorsomeothermeasureisusedtodis riminatea rossallthese lassiers
to lassifyanewinstan e.
Memory-Based Learning. Memory-based learning(MBL) and lassi-
ation is based on theassumption that a ognitive learningtask to ahigh
degreedependsondire t experien e andmemory,rather thanextra tionof
anabstra trepresentation. MBLhasbeenusedformanylanguagelearning
tasks, su h aspart-of-spee h tagging(Cardie 1993;Daelemanset al. 1996),
semanti rolelabeling(Vanden Bos het al. 2004; Kou hnir2004)andsyn-
ta ti parsing(Nivreetal. 2004).
MBLisalazymethodandisbasedontwofundamentalprin iples: learn-
ingisstoringexperien esinmemory,andsolvinganewproblemisa hieved
byreusingsolutionsfrom previouslysolvedproblemsthat aresimilarto the
new problem. The idea during training for MBL is to olle t the values
of dierent features from the training data together with the orre t lass
(Daelemans andVanden Bos h2005). MBLgeneralizesby applyingasim-
ilaritymetri withoutabstra ting oreliminatinglow-frequen yevents. This
similarity metri anbe seenas animpli it smoothingme hanism for rare
events. Daelemans and olleagues have shown that it may be harmful to
eliminaterareeventsinthetrainingdataforlanguagelearningtasks(Daele-
mansetal. 2002),be auseitisverydi ulttodis riminatenoisefromvalid
ex eptions.
The
n
feature-valuesaremappedintoann
-dimensionalspa e,whereea h featureve torfromthetrainingdatawithits orresponding lassisapointinthisspa e. Thetaskatde isiontimeistondthenearestneighbor(s)inthis
n
-dimensionalspa eandreturna ategorybasedonthek
nearestneighbor(s).Thewaythis sear hisperformed anbevariedinmanydierentways.
TheOverlapmetri isoneofthemostbasi metri sandusesthedistan e
∆(X, Y )
betweentwopatternsX
andY
,whi harerepresentedasn
features:∆(X, Y ) = X n i=1
w i δ(x i , y i )
(2.2)where
w i
isaweightforfeaturei
,andthefun tionδ(x i , y i )
isthedistan eperfeatureandwillbe0if
x i = y i
,otherwise1. Theweightw i
anbe al ulatedby avarietyof methods, e.g. Information Gain (IG), whi hmeasures ea h
feature's ontributiontoourknowledgewithrespe ttothetarget lass.
A variation of the Overlap metri s is the more sophisti ated Modied
ValueDieren eMetri (MVDM),introdu edbyCostand Salzberg(1993),
whi hestimatesthedistan ebetweentwovaluesofafeature by onsidering
their oo urren e with the target lasses. However, this metri is more
sensitiveto sparsedata.
2.4 Related Work
Duringthelastde ades,therehasbeenagreatinterestindata-drivenmeth-
odsforvariousnaturallanguagepro essingtasks. Data-drivenapproa hesto
synta ti parsingwererstdevelopedduringthe90sfor onstituen y-based
representations. The standard approa hes are based on nondeterministi
parsing te hniques, usually involving some kind of dynami programming,
in ombinationwith generativeprobabilisti modelsthat providean
n
-bestranking of the set of andidate analyses derived by the parser. The most
well-knownparsersbasedonthesete hniquesaretheparserofCollins(1997,
1999) and the parser of Charniak (2000). Dis riminativelearning methods
have been used to enhan e these parsers by reranking the analyses output