Generalized Hebbian Algorithm for Dimensionality Reduction in Natural Language Processing

(1)

Linköping Studies in Science and Technology

Dissertation No. 1045

Generalized Hebbian Algorithm for Dimensionality

Reduction in Natural Language Processing

by

Genevieve Gorrell

Department of Computer and Information Science

Linköpings universitet

SE581 83 Linköping, Sweden

Linköping 2006

(2)

(3)

The urrentsurgeofinterestinsear hand omparisontasksin

natu-rallanguagepro essinghasbroughtwithitafo usonve torspa e

ap-proa hesandve torspa edimensionalityredu tionte hniques.

Pre-sentingdata aspointsin hyperspa eprovidesopportunities tousea

varietyof well-developed tools pertinentto this representation.

Di-mensionalityredu tionallowsdatatobe ompressedandgeneralised.

Eigende ompositionandrelated algorithmsareone ategoryof

ap-proa hesto dimensionalityredu tion,providingaprin ipled wayto

redu edatadimensionalitythathastimeandagainshownitself

apa-bleof enablinga ess topowerfulgeneralisationsin thedata. Issues

with the approa h, however,in lude omputational omplexity and

limitationsonthesizeofdatasetthat anreasonablybepro essedin

thisway. Largedatasetsareapersistentfeatureofnaturallanguage

pro essingtasks.

Thisthesisfo usesontwomainquestions. Firstly,inwhatways an

eigen de omposition and related te hniques be extended to larger

datasets? Se ondly, thishavingbeena hieved,of what valueis the

resultingapproa htoinformationretrievalandtostatisti allanguage

modellingatthen-gramlevel? Theappli abilityofeigen

de omposi-tionisshowntobeextendablethroughtheuseofanextantalgorithm;

theGeneralizedHebbianAlgorithm(GHA),andthenovelextension

of this algorithm to paired data; the Asymmetri Generalized

Heb-bianAlgorithm(AGHA).Severaloriginalextensionstothethese

al-gorithmsarealsopresented,improvingtheirappli abilityinvarious

domains. The appli ability of GHA to Latent Semanti

Analysis-styletasks isinvestigated. Finally, AGHAisused toinvestigatethe

value ofsingular valuede omposition, aneigen de omposition

vari-ant,to n-gramlanguagemodelling. Asizeable perplexityredu tion

(4)

Partsofthisdo toralthesis appearinotherpubli ations:

Gorrell,G.,2006. GeneralizedHebbianAlgorithmforIn rementalSingular

Value De ompositioninNaturalLanguage Pro essing. InthePro eedings

of the 11th Conferen e of the European Chapter of the Asso iation for

ComputationalLinguisti s(EACL2006),Trento.

Gorrell,G.andWebb,B.,2005. GeneralizedHebbianAlgorithmforLatent

Semanti Analysis. InthePro eedingsofthe9thEuropeanConferen eon

Spee hCommuni ationandTe hnology(Interspee h2005),Lisbon.

Alsobythisauthor:

Gorrell,G.2004. LanguageModellingandErrorHandlinginSpoken

Dia-logueSystems. Li entiatethesis,LinköpingUniversity,2004.

Rayner, M., Boye, J., Lewin, I. and Gorrell, G. 2003. Plug and Play

Spoken DialoguePro essing. InCurrentandNewDire tionsinDis ourse

and Dialogue. Eds. JanvanKuppevelt and Ronnie W.Smith. Kluwer

A ademi Publishers.

Gorrell,G.2003. Re ognitionErrorHandlinginSpokenDialogueSystems.

Pro eedings of 2nd International Conferen e on Mobile and Ubiquitous

Multimedia,Norrköping2003.

Gorrell, G. 2003. Using Statisti al Language Modelling to Identify New

Vo abulary inaGrammar-BasedSpee hRe ognitionSystem. Pro eedings

ofEurospee h2003.

Gorrell, G., Lewin, I.and Rayner, M. 2002. Adding Intelligent Help to

MixedInitiativeSpokenDialogueSystems. Pro eedingsofICSLP2002.

Knight,S.,Gorrell, G.,Rayner,M., Milward,D., Koeling,R.and Lewin,

I. 2001. Comparing Grammar-Based and Robust Approa hes to Spee h

Understanding: ACaseStudy.Pro eedingsofEurospee h2001.

Rayner,M.,Lewin,I.,Gorrell,G.andBoye,J.2001. PlugandPlaySpee h

Understanding. Pro eedingsofSIGDial2001.

Rayner,M.,Gorrell,G.,Ho key,B.A.,Dowding,J.andBoye,J.2001. Do

CFG-Based Language ModelsNeed Agreement Constraints? Pro eedings

ofNAACL2001.

Korhonen,A.,Gorrell,G.andM Carthy,D.,2000. Statisti alFilteringand

Sub ategorisation Frame A quisition. Pro eedings of the Joint SIGDAT

Conferen e on Empiri al Methods in Natural Language Pro essing and

VeryLargeCorpora2000.

(5)

Arne Jönsson, my supervisor, who has been agreat supporter and

anoutstandingrolemodelforme;RobinCooper,se ondsupervisor,

forkeepingmesane,onwhi hallelsedepended;JoakimNivre,third

supervisor, forstepping in withadditionalsupportlater on;Manny

Rayner, Ian Lewin and Brandyn Webb, major professional and

in-telle tualinuen es;RobertAndersson,theworld'sbestsystems

ad-ministrator;everyonein GSLT,NLPLAB, KEDRIandLingvistiken

(GU) for providingsu h ri h working environments;and nally, my

(6)

(7)

1 Introdu tion 9

1.1 EigenDe omposition . . . 12

1.2 Appli ationsofEigenDe ompositioninNLP . . . 17

1.3 GeneralizedHebbianAlgorithm . . . 18

1.4 Resear hIssues . . . 19

2 Matrix De ompositionTe hniques and Appli ations 23 2.1 TheVe torSpa eModel . . . 24

2.2 EigenDe omposition . . . 25

2.3 SingularValueDe omposition . . . 27

2.4 LatentSemanti Analysis . . . 29

(8)

3 The GeneralizedHebbian Algorithm 39

3.1 HebbianLearningforIn rementalEigenDe omposition 41

3.2 GHAandIn rementalApproa hestoSVD. . . 44

3.3 GHAConvergen e . . . 46

3.4 Summary . . . 49

4 Algorithmi Variations 51 4.1 GHAforLatentSemanti Analysis . . . 52

4.1.1 In lusionof GlobalNormalisation. . . 53

4.1.2 Epo hSizeandImpli ationsforAppli ation. . 54

4.2 RandomIndexing. . . 55

4.3 GHAandSingularValueDe omposition . . . 57

4.4 SparseImplementation . . . 62

4.5 Convergen e. . . 64

4.5.1 StagedTraining. . . 65

4.5.2 Convergen eCriteria. . . 65

4.5.3 LearningRates . . . 68

4.5.4 Asymmetri GHAConvergen eEvaluation . . 69

(9)

5 GHAfor Information Retrieval 77

5.1 GHAforLSA . . . 78

5.1.1 Method . . . 78

5.1.2 Results . . . 79

5.2 LSAandLargeTrainingSets . . . 83

5.2.1 MemoryUsage . . . 84

5.2.2 Time. . . 85

5.2.3 S alingLSA. . . 86

5.3 Summary . . . 88

6 SVDand Language Modelling 89 6.1 ModellingLetterandWordCo-O urren e . . . 92

6.1.1 WordBigramTask . . . 93

6.1.2 LetterBigramTask . . . 94

6.2 SVD-BasedLanguageModelling . . . 96

6.2.1 LargeCorpusN-GramLanguageModelling us-ingSparseAGHA . . . 97

6.2.2 SmallCorpusN-GramLanguageModelling us-ingLAS2 . . . 100

(10)

6.3.2 SmallBigramCorpus . . . 104

6.3.3 Medium-SizedTrigramCorpus . . . 105

6.3.4 ImprovingTra tability . . . 106

6.3.5 LargeTrigramCorpus . . . 111

6.4 Summary . . . 115

7 Con lusion 119 7.1 DetailedOverview . . . 119

7.2 SummaryofContributions. . . 121

7.3 Dis ussion . . . 122

7.3.1 EigenDe ompositionAPana ea? . . . 122

7.3.2 GeneralizedHebbianAlgorithmOverhypedor Underrated? . . . 124

7.3.3 WiderPerspe tives. . . 126

(11)

Introdu tion

In omputationallinguisti s,asin manyarti ialintelligen e-related

elds, the on ept of intelligent behaviour is entral. Human-level

naturallanguagepro essingrequireshuman-likeintelligen e,imbued

as it iswith ourveryhumanexisten e. Furthermore,language

pro- essing ishighly omplex, andsowemighthope,broadlyspeaking,

thatamoreintelligentsystem,howeveryoudeneit,wouldbe

bet-ter able to handle language pro essing tasks. Ma hine intelligen e

hasbeengaugedand denedin anumberofways. TheTuringTest

(50)approa hedintelligen eastheabilitytopassforhuman. Other

denitions arebased ontheadaptivityof thesystem. Adaptivity is

apowerful on eptofintelligen egiventhatarguablythepointofa

plasti entral nervoussystemis toenableadaptationwithin an

or-ganism'slifetime(34). Worksu has(32)developssu hnotionsinto

aformalmeasure. Butwhatintelligen e anbesaidtobeembodied

in a simple stati system? What denition might we apply to the

intelligen eof,forexample,asystemthat doesnotadaptatall,but

isnonethelessabletohandleapredeterminedsetof ir umstan esin

useful ways? Does su h asystemembody anyintelligen eat all,as

thetermis ommonlyused? Surelyit anbesaidtohavemoreorless

intelligent behaviours, ifnot intelligen e per se? Sin e thesystems

(12)

this ategory,somemeasureoftheirpower mightbeuseful.

In order to produ e appropriate behavioursin response to input, a

system needs rst of allto beable to distinguish between dierent

kindsof input in task-relevantways. This maybe simplya hieved.

For example, a tou h-tone telephone system need only distinguish

betweena smallnumberof simpleanddistin t tones in responseto

a question. This step may also be more hallenging. Forexample,

ifyoumakea ommandviaaspokennaturallanguageinterfa e,the

systemneedsto beabletodistinguishbetweendierent ommands.

Itmaynotneedtodistinguishbetweendierentspeakers,orthesame

speakerin dierent moods. These skills are within the apabilities

of most human listeners, but are irrelevant to our system. So the

taskofdistinguishingbetween ommandsisa omplexmodellingtask

involvingidentifying thefeatures relevant to determining theusers'

wisheswithinthedomainofthetaskbeingperformed. Evengetting

as far as identifying the words theuser most likely spokeis a

non-trivialtask,requiringmanylayersof abstra tionandreasoning,and

so a sophisti ated model is required. Having formed a model by

whi h toidentify relevantinput, theabilityto generateappropriate

responses follows, andtherequirementsatthis stage depend onthe

nature ofthesystem.

Human developmental studies havedemonstrated the great

signi- an eofmodellingwithin themind,bothonahighand alowlevel.

Evenveryyoungbabiesshowmoreinterestin stimulithat hallenge

their worldmodel, suggesting that right from the start, learning is

a pro ess of tuning the world model. Forexample, the distin tion

between animate and inanimate appears veryqui klyin the hild's

modeloftheworld. Babieswill lookforlongerat inanimateobje ts

movingoftheir ownvolitionin themanner of omplexobje ts,and

onversely,animate obje tssu hasotherpeoplefollowingNewton's

Lawsassimpleobje tsdo(54). Theworldmodelprovidesawayto

sift through the input and give attention to phenomena that most

requireit. Onalowlevel,adaptation des ribesanerve's easingto

reinresponsetoun hanginginput. Youstareatthesamethingfor

(13)

par-look away yourvisual eld is markedbyan absen e of that olour.

Dire tion and speed of motion are also ompensated for. You are

able to tune out onstantnoises. Adaptation is averysimpleform

ofmodelling,butother,more omplexvarietiesalsoappearatalow

levelin thebrain(25).

Approa hes to reating an appropriately powerful model in an

ar-ti ial system fall loosely into two ategories. The model an be

designed by ahuman,possibly from humanintuitive per eptions of

thestru tureoftheinputor omplexhuman-generatedtheoriessu h

aslinguisti theoriesofgrammar,orthemodel anbea quired

auto-mati ally fromdata. Advantagesto theformerin lude thathumans

are verypowerful reatorsof models, and ouldweonlyen odeour

models well, they would surely surpassany automati ally a quired

modelavailabletoday. Advantagestothelatterin ludethattheyare

alotlesswork to reate(forthehumans)andpotentiallylessprone

to error, sin e they anbe set up to make fewer assumptions, and

hoose a solution based on optimising the result. Computer hess

anillustrate the dieren ewell, withprograms basedonsear hing

allpossiblepathstoadepthbeyondhumanabilities ompetingwell,

butbynomeansalwaysprevailing,againsthumanplayersusingmore

advan ed strategies but less raw omputational power. Creating a

system apable of a quiring a su iently powerfulmodel

automat-i ally from the data requires that the basi learning framework is

su ientlysophisti ated. Forexample,asystemthatlearns

seman-ti on epts throughword o-o urren epatterns is nevergoing to

produ e a theory of grammar,sin e it hasno a ess to word order

information. Theinput isinsu ientlyri h. A systemthat models

the world based on the assumption that observations omprise the

additive sum of relevant fa tors (for example, if it is sunny, I am

happy. IfitisSunday,Iamhappy. ThereforeifitisasunnySunday,

I am veryhappy)will fail to a uratelymodel ir umstan eswhere

auseandee t haveamoreinvolvedrelationship.

Anadequateformatbywhi htoen odeinputisa riti alrststep. In

thisworkIfo usonve torspa emodels. Intheve torspa emodel,

(14)

whi hthereisonedimensionforea hfeature. Similaritybetweenthe

datapoints anthenbethoughtofintermsofthedistan ebetween

the points in spa e, and so the framework is ideal for problems in

whi h similarityrelationships betweendata need to bedetermined.

Therange ofproblems whi h anbe hara terised in these termsis

verylargeindeed, andthere aremanywaysin whi h dierentkinds

ofsimilarity anbetargeted. Withintheve torspa erepresentation

there are a variety of ways in whi h the variation in the position

of these points an then bepro essedand therelevant information

sifted from the irrelevant and brought to the fore. I fo us on one

in parti ular; eigen de omposition, the properties of whi h will be

dis ussedin moredetailinthenextse tion.

1.1 Eigen De omposition

Eigen de omposition is a mu h-used te hnique within natural

lan-guagepro essing aswellasmanyother elds. Youwish tomodela

omplexdataset. Therelationsbetweenthefeaturesarenot learto

you. Youfo us onin ludingasmu hinformation aspossible. Your

hyperspa ethereforehasahighdimensionality. Asitturnsout,two

ofthefeaturesdependentirelyonea hother(forexample,itisnight,

anditisnotday),andthereforeoneofyourdimensionsissuperuous,

be auseif you know thevalue ofone ofthe features, youknowthe

valueoftheother. Withintheplaneformedbythesetwodimensions,

pointsliein astraightline. Someother featureshavemore omplex

interdependen ies. Thevalueofafeaturefollowswithlittlevariation

fromthe ombinedvaluesofseveralotherfeatures(forexample,

tem-peraturemightrelatetonumberofdaylighthoursintheday,amount

of loud overand time ofday). Lines,planes andhyperplanes are

formed bythedatawithinsubspa esoftheve torspa e. Collapsing

these dependen iesintosuperfeatures anbethoughtof in termsof

rotatingthedata,su hthatea hdimension apturesasmu hofthe

varian e in the data as possible. This is what eigen de omposition

(15)

Inaddition,we antakeafurtherstep. Bydis ardingthedimensions

with the least varian ewe an further redu e the dimensionality of

thedata. Thistime,theredu tionwillprodu eanimperfe t

approx-imationofthedata,but theapproximationwillbethebestpossible

approximation ofthe data for that numberof dimensions. The

ap-proximationmightbevaluableasa ompressionofthedata. Itmight

alsobevaluableasageneralisationofthedata, in the asethat the

details areovertting/noise.

The most important superfeatures in a dataset say something

sig-ni ant and important about that data. Between them they over

mu hofthevarian eofthedataset. It an beinterestinginitselfto

learnthesinglemostimportantthingaboutadataset. Forexample,

as welearnlaterinusingarelatedte hniqueto learnwordbigrams,

givenoneandonlyonefeature,thesinglemostimportantthing you

ansayaboutwordbigramsin the English languageis what words

pre ede the. With this one feature, you explain as mu h as you

possibly an about English bigrams using only one feature. Ea h

datum, whi h previously ontainedvaluesforea hofthefeaturesin

your original hyperspa e, now ontains valuespositioning it in the

newspa e. Its valuesrelateitto thenew superfeatures ratherthan

theoriginalfeaturesetwestartedoutwith. Anunseendatumshould

beabletobeapproximatedwellintermsof aproje tiononea h

di-mension inthenewspa e. Inotherwords,assuminganappropriate

training orpus,unseen data an bedes ribed asaquantity ofea h

ofthesuperfeatures.

Eigenfa esprovideanappealingvisualillustrationofthegeneralidea

ofeigen de omposition. Eigende ompositioniswidelyused in

om-puter vision, and one task to whi h it has been usefully applied is

fa e re ognition. Eigende omposition anbeapplied to orpora of

imagesoffa essu hthatsuperfeatures anbeextra ted. Figures1.1

(36)and 1.2(unknownsour e)showsomeexamplesofthese

eigen-fa es found on the web. (The dieren es in these eigenfa es are

attributabletothetrainingdataonwhi htheywereprepared.) Note

that the rst eigenfa ein 1.1 is indeed a very plausible basi male

(16)

(17)

Figure1.2: MoreEigenfa es

(18)

Figure 1.4: MoreEigenfa e Convergen e

they only make sense in the ontext of being ombined with other

eigenfa es to produ e an additive ee t. Figures 1.3 (27) and 1.4

(51) show images onverging on a target as more and more

eigen-fa es arein luded. (Inthese parti ularexampleshoweverthetarget

image formedpartofthetrainingset.)

Noteat thispointthateigen de ompositiononlyremoveslinear

de-penden yintheoriginalfeatureset. Alineardependen ybetweena

feature and one or moreothers is adependen yin whi h thevalue

of afeature takesthe formof aweightedsumof theotherfeatures.

Forexample,ifthereisagoodlmshowingthenJaneismorelikely

to go to the inema. If John is going to the inema, then Jane is

morelikelytogo. ThereforeifthereisagoodlmshowingandJohn

is goingto the inema, thenJane isevenmorelikelyto go. A

non-linear dependen y might o ur, for example, if Jane prefers to see

good lms alone soshe an on entrate. So ifthere is agood lm

showing,thensheismorelikelytogotothe inema. IfJohnisgoing

to the inema, then sheis morelikelyto go. If,however,there isa

goodlmshowingandJohnisgoingtothe inema,thenJaneisless

likelyto go. Examples ofthis in languageabound,andso therefore

theappli abilityofeigende omposition oftendependsontheextent

(19)

lin-earity. Forexample,wordsappearingoftenwithbluemoon arevery

dierentto wordsappearingoftenwithblueormoon.

1.2 Appli ations of Eigen De omposition

in NLP

Dimensionalityredu tionte hniquessu haseigende ompositionare

ofgreat relevan ewithin theeldofnaturallanguagepro essing. A

persistentproblemwithin languagepro essingistheover-spe i ity

of languagegiven the task, and the sparsityof data. Corpus-based

te hniques depend on a su ien y of examples in order to model

humanlanguageuse,buttheverynatureoflanguagemeansthatthis

approa h has diminishingreturns with orpus size. In short,there

arealargenumberofwaystosaythesamething,andnomatterhow

large your orpusis, you will never overall the thingsthat might

reasonablybesaid. Youwill alwayssee somethingnewat run-time

in ataskofany omplexity.

Furthermore,language anbetoori h forthetask. Thenumberof

underlying semanti on eptsne essarytomodelthetarget domain

isoftenfarsmallerthanthenumberofwaysinwhi hthese on epts

might be des ribed, whi h makes it di ult to, in a sear h

prob-lemforexample,establishthattwodo uments,pra ti allyspeaking,

are dis ussing the samething. Any approa h to automati natural

languagepro essingwill en ountertheseproblems onseverallevels.

Considerthetaskoflo atingrelevantdo umentsgivenasear hstring.

Problemsariseinthatthereareoftenseveralwaysofreferringtothe

same on ept. Howdoweknow,forexample,that atandfeline are

the same thing? There are plenty of do uments relevant to felines

that feature the word feline not on e. This is the kindof problem

that Latent Semanti Analysis aims to solve, and in doingso,

pro-vides natural language pro essing with its best-known appli ation

(20)

de om-de omposition.Itallowspaireddatatobepro essed,su has,inthis

ase,do ument/wordbagpairs.) Do uments ontainingtheword

ele-vator,forexample,donottypi ally ontainthewordlift,eventhough

do umentsaboutlifts aresemanti allyrelevanttosear hesabout

el-evators. Thefeature ve torfordo umentsaboutelevatorstherefore

ontainnovaluein thedimensionforthewordlift. However,in

de-s ribing the varian e in the words in a set of do uments, there is

mu hredundan ybetweendo uments ontainingelevator and

do u-ments ontaininglift. Both o-o urwith manysimilar do uments.

Sowheneigende ompositionisperformedonthedataset,thetwoare

automati ally ombinedinto asuperfeature, and the dieren es

re-mainingbetweenthemare apturedinsteadinanothersuperfeature,

whi h is probably being reused to explain a number of other

phe-nomena too. For example, one feature might overseveral aspe ts

of the dieren es between UK and US English. Later eigenve tors

apture thedetailsof thedieren esbetweenthem, su h that given

the omplete set of eigenve tors,the twowordson e again be ome

ex lusive. However,bydis ardingsomeofthelessimportantfeatures

we anstopthatfrom happening.

1.3 Generalized Hebbian Algorithm

Having de ided to pursue eigen de omposition, a further hallenge

awaits. Cal ulating the eigen de omposition is no trivialfeat, and

the best of the algorithms available are nonetheless

omputation-allydemanding. Currentresear haboutand usingthe te hniquein

natural language pro essing often fo uses on adapting the data to

the onstraintsof thealgorithm, and adaptingthe algorithmto the

onstraintsofthedata(49)(9). Thisworkisnoex eption. The

Gen-eralizedHebbianAlgorithm(GHA)isanalgorithmthatgrewfroma

dierentparadigmto the bulkof thework oneigen de omposition,

though notan unfamiliar oneto many omputational linguists and

arti ial intelligen e resear hers. Originating asan arti ial neural

networklearningalgorithm,itbringswithitmanyoftheadvantages

(21)

Learningupdatesare heapandlo alisedandinputisassumedtobe

astream ofindependent observations. Learningbehaviouralso has

someinterestingproperties. It is,in short,verydierentfrom other

morestandard approa hesto al ulatingeigen de ompositions, and

isthereforeappropriateinadierentrangeof ir umstan es.

1.4 Resear h Issues

This work aims to investigate the appli ability of eigen

de ompo-sition within natural languagepro essing (NLP), and to extend it,

bothinNLPandin omputers ien eingeneral. Large orporahave

traditionally been problemati for eigen de omposition. Standard

algorithms pla e limitations onthe size of datasetthat anbe

pro- essed. InthisworktheGeneralizedHebbianAlgorithmis onsidered

asanalternative,potentiallyallowinglargerdatasetstobepro essed.

Thisthesispresentsoriginal,publishedextensionstoGHA,viawhi h

thealgorithmismademoreappropriatetorelevanttaskswithinand

beyond natural language pro essing. The algorithm is adapted to

paireddatasets(singularvaluede omposition),whi hisrequiredfor

thelanguagemodellingtaskaswellasmanyothersin omputer

s i-en e in general,and isadapted to sparsedata, whi h is vitalforits

e ien yinthenaturallanguagedomainandbeyond. Otheroriginal

algorithmi and implementational variationsare also presented and

dis ussed.

Eigen de omposition has already proved valuable in some areas of

NLPandmanybeyondit. Inthiswork,afurtherareais onsidered.

This thesispresents original work in using eigen de omposition for

n-gramlanguagemodellingfor thersttime. It willbeshownthat

eigende omposition anbeusedtoimprovesingle-ordern-gram

lan-guagemodelsandpotentiallytoimproveba kon-grammodels. The

approa hisdemonstratedontraining orporaofvarioussizes.

(22)

•

What is the utility of eigen de omposition and related te h-niquestonaturallanguagepro essing? Morespe i ally:

Caneigen de omposition and relatedte hniquesbe used

toimprovelanguagemodelsatthen-gramlevel?

Whataretheimpli ationsofthisresultandother

appli a-tionsof eigende omposition innaturallanguage

pro ess-ingforitsoverallutilityinthisdomain?

•

Whatis the value ofthe Generalized Hebbian Algorithm and itsvariantsinperformingeigende ompositionandrelated

te h-niquesinthenaturallanguagepro essingdomain? More

spe if-i ally:

Whatis thevalueof theGeneralizedHebbian Algorithm

inperformingLatentSemanti Analysis?

Canthe GeneralizedHebbian Algorithm beused to

per-formsingularvaluede ompositiononn-gramdata? Isthe

te hniquevaluableforperformingthistask.

Inwhatways anthe GeneralizedHebbian Algorithmbe

extendedtoin reaseitsutilityin thisdomain?

Thethesisisprimarilyte hni alandimplementation-fo used.

Chap-ter2givesmathemati alba kgroundoneigende ompositionandits

variants. Chapter3givesmathemati alba kgroundonthe

General-izedHebbian Algorithm. Chapter4des ribesoriginal extensionsto

the GeneralizedHebbian Algorithmat an algorithmi level.

Exten-sions relevantto Latent Semanti Analysis are presented. A sparse

variantispresentedwhi hallows omputationale ien ytobemu h

improved onthe sparse datasets typi alto thelanguage pro essing

domainamongothers. GHAisextendedtoasymmetri datasetsand

evaluated. Otherdevelopmentsofthepra ti alappli abilityof

Asym-metri GHA arealsodis ussed. Chapter 5dis ussesthe appli ation

of eigen de omposition and the Generalized Hebbian Algorithm in

information retrieval. GHA is presented as a valuable alternative

(23)

appli ation ofsingularvaluede omposition(SVD) andthe

General-ized Hebbian Algorithm tolanguagemodelling at then-gramlevel.

It is demonstrated that SVD an beused to produ e a substantial

de rease in perplexity in omparison to a baseline trigram model.

Appli ationof theapproa hto ba kolanguagemodels isdis ussed

as a fo us for future work. AGHA is shown to be a valuable

al-ternative for performing singular value de omposition on the large

datasetstypi alton-gramlanguagemodelling. Otheralternativesto

performinglarge-s alesingularvaluede ompositions in this ontext

(24)

(25)

Matrix De omposition

Te hniques and

Appli ations

This hapter aims to give the reader una quainted with eigen

de- omposition and related te hniques an understanding su ient to

enablethemtofollowtheremainderofthework. Afamiliaritywith

basi matrixandve tormathemati sisassumedinpla es. For

read-ersunfamiliarwithmatrix/ve toroperations,thereareanumberof

ex ellent text books available. In Anton and Rorres' Elementary

Linear Algebra (2) for example, the reader will nd denitions of

thefollowing on eptsandoperations;squarematri es,symmetri al

matri es,matrixtransposition,dotprodu tofve tors,outerprodu t

of ve tors, the multiplying together of matri es, the multiplying of

matri esby ve torsand normalisationandorthogonalisationof

ve -tors. 1

Thereadersatisedwithamoresurfa eunderstandingshould

hopefullyndthatthis hapterprovidesthemwithanintuitivegrasp

oftherelevant on epts,andsoisen ouragedtoreadon. Thereader

1

(26)

already familiar with eigen de omposition,singular value

de ompo-sitionand LatentSemanti Analysisisdire ted tothenext hapter,

sin enothingmentionedherewillbenewtothem.

2.1 The Ve tor Spa e Model

As mentionedin theintrodu tion,theve torspa emodel isa

pow-erful approa h to des ribing and intera ting with a dataset. Data

takes the form of feature value sets in ve tor form. Ea h feature

takesadimensionin theve torspa ein whi hthedatapositions

it-self. Thetheoryisthattherelationshipsbetweenthepositionsofthe

datapointsin thespa etells ussomethingabout theirsimilarity. A

datasettakestheform ofasetofve tors,whi h anbepresentedas

amatrix, andall thepowerof thematrixformatbe omesavailable

tous. Here,forexample,isadataset:

Themanwalkedthedog

Themantookthedogtothepark

Thedogwenttothepark

These data an be used to prepare a set of ve tors, ea h of whi h

des ribesapassageintheformofawordbag. Thewordbag,inwhi h

the ve tordes ribes the ounts of ea h wordappearing in the

do -umentin afashionthat doesnotpreservewordorder,is popularin

avarietyof approa hesto naturallanguagepro essing, parti ularly

informationretrieval.Thewordbagve torsarepresentedasamatrix

(27)

P assage1 P assage2 P assage3

the

2

3

2 man

1

0 walked

1

0

0 dog

1

1 took

0

1

0 to

0

1

1 park

0

1

1 went

0

1

Data in thisform anbeintera ted within avarietyofways.

Sim-ilarity an be measured using the distan e between the points, or

using the osine of theangle betweenthem, as measuredusing dot

produ t. Transformations anbeperformedonthedatasetin order

toenhan e ertainaspe ts. Forexample,thedataset ouldbe

multi-pliedbyanothermatrixsu hastoskewand/orrotateit. Non-linear

transformations analsobeintrodu ed. Thematrixpresentedabove

anbemultipliedbyitsowntransposeto produ eamatrixthat

de-s ribes the o urren e of a word with every other word a ross the

dataset. Su hamatrixwouldbesquare andsymmetri al.

2.2 Eigen De omposition

Eigen de omposition allows us to rotate our dataset in a manner

that ollapseslineardependen iesintoasingledimensionor

ompo-nent, and provides ameasure of the importan e of the omponents

within the dataset. On a on eptual level, it models the dataset

in terms of additiveinuen es: this observation is produ ed by so

mu h of this inuen e, so mu h of that inuen e et . It aims to

tell us what the inuen es are that most e iently des ribe this

dataset. Thebenetsto su harepresentationaremany. Su h

(28)

e ient,andprovidesaprin ipledwaytoperformdimensionality

re-du tiononthedata,su h asto ompressandgeneraliseit.

Letusintrodu eaformaldenition.Inthefollowing,

M

isthematrix weareperformingeigende ompositionon.

v

isaneigenve torof

M

and

λ

istheeigenvalue orrespondingto

v

.

λv

= M v

(2.1)

As mentioned earlier, an eigenve tor of a transform/matrix is one

whi hisun hangedbyitindire tion. Theeigenve tormaybes aled,

andthes aling fa toristheeigenvalue. Theabovesimplysaysthat

v

multiplied by

λ

equals

M

, our original matrix, multiplied by

v

:

multiplying

v

by

M

hasthesameee tass alingitby

λ

.

Foranysquaresymmetri almatrixthere willbeasetofsu h

eigen-ve tors. Therewillbenomoreeigenve torsthanthematrixhasrows

(or olumns),thoughtheremaybefewer(re allthatlinear

dependen- ies are ollapsedinto single omponentsthusredu ing

dimensional-ity). The numberof eigenve tors is alled the rank of the matrix.

The eigenve torsare normalised and orthogonalto ea h other, and

ee tivelydene anewspa ein termsoftheoldone. Theresulting

matrix of eigenve tors an be used to rotate the data into the new

spa e. Formally,

M V

T

= M

′

(2.2)

where

V

isthe olumn matrixofeigenve tors,

M

istheoriginaldata matrix and

M

′

is the rotatedand possibly approximated data

ma-trix. This stepis in ludedin a workedexampleof LatentSemanti

(29)

lari a-newspa e,ea haxis apturesasmu hofthevarian einthedataset

aspossible. Superuousdimensionsfallaway. Dimensions

ontribut-ing least to thevarian e ofthe data anbedis arded to produ ea

leastmeansquarederrorapproximationtotheoriginaldataset. k is

ommonlyusedtorefertothenumberofdimensionsremaining,and

isusedin thiswaythroughoutthiswork.

2.3 Singular Value De omposition

Eigen de omposition is spe i to symmetri al data. An example

of symmetri al data is word o-o urren e. Word a appears with

word b exa tly as often aswordb o urs with word a. The set of

wordbagsgivenasanexampleat thebeginningofthis hapter isan

exampleofasymmetri data. Thematrixpairsdo umentswiththeir

wordbags. Ea h ell ount givesus thenumberof appearan es ofa

parti ularwordinaparti ulardo ument. Ifweweretomultiplysu h

amatrix byits own transpose what we would get would bea

sym-metri al datasetdes ribing theo urren eof everywordwithevery

otherworda rosstheentiredo umentset. Were wetomultiplythe

transpose of the matrix by the matrixwewould get asymmetri al

dataset des ribingthe numberof shared wordsbetween ea h

do u-mentandeveryother. Asymmetri aldatasetisalwaysdes ribedby

a square, diagonally symmetri almatrix. Eigen de omposition an

beextended toallowasimilar transformtobeperformedonpaired

data, i.e. re tangularmatri es. The pro ess is alled singularvalue

de omposition,and anbeformalisedas follows:

M

= U ΣV

T

(2.3)

In the above,

M

is our re tangular matrix,

U

is the olumn ma-trixofleftsingularve tors,andparallelstheeigenve tors,

V

T

isthe

transposeofthe olumnmatrixofrightsingularve tors,whi hagain

(30)

singular values (whi h take the pla e of eigenvalues) in the

appro-priate order.Whereeigende omposition reatesanewspa ewithin

theoriginalspa e,singularvaluede omposition reatesapairofnew

spa es, one in left ve tor spa e (for example, word spa e, in our

wordbagexample)andoneinrightve torspa e (forexample,

do -ument spa e). The twospa es are paired in a very tangible sense.

They reate a forum in whi h it is valid to ompare datapoints of

either typedire tlywith ea h other (forexamplewordve torswith

do umentve tors)(3).

Therelationshipbetweeneigende ompositionandsingularvalue

de- ompositionisnota ompli atedone. Wedene:

A

= M M

T

(2.4)

B

= M

T

M

(2.5)

That is to say, we form two square symmetri al matri es from our

original re tangular data matrix, ea h orrelating an aspe t of the

dataset(i.e. wordsordo uments,in ourexample) withitself. Then:

A

= U Λ

a

U

T

(2.6)

B

= V Λ

b

V

T

(2.7)

Λ

a

= Λ

b

= Σ

2

(2.8)

Orin otherwords,theeigenve torsofthematrix

A

aretheleft sin-gularve torsofthematrix

M

. Theeigenve torsofthematrix

B

are the rightsingular ve tors of thematrix

M

. The eigenvaluesof the matrix

A

aretheeigenvaluesofthematrix

B

andarethesquaresof thesingularvaluesof

M

.

(31)

2.4 Latent Semanti Analysis

Asthesinglebestknownusageofsingularvaluede omposition(SVD)

in natural language pro essing, Latent Semanti Analysis provides

anex ellentpra ti alillustrationofitsappli ation.Inaddition,LSA

makesanappearan elaterinthiswork,soana quaintan ewiththe

prin iplewillprovebene ial. Sear htasks,inwhi hdo uments

rel-evanttoasear hstringaretoberetrieved,runintodi ultiesdueto

theprevalen eofwordsynonymsinnaturallanguage.Whentheuser

sear heson,to iteanearlierexample,elevatortheywouldalsolike

do uments ontainingthewordlift tobereturned,although

do u-mentsrarelyusebothterms. Anautomati approa htondingthese

synonymsispotentiallyofbenetininformationretrievalaswellas

avariety of othertasks. Latent Semanti Analysis(14) approa hes

thisproblemwiththeaidofsingularvaluede omposition.

Theapproa haimsto utilisethefa t thatwhilst, forexample,lift

andelevatormightnotappeartogether,theywillea happearwith

a similar set of words (for example, oor). This information an

betappedthroughSVDand prin ipleddimensionalityredu tion. If

there is some superfeature that aptures the basi liftness of the

do ument, that is later rened su h as to spe ify whether lift or

elevatorisused,thenbyisolatingtheprin iplesanddis ardingthe

renementswemightbeabletoa essthisinformation.

Dimension-ality redu tion maps data to a ontinuous spa e, su h that we an

now ompare words that previouslywe ouldn't. Theremainder of

this hapterprovidesaworkedexampleofperformingLatent

Seman-ti Analysisonasmalldo umentset. Abriefsummaryfollows.

Returning to our earlier example, we have the following do ument

set:

Themanwalkedthedog

Themantookthedogtothepark

(32)

Thisdo umentset istransformedinto amatrixofwordbagslikeso:

Passage1 Passage2 Passage3

the 2 3 2 man 1 1 0 walked 1 0 0 dog 1 1 1 took 0 1 0 to 0 1 1 park 0 1 1 went 0 0 1

Tobevery lear, theve torsthismatrixembodiesare thefollowing

paireddata(do umentsandwordbags):

1

0

2 1 1

1 0

0 0

0

1

0

3 1 0

1 1

0

1

2 0 0

1 0

1 1

1

Ifleft(do ument)datamatrix

D

istherowmatrixformed fromthe do umentve tors,

D

=





1 0

0 0 1

0 0 0

1 



(2.9)

and right(wordbag)data matrix

W

is therowmatrix formedfrom thewordbagve tors,

W

=





2 1

1 1

0 0

3 1

0 1

1 1

1 0

2 0

0 1

1 1





(2.10)

(33)

W

T

D

= M

(2.11)

where

M

isourdatamatrixshownabove. Wede omposethismatrix using SVD, to obtaintwosets of normalised singularve torsand a

set of singular values. This operation anbe performed by any of

avarietyof readilyavailable mathemati s pa kages. Algorithms for

performingsingularvaluede ompositionaredis ussedinmoredetail

lateroninthiswork. Re all,

M

= U ΣV

T

(2.12) Then,

U

T

=





0.46

0.77 −0.45

−0.73 −0.04

0.68 −0.51 −0.64 −0.58





(2.13)

Σ =





5.03

0

1.57

0

1.09 



(2.14)

V

T

=

−0.82 −0.24 −0.09 −0.34 −0.14 −0.25 −0.25 −0.10

0.10

0.47

0.49 0.06 −0.02 −0.43 −0.43 −0.40

0.01 0.22 −0.41 −0.31 0.63 0.10 0.10 −0.53

!

(2.15)

Transposesarepresentedherefor the onvenien e ofpresentingrow

(34)

data and oursingularvalue de omposition. Thedieren e between

our original data and the singular value de omposition is that our

singular valuede omposition omprisesorthogonal ve tors, thereby

removingredundan y. Inthe worst ase there will be asmany

sin-gularve torpairsasthereweredatapairs,butusuallythereissome

linearredundan y,andthereforetherearefewersingularve torpairs.

A orpusofthreedo uments anprodu enomorethanthreesingular

triplets(pairsof ve torswith asso iatedsingularvalue). However,a

more realisti orpus mightprodu e numbersof singular tripletsin

quadruple guresforavo abulary/do umentset sizeofhundredsof

thousands. Dimensionalityredu tionisthenperformedbydis arding

allbut thetopfewhundredsingulartriplets. Thepre isenumberof

singulartripletsretainedis hosenonanadho basis. Aroundtwoto

three hundredisoften foundto beoptimalfor LSA.This stagewill

besimulatedherebydis arding thelast singulartriplet to produ e

atwo-dimensionalsemanti spa e,whi hhastheadvantageofbeing

readilyvisualisable. Herearetheremainingsingulartriplets:

U

′

T

=

0.46 0.77 −0.45

−0.73 −0.04 0.68

(2.16)

Σ

′

₌

5.03

0

1.57

(2.17)

V

′

T

=

−0.82 −0.24 −0.09 −0.34 −0.14 −0.25 −0.25 −0.10

0.10

0.47

0.49 0.06 −0.02 −0.43 −0.43 −0.40

(2.18)

Figure 2.1 depi ts thedo uments representedaspoints in semanti

spa e. Do umentsinredu edspa eare olumnve torsof

D

′

:

D

′

= DU

′

(35)

Doc 2 [0.77, −0.04]

Doc 1 [0.46, −0.73]

Doc 3 [−0.45, 0.68]

Figure 2.1: Three Example Do uments Depi ted in a

(36)

(In this ase, theabove is rathertrivial sin e

D

happensto be the identitymatrix.

DU

′

thereforeequals

U

′

.) Bymultiplyingthe

do u-mentve torsbytheredu edleftsingularve torset (orthewordbag

ve torsby therightsingularve torset)we anmovetheminto the

newspa e. Figure2.1 illustratesthis. We anthen omparethe

se-manti similarityofthedo umentsusingforexampletheirdot

prod-u tswithea hotherinthisnewspa e,whi hwilltypi ally onstitute

animprovement.

A typi al use of LSA is returning the best mat h among a set of

do umentsgivenastringsu hasauserquery. This anbeillustrated

in the ontextof theaboveexample. Supposethequeryis"thedog

walked". This string isused to form awordbag ve torin thesame

mannerasthedo umentswere. Itbe omesapseudodo ument. The

pseudodo umentwouldthereforebe,

P

=

1 0 1 1 0 0 0 0

(2.20)

We anmovethispseudodo umentintosemanti spa eby

multiply-ingitbythematrix

V

′

asshown:

P V

′

= P

′

(2.21)

Thisprodu esthetwo-dimensionalsemanti -spa eve tor,

P

′

=

−1.25

0.65

(2.22)

(37)

Doc 2 [0.77, −0.04]

Doc 1 [0.46, −0.73]

Pseudodoc [−1.25, 0.65]

Doc 3 [−0.45, 0.68]

(38)

Additionally, a number of te hniques are available that allow the

data to be prepro essed in su h a way as to further in rease the

ee tiveness of thete hnique. Words ontribute to varying extents

tothesemanti proleofapassage.Forexample,thewordthehas

littleimpa tonthemeaningofpassagesinwhi hitappears. Aword

whi hdistributesitselfevenlyamongthedo umentsina olle tionis

oflittlevalueindistinguishingbetweenthem. LSA anthereforebe

made moreee tiveby redu ingthe impa tof su h wordson word

ount matrix and in reasing the impa t of less evenly distributed

words. Dumais(16)outlinesseveralmethodsofa hievingthis. The

oneusedinthisthesisisthemostsophisti atedandee tiveofthese.

Itwillnowbepresented. Thereaderinterestedinlearningaboutthe

othersisdire tedto theoriginalsour e.

c

ij

isthe ell at olumn i,rowj ofthe orpusmatrix. Theentropy normalisation step most ommonlyused, and used throughout this

thesis,involvesmodifyingthis valueasfollows,

p

ij

=

tf

ij

gf

i

(2.23)

gw

i

= 1 +

X

j

p

ij

log(p

ij

)

log(n)

(2.24)

c

ij

= gw

i

(c

ij

+ 1)

(2.25)

where

gw

i

istheglobalweighting ofthewordati,nisthenumber of do uments in the olle tion,

tf

is the term frequen y, i.e. the original ell ount,and

gf

istheglobalfrequen y,i.e. thetotal ount forthat worda rossalldo uments.

Letuslookattheee tofthissteponourexampledataset. Hereis

(39)

Passage1 Passage2 Passage3 the 2 3 2 man 1 1 0 walked 1 0 0 dog 1 1 1 took 0 1 0 to 0 1 1 park 0 1 1 went 0 0 1

andhereisthesamematrixfollowingtheprepro essingstepdes ribed

above:

Passage1 Passage2 Passage3

the 0.019 0.024 0.019 man 0.255 0.255 0.0 walked 0.693 0.0 0.0 dog 0.0 0.0 0.0 took 0.0 0.693 0.0 to 0.0 0.255 0.255 park 0.0 0.255 0.255 went 0.0 0.0 0.693

Values for the are mu h redu ed: this word appears fairly

indis- riminately a rossallthe do uments. Dog disappears ompletely,

beingperfe tlyuniformin itso urren e. Took andwent remain

high, being good dis riminators. To and man nd themselves

somewhere in between. It is easy to see that we an arryout the

LSAte hniqueequallywellonthisse ondmatrix,andthatwemight

expe t superiorresults.

LatentSemanti Analysishasbeenappliedinanimpressivediversity

ofdomains (53)(18)(41),althoughitis best-knownin the

informa-tionretrieval ontext. Impressiveresultshavealsobeendemonstrated

inusingLSAtoin orporatelong-spansemanti dependen iesin

lan-guage modelling (55)(3) (12). Languagemodelling, in luding LSA

(40)

2.5 Summary

This hapter hasprovidedthe readerwith ba kgroundne essaryto

pla ethethesisin ontextwithregardstomatrixde omposition

te h-niques. Thevalueofeigenandsingularvaluede ompositionhasbeen

dis ussed interms oftheirallowingdata tobesmoothed,simplied

and ompressed in a prin ipled fashion. Latent Semanti Analysis

hasbeenpresentedasawell-knownappli ationofsingularvalue

de- omposition within natural language pro essing. Work within the

LSA domain willfollowlater in thethesis. Thenext hapter

intro-du es theGeneralizedHebbian Algorithm,whi h isthebasis ofthe

(41)

The Generalized Hebbian

Algorithm

The previous hapter introdu ed eigen de omposition and singular

value de omposition. These te hniques have been widely applied

within informations ien e and omputationallinguisti s. However,

their appli ability variesa ordingto the onstraintsintrodu ed by

spe i problems.

Mu h resear hhasbeendoneonoptimisingeigende omposition

al-gorithms, and the extent to whi h they an be optimised depends

on theareaofappli ation. Mostnaturallanguageproblemsinvolve

sparsematri es,sin etherearemanywordsinanaturallanguageand

thegreatmajoritydonotappearin,forexample,anyonedo ument.

Domains in whi h matri es are lesssparse lend themselves to su h

te hniquesasGolub-Kahan-Reins h(19)andJa obi-likeapproa hes,

whi h anbeverye ient. Theyareinappropriate tosparse

matri- es however, be ause they work by rotatingthe matrix, whi h has

theee t of desparsifyingit, inatingit insize. Te hniques su has

those des ribed in Berry's1992 arti le(6) are moreappropriate in

(42)

etal'sSVDPACK(4) isusedlateronin thiswork.

Optimisationworkisofparti ularimportan ebe ausethe

de ompo-sition te hniques areexpensive,and there are strong onstraintson

thesizeofmatri esthat anbepro essedinthisway. Thisisof

par-ti ular relevan ewithin naturallanguagepro essing,where orpora

areoftenverylarge,andthesu essofmanydata-drivente hniques

dependsontheuseofalarge orpus.

Optimisation is an important way to in rease the appli ability of

eigen and singularvalue de omposition. Designing algorithms that

a ommodate dierent requirements is another. For example,

an-other drawba k to Ja obi-like approa hesis that they al ulate all

thesingulartriplets(singularve torpairswithasso iatedvalues)

si-multaneously, whi h may not be the most pra ti al in a situation

where onlythe top few are required. Consideralso that the

meth-odsmentionedsofarassumethattheentirematrixisavailablefrom

thestart. There aremanysituationsin whi h datamay ontinueto

be omeavailableovertime.

Therearemanyareasofappli ationinwhi he ientin rementality

is of importan e. Sin e itis omputationally expensive to al ulate

a matrixde omposition, itmay notbefeasible to re al ulate when

new databe omesavailable. Ee tivein rementalitywould remove

the eilingonmatrixsizethat urrentte hniques impose. Thedata

need not bepro essedall at on e. Systems that learn in real time

need to beableto update data stru turesqui kly. Various ways of

updating an eigen or singular value de omposition given new data

items have been proposed. This hapter presents the Generalized

Hebbian Algorithmand ontrastsitwithotherapproa hes urrently

(43)

n

a

_n

.

a

_n

.

n

w

o =

w += w o

1 a

a

₂

a

₃

n

i

Σ

a w

_i

i=0

1 a

a

₂

a

₃

n

i

Σ

a w

_i

i=0

w

w += w o

2

3

1

1 w += w o

2

1

3

Figure3.1: HebbianLearning

3.1 Hebbian Learning for In remental

Eigen De omposition

TheGeneralisedHebbian Algorithm was rstpresentedby Ojaand

Karhunen in 1985 (38), who demonstrated that Hebbian learning

ouldbeusedtoderivethersteigenve torofadatasetgiven

serially-presented observations (ve tors). Sanger (46) later extended their

ar hite turetoallowfurthereigenve torstobedis overedwithinthe

samebasi framework.

(44)

anoutput, and se ond,the update stepwhi hleadsto thesystem's

learningthestrongesteigenve tor. Thegureshowshowdatais

re- eivedintheformofa tivationstotheinputnodes. Thea tivations

are altered a ording to the strength of the weighting on the

on-ne tion betweenthem and the output node. The a tivation at the

output nodeisthen fed ba kin theform ofupdates totheweights.

Theweights,whi h anbe onsideredave torofnumbers, onverge

onthestrongesteigenve tor.

Equation3.1des ribesthealgorithmbywhi hHebbianlearning an

bemadetodis overthestrongesteigenve tor,andissimplyanother

wayofstatingthepro edure des ribedbygure3.1.

u

_{(t + 1) = u + λ(u}

T

_{· a)a}

(3.1)

In the above,

u

is the eigenve tor,

a

is the input ve tor (data

ob-servation) and

λ

is the learning rate (not to be onfused with the

λ

usedin the previous hapter to representtheeigenvalue).

(t + 1)

des ribesthe fa t that

u

is updatedto takeon a new valuein the nexttimestep. Intuitively,theeigenve torisupdatedwiththeinput

ve tors aledproportionallytotheextenttowhi hitalready

resem-blesit,asestablishedbythedotprodu toperation. Inthisway,the

strongestdire tion intheinput omestodominate.

Torelatetheabovetotheformalisations introdu ed intheprevious

hapter, our data observations, whi h might for example take the

form of wordbag ve tors (this time notpaired with do ument

ve -tors, sin e we are using eigen de omposition and therefore require

symmetri aldata)aretheve tors

a

. Togethertheyformthe olumn matrix

A

. Oureigenve tors

u

,produ edbytheGeneralizedHebbian Algorithm,arethereforeeigenve torsof thefollowingmatrix:

M

= AA

T

(45)

This foundation is extended by Sanger to dis over multiple

eigen-ve tors. Theonly modi ation to equation 3.1 requiredto un over

further eigenve torsisthat theupdate needsto bemadeorthogonal

topreviouseigenve tors:sin ethebasi pro edurendsthestrongest

oftheeigenve tors,inordertopreventthatfromhappeningandnd

later eigenve tors, the previous eigenve tors are removed from the

trainingupdateinordertotakethemoutofthepi ture. The urrent

eigenve torisalsoin ludedintheorthogonalisation.

u

_n

_{(t + 1) = u}

_n

_{+ λ(u}

T

n

· a)(a −

X

i≤n

(u

T

i

· a)u

i

)

(3.3)

Here,

u

n

is thenth eigenve tor. Thisis equivalentto Sanger'snal formulation,intheoriginalnotation(46),

c

ij

(t + 1) = c

ij

(t) + γ(t)(y

i

(t)x

j

(t) − y

i

(t)

X

k≤i

c

kj

(t)y

k

(t))

(3.4)

where

c

ij

isanindividualelementinthe

i

'theigenve tor,

t

isthetime step,

x

j

is the inputve torand

y

i

is thea tivation (that is to say, the dot produ t of theinput ve torwith the

i

th eigenve tor).

γ

is thelearningrate.

Tosummarisefrom animplementationperspe tive,theformula

up-datesthe urrenteigenve torbyaddingtoittheinputve tor

multi-pliedbythea tivationminustheproje tionoftheinputve toronall

the eigenve torssofar in luding the urrent eigenve tor, multiplied

bythea tivation. In ludingthe urrenteigenve torintheproje tion

subtra tionstephastheee tofkeepingtheeigenve torsnormalised.

Note that Sanger in ludes an expli it learningrate,

γ

. A potential variation,utilised inthiswork,involvesex ludingthe urrent

(46)

Thishastheee t ofintrodu inganimpli itlearningrate,sin ethe

ve toronlybeginstogrowlongwhenitsettlesintherightdire tion,

su hthatthedatareinfor esit,andsin efurtherlearninghasless

im-pa ton etheve torhasbe omelong. Wengetal. (52)demonstrate

thee a yofthisapproa h.

In terms of an a tual algorithm, this amounts to storing a set of

N

word-spa eeigenve torsandupdatingthem withtheabovedelta omputed from ea h in oming do ument as it is presented. This

means that the full data matrix need never be held in memory all

aton e,andin fa ttheonlypersistentstoragerequirementisthe

N

developingsingularve torsthemselves.

3.2 GHA and In remental Approa hes to

SVD

GHA al ulates theeigende ompositionofamatrixbasedonsingle

observations presented serially. It allows eigenve tors to be learnt

using no more memory than is required to store the eigenve tors

themselves. It is therefore relevant in situations where the size of

the dataset makes onventional bat h approa hes infeasible. It is

alsoofinterestinthe ontextofadaptivity,sin eithasthepotential

to adapt to hanging input. The learningupdate operation is very

heap omputationally. (Complexity analysis is presented later in

thisworkinthe ontextofapplyingspe i implementationsofGHA

to LSA-style tasks.) The algorithm produ es eigenve tors starting

with the most signi ant, sin e it is the greater eigenve tors that

onverge most qui kly, whi h means that useful data immediately

beginstobe omeavailable. Sin eitisalearningte hnique,however,

it diers from what would normally be onsidered an in remental

te hnique,inthatthealgorithm onvergesontheeigende omposition

ofthedataset,ratherthanatanyonepointhavingthebestsolution

possible for the data it has seenso far. The method is potentially

(47)

AkeyreasonforusingGHAtoprodu eeigende ompositionsis

there-foreits in remental nature,bothforpurposesof adaptivityand

be- auseitmakestheapproa hamenabletoverylargedata

dimension-alities. Natural languageresear hhasgeneratedworkin theareaof

in rementalityinsingularvaluede omposition,be ausenatural

lan-guagepro essingisakeyexampleofaeldofresear hinwhi hlarge

orporaareused,andstandardapproa hesto matrixde omposition

arepushedtotheirlimits. Asdis ussedintheprevious hapter,SVD

andeigende ompositionare loselyrelatedandinsome ontextseven

inter hangeable,soalthough,stri tlyspeaking,GHAisamethodfor

performing eigen de omposition, SVD and eigen de omposition in

thisse tionaretreatedinter hangeably.

Extantin rementalapproa hestosingularvaluede omposition

typ-i allyfall into three ategories. Therstessentiallyinvolvesadding

thenewdatatothedatasetpreviouslyde omposedandthen

re om-puting the de omposition. To all su h anapproa h in remental is

therefore somewhat of a misnomer, though depending on the

on-text, some aspe ts of the pro ess might be onsidered in remental.

Forexample,Ozawaet al (39) take thisapproa hin the ontext of

Prin ipal ComponentAnalysis forfa e re ognition. Prin ipal

Com-ponent Analysis (PCA) is a near relative of SVD. Berry et al (5)

also dis uss re omputingas an optionin the asewhere adatabase

ofdo umentsforLSAisextended.

In the se ond ategoryof approa hesto in rementality we nd

ap-proximations.Thede ompositionofadataset,havingbeenin reased

withnewdata, anbeapproximatedwithoutre omputing ompletely.

Folding in, asdes ribed by Berry et al (5) is an example of this

approa h. It works on the assumption that new data is typi al of

thedata onwhi h theoriginalde omposition wasperformed.

Pseu-dodo uments are formed in the manner des ribed in the previous

hapter,andthesearethentreatedaspartoftheoriginaldo ument

set. As larger quantities of data are added and the assumption of

representativity startsto break down, the a ura y of the

approxi-mationde reases. However,it anbeausefuloptioninthe asethat

(48)

reatingamodelbasedonatrainingsetandthenusingittopro ess

an unseentest set. Theprin ipleis well-known, but itis learthat

themodel isnotupdatedwiththetestset.

Inthethird ategory,anexistingde ompositionisupdatedwithnew

datasu hthattheresultingde ompositionisaperfe tresultforthe

dataset. O'Brien (37) presents an example of this, as does Brand

(7). Branddes ribesanapproa htoSVDupdatingin the ontextof

whi hmissing ornoisydata isalsodis ussed. These approa hesare

appropriatein the asethat anewbat hof dataneedsto beadded

toanexistingde ompositionoine. Thestepismoreexpensivethan

folding in (though heaper thanre omputing) and assu h is

appli- ablein dierent ir umstan es. Brand(7)alsoprovidesareviewof

earlierwork inSVDin rementality.

GHA diers from ea h of these ategories in some key ways. The

above approa hes are in remental inasmu h as they provide ways

to add new data to an existing de omposition. None of them are

designed to a ommodate the situation in whi h data is streamed.

The update operationsare typi allyexpensive. All ofthem assume

an existing de omposition into whi h the new data will be added.

GHA is dierent in that its in rementality is far more intrinsi . It

assumesnoexistingde omposition(thoughmightpotentiallybenet

from beingseeded withanexisting de omposition). It onvergeson

thestrongesteigenve torsrst,therebyprodu ingusefulinformation

qui kly. As a learningalgorithm, however,it doesneed to be used

appropriately: unlikeother approa heswhi h atanystage havethe

perfe tde ompositionforthedatatheyhaveseensofar,GHAneeds

tobeallowedto onverge. Rerunsthroughsmallerdatasetswillmost

likelyberequired.

3.3 GHA Convergen e

The usability of the GHA algorithm is in no small way onne ted

(49)

dire tion, that is to say, the eigenve tor. If the aim is to have a

omplete de omposition in whi h every GHA-trained ve tor diers

minimally in dire tion from the a tual eigenve tor, andeigenvalues

are appropriate, then the time taken to a hieve this end, and the

reliabilitywithwhi hatolerablea ura ylevelisrea hed,is riti al.

Although onvergen eofGHAisproven(46),previousauthorshave

notedtheabsen eoflarge-s aleevaluationof onvergen ebehaviour

in the literature despite widespread interest in the algorithm (15).

Someattempthasbeenmadeheretoremedythisonapra ti allevel

withaplotof onvergen eagainstnumberoftrainingstepsrequired

usingade ompositiondoneusingthebetter-knownLAS2 algorithm

(6) as referen e. A subse tion of the 20 Newsgroups orpus (11),

spe i ally the atheism se tion, was prepro essed into a sparse

tri-grammatrixwithadimensionalityofaround15,000by90,000. Words

formed olumnsandtwo-wordhistories,rows. Matrix ells ontained

trigram ounts. Thismatrixwasthende omposedusingLAS2. The

resulting left-side ve torset wasthen used to plotdot produ t with

GHA eigenve torasthey onverged. This plot is presented in

Fig-ure 3.2. The impli it learning ratedes ribed earlier in the hapter

wasusedhere.Theimpa tofthisdesign hoi eonthedatapresented

hereneedsto be onsidered. Other approa hestolearningratesare

dis ussed in thenext hapter,in the ontextof asymmetri

onver-gen e. The onvergen e riterion used was based on the distan e

between theend of the ve torbeing trained, normalised, ompared

with the ve tor in its earlier position, in this ase 50,000 training

stepspreviously(and sowillinevitablyde reaseastheve torgrows

long,and newdata haslessimpa tondire tion). Thegraphshows

dotprodu toftheGHAeigenve tor urrentlybeingtrainedwiththe

LAS2 target, and sowhen theGHA ve torrea hes onvergen e the

graphshowsajumpaswemoveontothenextve tor. Thedot

prod-u t weare aimingat is 1;thetwove torsshould pointin thesame

dire tion. Thegraphshows onvergen e ofeleven eigenve tors(the

rstthreebeingdi ulttomakeoutbe ausethey onvergedalmost

immediately). Around2.5x

10

7

trainingpresentationswererequired

to a hievethis many eigenve tors. As anbe seenfrom thegraph,

(50)

on-Figure 3.2: Dot Produ t of GHA Eigenve tor with Referen e Set

AgainstNumberof Training Steps

a tenden y to level o, in some ases well before a high pre ision

is a hieved, suggeststhat theimpli it learningrateapproa h leaves

somethingtobedesired. Theimpli itlearningrateusedhereis

on-trastedwithSanger'soriginalexpli itlearningrateina omparisonof

onvergen ebehaviouronasymmetri datalaterinthiswork,where

onvergen eisdis ussedinmoredetailinthe ontextofevaluatingan

originalalgorithm,theAsymmetri GeneralizedHebbianAlgorithm.

Other ways of improving onvergen e throughthe hoi e of an

ap-propriate onvergen e riterion, and issues around sele ting su h a

(51)

3.4 Summary

This hapter introdu ed the Generalized Hebbian Algorithm, and

explained how it an be used to learntheeigen de omposition of a

matrix based on single observations presentedserially. Advantages

to su han approa hhavebeendis ussed, bothin termsofallowing

matrixsizestoolargefor onventionalapproa hestobede omposed,

andintermsofimplementinglearningbehavioursu hasadaptation

to newinputpatterns. Thenext hapterdis ussesworkin thearea

of adapting GHA to varying ontexts, with naturallanguage

(52)

(53)

Algorithmi Variations

This se tiondes ribesanumberof algorithmi developmentsto the

Generalized Hebbian Algorithm. It begins by dis ussing

develop-mentsofthebasi GHAformulation. Thete hniqueisappliedto

La-tentSemanti Analysis,andismodiedtoa ommodatethe

prepro- essing steps ommonly in ludedin LSAimplementations. Random

Indexing(28)isintrodu edhereasasupplementto GHAproviding

ameansofxingand redu ingve torlength.

An extensionofGHA topaireddata(singularvaluede omposition)

is then presented. Sparse variants of the algorithms are des ribed.

Sparsevariantsare ontrastedwith theRandomIndexingapproa h

introdu ed in the ontext of GHA. Approa hes to setting learning

ratesanddetermining onvergen eofthealgorithmsaredis ussed. 1

1

Inthisse tion,the work onin luding LSAentropy normalisationinGHA

isjoint work done withBrandyn Webb. The basi designof the Asymmetri

Generalized Hebbian Algorithm isBrandyn's; the derivation ismy own. The

(54)

4.1 GHA for Latent Semanti Analysis

LatentSemanti Analysis hasbeen used to great ee t in theeld

of informationretrievalandbeyond. Limitationson orpussize are

howeverado umentedproblem(49). Sin eonlytherstfewhundred

eigenve torsarerequired in LSA, GHA is apotential andidate for

an alternativealgorithm. GHA provides an alternative with a low

memory footprint, but takes progressively longer to produ e ea h

eigenve tor,ultimately meaning that time is an issue. Sin e

eigen-ve tors are produ ed in order starting with the greatest, however,

requiringonlyasmallnumberofeigenve torsmitigatesthis. GHAis

qui kto onvergeon the greatestof theeigenve tors. Additionally,

GHA is of interestfrom the point of viewof reation of alearning

systemthat andevelopapotentiallyverylargeLSA-stylesemanti

model overa period of time from ontinuous streamed input. The

learningbehaviourofsu hasystemwouldbeofinterest,and

further-more,thereispotentialforinterestingperforman efeaturesinavery

largematureLSA-stylemodel. Theworkpresentedinthisse tionis

previouslypublished(24).

At aglan e, GHAmaynot be anobvious andidate forLSA, sin e

LSAistraditionallyperformedusingsingularvaluede ompositionof

paireddata, toprodu etwosets ofsingularve tors. Oneset anbe

used torotatewordbagve torsintothesharedspa e reatedbythe

SVD pro ess, and the other, do ument ve tors. Sin e a do ument

ve tor anbejustaswellrepresentedasawordbagve tor,however,

this is a little redundant. In fa t, the primary task of LSA is to

establish word interrelationships, and this is atask to whi h eigen

de ompositionisverywellsuited. Inpra ti alterms,usingeigen

de- ompositionforLSAsimplyinvolvesusingaword orrelationmatrix

preparedoverasetoftrainingdo uments,whi hissquareand

sym-metri al,to reateaneigende omposition,thenusingthis

de ompo-sitiontorotatethetestsetofdo umentspresentedaswordbagsinto

theredu eddimensionalityspa e,wherethey anbe ompared. The

testsetmaybethesameasthetrainingset: thiswouldinfa tree t

(55)

do -on thematrix

A

= M M

T

, where

A

des ribes word orrelations,to produ ethe olumnmatrixofeigenve tors

U

.

U

isredu edin dimen-sionalitybydis ardinglater olumnstoprodu e

U

′

. Testdo uments

in the form of wordbags are presented asthe row matrix

D

.

U

′

is

usedto redu ethedimensionalityof

D

asfollows:

D

′

= DU

′

(4.1)

Rowdo umentve torsin

D

′

anthen be omparedtoea hother.

4.1.1 In lusion of Global Normalisation

LSA oftenin ludesan entropy-normalisationstep (16),dis ussedin

theprevious hapter,inwhi hwordfrequen iesoftheoriginal data

matrixare modied to ree ttheir usefulnessasdistinguishing

fea-tures. Sin ethisstephassigni antbenets,andhasindeedbe ome

a part of the standard, no suggestion for an approa h to

perform-ing LSA anbe omplete withoutits in lusion. This prepro essing

requiresthattheentire orpusbeavailableup-front,su hthat

prob-abilitieset . anbe al ulateda rosstheentire orpus,andtherefore

doesnot twellwith GHA,oneof themain sellingpointsof whi h

is its in rementality. As outlined in theprevious hapter,the word

ountismodiedbysettingthe ellvalue

c

ij

asfollows 2 :

p

ij

=

tf

ij

gf

i

(4.2)

gw

ij

= 1 +

X

j

p

ij

log(p

ij

)

log(n)

(4.3)

c

ij

= gw

ij

(c

ij

+ 1)

(4.4) 2

(56)

where

n

isthenumberofdo umentsinthe olle tion.

tf

istheterm frequen y,i.e. theoriginal ell ount,and

gf

istheglobalfrequen y, i.e. thetotal ountforthat worda rossalldo uments.

By modifying the word ount in this way, words that are of little

valuein distinguishingbetweendo uments,forexample,wordssu h

as the, that are very frequent, are down-weighted. Observe that

the al ulationofthe entropydependsonthetotaldo ument ount

and on thetotal numberof a given worda rossall the do uments,

aswellastheindividual ell ount. Foranin rementalmethod, this

meansthatitmustbe al ulatedoverthedo umentsseensofar,and

thatwordanddo ument ountsmustbea umulatedonanongoing

basis. Alittlealgebraprodu es 3 :

gw

ij

= 1 +

P

j

tf

ij

log(tf

ij

) − gf

i

log(gf

i

)

gf

i

log(n)

(4.5)

This arrangementhasthe onvenientpropertyof isolatingthe

sum-mation overa quantity that an bea umulated, i.e.

tf

ij

log(tf

ij

)

, whereasthepreviousarrangementwouldhaverequiredtheindividual

term frequen iesto be storedseparately foran a urate al ulation

to bemade. Thisis problemati where thenumberof su h

frequen- ies tends to innity and the storage requirement in reases as the

tra tabilityofthe al ulationde reases.

4.1.2 Epo h Size and Impli ationsfor

Appli ation

Theentropy-normalised ell ountbe omeslessusefuloververylarge

numbersoftrainingitems,su hasonemightusewithanin remental

algorithm. Considerthatitisthenatureoflanguagethatmostwords

areextremelyinfrequent. Asthenumberofseenitemstendsto

inn-ity,theweightingofwordsthato urwithmidrangefrequen ieswill

Generalized Hebbian Algorithm for Dimensionality Reduction in Natural Language Processing

Linköping Studies in Science and Technology

Dissertation No. 1045