Linköping Studies in Science and Technology
Dissertation No. 1045
Generalized Hebbian Algorithm for Dimensionality
Reduction in Natural Language Processing
by
Genevieve Gorrell
Department of Computer and Information Science
Linköpings universitet
SE581 83 Linköping, Sweden
Linköping 2006
The urrentsurgeofinterestinsear hand omparisontasksin
natu-rallanguagepro essinghasbroughtwithitafo usonve torspa e
ap-proa hesandve torspa edimensionalityredu tionte hniques.
Pre-sentingdata aspointsin hyperspa eprovidesopportunities tousea
varietyof well-developed tools pertinentto this representation.
Di-mensionalityredu tionallowsdatatobe ompressedandgeneralised.
Eigende ompositionandrelated algorithmsareone ategoryof
ap-proa hesto dimensionalityredu tion,providingaprin ipled wayto
redu edatadimensionalitythathastimeandagainshownitself
apa-bleof enablinga ess topowerfulgeneralisationsin thedata. Issues
with the approa h, however,in lude omputational omplexity and
limitationsonthesizeofdatasetthat anreasonablybepro essedin
thisway. Largedatasetsareapersistentfeatureofnaturallanguage
pro essingtasks.
Thisthesisfo usesontwomainquestions. Firstly,inwhatways an
eigen de omposition and related te hniques be extended to larger
datasets? Se ondly, thishavingbeena hieved,of what valueis the
resultingapproa htoinformationretrievalandtostatisti allanguage
modellingatthen-gramlevel? Theappli abilityofeigen
de omposi-tionisshowntobeextendablethroughtheuseofanextantalgorithm;
theGeneralizedHebbianAlgorithm(GHA),andthenovelextension
of this algorithm to paired data; the Asymmetri Generalized
Heb-bianAlgorithm(AGHA).Severaloriginalextensionstothethese
al-gorithmsarealsopresented,improvingtheirappli abilityinvarious
domains. The appli ability of GHA to Latent Semanti
Analysis-styletasks isinvestigated. Finally, AGHAisused toinvestigatethe
value ofsingular valuede omposition, aneigen de omposition
vari-ant,to n-gramlanguagemodelling. Asizeable perplexityredu tion
Partsofthisdo toralthesis appearinotherpubli ations:
Gorrell,G.,2006. GeneralizedHebbianAlgorithmforIn rementalSingular
Value De ompositioninNaturalLanguage Pro essing. InthePro eedings
of the 11th Conferen e of the European Chapter of the Asso iation for
ComputationalLinguisti s(EACL2006),Trento.
Gorrell,G.andWebb,B.,2005. GeneralizedHebbianAlgorithmforLatent
Semanti Analysis. InthePro eedingsofthe9thEuropeanConferen eon
Spee hCommuni ationandTe hnology(Interspee h2005),Lisbon.
Alsobythisauthor:
Gorrell,G.2004. LanguageModellingandErrorHandlinginSpoken
Dia-logueSystems. Li entiatethesis,LinköpingUniversity,2004.
Rayner, M., Boye, J., Lewin, I. and Gorrell, G. 2003. Plug and Play
Spoken DialoguePro essing. InCurrentandNewDire tionsinDis ourse
and Dialogue. Eds. JanvanKuppevelt and Ronnie W.Smith. Kluwer
A ademi Publishers.
Gorrell,G.2003. Re ognitionErrorHandlinginSpokenDialogueSystems.
Pro eedings of 2nd International Conferen e on Mobile and Ubiquitous
Multimedia,Norrköping2003.
Gorrell, G. 2003. Using Statisti al Language Modelling to Identify New
Vo abulary inaGrammar-BasedSpee hRe ognitionSystem. Pro eedings
ofEurospee h2003.
Gorrell, G., Lewin, I.and Rayner, M. 2002. Adding Intelligent Help to
MixedInitiativeSpokenDialogueSystems. Pro eedingsofICSLP2002.
Knight,S.,Gorrell, G.,Rayner,M., Milward,D., Koeling,R.and Lewin,
I. 2001. Comparing Grammar-Based and Robust Approa hes to Spee h
Understanding: ACaseStudy.Pro eedingsofEurospee h2001.
Rayner,M.,Lewin,I.,Gorrell,G.andBoye,J.2001. PlugandPlaySpee h
Understanding. Pro eedingsofSIGDial2001.
Rayner,M.,Gorrell,G.,Ho key,B.A.,Dowding,J.andBoye,J.2001. Do
CFG-Based Language ModelsNeed Agreement Constraints? Pro eedings
ofNAACL2001.
Korhonen,A.,Gorrell,G.andM Carthy,D.,2000. Statisti alFilteringand
Sub ategorisation Frame A quisition. Pro eedings of the Joint SIGDAT
Conferen e on Empiri al Methods in Natural Language Pro essing and
VeryLargeCorpora2000.
Arne Jönsson, my supervisor, who has been agreat supporter and
anoutstandingrolemodelforme;RobinCooper,se ondsupervisor,
forkeepingmesane,onwhi hallelsedepended;JoakimNivre,third
supervisor, forstepping in withadditionalsupportlater on;Manny
Rayner, Ian Lewin and Brandyn Webb, major professional and
in-telle tualinuen es;RobertAndersson,theworld'sbestsystems
ad-ministrator;everyonein GSLT,NLPLAB, KEDRIandLingvistiken
(GU) for providingsu h ri h working environments;and nally, my
1 Introdu tion 9
1.1 EigenDe omposition . . . 12
1.2 Appli ationsofEigenDe ompositioninNLP . . . 17
1.3 GeneralizedHebbianAlgorithm . . . 18
1.4 Resear hIssues . . . 19
2 Matrix De ompositionTe hniques and Appli ations 23 2.1 TheVe torSpa eModel . . . 24
2.2 EigenDe omposition . . . 25
2.3 SingularValueDe omposition . . . 27
2.4 LatentSemanti Analysis . . . 29
3 The GeneralizedHebbian Algorithm 39
3.1 HebbianLearningforIn rementalEigenDe omposition 41
3.2 GHAandIn rementalApproa hestoSVD. . . 44
3.3 GHAConvergen e . . . 46
3.4 Summary . . . 49
4 Algorithmi Variations 51 4.1 GHAforLatentSemanti Analysis . . . 52
4.1.1 In lusionof GlobalNormalisation. . . 53
4.1.2 Epo hSizeandImpli ationsforAppli ation. . 54
4.2 RandomIndexing. . . 55
4.3 GHAandSingularValueDe omposition . . . 57
4.4 SparseImplementation . . . 62
4.5 Convergen e. . . 64
4.5.1 StagedTraining. . . 65
4.5.2 Convergen eCriteria. . . 65
4.5.3 LearningRates . . . 68
4.5.4 Asymmetri GHAConvergen eEvaluation . . 69
5 GHAfor Information Retrieval 77
5.1 GHAforLSA . . . 78
5.1.1 Method . . . 78
5.1.2 Results . . . 79
5.2 LSAandLargeTrainingSets . . . 83
5.2.1 MemoryUsage . . . 84
5.2.2 Time. . . 85
5.2.3 S alingLSA. . . 86
5.3 Summary . . . 88
6 SVDand Language Modelling 89 6.1 ModellingLetterandWordCo-O urren e . . . 92
6.1.1 WordBigramTask . . . 93
6.1.2 LetterBigramTask . . . 94
6.2 SVD-BasedLanguageModelling . . . 96
6.2.1 LargeCorpusN-GramLanguageModelling us-ingSparseAGHA . . . 97
6.2.2 SmallCorpusN-GramLanguageModelling us-ingLAS2 . . . 100
6.3.2 SmallBigramCorpus . . . 104
6.3.3 Medium-SizedTrigramCorpus . . . 105
6.3.4 ImprovingTra tability . . . 106
6.3.5 LargeTrigramCorpus . . . 111
6.4 Summary . . . 115
7 Con lusion 119 7.1 DetailedOverview . . . 119
7.2 SummaryofContributions. . . 121
7.3 Dis ussion . . . 122
7.3.1 EigenDe ompositionAPana ea? . . . 122
7.3.2 GeneralizedHebbianAlgorithmOverhypedor Underrated? . . . 124
7.3.3 WiderPerspe tives. . . 126
Introdu tion
In omputationallinguisti s,asin manyarti ialintelligen e-related
elds, the on ept of intelligent behaviour is entral. Human-level
naturallanguagepro essingrequireshuman-likeintelligen e,imbued
as it iswith ourveryhumanexisten e. Furthermore,language
pro- essing ishighly omplex, andsowemighthope,broadlyspeaking,
thatamoreintelligentsystem,howeveryoudeneit,wouldbe
bet-ter able to handle language pro essing tasks. Ma hine intelligen e
hasbeengaugedand denedin anumberofways. TheTuringTest
(50)approa hedintelligen eastheabilitytopassforhuman. Other
denitions arebased ontheadaptivityof thesystem. Adaptivity is
apowerful on eptofintelligen egiventhatarguablythepointofa
plasti entral nervoussystemis toenableadaptationwithin an
or-ganism'slifetime(34). Worksu has(32)developssu hnotionsinto
aformalmeasure. Butwhatintelligen e anbesaidtobeembodied
in a simple stati system? What denition might we apply to the
intelligen eof,forexample,asystemthat doesnotadaptatall,but
isnonethelessabletohandleapredeterminedsetof ir umstan esin
useful ways? Does su h asystemembody anyintelligen eat all,as
thetermis ommonlyused? Surelyit anbesaidtohavemoreorless
intelligent behaviours, ifnot intelligen e per se? Sin e thesystems
this ategory,somemeasureoftheirpower mightbeuseful.
In order to produ e appropriate behavioursin response to input, a
system needs rst of allto beable to distinguish between dierent
kindsof input in task-relevantways. This maybe simplya hieved.
For example, a tou h-tone telephone system need only distinguish
betweena smallnumberof simpleanddistin t tones in responseto
a question. This step may also be more hallenging. Forexample,
ifyoumakea ommandviaaspokennaturallanguageinterfa e,the
systemneedsto beabletodistinguishbetweendierent ommands.
Itmaynotneedtodistinguishbetweendierentspeakers,orthesame
speakerin dierent moods. These skills are within the apabilities
of most human listeners, but are irrelevant to our system. So the
taskofdistinguishingbetween ommandsisa omplexmodellingtask
involvingidentifying thefeatures relevant to determining theusers'
wisheswithinthedomainofthetaskbeingperformed. Evengetting
as far as identifying the words theuser most likely spokeis a
non-trivialtask,requiringmanylayersof abstra tionandreasoning,and
so a sophisti ated model is required. Having formed a model by
whi h toidentify relevantinput, theabilityto generateappropriate
responses follows, andtherequirementsatthis stage depend onthe
nature ofthesystem.
Human developmental studies havedemonstrated the great
signi- an eofmodellingwithin themind,bothonahighand alowlevel.
Evenveryyoungbabiesshowmoreinterestin stimulithat hallenge
their worldmodel, suggesting that right from the start, learning is
a pro ess of tuning the world model. Forexample, the distin tion
between animate and inanimate appears veryqui klyin the hild's
modeloftheworld. Babieswill lookforlongerat inanimateobje ts
movingoftheir ownvolitionin themanner of omplexobje ts,and
onversely,animate obje tssu hasotherpeoplefollowingNewton's
Lawsassimpleobje tsdo(54). Theworldmodelprovidesawayto
sift through the input and give attention to phenomena that most
requireit. Onalowlevel,adaptation des ribesanerve's easingto
reinresponsetoun hanginginput. Youstareatthesamethingfor
par-look away yourvisual eld is markedbyan absen e of that olour.
Dire tion and speed of motion are also ompensated for. You are
able to tune out onstantnoises. Adaptation is averysimpleform
ofmodelling,butother,more omplexvarietiesalsoappearatalow
levelin thebrain(25).
Approa hes to reating an appropriately powerful model in an
ar-ti ial system fall loosely into two ategories. The model an be
designed by ahuman,possibly from humanintuitive per eptions of
thestru tureoftheinputor omplexhuman-generatedtheoriessu h
aslinguisti theoriesofgrammar,orthemodel anbea quired
auto-mati ally fromdata. Advantagesto theformerin lude thathumans
are verypowerful reatorsof models, and ouldweonlyen odeour
models well, they would surely surpassany automati ally a quired
modelavailabletoday. Advantagestothelatterin ludethattheyare
alotlesswork to reate(forthehumans)andpotentiallylessprone
to error, sin e they anbe set up to make fewer assumptions, and
hoose a solution based on optimising the result. Computer hess
anillustrate the dieren ewell, withprograms basedonsear hing
allpossiblepathstoadepthbeyondhumanabilities ompetingwell,
butbynomeansalwaysprevailing,againsthumanplayersusingmore
advan ed strategies but less raw omputational power. Creating a
system apable of a quiring a su iently powerfulmodel
automat-i ally from the data requires that the basi learning framework is
su ientlysophisti ated. Forexample,asystemthatlearns
seman-ti on epts throughword o-o urren epatterns is nevergoing to
produ e a theory of grammar,sin e it hasno a ess to word order
information. Theinput isinsu ientlyri h. A systemthat models
the world based on the assumption that observations omprise the
additive sum of relevant fa tors (for example, if it is sunny, I am
happy. IfitisSunday,Iamhappy. ThereforeifitisasunnySunday,
I am veryhappy)will fail to a uratelymodel ir umstan eswhere
auseandee t haveamoreinvolvedrelationship.
Anadequateformatbywhi htoen odeinputisa riti alrststep. In
thisworkIfo usonve torspa emodels. Intheve torspa emodel,
whi hthereisonedimensionforea hfeature. Similaritybetweenthe
datapoints anthenbethoughtofintermsofthedistan ebetween
the points in spa e, and so the framework is ideal for problems in
whi h similarityrelationships betweendata need to bedetermined.
Therange ofproblems whi h anbe hara terised in these termsis
verylargeindeed, andthere aremanywaysin whi h dierentkinds
ofsimilarity anbetargeted. Withintheve torspa erepresentation
there are a variety of ways in whi h the variation in the position
of these points an then bepro essedand therelevant information
sifted from the irrelevant and brought to the fore. I fo us on one
in parti ular; eigen de omposition, the properties of whi h will be
dis ussedin moredetailinthenextse tion.
1.1 Eigen De omposition
Eigen de omposition is a mu h-used te hnique within natural
lan-guagepro essing aswellasmanyother elds. Youwish tomodela
omplexdataset. Therelationsbetweenthefeaturesarenot learto
you. Youfo us onin ludingasmu hinformation aspossible. Your
hyperspa ethereforehasahighdimensionality. Asitturnsout,two
ofthefeaturesdependentirelyonea hother(forexample,itisnight,
anditisnotday),andthereforeoneofyourdimensionsissuperuous,
be auseif you know thevalue ofone ofthe features, youknowthe
valueoftheother. Withintheplaneformedbythesetwodimensions,
pointsliein astraightline. Someother featureshavemore omplex
interdependen ies. Thevalueofafeaturefollowswithlittlevariation
fromthe ombinedvaluesofseveralotherfeatures(forexample,
tem-peraturemightrelatetonumberofdaylighthoursintheday,amount
of loud overand time ofday). Lines,planes andhyperplanes are
formed bythedatawithinsubspa esoftheve torspa e. Collapsing
these dependen iesintosuperfeatures anbethoughtof in termsof
rotatingthedata,su hthatea hdimension apturesasmu hofthe
varian e in the data as possible. This is what eigen de omposition
Inaddition,we antakeafurtherstep. Bydis ardingthedimensions
with the least varian ewe an further redu e the dimensionality of
thedata. Thistime,theredu tionwillprodu eanimperfe t
approx-imationofthedata,but theapproximationwillbethebestpossible
approximation ofthe data for that numberof dimensions. The
ap-proximationmightbevaluableasa ompressionofthedata. Itmight
alsobevaluableasageneralisationofthedata, in the asethat the
details areovertting/noise.
The most important superfeatures in a dataset say something
sig-ni ant and important about that data. Between them they over
mu hofthevarian eofthedataset. It an beinterestinginitselfto
learnthesinglemostimportantthingaboutadataset. Forexample,
as welearnlaterinusingarelatedte hniqueto learnwordbigrams,
givenoneandonlyonefeature,thesinglemostimportantthing you
ansayaboutwordbigramsin the English languageis what words
pre ede the. With this one feature, you explain as mu h as you
possibly an about English bigrams using only one feature. Ea h
datum, whi h previously ontainedvaluesforea hofthefeaturesin
your original hyperspa e, now ontains valuespositioning it in the
newspa e. Its valuesrelateitto thenew superfeatures ratherthan
theoriginalfeaturesetwestartedoutwith. Anunseendatumshould
beabletobeapproximatedwellintermsof aproje tiononea h
di-mension inthenewspa e. Inotherwords,assuminganappropriate
training orpus,unseen data an bedes ribed asaquantity ofea h
ofthesuperfeatures.
Eigenfa esprovideanappealingvisualillustrationofthegeneralidea
ofeigen de omposition. Eigende ompositioniswidelyused in
om-puter vision, and one task to whi h it has been usefully applied is
fa e re ognition. Eigende omposition anbeapplied to orpora of
imagesoffa essu hthatsuperfeatures anbeextra ted. Figures1.1
(36)and 1.2(unknownsour e)showsomeexamplesofthese
eigen-fa es found on the web. (The dieren es in these eigenfa es are
attributabletothetrainingdataonwhi htheywereprepared.) Note
that the rst eigenfa ein 1.1 is indeed a very plausible basi male
Figure1.2: MoreEigenfa es
Figure 1.4: MoreEigenfa e Convergen e
they only make sense in the ontext of being ombined with other
eigenfa es to produ e an additive ee t. Figures 1.3 (27) and 1.4
(51) show images onverging on a target as more and more
eigen-fa es arein luded. (Inthese parti ularexampleshoweverthetarget
image formedpartofthetrainingset.)
Noteat thispointthateigen de ompositiononlyremoveslinear
de-penden yintheoriginalfeatureset. Alineardependen ybetweena
feature and one or moreothers is adependen yin whi h thevalue
of afeature takesthe formof aweightedsumof theotherfeatures.
Forexample,ifthereisagoodlmshowingthenJaneismorelikely
to go to the inema. If John is going to the inema, then Jane is
morelikelytogo. ThereforeifthereisagoodlmshowingandJohn
is goingto the inema, thenJane isevenmorelikelyto go. A
non-linear dependen y might o ur, for example, if Jane prefers to see
good lms alone soshe an on entrate. So ifthere is agood lm
showing,thensheismorelikelytogotothe inema. IfJohnisgoing
to the inema, then sheis morelikelyto go. If,however,there isa
goodlmshowingandJohnisgoingtothe inema,thenJaneisless
likelyto go. Examples ofthis in languageabound,andso therefore
theappli abilityofeigende omposition oftendependsontheextent
lin-earity. Forexample,wordsappearingoftenwithbluemoon arevery
dierentto wordsappearingoftenwithblueormoon.
1.2 Appli ations of Eigen De omposition
in NLP
Dimensionalityredu tionte hniquessu haseigende ompositionare
ofgreat relevan ewithin theeldofnaturallanguagepro essing. A
persistentproblemwithin languagepro essingistheover-spe i ity
of languagegiven the task, and the sparsityof data. Corpus-based
te hniques depend on a su ien y of examples in order to model
humanlanguageuse,buttheverynatureoflanguagemeansthatthis
approa h has diminishingreturns with orpus size. In short,there
arealargenumberofwaystosaythesamething,andnomatterhow
large your orpusis, you will never overall the thingsthat might
reasonablybesaid. Youwill alwayssee somethingnewat run-time
in ataskofany omplexity.
Furthermore,language anbetoori h forthetask. Thenumberof
underlying semanti on eptsne essarytomodelthetarget domain
isoftenfarsmallerthanthenumberofwaysinwhi hthese on epts
might be des ribed, whi h makes it di ult to, in a sear h
prob-lemforexample,establishthattwodo uments,pra ti allyspeaking,
are dis ussing the samething. Any approa h to automati natural
languagepro essingwill en ountertheseproblems onseverallevels.
Considerthetaskoflo atingrelevantdo umentsgivenasear hstring.
Problemsariseinthatthereareoftenseveralwaysofreferringtothe
same on ept. Howdoweknow,forexample,that atandfeline are
the same thing? There are plenty of do uments relevant to felines
that feature the word feline not on e. This is the kindof problem
that Latent Semanti Analysis aims to solve, and in doingso,
pro-vides natural language pro essing with its best-known appli ation
de om-de omposition.Itallowspaireddatatobepro essed,su has,inthis
ase,do ument/wordbagpairs.) Do uments ontainingtheword
ele-vator,forexample,donottypi ally ontainthewordlift,eventhough
do umentsaboutlifts aresemanti allyrelevanttosear hesabout
el-evators. Thefeature ve torfordo umentsaboutelevatorstherefore
ontainnovaluein thedimensionforthewordlift. However,in
de-s ribing the varian e in the words in a set of do uments, there is
mu hredundan ybetweendo uments ontainingelevator and
do u-ments ontaininglift. Both o-o urwith manysimilar do uments.
Sowheneigende ompositionisperformedonthedataset,thetwoare
automati ally ombinedinto asuperfeature, and the dieren es
re-mainingbetweenthemare apturedinsteadinanothersuperfeature,
whi h is probably being reused to explain a number of other
phe-nomena too. For example, one feature might overseveral aspe ts
of the dieren es between UK and US English. Later eigenve tors
apture thedetailsof thedieren esbetweenthem, su h that given
the omplete set of eigenve tors,the twowordson e again be ome
ex lusive. However,bydis ardingsomeofthelessimportantfeatures
we anstopthatfrom happening.
1.3 Generalized Hebbian Algorithm
Having de ided to pursue eigen de omposition, a further hallenge
awaits. Cal ulating the eigen de omposition is no trivialfeat, and
the best of the algorithms available are nonetheless
omputation-allydemanding. Currentresear haboutand usingthe te hniquein
natural language pro essing often fo uses on adapting the data to
the onstraintsof thealgorithm, and adaptingthe algorithmto the
onstraintsofthedata(49)(9). Thisworkisnoex eption. The
Gen-eralizedHebbianAlgorithm(GHA)isanalgorithmthatgrewfroma
dierentparadigmto the bulkof thework oneigen de omposition,
though notan unfamiliar oneto many omputational linguists and
arti ial intelligen e resear hers. Originating asan arti ial neural
networklearningalgorithm,itbringswithitmanyoftheadvantages
Learningupdatesare heapandlo alisedandinputisassumedtobe
astream ofindependent observations. Learningbehaviouralso has
someinterestingproperties. It is,in short,verydierentfrom other
morestandard approa hesto al ulatingeigen de ompositions, and
isthereforeappropriateinadierentrangeof ir umstan es.
1.4 Resear h Issues
This work aims to investigate the appli ability of eigen
de ompo-sition within natural languagepro essing (NLP), and to extend it,
bothinNLPandin omputers ien eingeneral. Large orporahave
traditionally been problemati for eigen de omposition. Standard
algorithms pla e limitations onthe size of datasetthat anbe
pro- essed. InthisworktheGeneralizedHebbianAlgorithmis onsidered
asanalternative,potentiallyallowinglargerdatasetstobepro essed.
Thisthesispresentsoriginal,publishedextensionstoGHA,viawhi h
thealgorithmismademoreappropriatetorelevanttaskswithinand
beyond natural language pro essing. The algorithm is adapted to
paireddatasets(singularvaluede omposition),whi hisrequiredfor
thelanguagemodellingtaskaswellasmanyothersin omputer
s i-en e in general,and isadapted to sparsedata, whi h is vitalforits
e ien yinthenaturallanguagedomainandbeyond. Otheroriginal
algorithmi and implementational variationsare also presented and
dis ussed.
Eigen de omposition has already proved valuable in some areas of
NLPandmanybeyondit. Inthiswork,afurtherareais onsidered.
This thesispresents original work in using eigen de omposition for
n-gramlanguagemodellingfor thersttime. It willbeshownthat
eigende omposition anbeusedtoimprovesingle-ordern-gram
lan-guagemodelsandpotentiallytoimproveba kon-grammodels. The
approa hisdemonstratedontraining orporaofvarioussizes.
•
What is the utility of eigen de omposition and related te h-niquestonaturallanguagepro essing? Morespe i ally:Caneigen de omposition and relatedte hniquesbe used
toimprovelanguagemodelsatthen-gramlevel?
Whataretheimpli ationsofthisresultandother
appli a-tionsof eigende omposition innaturallanguage
pro ess-ingforitsoverallutilityinthisdomain?
•
Whatis the value ofthe Generalized Hebbian Algorithm and itsvariantsinperformingeigende ompositionandrelatedte h-niquesinthenaturallanguagepro essingdomain? More
spe if-i ally:
Whatis thevalueof theGeneralizedHebbian Algorithm
inperformingLatentSemanti Analysis?
Canthe GeneralizedHebbian Algorithm beused to
per-formsingularvaluede ompositiononn-gramdata? Isthe
te hniquevaluableforperformingthistask.
Inwhatways anthe GeneralizedHebbian Algorithmbe
extendedtoin reaseitsutilityin thisdomain?
Thethesisisprimarilyte hni alandimplementation-fo used.
Chap-ter2givesmathemati alba kgroundoneigende ompositionandits
variants. Chapter3givesmathemati alba kgroundonthe
General-izedHebbian Algorithm. Chapter4des ribesoriginal extensionsto
the GeneralizedHebbian Algorithmat an algorithmi level.
Exten-sions relevantto Latent Semanti Analysis are presented. A sparse
variantispresentedwhi hallows omputationale ien ytobemu h
improved onthe sparse datasets typi alto thelanguage pro essing
domainamongothers. GHAisextendedtoasymmetri datasetsand
evaluated. Otherdevelopmentsofthepra ti alappli abilityof
Asym-metri GHA arealsodis ussed. Chapter 5dis ussesthe appli ation
of eigen de omposition and the Generalized Hebbian Algorithm in
information retrieval. GHA is presented as a valuable alternative
appli ation ofsingularvaluede omposition(SVD) andthe
General-ized Hebbian Algorithm tolanguagemodelling at then-gramlevel.
It is demonstrated that SVD an beused to produ e a substantial
de rease in perplexity in omparison to a baseline trigram model.
Appli ationof theapproa hto ba kolanguagemodels isdis ussed
as a fo us for future work. AGHA is shown to be a valuable
al-ternative for performing singular value de omposition on the large
datasetstypi alton-gramlanguagemodelling. Otheralternativesto
performinglarge-s alesingularvaluede ompositions in this ontext
Matrix De omposition
Te hniques and
Appli ations
This hapter aims to give the reader una quainted with eigen
de- omposition and related te hniques an understanding su ient to
enablethemtofollowtheremainderofthework. Afamiliaritywith
basi matrixandve tormathemati sisassumedinpla es. For
read-ersunfamiliarwithmatrix/ve toroperations,thereareanumberof
ex ellent text books available. In Anton and Rorres' Elementary
Linear Algebra (2) for example, the reader will nd denitions of
thefollowing on eptsandoperations;squarematri es,symmetri al
matri es,matrixtransposition,dotprodu tofve tors,outerprodu t
of ve tors, the multiplying together of matri es, the multiplying of
matri esby ve torsand normalisationandorthogonalisationof
ve -tors. 1
Thereadersatisedwithamoresurfa eunderstandingshould
hopefullyndthatthis hapterprovidesthemwithanintuitivegrasp
oftherelevant on epts,andsoisen ouragedtoreadon. Thereader
1
already familiar with eigen de omposition,singular value
de ompo-sitionand LatentSemanti Analysisisdire ted tothenext hapter,
sin enothingmentionedherewillbenewtothem.
2.1 The Ve tor Spa e Model
As mentionedin theintrodu tion,theve torspa emodel isa
pow-erful approa h to des ribing and intera ting with a dataset. Data
takes the form of feature value sets in ve tor form. Ea h feature
takesadimensionin theve torspa ein whi hthedatapositions
it-self. Thetheoryisthattherelationshipsbetweenthepositionsofthe
datapointsin thespa etells ussomethingabout theirsimilarity. A
datasettakestheform ofasetofve tors,whi h anbepresentedas
amatrix, andall thepowerof thematrixformatbe omesavailable
tous. Here,forexample,isadataset:
Themanwalkedthedog
Themantookthedogtothepark
Thedogwenttothepark
These data an be used to prepare a set of ve tors, ea h of whi h
des ribesapassageintheformofawordbag. Thewordbag,inwhi h
the ve tordes ribes the ounts of ea h wordappearing in the
do -umentin afashionthat doesnotpreservewordorder,is popularin
avarietyof approa hesto naturallanguagepro essing, parti ularly
informationretrieval.Thewordbagve torsarepresentedasamatrix
P assage1 P assage2 P assage3
the
2
3
2
man
1
1
0
walked
1
0
0
dog
1
1
1
took
0
1
0
to
0
1
1
park
0
1
1
went
0
0
1
Data in thisform anbeintera ted within avarietyofways.
Sim-ilarity an be measured using the distan e between the points, or
using the osine of theangle betweenthem, as measuredusing dot
produ t. Transformations anbeperformedonthedatasetin order
toenhan e ertainaspe ts. Forexample,thedataset ouldbe
multi-pliedbyanothermatrixsu hastoskewand/orrotateit. Non-linear
transformations analsobeintrodu ed. Thematrixpresentedabove
anbemultipliedbyitsowntransposeto produ eamatrixthat
de-s ribes the o urren e of a word with every other word a ross the
dataset. Su hamatrixwouldbesquare andsymmetri al.
2.2 Eigen De omposition
Eigen de omposition allows us to rotate our dataset in a manner
that ollapseslineardependen iesintoasingledimensionor
ompo-nent, and provides ameasure of the importan e of the omponents
within the dataset. On a on eptual level, it models the dataset
in terms of additiveinuen es: this observation is produ ed by so
mu h of this inuen e, so mu h of that inuen e et . It aims to
tell us what the inuen es are that most e iently des ribe this
dataset. Thebenetsto su harepresentationaremany. Su h
e ient,andprovidesaprin ipledwaytoperformdimensionality
re-du tiononthedata,su h asto ompressandgeneraliseit.
Letusintrodu eaformaldenition.Inthefollowing,
M
isthematrix weareperformingeigende ompositionon.v
isaneigenve torof
M
andλ
istheeigenvalue orrespondingtov
.
λv
= M v
(2.1)As mentioned earlier, an eigenve tor of a transform/matrix is one
whi hisun hangedbyitindire tion. Theeigenve tormaybes aled,
andthes aling fa toristheeigenvalue. Theabovesimplysaysthat
v
multiplied by
λ
equalsM
, our original matrix, multiplied byv
:
multiplying
v
by
M
hasthesameee tass alingitbyλ
.Foranysquaresymmetri almatrixthere willbeasetofsu h
eigen-ve tors. Therewillbenomoreeigenve torsthanthematrixhasrows
(or olumns),thoughtheremaybefewer(re allthatlinear
dependen- ies are ollapsedinto single omponentsthusredu ing
dimensional-ity). The numberof eigenve tors is alled the rank of the matrix.
The eigenve torsare normalised and orthogonalto ea h other, and
ee tivelydene anewspa ein termsoftheoldone. Theresulting
matrix of eigenve tors an be used to rotate the data into the new
spa e. Formally,
M V
T
= M
′
(2.2)
where
V
isthe olumn matrixofeigenve tors,M
istheoriginaldata matrix andM
′
is the rotatedand possibly approximated data
ma-trix. This stepis in ludedin a workedexampleof LatentSemanti
lari a-newspa e,ea haxis apturesasmu hofthevarian einthedataset
aspossible. Superuousdimensionsfallaway. Dimensions
ontribut-ing least to thevarian e ofthe data anbedis arded to produ ea
leastmeansquarederrorapproximationtotheoriginaldataset. k is
ommonlyusedtorefertothenumberofdimensionsremaining,and
isusedin thiswaythroughoutthiswork.
2.3 Singular Value De omposition
Eigen de omposition is spe i to symmetri al data. An example
of symmetri al data is word o-o urren e. Word a appears with
word b exa tly as often aswordb o urs with word a. The set of
wordbagsgivenasanexampleat thebeginningofthis hapter isan
exampleofasymmetri data. Thematrixpairsdo umentswiththeir
wordbags. Ea h ell ount givesus thenumberof appearan es ofa
parti ularwordinaparti ulardo ument. Ifweweretomultiplysu h
amatrix byits own transpose what we would get would bea
sym-metri al datasetdes ribing theo urren eof everywordwithevery
otherworda rosstheentiredo umentset. Were wetomultiplythe
transpose of the matrix by the matrixwewould get asymmetri al
dataset des ribingthe numberof shared wordsbetween ea h
do u-mentandeveryother. Asymmetri aldatasetisalwaysdes ribedby
a square, diagonally symmetri almatrix. Eigen de omposition an
beextended toallowasimilar transformtobeperformedonpaired
data, i.e. re tangularmatri es. The pro ess is alled singularvalue
de omposition,and anbeformalisedas follows:
M
= U ΣV
T
(2.3)
In the above,
M
is our re tangular matrix,U
is the olumn ma-trixofleftsingularve tors,andparallelstheeigenve tors,V
T
isthe
transposeofthe olumnmatrixofrightsingularve tors,whi hagain
singular values (whi h take the pla e of eigenvalues) in the
appro-priate order.Whereeigende omposition reatesanewspa ewithin
theoriginalspa e,singularvaluede omposition reatesapairofnew
spa es, one in left ve tor spa e (for example, word spa e, in our
wordbagexample)andoneinrightve torspa e (forexample,
do -ument spa e). The twospa es are paired in a very tangible sense.
They reate a forum in whi h it is valid to ompare datapoints of
either typedire tlywith ea h other (forexamplewordve torswith
do umentve tors)(3).
Therelationshipbetweeneigende ompositionandsingularvalue
de- ompositionisnota ompli atedone. Wedene:
A
= M M
T
(2.4)
B
= M
T
M
(2.5)That is to say, we form two square symmetri al matri es from our
original re tangular data matrix, ea h orrelating an aspe t of the
dataset(i.e. wordsordo uments,in ourexample) withitself. Then:
A
= U Λ
a
U
T
(2.6)B
= V Λ
b
V
T
(2.7)Λ
a
= Λ
b
= Σ
2
(2.8)
Orin otherwords,theeigenve torsofthematrix
A
aretheleft sin-gularve torsofthematrixM
. Theeigenve torsofthematrixB
are the rightsingular ve tors of thematrixM
. The eigenvaluesof the matrixA
aretheeigenvaluesofthematrixB
andarethesquaresof thesingularvaluesofM
.2.4 Latent Semanti Analysis
Asthesinglebestknownusageofsingularvaluede omposition(SVD)
in natural language pro essing, Latent Semanti Analysis provides
anex ellentpra ti alillustrationofitsappli ation.Inaddition,LSA
makesanappearan elaterinthiswork,soana quaintan ewiththe
prin iplewillprovebene ial. Sear htasks,inwhi hdo uments
rel-evanttoasear hstringaretoberetrieved,runintodi ultiesdueto
theprevalen eofwordsynonymsinnaturallanguage.Whentheuser
sear heson,to iteanearlierexample,elevatortheywouldalsolike
do uments ontainingthewordlift tobereturned,although
do u-mentsrarelyusebothterms. Anautomati approa htondingthese
synonymsispotentiallyofbenetininformationretrievalaswellas
avariety of othertasks. Latent Semanti Analysis(14) approa hes
thisproblemwiththeaidofsingularvaluede omposition.
Theapproa haimsto utilisethefa t thatwhilst, forexample,lift
andelevatormightnotappeartogether,theywillea happearwith
a similar set of words (for example, oor). This information an
betappedthroughSVDand prin ipleddimensionalityredu tion. If
there is some superfeature that aptures the basi liftness of the
do ument, that is later rened su h as to spe ify whether lift or
elevatorisused,thenbyisolatingtheprin iplesanddis ardingthe
renementswemightbeabletoa essthisinformation.
Dimension-ality redu tion maps data to a ontinuous spa e, su h that we an
now ompare words that previouslywe ouldn't. Theremainder of
this hapterprovidesaworkedexampleofperformingLatent
Seman-ti Analysisonasmalldo umentset. Abriefsummaryfollows.
Returning to our earlier example, we have the following do ument
set:
Themanwalkedthedog
Themantookthedogtothepark
Thisdo umentset istransformedinto amatrixofwordbagslikeso:
Passage1 Passage2 Passage3
the 2 3 2 man 1 1 0 walked 1 0 0 dog 1 1 1 took 0 1 0 to 0 1 1 park 0 1 1 went 0 0 1
Tobevery lear, theve torsthismatrixembodiesare thefollowing
paireddata(do umentsandwordbags):
1
0
0
2
1 1
1 0
0 0
0
0
1
0
3
1 0
1 1
1 1
0
0
0
1
2
0 0
1 0
1 1
1
Ifleft(do ument)datamatrix
D
istherowmatrixformed fromthe do umentve tors,D
=
1 0
0
0 1
0
0 0
1
(2.9)and right(wordbag)data matrix
W
is therowmatrix formedfrom thewordbagve tors,W
=
2 1
1 1
0 0
0 0
3 1
0 1
1 1
1 0
2 0
0 1
0 1
1 1
(2.10)W
T
D
= M
(2.11)where
M
isourdatamatrixshownabove. Wede omposethismatrix using SVD, to obtaintwosets of normalised singularve torsand aset of singular values. This operation anbe performed by any of
avarietyof readilyavailable mathemati s pa kages. Algorithms for
performingsingularvaluede ompositionaredis ussedinmoredetail
lateroninthiswork. Re all,
M
= U ΣV
T
(2.12) Then,U
T
=
0.46
0.77
−0.45
−0.73 −0.04
0.68
−0.51 −0.64 −0.58
(2.13)Σ =
5.03
0
0
0
1.57
0
0
0
1.09
(2.14)V
T
=
−0.82 −0.24 −0.09 −0.34 −0.14 −0.25 −0.25 −0.10
0.10
0.47
0.49
0.06 −0.02 −0.43 −0.43 −0.40
0.01
0.22 −0.41 −0.31 0.63 0.10 0.10 −0.53
!
(2.15)Transposesarepresentedherefor the onvenien e ofpresentingrow
data and oursingularvalue de omposition. Thedieren e between
our original data and the singular value de omposition is that our
singular valuede omposition omprisesorthogonal ve tors, thereby
removingredundan y. Inthe worst ase there will be asmany
sin-gularve torpairsasthereweredatapairs,butusuallythereissome
linearredundan y,andthereforetherearefewersingularve torpairs.
A orpusofthreedo uments anprodu enomorethanthreesingular
triplets(pairsof ve torswith asso iatedsingularvalue). However,a
more realisti orpus mightprodu e numbersof singular tripletsin
quadruple guresforavo abulary/do umentset sizeofhundredsof
thousands. Dimensionalityredu tionisthenperformedbydis arding
allbut thetopfewhundredsingulartriplets. Thepre isenumberof
singulartripletsretainedis hosenonanadho basis. Aroundtwoto
three hundredisoften foundto beoptimalfor LSA.This stagewill
besimulatedherebydis arding thelast singulartriplet to produ e
atwo-dimensionalsemanti spa e,whi hhastheadvantageofbeing
readilyvisualisable. Herearetheremainingsingulartriplets:
U
′
T
=
0.46
0.77 −0.45
−0.73 −0.04 0.68
(2.16)Σ
′
=
5.03
0
0
1.57
(2.17)V
′
T
=
−0.82 −0.24 −0.09 −0.34 −0.14 −0.25 −0.25 −0.10
0.10
0.47
0.49
0.06 −0.02 −0.43 −0.43 −0.40
(2.18)Figure 2.1 depi ts thedo uments representedaspoints in semanti
spa e. Do umentsinredu edspa eare olumnve torsof
D
′
:
D
′
= DU
′
Doc 2 [0.77, −0.04]
Doc 1 [0.46, −0.73]
Doc 3 [−0.45, 0.68]
Figure 2.1: Three Example Do uments Depi ted in a
(In this ase, theabove is rathertrivial sin e
D
happensto be the identitymatrix.DU
′
thereforeequals
U
′
.) Bymultiplyingthe
do u-mentve torsbytheredu edleftsingularve torset (orthewordbag
ve torsby therightsingularve torset)we anmovetheminto the
newspa e. Figure2.1 illustratesthis. We anthen omparethe
se-manti similarityofthedo umentsusingforexampletheirdot
prod-u tswithea hotherinthisnewspa e,whi hwilltypi ally onstitute
animprovement.
A typi al use of LSA is returning the best mat h among a set of
do umentsgivenastringsu hasauserquery. This anbeillustrated
in the ontextof theaboveexample. Supposethequeryis"thedog
walked". This string isused to form awordbag ve torin thesame
mannerasthedo umentswere. Itbe omesapseudodo ument. The
pseudodo umentwouldthereforebe,
P
=
1 0 1 1 0 0 0 0
(2.20)
We anmovethispseudodo umentintosemanti spa eby
multiply-ingitbythematrix
V
′
asshown:
P V
′
= P
′
(2.21)
Thisprodu esthetwo-dimensionalsemanti -spa eve tor,
P
′
=
−1.25
0.65
(2.22)
Doc 2 [0.77, −0.04]
Doc 1 [0.46, −0.73]
Pseudodoc [−1.25, 0.65]
Doc 3 [−0.45, 0.68]
Additionally, a number of te hniques are available that allow the
data to be prepro essed in su h a way as to further in rease the
ee tiveness of thete hnique. Words ontribute to varying extents
tothesemanti proleofapassage.Forexample,thewordthehas
littleimpa tonthemeaningofpassagesinwhi hitappears. Aword
whi hdistributesitselfevenlyamongthedo umentsina olle tionis
oflittlevalueindistinguishingbetweenthem. LSA anthereforebe
made moreee tiveby redu ingthe impa tof su h wordson word
ount matrix and in reasing the impa t of less evenly distributed
words. Dumais(16)outlinesseveralmethodsofa hievingthis. The
oneusedinthisthesisisthemostsophisti atedandee tiveofthese.
Itwillnowbepresented. Thereaderinterestedinlearningaboutthe
othersisdire tedto theoriginalsour e.
c
ij
isthe ell at olumn i,rowj ofthe orpusmatrix. Theentropy normalisation step most ommonlyused, and used throughout thisthesis,involvesmodifyingthis valueasfollows,
p
ij
=
tf
ij
gf
i
(2.23)gw
i
= 1 +
X
j
p
ij
log(p
ij
)
log(n)
(2.24)c
ij
= gw
i
(c
ij
+ 1)
(2.25)where
gw
i
istheglobalweighting ofthewordati,nisthenumber of do uments in the olle tion,tf
is the term frequen y, i.e. the original ell ount,andgf
istheglobalfrequen y,i.e. thetotal ount forthat worda rossalldo uments.Letuslookattheee tofthissteponourexampledataset. Hereis
Passage1 Passage2 Passage3 the 2 3 2 man 1 1 0 walked 1 0 0 dog 1 1 1 took 0 1 0 to 0 1 1 park 0 1 1 went 0 0 1
andhereisthesamematrixfollowingtheprepro essingstepdes ribed
above:
Passage1 Passage2 Passage3
the 0.019 0.024 0.019 man 0.255 0.255 0.0 walked 0.693 0.0 0.0 dog 0.0 0.0 0.0 took 0.0 0.693 0.0 to 0.0 0.255 0.255 park 0.0 0.255 0.255 went 0.0 0.0 0.693
Values for the are mu h redu ed: this word appears fairly
indis- riminately a rossallthe do uments. Dog disappears ompletely,
beingperfe tlyuniformin itso urren e. Took andwent remain
high, being good dis riminators. To and man nd themselves
somewhere in between. It is easy to see that we an arryout the
LSAte hniqueequallywellonthisse ondmatrix,andthatwemight
expe t superiorresults.
LatentSemanti Analysishasbeenappliedinanimpressivediversity
ofdomains (53)(18)(41),althoughitis best-knownin the
informa-tionretrieval ontext. Impressiveresultshavealsobeendemonstrated
inusingLSAtoin orporatelong-spansemanti dependen iesin
lan-guage modelling (55)(3) (12). Languagemodelling, in luding LSA
2.5 Summary
This hapter hasprovidedthe readerwith ba kgroundne essaryto
pla ethethesisin ontextwithregardstomatrixde omposition
te h-niques. Thevalueofeigenandsingularvaluede ompositionhasbeen
dis ussed interms oftheirallowingdata tobesmoothed,simplied
and ompressed in a prin ipled fashion. Latent Semanti Analysis
hasbeenpresentedasawell-knownappli ationofsingularvalue
de- omposition within natural language pro essing. Work within the
LSA domain willfollowlater in thethesis. Thenext hapter
intro-du es theGeneralizedHebbian Algorithm,whi h isthebasis ofthe
The Generalized Hebbian
Algorithm
The previous hapter introdu ed eigen de omposition and singular
value de omposition. These te hniques have been widely applied
within informations ien e and omputationallinguisti s. However,
their appli ability variesa ordingto the onstraintsintrodu ed by
spe i problems.
Mu h resear hhasbeendoneonoptimisingeigende omposition
al-gorithms, and the extent to whi h they an be optimised depends
on theareaofappli ation. Mostnaturallanguageproblemsinvolve
sparsematri es,sin etherearemanywordsinanaturallanguageand
thegreatmajoritydonotappearin,forexample,anyonedo ument.
Domains in whi h matri es are lesssparse lend themselves to su h
te hniquesasGolub-Kahan-Reins h(19)andJa obi-likeapproa hes,
whi h anbeverye ient. Theyareinappropriate tosparse
matri- es however, be ause they work by rotatingthe matrix, whi h has
theee t of desparsifyingit, inatingit insize. Te hniques su has
those des ribed in Berry's1992 arti le(6) are moreappropriate in
etal'sSVDPACK(4) isusedlateronin thiswork.
Optimisationworkisofparti ularimportan ebe ausethe
de ompo-sition te hniques areexpensive,and there are strong onstraintson
thesizeofmatri esthat anbepro essedinthisway. Thisisof
par-ti ular relevan ewithin naturallanguagepro essing,where orpora
areoftenverylarge,andthesu essofmanydata-drivente hniques
dependsontheuseofalarge orpus.
Optimisation is an important way to in rease the appli ability of
eigen and singularvalue de omposition. Designing algorithms that
a ommodate dierent requirements is another. For example,
an-other drawba k to Ja obi-like approa hesis that they al ulate all
thesingulartriplets(singularve torpairswithasso iatedvalues)
si-multaneously, whi h may not be the most pra ti al in a situation
where onlythe top few are required. Consideralso that the
meth-odsmentionedsofarassumethattheentirematrixisavailablefrom
thestart. There aremanysituationsin whi h datamay ontinueto
be omeavailableovertime.
Therearemanyareasofappli ationinwhi he ientin rementality
is of importan e. Sin e itis omputationally expensive to al ulate
a matrixde omposition, itmay notbefeasible to re al ulate when
new databe omesavailable. Ee tivein rementalitywould remove
the eilingonmatrixsizethat urrentte hniques impose. Thedata
need not bepro essedall at on e. Systems that learn in real time
need to beableto update data stru turesqui kly. Various ways of
updating an eigen or singular value de omposition given new data
items have been proposed. This hapter presents the Generalized
Hebbian Algorithmand ontrastsitwithotherapproa hes urrently
n
n
a
n
.
.
.
a
n
.
.
.
n
w
o =
w += w o
w += w o
1
a
a
2
a
3
n
i
Σ
a w
i
i=0
1
a
a
2
a
3
n
i
Σ
a w
i
i=0
w
w
w
w += w o
2
2
3
3
1
1
w += w o
2
1
3
Figure3.1: HebbianLearning
3.1 Hebbian Learning for In remental
Eigen De omposition
TheGeneralisedHebbian Algorithm was rstpresentedby Ojaand
Karhunen in 1985 (38), who demonstrated that Hebbian learning
ouldbeusedtoderivethersteigenve torofadatasetgiven
serially-presented observations (ve tors). Sanger (46) later extended their
ar hite turetoallowfurthereigenve torstobedis overedwithinthe
samebasi framework.
anoutput, and se ond,the update stepwhi hleadsto thesystem's
learningthestrongesteigenve tor. Thegureshowshowdatais
re- eivedintheformofa tivationstotheinputnodes. Thea tivations
are altered a ording to the strength of the weighting on the
on-ne tion betweenthem and the output node. The a tivation at the
output nodeisthen fed ba kin theform ofupdates totheweights.
Theweights,whi h anbe onsideredave torofnumbers, onverge
onthestrongesteigenve tor.
Equation3.1des ribesthealgorithmbywhi hHebbianlearning an
bemadetodis overthestrongesteigenve tor,andissimplyanother
wayofstatingthepro edure des ribedbygure3.1.
u
(t + 1) = u + λ(u
T
· a)a
(3.1)In the above,
u
is the eigenve tor,
a
is the input ve tor (data
ob-servation) and
λ
is the learning rate (not to be onfused with theλ
usedin the previous hapter to representtheeigenvalue).(t + 1)
des ribesthe fa t thatu
is updatedto takeon a new valuein the nexttimestep. Intuitively,theeigenve torisupdatedwiththeinputve tors aledproportionallytotheextenttowhi hitalready
resem-blesit,asestablishedbythedotprodu toperation. Inthisway,the
strongestdire tion intheinput omestodominate.
Torelatetheabovetotheformalisations introdu ed intheprevious
hapter, our data observations, whi h might for example take the
form of wordbag ve tors (this time notpaired with do ument
ve -tors, sin e we are using eigen de omposition and therefore require
symmetri aldata)aretheve tors
a
. Togethertheyformthe olumn matrixA
. Oureigenve torsu
,produ edbytheGeneralizedHebbian Algorithm,arethereforeeigenve torsof thefollowingmatrix:M
= AA
T
This foundation is extended by Sanger to dis over multiple
eigen-ve tors. Theonly modi ation to equation 3.1 requiredto un over
further eigenve torsisthat theupdate needsto bemadeorthogonal
topreviouseigenve tors:sin ethebasi pro edurendsthestrongest
oftheeigenve tors,inordertopreventthatfromhappeningandnd
later eigenve tors, the previous eigenve tors are removed from the
trainingupdateinordertotakethemoutofthepi ture. The urrent
eigenve torisalsoin ludedintheorthogonalisation.
u
n
(t + 1) = u
n
+ λ(u
T
n
· a)(a −
X
i≤n
(u
T
i
· a)u
i
)
(3.3)Here,
u
n
is thenth eigenve tor. Thisis equivalentto Sanger'snal formulation,intheoriginalnotation(46),c
ij
(t + 1) = c
ij
(t) + γ(t)(y
i
(t)x
j
(t) − y
i
(t)
X
k≤i
c
kj
(t)y
k
(t))
(3.4)where
c
ij
isanindividualelementinthei
'theigenve tor,t
isthetime step,x
j
is the inputve torandy
i
is thea tivation (that is to say, the dot produ t of theinput ve torwith thei
th eigenve tor).γ
is thelearningrate.Tosummarisefrom animplementationperspe tive,theformula
up-datesthe urrenteigenve torbyaddingtoittheinputve tor
multi-pliedbythea tivationminustheproje tionoftheinputve toronall
the eigenve torssofar in luding the urrent eigenve tor, multiplied
bythea tivation. In ludingthe urrenteigenve torintheproje tion
subtra tionstephastheee tofkeepingtheeigenve torsnormalised.
Note that Sanger in ludes an expli it learningrate,
γ
. A potential variation,utilised inthiswork,involvesex ludingthe urrentThishastheee t ofintrodu inganimpli itlearningrate,sin ethe
ve toronlybeginstogrowlongwhenitsettlesintherightdire tion,
su hthatthedatareinfor esit,andsin efurtherlearninghasless
im-pa ton etheve torhasbe omelong. Wengetal. (52)demonstrate
thee a yofthisapproa h.
In terms of an a tual algorithm, this amounts to storing a set of
N
word-spa eeigenve torsandupdatingthem withtheabovedelta omputed from ea h in oming do ument as it is presented. Thismeans that the full data matrix need never be held in memory all
aton e,andin fa ttheonlypersistentstoragerequirementisthe
N
developingsingularve torsthemselves.3.2 GHA and In remental Approa hes to
SVD
GHA al ulates theeigende ompositionofamatrixbasedonsingle
observations presented serially. It allows eigenve tors to be learnt
using no more memory than is required to store the eigenve tors
themselves. It is therefore relevant in situations where the size of
the dataset makes onventional bat h approa hes infeasible. It is
alsoofinterestinthe ontextofadaptivity,sin eithasthepotential
to adapt to hanging input. The learningupdate operation is very
heap omputationally. (Complexity analysis is presented later in
thisworkinthe ontextofapplyingspe i implementationsofGHA
to LSA-style tasks.) The algorithm produ es eigenve tors starting
with the most signi ant, sin e it is the greater eigenve tors that
onverge most qui kly, whi h means that useful data immediately
beginstobe omeavailable. Sin eitisalearningte hnique,however,
it diers from what would normally be onsidered an in remental
te hnique,inthatthealgorithm onvergesontheeigende omposition
ofthedataset,ratherthanatanyonepointhavingthebestsolution
possible for the data it has seenso far. The method is potentially
AkeyreasonforusingGHAtoprodu eeigende ompositionsis
there-foreits in remental nature,bothforpurposesof adaptivityand
be- auseitmakestheapproa hamenabletoverylargedata
dimension-alities. Natural languageresear hhasgeneratedworkin theareaof
in rementalityinsingularvaluede omposition,be ausenatural
lan-guagepro essingisakeyexampleofaeldofresear hinwhi hlarge
orporaareused,andstandardapproa hesto matrixde omposition
arepushedtotheirlimits. Asdis ussedintheprevious hapter,SVD
andeigende ompositionare loselyrelatedandinsome ontextseven
inter hangeable,soalthough,stri tlyspeaking,GHAisamethodfor
performing eigen de omposition, SVD and eigen de omposition in
thisse tionaretreatedinter hangeably.
Extantin rementalapproa hestosingularvaluede omposition
typ-i allyfall into three ategories. Therstessentiallyinvolvesadding
thenewdatatothedatasetpreviouslyde omposedandthen
re om-puting the de omposition. To all su h anapproa h in remental is
therefore somewhat of a misnomer, though depending on the
on-text, some aspe ts of the pro ess might be onsidered in remental.
Forexample,Ozawaet al (39) take thisapproa hin the ontext of
Prin ipal ComponentAnalysis forfa e re ognition. Prin ipal
Com-ponent Analysis (PCA) is a near relative of SVD. Berry et al (5)
also dis uss re omputingas an optionin the asewhere adatabase
ofdo umentsforLSAisextended.
In the se ond ategoryof approa hesto in rementality we nd
ap-proximations.Thede ompositionofadataset,havingbeenin reased
withnewdata, anbeapproximatedwithoutre omputing ompletely.
Folding in, asdes ribed by Berry et al (5) is an example of this
approa h. It works on the assumption that new data is typi al of
thedata onwhi h theoriginalde omposition wasperformed.
Pseu-dodo uments are formed in the manner des ribed in the previous
hapter,andthesearethentreatedaspartoftheoriginaldo ument
set. As larger quantities of data are added and the assumption of
representativity startsto break down, the a ura y of the
approxi-mationde reases. However,it anbeausefuloptioninthe asethat
reatingamodelbasedonatrainingsetandthenusingittopro ess
an unseentest set. Theprin ipleis well-known, but itis learthat
themodel isnotupdatedwiththetestset.
Inthethird ategory,anexistingde ompositionisupdatedwithnew
datasu hthattheresultingde ompositionisaperfe tresultforthe
dataset. O'Brien (37) presents an example of this, as does Brand
(7). Branddes ribesanapproa htoSVDupdatingin the ontextof
whi hmissing ornoisydata isalsodis ussed. These approa hesare
appropriatein the asethat anewbat hof dataneedsto beadded
toanexistingde ompositionoine. Thestepismoreexpensivethan
folding in (though heaper thanre omputing) and assu h is
appli- ablein dierent ir umstan es. Brand(7)alsoprovidesareviewof
earlierwork inSVDin rementality.
GHA diers from ea h of these ategories in some key ways. The
above approa hes are in remental inasmu h as they provide ways
to add new data to an existing de omposition. None of them are
designed to a ommodate the situation in whi h data is streamed.
The update operationsare typi allyexpensive. All ofthem assume
an existing de omposition into whi h the new data will be added.
GHA is dierent in that its in rementality is far more intrinsi . It
assumesnoexistingde omposition(thoughmightpotentiallybenet
from beingseeded withanexisting de omposition). It onvergeson
thestrongesteigenve torsrst,therebyprodu ingusefulinformation
qui kly. As a learningalgorithm, however,it doesneed to be used
appropriately: unlikeother approa heswhi h atanystage havethe
perfe tde ompositionforthedatatheyhaveseensofar,GHAneeds
tobeallowedto onverge. Rerunsthroughsmallerdatasetswillmost
likelyberequired.
3.3 GHA Convergen e
The usability of the GHA algorithm is in no small way onne ted
dire tion, that is to say, the eigenve tor. If the aim is to have a
omplete de omposition in whi h every GHA-trained ve tor diers
minimally in dire tion from the a tual eigenve tor, andeigenvalues
are appropriate, then the time taken to a hieve this end, and the
reliabilitywithwhi hatolerablea ura ylevelisrea hed,is riti al.
Although onvergen eofGHAisproven(46),previousauthorshave
notedtheabsen eoflarge-s aleevaluationof onvergen ebehaviour
in the literature despite widespread interest in the algorithm (15).
Someattempthasbeenmadeheretoremedythisonapra ti allevel
withaplotof onvergen eagainstnumberoftrainingstepsrequired
usingade ompositiondoneusingthebetter-knownLAS2 algorithm
(6) as referen e. A subse tion of the 20 Newsgroups orpus (11),
spe i ally the atheism se tion, was prepro essed into a sparse
tri-grammatrixwithadimensionalityofaround15,000by90,000. Words
formed olumnsandtwo-wordhistories,rows. Matrix ells ontained
trigram ounts. Thismatrixwasthende omposedusingLAS2. The
resulting left-side ve torset wasthen used to plotdot produ t with
GHA eigenve torasthey onverged. This plot is presented in
Fig-ure 3.2. The impli it learning ratedes ribed earlier in the hapter
wasusedhere.Theimpa tofthisdesign hoi eonthedatapresented
hereneedsto be onsidered. Other approa hestolearningratesare
dis ussed in thenext hapter,in the ontextof asymmetri
onver-gen e. The onvergen e riterion used was based on the distan e
between theend of the ve torbeing trained, normalised, ompared
with the ve tor in its earlier position, in this ase 50,000 training
stepspreviously(and sowillinevitablyde reaseastheve torgrows
long,and newdata haslessimpa tondire tion). Thegraphshows
dotprodu toftheGHAeigenve tor urrentlybeingtrainedwiththe
LAS2 target, and sowhen theGHA ve torrea hes onvergen e the
graphshowsajumpaswemoveontothenextve tor. Thedot
prod-u t weare aimingat is 1;thetwove torsshould pointin thesame
dire tion. Thegraphshows onvergen e ofeleven eigenve tors(the
rstthreebeingdi ulttomakeoutbe ausethey onvergedalmost
immediately). Around2.5x
10
7
trainingpresentationswererequired
to a hievethis many eigenve tors. As anbe seenfrom thegraph,
on-Figure 3.2: Dot Produ t of GHA Eigenve tor with Referen e Set
AgainstNumberof Training Steps
a tenden y to level o, in some ases well before a high pre ision
is a hieved, suggeststhat theimpli it learningrateapproa h leaves
somethingtobedesired. Theimpli itlearningrateusedhereis
on-trastedwithSanger'soriginalexpli itlearningrateina omparisonof
onvergen ebehaviouronasymmetri datalaterinthiswork,where
onvergen eisdis ussedinmoredetailinthe ontextofevaluatingan
originalalgorithm,theAsymmetri GeneralizedHebbianAlgorithm.
Other ways of improving onvergen e throughthe hoi e of an
ap-propriate onvergen e riterion, and issues around sele ting su h a
3.4 Summary
This hapter introdu ed the Generalized Hebbian Algorithm, and
explained how it an be used to learntheeigen de omposition of a
matrix based on single observations presentedserially. Advantages
to su han approa hhavebeendis ussed, bothin termsofallowing
matrixsizestoolargefor onventionalapproa hestobede omposed,
andintermsofimplementinglearningbehavioursu hasadaptation
to newinputpatterns. Thenext hapterdis ussesworkin thearea
of adapting GHA to varying ontexts, with naturallanguage
Algorithmi Variations
This se tiondes ribesanumberof algorithmi developmentsto the
Generalized Hebbian Algorithm. It begins by dis ussing
develop-mentsofthebasi GHAformulation. Thete hniqueisappliedto
La-tentSemanti Analysis,andismodiedtoa ommodatethe
prepro- essing steps ommonly in ludedin LSAimplementations. Random
Indexing(28)isintrodu edhereasasupplementto GHAproviding
ameansofxingand redu ingve torlength.
An extensionofGHA topaireddata(singularvaluede omposition)
is then presented. Sparse variants of the algorithms are des ribed.
Sparsevariantsare ontrastedwith theRandomIndexingapproa h
introdu ed in the ontext of GHA. Approa hes to setting learning
ratesanddetermining onvergen eofthealgorithmsaredis ussed. 1
1
Inthisse tion,the work onin luding LSAentropy normalisationinGHA
isjoint work done withBrandyn Webb. The basi designof the Asymmetri
Generalized Hebbian Algorithm isBrandyn's; the derivation ismy own. The
4.1 GHA for Latent Semanti Analysis
LatentSemanti Analysis hasbeen used to great ee t in theeld
of informationretrievalandbeyond. Limitationson orpussize are
howeverado umentedproblem(49). Sin eonlytherstfewhundred
eigenve torsarerequired in LSA, GHA is apotential andidate for
an alternativealgorithm. GHA provides an alternative with a low
memory footprint, but takes progressively longer to produ e ea h
eigenve tor,ultimately meaning that time is an issue. Sin e
eigen-ve tors are produ ed in order starting with the greatest, however,
requiringonlyasmallnumberofeigenve torsmitigatesthis. GHAis
qui kto onvergeon the greatestof theeigenve tors. Additionally,
GHA is of interestfrom the point of viewof reation of alearning
systemthat andevelopapotentiallyverylargeLSA-stylesemanti
model overa period of time from ontinuous streamed input. The
learningbehaviourofsu hasystemwouldbeofinterest,and
further-more,thereispotentialforinterestingperforman efeaturesinavery
largematureLSA-stylemodel. Theworkpresentedinthisse tionis
previouslypublished(24).
At aglan e, GHAmaynot be anobvious andidate forLSA, sin e
LSAistraditionallyperformedusingsingularvaluede ompositionof
paireddata, toprodu etwosets ofsingularve tors. Oneset anbe
used torotatewordbagve torsintothesharedspa e reatedbythe
SVD pro ess, and the other, do ument ve tors. Sin e a do ument
ve tor anbejustaswellrepresentedasawordbagve tor,however,
this is a little redundant. In fa t, the primary task of LSA is to
establish word interrelationships, and this is atask to whi h eigen
de ompositionisverywellsuited. Inpra ti alterms,usingeigen
de- ompositionforLSAsimplyinvolvesusingaword orrelationmatrix
preparedoverasetoftrainingdo uments,whi hissquareand
sym-metri al,to reateaneigende omposition,thenusingthis
de ompo-sitiontorotatethetestsetofdo umentspresentedaswordbagsinto
theredu eddimensionalityspa e,wherethey anbe ompared. The
testsetmaybethesameasthetrainingset: thiswouldinfa tree t
do -on thematrix
A
= M M
T
, where
A
des ribes word orrelations,to produ ethe olumnmatrixofeigenve torsU
.U
isredu edin dimen-sionalitybydis ardinglater olumnstoprodu eU
′
. Testdo uments
in the form of wordbags are presented asthe row matrix
D
.U
′
is
usedto redu ethedimensionalityof
D
asfollows:D
′
= DU
′
(4.1)
Rowdo umentve torsin
D
′
anthen be omparedtoea hother.
4.1.1 In lusion of Global Normalisation
LSA oftenin ludesan entropy-normalisationstep (16),dis ussedin
theprevious hapter,inwhi hwordfrequen iesoftheoriginal data
matrixare modied to ree ttheir usefulnessasdistinguishing
fea-tures. Sin ethisstephassigni antbenets,andhasindeedbe ome
a part of the standard, no suggestion for an approa h to
perform-ing LSA anbe omplete withoutits in lusion. This prepro essing
requiresthattheentire orpusbeavailableup-front,su hthat
prob-abilitieset . anbe al ulateda rosstheentire orpus,andtherefore
doesnot twellwith GHA,oneof themain sellingpointsof whi h
is its in rementality. As outlined in theprevious hapter,the word
ountismodiedbysettingthe ellvalue
c
ij
asfollows 2 :p
ij
=
tf
ij
gf
i
(4.2)gw
ij
= 1 +
X
j
p
ij
log(p
ij
)
log(n)
(4.3)c
ij
= gw
ij
(c
ij
+ 1)
(4.4) 2where
n
isthenumberofdo umentsinthe olle tion.tf
istheterm frequen y,i.e. theoriginal ell ount,andgf
istheglobalfrequen y, i.e. thetotal ountforthat worda rossalldo uments.By modifying the word ount in this way, words that are of little
valuein distinguishingbetweendo uments,forexample,wordssu h
as the, that are very frequent, are down-weighted. Observe that
the al ulationofthe entropydependsonthetotaldo ument ount
and on thetotal numberof a given worda rossall the do uments,
aswellastheindividual ell ount. Foranin rementalmethod, this
meansthatitmustbe al ulatedoverthedo umentsseensofar,and
thatwordanddo ument ountsmustbea umulatedonanongoing
basis. Alittlealgebraprodu es 3 :
gw
ij
= 1 +
P
j
tf
ij
log(tf
ij
) − gf
i
log(gf
i
)
gf
i
log(n)
(4.5)This arrangementhasthe onvenientpropertyof isolatingthe
sum-mation overa quantity that an bea umulated, i.e.
tf
ij
log(tf
ij
)
, whereasthepreviousarrangementwouldhaverequiredtheindividualterm frequen iesto be storedseparately foran a urate al ulation
to bemade. Thisis problemati where thenumberof su h
frequen- ies tends to innity and the storage requirement in reases as the
tra tabilityofthe al ulationde reases.
4.1.2 Epo h Size and Impli ationsfor
Appli ation
Theentropy-normalised ell ountbe omeslessusefuloververylarge
numbersoftrainingitems,su hasonemightusewithanin remental
algorithm. Considerthatitisthenatureoflanguagethatmostwords
areextremelyinfrequent. Asthenumberofseenitemstendsto
inn-ity,theweightingofwordsthato urwithmidrangefrequen ieswill