Self Organising Maps for Value Estimation to
Solve Reinforcement Learning Tasks
Alexander Kleiner, Bernadette Sharp and Oliver Bittel
Post Print
N.B.: When citing this work, cite the original article.
Original Publication:
Alexander Kleiner, Bernadette Sharp and Oliver Bittel, Self Organising Maps for Value
Estimation to Solve Reinforcement Learning Tasks, 2000, Proc. of the 2nd International
Conference on Enterprise Information Systems (ICEIS 2000), 74-83.
Postprint available at: Linköping University Electronic Press
learning tasks
A. Kleiner,B. Sharp,O. Bittel
Staordshire University
May 11,2000
Abstra t
Reinfor ement learninghasbeen applied re entlymoreand morefor theoptimisation of
agentbehaviours. Thisapproa hbe amepopularduetoitsadaptiveandunsupervisedlearning
pro ess. One ofthe keyideasof this approa his to estimate the valueof agentstates. For
huge state spa es however, it is diÆ ult to implement this approa h. As a result, various
models were proposed whi h make use of fun tion approximators, su h as neural networks,
to solve this problem. This paperfo uses on an implementation of value estimation with a
parti ular lassofneuralnetworks,knownasselforganisingmaps. Experimentswithanagent
moving in a\gridworld" andthe autonomous robot Khepera havebeen arriedout to show
thebenetofourapproa h. Theresults learlyshowthatthe onventionalapproa h,doneby
animplementationofalook-uptabletorepresentthevaluefun tion, anbeoutperformedin
termsofmemoryusageand onvergen espeed.
1 Introdu tion
Inthispaperwe dis ussthe reditassignment
problem, and the reinfor ement learningissue
asso iatedwith rewardingan agent upon
su - essfulexe utionofasetofa tions. Figure1
il-lustratestheintera tionbetweenanagent and
its environment. For every a tion, the agent
performs in any state s
t
, it re eives an
imme-diate reinfor ement r
t
and theper epts of the
su essorstates
t+1
. Thisimmediate
reinfor e-mentdependson theperformed a tionand on
the new state taken as well. For example, an
agent sear hing for an exit in a maze might
be rewarded only if this exit is rea hed. If
this state is found, it is obvious that all
for-mer states, whi h ontributed to thissu ess,
have to be rewardedas well.
Reinfor ement learningis one solutionfor the
redit assignment problem. The idea of
rein-for ement learning grew up within two
dier-ent bran hes. Onebran h fo used on learning
by trial and error, whereas the other bran h
fo used on theproblemof optimal ontrol. In
thelate1950sRi hardBellmanintrodu edhis
approa h ofa valuefun tionora \optimal
re-turnfun tion"tosolve theproblemofoptimal
ontrol(Bellman 1957). Methodsto solve this
equationarenowadaysknownasdynami
pro-gramming. Thispaperfo usesona
generaliza-tion of these methods, known astemporal
dif-feren emethods,whi hhasbeenintrodu edin
1988 byRi hardSutton (Sutton1988). These
methodsassign,duringaniterativepro edure,
a redittoeverystateinthestatespa e,based
ona al ulateddieren ebetweenthesestates.
Roughlyspeakingthisimplies,thatifafuture
state is desirable, the present state is as well.
Sutton introdu ed the parameter to dene,
how far in the future states have to be taken
into a ount,thusthisgeneralisationisnamed
Environment
action
a
t
state
s
t
AGENT
reward
r
t
Figure 1: The agent-environment intera tion
inreinfor ement learning
pler aseTD(0)
1
isused,whi honly onsiders
one su essorstate duringa temporal update.
Currentmethodsforthe\optimalreturn
fun -tion" suer, however, under what Bellman
alled \the urse of dimensionality", sin e
statesfromrealworldproblems onsistusually
of many elements in their ve tors. Therefore
it makes senseto usefun tion approximators,
su hasneuralnetworks, tolearnthe\optimal
returnfun tion".
Su essfulappli ationsof reinfor ement
learn-ingwithneuralnetworksaretestiedbymany
resear hers. Barto andCrites (Barto& Crites
1996) des ribe a neural reinfor ement
learn-ing approa h for an elevator s heduling task.
Thrun (Thrun 1996) reports the su essful
learning of basi ontrol pro edures of an
au-tonomous robot. This robot learned with a
neural Q learning implementation, supported
by a neural network. Another su essful
im-plementation was done by Tesauro at IBM
(Tesauro 1992). He ombined a feed-forward
network, trained by ba kpropagation, with
TD() for the popular ba kgammon game.
This ar hite ture was able to nd strategies
using less indu ement and has even defeated
hampions during an international
ompeti-tion.
Besidesthissu essfulexamples,whi h areall
based on neural networks using
ba kpropaga-tion,thereismoreand moreeviden e,that
ar- hite turesbasedonba kpropagation onverge
slowly or not at all. Examples for su h
prob-lemati tasks are given by (Boyan & Moore
1995) and (Gordon 1995). This diÆ ulties
ariseduetothefa tthatba kpropagation
net-works store informationimpli it. This means
forthe trainingthat every new updateae ts
former stored information as well. A
onver-gen e annot be guaranteed anymore, sin e
theoriginalapproa hofreinfor ementlearning
is supposed to be used with an expli it
look-up table. Therefore our approa h makes use
of a neural network ar hite ture with expli it
knowledgerepresentation,knownasself
organ-ising maps.
Thispaperwilldis usstheproblemsasso iated
withtheuseofselforganisingmaps(SOMs)to
learnthevaluefun tionanddes ribeour
modi-edapproa htoSOMappliedtotwoproblems.
2 Self organizing maps (SOM)
Self organizing maps were rstly introdu ed
by Teuvo Kohonen in 1982 (Kohonen 1982).
These kind of neural networks are a typi al
representative of unsupervised learning
algo-rithms. Duringthelearningpro essparti ular
neuronsaretrainedtorepresent lustersofthe
inputdata. Thea hievedarrangementofthese
lustersis su h, thatsimilar lusters,interms
of their Eu lidean distan e, are near to ea h
other and dierent lusters are far from ea h
other. Hen e, the network builds up a
topol-ogydependingonthedatagiventoitfromthe
inputspa e. Thistopologyisequalto the
sta-inputspa e,whi haresupportedbymore
sam-plesinthedata,arerepresentedmoredetailed
thanareas supportedwith lesssamples.
SOM ar hite ture
A SOM usually onsists of a two dimensional
grid of neurons. Every neuron is onne ted
via its weights to theinput ve tor, where one
weight is spent for every element of this
ve -tor. Beforethetrainingpro ess,valuesofthese
weights are set arbitrary. During the
train-ingphase,however, theweightsofea hneuron
are modied to represent lustersof theinput
spa e.
Mapping of pattern
After anetwork hasbeentrained,a lusterfor
an input ve tor an be identied easily. To
nd the neuron, representing this luster, the
Eu lideandistan e betweenthisve tor andall
weight sets of the neuronson theSOM hasto
be al ulated. The neuron with the shortest
distan e represents this ve tor most pre isely
and is thus named as \winner" neuron. The
Eu lidean distan e is al ulated after the
fol-lowing equation: d i = n X k=1 (w ik x k ) 2 (1) Wherew ik
denotestheithneuronskthweight
and x
k
thekth elementof theinputve tor.
Learning of lusters.
The learningpro esstakespla e ina so alled
oinelearning. Duringaxedamountof
repe-titions, alledepo hs,all patternsofthe
train-ing dataare propagated throughthenetwork.
At the beginning of the learning pro ess, v
al-ues ofthe weightsare arbitrary. Thereforefor
every inputve tor x
i
a neuron u
i
is hosen to
be its representative by random as well. To
manifestthestru tureofthemap, weightsare
in-input ve tors be omes more stable, sin e the
Eu lideandistan e of ea h winner neuron
de- reases.
To builda topologi al map, itis important to
adjust the weights of neighbours around the
neuronaswell. Thereforeaspe ial
neighbour-hoodfun tionhastobeapplied. Thisfun tion
shouldreturn to the winnerneurona value of
1 andtoneuronswithin reasingdistan etoit
a de reasing value down to zero. Usually the
\sombrerohatfun tion"ortheGaussian
fun -tion is used for that. By use of the Gaussian
fun tion, theneighbourhoodfun tionis:
h i =e jn n i j 2 2 2 (2) Where n
denotes the winner neuron and n
i
anyneuron on theKohonen Layer. The
stan-dard deviation denotes the neighbourhood
radius.
For every input ve tor the following update
rule will be applied to every neuron on the
SOM: 4w ik =h i (x k w ik ) (3)
Where denotes thestepsize.
By this update rule, weights are updated in
dis retesteps, dened bythe stepsize . The
nearer neurons are to a hosen winner
neu-ron,themore they areae tedbytheupdate.
Therebyneighbouringneurons represent
simi-lar lusters,whi hleadsto a topologi almap.
The advantage of SOMs is that they are able
to lassify samplesof an inputspa e
unsuper-vised. During the learning pro ess, the map
adapts its stru ture to the input data.
De-pendingon thedata, theSOMwillbuild
lus-tersandordertheminanappropriatemanner.
OnedisadvantageofSOMsis,however,the
ne-inputspa eandtrainitovermanyepo hs.
Af-tertheSOMistraineditisonlypossibletoadd
a new lusterto the representationby
repeat-ing the learning pro ess with the old training
set andthe newpattern.
3 Reinfor ement Learning
Classi al approa hes for neuralnetworks tend
to makeuseofspe i knowledgeaboutstates
and their orresponding output. This given
knowledge is used for a training set and
af-ter the training it is expe ted to gain
knowl-edge aboutunknown situations by
generaliza-tion. However for many problems in the real
worldanappropriatetrainingset an'tbe
gen-erated, sin e the \tea her" doesn't know the
spe i mapping. Nevertheless,it seemsto be
easy forthetea her toassess thismappingfor
every state. When learningto drivea ar, for
example, one is not told how to operate the
ar ontrols appropriately, the tea her,
how-ever, bridges the gap inlearning using
appro-priate feedba k, whi h improves the learning
pro ess and leads nally to the desired
map-ping betweenstatesand a tions.
The Reinfor ement problem
Thetaskofreinfor ementlearningistouse
re-wards to train an agent to perform su essful
fun tions. Figure 1 illustrates the typi al
in-tera tionbetweenagentandenvironment. The
agent performs a tionsinits environment and
re eivesa new state ve tor, aused by this
a -tion. Furthermore the agent gets feedba k of
whether the a tion was adequate. This
feed-ba kisexpressedbyimmediaterewards,whi h
also depend on the new state taken by the
agent. A hess playing agent, for example,
would re eive a maximum immediate reward
ifitrea hesastatewheretheopponent annot
illus-lem. The reward a hieved in the last board
position is a hieved after a long hain of
a -tions. Thus all a tions, done in the past, are
responsiblefor the nal su ess and therefore
also have to be rewarded. For this problem
severalapproa heshavebeenproposed;agood
introdu tion to these is foundin the book by
BartoandSutton(Barto&Sutton1998). This
paper, however, fo uses on one of these
ap-proa hes,whi his thevalueiteration method,
also known asTD(0).
Rewards 2
Inreinfor ement learning,theonlyhintsgiven
to thesu essfultaskareimmediate
reinfor e-ment signals. These signals usually ome
di-re tly from the environment or an be
gener-ated arti ially by an assessment of the
sit-uation. If they are generated for a problem,
they should be hosen e onomi ally. Instead
ofrewardingmanysub-solutionsofaproblem,
only the main goal should be rewarded. For
example, fora hess playeragent it would not
ne essarilymake senseto rewardthetakingof
the opponent's pie es. The agent might nd
a strategy whi h optimises the olle tion of
pie es of the opponent, but forgets about the
importan e of the king. Reinfor ement
learn-ing aims to maximise the a hieved
reinfor e-mentsignalsovera longperiodof time.
In some problems no terminal state an be
expe ted, as in the ase of a robot driving
througha world of obsta les and learning not
to ollide with them. An a umulation of
re-wards would lead to an innite sum. For the
asewherenoterminalstateisdened,wehave
tomakeuseofadis ountfa tor toensurethat
thelearningpro ess will onverge. Thisfa tor
dis ounts rewardswhi hmight beexpe ted in
thefuture
3
,and thus an be omputedas
fol-2
Rewards also in lude negative values whi h are
equaltopunishments 3 lows: R T =r t+1 + r t+2 + 2 r t+3 +::: = T X k=0 k r t+k+1 (4) WhereR T
denotestherewardsa hievedduring
many steps, the dis ount fa tor and r
t the
reward at time t. For T = 1 it has to be
ensuredthat <1
The dire t goal for reinfor ement learning
methods is to maximise R
T
. To a hieve this
goal, however, apredi tionfortheexpe tation
ofrewardsinthefutureisne essary. Therefore
we need a mapping from states to their
or-responding maximum expe tation. As known
from utilitytheory,thismappingisdenedby
thevalue fun tion
4 .
The value fun tion V
(s)
Inorder tomaximiserewards overtime, ithas
to be known for every state, what future
re-wards might be expe ted. The optimal value
fun tionV
(s) providesthisknowledgewitha
valueforevery state. thisreturnvalueisequal
tothea umulationofmaximumrewardsfrom
all su essor states. Generally this fun tion
an be represented by a look-up table, where
foreverystateanentryisne essary. This
fun -tion isusuallyunknownand hastobe learned
by a reinfor ement learning algorithm. One
algorithm,whi hupdatesthisfun tion
su es-sive,isvalueiteration.
Value iteration
In ontrast to other available methods, this
method updates the value fun tion after
ev-ery seen state and thus is known as value
it-eration. Thisupdate an beimaginedwithan
a hievedinthepast
4
util-agentperforminga tionsandusingre eived
re-wards, ausedbythisa tions,toupdatevalues
of the former states. Sin e the optimal value
fun tionreturns forevery state the
a umula-tion of future rewards,theupdate ofa visited
state s
t
has to in ludethevalueof the
su es-sorstate s
t+1
aswell. Thusthevalue fun tion
islearnedafterthefollowingiterativeequation:
V k+1 (s t ):=r(s t ;a t )+V k (s t+1 ) (5) Where V k+1 and V k
denote the value
fun -tion before and after the update and r(s
t ;a
t )
referstotheimmediatereinfor ementa hieved
for the transition from state s
t
to state s
t+1
by the hosen a tion a
t
. While applying this
method,thevaluefun tionapproximatesmore
and more until it rea hes its optimum. That
means that predi tions of future rewards
be- omesu essivelymorepre iseanda tions an
be hosen withmaximumfuture rewards.
There is an underlying assumption that the
agent's a tions are hosen in an optimal
man-ner. In value iteration, the optimal hoi e of
an a tion an be done after the greedy-poli y.
This poli yis, simply after its name, to hose
a tions whi h lead to maximum rewards. For
an agent this means, to hose from all
possi-blea tion a2A that one,whi hreturns after
equation (5)the maximumexpe tation.
How-ever we an see, that after equation (5) the
su essorstates
t+1
, aused bya tiona
t
,must
be known. Thus a model of the environment
is ne essary, whi h provides for state s
t and
a tiona
t
thesu essorstate s
t+1 : s t+1 =f(s t ;a t ) (6) Exploration
Ifalla tionsare hosenafterthegreedy-poli y,
it might happen that the learning pro ess
re-sultsinasub-optimalsolution. Thisisbe ause
gathered so far. This knowledge however an
lead to a lo al optimal solution in the sear h
spa e,whereglobaloptimalsolutionsnever an
be found. Therefore it makes sense to hose
a tions, with a dened likelihood, arbritary.
Thepoli yto hosea tionbyapropabilityof"
arbritrary,is alled "-greedy poli y. Certainly
thereisatrade-obetweenexplorationand
ex-ploitation of existing knowledge and the
opti-mal adjustment of this parameterdepends on
theproblem domain.
Implementation of Value Iteration
Sofar,thealgorithm anbesummarisedinthe
followingsteps:
sele t the most promising a tion a
t after
the"-greedy poli y
a t =argmin a2A(s t ) (r(s t ;a)+V k (f(s t ;a))) applya t in theenvironment s t =)s t+1
adaptthe value fun tionforstate s
t V k+1 (s t ):=r(s t ;a t )+V k (s t+1 )
In theory, this algorithm will denitely
eval-uate an optimal solution for problems, su h
as dened at the beginning of thisse tion. A
problemto reinfor ement learning however, is
its appli ation to real world situations. That
is be ause real world situationsareusually
in-volvedwithhuge statespa es. Thevalue
fun -tionshouldprovideeverystate withan
appro-priate value. But most real world problems
omeupwithamulti-dimensionalstateve tor.
Thestateofarobot,forexample,whosetaskis
de-If every sensor would have a possible return
valueof 10 Bitand therobot itself ownseight
of these sensors,the state spa e would onsist
of 1:210
24
dierent states, emphasizing the
problemoftra tabilityininferen ing.
Ontheotherhand,itmight happen,that
dur-ing a real experiment with a limited time, all
states an never be visited. Thus it is likely,
that even after a long training time, still
un-known states are visited. But unfortunately
the value fun tion an't provide a predi tion
forthem.
4 Modied SOM to learn the
value fun tion
The two problems previouslyidentiedfor
re-infor ementlearning, anbesolvedusing
fun -tion approximators. Neural Networks, in
par-ti ular, provide the benet of ompressing
the input spa e and furthermore the learned
knowledge an be generalised. Thismeans for
the value fun tion, that similar states willbe
evaluatedbyoneneuron. Hen ealso unknown
states an begeneralizedand evaluatedbythe
poli y. For this purpose the previously
intro-du ed model of self organising maps has been
takenand modied.
Modi ation to the ar hite ture
UsuallySOMs areusedfor lassi ation of
in-putspa es, forwhi h no output ve tor is
ne -essary. To make use of SOMs as fun tion
ap-proximator,itisne essarytoextendthemodel
by an output value. Su h modi ations have
beenrstintrodu edbyRitterandS hultenin
onne tionwithre exmaps for omplexrobot
movements (Ritter & S hulten 1987). The
modi ationusedhereis,thateveryneuronof
theKohonenlayerisexpanded byone weight,
whi h onne ts it to the s alar output. This
isto getageneralisation forsimilarsituations.
Toa hieve this,theoutputweightshave to be
trainedwithaneighbourhoodfun tionaswell.
Thereforetheoutputweightsareadaptedwith
thefollowingrule:
Æw i = 2 h i (y w i ) (7) Where 2
is a se ond step sizeparameter and
h i
the same neighbourhood fun tion as used
fortheinputweightsandy thedesiredoutput
of thenetwork.
Modi ation to the algorithm
Asremarkedpreviously,thelearningalgorithm
for SOMs is supposed to be applied \oine"
withaspe i trainingset. Theappli ationof
valueiterationhowever, isan\online"pro ess,
where the knowledge in reases iteratively. To
solvethis ontradi tion,thelearningpro essof
theSOM hasbeendividedintotwo steps:
First step: pre- lassi ation of the
envi-ronment
Se ond step: exe ution of reinfor ement
learning with improvement of
lassi a-tionfor visitedstates
Fortherststeparepresentativesampleofthe
wholestatespa eisne essary,tobuilda
appro-priate map of the environment. This sample
willbetrained,untilthestru tureoftheSOM
is adequate to lassify states of the problems
state spa e. During the exe ution of the
se -ond step thereinfor ementlearningalgorithm
updates states with their appropriate values.
Thesestatesare lassiedbySOMs,whereone
neuron is hosen as winner. The
orrespond-ing outputweights ofthis neuronare hanged
to the value, al ulated by the reinfor ement
modied aswellto a hieve theee t of
gener-alisation.
Usuallythestates, ne essarytosolvethe
prob-lem, are a subset of the whole state spa e.
Thus the SOM has to lassify only this
sub-set, using a pre- lassi ation. During the
ap-pli ationof reinfor ementlearning, this
lassi- ationwillimprove,sin e forevery state
vis-ited, its representation is strengthen. States,
whi harevisitedmorefrequentlyandthusare
more important for the solution of the
prob-lem, willa hieve a better representation than
those unimportant states, whi h are visited
less.
5 Experiments and results
5.1 The path-planning problem
This se tion des ribes the appli ation of our
modied SOMwithreinfor ementlearningfor
solvingthepathplanningproblem. The
prob-lemistondtheshortestpaththroughamaze
or simply a path on a map. For the
experi-mentdes ribedhere,a omputersimulationof
a \girdworld" has been taken (see Figure 2).
The gridworld is representedby a two
dimen-sional arrangement of positions. Wall pie e or
obsta les an o upy these positions and the
agent therefore an't ross them. Other
po-sitions however, are free to its dis overy. For
theexperiment,theupperleft ornerisdened
as start position and the lower right orner
as end position. The agent's task is to nd
theshortestpathbetweenthese twopositions,
whileavoiding obsta leson itsway.
Due to thefa t, that theagent is supposedto
learn the \ heapest" path, it is punished for
every move with -1 and rewarded with 0 if it
rea hes the goal. Beside these reinfor ement
signals, the agent gets no other information,
Figure 2: The gridworld experiment
0
50
100
150
200
Epochs
-1000
-800
-600
-400
-200
0
Reinforcement
Value Iteration Method
Convergence of different implementations
SOM 10x10
SOM 8x8
Look-up table
Figure3: A hievedrewards,duringlearningof
a behaviourforthegridworldexperiment
re tion should be preferred. If it fa es an
ob-sta le, thepossiblea tionsareredu edtothat
a tions,whi hlead to freepositionsaround.
Two implementationsofamodiedSOMwith
8x8 neurons and 10x10 neurons have been
used. For omparison, the experiment has
been arried out with a look-up table, where
every entry represents a state, as well. This
look-up table onsists of 289 entries, due to
Results
The result of this experiment is shown in
g-ure 3. In this graph the a hieved rewards for
ea h implementation after every episode an
beseen. The optimalpath is found, ifthe
a - umulatedreinfor ementduringoneepisodeis
-53, sin e the agent needs at least 53 steps to
rea h its goal. Inthe graph an be seen,that
theimplementationofthemodiedSOMwith
10x10neuronsleadsto afasterresultthanthe
look-up table. After 30 episodes the agent,
equipped with the modied SOM, found the
heapestpath.
5.2 Learning obsta le avoidan e
with a robot
A ommon problem in roboti s is the
au-tonomous drive of a robot. For su h a drive
therearevariouspro esses. Onepro essmight
bring it to a far destination, lead by a path
ndingalgorithm. Forsimplemovement,
how-ever, a pro ess isne essary to avoid obsta les.
Inthisproblem,itisverydiÆ ulttodene
ap-propriatea tions forparti ularsituations. On
theotherhand,we aneasilyassessthe
result-ing a tions. Therefore this problem seems to
be appropriate for the reinfor ement learning
approa h.
In this experiment the autonomous miniature
robot Khepera, whi h was developed at the
EPFL in Lausanne, has been used (see gure
4). This5 mhugerobotisequippedwitheight
approximitysensors,wheretwoaremountedat
thefront,two at theba k, two atthesideand
twoin45
Æ
tothefront. Thesesensorsgivea
re-turnvaluebetween0and 1024,whi his
orre-spondingtoarangeofabout5 m. Therobots
drive onsists of two servo motors, whi h an
turnthetwowheelswith2mperse ondin
neg-ative and positivedire tions. By this
ongu-ration, the robot is able to do 360
Æ
rotations
Figure 4: Autonomousrobot Khepera
therobotis very manoeuvrableand shouldbe
abletodealwithmostsituations. Furthermore
the robot is equipped with two re hargeable
batteries, whi henableittodriveforabout20
minutes autonomously. For exe ution of
pro-grams, there alsoexists aCPU fromMotorola
and a RAMarea of 512KBon therobot.
Experiment
Duetothefa t,thatforvalueiterationamodel
of the environment is required, the robot has
beenrsttrainedusinga omputersimulation.
Afterwardstheexperiment ontinuedona
nor-maloÆ edesk,whereobsta lesandwallswere
builtup withwoodenblo ks.
In the reinfor ement learning algorithm, the
state oftherobotwasrepresentedbytheeight
sensor values. The allowed a tions have been
redu ed to the three a tions: left turn, right
turn and straight forward. Alsothe
Figure5: Alearned lassi ationof thesensor
spa e
getsa punishmentof -1,otherwisearewardof
0. The experiment has been arried out over
multiple episodes. One episode has been
lim-ited to 50steps. Thereforethedis ount fa tor
hasbeensetto1.0. Forexplorationpurposes
the fa tor " has been adjusted to 0.01, whi h
is equal to the probability a tions are hosen
arbitrary. Con erning to the state ve tor, the
inputve tor of the SOM onsists of eight
ele-ments aswell. For the Kohonen Layer an
ar-rangementof 30x30 neuronshasbeen hosen.
Before the appli ation of the reinfor ement
learning algorithm, the SOM had to be
pre- lassied. Therefore a training set of
typi- alsituations froman obsta leworldhasbeen
trainedover90 epo hs. Withthehelp of
visu-alisationtoolsit ouldbeensuredthatthe
sit-uationsareadequately lassied,asillustrated
ingure 5.
During the episodes of the value iteration
method, identied situations were relearned
with a small neighbourhood of = 0:1 and
also smalllearningstep rate of =0:3.
Results
0
20
40
60
80
100
Episode
-50
-40
-30
-20
-10
0
1
0
Reinforcement
Learning to avoid obstacles
Figure 6: Collisions during the autonomous
learningof anobsta le avoidian e strategy
an beseeningure6. Inthisgraphthe
a u-mulated rewards forevery episode are shown.
Hen e for every ollision the robot has been
punished with -1, the reinfor ement for every
episode isequalto the aused ollisions. After
45 episodes the number of ollisions be ame
signi antly less. During the early episodes,
the value of a hieved reinfor ement signals
swaysstrongly. Thisresultsfromthefa t,that
aftertherobotover ameasituation,it
en oun-teredanewsituationagain,whereanother
be-haviourhadtobelearnedaswell. Asweseein
the graph, the robot learned to manage most
situations after a suÆ ient time of
pro eed-ing. After thetrainingthelearned ontrolhas
been tested on the real robot. Although the
a hieved behaviourwasnotelegant,itproved,
that therobot obviously learnedtheabilityto
avoidobsta les.
6 Con lusion
The problem of a huge state spa e in real
an be en ountered, have beenta kled byuse
of a modied SOM. The SOMs abilities to
ompress the inputspa e and generalize from
knownsituations tounknownmadeitpossible
to a hieve a reasonable solution. However, it
was ne essary to split the standard algorithm
for SOMs into two parts. On e learning by
a pre- lassi ation and twi e learning during
valueiteration. Withthismodi ation,the
al-gorithm an be applied to an online learning
pro ess, whi h is given by the value iteration
method.
The applied modi ations to the standard
SOMhavebeenevaluatedwithinthegridworld
exampleandtheproblemofobsta leavoidan e
of an autonomous robot. These experiments
showed two main advantages of the modied
SOM to a standard implementation with a
look-up table. First, the example of obsta le
avoidan eprovedthatevenforenormous state
spa es, astrategy an be learned. Se ond, the
pathndingexampleshowed,thattheuseofa
modied SOM an lead to fasterresults, sin e
the agent is able to generalize situations
in-stead oflearninga valueforall of them.
For the Value Iteration algorithm, applied to
the experiments des ribed here, a model of
the environment is ne essary. For real world
problems, su h as the problem of obsta le
avoidan e,however, anappropriatemodel an
hardly be provided. Sensor signals are
nor-mally noisy or even it might be that a sensor
is damagedordon't work properly. Thusitis
re ommendable to use another reinfor ement
learning implementation, whi h makes it not
anymore ne essary to providea modelof the
environment. One ommonly used variant of
reinfor ement learning is the Q-Learning. In
this algorithm states are represented by the
tupleof state and a tion, thusa modelof the
environment is notrequired. We beliefthat a
SOM, proposedinthispaper,yieldsbetter
re-sults.
Oneofthebigdisadvantagesen ountered
how-ever,isthatthemodi ationoftheSOM
dur-ing the se ond step hanges the
generalisa-tion behaviour of the network. If states are
relearned frequently with a small
neighbour-hoodfun tion,thelearnedknowledgebe omes
too spe i and generalisation is redu ed. To
over ome this problem, it would be ne essary
to relearn the omplete stru ture of the SOM
withanoinealgorithm. Unfortunately,
expe-rien eintermsoftrainingexamplesarelost
af-tertheonlinepro ess. Apossiblesolutionand
probably subje t of another work, is to store
\ riti al" pattern temporarily during the
on-linepro ess bydynami neurons. With
\ riti- al" pattern we mean those, whi h ome with
afarEu lideandistan eto allexistingneurons
in thenetwork and thustheir lassi ation by
thisneurons would notbeappropriate. Given
theset ofthese dynami allyallo atedneurons
and theset ofneuronson theSOM, a new
ar-rangement with better topologi al
representa-tion an be trained by an oine algorithm
5 .
The exe ution of this oinealgorithm an be
doneduringaphaseof noinputtothelearner
and is motivated by a human post pro essing
of information,known asREM phase.
Referen es
Barto, A. & Crites, R. (1996), Improving
el-evator performan e using reinfor ement
learning,inM.C.Hasselmo,M.C.Mozer
& D. S. Touretzky, eds, `Advan es in
Neural Information Pro essing Systems',
Vol.8.
5
Barto, A. & Sutton, R. (1998), Reinfo ement
Learning - An Introdu tion, MIT Press,
Cambridge.
Bellman,R.E.(1957),Dynami Programming,
Prin etonUniversityPress, Prin eton.
Boyan,A.J.&Moore,A.W.(1995),
General-ization in Renfor emen Learning: Savely
Approximating the Value Fun tion, in
T. K. Leen, G. Tesauro & D. S. T
ouret-zky, eds, `Information Pro essing
Sys-tems',Vol.7,MITPress,CambridgeMA.
Gordon, G. (1995), Stable fun tion
approxi-mationindynami programming,in
`Pro- eedingsofthe12thInternational
Confer-en eonMa hineLearning',Morgan
Kauf-mann,San Fransis o, Calif.,pp.261{268.
Kohonen,T.(1982),Self-OrganizedFormation
ofTopologi allyCorre tFeatureMaps,in
`Biol.Cyberneti s',Vol. 43,pp.59{69.
Ritter, H. & S hulten, K. (1987), Extending
Kohonen's Self-Organizing Mapping
Al-gorithmto LearnBallisti Movements, in
`NeuralComputers', SpringerVerlag
Hei-delberg, pp.393{406.
Sutton, R. (1988), Learning to predi t by the
methodsof temporal dieren es, in
`Ma- hine Learning',Vol. 3,pp.9{44.
Tesauro, G. (1992), Pra ti al issuesin
tempo-raldieren elearning,in`Ma hine
Learn-ing',Vol. 8,pp.257{277.
Thrun, S. (1996), Explanation-based neural
network learning: A lifelong learning
approa h, Kluwer A ademi Publishers,