Self Organising Maps for Value Estimation to Solve Reinforcement Learning Tasks

(1)

Self Organising Maps for Value Estimation to

Solve Reinforcement Learning Tasks

Alexander Kleiner, Bernadette Sharp and Oliver Bittel

Post Print

N.B.: When citing this work, cite the original article.

Original Publication:

Alexander Kleiner, Bernadette Sharp and Oliver Bittel, Self Organising Maps for Value

Estimation to Solve Reinforcement Learning Tasks, 2000, Proc. of the 2nd International

Conference on Enterprise Information Systems (ICEIS 2000), 74-83.

Postprint available at: Linköping University Electronic Press

(2)

learning tasks

A. Kleiner,B. Sharp,O. Bittel

Staordshire University

May 11,2000

Abstra t

Reinfor ement learninghasbeen applied re entlymoreand morefor theoptimisation of

agentbehaviours. Thisapproa hbe amepopularduetoitsadaptiveandunsupervisedlearning

pro ess. One ofthe keyideasof this approa his to estimate the valueof agentstates. For

huge state spa es however, it is diÆ ult to implement this approa h. As a result, various

models were proposed whi h make use of fun tion approximators, su h as neural networks,

to solve this problem. This paperfo uses on an implementation of value estimation with a

parti ular lassofneuralnetworks,knownasselforganisingmaps. Experimentswithanagent

moving in a\gridworld" andthe autonomous robot Khepera havebeen arriedout to show

thebenetofourapproa h. Theresults learlyshowthatthe onventionalapproa h,doneby

animplementationofalook-uptabletorepresentthevaluefun tion, anbeoutperformedin

termsofmemoryusageand onvergen espeed.

(3)

1 Introdu tion

Inthispaperwe dis ussthe reditassignment

problem, and the reinfor ement learningissue

asso iatedwith rewardingan agent upon

su - essfulexe utionofasetofa tions. Figure1

il-lustratestheintera tionbetweenanagent and

its environment. For every a tion, the agent

performs in any state s

t

, it re eives an

imme-diate reinfor ement r

t

and theper epts of the

su essorstates

t+1

. Thisimmediate

reinfor e-mentdependson theperformed a tionand on

the new state taken as well. For example, an

agent sear hing for an exit in a maze might

be rewarded only if this exit is rea hed. If

this state is found, it is obvious that all

for-mer states, whi h ontributed to thissu ess,

have to be rewardedas well.

Reinfor ement learningis one solutionfor the

redit assignment problem. The idea of

rein-for ement learning grew up within two

dier-ent bran hes. Onebran h fo used on learning

by trial and error, whereas the other bran h

fo used on theproblemof optimal ontrol. In

thelate1950sRi hardBellmanintrodu edhis

approa h ofa valuefun tionora \optimal

re-turnfun tion"tosolve theproblemofoptimal

ontrol(Bellman 1957). Methodsto solve this

equationarenowadaysknownasdynami

pro-gramming. Thispaperfo usesona

generaliza-tion of these methods, known astemporal

dif-feren emethods,whi hhasbeenintrodu edin

1988 byRi hardSutton (Sutton1988). These

methodsassign,duringaniterativepro edure,

a redittoeverystateinthestatespa e,based

ona al ulateddieren ebetweenthesestates.

Roughlyspeakingthisimplies,thatifafuture

state is desirable, the present state is as well.

Sutton introdu ed the parameter to dene,

how far in the future states have to be taken

into a ount,thusthisgeneralisationisnamed

Environment

action

a

t

state

s

t

AGENT

reward

r

t

Figure 1: The agent-environment intera tion

inreinfor ement learning

pler aseTD(0)

1

isused,whi honly onsiders

one su essorstate duringa temporal update.

Currentmethodsforthe\optimalreturn

fun -tion" suer, however, under what Bellman

alled \the urse of dimensionality", sin e

statesfromrealworldproblems onsistusually

of many elements in their ve tors. Therefore

it makes senseto usefun tion approximators,

su hasneuralnetworks, tolearnthe\optimal

returnfun tion".

Su essfulappli ationsof reinfor ement

learn-ingwithneuralnetworksaretestiedbymany

resear hers. Barto andCrites (Barto& Crites

1996) des ribe a neural reinfor ement

learn-ing approa h for an elevator s heduling task.

Thrun (Thrun 1996) reports the su essful

learning of basi ontrol pro edures of an

au-tonomous robot. This robot learned with a

neural Q learning implementation, supported

by a neural network. Another su essful

im-plementation was done by Tesauro at IBM

(Tesauro 1992). He ombined a feed-forward

network, trained by ba kpropagation, with

TD() for the popular ba kgammon game.

This ar hite ture was able to nd strategies

using less indu ement and has even defeated

(4)

hampions during an international

ompeti-tion.

Besidesthissu essfulexamples,whi h areall

based on neural networks using

ba kpropaga-tion,thereismoreand moreeviden e,that

ar- hite turesbasedonba kpropagation onverge

slowly or not at all. Examples for su h

prob-lemati tasks are given by (Boyan & Moore

1995) and (Gordon 1995). This diÆ ulties

ariseduetothefa tthatba kpropagation

net-works store informationimpli it. This means

forthe trainingthat every new updateae ts

former stored information as well. A

onver-gen e annot be guaranteed anymore, sin e

theoriginalapproa hofreinfor ementlearning

is supposed to be used with an expli it

look-up table. Therefore our approa h makes use

of a neural network ar hite ture with expli it

knowledgerepresentation,knownasself

organ-ising maps.

Thispaperwilldis usstheproblemsasso iated

withtheuseofselforganisingmaps(SOMs)to

learnthevaluefun tionanddes ribeour

modi-edapproa htoSOMappliedtotwoproblems.

2 Self organizing maps (SOM)

Self organizing maps were rstly introdu ed

by Teuvo Kohonen in 1982 (Kohonen 1982).

These kind of neural networks are a typi al

representative of unsupervised learning

algo-rithms. Duringthelearningpro essparti ular

neuronsaretrainedtorepresent lustersofthe

inputdata. Thea hievedarrangementofthese

lustersis su h, thatsimilar lusters,interms

of their Eu lidean distan e, are near to ea h

other and dierent lusters are far from ea h

other. Hen e, the network builds up a

topol-ogydependingonthedatagiventoitfromthe

inputspa e. Thistopologyisequalto the

sta-inputspa e,whi haresupportedbymore

sam-plesinthedata,arerepresentedmoredetailed

thanareas supportedwith lesssamples.

SOM ar hite ture

A SOM usually onsists of a two dimensional

grid of neurons. Every neuron is onne ted

via its weights to theinput ve tor, where one

weight is spent for every element of this

ve -tor. Beforethetrainingpro ess,valuesofthese

weights are set arbitrary. During the

train-ingphase,however, theweightsofea hneuron

are modied to represent lustersof theinput

spa e.

Mapping of pattern

After anetwork hasbeentrained,a lusterfor

an input ve tor an be identied easily. To

nd the neuron, representing this luster, the

Eu lideandistan e betweenthisve tor andall

weight sets of the neuronson theSOM hasto

be al ulated. The neuron with the shortest

distan e represents this ve tor most pre isely

and is thus named as \winner" neuron. The

Eu lidean distan e is al ulated after the

fol-lowing equation: d i = n X k=1 (w ik x k ) 2 (1) Wherew ik

denotestheithneuronskthweight

and x

k

thekth elementof theinputve tor.

Learning of lusters.

The learningpro esstakespla e ina so alled

oinelearning. Duringaxedamountof

repe-titions, alledepo hs,all patternsofthe

train-ing dataare propagated throughthenetwork.

At the beginning of the learning pro ess, v

al-ues ofthe weightsare arbitrary. Thereforefor

every inputve tor x

i

a neuron u

i

is hosen to

be its representative by random as well. To

manifestthestru tureofthemap, weightsare

(5)

in-input ve tors be omes more stable, sin e the

Eu lideandistan e of ea h winner neuron

de- reases.

To builda topologi al map, itis important to

adjust the weights of neighbours around the

neuronaswell. Thereforeaspe ial

neighbour-hoodfun tionhastobeapplied. Thisfun tion

shouldreturn to the winnerneurona value of

1 andtoneuronswithin reasingdistan etoit

a de reasing value down to zero. Usually the

\sombrerohatfun tion"ortheGaussian

fun -tion is used for that. By use of the Gaussian

fun tion, theneighbourhoodfun tionis:

h i =e jn n i j 2 2 2 (2) Where n

denotes the winner neuron and n

i

anyneuron on theKohonen Layer. The

stan-dard deviation denotes the neighbourhood

radius.

For every input ve tor the following update

rule will be applied to every neuron on the

SOM: 4w ik =h i (x k w ik ) (3)

Where denotes thestepsize.

By this update rule, weights are updated in

dis retesteps, dened bythe stepsize . The

nearer neurons are to a hosen winner

neu-ron,themore they areae tedbytheupdate.

Therebyneighbouringneurons represent

simi-lar lusters,whi hleadsto a topologi almap.

The advantage of SOMs is that they are able

to lassify samplesof an inputspa e

unsuper-vised. During the learning pro ess, the map

adapts its stru ture to the input data.

De-pendingon thedata, theSOMwillbuild

lus-tersandordertheminanappropriatemanner.

OnedisadvantageofSOMsis,however,the

ne-inputspa eandtrainitovermanyepo hs.

Af-tertheSOMistraineditisonlypossibletoadd

a new lusterto the representationby

repeat-ing the learning pro ess with the old training

set andthe newpattern.

3 Reinfor ement Learning

Classi al approa hes for neuralnetworks tend

to makeuseofspe i knowledgeaboutstates

and their orresponding output. This given

knowledge is used for a training set and

af-ter the training it is expe ted to gain

knowl-edge aboutunknown situations by

generaliza-tion. However for many problems in the real

worldanappropriatetrainingset an'tbe

gen-erated, sin e the \tea her" doesn't know the

spe i mapping. Nevertheless,it seemsto be

easy forthetea her toassess thismappingfor

every state. When learningto drivea ar, for

example, one is not told how to operate the

ar ontrols appropriately, the tea her,

how-ever, bridges the gap inlearning using

appro-priate feedba k, whi h improves the learning

pro ess and leads nally to the desired

map-ping betweenstatesand a tions.

The Reinfor ement problem

Thetaskofreinfor ementlearningistouse

re-wards to train an agent to perform su essful

fun tions. Figure 1 illustrates the typi al

in-tera tionbetweenagentandenvironment. The

agent performs a tionsinits environment and

re eivesa new state ve tor, aused by this

a -tion. Furthermore the agent gets feedba k of

whether the a tion was adequate. This

feed-ba kisexpressedbyimmediaterewards,whi h

also depend on the new state taken by the

agent. A hess playing agent, for example,

would re eive a maximum immediate reward

ifitrea hesastatewheretheopponent annot

(6)

illus-lem. The reward a hieved in the last board

position is a hieved after a long hain of

a -tions. Thus all a tions, done in the past, are

responsiblefor the nal su ess and therefore

also have to be rewarded. For this problem

severalapproa heshavebeenproposed;agood

introdu tion to these is foundin the book by

BartoandSutton(Barto&Sutton1998). This

paper, however, fo uses on one of these

ap-proa hes,whi his thevalueiteration method,

also known asTD(0).

Rewards 2

Inreinfor ement learning,theonlyhintsgiven

to thesu essfultaskareimmediate

reinfor e-ment signals. These signals usually ome

di-re tly from the environment or an be

gener-ated arti ially by an assessment of the

sit-uation. If they are generated for a problem,

they should be hosen e onomi ally. Instead

ofrewardingmanysub-solutionsofaproblem,

only the main goal should be rewarded. For

example, fora hess playeragent it would not

ne essarilymake senseto rewardthetakingof

the opponent's pie es. The agent might nd

a strategy whi h optimises the olle tion of

pie es of the opponent, but forgets about the

importan e of the king. Reinfor ement

learn-ing aims to maximise the a hieved

reinfor e-mentsignalsovera longperiodof time.

In some problems no terminal state an be

expe ted, as in the ase of a robot driving

througha world of obsta les and learning not

to ollide with them. An a umulation of

re-wards would lead to an innite sum. For the

asewherenoterminalstateisdened,wehave

tomakeuseofadis ountfa tor toensurethat

thelearningpro ess will onverge. Thisfa tor

dis ounts rewardswhi hmight beexpe ted in

thefuture

3

,and thus an be omputedas

fol-2

Rewards also in lude negative values whi h are

equaltopunishments 3 lows: R T =r t+1 + r t+2 + 2 r t+3 +::: = T X k=0 k r t+k+1 (4) WhereR T

denotestherewardsa hievedduring

many steps, the dis ount fa tor and r

t the

reward at time t. For T = 1 it has to be

ensuredthat <1

The dire t goal for reinfor ement learning

methods is to maximise R

T

. To a hieve this

goal, however, apredi tionfortheexpe tation

ofrewardsinthefutureisne essary. Therefore

we need a mapping from states to their

or-responding maximum expe tation. As known

from utilitytheory,thismappingisdenedby

thevalue fun tion

4 .

The value fun tion V

(s)

Inorder tomaximiserewards overtime, ithas

to be known for every state, what future

re-wards might be expe ted. The optimal value

fun tionV

(s) providesthisknowledgewitha

valueforevery state. thisreturnvalueisequal

tothea umulationofmaximumrewardsfrom

all su essor states. Generally this fun tion

an be represented by a look-up table, where

foreverystateanentryisne essary. This

fun -tion isusuallyunknownand hastobe learned

by a reinfor ement learning algorithm. One

algorithm,whi hupdatesthisfun tion

su es-sive,isvalueiteration.

Value iteration

In ontrast to other available methods, this

method updates the value fun tion after

ev-ery seen state and thus is known as value

it-eration. Thisupdate an beimaginedwithan

a hievedinthepast

4

(7)

util-agentperforminga tionsandusingre eived

re-wards, ausedbythisa tions,toupdatevalues

of the former states. Sin e the optimal value

fun tionreturns forevery state the

a umula-tion of future rewards,theupdate ofa visited

state s

t

has to in ludethevalueof the

su es-sorstate s

t+1

aswell. Thusthevalue fun tion

islearnedafterthefollowingiterativeequation:

V k+1 (s t ):=r(s t ;a t )+V k (s t+1 ) (5) Where V k+1 and V k

denote the value

fun -tion before and after the update and r(s

t ;a

t )

referstotheimmediatereinfor ementa hieved

for the transition from state s

t

to state s

t+1

by the hosen a tion a

t

. While applying this

method,thevaluefun tionapproximatesmore

and more until it rea hes its optimum. That

means that predi tions of future rewards

be- omesu essivelymorepre iseanda tions an

be hosen withmaximumfuture rewards.

There is an underlying assumption that the

agent's a tions are hosen in an optimal

man-ner. In value iteration, the optimal hoi e of

an a tion an be done after the greedy-poli y.

This poli yis, simply after its name, to hose

a tions whi h lead to maximum rewards. For

an agent this means, to hose from all

possi-blea tion a2A that one,whi hreturns after

equation (5)the maximumexpe tation.

How-ever we an see, that after equation (5) the

su essorstates

t+1

, aused bya tiona

t

,must

be known. Thus a model of the environment

is ne essary, whi h provides for state s

t and

a tiona

t

thesu essorstate s

t+1 : s t+1 =f(s t ;a t ) (6) Exploration

Ifalla tionsare hosenafterthegreedy-poli y,

it might happen that the learning pro ess

re-sultsinasub-optimalsolution. Thisisbe ause

gathered so far. This knowledge however an

lead to a lo al optimal solution in the sear h

spa e,whereglobaloptimalsolutionsnever an

be found. Therefore it makes sense to hose

a tions, with a dened likelihood, arbritary.

Thepoli yto hosea tionbyapropabilityof"

arbritrary,is alled "-greedy poli y. Certainly

thereisatrade-obetweenexplorationand

ex-ploitation of existing knowledge and the

opti-mal adjustment of this parameterdepends on

theproblem domain.

Implementation of Value Iteration

Sofar,thealgorithm anbesummarisedinthe

followingsteps:

sele t the most promising a tion a

t after

the"-greedy poli y

a t =argmin a2A(s t ) (r(s t ;a)+V k (f(s t ;a))) applya t in theenvironment s t =)s t+1

adaptthe value fun tionforstate s

t V k+1 (s t ):=r(s t ;a t )+V k (s t+1 )

In theory, this algorithm will denitely

eval-uate an optimal solution for problems, su h

as dened at the beginning of thisse tion. A

problemto reinfor ement learning however, is

its appli ation to real world situations. That

is be ause real world situationsareusually

in-volvedwithhuge statespa es. Thevalue

fun -tionshouldprovideeverystate withan

appro-priate value. But most real world problems

omeupwithamulti-dimensionalstateve tor.

Thestateofarobot,forexample,whosetaskis

(8)

de-If every sensor would have a possible return

valueof 10 Bitand therobot itself ownseight

of these sensors,the state spa e would onsist

of 1:210

24

dierent states, emphasizing the

problemoftra tabilityininferen ing.

Ontheotherhand,itmight happen,that

dur-ing a real experiment with a limited time, all

states an never be visited. Thus it is likely,

that even after a long training time, still

un-known states are visited. But unfortunately

the value fun tion an't provide a predi tion

forthem.

4 Modied SOM to learn the

value fun tion

The two problems previouslyidentiedfor

re-infor ementlearning, anbesolvedusing

fun -tion approximators. Neural Networks, in

par-ti ular, provide the benet of ompressing

the input spa e and furthermore the learned

knowledge an be generalised. Thismeans for

the value fun tion, that similar states willbe

evaluatedbyoneneuron. Hen ealso unknown

states an begeneralizedand evaluatedbythe

poli y. For this purpose the previously

intro-du ed model of self organising maps has been

takenand modied.

Modi ation to the ar hite ture

UsuallySOMs areusedfor lassi ation of

in-putspa es, forwhi h no output ve tor is

ne -essary. To make use of SOMs as fun tion

ap-proximator,itisne essarytoextendthemodel

by an output value. Su h modi ations have

beenrstintrodu edbyRitterandS hultenin

onne tionwithre exmaps for omplexrobot

movements (Ritter & S hulten 1987). The

modi ationusedhereis,thateveryneuronof

theKohonenlayerisexpanded byone weight,

whi h onne ts it to the s alar output. This

isto getageneralisation forsimilarsituations.

Toa hieve this,theoutputweightshave to be

trainedwithaneighbourhoodfun tionaswell.

Thereforetheoutputweightsareadaptedwith

thefollowingrule:

Æw i = 2 h i (y w i ) (7) Where 2

is a se ond step sizeparameter and

h i

the same neighbourhood fun tion as used

fortheinputweightsandy thedesiredoutput

of thenetwork.

Modi ation to the algorithm

Asremarkedpreviously,thelearningalgorithm

for SOMs is supposed to be applied \oine"

withaspe i trainingset. Theappli ationof

valueiterationhowever, isan\online"pro ess,

where the knowledge in reases iteratively. To

solvethis ontradi tion,thelearningpro essof

theSOM hasbeendividedintotwo steps:

First step: pre- lassi ation of the

envi-ronment

Se ond step: exe ution of reinfor ement

learning with improvement of

lassi a-tionfor visitedstates

Fortherststeparepresentativesampleofthe

wholestatespa eisne essary,tobuilda

appro-priate map of the environment. This sample

willbetrained,untilthestru tureoftheSOM

is adequate to lassify states of the problems

state spa e. During the exe ution of the

se -ond step thereinfor ementlearningalgorithm

updates states with their appropriate values.

Thesestatesare lassiedbySOMs,whereone

neuron is hosen as winner. The

orrespond-ing outputweights ofthis neuronare hanged

to the value, al ulated by the reinfor ement

(9)

modied aswellto a hieve theee t of

gener-alisation.

Usuallythestates, ne essarytosolvethe

prob-lem, are a subset of the whole state spa e.

Thus the SOM has to lassify only this

sub-set, using a pre- lassi ation. During the

ap-pli ationof reinfor ementlearning, this

lassi- ationwillimprove,sin e forevery state

vis-ited, its representation is strengthen. States,

whi harevisitedmorefrequentlyandthusare

0

50

100

150

200 Epochs

-1000

-800

-600

-400

-200

0 Reinforcement

Value Iteration Method

Convergence of different implementations

SOM 10x10

SOM 8x8

Look-up table

Figure3: A hievedrewards,duringlearningof

a behaviourforthegridworldexperiment

re tion should be preferred. If it fa es an

ob-sta le, thepossiblea tionsareredu edtothat

a tions,whi hlead to freepositionsaround.

Two implementationsofamodiedSOMwith

8x8 neurons and 10x10 neurons have been

used. For omparison, the experiment has

been arried out with a look-up table, where

every entry represents a state, as well. This

look-up table onsists of 289 entries, due to

(10)

Results

The result of this experiment is shown in

g-ure 3. In this graph the a hieved rewards for

ea h implementation after every episode an

beseen. The optimalpath is found, ifthe

a - umulatedreinfor ementduringoneepisodeis

-53, sin e the agent needs at least 53 steps to

rea h its goal. Inthe graph an be seen,that

theimplementationofthemodiedSOMwith

10x10neuronsleadsto afasterresultthanthe

look-up table. After 30 episodes the agent,

equipped with the modied SOM, found the

heapestpath.

5.2 Learning obsta le avoidan e

with a robot

A ommon problem in roboti s is the

au-tonomous drive of a robot. For su h a drive

therearevariouspro esses. Onepro essmight

bring it to a far destination, lead by a path

ndingalgorithm. Forsimplemovement,

how-ever, a pro ess isne essary to avoid obsta les.

Inthisproblem,itisverydiÆ ulttodene

ap-propriatea tions forparti ularsituations. On

theotherhand,we aneasilyassessthe

result-ing a tions. Therefore this problem seems to

be appropriate for the reinfor ement learning

approa h.

In this experiment the autonomous miniature

robot Khepera, whi h was developed at the

EPFL in Lausanne, has been used (see gure

4). This5 mhugerobotisequippedwitheight

approximitysensors,wheretwoaremountedat

thefront,two at theba k, two atthesideand

twoin45

Æ

tothefront. Thesesensorsgivea

re-turnvaluebetween0and 1024,whi his

orre-spondingtoarangeofabout5 m. Therobots

drive onsists of two servo motors, whi h an

turnthetwowheelswith2mperse ondin

neg-ative and positivedire tions. By this

ongu-ration, the robot is able to do 360

Æ

rotations

Figure 4: Autonomousrobot Khepera

therobotis very manoeuvrableand shouldbe

abletodealwithmostsituations. Furthermore

the robot is equipped with two re hargeable

batteries, whi henableittodriveforabout20

minutes autonomously. For exe ution of

pro-grams, there alsoexists aCPU fromMotorola

and a RAMarea of 512KBon therobot.

Experiment

Duetothefa t,thatforvalueiterationamodel

of the environment is required, the robot has

beenrsttrainedusinga omputersimulation.

Afterwardstheexperiment ontinuedona

nor-maloÆ edesk,whereobsta lesandwallswere

builtup withwoodenblo ks.

In the reinfor ement learning algorithm, the

state oftherobotwasrepresentedbytheeight

sensor values. The allowed a tions have been

redu ed to the three a tions: left turn, right

turn and straight forward. Alsothe

(11)

Figure5: Alearned lassi ationof thesensor

spa e

getsa punishmentof -1,otherwisearewardof

0. The experiment has been arried out over

multiple episodes. One episode has been

lim-ited to 50steps. Thereforethedis ount fa tor

hasbeensetto1.0. Forexplorationpurposes

the fa tor " has been adjusted to 0.01, whi h

is equal to the probability a tions are hosen

arbitrary. Con erning to the state ve tor, the

inputve tor of the SOM onsists of eight

ele-ments aswell. For the Kohonen Layer an

ar-rangementof 30x30 neuronshasbeen hosen.

Before the appli ation of the reinfor ement

learning algorithm, the SOM had to be

pre- lassied. Therefore a training set of

typi- alsituations froman obsta leworldhasbeen

trainedover90 epo hs. Withthehelp of

visu-alisationtoolsit ouldbeensuredthatthe

sit-uationsareadequately lassied,asillustrated

ingure 5.

During the episodes of the value iteration

method, identied situations were relearned

with a small neighbourhood of = 0:1 and

also smalllearningstep rate of =0:3.

Results

0

20

40

60

80

100 Episode

-50

-40

-30

-20

-10

0

1

0 Reinforcement

Learning to avoid obstacles

Figure 6: Collisions during the autonomous

learningof anobsta le avoidian e strategy

an beseeningure6. Inthisgraphthe

a u-mulated rewards forevery episode are shown.

Hen e for every ollision the robot has been

punished with -1, the reinfor ement for every

episode isequalto the aused ollisions. After

45 episodes the number of ollisions be ame

signi antly less. During the early episodes,

the value of a hieved reinfor ement signals

swaysstrongly. Thisresultsfromthefa t,that

aftertherobotover ameasituation,it

en oun-teredanewsituationagain,whereanother

be-haviourhadtobelearnedaswell. Asweseein

the graph, the robot learned to manage most

situations after a suÆ ient time of

pro eed-ing. After thetrainingthelearned ontrolhas

been tested on the real robot. Although the

a hieved behaviourwasnotelegant,itproved,

that therobot obviously learnedtheabilityto

avoidobsta les.

6 Con lusion

The problem of a huge state spa e in real

(12)

an be en ountered, have beenta kled byuse

of a modied SOM. The SOMs abilities to

ompress the inputspa e and generalize from

knownsituations tounknownmadeitpossible

to a hieve a reasonable solution. However, it

was ne essary to split the standard algorithm

for SOMs into two parts. On e learning by

a pre- lassi ation and twi e learning during

valueiteration. Withthismodi ation,the

al-gorithm an be applied to an online learning

pro ess, whi h is given by the value iteration

method.

The applied modi ations to the standard

SOMhavebeenevaluatedwithinthegridworld

exampleandtheproblemofobsta leavoidan e

of an autonomous robot. These experiments

showed two main advantages of the modied

SOM to a standard implementation with a

look-up table. First, the example of obsta le

avoidan eprovedthatevenforenormous state

spa es, astrategy an be learned. Se ond, the

pathndingexampleshowed,thattheuseofa

modied SOM an lead to fasterresults, sin e

the agent is able to generalize situations

in-stead oflearninga valueforall of them.

For the Value Iteration algorithm, applied to

the experiments des ribed here, a model of

the environment is ne essary. For real world

problems, su h as the problem of obsta le

avoidan e,however, anappropriatemodel an

hardly be provided. Sensor signals are

nor-mally noisy or even it might be that a sensor

is damagedordon't work properly. Thusitis

re ommendable to use another reinfor ement

learning implementation, whi h makes it not

anymore ne essary to providea modelof the

environment. One ommonly used variant of

reinfor ement learning is the Q-Learning. In

this algorithm states are represented by the

tupleof state and a tion, thusa modelof the

environment is notrequired. We beliefthat a

SOM, proposedinthispaper,yieldsbetter

re-sults.

Oneofthebigdisadvantagesen ountered

how-ever,isthatthemodi ationoftheSOM

dur-ing the se ond step hanges the

generalisa-tion behaviour of the network. If states are

relearned frequently with a small

neighbour-hoodfun tion,thelearnedknowledgebe omes

too spe i and generalisation is redu ed. To

over ome this problem, it would be ne essary

to relearn the omplete stru ture of the SOM

withanoinealgorithm. Unfortunately,

expe-rien eintermsoftrainingexamplesarelost

af-tertheonlinepro ess. Apossiblesolutionand

probably subje t of another work, is to store

\ riti al" pattern temporarily during the

on-linepro ess bydynami neurons. With

\ riti- al" pattern we mean those, whi h ome with

afarEu lideandistan eto allexistingneurons

in thenetwork and thustheir lassi ation by

thisneurons would notbeappropriate. Given

theset ofthese dynami allyallo atedneurons

and theset ofneuronson theSOM, a new

ar-rangement with better topologi al

representa-tion an be trained by an oine algorithm

5 .

The exe ution of this oinealgorithm an be

doneduringaphaseof noinputtothelearner

and is motivated by a human post pro essing

of information,known asREM phase.

Referen es

Barto, A. & Crites, R. (1996), Improving

el-evator performan e using reinfor ement

learning,inM.C.Hasselmo,M.C.Mozer

& D. S. Touretzky, eds, `Advan es in

Neural Information Pro essing Systems',

Vol.8.

5

(13)

Barto, A. & Sutton, R. (1998), Reinfo ement

Learning - An Introdu tion, MIT Press,

Cambridge.

Bellman,R.E.(1957),Dynami Programming,

Prin etonUniversityPress, Prin eton.

Boyan,A.J.&Moore,A.W.(1995),

General-ization in Renfor emen Learning: Savely

Approximating the Value Fun tion, in

T. K. Leen, G. Tesauro & D. S. T

ouret-zky, eds, `Information Pro essing

Sys-tems',Vol.7,MITPress,CambridgeMA.

Gordon, G. (1995), Stable fun tion

approxi-mationindynami programming,in

`Pro- eedingsofthe12thInternational

Confer-en eonMa hineLearning',Morgan

Kauf-mann,San Fransis o, Calif.,pp.261{268.

Kohonen,T.(1982),Self-OrganizedFormation

ofTopologi allyCorre tFeatureMaps,in

`Biol.Cyberneti s',Vol. 43,pp.59{69.

Ritter, H. & S hulten, K. (1987), Extending

Kohonen's Self-Organizing Mapping

Al-gorithmto LearnBallisti Movements, in

`NeuralComputers', SpringerVerlag

Hei-delberg, pp.393{406.

Sutton, R. (1988), Learning to predi t by the

methodsof temporal dieren es, in

`Ma- hine Learning',Vol. 3,pp.9{44.

Tesauro, G. (1992), Pra ti al issuesin

tempo-raldieren elearning,in`Ma hine

Learn-ing',Vol. 8,pp.257{277.

Thrun, S. (1996), Explanation-based neural

network learning: A lifelong learning

approa h, Kluwer A ademi Publishers,