A Mean Field Theory Lea rning Algorithm for N e u r al N e t w o r ks

(1)

Com plexSys tems 1 (1987) 995-1019

A Mean Field Theory Lea rning Algorithm for N e u r al N e t w o r ks

Carsten Peterson James R. Anderson

MicroelectronicsandComp uterTechnology Corporation, 3500 West BaJcones CenterDrive,Austin,TX78759-6509, USA

Abstract . Based on the Boltzmann Machine concept, we derive a learning algorith m inwhichtime-consumingstochasticmeasurements ofcorrelations are replaced by solutions to dete r minist ic mean field theory equ ations. The method is applied to the XOR (exclusive-or ), encoder, and line symmetry problem s wit h substanti al success. We observe speedupfacto rs rangingfrom 10to 30for these applications and a significantly bet t er learning perform an cein general.

1. Motivation and results

1.1 Background

Neural Network models are prese nt ly subject to intense studies [1,2,7,10].

Mostattent ion is being paidto pattern completion problems. Net workarchi- tectures and learning algorit hms areher e the dominatingthemes. Common ingredi ents of all models area set of bina ry valuedneu ronsS,

=

^±1^{which are}

int ercon nect edwithsynapt icstrengths Tij, whe re_{T ij}representsthestrength of the connection between the ou tp ut ofthei^th neuronand theinput ofthe

p h

^neuron ^and ^Tij

=

^O. ^In ^typica^lapp lications, a subset ofthe neurons are design ated asinputsandthe remainderare used to indicat e the output.

By clampi ng the neurons to certain pattern s, 5,

= Sf' ,

the synaptic st re ngt hsadaptaccord ingtodifferent learning algorit h ms. For patternswith first-order internalconstraints,onehas the Hebbrule[4], whereforeach pa t- terna the synapsesare modifiedaccording to

sr; ()(

^(S,Sj) ^(1.1)

where

0

denotes a timeaverage.

In the casein which one hashigher-order constraints, as in parity patterns, the situation is more complicated. Extra, so-called hidden units are then need ed to capture or to build an internal represent a t ion of the pattern. In this case, equat ion (1.1) is not adequate; for the different patterns,

@1987 ComplexSystemsPublicat ions ,Inc.

(2)

996

CarstenPet ersonand JamesR.Anderson

the hidden units have no par ticular values. For this reason , more elaborate lea rningalgorithms have beendevelop ed. Mostpopular andpowerful are the Back-propagation Scheme [10)and the Boltzm ann Mac1Jjne (BM) [IJ. The latter determines Tjj for a given set of patter ns by a globalsearch over a largesolutionspace. Wit h its simula ted annealing [8] relaxati on technique, BM is particularly well suite d for avoiding local minima. This feature, on the other hand, makes BM time consuming; Dot only does the stochastic annealing algorithm involvemeasurement s at manysuccessivetemperatures, but the measuremen t s themselves require man y sweeps. Development s or approximations that wouldspeed up this algorit hm are in deman d.

1.2 Ob ject ives

In this work , wedefine and apply a mean field theory(MFT) approxima t ion tothe statisticalmechanics syst em that isdefined hy the BM algori thm. The nond et erministi c nature of the latteristhenreplaced byasetof det erminist ic equat ions . At each tempe ra t ure, the solut ions oftheseequat ions represent the averageval uesof corres pondingquantities computedfromextensive(and expensive) sampling inthe HM.Itisobvious thatifthis approximation turns out to be a good one, subst an tialCPU timesavingsare possible. Also, these mean field theory equations are inher entl y parallel. Thus, simu lations can ta keimmed iate and fulladvantageof a parallelprocess or.

1.3 Results

Wedevelop and app lythe MFTapproximat ionfor theBoltzmann Machine.

This approximat ion is only strictly valid in thelimit of infinite numbersof degr ees of freedom. The systematicerro rs that occur when applying it to finite system sizes can he contr olled and essentially canceled out in our applications. We find,when applying the met hod to theXOR [2],encoder[I], and line symmetry [2] problems, that we gain a fact or 10-30 in comput ing time wit h respect to the original Bolt zmann Machine. This means that for these problems, thelearning times areof the sameorderofmagnitude asin the Back-propagati on approach. In cont rast to the latter, it also allows for a more general network architectureand it naturally parallelizes. Furt her- more, it in generalgivesriseto a higherlearn ingquality thanthe Bolt zmann Machine. This feature arises because the lat ter requires an unrealisti cally largenumber of samples for a relia ble performance.

Thispaperis organizedasfollows.Insection2,wereview the basicsofthe Bolt zmann Machine. A derivat ionand evalua t ion ofthe mean field theory approximat ion can be found in section3, andits applications to the problems mentione d aboveare coveredin section4. Finally,section5 contains a very briefsummary and outloo k.

2. The Boltzmann Machine revisited

TheBoltzmann Machineis based on theHopfield energyfunction[6J

(3)

A Mean Field TheoryLearningAlgorithm

1 N

E(5)

= - 2"

L T;j5;5j

+

L 1;5;

i,j=1 ;

997

(2.1)

(2.2) wheretheIiare the neuron th resho lds." Thelasttermin equation (2.1)can be eliminated byintroducing an extra neuron SOl which is permanentlyin a +1statewit hTOi

=

TiO= -Ii, The energy then takes thesimpler form

_ 1 N

E(5)

=

-2"L T;j5,sj.

1,)=0

In a Hopfield networ k, learni ng takes place with equa tion (1.1) , which corresponds to differentiating equation (2.2) wit h respect to Tij. With a given set ofTi j anda particular st arting configuration 53,thesyst emrelaxes to a local energyminima wit h thestepfunctionupdating rule

{

+1 ifLT;j5j

>

0

Sj= J

-1 otherwise

(2.3)

(2.4) which follows from differentiating equation (2.2) with resp ect to Sj along wit hthe fact thatTij

=

^T^{j i} ^and^Til

=

^O.

As mentioned inthe introd uction, equat ion (1.1)isnotap prop r ia te when hidd en unit s are included, since their values for different patterns are un- known. In the Boltzmann Machine, thestrategy is to determinethe hidden unit valuesfor a given set ofpatterns by looking for a globalminimum to

... 1N+h

E(5) = - 2"LT;j5;5j 1,J=0

where h is the number of hidden units. The simulated annealing technique [8]isused to avoid lo cal minima.

The maj or steps in the BM are the following:

1. Clamping Phase. The input and output units are clamped to the corresponding valuesof the pattern to be lea rned, and for a sequence of decreasing temperatures Tn' Tn-I,"" To, the network of equation (2.4)is allowed to relaxaccor ding to the Boltzmann distribution

P(5)exe- E (S)/ T (2.5)

where P(S) denotes the probability that the state § will occurgiven the tempera t ure T. Typica lly, the init ia l state §s of the network is chosen at random. At each temperat ure, the network relaxes for an amount of time'! determined by an annealing schedule. At T = To, statisticsarecollected forthecorr elat ions

"Thro ughout this paper, the notation

S

= (Sl,"" Si....,SN) isused to describ ea state ofthenetwork .

2Wedefinetimeintermsof sweepsof the network.A sweep consist sof allowingeach undam ped unit toupdateits valueonce.

(4)

998

Pi;= (S,S;).

Carst en Peterson andJamesR. Anderson

(2.6)

Therelaxa tion ateachtem pe ra t ureisperformed by upd atingund am ped units according to the heatba th algorithm[I]

(2.7)

2. Free-running phase. Thesame procedureasinstep I, hut this time only the input units are clamped. Again, correlations

(2.8)

aremeasu red atT = To.

3. Updating. Aft er each pattern has been processed throug hsteps 1and 2, the weights are updated accord ing to

(2.9)

whereTJ isa learni ngparameter.

Steps1,2, and 3 are then repeated untilno more cha ngesin Ti j take place.

Ifthe updati ngperformed by equation (2.7) in steps 1 and 2 is instead performed wit h the step function upd a tin g rule inequat ion(3), the system is likely toget caught in a. local minima,which could giverise to erroneous learning. With the annealing prescription on the other ha nd , the global minimu m is more likelyto be reached.

Before moving on to the mean field theory trea tment of the annea ling processin the Boltzmann Machine}we will make two important comments and clarification sonthe learning processdescribed above.

AnnealingSchedule. The effic iencyofthehilJ.climbing prope rtyofequa- tions (2.5, 2.7) depends not only on the temperat ures T used , but it is rather the ratios E(S)/T that set the fluctuation scales [i.e. the likelihood of uphillmoves). In the Boltzmann Machine} the same annealingschedule is normally used for the entire learningprocess. This rigiditydoes not fully exploit the virtue ofthe algorit hm. The reason for this is that the energy cha nges as learning takes place} since the Ti/ s in equat ion (2.4) are changing." Hen ce, the annealing sched ule Tn} Tn_I}...,Toshould be adjusted in a adaptive manner during the learning phase. It turns out that in our app licat ions, theeffects from sucha fine-tuning are negligible.

3Typically,Ti;'sareinitialized to smallrandom values.Thus,as learningtakesplace, theTi;'s growin magnit ude.

(5)

A Mean Field Theory LearningAlgorithm 999

Correlations. Our descrip tion ofthe BM learni ng algorithmabove differs from the original [1] and subsequent works [2] on one sub tle but important point. Thelear ningis accomp lished bymeasuring correlations

Pij (see equat ion(2.6))rather than coo ccu ren cesPij. In the latter case, oneonly assignsposit ive increment s to Pij when either(a) both ofthe units i and j are on at the same time [1], or (b) both are identi cal [2]. By expand ing these coo ccur re nce measurement s to correlations , one also captures negat ive increment s,i.e.,one assign s negativecorr e- lationsinsitua tions wheretwo units are ant icorrela te d . (Note tha t the correlatio nsPij and _P~j are not prob abilities since they rangefrom - 1 to +1). This gener alizat ion improvesthe lea rnin g proper ti esofthe algorit hm, asindicated in reference [2]. The correla tion measurehas the effect of doublingthe valueofD.T_ij thatwouldbeprod uced by equation (2.9) using the coocc ure nce measure instea d of the correla t ions, as in refere nce [2].4Th iseffect ively doubles the learning rat e fJ.

3. The mean field theory eq uat io ns 3.1 Derivatio ns

The stat ist ica l weight (discret e pro ba bility) for a state in a particular configuration

8 =

⁽³1,.. .,3i ,...,3N)ata tem pe ratu reT is given by theBoltz- mann dist ribution (see eq uation (2.5)) . From equation (2.5), one computes the average of a st ate dependen t function F(S)by

(3.1)

wher e Z isthe so-called partition fun ction

(3.2)

and the summa t ions

L

^run ^over ^all^possible neuron configurations . S

Itis clear that configurat ions with sma ll values for E{8) will dominate.

The standa rdproced ure to compute(F(S) )inequat ion (3.1) is with Monte - Carlosa mplingtechniques . Isthere anywayofest ima ting(F(S))alongthese lineswit hout performingMont e-Carlo simula t ions? Itis certainlynotfr u itful to search for minima of E(8 ) since then the T-dependen ce disap pears and we are back to a lo cal minima search problem. Inst ead , let us manipulat e the summat ions in eq ua tions (3.1,3.2) .

A sum over 3 = ±l can he rep lace d by an integra l over continuous varia bles U and V asfollows:

"Note that Pij

=

^{Pij -} ^qij where qij isa measur e

o r

the anticorre lated st ates and

qij =1 - Pi j' Then,Pij - P~j =2(Pij - P;j)'

(6)

1000 Carsten Peterson and James

R.

Anderson

:L

I(S)

= :L r

^dVI(V )6(S - V).

S= ±l S=±t -00

Using the a-functionrepresent ation 6(x) =

~

^21l"z

i :

^-ioo^dye"'

wherethe integral overy runs along the imaginary axis,one obtains (3.3)

(3.4)

:L

^I^(S)

S= ± t

= _~:L

j = dVj'=dU I(V )eU(S- V)

2'1";S=±l -cco -ioo

= ~

^j ⁼^dV^j'=^dU

I(V)e-UV+ log(~hU).

7rt -00 -ioo (3.5)

(3.6)

:L . .. :L ... :L

^{e- E (S)/}^T

SI=±l Si=±l SN=±l

=

c

II _, L : ^d1l; l : ^su,

^e-E'(V^.iJ.T)

Generalizing to our case of N neuron variables Sj and letting /(8)

=

exp(- E(S)/T),oneobtai ns Z =

:L

^e-^{E (S)/ T} ⁼

I:

where cis anormalization constant andtheeffecti veenergy isgivenby E'(V,

0,

T)= E(V)/T

+ :L

^[U,1I;^- ^log(cosh^U,)]. ^(3.7)

(3.9) (3.8)

.!:. T

aE(V )

av.. ⁺

U. -

^, ^-

0

^.

•

1I;-tanhU,= 0

The saddlepoints of Z are determined by the simulta.neousstationarity ofE'(V,O,T)in both ofthe mean field variablesU,and 11;:

aE'(V,0, T)

au,

aE'(V ,O,T)

a1l;

(3.10) Since E' is real for real Viand

\Ii,

the solutions to these equations arein generalreal.

For theneuralnetwork ofequation (2.4)onegets fromtheseequations

11;= tanh

(~T'jV;/T)

wherethe neuron variables S,have beenreplacedthroughequation(3.9)by the meanfield variables

Vi.

Thus,thenon-zerotemperaturebehavior ofthe network in equation(2.4) with thestep functi on updating rule of equation (3) is emulated by a sigmoid updating rule (seefigure 1). An importan t propertyof the effect iveenergy functionE'(V,

0 ,

T) is that it has a smoother lan dsca pe thanE(

S)

dueto theextra terms. Hence, theprobabilityof getting stuckin alocal minima decreases.

Algorithms basedon equation (3.10) are, of course,still deterministic.

Equation (3.10) can besolved iteratively:

(7)

A Mean Field Theory LearningAlgnrithm 1001

SlgnoIdGanvsTeIfll Illfillx/T)

hO.2 1_1.0 t..5.0 -0.5

~ 0 .01---~fL___---

j

0.5

1. 0

-1.01,==;:=::::::;:=~=r-=-::.-I-~-~-r--~-~

-5

- .

^-J ^-2 ^-^I ^o ^J

Figure1:Sigmoidgain functionsofequation (3.10) for differenttem- peraturesT. The step function updati ng ruleofequa tio n (3) corresponds toT -+O.

W'W

=

^{tanh (}

it

^T^i;

^V;0 ^ld

^IT) ^. ^(3.11)

(3.12) One can use eitherlocalor globaltime step s, asynchronous or syn chronous upd at ingresp ectively. In most applications, the asynchronousmethodseems to beadvantageous.

Underapp ropr ia teexist ence andst abilit yconditions(seee.g. [5],chapter 9),aneq uation of theform f{x)

= a

can be solved by numericalsolutionof

dx

dt =

f{x).

Solving equation (3.9) in this way for the neural network ofequation (2.4 ), substit ut ing for Vj [rom equation (3.8), and making a change ofvariables U,-+Ui/T,onegets

- su.

_dt

' =

^-U·₁

+

"'_~1]~

. .

tanh_.,{UIT )

J

(3.13)

which are iden ti cal to the RC equations for a electr icalcirc uit of int er con - nect ed amplifiers and capacitors withcapacitances C and time constants T

set to one, and int ercon nection conductances Tjj • Similar equat io ns were used in refer ence [7] to provide a neu r al network solution to the traveling salesman problem .

(8)

1002 Cars ten Pet erson and JamesR. Anderson

An alternate and more simplifiedderivation of equation (3.10) based on probabilist ic concepts can befound inthe appendix. The derivation above, however,has the nicefea tureof illuminatingthe factthat the stochastic hill- climbi ng property of non-zero tem pera.t ure Monte Carlo can he cast into a determinist icprocedureina smoot herenergy landscape ;ratherthan climbing stee p hills, one takes them away. That this technique is a mean field theory ap proxim ationisclearfrom equations (A.S, A.lO).

So far, we have computed V;=(8,). What we really need for the BM algorithm of equations(2.6-2.9)are the correlations~j=(SiSj). Again, these can be obtained by formal manip ula t ionsofthe partit ion function along the the samelines as above or with the probabilist ic approach described in the appendix. One gets

(3.14) This set of equations can also be solved by the same iterative technique as used for ~in equation (3.11). One now hasasyste m ofN ^X N rath er tha n N equations. This fact, together with the experience thatlargersystemsof equat ions in general tend to convergeslower, has motivated us to makeone further approximation in ourapplication studies. We approximate

Vi;

wit h thefact orizat ion:

Vi;

=

ViV

i (3.15)

3.2 Vali dity of the approximation

How good an approximationis the mean field theoryexpress ionof equa tion (6) together with the fact orization assumption of equat ion (3.15) for our app lications? The MF T deriva tion basically involves a replacement of a discrete sum with a continuous integral. Thus, the approximationshould be exact for N -» 00 where N is the number of degrees of freedom . Let us invest igate how good theapproxima t ion is when comput ing p~; in equation (2.8) for the XOR problem. The XOR (exclusive-or) problem is the oneof computing parity out of two binary digits. Thus, the pat terns of theinput- out put mapping ares

00 0 01 1 10 1

11 0 (3.16)

wher ethe first two columns are theinput unit sand the third column isthe output unit. Asis wellknown, this problem requiresthe presence of hidden units (seefigure2) [10).

"Throughoutthis paper,weuse ±1in thecalculat ions,rath er than the0,1 represen- tationofpatterns.

(9)

A Mean Field Theory LearningAlgorithm 1003

Output unit

hidden units

Input units

Figure 2: Neural network for the XOR problem with one layer of hidden units.

IntheCreephase of theBoltzmannMachine, thein p ut units areclamped.

When comput ing pi;^I two different sit ua t ions are encountered; one eithe r com pu tes (S;Sj) or(S;)S;,depend ing on wheth er S,is clamped ornot . We have compared (S;) wit h 11; and {S;Sj} with 1I;j

=

1I;V; respect ively for the free-running casewithrandom choiceofTi j. In figure 3,weshow the average values for theoutput unit (SOUl) as a functionofthe numberofsweeps used for mea su rement s at the final annealing temperature T

=

To. Also shown is the mean field theory prediction, which is based on the same annealing sched ule as for the Boltzmann Machine but withonlyone iteration forea ch temperature includ ingT = To- Thus, N.w eep = 1for the MFT val ue. For furtherdetail sonannealing schedules,architectures, andT_ii values , we refer to the figure caption.

Twoconclusions st and out from this figure. One is that the mean field theory is a very good approximation even for relati vely small systems (in this easel 5 dynamicalunits). The second point regard s the behavior ofthe Bolt zmann Machine as a functionof Nswee p. Oneexpectssubs tant ial fluctu - ations in measured quan ti ties aroundexp ectedvalues forsmallor mo derat e

N,weep, but withdecreasing errors. That theerrorsaredecreasing is evident from figure3. However,theapproach to asym ptot ia has sys tematicfeatures rather than being random . The reason for this is that it takes a large number ofsweeps to therma lize at T = To. From the figure, we est imate that 0(100-1000)sweeps seems appropriateifone want sa performan ce ccmpat- ible with the mean field theory approximation. In figure 4, we depict the same resultfor the hidden unit

S f! .

The same conclusion can be drawn for

(10)

1004

Cars ten Peterson and JamesR.Anderson

Convergenc e01 8M WISIatlstlcs 2-4-1XORwith Alllldom Welghls 1.0 <Soul>

0.5

10 100 1000 10000

tunber01 Sweeps

Figure 3: {Sout } and Vout from the BM and MFT respecti vely as functionsofNswcep- A one-layer networkwith four hidden units was used as in [2]. Random values in the range [-2.5, 2.5J were used for Tij • The annealing schedule used was T :=50, 49, ...,1 with 10 sweepslTfor BM and 1sweep/T forMFT.Nsweep refers time at the final temp erature.

(11)

A Mean Field Theory LearningAlgorithm

Convergence01 8M WIsteusnce

2-4-1 XOAwith Random Woiglls

0.0·

I ~ ^{I I} ^• ^""

-0 . 5 ---~~----i-- --I---~---_.

1005

10 100 1000 10000

tbnber01SWoop8

Figu re4: (sf!)and V(l from the BM and MFT respectivelyasfunc- tions of Nsweew For details on architecture,annealing sched ule , and

Tij values, see figure 3.

thecorrelation (s i'Sou')(see figure5).

Allaverages and correlations show the samefeaturesas in theexamples above. Infigure6,we su mmarizeour findings byshowing theaverage devia- tion~ between the Boltzman n Machine statist icsand themeanfieldtheory results,

L'.=

~ L:

Ip;j(BM) - p;j(M

FTJ I

.>,

^(3.17)

again as a function of Nsweep" From this figure, it is clear that even for a large number ofsweeps there issmallhut systematic deviat ionbetween the BoltzmannMachine and the mean fieldtheory. Itisinterestingtostudy how this discrepan cy varies with the numberof degrees offreedom (in our case, thenumber of hidden uni ts,nH)' In figure 7, weshow the average of~ for 100 different random Tij, (~), as afunct ion ofthe number of hidden units . It is clear that the discrepancy decreases withnH. As discussed above,this phenomenon is expected.

Insumm ary, we find the mean field theory approximationto be extremely good even for a rela t ivelysmallnumber ofdegrees offreedom . Asexpected , theapproximationimproveswith the numberofdegrees offreedom. Further- more, using the Boltzman nMachinewit houtallowingfor amplethe rmaliza- tionmight provideerro neous results.

Beingconvinced abou t the efficiency of themean fieldapp roxima tion, we now moveon tolearni ng applications.

(12)

1006

CarstenPetersonand JamesR.Anderson

Coovergence01eMColTeiallonStabiles 2-04 - 1XORwilhRandomWetgltJ

0.0

-0.5

'0 '00 10 0 0 10 0 00

t.\ntler01SWe8plll

Figure5:

{s f'

^Bout} and

vt v

outfrom theBMand MFTrespec tively asfunctions ofNsweepoFordetails onarchitect ure,anneali ngsched ule, andTij values, seefigure 3.

CorwflfOOOCll01L4eanCarelallonOillerence 2-4-1 XORwithRandom Welltlll

J

•

• • • • •

o'-~_ _ ~_~ ~~ ~ _

'0 '00 1000 100 0 0

tbnber01 SWeepli

Figure6:Doasdefinedin equa t ion(3.17)as afunctionofNstKeep• For det ails on architecture, annealingsched ule,and Ti j values, seefigure 3.

(13)

A Mean Field Theory Learning Algorithm

AVOl'l gGMeanCoflelallonDifIBfence,I NswlMlj)alOOO v.rumer01HddenlXils

2-4- 1XOOwllhRandom WeIg\ls

3

1007

,

6

•

¹⁰ "

Figure 7: (Il) (Il defined in equation (3.17» as a funct ion ofthe num ber of hidden unitsnH. Anneali ng schedules are as in figure 3.

Forthe BoltzmannMachine Nsweep ;;;: 1000 was used . Thesta.tist ics arebased on 100 differen t rand om sets ofTij.

4. Performan ce studies of the mean field theory algo r it h m We haveinvestigated theperformanceoftheMFTalgorit hm in threedifferent applications: the XOR [2], encoder [11, and line sy mmetry [21problems.

These problems, while sma ll, exhibitvarioustypes of higher-ord er const raints that req uiretheuseof hiddenunits. Theresult s of theMFT calculationsare compar ed with the correspond ing results from the BM simulations.

4.1 Annealing sched u les and learning rat es

Boltzmann Ma chin e (BM). For all theapplications described below,except as noted, the following annealingscheduleand learning rate were used:

N.=,poT 1030,2025, 4020,8015,8010, 805, 1601,1600.5

~ = 2 (4.1)

Forthe finaltemperatu re,T

=

^0.5^,al116 sweeps were used for gather- ing correlationstatist ics. This schedule, which is identicalto the one used in [2], appears to provide good results for all three applications.

Any attempts to reduce the annealing time leads to degradation of lear ni ng performance, and improving the performance with a longer annealingschedule result sin longerlearning times.

(14)

1008 Carst en Peterson and JamesR.Anderson

Mean field theory (MFT). A briefexplora t ion of tbe iterativetechniques of equation (3.11) produced good results withthe following parameter choices :

N~weepoT

=

1030 ,1 025,1020,1 015,1 010,1 05,101,180.5

ry

=

¹ ^(4.2)

Not ice that thisannealingschedule is almost a factor8fast ertha n the Boltzmann Machine schedule.

For both the BM and MFT algorithms, except as noted below, the Tij areini tiali zed with random va lues in therang e

[-1] , +7]].

Letus moveon to the differentapplicationsin more detail.

4.2 The XOR problem

Thisproblemconsistsof thefourpat t ernsin equation(16), whichexhaust the combinatorialpossibilit ies. For both the BM and MFT algorithms, we con- firm previous results (seereference [2]) that anarchitect ure withat least four hidden unitsseemsto be needed fora good performance. We use four hidden units with limitedconnectivity inorder to facilitatecomparisons with [2J;no active connectionsbetween two hidden units and between input andoutput units (seefigure2). However,itshould be stressed that incont ras tto feedfor- ward algorit hm slikeBack-propagation[10), the BM and MFT algorithmsare fully capableof dealing with fully connected networks. Performance studies of different degrees of connectivities willbepublishedelsewhere (9].

As a criteria for lea rn ing performance, one normallyuses theperc entage of pat tern s tha t are completed (i.e. correct output prod uced for a given input) during training (see e.g. reference (2l) . This measure is inferior, at least for sma llpro blems, as it does not ind icate what portion of the inpu t space is being learn ed . Therefore, for the XOR and the encoder problems, the entireinput spaceistestedforpropercomplet ion. Thus , theentireinput spaceis presented duringeach learning cycle.

In figure.S,weshow the per centageofcompleted input setsas a function of the number of learning cycles performed. (An input set consists of a collection of pat terns that are presented during a single learningcycle. In thecase ofthe XOR prob lem,an input set consists of the ent ire input space.) Eachda.ta point representsthe percen tag e of theprevious25 learning cycles (100 pattern s) in which the network correctly complet ed the entire input set. In allof thefigur es in this section, the curves presente d are obtain ed by averaging the stat ist ics from 100 different experiments. The significant feature of figure 8 is that MFT demonstrates a higher quality of learn ing than HM. This does not appear to be simply a matter of MFT learn ing faster. For the XOR, as well as the other experime nts in this section, the qualityof learningexhibi ted byMFTseemsto be asymptot ically bet ter tha n BM.Thishasbeen at tributed toerro rs in the estimates of(SiSj) by theBM

(15)

A Mean Field Theory LearningAlgorithm

~01 MFT

v,

^8M

2-4-'XOR

1009

- .. -

-- -

- -

^-^_^.^-

..

^_.-

-- ^_. ^- - .

# .--- --... _ .---

100

j ₈₀

J

~ .0

~

<; '0

&

j

20

•

~

0 0

,,- ,,

100 150 200 250

Loarri'lgCydee

Figure 8: Percentage ofcompleted input sets for the XOR problem asa.function of lear ningcyclesfor theBM and MFT algorithms. For further details,seesectio n 4.2.

algorithmdueto the largenumberofsweep sat To thatarerequired

to

^obtain

accurateestimates (see section3.2) .

The curves shown in figure 8 do not take into account the difference in annealing sched ules between MFT and BM.To get a better idea of the com pu t ing performance im provem ent offered by MFT, we show infigure 9 the per cen t ageofcompletedinput sets asafuncti on of thenumberofsweeps performe d." Ifwe considerBM to (nearly)reachits finalperformancevalueat approximately5x10⁴sweeps while MFT does soatapproxim ately0.4x10⁴,

we can consider the MFT algorithm to achieve a factor of10 to 15 per cent improvementinexecut ion time . Based on these cur ves,thisap pears to be a conservative claim.

A finalevaluationofMFT performan ceis based on the notionof anex- peri ment having com plet ely learn ed the inputspace. Such a notion requires definiti on ofalearning criteria. We consider the in putspace to be com pletely learned if the input set is correctly com plete d fo r 75 successive cycles (300 patter ns). In figure 10, we show the percent age ofexperiments that completely lea rn the inp u t space as a function of the number of sweeps. From

"Ea ch sweepin both MIT and BM consists of updating each unclamp ed unit once during bot.h the clamped and free-running phases . We consid er an MIT sweep to be equivalent to a BMsweep as both involve the same number of updates. However, in practice ,a 8M sweepta kes longert.han an MFTsweep;bothrequir e evalua tion of asimilar function,but 8Mrequ iresinadditiona random number genera tion(for theupdate) and collectionof statisti cs(estimationofPi;and p~;) .

(16)

1010 Carsten

Peterson

and James

R. Anderson

~rlson01 lET\IS BM 2-4-1XOR

100~~=---'

1

_~ ⁸⁰₆₀

~

!

'0 40

t l

²⁰

,,,

,

""

B"

o

o 3

Sweeps1'-)',

5 6

Figure9: Percentageofcomplet ed inputsetsfor theXOR problem as a functio n ofNswei:p. Thedataisthesameasin figure 8.

Figure 10: Percen t age of XOR expe riments that comp letelylearned the input spa ce as a function of Nsweep. For details ofthe learning criteria, see section 4.2.

(17)

AMean Field TheoryLearning Algorithm ₁₀₁₁

(4.3) these curves we see that MITcompletely learns the XOR input space both fasterand with a higher success rate than HM. Based on the resul t s ofthe encoder problem (see figures 11 through 14), we expect that if the XOR experiments had been funfor a longer number oflearnin gcycles, the percentage completely learned by BM would approach the percent age of input sets completed by BMshown in figure9.

4.3 The encoder problem

Theencoderproblem(seereference [1])consists

•

of theinput-ou tput mapping

1000 1000 0100 0100 0010 0010 0001 0001

In its most difficult form (4-2-4), there are only two hidden unit s which must opt imally enco de the fourpatter ns. Becau se there isno redu ndancy in the bidden layer, it is necessary to provide active connections between the unit s in the bidden layer. This allows the hidden units to "compete" for par ti cular codes duringlearning. Connect ionsare also provided between the units in the outpu t layer. This allows lateral inhibition to develop so that the desired output unit can inhibit theotheroutput unit sfrom being on at the same time. In addition} the Ti j are initialized to zero for BM and to very small random valu es

([ - ry,+ry]

x 10-³) for MF T.' Finally, we found it necessary to reduce thelearning rat esfor both BM and MFT to fJ

=

1and fJ = 0.5respectivelyin order to achieve good results . This has the effect of loweringtheE(S)jT ratio(see sect ion2),therebyint rod ucing moretherm al noise into the learning algor it hm . This help s to resolve conflict s between encodings among the hidden unit s.

Infigure11}weshow thepercentageofcompletedinputset s as afunct ion ofsweeps perfo rmed for the 4-2-4encoder. We also show}infigure 12}the percentageofexperiments thatcompletely learn theinpu t-outpu tenco dingas afunctionofsweeps. The finaldatapointsfor thesecurvescorres pond to 500 learning cycles. NoticethatforBM, thepercentage complete lylearned shown infigure 12asymptoticallyapprochesthepercentageofinput sets completed shown in figure 11. Both BMand MIThave troublelearning this problem}

but tbe MITlearningquali ty asmeasured by percentagecompletelylearned is nearly a factor of3 better tha n BM.

71ftheIij are not initiallyzeroforproblemswithinterconnectedhiddenunits,there isan initial biastowardsa certaininter nal repr esent ation oftheencoding. Very oftenthis leadst.o conflictsbetweenhiddenunits that preventslearningfromoccuring. Onthe other han d, theTij areinitiallyset. to non-zero valuesin problemswere thehidden units are not interconn ected (e.g. XOR,line symmet ery) totakeadvant age ofrandom bias. This improvestheprob abilityofachieving a well-distributedinternal represent ationamongthe hidden units. The fact that the 11j areiniti allynon-aerofor MFT is a consequenceof equa tion(3.10); allzeroIij yieldsa solut ion ofallzero\1;,whileinBM there is enough noise toprevent thisfrombeing aproblem.

(18)

1012 Carsten PetersonandJames R.Anderson

p"- ...--....... .^O'^_^.^_^{_ . . _}^{. ._ _ . .}^-^{. . -}^-^{. .}^-^{... -}^{- O '}^-^{. .}

_

.. --_ ^.. ^-

un BU

B 10 12

"

Figur e11:Percentageof input sets completed as a function ofN.w~p

forthe 4-2-4encoder.

We have also inves tigated the encoder problem with threehidd en units (4-3-4). Thisconfigurationprovidessomeredundancy in the hiddenlayer and we expec t betterresults. Figures13 and 14 show the percentageofinp ut sets completed and per centageofexperime nts thatcompletely learned the input spacefor the 4-3-4 encode r with 500 learningcycles. Again, the percentage complete ly learned by BM shown in figure 14asymptotically approachesthe percentage of input sets completed byBM shown in figu re 13. This seems to indi cate tha t once BM begins to comp lete input sets, it cont inues to do soina uniformmannersuchthat thecomplete learning crite ria iseventually met. Thiswill not appearto betrueor 8Mforthelinesymmet ry problem in the next section. For the 4-3-4encoder^J MITeas ily achievesnearlyperfect lear ning quality, providing a 2.5 factor or improvement over 8M learning quality.

4.4 The line symmetry problem

This is theproblem or detectingsymmetry among an even numberof binary units [2]. We have investigated the MFT performa nce for a problem consist ing of six input uni ts, six hidd en units (no connections between hidden unit s) ,and one outputunit to indicatepresenceor absence of symmetry. The symme trical input pa t terns tha t are to belear ned [i.e. produce an outpu tor 1)consist of the following 8pat tern s out ofthe 64 possiblepattern s:

(19)

A Mean Field Theory Learning Algorithm 1013

~...

100

1

J ⁸⁰

i'

^'0

I

& ^'0

"

§

•

^'⁰

~

0 0

..

^-

..

--..- .

, ,

8 10 12

Mer OM

"

SWeopI(xl)')

Figure 12: Percentage of4-2-4 encode rexperiments that performed complete learning of the input space as a functionof Nsweep.

1 _i

⁸⁰

60

!

'0 <to

&

J

²⁰ ,^,

^.--

^,

,

^M_OM^er

Sweep.(x1Q')

Figure13:Percentage of inputsets completed as a function ofNs'fi/ef!.p

forthe4-3-4encoder.

(20)

1014 Carsten Petersonand JamesR.Anderson

Coqlarlsonoft.FTvs BM

"-3-4EncocIer

10 0 ..---- - - - -- - - -- - - -- ,

"n 0"

....--_#--.----0

. . - _ . . .-~#P"

) '0

1'0

'0

1,0

o

»',~,-

__•__r_.'/ '

2

.>

,

8 10 12

"

SW..-lJcI)·J

Figure 14: Percentageof 4-3-4 encoder experiments that performed completelearn ing of the inpu t space as a functi on of Naweep .

000 000 001 001 010 010 011 011 100 100 101 101 110 110 111 III

Thisproblem differsfrom the previousproblemsin twoways:

Low signal-to-noiseratio (SNR). Thenetwork isrequiredtoiden- tify8out ofthe 64different patternsit is presented with,as opposed to 2 out of 4 fortheXOR (theenco de r is morelikea completion problem tha n a classification problem). Thus, there are 8 signalpat t erns compared to 56 noise pat t ern s. With a low SNR , learni ng to prop erl y identify the sy mmetric patterns is difficul t , because it is much easier forthe networ kto ident ify the noise. For thisproblem, if thenet work identifies all patterns as asymmetric, it will be right at least 87.5%

of the time. To removethis difficulty, the inpu t space is adj usted so that symmetric pa t tern s are presented as often as asymmetricones.

Thisisdone by selectinga symmetric pattern with a43% probability, otherwiseselect any pa ttern at random."

8When we select any pa ttern at rand omwith 57%probability,12.5%(8of64)will be symmetric.Thus,P(symmetric)::; (0.57x 0.125)

+

0.43 ::; 0.5.

(21)

A Mean Field Theory Learning Algorit hm 1015

Large pattern space (gen eralized learning ). We use thisproblem toexplore the sit ua t ion in which thesize ofthe inputspace approaches apointatwhichitbecomes intractable topresenttheent ire inputspace for each learning cycle. Thus, we wan t toexplore the the problem of generalized learning. Here we wish to learn theentire input space by training on a small random subset of the space during each learn ing cycle.

Wehaveinves tigat ed generalized learningfor the line sym met ry problem by choosing ten input pat t erns at random from the distribution describ ed above to bepresen ted at each lea rn ingcycle. An inpu t set cons ists ofthese ten random patterns,and theseare checkedfor complet ion at each lea rning cycle prior to updating the T ij. 9. In figure 15, we show the percent age of input sets completed as a funct ion ofsweeps corres pond ing to 500 learn ing cycles . In cont rast to corresponding curves [or the XORand encod er problem s (see figures 9, 11, 13))we notice that learning in the line sym metry problemappearsto be noisy. Thisis at t ributed, at leas t in part ,toacert a in amou nt of unlearning occuring due to trai ning on small subset s of the inputspace. Again,MITprovides significant improvementin learningquality and performa nce . In figur e 16, weshowthe percentageofexperiments tha t perform complete learning(300successive pat tern scomplet ed ) asa function of the number ofsweeps. MIT appears to be better suited for generalized learning as nearly 90 percent of the experimen tslearn complet ely whereas less than 20 percent of the experiments learn completely wit h BM for 500 learningcycles.

Notice that incontrast to theencoder pro blems, the percent age ofexperiments completely learn ed by BMshown in figure 16 doesnot appear to be approaching the percent age of input sets completed by BMshown in figu re 15. We conclude that, due to the noisiness of generalized learni ng, BM for the line symm et ry problem does not exhibit the uniform approach to complete lear ning that wasexhibited by BM in the encoder problem s. On the oth erhand,MFTperformancedoesnotappearto degradesignificantly when dealing wit h gen eralized learn ing .

5. Summary and outlook

"Nehavedevelop ed , evaluated, and implemen ted ameanfield theor ylearning algori thm. It is exteremely simple to use. Equati on (3.10) is solved for a sequence of temperatures with appropriat e unit s clamped. From these solut ions, Pij and

p i

j are computed from equa t ion (3.15). The rest of the 9While this may look likegeneralizatio n in the broadest sense, in that we may be testinga patternthat hasnot beentrained onyet,eventuallyallpatterns willbeusedfor tra ining. Thus ,we cannot say that the networ khas genera lized topat tern sit has never seenbefore.We can say that thereis generalization inlearning, in tha t thelearningdone byten particular pat terns is not so specificas to causethe next ten patterns to undo previouslearni ng. Thiswould be thecaseif therewasnot anyexploitable regularity in theinputspace.

(22)

1016 Carsten Petersonand JamesR. Anderson

Compafboooft.FT vs8M 6-6-1U1eSynmelfy

J2

"FT

'"

24 28 20 12

8

t""''' ·~''''''''''''·

.r r ~" ""-"" \ './

r·-·V··~·

. . - .r-

,. ^..

^/

,r

-,

"

_.../

..

100

i ₈₀

I "

~ ^'⁰

!

,; _'0

&

i

20

~

0 0

Figure 15: Percentage of input setsconsist ingof 10random patterns completed asa.functionof Nn ree p for the line symmetryproblem.

Coqlarlson01t.FTvsBM 6-6-1 LineSyfl1'lIolry 100

1

^~ ^'0

i'

^'⁰

i

& ^'0

I

~ 20

0

0 8 12 16 20 24 28

"FT

'"

J2 Sweepe (xlO')

Figure 16: Percentage of line symmetryexperiments that performed complete generalized learning of the input space (see section4.4) as a functionofN_{s we ep}'

(23)

A MeanFieldTheoryLearningAlgorithm 1017

algorithm follows from the Boltzmann Machine updating rule of equation (2.9). The algorithmisinherently parallel.

Testing the algorithm on the XOR, encoder and linesymmetryproblems gives very encouraging results. Not only are speed up factors in the range 10-30 observed as compared to the Boltzmann Machine, but the quality of the lear ning is significantly improved . Exploratory investigations [9J on scalingupthe problems to largersizes tentative ly indic ate thatthe promising featu res survive.

The next step is toinvestigate what thegeneralization prope rties are of this algorithm and to applyitto realistic problems.

Appendix A.

Here we presentanalternate derivation of the mean fieldtheoryequations followingclosely the Ma rkovapproach ofreference [3J.

TheBoltzmann distribut ionof equation (2.5) isvalid forequilibriumcon- figurations. Let us define a time-dependent distribution F(Si,t) that also holds fornon-equilibrium situations. In a Markov process, it obey s thefol- lowing master equation:

dF(Si,I)

dt

=

- L:Wi(Si ^-> -Si,t)F(Si,l )

i

+ L:

Wit<S,^->

s.;

I)F(- Si, I)

(A.l)

where Wi(±Si^->'fSi,I) isthetransition probability.

At equilibrium , d/dIF (Si,l )

=

0 and F(Si,l)

=

P(Si),so one obtains from equa t ion(A.l):

Wi(Si -> -Si,t)P(Si)

=

Wit- B,-> Si, t)P(-S;) (A.2) Substituting the Boltzmanndistribut ion (seeequation (2.5)) for P(S;),one get sfrom equation (A.2):

Wi(S i^-> -Si, t)

Wit

-s,

^->

s.,

^t) ^ex^p(-Si Lj TijSj/T)

exp(Si LjTijSj/T ) ^(A.3)

Thus ,the transi tion probability can berewritten as:

Wi(Si->-Si ,t) =

2~

⁽^I-tanh

(Si~ T ij Sj/T))

=

^;r ⁽^{1 -} ^{Si tanh}

(~TijSj/T))

^(A.4)

where r is a proportionality constant and the second line followsfrom the facl th at Si=

±1

and that tanh(x) is odd inx. Theaverage ofSiis defined by

(24)

lOIS

{Si}⁼ II

Z 'L

SiP(Si,t)

s

Carsten PetersonandJamesR.Anderson

(A.5)

where

Z

⁼ 'LP( Si,t)

s

(A.6)

Notingthat

Z

= 1 and ignoring theproportionality constantT^I we compute dldt(Si}fromequa t ions (A.1),(A.4),and (A.5):

d(Si} " S d P(Si, t)

dt £...' dt

s

'L

^s, ^[- ^'LW;{^S;^-->^-S;,^t^{)P(S; ,t}⁾

§ i

+ ~

^W;{-S;^-->^S;,^{t)P( -S;,}

t)]

- [{S,}- tanh

(~ T'; (S; )IT)]

^. ^(A.7)

The last equality followsfrom the fact that only the j

=

i term con- tributes to the summation in the second line and we have made the mean field approximation

(tanh

(~T';S;IT))

⁼ ^tanh

( ~Ti;(S; }IT)

^.

Similarly,for (SiS;)oneget s:

(A.S)

deS,S;}

dt

=

" ssdP(S" t)

LJ ^I J dt S

= 'LSiS;[- 'LW. (S. -->-s.,t)P(s.,t)

§ •

+ ~

^{W.( -S.}^--> S., t)P( -S.,

I)]

= -

[2(S' S;} - tanh

(~ Ti.{S,S.)

^IT)

- tanh

(~T ;'(SjS')

^{IT) ] .} ^(A.9)

At equilibrium, equations (7) and (9) are identi cal to equations (3.10) and (3.14) with theidentificat ions

V;= (Si), V;; = {S ,S;}.

(A.10) (A.ll)

(25)

A Mean Field Theory Learn ingAlgorithm

Referen ces

1019

[I] D.H. Ackley,G. E. Hinton , and T.J.Sej nowski, "A LearningAlgorit hm for Boltzmann Machi nes," Cognitive Science,9:1 (1985) 147-169.

{2} Joshua.Alspecto r and Robert B. Allen , "A Neuromorphic VLSI Learning System," in Advanced Researchin VLSI: Proceedingsofthe 1987Stanford Confere nce,edited by Pa.ul Losleben, (MIT Press,1987).

[3] RoyJ. Glauber ,"Time- Depe ndent Sta.tistic softhe Ising Model,"Journal of MathematicalPhysics, 4,2(1963) 294-307.

[41 D.O. Hebb, The Orga niza tion ofBehavior,(Jo hn Wiley&Sons,1949).

[5] Morris W. Hirsch and Stephen Smale, Differentjal Eq uat ions, Dynamical Systems,andLinear Algebra,(AcademicPress,1974).

[6J J. J.Hop field, "Neural Networksand physical systems wit hemergent col- lecti ve computational abilities," Proceedings of the National Academy of Science, USA,79 (1982)2554-2558.

[7] J. J.Hopfield and D.W. Ta nk ,"Neural Computation ofDecisions in Opti- mization Problems," BiologicalCyberneti cs,52 (1985)141-152.

[8J S. Kirkpat rick,C. D. GelattJr.,and M.P.Vecchi,"Optimizat ion by Simu- lat ed Annealin g,"Science,220,4598(1983)671-680.

[9] Carste n Pet erson and JamesR.Anderson, work inprogress.

[10] D.E.Rumelhart,G. E. Hinton,and R.J. Williams, "Lear ningInternal Rep- resentations by ErrorPropagatio n," in D. E. Rumelhartand J. L. McClel- land, ParallelDistributedProcessing:Explorationsin theMicrostructureof Cognition. Vol.1: Foundations,(MlT Press,1986) 318-362.

A Mean Field Theory Lea rning Algorithm for N e u r al N e t w o r ks