A New Algorithm for Learning Bayesian Classifiers from Data

(1)

A New Algorithm for Learning Bayesian

Classifiers from Data

Alexander Kleiner and Bernadette Sharp

Post Print

N.B.: When citing this work, cite the original article.

Original Publication:

Alexander Kleiner and Bernadette Sharp, A New Algorithm for Learning Bayesian Classifiers

from Data, 2000, Artificial Intelligence and Soft Computing, 191-197.

Postprint available at: Linköping University Electronic Press

(2)

FROM DATA

A. KLEINER

StaordshireUniversity

Bea onside,StaordST180AD,UK

a.kleinerstas.a .uk

B. SHARP

StaordshireUniversity

Bea onside,StaordST180AD,UK

b.sharpstas.a .uk

Abstra t

Weintrodu eanewalgorithmfortheindu tionof

lassiersfromdata,basedonBayesiannetworks.

Basi allythisproblemhasalreadybeenexamined

fromtwoperspe tives: rst,theindu tionof

las-siers by learning algorithms for Bayesian

net-works, se ond, the indu tion of lassiers based

on the naive Bayesian lassier. Our approa h is

lo ated between these two perspe tives; it

elimi-nates the disadvantages of both while exploiting

theiradvantages. In ontrasttore entlyappeared

renements of the naive Bayes lassier, whi h

aptures single orrelations in the data, we have

developed an approa h whi h aptures multiple

orrelations and furthermore doesa trade-o

be-tween omplexity and a ura y. In thispaper we

evaluatetheimplementationofourapproa hwith

data sets from the ma hine learning repository

and data sets arti ially generated by Bayesian

networks.

Keywords: Ma hine Learning, Naive Bayes

Classier, Bayesian Networks, MDL prin iple

INTRODUCTION

Inthispaperweintrodu eanewalgorithmforthe

indu -tionof lassiersfromdata,basedonBayesiannetworks.

Basi ally this problem has already been examined from

twoperspe tives: rstfromtheindu tionof lassiersby

learning algorithms for Bayesian networks, se ond, the

indu tion of lassiersbasedonthe naiveBayesian

las-sier. Our approa h is lo ated between these two

per-spe tives; it eliminates the disadvantages of both while

exploitingtheiradvantages.

Therstindu tionof lassiersinvolvesasear hover

allpossiblenetworks andhasbeensu essfullysolvedin

[2℄ and [4℄. However, it an be onsidered as

unsuper-vised learning [7℄ sin e it does not distinguish between

attribute variables and the lass variable. Thus the

re-sultsfora lassi ationtaskarenotsuÆ ientlya urate.

These ondindu tionapproa hisbasedontherenement

ofthenaiveBayes lassier,whi hhasalreadyprovedits

power for lassi ation in many appli ations [13℄. Due

tothe fa t,that this lassier omes withthestrong

as-sumption of independen e, renements are a hieved by

relaxingthisassumption.

Signi ant work in that eld is found in [16℄, [14℄

and [7℄. The latter approa h improves the naive Bayes

lassier by apturing single dependen ies between the

attributes. Our approa h is motivated by this one but

extendsitbytwonewfeatures. Thesetwofeaturesallow

the possibility of learningmultiple orrelations between

attributes andthe trade-o between omplexity and

a - ura y. We argue that both features are important, on

one hand, be ause data from real world appli ations is

likelyto havemultiple orrelationsbetweenits variables

andontheotherhand,be ausetheappli ation of

lassi-ersto realworld problemsrequires afast omputation.

This omputation,however,dependsstronglyonthe

om-plexity of the lassier. To realize these two features,

weadoptedtheprin ipleof maximumdes riptionlength

[2℄, whi h is ate hnique used in the general learningof

Bayesiannetworks.

Weshalldenotevariablesthatrefer,forexample,to

attributes in a lassi ation task, with apital letters,

su h as A;B;C and parti ular ongurations of these

variablesinlower ase,su hasa;b; . Asetofvariablesis

denotedinbold,forexample,U=fA;B;Cg. The

(3)

a

1 a

2 a

3 a

4 ...

a

n

v

i

Figure1: Thestru tureofthenaiveBayes lassier

lasses. These \mapping"from asesto lassesdepends

onparti ular ongurationsoftheattributes andhas to

belearnedbythe lassier. A aseis representedbythe

attributes(A 1 ;A 2 ;:::A n

)andthe lassV. Everyattribute

A i anbein a ertain stateA i =a i

from itsdomain of

N Ai

possible states. Ea h onguration A of these

at-tributes belongs to a lass v i

from the set of lasses V.

The task is to learna target mapping for ea h

ongu-rationto oneofthese lasses. Finally, thequality ofthe

indu ed lassier anbeassessedbyitsabilityto lassify

unknown ongurationsto anappropriatev i

.

NAIVE BAYESIAN CLASSIFIER

Amongotherte hniques,thenaiveBayesian lassier(or

simplynaiveBayes) isoneofthe mostpowerfultoolsin

ma hine learning. It an ompete with other lassiers,

su h asba kpropagation or ID3, though its stru ture is

less omplex. Its power for text lassi ationhas been

provenin[15℄,[11℄and[13℄.

The Bayesian approa h to a hieve a mapping

be-tween lassesandattributes, istoidentify the lasswith

the highest probability for aparti ular onguration of

the attributes. In statisti al terms, the thereby

identi-ed lass is named asthe maximum a posterior (MAP)

hypothesis: v MAP =argmax vi2V P(v i ja 1 ;a 2 ;:::a n ) (1)

ApplyingBayestheorem, thisyieldsto:

v MAP =argmax vi2V P(a 1 ;a 2 ;:::a n jv i )P(v i ) P(a 1 ;a 2 ;:::a n ) (2)

and due to the onstantpresen e of P(a 1 ;a 2 ;:::a n ) this be omes: v MAP =argmax v i 2V P(a 1 ;a 2 ;:::a n jv i )P(v i ) (3)

This des ribesanapproa h fora orre t lassi ationof

attributes with respe t to their probabilities, estimated

from the training data. The estimation of these

prob-abilities, however, be omes intra table with in reasing

numberof attributes, sin e the number of possible

on-gurations of these attributes, also known as \atomi

events",growsdrasti ally. Toover omethisproblem,the

naiveBayes omeswith the\naive" underlying

assump-tion,thateveryattributeA i

isindependentfromthe

oth-ers,therebythenumberofrequiredprobabilityvaluesits

largelyredu ed. Under theassumptionofindependen e,

the onjun tion of the attributes anbe de omposed in

a produ t of the probabilities of ea h single attribute:

P(a 1 ;a 2 ;:::a n j v i ) = j P(a j j v i

), whi h yields to the

naiveBayes lassier:

v NB =argmax vi2V =P(v i ) j P(a j jv i ) (4)

Inotherwords,thislearningmethodinvolvesa

learn-ingstep,wheretheestimatesforallP(v i )andP(a j jv i )

aredeterminedbytheirfrequen iesinthetrainingsetby

simply ountingtheiro urren es. An indu ed lassier

anthenbeused to lassifyany ongurationofthe

at-tributesbymultiplyingforevery lassv i theprobabilities P(a j j v i

) of ea h attribute and sele tion of that lass,

whi h yieldstothehighestprobability.

The performan e of this simple approa h has been

measured in various appli ations. One interesting

ex-ample is the lassi ationof newsgroups, asreportedin

[11℄. In this work, 20 newsgroups, ea h with 1000

ar-ti les, have been lassied. The lasses v i

were given

by the names of this 20 news groups, for example

omp.sys.ibm.p .hardware, and the attributes by words

from the English language appearing in those arti les.

Theexperiment leadto anamazing result of 89%

a u-ra y,in ontrasttoarandom lassi ationwithexpe ted

5% a ura y. Noteworthy, however,is that the

assump-tionof onditionalindependen ewasnotne essarilykept

bythedata. One animagine,thatinthe aseof

lassi- ationoftextsinnaturallanguage, onditional

dependen- iesmustexist. Forinstan e, itislikelytondtheword

\Intelligen e" after the word \Arti ial" or to nd the

word\Naive"beforetheword\Bayes". However,re ent

results showed that the naive Bayes lassier performs

wellevenwithviolationofthisassumption.

This leadsto theobviousquestion, whether we an

a hieveevenbetterperforman ebyusingnetworkswhi h

onsider dependen ies in the data. Bayesian Networks

[17℄provideamethodtorepresentsu hdependen ies

be-tween variables and there are approa hes to learn their

stru tureandparametersfrom data.

LEARNING BAYESIAN NETWORKS

FOR CLASSIFICATION

Bayesian Networks

A Bayesiannetwork B for aset of random variables U

isdened byastru ture S,des ribing adire teda y li

graph,andasetofparameters,quantifyingthis

stru -ture. The stru ture is represented by ar s between the

randomvariablesX 1 ;X 2 ;:::;X n

in U , whi h indi ate

di-re tdependen iesbetweenthem. Furthermore,thesetof

parametersprovidesfor every ongurationof a

(4)

Xijpa(Xi)

i

pa(X i

))of this parti ular onguration. Thus the joint

probabilitydistributionaboveU anbere onstru tedby

themultipli ationofea hnodesprobabilities:

P B (X 1 ;X 2 ;:::;X n )= n i=1 P B (X i jpa(X i )) (5) If pa(X i

) onsisted only of the lass variable V for

ev-ery i 2 1;2;:::;n and pa(V) = 0, the above would

de-s ribe a Bayesian network for a naive Bayes lassier.

However, we are able to express far more omplex

re-lationships within U . Basi ally these relationships are

aboutdependen e andindependen e betweenthese

vari-ables. Let A;B;C be subsets in U. Then there is

onditional independen e between A and C given, if

P(Aj B) =P(A jB;C) holds, whenever P(B;C) >0.

That is, when the state of B is known, no knowledge

about C will alter the probability of A [10℄. Of ourse

this implies,that thisholdsfor everypossible

ongura-tiona ;b; ofthesubsetsA;B;C. InBayesiannetworks,

thisindependen e isen odedbythefollowingdenition:

Everyvariable X i

is independent of its nondes endants

givenitsparents[17℄.

The learningof stru ture and parameters

This problem an be solved by a sear h over all

possi-blenetworksandanestimation oftheparameters. To

identify that network, whi h mat hes the data best, a

ommonly used method is to al ulate the log-likelihood

for B given D. Let B =fS;g be aBayesiannetwork

for the data set D with D = fd 1 ;d 2 ;:::;d N g where d i

assignsforeveryvariablein B avalue. Then

LL(BjD)= N X i=1 logP B (d i ) (6)

measuresthe probabilitythat thedata D wasgenerated

fromthe network B. That means,thebiggerLL(BjD)

themore likely theexamined network anrepresentthe

underlyingdistributionofD. Unfortunatelythismeasure

isnotappropriatein itspure formfor learningBayesian

networks, sin eit favours omplexnetworks,whi h lead

toahigherlog-likelihoodthansimpleones.

However,ithasbeenshown,thatthenumberof

pos-sible stru tures in reasesdrasti allywith the numberof

nodesinthenetwork. He kerman ommentedonthis: "If

we onsider Bayesian network models with n variables,

thenumberofpossiblestru turehypothesesismorethan

exponentialinn"[9℄. Therefore, itis impossibleto

on-sider all of these models during a sear h. A ommon

te hniqueto avoid the onsiderationofall possible

solu-tion of a problem is to perform a greedy-sear h, whi h

usuallyleadstoalo almaximuminthesear hspa e. To

applysu hagreedy-sear htoaBayesiannetwork,a

s or-ingfun tionisne essary,whi hreturnsavalueforlo ally

Suppose A i

isanodein anetworkwith nvariables.

Consequently there are n 1 possible parents for this

nodeand2 n 1

possible ombinationsofthem. Insteadof

applyingallthese ombinationsforeverynode,a

greedy-sear h ould be implemented whi h performs the

oper-ations add parent and delete parent , guided by a

s or-ing fun tion. A greedy-sear hworks on theprin iple of

notre onsideringoperationsdoneinprevioussteps. This

leads nally to a lo al optimal solution, as long as the

s oreindi ates theoptimaloperationforeverystep.

Sin ethes oringfun tionneedstobeappli able

lo- ally,thelog-likelihood,whi hreturnsavalue

orrespond-ingto thewhole network, annotbeused. Furthermore,

asmentionedabove,thismeasuretendstofavour omplex

stru tures whi h wewant to avoid. To solve this

prob-lems,twometri fun tionshavebeenintrodu ed,namely

the Bayesian Information Criterion (BIC) [4℄ and the

Minimum Des ription Length (MDL) riterion[2℄. Both

ofthesefun tionsreturnas orewhi hmaximisesthe

log-likelihood, howeverwith arestri tionbythe omplexity.

Sin ethese fun tionsaresimilar fromtheirprin iple,we

fo us on the MDL s ore, whi h also motivates our

ap-proa h.

Minimum des ription length (MDL) prin iple

TheMDLs ore onsistsoftwoparts,whi harethe

previ-ouslyintrodu edlog-likelihoodandthe omplexityofthe

model. The approa h sele ts amodel within a trade-o

betweenthesetwo omponents. The omplexityofa

net-work anbeexpressedbythenumberofbitsne essaryfor

its representation. Suppose there are n nodes in a

net-workea hwithkparents,thentheparentsofanode an

been oded with klog 2

(n)bits. Furthermore,the

ondi-tionalprobabilitytables,asso iatedwithea hnode,have

tobeen odedaswell. ForanodeinaBayesiannetwork,

oneprobabilityisneededforeverypossible onguration

oftheparentsandthenode itself. Theinformation

ne -essarytoen odeaBayesiannetworkwithnnodesis:

n X i=1 [k i log(n)+d(val(X i ) 1) X j 2pa(X i ) val(X j )℄ (7)

Whereddenotesthenumberofbitsne essarytorepresent

thenumeri valueofaprobability,val(X)thenumberof

possiblestatesnodeX antakeandpa(X)denotesitsset

ofparents. Thisistheen odings heme,suggestedin[2℄.

Sin ethisformulasumsoverallnodesin thenetwork,it

aneasily bede omposedforasinglenode.

The se ond part in the MDL s ore is ameasure of

how wellthe networkrepresentsthedata. However,the

log-likelihood annot beuseddire tly,sin eit annotbe

(5)

sup- The Kullba k-Leibler Cross-Entropy between the

true distribution P(X) and the distribution Q(X),

generated by a Bayesiannetwork, shrinks as Q(X)

more losely approximates P(X). Due to the fa t,

thatanetworkwhi hgeneratesthetruedistribution,

also en odes the data optimally, the ross-entropy

anbeusedasameasuretoidentifythis network.

Asshownin[5℄,the ross-entropybetweenatrue

dis-tributionP(X)andadistributionQ(X)is minimal

iftheunderlyingnetworkgeneratingQ(X)isa

max-imumweightspanningtreeandtheweightsbetween

ea h node X i

and X j

are dened by themutual

in-formationbetweenthem.

Themutual informationbetweentwonodesX i and X j isdenedas: I(X i ;X j )= X Xi;Xj P(X i ;X j )log 2 P(X i ;X j ) P(X i )P(X j ) (8)

whi hsumsoverallpossiblestatesof X i

andX j

. Given

the mutual information between all variables, one an

build a maximum weightedspanning tree and therefore

approximatethemaximumlog-likelihood,withrespe tto

thedata. Themutual information, however, anbe

ap-pliedtoeverysinglenode.

LamandBa husevaluatedtheirapproa hwith

net-works of dierent sizes. In most of the asesthe MDL

s ore wasable to re onstru t theoriginal Bayesian

net-work whi h generated thedata for the learningpro ess.

In [3℄ they extended their approa h to rening existing

network stru tures, and parti ularly onsidered the

en- oding of hanges between onemodel and a potentially

betterone.

Howeveranevaluationin [6℄showed,thatthe

learn-ingof lassierswiththisapproa hleadsto poorresults.

In that paper they argued, that the reason for this is

thes oring fun tion itself. Sin e theMDL s orefavours

simplenetworks,ittendsto redu erelationsbetween

at-tributevariablesandthe lassvariable. Inparti ular,for

problems with many attributes, lassiers produ e poor

predi tions, sin e important attributes are ut from the

lass node and therefore not able to ontributedire tly

to the lassi ation. They onsidered learningwith the

MDL s ore as\unsupervised" learning, sin e the

learn-ingalgorithmhasnoinformationgivenaboutwhi hnode

representsthe lass. Theyalso laimed,thatthelearning

ofa lassi ationproblemwiththeBayesianinformation

riterion(BIC)suersfromthesameproblem. Forthese

reasons,the lassi ationproblemhasbeenta kledusing

thenaiveBayes lassier,whi his des ribedin thenext

a

₁

a

2 a

3 a

4 ...

a

n

v

_i

Figure 2: The stru ture of the tree augmented naive

Bayes lassier(TAN)

IMPROVING NAIVE BAYES

As wesawin the previousse tion,standardlearning

al-gorithms for Bayesian networks are insuÆ ient to solve

lassi ation tasks. Therefore several eorts havebeen

undertaken to improve thenaiveBayes lassierfor the

lassi ation of orrelated data. Sin e the naive Bayes

lassier omes with the strong assumption of

indepen-den e, these approa hes are motivated in relaxing this

assumption. Basi ally this leads to a sear h of

orrela-tionbetweentheattributesandmethods tore e t them

in the lassier. In the literature three signi ant

ap-proa hes anbefound:

Subfeaturesele tion

Thejoiningofattributes

TheTreeAugmentedNaiveBayes(TAN)

The rst method is found in [14℄ who des ribed a

greedy-sear halgorithmwhi hex ludesstrong orrelated

attributes from the lassier. This takes pla e in a

for-ward sele tion manner, whi h starts with an empty set

of attributes and in rementallyadds new ones,unless a

termination riterion ismet. The sele tionof attributes

takespla ewith respe tto ametri ,whi hidenties

at-tributeswith a ru ial ontribution to the lassi ation.

Themetri theyusedwasleave-one-out rossvalidation 1

,

sin eitisthemostpre isemeasureforthea ura yofa

lassier.

These ondapproa hisdes ribedin[16℄andta kles

theproblemin theoppositeway. Rather thanex luding

orrelated attributes, this approa h joins them together

toa hievehigher lassi ationa ura y. Theyevaluated

the sele tion of attributes in a forward and ba kward

manner, with twopossibleoperations, whi h are to add

anattribute to the lassier (respe tivelydelete) and to

joinanattributewithoneinthe lassier. Theyalsoused

theleave-one-outte hniquetoindi atewhethera hange

wassu essfulornot.

The latest work in this eld is the tree augmented

naive Bayes (TAN) approa h, des ribed in [7℄. This

ap-proa h performs better than the other two and is also

(6)

themotivationforourapproa h. It isbasedonthework

from[5℄and[8℄,whi hdevelopedalgorithmsforbuilding

amaximumweightedspanningtreebythemutual

infor-mationand onditionalmutual information respe tively.

Note, as we saw above,the rst onealso motivated the

MDLprin iple.

TheTAN algorithmbuilds anetworkstru ture,

de-pendingonthemutualinformationbetweennodes.

Basi- allyit aptures orrelationsbetweenattributesby

draw-ing ar s between them. However, this approa h omes

with intended restri tions, sin e it is goal orientated to

lassi ationtasks. Therstrestri tionisthat every

at-tribute is onne ted to the lass variable that yields to

the stru ture of naive Bayes. The se ond restri tion is

that every attribute may own one more parent besides

the lass variable(seeFigure 2). Thesoresulting

stru -tureimprovesthenaiveBayes,sin eit an apturesingle

relations betweentwoattributes. On the otherhand, it

avoidsasear hthroughthespa eofallpossiblenetworks

by these restri tions to the stru ture. The results

re-portedbythis approa hareequalorbetterthanresults

reportedfrom thenaiveBayes.

We argue, however, that this approa h omes with

two ru ialdisadvantages:

Only single orrelations between attributes an be

aptured,duetotherestri tiontooneadditional

par-entbesidesthe lassnode

Parentsare hosenwithrespe ttothemaximum

log-likelihood,buttheresulting omplexityisnot

onsid-ered.

Certainlytherstisreasonable,sin enetworkswith

n>2parentsaremore omplex. However,supposea

net-work, where the ongurationsof attribute A 4 depends on the onguration of A 1 ;A 2 ;A 3

, whi h are

indepen-dent from ea h other. The TAN Bayes would add the

nodewithmaximalin uen eonA 4

asparentandignore

the in uen e of the other two. In the worst ase, these

ignoredattributeswouldevenbe onne tedas hildrento

othernodes.

These onddisadvantagebe omessigni antifthere

are nodes with plenty of states. The algorithm would

favournetworks,whi hin reaseLL(DjB),regardlessof

their omplexity. Note, that the size of a node's

ondi-tional probability table (CPT),whi h holds all possible

ongurations of the node itself and its parents, grows

drasti allywiththenumberofstatesofea hparent. For

anodewithistatesandtwoparentswithj andkstates,

the CPT onsists of ijk entries. Weargue further

that thepredi tions ofthe lassierbe omeless reliable

withmore omplexnodes. This omesfromthefa t that

a

1 a

2 a

3 a

4 ...

a

n

v

_i

Figure 3: An example network, asit might resultfrom

ourapproa h

beestimated from thedata. These ongurations,

how-ever,arelesslikelytobefoundinthedatawithin reasing

omplexity. ThustheirestimatesareinsuÆ ientandlead

topoorpredi tions. Toover omethisproblem,Friedman

andGoldszmidtintrodu edasmoothingoperationtoll

thegapofunreliableprobabilities.

LEARNINGCLASSIFIERS WITH

MUL-TIPLE CORRELATIONS AND LESS

COMPLEXITY

Ourapproa h anbefoundbetweentheTANar hite ture

and the MDL approa h for learningBayesiannetworks.

Furthermore it ombines theadvantagesofbothand

ex- ludes their disadvantages, whi h were previously

iden-tied in this paper. The algorithm anbe hara terised

by two main features as follows. First, sin e our

algo-rithm is supposed to be applied for lassi ation tasks,

andthe lassvariableisusuallyknown,welimitthe

num-berof possible networks to those whose attributes have

the lass node asparent. Se ond, westart to sear h for

relationsbetweenthese attributes, using theMDL s ore

whi h favourssimple relations with maximum

ontribu-tiontothelog-likelihood.

These ond feature appliesagreedy sear h forea h

node, to avoidthe huge sear h spa e of all valid parent

ombinations. The metri for this greedy sear h is the

trade-o betweenmutual information and the

omplex-ity aused by the hange. We used the modied

ver-sion of the mutual information from [2℄, whi h denes

aweightW(A i ;pa(A i ))for attribute A i

and its parents

pa(A i )with X A i ;pa(A i ) P(A i ;pa(A i ))log 2 P(A i ;pa(A i )) P(A i )P(pa(A i )) : (9)

Against this informationmeasure stands the omplexity

C(A i

;pa(A i

)), resulting from this parent onguration

with val(A i ) Aj2pa(Ai) val(A j ); (10)

(7)

de-1 i and pa 2 (A i ) for attribute A i , where pa 2 is extended by

onemoreattribute thanpa 1

. Fora omparison of these

two parent ongurations, we ompute the value pairs

(W 1 ;C 1 )and(W 2 ;C 2

)forbothrespe tivelyand ompare

therelativelygrowthof omplexityandweight:

W 2 W 1 > C 2 C 1 (11)

Wheredenotesaweightingfa torto ontrolthe

trade-obetween omplexityand informationgain. Usingthis

formula a new parent onguration is a epted, if the

relativeinformationex eedstherelative omplexity.

Thegreedysear hneedsasortedlist forea h node,

indi atingwhi hoftheothernodesareworthofbe oming

parents. Thuswegeneratealistforea hnode, onsisting

ofeverypossibleparentA j

andrankthembytheirweight

W(A i

;V;A j

). Giventhis list, thealgorithm su essively

addsanodefromthislistasparentandkeepsitasparent,

if thea hieved weight W ex eeds the omplexity C. In

additiontotheoperationsaddar and deletear weuse

theoperation reverse ar , whi h isne essaryin the ase

thatanodeA i

favoursanothernodeA j

asparent,butA i

hasalreadybeen hosenasparentforA j

. Inthis asewe

reverse this ar , depending on the weightof both. The

omputational omplexity ofthisgreedy sear his O(n 2

)

fornattributes,sin eintheworst aseallpossiblen 1

parentsare onsideredbyeverynode. Thealgorithmfor

nattributes anbesummarizedasfollows:

GenerateforeveryattributeaparentlistP, orrespondingto

anaiveBayesian lassierwithP V

=fgandP i

=V forevery

i2f1:::ng

RepeatforeveryAiwithi2f1:::ng:

GeneratealistL, onsistingofn 1

entrieswhi hstoretheweightsW(A i

;V;A j

),

foreverypossibleparentnodeA j

.

SortLwithas endingW.

RepeatforallA j 2P ComputeW 1 =W(A i ;P A i )and C 1 =C(A i ;P A i ) AddA j toP A i (addar ) ComputeW 2 =W(A i ;P A i )and C2=C(Ai;P A i ) If W 2 W 1 < C 2 C 1 RemoveAjfromP A i (deletear ) ElseIf A i 2P A j IfW(A i ;P A i )>W(A j ;P A j ) Reversear end end end end Return lassier EXPERIMENTS Methodology

The naive Bayes, the TAN Bayes and our approa h,

onehand trainingsetsfrom thema hine learning

repos-itory [1℄and on the otherhand with Bayesiannetworks

arti iallygenerateddata. Thelatter omeswiththe

ad-vantage, that the underlying network is already known

andthus theindu ed lassier anbe omparedwith it.

WebuildBayesiannetworkswiththe ommer ialpa kage

Neti aandsampled suÆ ient asesfromit.

In line with other resear h papers, the a ura y of

ea h lassier hasbeendetermined by theleave-one-out

ross validation [12℄. In ontrasttothelesspre ise

hold-out method, where a lassier is indu ed with 2 3

of the

trainingdata and its a ura y measured with the other

1 3

, this method indu es a lassier with allsamples less

oneandmeasuresitsa ura ywiththatsample,leftout

duringthetraining. Thispro essisrepeatedforall

sam-plesinthetrainingdata,andthea ura y al ulatedby

orre tly lassiedsamplesdividedbyallsamples. A

de-tailedexaminationontheevaluationof lassiersisfound

in[12℄.

Results

Generallyweexpe tedfromourresultsabetterorequal

a ura y as the naive Bayes lassier. Furthermore we

expe ted that the omplexity of the generated lassier

wouldbelo atedbetweennaiveBayesandTAN. Table1

showsthepropertiesofthedatasetandtrainingset. The

resultswiththis dataisfoundin table 2,whi h liststhe

a ura yforthenaiveBayes lassier,thetreeaugmented

naiveBayesand multipleBayes.

Asit anbeseenin table 2,themultiple Bayes

ap-proa hhasa hievedforthersttwodatasetsana ura y

equalto the naive Bayes. The TAN lassier, however,

a hievedinbothsetslessa ura ythantheothers. These

setshavenotbeen hosenarbitrary,theybothhavevery

little orrelationsbetweentheirattributes. Thusour

las-sier preferred the most simple stru ture, whi h is the

naive Bayes (no additional parents), and a hieved thus

thesamea ura y. Sin e theTAN approa hignoresthe

balan ebetween omplexityanda ura y,itbuildsa

las-sier,basedonweak orrelationsbetweentheattributes.

The resulting parent- hild onne tions are poorly

sup-portedbythedata,whi hexplainsthelossofa ura y.

Thethird dataset omeswith16 attributes. As we

dis overedwithour lassier,twofromthis16attributes

are signi antly in uen ed by more than three others.

Thisstandsin ontrastto theother14attributes,whi h

aremaximal withtwo orrelated. Thus our lassier

re-turned astru ture where 14 attributesare onne ted to

two or less parents, these two strong orrelated nodes,

however,with morethanthree. Sin etheTAN

(8)

breast 11 699 10

balan e 5 625 5

votes 17 435 3

ABN 5 1000 2

Table1: Propertiesoftheuseddatasets

NaiveBayes TAN MultiBayes

breast 97,420,60 92,561,00 97,420,60 balan e 92,161,10 85,281,42 92,161,10 votes 90,341,41 89,201,61 92,420,60 ABN 70,001,45 70,901,44 73,701,41

Table2: A ura yofthetested lassiers

to this, ourapproa h wasableto a hieveahigher

a u-ra y. Furthermore this ould be rea hed with aslightly

more omplex lassier, than theonereturnedby TAN,

sin enotallattributeshadbeen onne tedwithaparent.

Thelastdatasetwhi hweexamined,wasgenerated

with an arti ial Bayesian Network. The stru ture of

this network was hosen to re e t multiple orrelations

asthepreviousexampleaswell. Wesampled1000 ases

fromthisnetworkandusedthemwithour lassier. The

resultwasa lassierre e tingall orrelationsdened in

theBayesiannetworkandthussu eededwiththehighest

a ura y.

CONCLUSION

Weproposedanewar hite turefortheindu tionof

las-siers,basedonBayesiannetworks. Essentiallythiswas

arriedoutbytheadoption oftheMDL prin ipletothe

naiveBayes lassier.

Ourresultsshowthatourrened lassieryields in

all ases an a ura y equal to or better that of naive

Bayes,andfurthermoreitoutperformsTANinthe aseof

datawithweakormultiple orrelations. Ourassumption,

that orrelationsareonlyworthmodellingina lassierif

theyare heapintermsof omplexity,hasbeenre e ted

bythisresults. Weintendedto dofurther tests,to over

awiderangeofdierentdatasets.

The omplexity ofour algorithmis equalto that of

otherapproa hesand omputationaltra table. However,

the al ulationof themutualinformation seemsto limit

thespeedoftheindu tionpro esssigni antly. Thusour

approa h, and all other methods based on the mutual

information, are likelyto indu e lassiers slowly, if

at-tributeswith numerous statesarefoundin thedata. To

over omethisproblem, asimplermethodfor the

identi- ationof orrelationsinthedatahastobefound. This

method has to be apable to identify statisti al

orrela-bytheparityproblem,aswell. Thisareaistobeexplored

inthenextstage.

ACKNOWLEDGEMENTS

WewouldliketothankDavidEmeryandBillWalleyfor

theirusefuldis ussionsand omments.

Referen es

[1℄ D.W.AhaandK.Murphy.UCIrepositoryofma hinelearning

databases,1995. http://www.i s.u i.edu/mlearn/

MLReposi-tory.html.

[2℄ F.Ba husand W.Lam. LearninBayesianBelief Networks:

AnApproa hbasedontheMDLPrin iple. InComputational

Intelligen e,volume10,pages269{293,1994.

[3℄ F.Ba husandW.Lam.UsingNewDatatoReneaBayesian

Network.InUn ertaintyinArti ialIntelligen e,pages383{

390,1994.

[4℄ D. M. Chi kering, D. Geiger, and D. He kerman. Learning

BayesianNetworks:The ombinationofknowledgeand

statis-ti aldata. InMa hineLearning,volume 20,pages 197{243,

1995.

[5℄ C.K.ChowandC.N.Lui.Approximatingdis reteprobability

distributionswithdependen etrees.InIEEETransa tionson

Info.Theory,volume14,pages462{467,1968.

[6℄ N. Friedman,D. Geiger,and M. Goldszmidt. Bayesian

net-works lassiers.InMa hineLearning,volume29,pages131{

163,1997.

[7℄ N. Friedman and M. Goldszmidt. Building Classiers using

BayesianNetworks.InThirteenthNationalConf.onArti ial

Intelligen e,1996.

[8℄ D. Geiger. Anentropy-basedlearning algorithmof Bayesian

onditionaltrees.InUAI'92,pages92{97,1992.

[9℄ D.He kerman.ATutorialonlearningwithBayesiannetworks.

Te hn. Report, Mi rosoft Resear h, Redmond, Washington,

1995.

[10℄ F.V. Jensen. An Introdu tion toBayesian Networks. UCL

PressLimitedUniversityCollegeLondon,England,1996.

[11℄ T.Joa hims.Aprobabilisti analysisoftheRo hioalgorithm

withTFIDFfortext ategorization. ComputerS ien eTe hn.

ReportCMU-CS-96-118,CarnegieMellonUniversity,1996.

[12℄ R.Kohavi. Astudyof ross-validationandbootstrap for

a - ura y estimation and model sele tion. InIJCAI'95, pages

1137{1143,1995.

[13℄ K.Lang.Newsweeder: Learningtolternetnews.In

Pro eed-ingsofthe12thInternationalConferen eonMa hine

Learn-ing,pages331{339,SanFransis o,Calif.,1995.Morgan

Kauf-mann.

[14℄ P.LangleyandS.Saga.Indu tionofsele tiveBayesian

lassi-ers. InUAI'94,pages399{406,1994.

[15℄ D. Lewis. Representation and learning in informational

re-trieval. Dissertation,Dept.ofComputerandInformation

S i-en e, UniversityofMassa husetts,UnitedStatesofAmeri a,

1991. (COINSTe hni alReport91-93).

[16℄ M.J.Pazzani. Sear hingfordependen iesinBayesian

lassi-ers. InPro .of the5thInt. Workshop onArti ial

Intelli-gen eandStatisti s,1995.

[17℄ J.Perl. Probalist Reasoningin IntelligentSystems. Morgan