• No results found

A New Algorithm for Learning Bayesian Classifiers from Data

N/A
N/A
Protected

Academic year: 2021

Share "A New Algorithm for Learning Bayesian Classifiers from Data"

Copied!
8
0
0

Loading.... (view fulltext now)

Full text

(1)

A New Algorithm for Learning Bayesian

Classifiers from Data

Alexander Kleiner and Bernadette Sharp

Post Print

N.B.: When citing this work, cite the original article.

Original Publication:

Alexander Kleiner and Bernadette Sharp, A New Algorithm for Learning Bayesian Classifiers

from Data, 2000, Artificial Intelligence and Soft Computing, 191-197.

Postprint available at: Linköping University Electronic Press

(2)

FROM DATA

A. KLEINER

Sta ordshireUniversity

Bea onside,Sta ordST180AD,UK

a.kleinersta s.a .uk

B. SHARP

Sta ordshireUniversity

Bea onside,Sta ordST180AD,UK

b.sharpsta s.a .uk

Abstra t

Weintrodu eanewalgorithmfortheindu tionof

lassi ersfromdata,basedonBayesiannetworks.

Basi allythisproblemhasalreadybeenexamined

fromtwoperspe tives: rst,theindu tionof

las-si ers by learning algorithms for Bayesian

net-works, se ond, the indu tion of lassi ers based

on the naive Bayesian lassi er. Our approa h is

lo ated between these two perspe tives; it

elimi-nates the disadvantages of both while exploiting

theiradvantages. In ontrasttore entlyappeared

re nements of the naive Bayes lassi er, whi h

aptures single orrelations in the data, we have

developed an approa h whi h aptures multiple

orrelations and furthermore doesa trade-o

be-tween omplexity and a ura y. In thispaper we

evaluatetheimplementationofourapproa hwith

data sets from the ma hine learning repository

and data sets arti ially generated by Bayesian

networks.

Keywords: Ma hine Learning, Naive Bayes

Classi er, Bayesian Networks, MDL prin iple

INTRODUCTION

Inthispaperweintrodu eanewalgorithmforthe

indu -tionof lassi ersfromdata,basedonBayesiannetworks.

Basi ally this problem has already been examined from

twoperspe tives: rstfromtheindu tionof lassi ersby

learning algorithms for Bayesian networks, se ond, the

indu tion of lassi ersbasedonthe naiveBayesian

las-si er. Our approa h is lo ated between these two

per-spe tives; it eliminates the disadvantages of both while

exploitingtheiradvantages.

The rstindu tionof lassi ersinvolvesasear hover

allpossiblenetworks andhasbeensu essfullysolvedin

[2℄ and [4℄. However, it an be onsidered as

unsuper-vised learning [7℄ sin e it does not distinguish between

attribute variables and the lass variable. Thus the

re-sultsfora lassi ationtaskarenotsuÆ ientlya urate.

These ondindu tionapproa hisbasedonthere nement

ofthenaiveBayes lassi er,whi hhasalreadyprovedits

power for lassi ation in many appli ations [13℄. Due

tothe fa t,that this lassi er omes withthestrong

as-sumption of independen e, re nements are a hieved by

relaxingthisassumption.

Signi ant work in that eld is found in [16℄, [14℄

and [7℄. The latter approa h improves the naive Bayes

lassi er by apturing single dependen ies between the

attributes. Our approa h is motivated by this one but

extendsitbytwonewfeatures. Thesetwofeaturesallow

the possibility of learningmultiple orrelations between

attributes andthe trade-o between omplexity and

a - ura y. We argue that both features are important, on

one hand, be ause data from real world appli ations is

likelyto havemultiple orrelationsbetweenits variables

andontheotherhand,be ausetheappli ation of

lassi- ersto realworld problemsrequires afast omputation.

This omputation,however,dependsstronglyonthe

om-plexity of the lassi er. To realize these two features,

weadoptedtheprin ipleof maximumdes riptionlength

[2℄, whi h is ate hnique used in the general learningof

Bayesiannetworks.

Weshalldenotevariablesthatrefer,forexample,to

attributes in a lassi ation task, with apital letters,

su h as A;B;C and parti ular on gurations of these

variablesinlower ase,su hasa;b; . Asetofvariablesis

denotedinbold,forexample,U=fA;B;Cg. The

(3)

a

1

a

2

a

3

a

4

...

a

n

v

i

Figure1: Thestru tureofthenaiveBayes lassi er

lasses. These \mapping"from asesto lassesdepends

onparti ular on gurationsoftheattributes andhas to

belearnedbythe lassi er. A aseis representedbythe

attributes(A 1 ;A 2 ;:::A n

)andthe lassV. Everyattribute

A i anbein a ertain stateA i =a i

from itsdomain of

N Ai

possible states. Ea h on guration A of these

at-tributes belongs to a lass v i

from the set of lasses V.

The task is to learna target mapping for ea h

on gu-rationto oneofthese lasses. Finally, thequality ofthe

indu ed lassi er anbeassessedbyitsabilityto lassify

unknown on gurationsto anappropriatev i

.

NAIVE BAYESIAN CLASSIFIER

Amongotherte hniques,thenaiveBayesian lassi er(or

simplynaiveBayes) isoneofthe mostpowerfultoolsin

ma hine learning. It an ompete with other lassi ers,

su h asba kpropagation or ID3, though its stru ture is

less omplex. Its power for text lassi ationhas been

provenin[15℄,[11℄and[13℄.

The Bayesian approa h to a hieve a mapping

be-tween lassesandattributes, istoidentify the lasswith

the highest probability for aparti ular on guration of

the attributes. In statisti al terms, the thereby

identi- ed lass is named asthe maximum a posterior (MAP)

hypothesis: v MAP =argmax vi2V P(v i ja 1 ;a 2 ;:::a n ) (1)

ApplyingBayestheorem, thisyieldsto:

v MAP =argmax vi2V P(a 1 ;a 2 ;:::a n jv i )P(v i ) P(a 1 ;a 2 ;:::a n ) (2)

and due to the onstantpresen e of P(a 1 ;a 2 ;:::a n ) this be omes: v MAP =argmax v i 2V P(a 1 ;a 2 ;:::a n jv i )P(v i ) (3)

This des ribesanapproa h fora orre t lassi ationof

attributes with respe t to their probabilities, estimated

from the training data. The estimation of these

prob-abilities, however, be omes intra table with in reasing

numberof attributes, sin e the number of possible

on- gurations of these attributes, also known as \atomi

events",growsdrasti ally. Toover omethisproblem,the

naiveBayes omeswith the\naive" underlying

assump-tion,thateveryattributeA i

isindependentfromthe

oth-ers,therebythenumberofrequiredprobabilityvaluesits

largelyredu ed. Under theassumptionofindependen e,

the onjun tion of the attributes anbe de omposed in

a produ t of the probabilities of ea h single attribute:

P(a 1 ;a 2 ;:::a n j v i ) =  j P(a j j v i

), whi h yields to the

naiveBayes lassi er:

v NB =argmax vi2V =P(v i ) j P(a j jv i ) (4)

Inotherwords,thislearningmethodinvolvesa

learn-ingstep,wheretheestimatesforallP(v i )andP(a j jv i )

aredeterminedbytheirfrequen iesinthetrainingsetby

simply ountingtheiro urren es. An indu ed lassi er

anthenbeused to lassifyany on gurationofthe

at-tributesbymultiplyingforevery lassv i theprobabilities P(a j j v i

) of ea h attribute and sele tion of that lass,

whi h yieldstothehighestprobability.

The performan e of this simple approa h has been

measured in various appli ations. One interesting

ex-ample is the lassi ationof newsgroups, asreportedin

[11℄. In this work, 20 newsgroups, ea h with 1000

ar-ti les, have been lassi ed. The lasses v i

were given

by the names of this 20 news groups, for example

omp.sys.ibm.p .hardware, and the attributes by words

from the English language appearing in those arti les.

Theexperiment leadto anamazing result of 89%

a u-ra y,in ontrasttoarandom lassi ationwithexpe ted

5% a ura y. Noteworthy, however,is that the

assump-tionof onditionalindependen ewasnotne essarilykept

bythedata. One animagine,thatinthe aseof

lassi - ationoftextsinnaturallanguage, onditional

dependen- iesmustexist. Forinstan e, itislikelyto ndtheword

\Intelligen e" after the word \Arti ial" or to nd the

word\Naive"beforetheword\Bayes". However,re ent

results showed that the naive Bayes lassi er performs

wellevenwithviolationofthisassumption.

This leadsto theobviousquestion, whether we an

a hieveevenbetterperforman ebyusingnetworkswhi h

onsider dependen ies in the data. Bayesian Networks

[17℄provideamethodtorepresentsu hdependen ies

be-tween variables and there are approa hes to learn their

stru tureandparametersfrom data.

LEARNING BAYESIAN NETWORKS

FOR CLASSIFICATION

Bayesian Networks

A Bayesiannetwork B for aset of random variables U

isde ned byastru ture S,des ribing adire teda y li

graph,andasetofparameters,quantifyingthis

stru -ture. The stru ture is represented by ar s between the

randomvariablesX 1 ;X 2 ;:::;X n

in U , whi h indi ate

di-re tdependen iesbetweenthem. Furthermore,thesetof

parametersprovidesfor every on gurationof a

(4)

Xijpa(Xi)

i

pa(X i

))of this parti ular on guration. Thus the joint

probabilitydistributionaboveU anbere onstru tedby

themultipli ationofea hnodesprobabilities:

P B (X 1 ;X 2 ;:::;X n )= n i=1 P B (X i jpa(X i )) (5) If pa(X i

) onsisted only of the lass variable V for

ev-ery i 2 1;2;:::;n and pa(V) = 0, the above would

de-s ribe a Bayesian network for a naive Bayes lassi er.

However, we are able to express far more omplex

re-lationships within U . Basi ally these relationships are

aboutdependen e andindependen e betweenthese

vari-ables. Let A;B;C be subsets in U. Then there is

onditional independen e between A and C given, if

P(Aj B) =P(A jB;C) holds, whenever P(B;C) >0.

That is, when the state of B is known, no knowledge

about C will alter the probability of A [10℄. Of ourse

this implies,that thisholdsfor everypossible

on gura-tiona ;b; ofthesubsetsA;B;C. InBayesiannetworks,

thisindependen e isen odedbythefollowingde nition:

Everyvariable X i

is independent of its nondes endants

givenitsparents[17℄.

The learningof stru ture and parameters

This problem an be solved by a sear h over all

possi-blenetworksandanestimation oftheparameters. To

identify that network, whi h mat hes the data best, a

ommonly used method is to al ulate the log-likelihood

for B given D. Let B =fS;g be aBayesiannetwork

for the data set D with D = fd 1 ;d 2 ;:::;d N g where d i

assignsforeveryvariablein B avalue. Then

LL(BjD)= N X i=1 logP B (d i ) (6)

measuresthe probabilitythat thedata D wasgenerated

fromthe network B. That means,thebiggerLL(BjD)

themore likely theexamined network anrepresentthe

underlyingdistributionofD. Unfortunatelythismeasure

isnotappropriatein itspure formfor learningBayesian

networks, sin eit favours omplexnetworks,whi h lead

toahigherlog-likelihoodthansimpleones.

However,ithasbeenshown,thatthenumberof

pos-sible stru tures in reasesdrasti allywith the numberof

nodesinthenetwork. He kerman ommentedonthis: "If

we onsider Bayesian network models with n variables,

thenumberofpossiblestru turehypothesesismorethan

exponentialinn"[9℄. Therefore, itis impossibleto

on-sider all of these models during a sear h. A ommon

te hniqueto avoid the onsiderationofall possible

solu-tion of a problem is to perform a greedy-sear h, whi h

usuallyleadstoalo almaximuminthesear hspa e. To

applysu hagreedy-sear htoaBayesiannetwork,a

s or-ingfun tionisne essary,whi hreturnsavalueforlo ally

Suppose A i

isanodein anetworkwith nvariables.

Consequently there are n 1 possible parents for this

nodeand2 n 1

possible ombinationsofthem. Insteadof

applyingallthese ombinationsforeverynode,a

greedy-sear h ould be implemented whi h performs the

oper-ations add parent and delete parent , guided by a

s or-ing fun tion. A greedy-sear hworks on theprin iple of

notre onsideringoperationsdoneinprevioussteps. This

leads nally to a lo al optimal solution, as long as the

s oreindi ates theoptimaloperationforeverystep.

Sin ethes oringfun tionneedstobeappli able

lo- ally,thelog-likelihood,whi hreturnsavalue

orrespond-ingto thewhole network, annotbeused. Furthermore,

asmentionedabove,thismeasuretendstofavour omplex

stru tures whi h wewant to avoid. To solve this

prob-lems,twometri fun tionshavebeenintrodu ed,namely

the Bayesian Information Criterion (BIC) [4℄ and the

Minimum Des ription Length (MDL) riterion[2℄. Both

ofthesefun tionsreturnas orewhi hmaximisesthe

log-likelihood, howeverwith arestri tionbythe omplexity.

Sin ethese fun tionsaresimilar fromtheirprin iple,we

fo us on the MDL s ore, whi h also motivates our

ap-proa h.

Minimum des ription length (MDL) prin iple

TheMDLs ore onsistsoftwoparts,whi harethe

previ-ouslyintrodu edlog-likelihoodandthe omplexityofthe

model. The approa h sele ts amodel within a trade-o

betweenthesetwo omponents. The omplexityofa

net-work anbeexpressedbythenumberofbitsne essaryfor

its representation. Suppose there are n nodes in a

net-workea hwithkparents,thentheparentsofanode an

been oded with klog 2

(n)bits. Furthermore,the

ondi-tionalprobabilitytables,asso iatedwithea hnode,have

tobeen odedaswell. ForanodeinaBayesiannetwork,

oneprobabilityisneededforeverypossible on guration

oftheparentsandthenode itself. Theinformation

ne -essarytoen odeaBayesiannetworkwithnnodesis:

n X i=1 [k i log(n)+d(val(X i ) 1) X j 2pa(X i ) val(X j )℄ (7)

Whereddenotesthenumberofbitsne essarytorepresent

thenumeri valueofaprobability,val(X)thenumberof

possiblestatesnodeX antakeandpa(X)denotesitsset

ofparents. Thisistheen odings heme,suggestedin[2℄.

Sin ethisformulasumsoverallnodesin thenetwork,it

aneasily bede omposedforasinglenode.

The se ond part in the MDL s ore is ameasure of

how wellthe networkrepresentsthedata. However,the

log-likelihood annot beuseddire tly,sin eit annotbe

(5)

sup- The Kullba k-Leibler Cross-Entropy between the

true distribution P(X) and the distribution Q(X),

generated by a Bayesiannetwork, shrinks as Q(X)

more losely approximates P(X). Due to the fa t,

thatanetworkwhi hgeneratesthetruedistribution,

also en odes the data optimally, the ross-entropy

anbeusedasameasuretoidentifythis network.

 Asshownin[5℄,the ross-entropybetweenatrue

dis-tributionP(X)andadistributionQ(X)is minimal

iftheunderlyingnetworkgeneratingQ(X)isa

max-imumweightspanningtreeandtheweightsbetween

ea h node X i

and X j

are de ned by themutual

in-formationbetweenthem.

Themutual informationbetweentwonodesX i and X j isde nedas: I(X i ;X j )= X Xi;Xj P(X i ;X j )log 2 P(X i ;X j ) P(X i )P(X j ) (8)

whi hsumsoverallpossiblestatesof X i

andX j

. Given

the mutual information between all variables, one an

build a maximum weightedspanning tree and therefore

approximatethemaximumlog-likelihood,withrespe tto

thedata. Themutual information, however, anbe

ap-pliedtoeverysinglenode.

LamandBa husevaluatedtheirapproa hwith

net-works of di erent sizes. In most of the asesthe MDL

s ore wasable to re onstru t theoriginal Bayesian

net-work whi h generated thedata for the learningpro ess.

In [3℄ they extended their approa h to re ning existing

network stru tures, and parti ularly onsidered the

en- oding of hanges between onemodel and a potentially

betterone.

Howeveranevaluationin [6℄showed,thatthe

learn-ingof lassi erswiththisapproa hleadsto poorresults.

In that paper they argued, that the reason for this is

thes oring fun tion itself. Sin e theMDL s orefavours

simplenetworks,ittendsto redu erelationsbetween

at-tributevariablesandthe lassvariable. Inparti ular,for

problems with many attributes, lassi ers produ e poor

predi tions, sin e important attributes are ut from the

lass node and therefore not able to ontributedire tly

to the lassi ation. They onsidered learningwith the

MDL s ore as\unsupervised" learning, sin e the

learn-ingalgorithmhasnoinformationgivenaboutwhi hnode

representsthe lass. Theyalso laimed,thatthelearning

ofa lassi ationproblemwiththeBayesianinformation

riterion(BIC)su ersfromthesameproblem. Forthese

reasons,the lassi ationproblemhasbeenta kledusing

thenaiveBayes lassi er,whi his des ribedin thenext

a

1

a

2

a

3

a

4

...

a

n

v

i

Figure 2: The stru ture of the tree augmented naive

Bayes lassi er(TAN)

IMPROVING NAIVE BAYES

As wesawin the previousse tion,standardlearning

al-gorithms for Bayesian networks are insuÆ ient to solve

lassi ation tasks. Therefore several e orts havebeen

undertaken to improve thenaiveBayes lassi erfor the

lassi ation of orrelated data. Sin e the naive Bayes

lassi er omes with the strong assumption of

indepen-den e, these approa hes are motivated in relaxing this

assumption. Basi ally this leads to a sear h of

orrela-tionbetweentheattributesandmethods tore e t them

in the lassi er. In the literature three signi ant

ap-proa hes anbefound:

 Subfeaturesele tion

 Thejoiningofattributes

 TheTreeAugmentedNaiveBayes(TAN)

The rst method is found in [14℄ who des ribed a

greedy-sear halgorithmwhi hex ludesstrong orrelated

attributes from the lassi er. This takes pla e in a

for-ward sele tion manner, whi h starts with an empty set

of attributes and in rementallyadds new ones,unless a

termination riterion ismet. The sele tionof attributes

takespla ewith respe tto ametri ,whi hidenti es

at-tributeswith a ru ial ontribution to the lassi ation.

Themetri theyusedwasleave-one-out rossvalidation 1

,

sin eitisthemostpre isemeasureforthea ura yofa

lassi er.

These ondapproa hisdes ribedin[16℄andta kles

theproblemin theoppositeway. Rather thanex luding

orrelated attributes, this approa h joins them together

toa hievehigher lassi ationa ura y. Theyevaluated

the sele tion of attributes in a forward and ba kward

manner, with twopossibleoperations, whi h are to add

anattribute to the lassi er (respe tivelydelete) and to

joinanattributewithoneinthe lassi er. Theyalsoused

theleave-one-outte hniquetoindi atewhethera hange

wassu essfulornot.

The latest work in this eld is the tree augmented

naive Bayes (TAN) approa h, des ribed in [7℄. This

ap-proa h performs better than the other two and is also

(6)

themotivationforourapproa h. It isbasedonthework

from[5℄and[8℄,whi hdevelopedalgorithmsforbuilding

amaximumweightedspanningtreebythemutual

infor-mationand onditionalmutual information respe tively.

Note, as we saw above,the rst onealso motivated the

MDLprin iple.

TheTAN algorithmbuilds anetworkstru ture,

de-pendingonthemutualinformationbetweennodes.

Basi- allyit aptures orrelationsbetweenattributesby

draw-ing ar s between them. However, this approa h omes

with intended restri tions, sin e it is goal orientated to

lassi ationtasks. The rstrestri tionisthat every

at-tribute is onne ted to the lass variable that yields to

the stru ture of naive Bayes. The se ond restri tion is

that every attribute may own one more parent besides

the lass variable(seeFigure 2). Thesoresulting

stru -tureimprovesthenaiveBayes,sin eit an apturesingle

relations betweentwoattributes. On the otherhand, it

avoidsasear hthroughthespa eofallpossiblenetworks

by these restri tions to the stru ture. The results

re-portedbythis approa hareequalorbetterthanresults

reportedfrom thenaiveBayes.

We argue, however, that this approa h omes with

two ru ialdisadvantages:

 Only single orrelations between attributes an be

aptured,duetotherestri tiontooneadditional

par-entbesidesthe lassnode

 Parentsare hosenwithrespe ttothemaximum

log-likelihood,buttheresulting omplexityisnot

onsid-ered.

Certainlythe rstisreasonable,sin enetworkswith

n>2parentsaremore omplex. However,supposea

net-work, where the on gurationsof attribute A 4 depends on the on guration of A 1 ;A 2 ;A 3

, whi h are

indepen-dent from ea h other. The TAN Bayes would add the

nodewithmaximalin uen eonA 4

asparentandignore

the in uen e of the other two. In the worst ase, these

ignoredattributeswouldevenbe onne tedas hildrento

othernodes.

These onddisadvantagebe omessigni antifthere

are nodes with plenty of states. The algorithm would

favournetworks,whi hin reaseLL(DjB),regardlessof

their omplexity. Note, that the size of a node's

ondi-tional probability table (CPT),whi h holds all possible

on gurations of the node itself and its parents, grows

drasti allywiththenumberofstatesofea hparent. For

anodewithistatesandtwoparentswithj andkstates,

the CPT onsists of ijk entries. Weargue further

that thepredi tions ofthe lassi erbe omeless reliable

withmore omplexnodes. This omesfromthefa t that

a

1

a

2

a

3

a

4

...

a

n

v

i

Figure 3: An example network, asit might resultfrom

ourapproa h

beestimated from thedata. These on gurations,

how-ever,arelesslikelytobefoundinthedatawithin reasing

omplexity. ThustheirestimatesareinsuÆ ientandlead

topoorpredi tions. Toover omethisproblem,Friedman

andGoldszmidtintrodu edasmoothingoperationto ll

thegapofunreliableprobabilities.

LEARNINGCLASSIFIERS WITH

MUL-TIPLE CORRELATIONS AND LESS

COMPLEXITY

Ourapproa h anbefoundbetweentheTANar hite ture

and the MDL approa h for learningBayesiannetworks.

Furthermore it ombines theadvantagesofbothand

ex- ludes their disadvantages, whi h were previously

iden-ti ed in this paper. The algorithm anbe hara terised

by two main features as follows. First, sin e our

algo-rithm is supposed to be applied for lassi ation tasks,

andthe lassvariableisusuallyknown,welimitthe

num-berof possible networks to those whose attributes have

the lass node asparent. Se ond, westart to sear h for

relationsbetweenthese attributes, using theMDL s ore

whi h favourssimple relations with maximum

ontribu-tiontothelog-likelihood.

These ond feature appliesagreedy sear h forea h

node, to avoidthe huge sear h spa e of all valid parent

ombinations. The metri for this greedy sear h is the

trade-o betweenmutual information and the

omplex-ity aused by the hange. We used the modi ed

ver-sion of the mutual information from [2℄, whi h de nes

aweightW(A i ;pa(A i ))for attribute A i

and its parents

pa(A i )with X A i ;pa(A i ) P(A i ;pa(A i ))log 2 P(A i ;pa(A i )) P(A i )P(pa(A i )) : (9)

Against this informationmeasure stands the omplexity

C(A i

;pa(A i

)), resulting from this parent on guration

with val(A i ) Aj2pa(Ai) val(A j ); (10)

(7)

de-1 i and pa 2 (A i ) for attribute A i , where pa 2 is extended by

onemoreattribute thanpa 1

. Fora omparison of these

two parent on gurations, we ompute the value pairs

(W 1 ;C 1 )and(W 2 ;C 2

)forbothrespe tivelyand ompare

therelativelygrowthof omplexityandweight:

W 2 W 1 > C 2 C 1 (11)

Where denotesaweightingfa torto ontrolthe

trade-o between omplexityand informationgain. Usingthis

formula a new parent on guration is a epted, if the

relativeinformationex eedstherelative omplexity.

Thegreedysear hneedsasortedlist forea h node,

indi atingwhi hoftheothernodesareworthofbe oming

parents. Thuswegeneratealistforea hnode, onsisting

ofeverypossibleparentA j

andrankthembytheirweight

W(A i

;V;A j

). Giventhis list, thealgorithm su essively

addsanodefromthislistasparentandkeepsitasparent,

if thea hieved weight W ex eeds the omplexity C. In

additiontotheoperationsaddar and deletear weuse

theoperation reverse ar , whi h isne essaryin the ase

thatanodeA i

favoursanothernodeA j

asparent,butA i

hasalreadybeen hosenasparentforA j

. Inthis asewe

reverse this ar , depending on the weightof both. The

omputational omplexity ofthisgreedy sear his O(n 2

)

fornattributes,sin eintheworst aseallpossiblen 1

parentsare onsideredbyeverynode. Thealgorithmfor

nattributes anbesummarizedasfollows:

 GenerateforeveryattributeaparentlistP, orrespondingto

anaiveBayesian lassi erwithP V

=fgandP i

=V forevery

i2f1:::ng

 RepeatforeveryAiwithi2f1:::ng:

GeneratealistL, onsistingofn 1

entrieswhi hstoretheweightsW(A i

;V;A j

),

foreverypossibleparentnodeA j

.

SortLwithas endingW.

RepeatforallA j 2P ComputeW 1 =W(A i ;P A i )and C 1 =C(A i ;P A i ) AddA j toP A i (addar ) ComputeW 2 =W(A i ;P A i )and C2=C(Ai;P A i ) If W 2 W 1 < C 2 C 1 RemoveAjfromP A i (deletear ) ElseIf A i 2P A j IfW(A i ;P A i )>W(A j ;P A j ) Reversear end end end end  Return lassi er EXPERIMENTS Methodology

The naive Bayes, the TAN Bayes and our approa h,

onehand trainingsetsfrom thema hine learning

repos-itory [1℄and on the otherhand with Bayesiannetworks

arti iallygenerateddata. Thelatter omeswiththe

ad-vantage, that the underlying network is already known

andthus theindu ed lassi er anbe omparedwith it.

WebuildBayesiannetworkswiththe ommer ialpa kage

Neti aandsampled suÆ ient asesfromit.

In line with other resear h papers, the a ura y of

ea h lassi er hasbeendetermined by theleave-one-out

ross validation [12℄. In ontrasttothelesspre ise

hold-out method, where a lassi er is indu ed with 2 3

of the

trainingdata and its a ura y measured with the other

1 3

, this method indu es a lassi er with allsamples less

oneandmeasuresitsa ura ywiththatsample,leftout

duringthetraining. Thispro essisrepeatedforall

sam-plesinthetrainingdata,andthea ura y al ulatedby

orre tly lassi edsamplesdividedbyallsamples. A

de-tailedexaminationontheevaluationof lassi ersisfound

in[12℄.

Results

Generallyweexpe tedfromourresultsabetterorequal

a ura y as the naive Bayes lassi er. Furthermore we

expe ted that the omplexity of the generated lassi er

wouldbelo atedbetweennaiveBayesandTAN. Table1

showsthepropertiesofthedatasetandtrainingset. The

resultswiththis dataisfoundin table 2,whi h liststhe

a ura yforthenaiveBayes lassi er,thetreeaugmented

naiveBayesand multipleBayes.

Asit anbeseenin table 2,themultiple Bayes

ap-proa hhasa hievedforthe rsttwodatasetsana ura y

equalto the naive Bayes. The TAN lassi er, however,

a hievedinbothsetslessa ura ythantheothers. These

setshavenotbeen hosenarbitrary,theybothhavevery

little orrelationsbetweentheirattributes. Thusour

las-si er preferred the most simple stru ture, whi h is the

naive Bayes (no additional parents), and a hieved thus

thesamea ura y. Sin e theTAN approa hignoresthe

balan ebetween omplexityanda ura y,itbuildsa

las-si er,basedonweak orrelationsbetweentheattributes.

The resulting parent- hild onne tions are poorly

sup-portedbythedata,whi hexplainsthelossofa ura y.

Thethird dataset omeswith16 attributes. As we

dis overedwithour lassi er,twofromthis16attributes

are signi antly in uen ed by more than three others.

Thisstandsin ontrastto theother14attributes,whi h

aremaximal withtwo orrelated. Thus our lassi er

re-turned astru ture where 14 attributesare onne ted to

two or less parents, these two strong orrelated nodes,

however,with morethanthree. Sin etheTAN

(8)

breast 11 699 10

balan e 5 625 5

votes 17 435 3

ABN 5 1000 2

Table1: Propertiesoftheuseddatasets

NaiveBayes TAN MultiBayes

breast 97,420,60 92,561,00 97,420,60 balan e 92,161,10 85,281,42 92,161,10 votes 90,341,41 89,201,61 92,420,60 ABN 70,001,45 70,901,44 73,701,41

Table2: A ura yofthetested lassi ers

to this, ourapproa h wasableto a hieveahigher

a u-ra y. Furthermore this ould be rea hed with aslightly

more omplex lassi er, than theonereturnedby TAN,

sin enotallattributeshadbeen onne tedwithaparent.

Thelastdatasetwhi hweexamined,wasgenerated

with an arti ial Bayesian Network. The stru ture of

this network was hosen to re e t multiple orrelations

asthepreviousexampleaswell. Wesampled1000 ases

fromthisnetworkandusedthemwithour lassi er. The

resultwasa lassi erre e tingall orrelationsde ned in

theBayesiannetworkandthussu eededwiththehighest

a ura y.

CONCLUSION

Weproposedanewar hite turefortheindu tionof

las-si ers,basedonBayesiannetworks. Essentiallythiswas

arriedoutbytheadoption oftheMDL prin ipletothe

naiveBayes lassi er.

Ourresultsshowthatourre ned lassi eryields in

all ases an a ura y equal to or better that of naive

Bayes,andfurthermoreitoutperformsTANinthe aseof

datawithweakormultiple orrelations. Ourassumption,

that orrelationsareonlyworthmodellingina lassi erif

theyare heapintermsof omplexity,hasbeenre e ted

bythisresults. Weintendedto dofurther tests,to over

awiderangeofdi erentdatasets.

The omplexity ofour algorithmis equalto that of

otherapproa hesand omputationaltra table. However,

the al ulationof themutualinformation seemsto limit

thespeedoftheindu tionpro esssigni antly. Thusour

approa h, and all other methods based on the mutual

information, are likelyto indu e lassi ers slowly, if

at-tributeswith numerous statesarefoundin thedata. To

over omethisproblem, asimplermethodfor the

identi- ationof orrelationsinthedatahastobefound. This

method has to be apable to identify statisti al

orrela-bytheparityproblem,aswell. Thisareaistobeexplored

inthenextstage.

ACKNOWLEDGEMENTS

WewouldliketothankDavidEmeryandBillWalleyfor

theirusefuldis ussionsand omments.

Referen es

[1℄ D.W.AhaandK.Murphy.UCIrepositoryofma hinelearning

databases,1995. http://www.i s.u i.edu/mlearn/

MLReposi-tory.html.

[2℄ F.Ba husand W.Lam. LearninBayesianBelief Networks:

AnApproa hbasedontheMDLPrin iple. InComputational

Intelligen e,volume10,pages269{293,1994.

[3℄ F.Ba husandW.Lam.UsingNewDatatoRe neaBayesian

Network.InUn ertaintyinArti ialIntelligen e,pages383{

390,1994.

[4℄ D. M. Chi kering, D. Geiger, and D. He kerman. Learning

BayesianNetworks:The ombinationofknowledgeand

statis-ti aldata. InMa hineLearning,volume 20,pages 197{243,

1995.

[5℄ C.K.ChowandC.N.Lui.Approximatingdis reteprobability

distributionswithdependen etrees.InIEEETransa tionson

Info.Theory,volume14,pages462{467,1968.

[6℄ N. Friedman,D. Geiger,and M. Goldszmidt. Bayesian

net-works lassi ers.InMa hineLearning,volume29,pages131{

163,1997.

[7℄ N. Friedman and M. Goldszmidt. Building Classi ers using

BayesianNetworks.InThirteenthNationalConf.onArti ial

Intelligen e,1996.

[8℄ D. Geiger. Anentropy-basedlearning algorithmof Bayesian

onditionaltrees.InUAI'92,pages92{97,1992.

[9℄ D.He kerman.ATutorialonlearningwithBayesiannetworks.

Te hn. Report, Mi rosoft Resear h, Redmond, Washington,

1995.

[10℄ F.V. Jensen. An Introdu tion toBayesian Networks. UCL

PressLimitedUniversityCollegeLondon,England,1996.

[11℄ T.Joa hims.Aprobabilisti analysisoftheRo hioalgorithm

withTFIDFfortext ategorization. ComputerS ien eTe hn.

ReportCMU-CS-96-118,CarnegieMellonUniversity,1996.

[12℄ R.Kohavi. Astudyof ross-validationandbootstrap for

a - ura y estimation and model sele tion. InIJCAI'95, pages

1137{1143,1995.

[13℄ K.Lang.Newsweeder: Learningto lternetnews.In

Pro eed-ingsofthe12thInternationalConferen eonMa hine

Learn-ing,pages331{339,SanFransis o,Calif.,1995.Morgan

Kauf-mann.

[14℄ P.LangleyandS.Saga.Indu tionofsele tiveBayesian

lassi- ers. InUAI'94,pages399{406,1994.

[15℄ D. Lewis. Representation and learning in informational

re-trieval. Dissertation,Dept.ofComputerandInformation

S i-en e, UniversityofMassa husetts,UnitedStatesofAmeri a,

1991. (COINSTe hni alReport91-93).

[16℄ M.J.Pazzani. Sear hingfordependen iesinBayesian

lassi- ers. InPro .of the5thInt. Workshop onArti ial

Intelli-gen eandStatisti s,1995.

[17℄ J.Perl. Probalist Reasoningin IntelligentSystems. Morgan

References

Related documents

a simple way of using the junction tree algorithm for online inference in DBNs; new complexity bounds on exact online inference in DBNs; a new deterministic approximate

We noted that, in this situation, we would want to do probabilistic inference involving features that are not related via a direct influence. We would want to determine, for

We shall see in Chapter 8 that, for investors with utility functions in a logarithmic family, and only for such investors, in the horse race setting, the utility-based model

1710, 2015 Department of Electrical Engineering. Linköping University SE-581 83

Further, a Bayesian learning algorithm for a standard linear prediction was recently considered in [16] where a glottal signal was modeled as a combination of dense noise

Still, our choices in formulating the semantics of Fun and Imp were to include some distribu- tions as primitive, and to exclude recursion; compared to encodings within

Here we show how such a learning problem can be formulated using a Bayesian model that targets to simultaneously maximize the marginal likelihood of sequence data arising under

But using generative parameter estimation methods shown above to learn the parameters, is rarely optimal for prediction tasks as it makes little sense to optimize joint likelihood,