A New Algorithm for Learning Bayesian
Classifiers from Data
Alexander Kleiner and Bernadette Sharp
Post Print
N.B.: When citing this work, cite the original article.
Original Publication:
Alexander Kleiner and Bernadette Sharp, A New Algorithm for Learning Bayesian Classifiers
from Data, 2000, Artificial Intelligence and Soft Computing, 191-197.
Postprint available at: Linköping University Electronic Press
FROM DATA
A. KLEINER
StaordshireUniversity
Bea onside,StaordST180AD,UK
a.kleinerstas.a .uk
B. SHARP
StaordshireUniversity
Bea onside,StaordST180AD,UK
b.sharpstas.a .uk
Abstra t
Weintrodu eanewalgorithmfortheindu tionof
lassiersfromdata,basedonBayesiannetworks.
Basi allythisproblemhasalreadybeenexamined
fromtwoperspe tives: rst,theindu tionof
las-siers by learning algorithms for Bayesian
net-works, se ond, the indu tion of lassiers based
on the naive Bayesian lassier. Our approa h is
lo ated between these two perspe tives; it
elimi-nates the disadvantages of both while exploiting
theiradvantages. In ontrasttore entlyappeared
renements of the naive Bayes lassier, whi h
aptures single orrelations in the data, we have
developed an approa h whi h aptures multiple
orrelations and furthermore doesa trade-o
be-tween omplexity and a ura y. In thispaper we
evaluatetheimplementationofourapproa hwith
data sets from the ma hine learning repository
and data sets arti ially generated by Bayesian
networks.
Keywords: Ma hine Learning, Naive Bayes
Classier, Bayesian Networks, MDL prin iple
INTRODUCTION
Inthispaperweintrodu eanewalgorithmforthe
indu -tionof lassiersfromdata,basedonBayesiannetworks.
Basi ally this problem has already been examined from
twoperspe tives: rstfromtheindu tionof lassiersby
learning algorithms for Bayesian networks, se ond, the
indu tion of lassiersbasedonthe naiveBayesian
las-sier. Our approa h is lo ated between these two
per-spe tives; it eliminates the disadvantages of both while
exploitingtheiradvantages.
Therstindu tionof lassiersinvolvesasear hover
allpossiblenetworks andhasbeensu essfullysolvedin
[2℄ and [4℄. However, it an be onsidered as
unsuper-vised learning [7℄ sin e it does not distinguish between
attribute variables and the lass variable. Thus the
re-sultsfora lassi ationtaskarenotsuÆ ientlya urate.
These ondindu tionapproa hisbasedontherenement
ofthenaiveBayes lassier,whi hhasalreadyprovedits
power for lassi ation in many appli ations [13℄. Due
tothe fa t,that this lassier omes withthestrong
as-sumption of independen e, renements are a hieved by
relaxingthisassumption.
Signi ant work in that eld is found in [16℄, [14℄
and [7℄. The latter approa h improves the naive Bayes
lassier by apturing single dependen ies between the
attributes. Our approa h is motivated by this one but
extendsitbytwonewfeatures. Thesetwofeaturesallow
the possibility of learningmultiple orrelations between
attributes andthe trade-o between omplexity and
a - ura y. We argue that both features are important, on
one hand, be ause data from real world appli ations is
likelyto havemultiple orrelationsbetweenits variables
andontheotherhand,be ausetheappli ation of
lassi-ersto realworld problemsrequires afast omputation.
This omputation,however,dependsstronglyonthe
om-plexity of the lassier. To realize these two features,
weadoptedtheprin ipleof maximumdes riptionlength
[2℄, whi h is ate hnique used in the general learningof
Bayesiannetworks.
Weshalldenotevariablesthatrefer,forexample,to
attributes in a lassi ation task, with apital letters,
su h as A;B;C and parti ular ongurations of these
variablesinlower ase,su hasa;b; . Asetofvariablesis
denotedinbold,forexample,U=fA;B;Cg. The
a
1
a
2
a
3
a
4
...
a
n
v
i
Figure1: Thestru tureofthenaiveBayes lassier
lasses. These \mapping"from asesto lassesdepends
onparti ular ongurationsoftheattributes andhas to
belearnedbythe lassier. A aseis representedbythe
attributes(A 1 ;A 2 ;:::A n
)andthe lassV. Everyattribute
A i anbein a ertain stateA i =a i
from itsdomain of
N Ai
possible states. Ea h onguration A of these
at-tributes belongs to a lass v i
from the set of lasses V.
The task is to learna target mapping for ea h
ongu-rationto oneofthese lasses. Finally, thequality ofthe
indu ed lassier anbeassessedbyitsabilityto lassify
unknown ongurationsto anappropriatev i
.
NAIVE BAYESIAN CLASSIFIER
Amongotherte hniques,thenaiveBayesian lassier(or
simplynaiveBayes) isoneofthe mostpowerfultoolsin
ma hine learning. It an ompete with other lassiers,
su h asba kpropagation or ID3, though its stru ture is
less omplex. Its power for text lassi ationhas been
provenin[15℄,[11℄and[13℄.
The Bayesian approa h to a hieve a mapping
be-tween lassesandattributes, istoidentify the lasswith
the highest probability for aparti ular onguration of
the attributes. In statisti al terms, the thereby
identi-ed lass is named asthe maximum a posterior (MAP)
hypothesis: v MAP =argmax vi2V P(v i ja 1 ;a 2 ;:::a n ) (1)
ApplyingBayestheorem, thisyieldsto:
v MAP =argmax vi2V P(a 1 ;a 2 ;:::a n jv i )P(v i ) P(a 1 ;a 2 ;:::a n ) (2)
and due to the onstantpresen e of P(a 1 ;a 2 ;:::a n ) this be omes: v MAP =argmax v i 2V P(a 1 ;a 2 ;:::a n jv i )P(v i ) (3)
This des ribesanapproa h fora orre t lassi ationof
attributes with respe t to their probabilities, estimated
from the training data. The estimation of these
prob-abilities, however, be omes intra table with in reasing
numberof attributes, sin e the number of possible
on-gurations of these attributes, also known as \atomi
events",growsdrasti ally. Toover omethisproblem,the
naiveBayes omeswith the\naive" underlying
assump-tion,thateveryattributeA i
isindependentfromthe
oth-ers,therebythenumberofrequiredprobabilityvaluesits
largelyredu ed. Under theassumptionofindependen e,
the onjun tion of the attributes anbe de omposed in
a produ t of the probabilities of ea h single attribute:
P(a 1 ;a 2 ;:::a n j v i ) = j P(a j j v i
), whi h yields to the
naiveBayes lassier:
v NB =argmax vi2V =P(v i ) j P(a j jv i ) (4)
Inotherwords,thislearningmethodinvolvesa
learn-ingstep,wheretheestimatesforallP(v i )andP(a j jv i )
aredeterminedbytheirfrequen iesinthetrainingsetby
simply ountingtheiro urren es. An indu ed lassier
anthenbeused to lassifyany ongurationofthe
at-tributesbymultiplyingforevery lassv i theprobabilities P(a j j v i
) of ea h attribute and sele tion of that lass,
whi h yieldstothehighestprobability.
The performan e of this simple approa h has been
measured in various appli ations. One interesting
ex-ample is the lassi ationof newsgroups, asreportedin
[11℄. In this work, 20 newsgroups, ea h with 1000
ar-ti les, have been lassied. The lasses v i
were given
by the names of this 20 news groups, for example
omp.sys.ibm.p .hardware, and the attributes by words
from the English language appearing in those arti les.
Theexperiment leadto anamazing result of 89%
a u-ra y,in ontrasttoarandom lassi ationwithexpe ted
5% a ura y. Noteworthy, however,is that the
assump-tionof onditionalindependen ewasnotne essarilykept
bythedata. One animagine,thatinthe aseof
lassi- ationoftextsinnaturallanguage, onditional
dependen- iesmustexist. Forinstan e, itislikelytondtheword
\Intelligen e" after the word \Arti ial" or to nd the
word\Naive"beforetheword\Bayes". However,re ent
results showed that the naive Bayes lassier performs
wellevenwithviolationofthisassumption.
This leadsto theobviousquestion, whether we an
a hieveevenbetterperforman ebyusingnetworkswhi h
onsider dependen ies in the data. Bayesian Networks
[17℄provideamethodtorepresentsu hdependen ies
be-tween variables and there are approa hes to learn their
stru tureandparametersfrom data.
LEARNING BAYESIAN NETWORKS
FOR CLASSIFICATION
Bayesian Networks
A Bayesiannetwork B for aset of random variables U
isdened byastru ture S,des ribing adire teda y li
graph,andasetofparameters,quantifyingthis
stru -ture. The stru ture is represented by ar s between the
randomvariablesX 1 ;X 2 ;:::;X n
in U , whi h indi ate
di-re tdependen iesbetweenthem. Furthermore,thesetof
parametersprovidesfor every ongurationof a
Xijpa(Xi)
i
pa(X i
))of this parti ular onguration. Thus the joint
probabilitydistributionaboveU anbere onstru tedby
themultipli ationofea hnodesprobabilities:
P B (X 1 ;X 2 ;:::;X n )= n i=1 P B (X i jpa(X i )) (5) If pa(X i
) onsisted only of the lass variable V for
ev-ery i 2 1;2;:::;n and pa(V) = 0, the above would
de-s ribe a Bayesian network for a naive Bayes lassier.
However, we are able to express far more omplex
re-lationships within U . Basi ally these relationships are
aboutdependen e andindependen e betweenthese
vari-ables. Let A;B;C be subsets in U. Then there is
onditional independen e between A and C given, if
P(Aj B) =P(A jB;C) holds, whenever P(B;C) >0.
That is, when the state of B is known, no knowledge
about C will alter the probability of A [10℄. Of ourse
this implies,that thisholdsfor everypossible
ongura-tiona ;b; ofthesubsetsA;B;C. InBayesiannetworks,
thisindependen e isen odedbythefollowingdenition:
Everyvariable X i
is independent of its nondes endants
givenitsparents[17℄.
The learningof stru ture and parameters
This problem an be solved by a sear h over all
possi-blenetworksandanestimation oftheparameters. To
identify that network, whi h mat hes the data best, a
ommonly used method is to al ulate the log-likelihood
for B given D. Let B =fS;g be aBayesiannetwork
for the data set D with D = fd 1 ;d 2 ;:::;d N g where d i
assignsforeveryvariablein B avalue. Then
LL(BjD)= N X i=1 logP B (d i ) (6)
measuresthe probabilitythat thedata D wasgenerated
fromthe network B. That means,thebiggerLL(BjD)
themore likely theexamined network anrepresentthe
underlyingdistributionofD. Unfortunatelythismeasure
isnotappropriatein itspure formfor learningBayesian
networks, sin eit favours omplexnetworks,whi h lead
toahigherlog-likelihoodthansimpleones.
However,ithasbeenshown,thatthenumberof
pos-sible stru tures in reasesdrasti allywith the numberof
nodesinthenetwork. He kerman ommentedonthis: "If
we onsider Bayesian network models with n variables,
thenumberofpossiblestru turehypothesesismorethan
exponentialinn"[9℄. Therefore, itis impossibleto
on-sider all of these models during a sear h. A ommon
te hniqueto avoid the onsiderationofall possible
solu-tion of a problem is to perform a greedy-sear h, whi h
usuallyleadstoalo almaximuminthesear hspa e. To
applysu hagreedy-sear htoaBayesiannetwork,a
s or-ingfun tionisne essary,whi hreturnsavalueforlo ally
Suppose A i
isanodein anetworkwith nvariables.
Consequently there are n 1 possible parents for this
nodeand2 n 1
possible ombinationsofthem. Insteadof
applyingallthese ombinationsforeverynode,a
greedy-sear h ould be implemented whi h performs the
oper-ations add parent and delete parent , guided by a
s or-ing fun tion. A greedy-sear hworks on theprin iple of
notre onsideringoperationsdoneinprevioussteps. This
leads nally to a lo al optimal solution, as long as the
s oreindi ates theoptimaloperationforeverystep.
Sin ethes oringfun tionneedstobeappli able
lo- ally,thelog-likelihood,whi hreturnsavalue
orrespond-ingto thewhole network, annotbeused. Furthermore,
asmentionedabove,thismeasuretendstofavour omplex
stru tures whi h wewant to avoid. To solve this
prob-lems,twometri fun tionshavebeenintrodu ed,namely
the Bayesian Information Criterion (BIC) [4℄ and the
Minimum Des ription Length (MDL) riterion[2℄. Both
ofthesefun tionsreturnas orewhi hmaximisesthe
log-likelihood, howeverwith arestri tionbythe omplexity.
Sin ethese fun tionsaresimilar fromtheirprin iple,we
fo us on the MDL s ore, whi h also motivates our
ap-proa h.
Minimum des ription length (MDL) prin iple
TheMDLs ore onsistsoftwoparts,whi harethe
previ-ouslyintrodu edlog-likelihoodandthe omplexityofthe
model. The approa h sele ts amodel within a trade-o
betweenthesetwo omponents. The omplexityofa
net-work anbeexpressedbythenumberofbitsne essaryfor
its representation. Suppose there are n nodes in a
net-workea hwithkparents,thentheparentsofanode an
been oded with klog 2
(n)bits. Furthermore,the
ondi-tionalprobabilitytables,asso iatedwithea hnode,have
tobeen odedaswell. ForanodeinaBayesiannetwork,
oneprobabilityisneededforeverypossible onguration
oftheparentsandthenode itself. Theinformation
ne -essarytoen odeaBayesiannetworkwithnnodesis:
n X i=1 [k i log(n)+d(val(X i ) 1) X j 2pa(X i ) val(X j )℄ (7)
Whereddenotesthenumberofbitsne essarytorepresent
thenumeri valueofaprobability,val(X)thenumberof
possiblestatesnodeX antakeandpa(X)denotesitsset
ofparents. Thisistheen odings heme,suggestedin[2℄.
Sin ethisformulasumsoverallnodesin thenetwork,it
aneasily bede omposedforasinglenode.
The se ond part in the MDL s ore is ameasure of
how wellthe networkrepresentsthedata. However,the
log-likelihood annot beuseddire tly,sin eit annotbe
sup- The Kullba k-Leibler Cross-Entropy between the
true distribution P(X) and the distribution Q(X),
generated by a Bayesiannetwork, shrinks as Q(X)
more losely approximates P(X). Due to the fa t,
thatanetworkwhi hgeneratesthetruedistribution,
also en odes the data optimally, the ross-entropy
anbeusedasameasuretoidentifythis network.
Asshownin[5℄,the ross-entropybetweenatrue
dis-tributionP(X)andadistributionQ(X)is minimal
iftheunderlyingnetworkgeneratingQ(X)isa
max-imumweightspanningtreeandtheweightsbetween
ea h node X i
and X j
are dened by themutual
in-formationbetweenthem.
Themutual informationbetweentwonodesX i and X j isdenedas: I(X i ;X j )= X Xi;Xj P(X i ;X j )log 2 P(X i ;X j ) P(X i )P(X j ) (8)
whi hsumsoverallpossiblestatesof X i
andX j
. Given
the mutual information between all variables, one an
build a maximum weightedspanning tree and therefore
approximatethemaximumlog-likelihood,withrespe tto
thedata. Themutual information, however, anbe
ap-pliedtoeverysinglenode.
LamandBa husevaluatedtheirapproa hwith
net-works of dierent sizes. In most of the asesthe MDL
s ore wasable to re onstru t theoriginal Bayesian
net-work whi h generated thedata for the learningpro ess.
In [3℄ they extended their approa h to rening existing
network stru tures, and parti ularly onsidered the
en- oding of hanges between onemodel and a potentially
betterone.
Howeveranevaluationin [6℄showed,thatthe
learn-ingof lassierswiththisapproa hleadsto poorresults.
In that paper they argued, that the reason for this is
thes oring fun tion itself. Sin e theMDL s orefavours
simplenetworks,ittendsto redu erelationsbetween
at-tributevariablesandthe lassvariable. Inparti ular,for
problems with many attributes, lassiers produ e poor
predi tions, sin e important attributes are ut from the
lass node and therefore not able to ontributedire tly
to the lassi ation. They onsidered learningwith the
MDL s ore as\unsupervised" learning, sin e the
learn-ingalgorithmhasnoinformationgivenaboutwhi hnode
representsthe lass. Theyalso laimed,thatthelearning
ofa lassi ationproblemwiththeBayesianinformation
riterion(BIC)suersfromthesameproblem. Forthese
reasons,the lassi ationproblemhasbeenta kledusing
thenaiveBayes lassier,whi his des ribedin thenext
a
1
a
2
a
3
a
4
...
a
n
v
i
Figure 2: The stru ture of the tree augmented naive
Bayes lassier(TAN)
IMPROVING NAIVE BAYES
As wesawin the previousse tion,standardlearning
al-gorithms for Bayesian networks are insuÆ ient to solve
lassi ation tasks. Therefore several eorts havebeen
undertaken to improve thenaiveBayes lassierfor the
lassi ation of orrelated data. Sin e the naive Bayes
lassier omes with the strong assumption of
indepen-den e, these approa hes are motivated in relaxing this
assumption. Basi ally this leads to a sear h of
orrela-tionbetweentheattributesandmethods tore e t them
in the lassier. In the literature three signi ant
ap-proa hes anbefound:
Subfeaturesele tion
Thejoiningofattributes
TheTreeAugmentedNaiveBayes(TAN)
The rst method is found in [14℄ who des ribed a
greedy-sear halgorithmwhi hex ludesstrong orrelated
attributes from the lassier. This takes pla e in a
for-ward sele tion manner, whi h starts with an empty set
of attributes and in rementallyadds new ones,unless a
termination riterion ismet. The sele tionof attributes
takespla ewith respe tto ametri ,whi hidenties
at-tributeswith a ru ial ontribution to the lassi ation.
Themetri theyusedwasleave-one-out rossvalidation 1
,
sin eitisthemostpre isemeasureforthea ura yofa
lassier.
These ondapproa hisdes ribedin[16℄andta kles
theproblemin theoppositeway. Rather thanex luding
orrelated attributes, this approa h joins them together
toa hievehigher lassi ationa ura y. Theyevaluated
the sele tion of attributes in a forward and ba kward
manner, with twopossibleoperations, whi h are to add
anattribute to the lassier (respe tivelydelete) and to
joinanattributewithoneinthe lassier. Theyalsoused
theleave-one-outte hniquetoindi atewhethera hange
wassu essfulornot.
The latest work in this eld is the tree augmented
naive Bayes (TAN) approa h, des ribed in [7℄. This
ap-proa h performs better than the other two and is also
themotivationforourapproa h. It isbasedonthework
from[5℄and[8℄,whi hdevelopedalgorithmsforbuilding
amaximumweightedspanningtreebythemutual
infor-mationand onditionalmutual information respe tively.
Note, as we saw above,the rst onealso motivated the
MDLprin iple.
TheTAN algorithmbuilds anetworkstru ture,
de-pendingonthemutualinformationbetweennodes.
Basi- allyit aptures orrelationsbetweenattributesby
draw-ing ar s between them. However, this approa h omes
with intended restri tions, sin e it is goal orientated to
lassi ationtasks. Therstrestri tionisthat every
at-tribute is onne ted to the lass variable that yields to
the stru ture of naive Bayes. The se ond restri tion is
that every attribute may own one more parent besides
the lass variable(seeFigure 2). Thesoresulting
stru -tureimprovesthenaiveBayes,sin eit an apturesingle
relations betweentwoattributes. On the otherhand, it
avoidsasear hthroughthespa eofallpossiblenetworks
by these restri tions to the stru ture. The results
re-portedbythis approa hareequalorbetterthanresults
reportedfrom thenaiveBayes.
We argue, however, that this approa h omes with
two ru ialdisadvantages:
Only single orrelations between attributes an be
aptured,duetotherestri tiontooneadditional
par-entbesidesthe lassnode
Parentsare hosenwithrespe ttothemaximum
log-likelihood,buttheresulting omplexityisnot
onsid-ered.
Certainlytherstisreasonable,sin enetworkswith
n>2parentsaremore omplex. However,supposea
net-work, where the ongurationsof attribute A 4 depends on the onguration of A 1 ;A 2 ;A 3
, whi h are
indepen-dent from ea h other. The TAN Bayes would add the
nodewithmaximalin uen eonA 4
asparentandignore
the in uen e of the other two. In the worst ase, these
ignoredattributeswouldevenbe onne tedas hildrento
othernodes.
These onddisadvantagebe omessigni antifthere
are nodes with plenty of states. The algorithm would
favournetworks,whi hin reaseLL(DjB),regardlessof
their omplexity. Note, that the size of a node's
ondi-tional probability table (CPT),whi h holds all possible
ongurations of the node itself and its parents, grows
drasti allywiththenumberofstatesofea hparent. For
anodewithistatesandtwoparentswithj andkstates,
the CPT onsists of ijk entries. Weargue further
that thepredi tions ofthe lassierbe omeless reliable
withmore omplexnodes. This omesfromthefa t that
a
1
a
2
a
3
a
4
...
a
n
v
i
Figure 3: An example network, asit might resultfrom
ourapproa h
beestimated from thedata. These ongurations,
how-ever,arelesslikelytobefoundinthedatawithin reasing
omplexity. ThustheirestimatesareinsuÆ ientandlead
topoorpredi tions. Toover omethisproblem,Friedman
andGoldszmidtintrodu edasmoothingoperationtoll
thegapofunreliableprobabilities.
LEARNINGCLASSIFIERS WITH
MUL-TIPLE CORRELATIONS AND LESS
COMPLEXITY
Ourapproa h anbefoundbetweentheTANar hite ture
and the MDL approa h for learningBayesiannetworks.
Furthermore it ombines theadvantagesofbothand
ex- ludes their disadvantages, whi h were previously
iden-tied in this paper. The algorithm anbe hara terised
by two main features as follows. First, sin e our
algo-rithm is supposed to be applied for lassi ation tasks,
andthe lassvariableisusuallyknown,welimitthe
num-berof possible networks to those whose attributes have
the lass node asparent. Se ond, westart to sear h for
relationsbetweenthese attributes, using theMDL s ore
whi h favourssimple relations with maximum
ontribu-tiontothelog-likelihood.
These ond feature appliesagreedy sear h forea h
node, to avoidthe huge sear h spa e of all valid parent
ombinations. The metri for this greedy sear h is the
trade-o betweenmutual information and the
omplex-ity aused by the hange. We used the modied
ver-sion of the mutual information from [2℄, whi h denes
aweightW(A i ;pa(A i ))for attribute A i
and its parents
pa(A i )with X A i ;pa(A i ) P(A i ;pa(A i ))log 2 P(A i ;pa(A i )) P(A i )P(pa(A i )) : (9)
Against this informationmeasure stands the omplexity
C(A i
;pa(A i
)), resulting from this parent onguration
with val(A i ) Aj2pa(Ai) val(A j ); (10)
de-1 i and pa 2 (A i ) for attribute A i , where pa 2 is extended by
onemoreattribute thanpa 1
. Fora omparison of these
two parent ongurations, we ompute the value pairs
(W 1 ;C 1 )and(W 2 ;C 2
)forbothrespe tivelyand ompare
therelativelygrowthof omplexityandweight:
W 2 W 1 > C 2 C 1 (11)
Wheredenotesaweightingfa torto ontrolthe
trade-obetween omplexityand informationgain. Usingthis
formula a new parent onguration is a epted, if the
relativeinformationex eedstherelative omplexity.
Thegreedysear hneedsasortedlist forea h node,
indi atingwhi hoftheothernodesareworthofbe oming
parents. Thuswegeneratealistforea hnode, onsisting
ofeverypossibleparentA j
andrankthembytheirweight
W(A i
;V;A j
). Giventhis list, thealgorithm su essively
addsanodefromthislistasparentandkeepsitasparent,
if thea hieved weight W ex eeds the omplexity C. In
additiontotheoperationsaddar and deletear weuse
theoperation reverse ar , whi h isne essaryin the ase
thatanodeA i
favoursanothernodeA j
asparent,butA i
hasalreadybeen hosenasparentforA j
. Inthis asewe
reverse this ar , depending on the weightof both. The
omputational omplexity ofthisgreedy sear his O(n 2
)
fornattributes,sin eintheworst aseallpossiblen 1
parentsare onsideredbyeverynode. Thealgorithmfor
nattributes anbesummarizedasfollows:
GenerateforeveryattributeaparentlistP, orrespondingto
anaiveBayesian lassierwithP V
=fgandP i
=V forevery
i2f1:::ng
RepeatforeveryAiwithi2f1:::ng:
GeneratealistL, onsistingofn 1
entrieswhi hstoretheweightsW(A i
;V;A j
),
foreverypossibleparentnodeA j
.
SortLwithas endingW.
RepeatforallA j 2P ComputeW 1 =W(A i ;P A i )and C 1 =C(A i ;P A i ) AddA j toP A i (addar ) ComputeW 2 =W(A i ;P A i )and C2=C(Ai;P A i ) If W 2 W 1 < C 2 C 1 RemoveAjfromP A i (deletear ) ElseIf A i 2P A j IfW(A i ;P A i )>W(A j ;P A j ) Reversear end end end end Return lassier EXPERIMENTS Methodology
The naive Bayes, the TAN Bayes and our approa h,
onehand trainingsetsfrom thema hine learning
repos-itory [1℄and on the otherhand with Bayesiannetworks
arti iallygenerateddata. Thelatter omeswiththe
ad-vantage, that the underlying network is already known
andthus theindu ed lassier anbe omparedwith it.
WebuildBayesiannetworkswiththe ommer ialpa kage
Neti aandsampled suÆ ient asesfromit.
In line with other resear h papers, the a ura y of
ea h lassier hasbeendetermined by theleave-one-out
ross validation [12℄. In ontrasttothelesspre ise
hold-out method, where a lassier is indu ed with 2 3
of the
trainingdata and its a ura y measured with the other
1 3
, this method indu es a lassier with allsamples less
oneandmeasuresitsa ura ywiththatsample,leftout
duringthetraining. Thispro essisrepeatedforall
sam-plesinthetrainingdata,andthea ura y al ulatedby
orre tly lassiedsamplesdividedbyallsamples. A
de-tailedexaminationontheevaluationof lassiersisfound
in[12℄.
Results
Generallyweexpe tedfromourresultsabetterorequal
a ura y as the naive Bayes lassier. Furthermore we
expe ted that the omplexity of the generated lassier
wouldbelo atedbetweennaiveBayesandTAN. Table1
showsthepropertiesofthedatasetandtrainingset. The
resultswiththis dataisfoundin table 2,whi h liststhe
a ura yforthenaiveBayes lassier,thetreeaugmented
naiveBayesand multipleBayes.
Asit anbeseenin table 2,themultiple Bayes
ap-proa hhasa hievedforthersttwodatasetsana ura y
equalto the naive Bayes. The TAN lassier, however,
a hievedinbothsetslessa ura ythantheothers. These
setshavenotbeen hosenarbitrary,theybothhavevery
little orrelationsbetweentheirattributes. Thusour
las-sier preferred the most simple stru ture, whi h is the
naive Bayes (no additional parents), and a hieved thus
thesamea ura y. Sin e theTAN approa hignoresthe
balan ebetween omplexityanda ura y,itbuildsa
las-sier,basedonweak orrelationsbetweentheattributes.
The resulting parent- hild onne tions are poorly
sup-portedbythedata,whi hexplainsthelossofa ura y.
Thethird dataset omeswith16 attributes. As we
dis overedwithour lassier,twofromthis16attributes
are signi antly in uen ed by more than three others.
Thisstandsin ontrastto theother14attributes,whi h
aremaximal withtwo orrelated. Thus our lassier
re-turned astru ture where 14 attributesare onne ted to
two or less parents, these two strong orrelated nodes,
however,with morethanthree. Sin etheTAN
breast 11 699 10
balan e 5 625 5
votes 17 435 3
ABN 5 1000 2
Table1: Propertiesoftheuseddatasets
NaiveBayes TAN MultiBayes
breast 97,420,60 92,561,00 97,420,60 balan e 92,161,10 85,281,42 92,161,10 votes 90,341,41 89,201,61 92,420,60 ABN 70,001,45 70,901,44 73,701,41
Table2: A ura yofthetested lassiers
to this, ourapproa h wasableto a hieveahigher
a u-ra y. Furthermore this ould be rea hed with aslightly
more omplex lassier, than theonereturnedby TAN,
sin enotallattributeshadbeen onne tedwithaparent.
Thelastdatasetwhi hweexamined,wasgenerated
with an arti ial Bayesian Network. The stru ture of
this network was hosen to re e t multiple orrelations
asthepreviousexampleaswell. Wesampled1000 ases
fromthisnetworkandusedthemwithour lassier. The
resultwasa lassierre e tingall orrelationsdened in
theBayesiannetworkandthussu eededwiththehighest
a ura y.
CONCLUSION
Weproposedanewar hite turefortheindu tionof
las-siers,basedonBayesiannetworks. Essentiallythiswas
arriedoutbytheadoption oftheMDL prin ipletothe
naiveBayes lassier.
Ourresultsshowthatourrened lassieryields in
all ases an a ura y equal to or better that of naive
Bayes,andfurthermoreitoutperformsTANinthe aseof
datawithweakormultiple orrelations. Ourassumption,
that orrelationsareonlyworthmodellingina lassierif
theyare heapintermsof omplexity,hasbeenre e ted
bythisresults. Weintendedto dofurther tests,to over
awiderangeofdierentdatasets.
The omplexity ofour algorithmis equalto that of
otherapproa hesand omputationaltra table. However,
the al ulationof themutualinformation seemsto limit
thespeedoftheindu tionpro esssigni antly. Thusour
approa h, and all other methods based on the mutual
information, are likelyto indu e lassiers slowly, if
at-tributeswith numerous statesarefoundin thedata. To
over omethisproblem, asimplermethodfor the
identi- ationof orrelationsinthedatahastobefound. This
method has to be apable to identify statisti al
orrela-bytheparityproblem,aswell. Thisareaistobeexplored
inthenextstage.
ACKNOWLEDGEMENTS
WewouldliketothankDavidEmeryandBillWalleyfor
theirusefuldis ussionsand omments.
Referen es
[1℄ D.W.AhaandK.Murphy.UCIrepositoryofma hinelearning
databases,1995. http://www.i s.u i.edu/mlearn/
MLReposi-tory.html.
[2℄ F.Ba husand W.Lam. LearninBayesianBelief Networks:
AnApproa hbasedontheMDLPrin iple. InComputational
Intelligen e,volume10,pages269{293,1994.
[3℄ F.Ba husandW.Lam.UsingNewDatatoReneaBayesian
Network.InUn ertaintyinArti ialIntelligen e,pages383{
390,1994.
[4℄ D. M. Chi kering, D. Geiger, and D. He kerman. Learning
BayesianNetworks:The ombinationofknowledgeand
statis-ti aldata. InMa hineLearning,volume 20,pages 197{243,
1995.
[5℄ C.K.ChowandC.N.Lui.Approximatingdis reteprobability
distributionswithdependen etrees.InIEEETransa tionson
Info.Theory,volume14,pages462{467,1968.
[6℄ N. Friedman,D. Geiger,and M. Goldszmidt. Bayesian
net-works lassiers.InMa hineLearning,volume29,pages131{
163,1997.
[7℄ N. Friedman and M. Goldszmidt. Building Classiers using
BayesianNetworks.InThirteenthNationalConf.onArti ial
Intelligen e,1996.
[8℄ D. Geiger. Anentropy-basedlearning algorithmof Bayesian
onditionaltrees.InUAI'92,pages92{97,1992.
[9℄ D.He kerman.ATutorialonlearningwithBayesiannetworks.
Te hn. Report, Mi rosoft Resear h, Redmond, Washington,
1995.
[10℄ F.V. Jensen. An Introdu tion toBayesian Networks. UCL
PressLimitedUniversityCollegeLondon,England,1996.
[11℄ T.Joa hims.Aprobabilisti analysisoftheRo hioalgorithm
withTFIDFfortext ategorization. ComputerS ien eTe hn.
ReportCMU-CS-96-118,CarnegieMellonUniversity,1996.
[12℄ R.Kohavi. Astudyof ross-validationandbootstrap for
a - ura y estimation and model sele tion. InIJCAI'95, pages
1137{1143,1995.
[13℄ K.Lang.Newsweeder: Learningtolternetnews.In
Pro eed-ingsofthe12thInternationalConferen eonMa hine
Learn-ing,pages331{339,SanFransis o,Calif.,1995.Morgan
Kauf-mann.
[14℄ P.LangleyandS.Saga.Indu tionofsele tiveBayesian
lassi-ers. InUAI'94,pages399{406,1994.
[15℄ D. Lewis. Representation and learning in informational
re-trieval. Dissertation,Dept.ofComputerandInformation
S i-en e, UniversityofMassa husetts,UnitedStatesofAmeri a,
1991. (COINSTe hni alReport91-93).
[16℄ M.J.Pazzani. Sear hingfordependen iesinBayesian
lassi-ers. InPro .of the5thInt. Workshop onArti ial
Intelli-gen eandStatisti s,1995.
[17℄ J.Perl. Probalist Reasoningin IntelligentSystems. Morgan