Part I: Discrete and semi-discrete Data Matrices
Stefan Arnborg
Swedish Institute of Computer Science
SICS TRT99:08, ISSN1100-3154, ISRN:SICS-T99/08-SE
Abstract
Thistutorialsummarises theuseofBayesiananalysis andBayes
fac-torsforndingsignicantpropertiesofdiscrete(categorical andordinal)
data. Itoverviewsmethodsforndingdependenciesandgraphical
mod-els,latentvariables,robustdecisiontreesandassociationrules.
1 Introduction
DataminingiscomplementarytoBayesiandataanalysis.Whereasdatamining
isoftenseenastheproblemofgrindingthroughmassivedatasetsforthepurpose
ofndingunexpecteddependenciesintheformofcorrelations,associationrules
and segmentations, Bayesian data analysis is typically seen as an activity of
evaluatingdetailedmodelsforsmall datasets. Weareinterestedinthemiddle
ground, wheredata isscarce enoughto pose delicatequestions of validity and
signicanceofourndings,butwherewedonotyethavedetailedmathematical
models. Weare developingtools and methodologyfor exploratory analysis of
small andfragiledatasets, asapreparatorystepfor amoredetailedanalysis,
as can beperformed in the Bayesianframework with, e.g., theBUGS system
[33].
Theapplicationareaishumanbrainresearch. Here,manydierenttypesof
data arerecordedforpatientsandfor healthycontrol persons. Besidesresults
of establishedand wellstandardized tests and backgrounddata, manyresults
fromimaginginvestigations(measuringcellstructure,bloodow,receptor
pres-ence,etc.) areenteredasextractedfeaturesofimagesmappedtobrainatlases.
Geneticdatarelatedtobraindevelopmentisalsoemerging. Somedataentered
areuncertain,othersarebeingstandardized. Weseldomhaveacompletedata
setforanyindividual,sincethedatacollectionprocessiscostlyandoften
infea-sibleforpatientsin badcondition. Theobjectiveofdataminingonthese data
aredeeperunderstandingoftheinterplaybetweenphysiologicalandpsychiatric
conditions,andalso improvedprocedures fordiagnosingpatientsandchoosing
therapies.
Thepurpose ofthis report isto explain theadvantageof theBayesian
ap-proachin the presentapplication, and howthe Bayesfactorcanbean almost
playtheinformationorknowledgeweareafterinanapplication. Itisalsoour
intentionto giveafull accountofthecomputationsrequired. It canserveasa
surveyofthearea, although itfocuses on techniques beinginvestigated in the
presentproject. Several of thecomputations we describe havebeen analysed
at length, although not exactly in the way and with thesame conclusions as
foundhere. Thecontributionhereisasystematictreatmentthatisconnedto
pure Bayesiananalysis andputs several establisheddatamining methodsin a
joint Bayesianframework. Wedonotwantto enter thediscussionof whythe
Bayesianapproachissuperiortoitsalternatives,butsomebackgroundmaterial
is included. We will see that,although many computations ofBayesian
data-miningarestraightforward,onesoonreachesproblemswhere dicultintegrals
havetobeevaluated,andpresentlyonlyMarkovChainMonte Carlo(MCMC)
methods areavailable. Thereare severalrecentbooks describingtheBayesian
method from both a theoretical[3], an ideological[19, 32] and an application
oriented[7]perspective. Amain historic inuence leadingto increasedinterest
in Bayesianmethods is HaroldJereys, whowrote particularly twobooks on
scienticinferenceandprobability theoryfrom aBayesianperspective[21, 20].
AcurrentsurveyofMCMCmethods,whichcansolvesomecomplexevaluations
requiredinBayesianmodeling,canbefoundinthebook[17]. Booksexplaining
theory and use of graphical models are Lauritzen[22], Cox and Wermuth[10],
andWhittaker[35]. AtutorialonBayesiannetworkapproachestodatamining
is found in (Heckermann[18]). This present report describesdata mining in a
relationaldatastructurewithdiscretedata(discretedatamatrix)andthe
sim-plestgeneralizationstonumericaldata. Asecondpartwilldescribegeneralreal
valueddatamatrices,rasterdatarepresenting,e.g.,scalarand/orvectorelds,
aswellastimeseriesandstrings.
2 Data model
Weconsideradatamatrixwhere rowsarecasesandcolumns arevariables. In
our application, the row is associated with a person or an investigation
(pa-tientand date). Thecolumns describealargenumberof variables that could
berecorded,suchasbackgrounddata(occupation,sex,age,etc),andnumbers
extracted from investigationsmade, likesizes of brain regions, receptor
densi-ties and blood ow by region, etc. Categorical data can be equipped with a
condence(probabilitythattherecordeddatumiscorrect),andnumericaldata
with an error bar. Every datum can be recorded as missing, and the reason
for missing data canbe related to patients conditionor external factors(like
equipmentunavailabilityortimeandcostconstraints). Onlythelattertypeof
missing data is (atleast approximately) unrelatedto thedomain of
investiga-tion. Onthelevelof exploratoryanalysis weconne ourselvesto discrete and
multivariatenormaldistributions,withDirichletandinverseWishartpriors. In
this way, no delicateand costly MCMC methods will be required until
miss-ing data and/or segmentation is introduced. If the data do not satisfythese
conditions (e.g., normality for a real variable), they may do so after suitable
transformationand/orsegmentation. Anotherapproachisto ignorethe
distri-butionoverthereallineandregardanumericalattributeasanordinalone,i.e.,
appreciation of a phenomenon in organized society or their valuation of their
ownemotions.
2.1 Multivariate data models
Given a data matrix, the rst questionthat arises concerns the relationships
between its variables(columns). Could some pairs of variables be considered
independent,ordothedataindicatethat thereisaconnectionbetweenthem
-eitherdirectlycausal,mediatedthroughanothervariable,orintroducedthrough
sampling bias? These questions are analyzedusing graphicalmodels, directed
ordecomposable[24]. As an example, in gure1 M
1
indicates amodel where
A andB aredependent,whereastheyare independentin model M
2
. Ingure
2,wedescribea directedgraphical model M 00
4
indicatingthat variables A and
B areindependentlydetermined, but the valueofC will bedependent onthe
values for A and B. The similar decomposable model M
4
indicates that the
dependence of A and B is completely explained by the mediationof variable
C. Wecould think of the data generation process as determining A, then C
dependentonA andlast B dependentonC,orequivalently,determining rst
C andthenAdependentonC andB dependentonC.
M1
M1’
M2
M2’
A
A
A
A
B
B
B
B
Figure1: Graphicalmodels,dependence orindependence?
Bayesiananalysis ofgraphicalmodelsinvolvesselectingall orsomegraphs
onthevariables,dependentonpriorinformation,andcomparingtheirposterior
probabilitieswithrespecttothedatamatrix. Asetofhighestposterior
B
A
C
A
B
C
A
B
C
A
B
C
A
B
C
M3
M3’
M4
M4’
M4’’
Figure2: Graphicalmodels
onemust-asalwaysinstatistics-constantlyrememberthat dependenciesare
notnecessarilycausalities.
Asecondquestionthatarisesconcernstherelationshipsbetweenrows(cases)
in thedatamatrix. Arethecasesbuiltupfromdistinguishableclasses,sothat
eachclass has itsdata generated from asimplergraphicalmodel than that of
thewhole dataset? Inthe simplestcase these classescanbe directly read o
inthegraphicalmodel. Inadatamatrixwhereinter-variabledependenciesare
wellexplainedby themodel M
4
, ifC is acategoricalvariable takingonly few
values, splitting the rows by thevalue of C could give aset of data matrices
in each of which A and B might be independent. However, the interesting
casesarewheretheclassescannotbedirectlyseeninagraphicalmodelbecause
then theclasses are nottriviallyderivable. Ifthe data matrixof theexample
contained only variables A and B, because C wasunavailable or unknownto
interferewithAandB,thehighestposteriorprobabilitygraphicalmodelmight
beone with alinkfrom A to B. Theclasseswouldstill bethere, but sinceC
wouldbelatentorhidden,theclasseswouldhavetobederivedfromtheAand
B variables only. A dierent case of classicationis where the values of one
numerical variable are drawn from several normal distributions with dierent
meansandvariances. Thefullcolumnwouldtverybadlytoanysinglenormal
distribution,butafterclassication,eachclasscouldhaveasetofvaluestting
onBayesianmethodologyisdescribedbyCheesemanandStutz[8 ].
Athirdquestion-oftentheoneofhighestpracticalconcern-iswhethersome
designatedvariablecanbereliablypredictedinthesensethat itiswellrelated
to combinationsof valuesof other variables, notonly in the data matrix,but
alsowith highcondence in newcasesthat arepresented. This questionleads
toanotherconceptthathasbeenextensivelystudied,namelyassociationrules.
ConsideradatamatrixwelldescribedbymodelM
4
ingure2. Itisconceivable
thatthevalueofCisagoodpredictorofvariableB,andbetterthanA. Italso
seemslikelythat knowingbothA andC is oflittlehelp comparedto knowing
onlyC,becausetheinuenceofA onB is completelymediatedbyC. Onthe
otherhand,ifwewanttopredictC,itiswellconceivablethatknowingbothA
andB isbetterthanknowingonlyoneofthem.
Finally, it is possible that a data matrix with many categorical variables
withmanyvaluesgivesascatteredmatrixwithveryfewcasescomparedtothe
numberofpotentiallydierentcases. Generalizationisatechniquebywhicha
coarseningofthedatamatrixcanyieldbetterinsight,suchasreplacingtheage
andsexvariablesbythecategorieskids,youngmen,adultsandseniorsinacar
insuranceapplication. Thequestionofrelevantgeneralizationisclearlyrelated
to the problems of nding association rules and to classication. Forordinal
variables, this line of inquiryleads naturally to the concept of decision trees,
thatcanbethoughtofasarecursivesplittingofthedatamatrixbythesizeof
oneofitsordinalvariables.
3 Bayesian analysis, uninformative priors, and
over-tting
A natural procedure for estimating dependencies among categorical variables
is by means of conditional probabilities estimated as frequencies in the data
matrix. Likewise, correlations can be used to nd dependencies among real
valuedvariables. Suchproceduresusuallyleadtoselectionofthemoredetailed
models and givepoorgeneralizing performance, in the sense that newsets of
dataarelikelytohavecompletelydierentdependencies. Variouspenaltyterms
havebeentriedtoavoidover-tting.However,theBayesianmethodhasa
built-inmechanismthatfavorsthesimplestmodelscompatiblewiththedata,andalso
selects moredetailed models as the amount of data increases. Theprocedure
is to compare posterior model probabilities, where the posterior probability
of amodel is obtainedbycombiningits priordistribution ofparameters with
the probabilityof the data asafunction of theparameters, using Bayesrule.
Thus, if p
1 (
1
) is the prior pdf of the parameter(set)
1
of model M
1 and
theprobabilityofobtainingthecase(rowofdatamatrix)disp(djM
1
1 ),then
theprobabilityinmodelM
1
ofthedatamatrixDcontainingtheorderedcases
fd i g i2I is: p(DjM 1 )= Z Y i2I p(d i jM 1 1 )p( 1 )d 1 ; (1)
and the posterior probability of model M
1
given the data D is, by Bayes
1 1 1
Fromafrequentistororthodoxstatisticalpointofviewitisquestionableto
dothisinterchangeandconsidertheprobabilityofamodelgiventhedata. This
isexactlywhatmakesthedierencebetweenBayesianandfrequentistmethods.
Ifthedatamatrixisunordered,oneshould multiplywithamultinomial
coe-cient,butthisisoftennotdone-whetherornotthisisdonedoesnotmatterfor
computation ofBayesfactors,see below. Twomodels M
1
and M
2
cannowbe
related withrespectto the databy theBayesfactor p(DjM
1
)=p(DjM
2 ). This
is a factor which is multiplied with the prior odds between the two models,
p(M
1 )=p(M
2
),togettheposterioroddsp(M
1
jD)=p(M
2
jD). Theposteriorodds
cannowtaketheplaceofanewpriorforthenextdatabatch,andtheprocedure
canberepeated. Itshouldbenoted,however,thatthemodelaveragingisdone
foreachbatch-whetherthisis appropriateornotdependsontheapplication,
andoftenitisnot.
A high value of the Bayes factor, say more than 100, speaks strongly in
favorofmodelM
1
,likeavaluebelow.01 givesstrongsupport forM
2
. Values
closer to one (i.e. in the range .3 to 3 ), however, tell us that the data are
insucienttodecidebetweenthemodels,andthisisunavoidable-methodsthat
decidein thosecasescannotbewelldesigned. Thisappearsto beasignicant
dierence between theBayesianapproach and many analyses occurring in AI
anddatamining-wedonotconsiderourdataasanimperfectimageofanideal
underlying andcompletely preciseprobabilitymodel. Onthecontrary,weask
whichimperfect underlyingmodelsbest serveto describeourdata. Ifwetried
to get much moredata thanwehave,wewould notnecessarilybecomewiser,
sincethedatacollectionprocessmaywellbesuchthatcasesarenotindependent
andthedatacollectionprocessmaychangethenatureofthedatathroughthe
samplingprocess.
A disturbingfeature of theBayesianmethodology is that it requires prior
distributions. Priorsgiveanimpression ofsubjectivity, whichtheyshould not
do. Thepriorisanassessmentofastateofinformation,andisnotrelatedtoa
subjectexceptthat theinformationstateis possessedby asubject. Oftenthe
information state is dicult to deal withsince its form is fairly open-ended
-just imagineinformationrelatedto anopenmathematicalproblem, orevenan
NP-hard optimization problem. However,everywell-foundedchoice between
alternativesmustinvolvethepriorbeliefsof -objectivelythestateof
informa-tionheldby-thedecisionmakerinsomeway,andtheBayesianmethodisone
(infact theonly)consistentwayofdoingthis. Bayesianmethodologyprovides
an expedient for the case where no strong prior beliefs should inuence the
conclusion,namelyuninformativeorweaklyinformativepriors. Forsuchprior
distributions,moredatais typicallyneededtoreach adeniteconclusion than
for cases where there is distinct prior information to include in the analysis.
WiththeBayesianmethodthereisnoneedtopenalizemoredetailedmodelsto
avoidover-tting- ifM
2
ismoredetailedthanM
1
in thesenseofhavingmore
parameters to t, then the parameterdimension is largerin M
2 and p( 1 ) is larger than p( 2
), which automatically penalizes M
2
againstM
1
. This
auto-maticpenalizationhasbeenfoundappropriatein manyapplicationcases, and
shouldbecomplementedbyexplicitpriormodelprobabilitiesonlywhenthereis
of detailed modelsimplicit in the Bayesfactor approach is afactor n2 (p 1 p 2 ) ,
wherenisthenumberofdatapoints(cases)andp
i
isthenumberofparameters
inmodelM
i
. ThisestimatewasrstfoundbySchwarz[31],andisknown,when
usedtopenalizemoredetailed modelsinalikelihoodbasedmodelcomparison,
as the Bayesian information criterion (BIC). So deciding between the models
usingthelikelihoodratioswiththeBICasapenalizingfactorisan
approxima-tiontothe'orthodoxBayesian'procedureofcomparingposteriorprobabilities,
anditisusefulwhentheintegrationrequiredforposteriordeterminationis
in-feasibleorotherwiseunwanted. Somediscussionsofthispointcanbefoundin
(Ch24ofJaynes[19])andalsoinNeal[25].
Thediscussionaboverelatesto choosingoneof twomodels. Clearly, there
is a possibility that thedata discredits both these models, orthat we havea
whole familyofmodels tochoosefrom.
Considertheproblem ofcomparing modelsin afamily, fM
1 ;:::;M
k g,and
havingno prior preference forany ofthem. If the models do not overlap, we
should choosetheprobabilitiesfp(M
i jD)= P j p(M j jD)g astheprobabilitiesof
these modelsgiventhe data. By overlapping wemean that parameter sets of
priornon-zeroprobabilityexistwhichgivethesamedistributionintwomodels.
We usually do not haveoverlap, since, e.g., in the case of nested models the
region of overlap, thewhole 'less specic' model, hasprior probabilityzeroin
themorespecic model. Typically, anestedfamilyforming atree ordirected
acyclicgraphstructureis chosen,where thedimensionof theparameterspace
increasesas onedescendsin thetree,andwheretherootisassociatedwiththe
fewestparameters. Therootmodelistheleastspeciconein thefamily.
Inthemodelingeort,theanalystmustdecideongroundsofwhatisknown
in generalterms abouttheapplication and thepurpose of theanalysis,which
model family to consider. Here we must remember that inference is not an
idle activity, butshould normallybeused tomakedecisions. Clearly,it isnot
adequate to select a model from its posterior probability without considering
theconsequencesofdecisions.InBayesiandecisiontheory(see,e.g.,(Berger[1]),
weintroduceactionsandexpectedutilityofactionsgivena'stateoftheworld',
which couldbeamodel oramodel with itsparameter. However,in Bayesian
decisiontheory,therationaldecisionmakingfollowsfromonlytheposteriorand
the utility functions (statisticians seem to be a pessimistic breed and usually
talk aboutlossfunctions,butthis isof coursereallythesamething). Forthis
reasonwedonotintroducelossfunctionsin thisreport.
3.1 TheBayesiandebateandtheunavoidabilityofBayesian
analysis
Therewasaquiteheateddebate amongstatisticiansontheproperapplication
of mathematicaltoolsin theinterpretation of experimentaldata. This debate
startedbetweenFisherandPearssonandcontinuedbetweenFisherandJereys.
WhatismostrememberedisthediscussionbetweenBayesiansand'frequentists'
(astraditionalstatisticianswerecalledbyBayesians). Foratrainedpure
math-ematicianthecontroversybetweenfrequentistandBayesianviewsdoessimply
notappear. Heisinterestedin abstractspaceswith probabilitymeasures,and
reactionsamongstatisticiansandalsorecentlyintheAIcommunity. Bayesians
areknownfortheirarroganceandclaimtoownthetruth. Itisunfortunatethat
thisclaimisnotpresentedinmanytextbooks,becauseitiseasytounderstand,
and also quite surprising. It is generally agreed that Bayes original paper is
deepandchallenging,but itisalsotoovagueandincoherenttobeconvincing,
andmanyreadershaverejecteditoutright. Thereisapparentlynodocumented
evidence that Laplace actually sawthe paperor heard of it, but the work of
LaplaceisacontinuationoftheideasinBayeswork. Unfortunately,hedidnot
succeed to convincehis colleagues and successorsin the scientic community.
HisideaoftheruleofsuccessionisaclearapplicationofBayesiananalysis,but
it wasrejected becausehis readersdidnot accepthis choiceof prior
informa-tion (decidingthe number ofdays,all withsunrise, sincecreation,by reading
the Bible)and discardedthe method on thebasis of one dubiousapplication.
Obviously, iftheBibleis reliableonthis point,other informationontheorder
of Nature found in itmightcontradict his application. Other sourcesof prior
informationwereknownbyLaplace, buthedidnotusethemforthispurpose.
Severalgreat 19th century mathematicians havemoreorlessby instinct used
theideasofBayesandLaplacewhenperformingcomputationsonexperimental
data(typicallyinastronomy),buttheseeorts weremoreorlessignoredwhen
thedisciplineofstatisticswascreatedin theearly20thcentury.
TherstderivationofthenecessityofBayesianmethodswasdonebyR.T.
Cox in 1946[11], and hasbeen repackaged by Jayneswith alot of motivating
discussion. Basically,theanalysisinvestigateswhichfamilyofrulesforreasoning
with theplausibilityof statements abouttheworldis permissible in thesense
that theysatisfythefollowingcriteria:
I: The plausibility of a statement is a real number and dependent on
informationwehaveontheplausibilityofotherstatements.
II: Consistency-If theplausibilityof astatement canbe derived in two
ways,thetworesultsmustbeequal.
III:Common sense -Some propertiesof statementsknown tobetrueor
knowntobefalse,andcontinuityrules.
Fromthese criteriafollowsthat anypermissible wayto reasonwith
plausi-bilityisequivalenttoBayesiananalysis.
Averyshort outlinefollows,werewedonotin factshowthat theBayesian
method satisesthecriteria(thisisnotusuallyquestioned):
Let A, B, C, ... be statements, combinable with the invisible logicaland
operator: AB means A and B. The negation of astatement A is written A.
Statementsmustinsomewaybeconsideredobjectiveandrelatetostatesofthe
world,andhaveanagreedinterpretation. LetAjCbetheplausibilityofAgiven
the additional information that C is true. C is thus the context in which we
consider the plausibility ofA. That such anotation mustbepresentin every
calculus to derive plausibility is clear - there must for example be a way to
relateameasuredvalue(AjC)to therealitybehindit(BjC)usingbackground
informationonthemeasurementprocessanditsaccuracy(C). Numericalvalues
- parameters,measuredvalues, etc. - enter this framework byalimit process.
Wecannotstartwithinnitedomainsanddirectlyputplausibilitymeasureson
moreoftheplausibilitiesAjC,BjAC,BjCandAjBC. Itcanbeshownthatwe
mustconsider eitherBjAC and AjC orAjBC and BjC- anyotheralternative
can beshowninadequate byviolatingcommonsense insomesituation. Asan
examplewecannotderivetheplausibilityofABjC fromonlytheplausibilities
ofAjC andBjC,sincethatgivesus nomeansto considerhowA andB relate
to each other- it would force us to assume, for example, that the plausibility
ofaperson havingaleft blueandarightbrowneyewould dependonlyonthe
plausibilitiesof left blue andrightbrowneye, and notallowingus toconsider
thedependencybetweenthesetwostatements.
Thus, we can assume that the plausibility of ABjC is a function of the
plausibilitiesofAjC andBjAC, theother casebeinganaturalconsequenceof
thecommutativityoftheandoperator:
ABjC=F(AjC ;BjAC): (3)
ThecommonsenserequirementtellsusthatthefunctionFmustbecontinuous,
and monotonously increasing in bothits arguments. It canhavea stationary
pointforitsrstargumentonlyifthesecondargumentrepresentsimpossibility
and vice versa. We assumeit twice continiously dierentiable, although there
existsafairlycomplexproofthat thisisnotnecessaryforourconclusions[19].
Nowweconsidertheconsistencyrequirement. Sincetheandoperatorisnot
onlycommutativebutalsoassociative,ABC=(AB)C =A(BC),wecanderive
aconsistencyrequirementforF:
ABCjD=F(ABjCD;CjD)=F(AjBCD;BCjD): (4)
Expandingonce more,weget:
F(F(AjBCD;BjCD);CjD)=F(AjBCD;F(BjCD;CjD)): (5)
Thismusthold foranystatementsA;B;C ;D,and thus F mustsatisfythe
followingfunctionalequationin itsrangeofdenition:
F(x;F(y;z))=F(F(x;y);z): (6)
The aboveis called the equation of associativity. Thetrivial constant
so-lution is clearly useless. Which non-trivial solutions are there? We can
dif-ferentiate equation (6) with respect to x, y and z, and see that the
follow-ing equality holds, i.e., the left side is independent of z (we use the notation
F 1 (x;y)= @F(x;y) @x ): F 2 (x;F(y;z))F 1 (y;z)=F 1 (x;F(y;z))=F 2 (x;y)=F 1 (x;y): (7) LetG(x;y)=F 2 (x;y)=F 1 (x;y)andwendF 1
(y;z)G(x;F(y;z))=G(x;y),
and the left side of this (which is algebraically independent of z) we denote
U. Likewise, after a littlealgebra: G(x;f(y;z))F
2
(y;z) =G(x;y)G(y;z), and
theleftside wedenotebyV. Now@V=@y isidentical to@U=@z andthuszero,
sinceU isindependentof z. ButthenV which canbewritten G(x;y)G(y;z),
is independent of y. This can only happen if G(y;z) and 1=G(x;y) have a
commonfactordependentony,andnootherdependenceony. Sowemusthave
G(y;z)=H(y)E(z)andG(x;y)=E 0
y for x and z for y in the latter: G(y;z) = E(y)=H(z). In other words, G
musthavetheformG(x;y)=rH(x)=H(y), andthisisalsobydenitionequal
to F
2
(x;y)=F
1
(x;y). This is what we need to separate variables and put the
diferentialofv=F(x;y)onanintegrableform:
dv H(v) = dx H(x) +r dy H(y) (8)
whichcanbeintegratedusingw(x)exp( R 1 dx H(x) )to: w(F(x;y))=w(x)w r (y) (9)
but theequationofassociativyalsogivesus
w(F(x;y);z)=w(x)w r (y)w r (z)=w(F(x;F(y;z)))=w(x)(w(y)w(z)) r (10)
and in every non-trivial and useful case we must have r = 1. We can now
investigatewhat w(x) must be when x represents truthor falsity, and weget
w(x)=w(x)w(T),w(F)=w(x)w(F)andsomemoreconditionswedonothave
to use. It ispossiblethat thevalues1 and 1 areobtainedsincetruth and
falsity mightbe considered alimit case. Therst condition yields w(T) =1,
theothercouldmeaneither w(F)=0orw(F)=1( 1isruledoutsincewe
cannotalloww(x)topasszeroinitswayfromw(T)tow(F)). Butthesolution
goingfrom 1to1 canbereplaced byitsinverse,whichgoesfrom 1to 0. We
are now veryclose to probability rules, since the function w goes from 0 for
impossibilityto1fortruth,andourrulefortheconjunctionofstatementscan
bewritten
w(ABjC)=w(AjBC)w(BjC) (11)
Itnowremainstondouthowplausibilitiesofcomplementsmustbetreated.
Since AAis alwaysfalseandeither ofA orA mustbetrue,theplausibilityof
A mustbeafunction oftheplausibilityof A. Introducethefunction S onthe
unitinterval: S:[0;1]![0;1],suchthatw(A jC)=S(w(AjC)). Byconsidering
Aristotelian logic, and ourchoice ofw(T)= 1and w(F) =0, and reasonable
commonsense,wendthatSisamonotoneandcontinuousfunctiondecreasing
from1to0ontheunitinterval. WewillassumethatS isdierentiable-again
this isnotnecessarybut it isalmost requiredbycommonsense and simplies
theargument. Also,sinceA=A,wehaveS(S(x))=x. Thisisnotall,however,
becauseS mustalso beconsistentwiththeproductrule:
w(ABjC)=w(AjC)w(BjAC)=w(AjC)S(w(BjAC)) (12)
w(ABjC)=w(AjC)w(BjAC)=w(AjC)S(w(BjAC)) (13)
Rearranging these constraintsand using the commutativity AB =BA we
ndw(ABjC)=w(AjC)S(w(BjAC))=w(AjC)S(w(ABjC)=w(AjC)),and
w(AjC)S( w(ABjC) w(AjC) )=w(BjC)S( w(BAjC) w(BjC) ): (14)
Equation (14) must hold for all statements A, B, and C. In particular,
fundamental equationgoverningthepossiblefunctions S: xS( S(y) x )=yS( S(x) y );S(y)x (15)
The analysis of this equation is not entirely trivial, but it canreadily be
veried that among its solutions are the (easily obtainable) solutions to the
simpleequation: S(x) m +x m =1;m>0: (16)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure3: Samplesolutionsto(16).
For the dierent values of m, the curve family will cover the interior of
the unit square (see gure 3). It is also easy, by consideriungthe choice y =
S(x)+as!0,toseethatSisgovernedbyarstorderdierentialequation.
Therefore,there arenomoresolutionsthanthese. Itmightseemoddthat the
solutionS(x)=1 xis nottheonlyone,sinceitwould twellwithequation
(11)andthechoiceofw(AjC)astheprobabilityofAjC. However,bytakingthe
mthpowerofequation(11),wendthatwecanstillinterpretallpossibleways
tocomputewithplausibilitiesasBayesiananalysis,simplybylettingprobability
correspondto w(AjC) m
.
Thenextquestionin thisline ofinquiryconcernsproperchoicesofpriors
-wehaveno helpwhatsoeverin theprecedingdiscussion. Building arepertoire
of methods to assign priorswould start with simplesymmetryconsiderations:
IfIhavenobackgroundknowledgewhatsoevertonddierencesinplausibility
between aset of nexclusiveandexhaustivehypotheses,then theprior
proba-bilityofeachshouldbesettothesamevalueandtheprobabilitiesshouldsum
to one. Thus, each hypothesis will have prior probability 1=n. This leadsto
thestandardassignmentsforcointossingandurndrawingexperiments
consid-ered in basic probability texts. Translating this rule, by limitforming
opera-tions,to pdfswith continuousparameterspacesleadsnaturallytotheconcept
ofminimum-information(maximumentropy)priors,whichhaverevolutionized
themethods foranalyzingphysicsdataandisspreadingto othersciences. We
donotdescribethis revolutionhere,see e.g. Jaynes[19]. A remainingproblem
isthat wesimplycannotconsiderallpossiblehypotheses. Thismeansthatthe
set of hypotheses we actually consider must in some sense be realistic. This
is a problem that is the key problem that must get a convincing solution in
found applicabletomanydierentproblemsasarstquantitativegrinding of
the collecteddata. However,once the big lines havebeenuncovered, there is
usually plenty of scopefor investigating morespecic and application related
models.
3.2 An educational example: Tossing a coin
Whenitcomestotheinterpretationofexperimentaloutcomes,wecanillustrate
thecontroversy withan examplethat hasbeen discussedfrequentlyby
statis-ticians, rstby Lindley(see, e.g.,[7, 32, 19]): Assume wetoss acoin12 times
andobservetheoutcomettthhtttttth,wheretmeanstailandhmeanshead.
We areinterested in what thismeans forour objectiveoflearningwhether or
notthe coinisfair. Theprobabilityofthis sequence forafaircoinis 0:5 12
as
it is for any other sequence of 12 tosses. So it does not seem extraordinary,
nor does a sequence consisting of millions of heads only, because it also has
the sameprobability asany other sequence of thesame length. The
frequen-tists approach is to dene atest. We order the possible outcomes linearlyor
map them tothe realline, and thisinduces apdfofareal-valuedquantity. If
the current outcome liesfar out on thetail of this distribution, we reject the
hypothesis that the coin is fair. It is acceptedthat a5% cuto canbeused,
and this givesus a 5%risk ofrejecting atrue hypothesis. Of coursethemap
ofoutcomesto therealline mustbedenedin someimpartialway,essentially
beforewehaveseentheactualoutcome. Typically,atleastifwearemore
con-cerned withfairnessthanwith independence,wechoose thenumberof tailsin
thesequence, which hasabinomialdistribution. The probabilityof 9ormore
tailsin12tossesofafaircoinisslightlymorethan5%( P 12 i=9 12 j 2 12 =:075),
sowecouldreasonablyassumethatthecoinisfair. Thereisaveryfundamental
problemwiththisapproach,however,andthatisthatwemadeanassumption
aboutthepossibleunobservedoutcomesthatisnotjustied. Wejust assumed
that the outcomeis oneof thepossibleoutcomeswhen tossing 12times. The
actualsequenceobserveddoesnotexcludethepossibilitythattheexperimenter
tossed the coinuntil he had 3heads. If that were the caseweshould instead
computethedistributionofthenumberoftailsseenbeforethethirdhead. This
distributionisdierent,particularlyit admitsarbitrarilylargevalues. A rapid
calculationshowsthatwiththisruleweshouldrejectthenullhypothesisatthe
5%levelforthesameoutcomeoftheexperiment(the probabilityof9ormore
tailsis P 1 i=9 j+2 j 2 (j+3)
=:0325). This dependence ontheunknown
experi-mental design violatesafundamental statistical principlesaying that onlythe
likelihood of the observed data caninuence ourbelief in ahypothesis. This
principle,theLikelihoodPrinciple,wasproposedbyFisherandBarnard,butit
was rstgivena detailed analysis byBirnbaumin 1962[5]. In thesubsequent
debate,frequentistshaveproposedthattheLikelihoodPrincipleisnot
applica-ble in thiscaseand that theexperimental designcouldin practiseberelevant
information. A Bayesianonly admits that the probability, under the fairness
assumption,oftheoutcomeobservedis0:5 12
=:000244andthattheprobability
of 9tailsis afactor 12
9
larger. Inorder to evaluate theexperimenthe needs
priorbeliefs. Suchpriorbeliefcouldbeanalternativemodel,denedbeforethe
experimentisobserved. Ifthealternativemodelisthatthecoingivestailswith
under thealternativemodel isagain afactor 12
9
larger. SoaBayesfactor of
3.9infavorofthealternativehypothesisisobserved,andtheBayesianstarting
outwithnopreference(probability1=2foreachalternative)wouldendupwith
apreference forthe alternativewhich could bequantied asprobability .8for
theunfairalternativeand.2forthefairalternative. Thispreferenceshouldnot
beregardedasarejectionofthelessbelievedalternative,butcaneasilybe
re-versedbymoreinformation. Thereisatemptingalternativehypothesisin this
case, namely that thetrue probabilityis the observed frequency, .75for tails.
Thismodelhasthehighestprobability(:001173)ofthosealternativesassuming
independentoutcomes. Evenhigher(probability1)wereachifweassumethat
the observed sequence is the only possible outcome and that the tosses were
notindependent-butnowwehavedenitelyusedthedatatoomuch,sincewe
wouldprobablynotdesignatethishypothesisasamajoralternativebeforethe
experiment.
Now,letthealternativehypothesisbe: Theprobabilityoftailsisanumber
. Figure 4showsthe probability of theoutcome asafunction of . We do
notknowanythingabout,butwemust assumesomedistribution ofit. One
obviousalternativeistheuniformdistribution. Thisgivesthemodelprobability
R 1 0 9 (1 ) 3
d = :00035. The resulting Bayes factor is 1.4 in favor of the
hypothesis of unfairness, much weaker than 4.9 for the maximum likelihood
hypothesis( =0:75). A Bayesianwithno priorpreferenceof thehypotheses
fair againstunfair wouldend upbyassigning probability0.411 tothefair and
.589totheunfairhypothesis.
Figure4: Posteriorfrequencydistribution
Itmightseemunreasonableto lettheprobabilitieslessthan1=2waterout
our belief in unfairnesswhen data clearly suggestthat the probability, if it is
not1=2,isgreater. Letussplittheunfairnesscaseintotwoandconsider three
models: M
l
- biasfor heads; M
f
- fair; M
h
- bias fortails. Weagain assume
a uniform distributionof in the intervals 0 to 1/2 for M
l
and in 1/2 to 1
for M
h
. A similar calculationleadsto theposteriorprobabilities0.034, 0.259
and0.706,respectively. Clearly,byseparatingtheunfairnesshypothesisintolow
andhighbias,wedecreasedourbeliefinthefairnessalternative. Unfortunately,
this is to someextent anillusion. Thereal reason why ourposteriorbelief in
the fairness decreased is that our prior belief in fairness decreased when we
replacedtwoequallybelievablehypothesis(priorprobability1/2each)bythree
equallybelievablehypotheses(priorprobability1/3each). Itwouldbeequally
change. Thisisoneexampleofnon-robustnessproblemsappearingwhendoing
Bayesiananalyseswith weakpriors.
Inanycase, this is a resultthat seemsmuch weakerthan thefrequentists
abilityto reject thefairness assumptiongiven theinformation that the
exper-imenter tossed thecoinuntil3heads were observed. In'fairness' itshould be
notedthatthetwoviewswouldyieldsimilarresultsif120tossesweremadewith
30 observed heads: The Bayesfactor would be 10 6
in favorof unfairness and
thelevelof thefrequencytestwould be10 7
twoequallyconvincingreasons
toreject thefairnessassumption.
Theprocessof dividingtheunfairnesscaseinto twocanbecontinued, and
inthelimitweobtaintheconceptofaposteriordistributionforovertheunit
interval. Thisanalysisiscarriedout,withanumberofnicegraphicalresults,by
Sivia[32]. Theresultingposteriorwithauniformprior,ttailsandhheadsisthe
normalizedlikelihoodfunction,theBetadistribution,p(jh;t)=c t
(1 ) h
.
In the next section we will perform a generalized derivation, where we allow
morethan2outcomes: wegofromaBernoullidistributiontoageneraldiscrete
distribution,andweusethemoregeneralDirichletconjugatefamilyinsteadof
Betadistributions.
ThereisnomathematicalreasontorejectoneofthefrequentistorBayesian
approaches. Bayesians accused frequentists for not accepting probability as
dependenton information, whereasfrequentistsaccused Bayesiansfor putting
up with the non-robustness caused by dependence on prior information.
Ad-mittedly, it is dicult to translate priorinformation to prior probability, but
Bayesiansclaim that it is unavoidable. Whether the frequentists reliance on
experimental designisworsethantheBayesiansrelianceonpriorsis ofcourse
impossible to say without a lot of experience. Several other arguments have
beenputforwardin this debate,but thoseaboveseemtobethemostcritical.
Today,Bayesianviewsaregainingground,perhapslargelyduetointerestfrom
the AIcamp, where several lessconvincing waysto dealwith imprecise
infor-mation havebeentried. Althoughwepromotethe pure Bayesianviewin this
report,itmustberememberedthatanyoneinvestigatingrealdatamustexplore
it frommanyangles, inorder toavoidbeingmisleadbytooconstrainedor
in-appropriatemodels. Inpracticesuch explorationsare perhapsbest performed
with various visualization tools. An old saying is that a proper visaulization
hits the investigatorbetween his eyes with the truth. There is sometruth in
this.
4 Graphical model choice - local analysis
We will analyze a numberof models involving twoor three variables of
cate-gorical type, asa preparationto the task of determining likely decomposable
ordirected graphical models. First, consider the caseof two variables, A and
B, andourtask isto determinewhether ornotthese variablesaredependent.
Since weknow that Bayesmethod is the only method that givesus the right
answer,wealreadyknowhowto proceed. Wemust deneonemodel M
2 that
captures the concept of independence, and one model M
1
that captures the
conceptofdependence,andaskwhichoneproducedourdata. TheBayesfactor
is P(DjM
2
)=P(DjM
1
sume is one) toget the posteriorodds. There issome latitudein dening the
datamodelfordependenceandindependence,buttheyleadustoquitesimilar
computations,asweshallsee.
Letd
A andd
B
bethenumberofpossiblevaluesforAand B,respectively.
It is natural to regard categorical data as produced by a discrete probability
distribution,andthenitisconvenienttoassumeDirichletdistributionsforthe
parameters(probabilitiesofthepossibleoutcomes)ofthedistribution.
Wewillndthatthisanalysisisthekeystepindeterminingafullgraphical
model for the data matrix. Our analysis is analogous to those of Dawid and
Lauritzen[12]andMadiganandRaftery[24],buttheiranalysesareinmanyways
moregeneralandusealikelihoodapproachwithpenalizationofdetailedmodels
usingtheBICcriterionandothersimilartechniques.
For a discrete distribution over d values, the parameter set is a sequence
of probabilitiesx = (x 1 ;:::x d ), constrained by 0 x i and P i x i = 1(often
thelast parameterx
d
is omitted-it isdetermined bytherst d 1ones). A
priordistributionoverxistheconjugateDirichletdistributionwithaparameter
set =( i ) d i=1 , constrained by0 i
. Then the Dirichletdistribution with
parameterset isDi(x j)= Q i x (i 1) i ( P i i )= Q i ( i ),where (n+1)=
n! for naturalnumber n. The normalizing constant ( P i i )= Q i ( i ) gives
a useful mnemonic for integrating Q i x ( i 1) i
over the d 1-dimensional unit
cube (with x d = 1 P x i
). It is veryconvenient to use Dirichlet priors, for
the posterior is also aDirichletdistribution: After having obtaineddata with
frequency count n we just add it to the prior parametervector to get the
posterior parameter vector +n. It is also easy to handle priors that are
mixturesofDirichlets,becausethemixingpropagatesthroughandweonlyneed
tomixtheposteriorsofthecomponentstogettheposteriorofthemixture. We
donotneedthishere,however.
With no specic prior information for x , it is necessary from symmetry
considerationstoassume allDirichletparametersequal,
i
=. A convenient
prioris theuniformprior( =1). Thisis, e.g,thepriorused byLaplaceto
derivetheruleofsuccession,seeCh18of[19]. Otherpriorshavebeenused,e.g.,
= 1=2in the case d =2, which is aminimum information (Jereys) prior.
The value = 1=2 has also been used for d > 2(Madigan and Raftery[24]).
CheesemanandStutz[8]reporttheuseof=1+1=d. Experimentshaveshown
little dierence betweenthese choices, but it is easy to see that Jereysprior
promotes x
i
closeto 0or1somewhat whereas=1+1=d penalizes extreme
probabilities. If we get signicantdierences between dierent uninformative
priorsthiswarrantsacloserinvestigationontheadequacyofdataandmodeling
assumptions. Wewill mostlyuse theuniformprior. Inmanycasesanexperts
delibered prior information canbe expressed as an equivalent sample that is
just added tothedata matrix,andthen thismodiedmatrixcanbeanalyzed
withthe uniformprior. Likewise,anumberofexpertscanbemixedto forma
mixtureprior. Ifthedatahasoccurrencevector(n
i )
d
i=1
forthedpossibledata
valuesinacase,andn=n
: = P i n i
, thentheprobabilityforthese datagiven
thediscretedistributionparametersx ,is
p(njx )= n n 1 ;:::;n d Y i x ni i : (17)
nomial coecient. This would givethe probabilitynotof getting aparticular
contingencytable(datamatrix),butagivenorderedsamplewiththefrequency
countsn
i
. Thedierencebetweenthese twoviews disappearswhenthe
multi-nomial coecientscancel in thedivision leading toBayesfactors. Integrating
outthex
i
with thepriorgivestheprobabilityofthe datagiven modelM (M
is characterizedby aparameterizedprobability distribution andaprior onits
parameters): p J (njM)= Z p(njx )p(x )dx = Z n n 1 ;:::;n d Y i x ni i Y i x i 1 i ( : ) Q i ( i ) dx = n n 1 ;:::;n d (d) () d Q i (n i +) (n+d) (18) = (n+1) (d) Q i (n i +) () d (n+d) Q i (n i +1) : (19)
Asis easilyseen,theuniform priorgivesaprobabilityforeachsamplesize
that isindependentoftheactualdata:
p u (njM)= (n+1) (d) (n+d) : (20)
Consider now the data matrix over A and B. Let n
ij
be the number of
rows with value i for A and value j for B. Let n
:j and n
i:
be the marginal
countswherewehavesummedoverthe'dotted'index,and n=n
:: = P ij n ij . Let model M 1
(gure 1) be the model where theA and B value fora row is
combined to a categorical variable ranging overd
A d
B
dierentvalues, with a
Jereys oruniformprior. Theprobabilityofthedata givenM
1
isobtainedby
replacingtheproductsandreplacingdbyd
A d B inequations(19)and(20): p J (njM 1 )= (n+1) (d A d B AB ) Q ij (n ij + AB ) ( AB ) d A d B (n+d AB ) Q ij (n ij +1) ; (21) p u (njM 1 )= (n+1) (d A d B ) (n+d A d B ) : (22)
Wecouldalsoconsider adierentmodelM 0
1
,wheretheAcolumn is
gener-atedrstandthentheB columnisgeneratedforeachvalueofAinturn. With
uniformpriorsweget:
p u (njM 0 1 )= (n+1) (d A ) (d B ) d A (n+d A ) Y i (n i: +1) (n i: +d B ) (23)
Observethat wearenotallowedtodecidebetweentheundirectedM
1 and
thedirectedmodelM 0
1
basedonequations(22)and(23). Thisisbecausethese
modelsdenethesamesetofpdf:sinvolvingAandB,thedierencelyingonly
and it might be usefulfor seeing how well data t the twoparameterizations
and parameterpriors. A dierencecompared to realBayesfactorsis that we
cannot resolvethe hypothesis by taking moredata. The factorjust measures
relativestretchintheparametrizationinthehighlikelihoodareas.
Inthe nextmodelM
2
weassume that theA and B columns are
indepen-dent,eachhavingitsowndiscretedistribution. Therearetwodierentwaysto
specifypriorinformationin thiscase. Wecaneither considerthetwocolumns
separately, eachbeingassumedtobegeneratedbyadiscretedistribution with
itsownprior. OrwecouldfollowthestyleofM 0
1
above,withthedierencethat
eachAvaluehasthesamedistributionofB-values. Nowtherstapproach:
As-sumingparametersx A
andx B
forthetwodistributions,arowwithvaluesifor
AandjforB willhaveprobabilityx A
i x
B
j
. Fordiscretedistributionparameters
x A
;x B
,theprobabilityofthedatamatrixnwillbe:
p(njx A ;x B )= n n 11 ;:::;n d A d B dA;dB Y i;j=1 (x A i x B j ) nij = n n 11 ;:::;n d A d B dA Y i=1 (x A i ) ni: dB Y j=1 (x B j ) n:j :
Integration over the priors for A and B gives the data probability given
modelM 2 : p J (njM 2 )= Z p(njx A x B )p(x A )p(x B )dx A dx B = Z n n 11 ;:::;n d A d B d A Y i=1 (x A i ) ni: d B Y j=1 (x B j ) n:j Y i (x A i ) A 1 (d A A ) ( A ) d A Y i (x B i ) B 1 (d B B ) ( B ) d B dx A dx B = (n+1) Q ij (n ij +1) (d A A ) ( A ) dA (d B B ) ( B ) dB Q i (n i: + A ) (n+d A A ) Q j (n :j + B ) (n+d B B ) :
Ifweselecttheuniformpriorweobtainlesscancelingof termsthanwedid
forM 1 in equation(20): p u (njM 2 )= (n+1) (d A ) (d B ) (n+d A ) (n+d B ) Q i (n i: +1) Q j (n :j +1) Q ij (n ij +1) : (24)
Fromequations(22)and(24)weobtaintheBayesfactorfortheundirected
u 2 p u (M 1 jD) = u 2 p u (njM 1 ) = (n+d A d B ) (d A ) (d B ) (n+d A ) (n+d B ) (d A d B ) Q j (n :j +1) Q i (n i: +1) Q ij (n ij +1) : (25)
The second approach to model independence between A and B gives the
following: p u (n jM 0 2 )= (n+1) (d A ) (n+d A ) Z ( Y i n i: n i1 :::n id B Y j x nij j ) (d B ) dx B = (n+1) (d A ) (d B ) (n+d A ) ( Y i n i: n i1 :::n id B ) Y j x n:j j dx B = (n+1) (d A ) (d B ) (n+d A ) (n+d B ) Q i (n i: +1) Q j (n :j +1) Q ij (n ij +1) : (26)
WecannowndtheBayesfactorrelatingmodelsM 0
1
(equation23)andM 0
2
(equation26),withnopriorpreferenceofeither:
p u (M 0 2 jD) p u (M 0 1 jD) = p u (njM 0 2 ) p u (njM 0 1 ) = Q j (n :j +1) Q i (n i: +d B ) (d B ) dA 1 (n+d B ) Q ij (n ij +1) (27)
Considernowadatamatrixwiththreevariables,A,BandC(gure2). The
analysis of themodelM 0
3
where full dependencies areacceptedis verysimilar
to M
1
above (equation 22). For the model M
4
without the link between A
andB weshouldpartitionthedatamatrixbythevalueofC andmultiplythe
probabilitiesoftheblockswiththeprobabilityofthepartitioningdenedbyC.
Sinceweareultimately after theBayesfactor relatingM
4 and M 3 respec-tivelyM 0 4 and M 0 3
, we cansimply multiplytheBayesfactorsrelating M
2 and M 1 (equation25) respectivelyM 0 2 andM 0 1
(equation27) foreachblockof the
partitionto gettheBayesfactorssought:
p u (M 4 jD) p u (M 3 jD) = p u (njM 4 ) p u (njM 3 ) = (d A ) d C (d B ) d C (d A d B ) dC Y c (n ::c +d A d B ) Q j (n :jc +1) Q i (n i:c +1) (n ::c +d A ) (n ::c +d B ) Q ij (n ijc +1) (28)
andinthedirectedcasewehave:
p u (M 0 4 jD) p u (M 0 3 jD) = p u (njM 0 4 ) p u (njM 0 3 ) = (d B ) (d A +1)d C Y c Q j (n :jc +1) Q i (n i:c +d B ) (n ::c +d B ) Q ij (n ijc +1) :
beabletocomparemodelsM 5 andM 6 ofgure5: p u (M 5 jD) p u (M 6 jD) = p u (njM 5 ) p u (njM 6 ) = 1 (d B ) (d A 1)d C Y c Q j (n :jc +1) Q i (n i:c +d B ) (n ::c +d B ) Q ij (n ijc +1) :
M5
A
B
C
A
B
M6
C
Figure5: Directed models
5 Graphical model choice - global analysis
If wehavemanyvariables, theirinterdependenciescanbemodeled asagraph
with verticescorrespondingto thevariables. The exampleof gure 6is from
[23], andshowsthedependenciesinadatamatrixrelatedto heartdisease. Of
course, a graph of this kind can give a data probability to the data matrix
in a way analogous to the calculations in the previous section, although the
formulaebecomeratherinvolved, andthenumberof possible graphsincreases
dramaticallywiththenumberofvariables. Itiscompletelyinfeasibletolistand
evaluateallgraphsifthereismorethanahandfulof variables. An interesting
possibility to simplify the calculations would use some kind of separation, so
thatanedgeinthemodelcouldbegivenascoreindependentoftheinclusionor
exclusionofmostother potentialedges. Indeed,thederivations oflast section
showhowthisworks. LetC inthatexamplebeacompoundvariable,obtained
by merging columns fc
1 ;:::c
d
g. If two models G and G 0
Mental Work
Lipoproteins
Physical Work
Smoking
Anamnesis
Blood Pressure
Figure6: Symptomsandcausesrelevantto heartproblems
presenceandabsenceoftheedgefA;Bg,andifthereisnopathbetweenAand
B exceptthroughvertexset C,then theexpressionsforp(njM
4
)andp(njM
3 )
abovewillbecomefactorsoftheexpressionsforp(njG)andp(njG 0
),respectively,
andtheotherfactorswillbethesameinthetwoexpressions. Thus,theBayes
factorrelatingtheprobabilitiesofGandG 0
isthesameasthatrelatingM
4 and
M
3
. Thisresultis independentof thechoice ofdistributions and priorsofthe
model, sincethestructure of thederivation followsthestructure of thegraph
ofthemodel-itis equallyvalid forGaussianorotherdatamodels,aslongas
the parameters of the participating distributions are assumed independent in
thepriorassumptions. Abeautifulabstractanalysisofthisphenomenoncanbe
foundin (DawidandLauritzen[12]).
Wecannowthinkofvarious'greedy'methodsforbuildinghighprobability
interaction graphs relating the variables (columns in the data matrix). It is
convenientandcustomarytorestrictattentiontoeitherdecomposable(chordal)
graphs or directed acyclic graphs. Chordal graphs are fundamental in many
applicationsofdescribingrelationshipsbetweenvariables(typicallyvariablesin
systemsof equations orinequalities). Theycanbecharacterizedin many
dif-ferentbut equivalent ways, see(Rose [29], Rose, LuekerandTarjan[30]). One
simple wayis to consider adecomposable graphasconsisting of the union of
a number of maximal complete graphs (cliques, ormaximally connected
sub-graphs),in suchawaythat(i)thereisatleastonevertexthatappearsinonly
oneclique(a simplicial vertex), and(ii) ifanedge to asimplicialvertex is
re-moved, another decomposablegraphremains, and (iii) thegraphwithout any
edgesisdecomposable. Acharacteristicfeatureofasimplicialvertexisthatits
neighbors are completely connected. This recursivedenition canbereversed
into ageneration procedure: GivenadecomposablegraphGontheset of
ver-tices,ndtwoverticessandnsuchthat(i): sissimplicial,i.e.,itsneighborsare
G obtained by adding the edge between s and n to G is also decomposable.
Wewill call such anedge apermissible edge of G. This procedure describesa
generationstructure(adirectedacyclicgraphwhoseverticesaredecomposable
graphson theset of vertices)containingall decomposable graphsonthe
vari-ableset. An interestingfeature of this generation process isthat it is easyto
computetheBayesfactorcomparingtheposteriorprobabilitiesofthegraphsG
andG 0
asgraphicalmodelsofthedata: LetscorrespondtoA,ntoB andthe
compoundvariableobtainedbyfusingtheneighborsofstoCintheanalysisof
section5. Without explicitpriormodelprobabilitieswehave:
p(GjD) p(G 0 jD) = p(njM 3 ) p u (n jM 4 ) : (29)
Asearchforhighprobabilitygraphscannowbeorganizedasfollows:
1. Startfrom thegraphG
0
withoutedges.
2. Repeat: ndanumberof permissible edges thatgivethehighestBayes
factor,andadditifthefactorisgreaterthan1. Keepasetofhighestprobability
graphsencountered.
3. Thenrepeat: Forthehighprobabilitygraphsfoundinthepreviousstep,
nd simplicial edges whose removal increases the Bayes factor the most (or
decreasesittheleast).
Foreachgraph keptin this process, itsBayesfactorrelativeto G
0 canbe
foundbymultiplyingtheBayesfactorsinthegenerationsequence. Aprocedure
similartothisoneisreportedby(MadiganandRaftery[24]),anditsresultson
small variable sets wasfound good, in that it found thebest graphsreported
in otherapproaches. It mustbenoted,however,that wehavenowpassedinto
therealmofapproximateanalysis,sincewecannot(yet)knowthatwewillnd
all high probability graphs. One splendid example of this is where we have
manybinarycategoricalcolumns,allgeneratedrandomlyandindependentlyof
each other except the last onewhich is the parity function of theother ones.
Ifwestartsearchingfromtheemptygraph,wewillneverndthisrelationship
since the intermediategraphs will have low probability. Likewise, if some
ar-bitrary subsetof thecolumns are interrelated by a parity constraintit seems
unlikelyalthoughpossiblethat we willnd itevenif westartthe searchfrom
thesaturatedmodel(graphwithalledges).
Anotherfamily of graphicalmodelsare the directed acyclicmodels. They
canbe treatedsimilarly, sinceherewechecklocally for avariable B that has
beenfounddependentonasetC,whetheritcanbeinferredalsotodependon
variable A. WecomparethusmodelsM
5
and M
6
ofgure4. Theinclusionor
exclusionof the arrowfrom A to B canbe inferred independentof all arrows
not going to B. A problem with directed graphical models is that dierent
acyclic graphscanrepresentthe samefamily of probabilitydistributions, and
thisrequires somecarefulargumentation.
6 Graphical model choice - categorical, ordinal
and Gaussian variables
Wenowconsiderdatamatricesmadeupfromordinalandrealvalueddata,and
normal distribution. It has nice theoreticalproperties manifesting themselves
in such formsas thecentrallimittheorem, theleast squaresmethod,principal
components, etc. However,it must be notedthat itis also unsatisfactory for
many data sets occurring in practice, because of its narrow tail and because
manyreallifedistributionsdeviateterriblyfromit. Severalapproachestosolve
this problem areavailable. Oneis to consider avariableasbeingobtainedby
mixingseveralnormaldistributions. Thisisaspecialcaseoftheclassicationor
segmentationproblemdiscussedbelow. Anotheristodisregardthedistribution
over the real line, and considering the variable as just being made up of an
orderedsetofvalues. Thisleadsnaturallytotherecursivesplittingofthedata
set byadecisiontree,alsodiscussedbelow.
7 Missing values and errors in data matrix
Data collectedfrom experimentsare seldom perfect. The problem of missing
and erroneousdata isavast eld in thestatisticsliterature. First ofall there
isapossibilitythat'missingness'ofdatavaluesaresignicantfortheanalysis,
in whichcasemissingnessshould bemodeled asanordinarydatavalue. Then
theproblemhasbeeninternalized,andtheanalysiscanproceedasusual, with
theimportant dierencethat themissing valuesarenotavailable foranalysis.
A moresceptical approach wasdeveloped by Ramoni andSebastiani[27], who
consider anoptionto regardthemissing valuesasadversaries(the conclusions
ondependencewouldthenbetruenomatterwhatthemissingvaluesare). The
other possibility is that missingness is known to have nothing to do with the
objectives of the analysis. For example, in a medical application, if data is
missing because of thebad condition of thepatient, missingness is signicant
if theinvestigation isconcerned with patients. Butif data ismissing because
ofunavailabilityofequipment, itisprobablynot-unless maybeifthe
investi-gationisrelated tohospitalquality. In Bayesiandata analysis,theproblem of
missing orerroneousdata createssignicantcomplications, aswewill see. As
an example, consider theanalysis of the two-columndata matrix with binary
categoricalvariables Aand B, analyzedagainstmodels M
1 and M 2 of section 5. Suppose we obtained n 00 , n 01 , n 10 and n 11
cases with the values 00, 01,
etc. WethenhaveaposteriorDirichletdistributionwithparametersn
ij forthe
probabilitiesof thefourpossiblecases. Ifwenowreceiveacasewhere bothA
and B areunknown, it is reasonablethat this caseis altogetherignored. But
what shallwedoifacasearriveswhereAisknown,say0,but B isunknown?
One possibility is to waste theentire case, but this is not orthodoxBayesian,
sinceweare notmakinguseof informationwehave. Anotherpossibility isto
usethecurrentposteriortoestimateapdfforthemissingvalue,inourcasethe
probabilitythat Bhasvalue0isp
0 =n
00 =n
0:
. Soourposteriorisnoweithera
Dirichletwithparametersn
00 ,n 01 1,n 10 1and n 11 1(probabilityp 0 )or
onewithparametersn
00 1,n 01 ,n 10 1andn 11 1(probability1 p 0 ). But
thismeansthattheposteriorisnowaweightedaverageoftwoDirichlet
distri-butions, inotherterms,isnotaDirichletdistribution atall! Asthenumberof
missing valuesincreases,thenumberoftermsintheposteriorwill increase
ex-ponentially,andthewhole advantagewithconjugatedistributionswill belost.
The related case of errors in data is more dicult to treat. How do we
describedatawhere thereareknownuncertaintiesin therecordingprocedure?
This is aproblem workedon forcenturieswhenit comes to realvalued
quan-tities as measuredin physics and astronomy, and is one of the main features
of interpretation of physics experiments. When it comes to categorical data
there islesshelpin theliterature-anobviousalternativeistorelaterecorded
vsactualvaluesofdiscretevariablesasaprobabilitydistribution,or-whichis
fairlyexpedientinourapproach-asanequivalentsample.
8 Decision trees
Decisiontreesaretypicallyusedwhenwewanttopredictavariable-theclass
variable - from other - explanatory- variables in a case, and wehave a data
matrixofknowncases. Whenmodelingdatawithdecisiontrees,weareusually
tryingtosegmentthedatasetintoranges-n-dimensionalboxesofwhichsome
are unbounded - such that a particular variable - the class variable - is fairly
constantovereach box. If theclass variable istruly constantin each box, we
have a tree that is consistent with respect to the data. This means that for
newcases,wheretheclass variableisnotdirectly available,itcanbewell
pre-dicted by theboxinto whichthe casefalls. The methodis suitablewhere the
variablesusedforpredictionareofanykind(categorical,ordinalornumerical)
andwherethepredictedvariableiscategoricalorordinalwithasmall domain.
There are several ecient waysto heuristicallybuild good decision trees, and
it is acentral techniquein the eld of machine learning. Practicalexperience
hasgivenmanycaseswerethepredictiveperformanceofdecisiontreesisgood,
but alsomany counter-intuitivephenomena havebeen uncovered bypractical
experiments. Recently,severaltreatmentsofdecisiontreeshavebeenpublished
where it isdiscussed whether ornotthesmallestpossibletree consistent with
all cases is the best one. This turned out not to be the case, and the
argu-ment that a smallest decision tree should be preferred because of some kind
of Occam's razor argument is apparently not valid, neither in theory nor in
practise[34,2]. TheBayesianapproachgivestherightinformationonthe
credi-bilityandgeneralizingpowerofadecisiontree. Itisexplainedinrecentpapers
by(Chipman,GeorgeandMcCullogh[9])and by(PaassandKindermann[26]).
Adecisiontreestatisticalmodelisonewhereanumberofboxesaredenedon
oneset ofvariables by recursivesplitting of oneboxinto two bysplitting the
rangeofonedesignatedvariableintotwo. Dataareassumedtobegeneratedby
adiscretedistributionovertheboxes,andfor each boxitis assumedthat the
class variable value is generated by another discrete distribution. Both these
distributionsaregivenuninformativeDirichletpriordistributions,andthusthe
posteriorprobabilityofadecisiontreecanbecomputedfromdata. Sincelarger
trees havemoreparameters, there is anautomatic penalization oflarge trees,
butthedistributionofcasesintoboxesalsoentersthepicture,soitisnotclear
thatthesmallesttreegivingperfectclassicationwillbepreferred,oreventhat
aconsistenttreewill bepreferredoveraninconsistent one. Thedecisiontrees
wedescribedheredonotgiveaclearcutdecisiononthevalue ofthedecision
variableforacase,butaprobabilitydistributionovervalues. Iftheprobability
pos-thename ofthisdata model indicatesits usefordecisionmaking,onecanget
bettertreesforanapplicationbyincludinginformationabouttheutilityofthe
decision in the form of a loss function and by comparing trees based on the
expectedutilityratherthanmodel probability.
ForadecisiontreeT withdboxesdatawithcclasses,andwherethenumber
of casesin box iwith class valuek is n
ik
, and n=n
::
, wehavewith uniform
priorsonboththeassignmentofcasetoboxandofclasswithinbox,
p(DjT)= (n+1) (d) (n+d) Y i (n i: +1) (c) (n i: +c) (30)
However,inordertocomparetwotreesTandT 0
,wewouldhavetoformthe
setofintersectionboxesandaskabouttheprobabilityofndingthedatawith
acommonparameteroverthe boxesbelongingto acommonboxofT relative
totheprobabilityofthedatawhentheparametersarecommoninboxesofT 0
.
ForthecasewhereT andT 0
onlydierbysplittingofoneboxiinto i 0
andi 00
,
thecalculationiseasy(n
i 00 j +n i 0 j =n ij ): p(DjT 0 ) p(DjT) = (n i: +c) (n i 0 : +c) (n i 00 : +c) Y j (n i 0 j +1) (n i 00 j +1) (n ij +1) (31)
9 Segmentation - Latent variables
Segmentationandlatentvariableanalysisisdirectedatdescribingthedataset
as acollection of subsets, each having simplerdescriptions than the full data
matrix. Supposedata set D is partitioned into d
c
classesfD (i)
g,and each of
these has a high posterior probability p(D (i)
jM
i
) wrt some model set fM
i g.
Then wethink that theclassicationis a good model for the data. However,
someproblemsremainto consider. First,what isitthat wecomparethe
clas-sication against, and second, how do we accomplish the partitioning of the
cases? Therstquestionisthesimplesttoanswer: wecompareaclassication
model against some other model, based on classicationor not. The second
is trickier, sincethe introduction of this section issomewhat misleading. The
priorinformation foramodelbased onclassicationmusthavesome
informa-tion about classes, but it does not have an explicit division of the data into
classesavailable. Indeed,if we were allowed to makethis division into classes
on our own, seeking the highestposterior class model probabilities, wewould
probablyover-tby usingthesamedatatwice -onceforclass assignmentand
once for posteriormodel probability computation. The statistical model
gen-erating segmented data could be the following: A case is rst assigned to a
classbyadiscretedistributionobtainedfromasuitableuninformativeDirichlet
distribution, and then its visible attributes are assigned by aclass-dependent
distribution. Thismodelcanbeusedtocomputeaprobabilityofthedata
ma-trix, andthen, viaBayesrule, aBayesfactor relatingthemodelwith another
one,e.g.,onewithoutclassesorwithadierentnumberofclasses. Onecanalso
haveavariablenumberofclassesandevaluatebyndingtheposterior
distribu-tionofthe numberofclasses. Thedataprobabilityisobtainedbyintegrating,
ities accordingto therespective classmodel. Needless to say, this integration
is feasible only fora handful of caseswhere the data is too meagerto permit
anykindof signicantconclusion onthenumberof classesand their
distribu-tions. Themostwell-knownproceduresforautomaticclassicationarebuilton
expectationmaximization.Withthistechnique,asetofclassparametersare
re-nedbyassigningcasestoclassesprobabilistically,withtheprobabilityofeach
casemembership determinedbythelikelihoodvectorforitinthecurrentclass
parameters[8]. We can alsosolvethe problem with the MCMC approach[28].
The MCMC approach to classicationis the following: Assume that wehave
adatamatrixandwantaclassicationofitscaseswhich makestheattributes
independent. Deneaclassassignmentrandomly,andcomputetheprobability
ofdata,giventhemodelwithindependentattributes,asin(24)whichiseasyto
generalizetomoreattributes. TheMCMCwillnowimplementamovefunction,
proposingachangedclassforsomecase. Themoveisacceptediftheposterior
probability increases, orotherwise by aprobability given by the ratio of new
to olddataprobability(seesection11). Thisprocedure isreasonablyecient,
sinceitis possibleto evaluatetheclassprobabilitiesincrementally, bykeeping
just thecurrentcontingencytableforeachclassandupdatingitincrementally.
Since absolute probabilities are held updated, we also avoid a common
com-plicationin MCMC applications arisingwhenthe dimensionof theparameter
spacechanges. Although itcansometimesbeavoidedit isnotalwaysso. The
reversiblejumpprocess wasdesignedtocopewiththisphenomenon[6].
10 Association rules
Association rulesarespecial setsof rulesused to predictdatain data mining.
Theliteratureonassociationrulesemphasizesrapidextraction, sincetypically
a data matrix has verymany potential association rules and the data
matri-ces considered areverylarge. An association ruleis written A!B, where A
and B are conditionson adata case. Theycanbe either dened by givinga
predicate on the valueof an attribute, or asa conjunction of such conditions
for several attributes. In the literature, binary attributes are often assumed.
These usefulnessofthisrule depends onhowwellitsatises theintuitive
con-dition of therule: WheneverA istruefor acase, B is also true. Thesupport
of the ruleis the fraction of caseswhere bothA and B are true, whereasthe
condence is the fraction of caseswith A true where also B is true. The lift
ofarule isthefactorbywhich itscondenceexceedsthecondence wewould
have with in-dependency between A and B, computing in a ML framework,
i.e., n AB n :: =(n A: n :B
), wherethe notationisanobviousadaptationofthe
con-tingency table notationused previously. Clearly, theconceptof liftassumesa
large database,where statistical uctuation canbe ignored. In order to asses
thesignicanceofanassociationrule,weneedthemachineryofBayesfactors,
and then we can easily assess the generalization expectable from a proposed
rule. Inshort,letthesignicanceofarulebetheBayesfactorbetweenamodel
thatgivesdependence betweenAandBandamodelthat doesnot. Thisgives
s(A!B)= :: (n A: +c) (n A: +c) AB AB (n :B +1) AB AB (n :B +1) (32)
Since the signicance depends on four quantities which can have a very
largespan ofvaluesin practicalapplications, this conceptseemsnecessaryfor
throwingoutrulesthatcannotbeexpectedtogeneralizebecauseeitherthedata
base,thesupportorthelift(orsomecombination)istoosmall. However,there
are moredangersin applyingdata miningresults, particularly theproblem of
biasedsamplingwhichnotestonsampleddatacanreveal.
Miningoflargelesforassociationrulestypicallyrevealsverylarge
quanti-tiesofsignicantrules. Manypapershavebeendevotedtondinganinteresting
subset of such rules. Twobasic approaches exist: in one, ameasure of
inter-estingnessorsurprisingnessisdenedfor aparticularrule,in anotheraruleis
evaluatedinthecontextofanalreadyexistingset.
11 Approximateanalysis withMetropolis-Hastings
simulation
Several of the cases mentioned previously, where analytical solutions become
infeasiblebecauseofbreakdownofthesimpleconjugacyprinciple(missingand
erroneous values) or the large number of models to be considered (graphical
modelsonmanyvariables)orbecauseoftheanalyticaldicultiesincomputing
dataprobability(classication)havebeenseparatelyattackedwithmontecarlo
methods. Wewilloutlineamethodsolvingallthesecasesatonce.
ThebasicproblemsolvedbyMCMCmethodsissamplingfromamultivariate
distributionovermanyvariables. Thedistribution canbegiven genericallyas
p(x;y;z;:::;w). If some variable, e.g., y, represents measured signals, then
the actual values measured, say a, can be substituted, and sampling will be
from aconditional distributionp(x;a;z;:::;w). If someother variable, sayx,
representsthemeasuredquantity,thensamplingandselectingthexvariablewill
give samplesfrom theposteriorof x giventhe measurements. Inother words,
wewill getabest possibleestimationof the quantitygiventhe measurements
andthestatisticalmodel ofthemeasurementprocess.
ThetwobasicmethodsforMCMCcomputationaretheGibbssamplerand
theMetropolis-Hastingsalgorithm. Both generateaMarkovchain withstates
overthe domainof a multivariatetarget distribution and with the target
dis-tribution as its unique limit distribution. Both exist in several more or less
renedversions. TheMetropolisalgorithmhastheadvantagesthat itdoesnot
require sampling from the conditional distributions of the target distribution
but onlyndingthequotientof thedistributionat twoarbitrarygivenpoints,
anditcanbechosenfromasetwithbetterconvergenceproperties. Athorough
introductionis givenbyNeal[25]. Tosumit up,MCMCmethodscanbeused
toestimatedistributionsthatarenottractableanalyticallyornumerically. We
get realestimates ofposteriordistributions and notjust approximate maxima
offunctions. Onthenegativeside, theMarkovchainsgeneratedhavehigh
au-tocorrelation,soasampleoverasequenceofstepscangiveahighlymisleading