A survey of Bayesian Data Mining - Part I: Discrete and semi-discrete Data Matrices

(1)

Part I: Discrete and semi-discrete Data Matrices

Stefan Arnborg

Swedish Institute of Computer Science

SICS TRT99:08, ISSN1100-3154, ISRN:SICS-T99/08-SE

Abstract

Thistutorialsummarises theuseofBayesiananalysis andBayes

fac-torsforndingsignicantpropertiesofdiscrete(categorical andordinal)

data. Itoverviewsmethodsforndingdependenciesandgraphical

mod-els,latentvariables,robustdecisiontreesandassociationrules.

1 Introduction

DataminingiscomplementarytoBayesiandataanalysis.Whereasdatamining

isoftenseenastheproblemofgrindingthroughmassivedatasetsforthepurpose

ofndingunexpecteddependenciesintheformofcorrelations,associationrules

and segmentations, Bayesian data analysis is typically seen as an activity of

evaluatingdetailedmodelsforsmall datasets. Weareinterestedinthemiddle

ground, wheredata isscarce enoughto pose delicatequestions of validity and

signicanceofourndings,butwherewedonotyethavedetailedmathematical

models. Weare developingtools and methodologyfor exploratory analysis of

small andfragiledatasets, asapreparatorystepfor amoredetailedanalysis,

as can beperformed in the Bayesianframework with, e.g., theBUGS system

[33].

Theapplicationareaishumanbrainresearch. Here,manydierenttypesof

data arerecordedforpatientsandfor healthycontrol persons. Besidesresults

of establishedand wellstandardized tests and backgrounddata, manyresults

fromimaginginvestigations(measuringcellstructure,bloodow,receptor

pres-ence,etc.) areenteredasextractedfeaturesofimagesmappedtobrainatlases.

Geneticdatarelatedtobraindevelopmentisalsoemerging. Somedataentered

areuncertain,othersarebeingstandardized. Weseldomhaveacompletedata

setforanyindividual,sincethedatacollectionprocessiscostlyandoften

infea-sibleforpatientsin badcondition. Theobjectiveofdataminingonthese data

aredeeperunderstandingoftheinterplaybetweenphysiologicalandpsychiatric

conditions,andalso improvedprocedures fordiagnosingpatientsandchoosing

therapies.

Thepurpose ofthis report isto explain theadvantageof theBayesian

ap-proachin the presentapplication, and howthe Bayesfactorcanbean almost

(2)

playtheinformationorknowledgeweareafterinanapplication. Itisalsoour

intentionto giveafull accountofthecomputationsrequired. It canserveasa

surveyofthearea, although itfocuses on techniques beinginvestigated in the

presentproject. Several of thecomputations we describe havebeen analysed

at length, although not exactly in the way and with thesame conclusions as

foundhere. Thecontributionhereisasystematictreatmentthatisconnedto

pure Bayesiananalysis andputs several establisheddatamining methodsin a

joint Bayesianframework. Wedonotwantto enter thediscussionof whythe

Bayesianapproachissuperiortoitsalternatives,butsomebackgroundmaterial

is included. We will see that,although many computations ofBayesian

data-miningarestraightforward,onesoonreachesproblemswhere dicultintegrals

havetobeevaluated,andpresentlyonlyMarkovChainMonte Carlo(MCMC)

methods areavailable. Thereare severalrecentbooks describingtheBayesian

method from both a theoretical[3], an ideological[19, 32] and an application

oriented[7]perspective. Amain historic inuence leadingto increasedinterest

in Bayesianmethods is HaroldJereys, whowrote particularly twobooks on

scienticinferenceandprobability theoryfrom aBayesianperspective[21, 20].

AcurrentsurveyofMCMCmethods,whichcansolvesomecomplexevaluations

requiredinBayesianmodeling,canbefoundinthebook[17]. Booksexplaining

theory and use of graphical models are Lauritzen[22], Cox and Wermuth[10],

andWhittaker[35]. AtutorialonBayesiannetworkapproachestodatamining

is found in (Heckermann[18]). This present report describesdata mining in a

relationaldatastructurewithdiscretedata(discretedatamatrix)andthe

sim-plestgeneralizationstonumericaldata. Asecondpartwilldescribegeneralreal

valueddatamatrices,rasterdatarepresenting,e.g.,scalarand/orvectorelds,

aswellastimeseriesandstrings.

2 Data model

Weconsideradatamatrixwhere rowsarecasesandcolumns arevariables. In

our application, the row is associated with a person or an investigation

(pa-tientand date). Thecolumns describealargenumberof variables that could

berecorded,suchasbackgrounddata(occupation,sex,age,etc),andnumbers

extracted from investigationsmade, likesizes of brain regions, receptor

densi-ties and blood ow by region, etc. Categorical data can be equipped with a

condence(probabilitythattherecordeddatumiscorrect),andnumericaldata

with an error bar. Every datum can be recorded as missing, and the reason

for missing data canbe related to patients conditionor external factors(like

equipmentunavailabilityortimeandcostconstraints). Onlythelattertypeof

missing data is (atleast approximately) unrelatedto thedomain of

investiga-tion. Onthelevelof exploratoryanalysis weconne ourselvesto discrete and

multivariatenormaldistributions,withDirichletandinverseWishartpriors. In

this way, no delicateand costly MCMC methods will be required until

miss-ing data and/or segmentation is introduced. If the data do not satisfythese

conditions (e.g., normality for a real variable), they may do so after suitable

transformationand/orsegmentation. Anotherapproachisto ignorethe

distri-butionoverthereallineandregardanumericalattributeasanordinalone,i.e.,

(3)

appreciation of a phenomenon in organized society or their valuation of their

ownemotions.

2.1 Multivariate data models

Given a data matrix, the rst questionthat arises concerns the relationships

between its variables(columns). Could some pairs of variables be considered

independent,ordothedataindicatethat thereisaconnectionbetweenthem

-eitherdirectlycausal,mediatedthroughanothervariable,orintroducedthrough

sampling bias? These questions are analyzedusing graphicalmodels, directed

ordecomposable[24]. As an example, in gure1 M

1

indicates amodel where

A andB aredependent,whereastheyare independentin model M

2

. Ingure

2,wedescribea directedgraphical model M 00

4

indicatingthat variables A and

B areindependentlydetermined, but the valueofC will bedependent onthe

values for A and B. The similar decomposable model M

4

indicates that the

dependence of A and B is completely explained by the mediationof variable

C. Wecould think of the data generation process as determining A, then C

dependentonA andlast B dependentonC,orequivalently,determining rst

C andthenAdependentonC andB dependentonC.

M1

M1’

M2

M2’

A

B

Figure1: Graphicalmodels,dependence orindependence?

Bayesiananalysis ofgraphicalmodelsinvolvesselectingall orsomegraphs

onthevariables,dependentonpriorinformation,andcomparingtheirposterior

probabilitieswithrespecttothedatamatrix. Asetofhighestposterior

(4)

B

A

C

A

B

C

A

B

C

A

B

C

A

B

C

M3

M3’

M4

M4’

M4’’

Figure2: Graphicalmodels

onemust-asalwaysinstatistics-constantlyrememberthat dependenciesare

notnecessarilycausalities.

Asecondquestionthatarisesconcernstherelationshipsbetweenrows(cases)

in thedatamatrix. Arethecasesbuiltupfromdistinguishableclasses,sothat

eachclass has itsdata generated from asimplergraphicalmodel than that of

thewhole dataset? Inthe simplestcase these classescanbe directly read o

inthegraphicalmodel. Inadatamatrixwhereinter-variabledependenciesare

wellexplainedby themodel M

4

, ifC is acategoricalvariable takingonly few

values, splitting the rows by thevalue of C could give aset of data matrices

in each of which A and B might be independent. However, the interesting

casesarewheretheclassescannotbedirectlyseeninagraphicalmodelbecause

then theclasses are nottriviallyderivable. Ifthe data matrixof theexample

contained only variables A and B, because C wasunavailable or unknownto

interferewithAandB,thehighestposteriorprobabilitygraphicalmodelmight

beone with alinkfrom A to B. Theclasseswouldstill bethere, but sinceC

wouldbelatentorhidden,theclasseswouldhavetobederivedfromtheAand

B variables only. A dierent case of classicationis where the values of one

numerical variable are drawn from several normal distributions with dierent

meansandvariances. Thefullcolumnwouldtverybadlytoanysinglenormal

distribution,butafterclassication,eachclasscouldhaveasetofvaluestting

(5)

onBayesianmethodologyisdescribedbyCheesemanandStutz[8 ].

Athirdquestion-oftentheoneofhighestpracticalconcern-iswhethersome

designatedvariablecanbereliablypredictedinthesensethat itiswellrelated

to combinationsof valuesof other variables, notonly in the data matrix,but

alsowith highcondence in newcasesthat arepresented. This questionleads

toanotherconceptthathasbeenextensivelystudied,namelyassociationrules.

ConsideradatamatrixwelldescribedbymodelM

4

ingure2. Itisconceivable

thatthevalueofCisagoodpredictorofvariableB,andbetterthanA. Italso

seemslikelythat knowingbothA andC is oflittlehelp comparedto knowing

onlyC,becausetheinuenceofA onB is completelymediatedbyC. Onthe

otherhand,ifwewanttopredictC,itiswellconceivablethatknowingbothA

andB isbetterthanknowingonlyoneofthem.

Finally, it is possible that a data matrix with many categorical variables

withmanyvaluesgivesascatteredmatrixwithveryfewcasescomparedtothe

numberofpotentiallydierentcases. Generalizationisatechniquebywhicha

coarseningofthedatamatrixcanyieldbetterinsight,suchasreplacingtheage

andsexvariablesbythecategorieskids,youngmen,adultsandseniorsinacar

insuranceapplication. Thequestionofrelevantgeneralizationisclearlyrelated

to the problems of nding association rules and to classication. Forordinal

variables, this line of inquiryleads naturally to the concept of decision trees,

thatcanbethoughtofasarecursivesplittingofthedatamatrixbythesizeof

oneofitsordinalvariables.

3 Bayesian analysis, uninformative priors, and

over-tting

A natural procedure for estimating dependencies among categorical variables

is by means of conditional probabilities estimated as frequencies in the data

matrix. Likewise, correlations can be used to nd dependencies among real

valuedvariables. Suchproceduresusuallyleadtoselectionofthemoredetailed

models and givepoorgeneralizing performance, in the sense that newsets of

dataarelikelytohavecompletelydierentdependencies. Variouspenaltyterms

havebeentriedtoavoidover-tting.However,theBayesianmethodhasa

built-inmechanismthatfavorsthesimplestmodelscompatiblewiththedata,andalso

selects moredetailed models as the amount of data increases. Theprocedure

is to compare posterior model probabilities, where the posterior probability

of amodel is obtainedbycombiningits priordistribution ofparameters with

the probabilityof the data asafunction of theparameters, using Bayesrule.

Thus, if p

1 (

1

) is the prior pdf of the parameter(set)

1

of model M

1 and

theprobabilityofobtainingthecase(rowofdatamatrix)disp(djM

1

1 ),then

theprobabilityinmodelM

1

ofthedatamatrixDcontainingtheorderedcases

fd i g i2I is: p(DjM 1 )= Z Y i2I p(d i jM 1 1 )p( 1 )d 1 ; (1)

and the posterior probability of model M

1

given the data D is, by Bayes

(6)

1 1 1

Fromafrequentistororthodoxstatisticalpointofviewitisquestionableto

dothisinterchangeandconsidertheprobabilityofamodelgiventhedata. This

isexactlywhatmakesthedierencebetweenBayesianandfrequentistmethods.

Ifthedatamatrixisunordered,oneshould multiplywithamultinomial

coe-cient,butthisisoftennotdone-whetherornotthisisdonedoesnotmatterfor

computation ofBayesfactors,see below. Twomodels M

1

and M

2

cannowbe

related withrespectto the databy theBayesfactor p(DjM

1

)=p(DjM

2 ). This

is a factor which is multiplied with the prior odds between the two models,

p(M

1 )=p(M

2

),togettheposterioroddsp(M

1

jD)=p(M

2

jD). Theposteriorodds

cannowtaketheplaceofanewpriorforthenextdatabatch,andtheprocedure

canberepeated. Itshouldbenoted,however,thatthemodelaveragingisdone

foreachbatch-whetherthisis appropriateornotdependsontheapplication,

andoftenitisnot.

A high value of the Bayes factor, say more than 100, speaks strongly in

favorofmodelM

1

,likeavaluebelow.01 givesstrongsupport forM

2

. Values

closer to one (i.e. in the range .3 to 3 ), however, tell us that the data are

insucienttodecidebetweenthemodels,andthisisunavoidable-methodsthat

decidein thosecasescannotbewelldesigned. Thisappearsto beasignicant

dierence between theBayesianapproach and many analyses occurring in AI

anddatamining-wedonotconsiderourdataasanimperfectimageofanideal

underlying andcompletely preciseprobabilitymodel. Onthecontrary,weask

whichimperfect underlyingmodelsbest serveto describeourdata. Ifwetried

to get much moredata thanwehave,wewould notnecessarilybecomewiser,

sincethedatacollectionprocessmaywellbesuchthatcasesarenotindependent

andthedatacollectionprocessmaychangethenatureofthedatathroughthe

samplingprocess.

A disturbingfeature of theBayesianmethodology is that it requires prior

distributions. Priorsgiveanimpression ofsubjectivity, whichtheyshould not

do. Thepriorisanassessmentofastateofinformation,andisnotrelatedtoa

subjectexceptthat theinformationstateis possessedby asubject. Oftenthe

information state is dicult to deal withsince its form is fairly open-ended

-just imagineinformationrelatedto anopenmathematicalproblem, orevenan

NP-hard optimization problem. However,everywell-foundedchoice between

alternativesmustinvolvethepriorbeliefsof -objectivelythestateof

informa-tionheldby-thedecisionmakerinsomeway,andtheBayesianmethodisone

(infact theonly)consistentwayofdoingthis. Bayesianmethodologyprovides

an expedient for the case where no strong prior beliefs should inuence the

conclusion,namelyuninformativeorweaklyinformativepriors. Forsuchprior

distributions,moredatais typicallyneededtoreach adeniteconclusion than

for cases where there is distinct prior information to include in the analysis.

WiththeBayesianmethodthereisnoneedtopenalizemoredetailedmodelsto

avoidover-tting- ifM

2

ismoredetailedthanM

1

in thesenseofhavingmore

parameters to t, then the parameterdimension is largerin M

2 and p( 1 ) is larger than p( 2

), which automatically penalizes M

2

againstM

1

. This

auto-maticpenalizationhasbeenfoundappropriatein manyapplicationcases, and

shouldbecomplementedbyexplicitpriormodelprobabilitiesonlywhenthereis

(7)

of detailed modelsimplicit in the Bayesfactor approach is afactor n2 (p 1 p 2 ) ,

wherenisthenumberofdatapoints(cases)andp

i

isthenumberofparameters

inmodelM

i

. ThisestimatewasrstfoundbySchwarz[31],andisknown,when

usedtopenalizemoredetailed modelsinalikelihoodbasedmodelcomparison,

as the Bayesian information criterion (BIC). So deciding between the models

usingthelikelihoodratioswiththeBICasapenalizingfactorisan

approxima-tiontothe'orthodoxBayesian'procedureofcomparingposteriorprobabilities,

anditisusefulwhentheintegrationrequiredforposteriordeterminationis

in-feasibleorotherwiseunwanted. Somediscussionsofthispointcanbefoundin

(Ch24ofJaynes[19])andalsoinNeal[25].

Thediscussionaboverelatesto choosingoneof twomodels. Clearly, there

is a possibility that thedata discredits both these models, orthat we havea

whole familyofmodels tochoosefrom.

Considertheproblem ofcomparing modelsin afamily, fM

1 ;:::;M

k g,and

havingno prior preference forany ofthem. If the models do not overlap, we

should choosetheprobabilitiesfp(M

i jD)= P j p(M j jD)g astheprobabilitiesof

these modelsgiventhe data. By overlapping wemean that parameter sets of

priornon-zeroprobabilityexistwhichgivethesamedistributionintwomodels.

We usually do not haveoverlap, since, e.g., in the case of nested models the

region of overlap, thewhole 'less specic' model, hasprior probabilityzeroin

themorespecic model. Typically, anestedfamilyforming atree ordirected

acyclicgraphstructureis chosen,where thedimensionof theparameterspace

increasesas onedescendsin thetree,andwheretherootisassociatedwiththe

fewestparameters. Therootmodelistheleastspeciconein thefamily.

Inthemodelingeort,theanalystmustdecideongroundsofwhatisknown

in generalterms abouttheapplication and thepurpose of theanalysis,which

model family to consider. Here we must remember that inference is not an

idle activity, butshould normallybeused tomakedecisions. Clearly,it isnot

adequate to select a model from its posterior probability without considering

theconsequencesofdecisions.InBayesiandecisiontheory(see,e.g.,(Berger[1]),

weintroduceactionsandexpectedutilityofactionsgivena'stateoftheworld',

which couldbeamodel oramodel with itsparameter. However,in Bayesian

decisiontheory,therationaldecisionmakingfollowsfromonlytheposteriorand

the utility functions (statisticians seem to be a pessimistic breed and usually

talk aboutlossfunctions,butthis isof coursereallythesamething). Forthis

reasonwedonotintroducelossfunctionsin thisreport.

3.1 TheBayesiandebateandtheunavoidabilityofBayesian

analysis

Therewasaquiteheateddebate amongstatisticiansontheproperapplication

of mathematicaltoolsin theinterpretation of experimentaldata. This debate

startedbetweenFisherandPearssonandcontinuedbetweenFisherandJereys.

WhatismostrememberedisthediscussionbetweenBayesiansand'frequentists'

(astraditionalstatisticianswerecalledbyBayesians). Foratrainedpure

math-ematicianthecontroversybetweenfrequentistandBayesianviewsdoessimply

notappear. Heisinterestedin abstractspaceswith probabilitymeasures,and

(8)

reactionsamongstatisticiansandalsorecentlyintheAIcommunity. Bayesians

areknownfortheirarroganceandclaimtoownthetruth. Itisunfortunatethat

thisclaimisnotpresentedinmanytextbooks,becauseitiseasytounderstand,

and also quite surprising. It is generally agreed that Bayes original paper is

deepandchallenging,but itisalsotoovagueandincoherenttobeconvincing,

andmanyreadershaverejecteditoutright. Thereisapparentlynodocumented

evidence that Laplace actually sawthe paperor heard of it, but the work of

LaplaceisacontinuationoftheideasinBayeswork. Unfortunately,hedidnot

succeed to convincehis colleagues and successorsin the scientic community.

HisideaoftheruleofsuccessionisaclearapplicationofBayesiananalysis,but

it wasrejected becausehis readersdidnot accepthis choiceof prior

informa-tion (decidingthe number ofdays,all withsunrise, sincecreation,by reading

the Bible)and discardedthe method on thebasis of one dubiousapplication.

Obviously, iftheBibleis reliableonthis point,other informationontheorder

of Nature found in itmightcontradict his application. Other sourcesof prior

informationwereknownbyLaplace, buthedidnotusethemforthispurpose.

Severalgreat 19th century mathematicians havemoreorlessby instinct used

theideasofBayesandLaplacewhenperformingcomputationsonexperimental

data(typicallyinastronomy),buttheseeorts weremoreorlessignoredwhen

thedisciplineofstatisticswascreatedin theearly20thcentury.

TherstderivationofthenecessityofBayesianmethodswasdonebyR.T.

Cox in 1946[11], and hasbeen repackaged by Jayneswith alot of motivating

discussion. Basically,theanalysisinvestigateswhichfamilyofrulesforreasoning

with theplausibilityof statements abouttheworldis permissible in thesense

that theysatisfythefollowingcriteria:

I: The plausibility of a statement is a real number and dependent on

informationwehaveontheplausibilityofotherstatements.

II: Consistency-If theplausibilityof astatement canbe derived in two

ways,thetworesultsmustbeequal.

III:Common sense -Some propertiesof statementsknown tobetrueor

knowntobefalse,andcontinuityrules.

Fromthese criteriafollowsthat anypermissible wayto reasonwith

plausi-bilityisequivalenttoBayesiananalysis.

Averyshort outlinefollows,werewedonotin factshowthat theBayesian

method satisesthecriteria(thisisnotusuallyquestioned):

Let A, B, C, ... be statements, combinable with the invisible logicaland

operator: AB means A and B. The negation of astatement A is written A.

Statementsmustinsomewaybeconsideredobjectiveandrelatetostatesofthe

world,andhaveanagreedinterpretation. LetAjCbetheplausibilityofAgiven

the additional information that C is true. C is thus the context in which we

consider the plausibility ofA. That such anotation mustbepresentin every

calculus to derive plausibility is clear - there must for example be a way to

relateameasuredvalue(AjC)to therealitybehindit(BjC)usingbackground

informationonthemeasurementprocessanditsaccuracy(C). Numericalvalues

- parameters,measuredvalues, etc. - enter this framework byalimit process.

Wecannotstartwithinnitedomainsanddirectlyputplausibilitymeasureson

(9)

moreoftheplausibilitiesAjC,BjAC,BjCandAjBC. Itcanbeshownthatwe

mustconsider eitherBjAC and AjC orAjBC and BjC- anyotheralternative

can beshowninadequate byviolatingcommonsense insomesituation. Asan

examplewecannotderivetheplausibilityofABjC fromonlytheplausibilities

ofAjC andBjC,sincethatgivesus nomeansto considerhowA andB relate

to each other- it would force us to assume, for example, that the plausibility

ofaperson havingaleft blueandarightbrowneyewould dependonlyonthe

plausibilitiesof left blue andrightbrowneye, and notallowingus toconsider

thedependencybetweenthesetwostatements.

Thus, we can assume that the plausibility of ABjC is a function of the

plausibilitiesofAjC andBjAC, theother casebeinganaturalconsequenceof

thecommutativityoftheandoperator:

ABjC=F(AjC ;BjAC): (3)

ThecommonsenserequirementtellsusthatthefunctionFmustbecontinuous,

and monotonously increasing in bothits arguments. It canhavea stationary

pointforitsrstargumentonlyifthesecondargumentrepresentsimpossibility

and vice versa. We assumeit twice continiously dierentiable, although there

existsafairlycomplexproofthat thisisnotnecessaryforourconclusions[19].

Nowweconsidertheconsistencyrequirement. Sincetheandoperatorisnot

onlycommutativebutalsoassociative,ABC=(AB)C =A(BC),wecanderive

aconsistencyrequirementforF:

ABCjD=F(ABjCD;CjD)=F(AjBCD;BCjD): (4)

Expandingonce more,weget:

F(F(AjBCD;BjCD);CjD)=F(AjBCD;F(BjCD;CjD)): (5)

Thismusthold foranystatementsA;B;C ;D,and thus F mustsatisfythe

followingfunctionalequationin itsrangeofdenition:

F(x;F(y;z))=F(F(x;y);z): (6)

The aboveis called the equation of associativity. Thetrivial constant

so-lution is clearly useless. Which non-trivial solutions are there? We can

dif-ferentiate equation (6) with respect to x, y and z, and see that the

follow-ing equality holds, i.e., the left side is independent of z (we use the notation

F 1 (x;y)= @F(x;y) @x ): F 2 (x;F(y;z))F 1 (y;z)=F 1 (x;F(y;z))=F 2 (x;y)=F 1 (x;y): (7) LetG(x;y)=F 2 (x;y)=F 1 (x;y)andwendF 1

(y;z)G(x;F(y;z))=G(x;y),

and the left side of this (which is algebraically independent of z) we denote

U. Likewise, after a littlealgebra: G(x;f(y;z))F

2

(y;z) =G(x;y)G(y;z), and

theleftside wedenotebyV. Now@V=@y isidentical to@U=@z andthuszero,

sinceU isindependentof z. ButthenV which canbewritten G(x;y)G(y;z),

is independent of y. This can only happen if G(y;z) and 1=G(x;y) have a

commonfactordependentony,andnootherdependenceony. Sowemusthave

G(y;z)=H(y)E(z)andG(x;y)=E 0

(10)

y for x and z for y in the latter: G(y;z) = E(y)=H(z). In other words, G

musthavetheformG(x;y)=rH(x)=H(y), andthisisalsobydenitionequal

to F

2

(x;y)=F

1

(x;y). This is what we need to separate variables and put the

diferentialofv=F(x;y)onanintegrableform:

dv H(v) = dx H(x) +r dy H(y) (8)

whichcanbeintegratedusingw(x)exp( R 1 dx H(x) )to: w(F(x;y))=w(x)w r (y) (9)

but theequationofassociativyalsogivesus

w(F(x;y);z)=w(x)w r (y)w r (z)=w(F(x;F(y;z)))=w(x)(w(y)w(z)) r (10)

and in every non-trivial and useful case we must have r = 1. We can now

investigatewhat w(x) must be when x represents truthor falsity, and weget

w(x)=w(x)w(T),w(F)=w(x)w(F)andsomemoreconditionswedonothave

to use. It ispossiblethat thevalues1 and 1 areobtainedsincetruth and

falsity mightbe considered alimit case. Therst condition yields w(T) =1,

theothercouldmeaneither w(F)=0orw(F)=1( 1isruledoutsincewe

cannotalloww(x)topasszeroinitswayfromw(T)tow(F)). Butthesolution

goingfrom 1to1 canbereplaced byitsinverse,whichgoesfrom 1to 0. We

are now veryclose to probability rules, since the function w goes from 0 for

impossibilityto1fortruth,andourrulefortheconjunctionofstatementscan

bewritten

w(ABjC)=w(AjBC)w(BjC) (11)

Itnowremainstondouthowplausibilitiesofcomplementsmustbetreated.

Since AAis alwaysfalseandeither ofA orA mustbetrue,theplausibilityof

A mustbeafunction oftheplausibilityof A. Introducethefunction S onthe

unitinterval: S:[0;1]![0;1],suchthatw(A jC)=S(w(AjC)). Byconsidering

Aristotelian logic, and ourchoice ofw(T)= 1and w(F) =0, and reasonable

commonsense,wendthatSisamonotoneandcontinuousfunctiondecreasing

from1to0ontheunitinterval. WewillassumethatS isdierentiable-again

this isnotnecessarybut it isalmost requiredbycommonsense and simplies

theargument. Also,sinceA=A,wehaveS(S(x))=x. Thisisnotall,however,

becauseS mustalso beconsistentwiththeproductrule:

w(ABjC)=w(AjC)w(BjAC)=w(AjC)S(w(BjAC)) (12)

w(ABjC)=w(AjC)w(BjAC)=w(AjC)S(w(BjAC)) (13)

Rearranging these constraintsand using the commutativity AB =BA we

ndw(ABjC)=w(AjC)S(w(BjAC))=w(AjC)S(w(ABjC)=w(AjC)),and

w(AjC)S( w(ABjC) w(AjC) )=w(BjC)S( w(BAjC) w(BjC) ): (14)

Equation (14) must hold for all statements A, B, and C. In particular,

(11)

fundamental equationgoverningthepossiblefunctions S: xS( S(y) x )=yS( S(x) y );S(y)x (15)

The analysis of this equation is not entirely trivial, but it canreadily be

veried that among its solutions are the (easily obtainable) solutions to the

simpleequation: S(x) m +x m =1;m>0: (16)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure3: Samplesolutionsto(16).

For the dierent values of m, the curve family will cover the interior of

the unit square (see gure 3). It is also easy, by consideriungthe choice y =

S(x)+as!0,toseethatSisgovernedbyarstorderdierentialequation.

Therefore,there arenomoresolutionsthanthese. Itmightseemoddthat the

solutionS(x)=1 xis nottheonlyone,sinceitwould twellwithequation

(11)andthechoiceofw(AjC)astheprobabilityofAjC. However,bytakingthe

mthpowerofequation(11),wendthatwecanstillinterpretallpossibleways

tocomputewithplausibilitiesasBayesiananalysis,simplybylettingprobability

correspondto w(AjC) m

.

Thenextquestionin thisline ofinquiryconcernsproperchoicesofpriors

-wehaveno helpwhatsoeverin theprecedingdiscussion. Building arepertoire

of methods to assign priorswould start with simplesymmetryconsiderations:

IfIhavenobackgroundknowledgewhatsoevertonddierencesinplausibility

between aset of nexclusiveandexhaustivehypotheses,then theprior

proba-bilityofeachshouldbesettothesamevalueandtheprobabilitiesshouldsum

to one. Thus, each hypothesis will have prior probability 1=n. This leadsto

thestandardassignmentsforcointossingandurndrawingexperiments

consid-ered in basic probability texts. Translating this rule, by limitforming

opera-tions,to pdfswith continuousparameterspacesleadsnaturallytotheconcept

ofminimum-information(maximumentropy)priors,whichhaverevolutionized

themethods foranalyzingphysicsdataandisspreadingto othersciences. We

donotdescribethis revolutionhere,see e.g. Jaynes[19]. A remainingproblem

isthat wesimplycannotconsiderallpossiblehypotheses. Thismeansthatthe

set of hypotheses we actually consider must in some sense be realistic. This

is a problem that is the key problem that must get a convincing solution in

(12)

found applicabletomanydierentproblemsasarstquantitativegrinding of

the collecteddata. However,once the big lines havebeenuncovered, there is

usually plenty of scopefor investigating morespecic and application related

models.

3.2 An educational example: Tossing a coin

Whenitcomestotheinterpretationofexperimentaloutcomes,wecanillustrate

thecontroversy withan examplethat hasbeen discussedfrequentlyby

statis-ticians, rstby Lindley(see, e.g.,[7, 32, 19]): Assume wetoss acoin12 times

andobservetheoutcomettthhtttttth,wheretmeanstailandhmeanshead.

We areinterested in what thismeans forour objectiveoflearningwhether or

notthe coinisfair. Theprobabilityofthis sequence forafaircoinis 0:5 12

as

it is for any other sequence of 12 tosses. So it does not seem extraordinary,

nor does a sequence consisting of millions of heads only, because it also has

the sameprobability asany other sequence of thesame length. The

frequen-tists approach is to dene atest. We order the possible outcomes linearlyor

map them tothe realline, and thisinduces apdfofareal-valuedquantity. If

the current outcome liesfar out on thetail of this distribution, we reject the

hypothesis that the coin is fair. It is acceptedthat a5% cuto canbeused,

and this givesus a 5%risk ofrejecting atrue hypothesis. Of coursethemap

ofoutcomesto therealline mustbedenedin someimpartialway,essentially

beforewehaveseentheactualoutcome. Typically,atleastifwearemore

con-cerned withfairnessthanwith independence,wechoose thenumberof tailsin

thesequence, which hasabinomialdistribution. The probabilityof 9ormore

tailsin12tossesofafaircoinisslightlymorethan5%( P 12 i=9 12 j 2 12 =:075),

sowecouldreasonablyassumethatthecoinisfair. Thereisaveryfundamental

problemwiththisapproach,however,andthatisthatwemadeanassumption

aboutthepossibleunobservedoutcomesthatisnotjustied. Wejust assumed

that the outcomeis oneof thepossibleoutcomeswhen tossing 12times. The

actualsequenceobserveddoesnotexcludethepossibilitythattheexperimenter

tossed the coinuntil he had 3heads. If that were the caseweshould instead

computethedistributionofthenumberoftailsseenbeforethethirdhead. This

distributionisdierent,particularlyit admitsarbitrarilylargevalues. A rapid

calculationshowsthatwiththisruleweshouldrejectthenullhypothesisatthe

5%levelforthesameoutcomeoftheexperiment(the probabilityof9ormore

tailsis P 1 i=9 j+2 j 2 (j+3)

=:0325). This dependence ontheunknown

experi-mental design violatesafundamental statistical principlesaying that onlythe

likelihood of the observed data caninuence ourbelief in ahypothesis. This

principle,theLikelihoodPrinciple,wasproposedbyFisherandBarnard,butit

was rstgivena detailed analysis byBirnbaumin 1962[5]. In thesubsequent

debate,frequentistshaveproposedthattheLikelihoodPrincipleisnot

applica-ble in thiscaseand that theexperimental designcouldin practiseberelevant

information. A Bayesianonly admits that the probability, under the fairness

assumption,oftheoutcomeobservedis0:5 12

=:000244andthattheprobability

of 9tailsis afactor 12

9

larger. Inorder to evaluate theexperimenthe needs

priorbeliefs. Suchpriorbeliefcouldbeanalternativemodel,denedbeforethe

experimentisobserved. Ifthealternativemodelisthatthecoingivestailswith

(13)

under thealternativemodel isagain afactor 12

9

larger. SoaBayesfactor of

3.9infavorofthealternativehypothesisisobserved,andtheBayesianstarting

outwithnopreference(probability1=2foreachalternative)wouldendupwith

apreference forthe alternativewhich could bequantied asprobability .8for

theunfairalternativeand.2forthefairalternative. Thispreferenceshouldnot

beregardedasarejectionofthelessbelievedalternative,butcaneasilybe

re-versedbymoreinformation. Thereisatemptingalternativehypothesisin this

case, namely that thetrue probabilityis the observed frequency, .75for tails.

Thismodelhasthehighestprobability(:001173)ofthosealternativesassuming

independentoutcomes. Evenhigher(probability1)wereachifweassumethat

the observed sequence is the only possible outcome and that the tosses were

notindependent-butnowwehavedenitelyusedthedatatoomuch,sincewe

wouldprobablynotdesignatethishypothesisasamajoralternativebeforethe

experiment.

Now,letthealternativehypothesisbe: Theprobabilityoftailsisanumber

. Figure 4showsthe probability of theoutcome asafunction of . We do

notknowanythingabout,butwemust assumesomedistribution ofit. One

obviousalternativeistheuniformdistribution. Thisgivesthemodelprobability

R 1 0 9 (1 ) 3

d = :00035. The resulting Bayes factor is 1.4 in favor of the

hypothesis of unfairness, much weaker than 4.9 for the maximum likelihood

hypothesis( =0:75). A Bayesianwithno priorpreferenceof thehypotheses

fair againstunfair wouldend upbyassigning probability0.411 tothefair and

.589totheunfairhypothesis.

Figure4: Posteriorfrequencydistribution

Itmightseemunreasonableto lettheprobabilitieslessthan1=2waterout

our belief in unfairnesswhen data clearly suggestthat the probability, if it is

not1=2,isgreater. Letussplittheunfairnesscaseintotwoandconsider three

models: M

l

- biasfor heads; M

f

- fair; M

h

- bias fortails. Weagain assume

a uniform distributionof in the intervals 0 to 1/2 for M

l

and in 1/2 to 1

for M

h

. A similar calculationleadsto theposteriorprobabilities0.034, 0.259

and0.706,respectively. Clearly,byseparatingtheunfairnesshypothesisintolow

andhighbias,wedecreasedourbeliefinthefairnessalternative. Unfortunately,

this is to someextent anillusion. Thereal reason why ourposteriorbelief in

the fairness decreased is that our prior belief in fairness decreased when we

replacedtwoequallybelievablehypothesis(priorprobability1/2each)bythree

equallybelievablehypotheses(priorprobability1/3each). Itwouldbeequally

(14)

change. Thisisoneexampleofnon-robustnessproblemsappearingwhendoing

Bayesiananalyseswith weakpriors.

Inanycase, this is a resultthat seemsmuch weakerthan thefrequentists

abilityto reject thefairness assumptiongiven theinformation that the

exper-imenter tossed thecoinuntil3heads were observed. In'fairness' itshould be

notedthatthetwoviewswouldyieldsimilarresultsif120tossesweremadewith

30 observed heads: The Bayesfactor would be 10 6

in favorof unfairness and

thelevelof thefrequencytestwould be10 7

twoequallyconvincingreasons

toreject thefairnessassumption.

Theprocessof dividingtheunfairnesscaseinto twocanbecontinued, and

inthelimitweobtaintheconceptofaposteriordistributionforovertheunit

interval. Thisanalysisiscarriedout,withanumberofnicegraphicalresults,by

Sivia[32]. Theresultingposteriorwithauniformprior,ttailsandhheadsisthe

normalizedlikelihoodfunction,theBetadistribution,p(jh;t)=c t

(1 ) h

.

In the next section we will perform a generalized derivation, where we allow

morethan2outcomes: wegofromaBernoullidistributiontoageneraldiscrete

distribution,andweusethemoregeneralDirichletconjugatefamilyinsteadof

Betadistributions.

ThereisnomathematicalreasontorejectoneofthefrequentistorBayesian

approaches. Bayesians accused frequentists for not accepting probability as

dependenton information, whereasfrequentistsaccused Bayesiansfor putting

up with the non-robustness caused by dependence on prior information.

Ad-mittedly, it is dicult to translate priorinformation to prior probability, but

Bayesiansclaim that it is unavoidable. Whether the frequentists reliance on

experimental designisworsethantheBayesiansrelianceonpriorsis ofcourse

impossible to say without a lot of experience. Several other arguments have

beenputforwardin this debate,but thoseaboveseemtobethemostcritical.

Today,Bayesianviewsaregainingground,perhapslargelyduetointerestfrom

the AIcamp, where several lessconvincing waysto dealwith imprecise

infor-mation havebeentried. Althoughwepromotethe pure Bayesianviewin this

report,itmustberememberedthatanyoneinvestigatingrealdatamustexplore

it frommanyangles, inorder toavoidbeingmisleadbytooconstrainedor

in-appropriatemodels. Inpracticesuch explorationsare perhapsbest performed

with various visualization tools. An old saying is that a proper visaulization

hits the investigatorbetween his eyes with the truth. There is sometruth in

this.

4 Graphical model choice - local analysis

We will analyze a numberof models involving twoor three variables of

cate-gorical type, asa preparationto the task of determining likely decomposable

ordirected graphical models. First, consider the caseof two variables, A and

B, andourtask isto determinewhether ornotthese variablesaredependent.

Since weknow that Bayesmethod is the only method that givesus the right

answer,wealreadyknowhowto proceed. Wemust deneonemodel M

2 that

captures the concept of independence, and one model M

1

that captures the

conceptofdependence,andaskwhichoneproducedourdata. TheBayesfactor

is P(DjM

2

)=P(DjM

1

(15)

sume is one) toget the posteriorodds. There issome latitudein dening the

datamodelfordependenceandindependence,buttheyleadustoquitesimilar

computations,asweshallsee.

Letd

A andd

B

bethenumberofpossiblevaluesforAand B,respectively.

It is natural to regard categorical data as produced by a discrete probability

distribution,andthenitisconvenienttoassumeDirichletdistributionsforthe

parameters(probabilitiesofthepossibleoutcomes)ofthedistribution.

Wewillndthatthisanalysisisthekeystepindeterminingafullgraphical

model for the data matrix. Our analysis is analogous to those of Dawid and

Lauritzen[12]andMadiganandRaftery[24],buttheiranalysesareinmanyways

moregeneralandusealikelihoodapproachwithpenalizationofdetailedmodels

usingtheBICcriterionandothersimilartechniques.

For a discrete distribution over d values, the parameter set is a sequence

of probabilitiesx = (x 1 ;:::x d ), constrained by 0 x i and P i x i = 1(often

thelast parameterx

d

is omitted-it isdetermined bytherst d 1ones). A

priordistributionoverxistheconjugateDirichletdistributionwithaparameter

set =( i ) d i=1 , constrained by0 i

. Then the Dirichletdistribution with

parameterset isDi(x j)= Q i x (i 1) i ( P i i )= Q i ( i ),where (n+1)=

n! for naturalnumber n. The normalizing constant ( P i i )= Q i ( i ) gives

a useful mnemonic for integrating Q i x ( i 1) i

over the d 1-dimensional unit

cube (with x d = 1 P x i

). It is veryconvenient to use Dirichlet priors, for

the posterior is also aDirichletdistribution: After having obtaineddata with

frequency count n we just add it to the prior parametervector to get the

posterior parameter vector +n. It is also easy to handle priors that are

mixturesofDirichlets,becausethemixingpropagatesthroughandweonlyneed

tomixtheposteriorsofthecomponentstogettheposteriorofthemixture. We

donotneedthishere,however.

With no specic prior information for x , it is necessary from symmetry

considerationstoassume allDirichletparametersequal,

i

=. A convenient

prioris theuniformprior( =1). Thisis, e.g,thepriorused byLaplaceto

derivetheruleofsuccession,seeCh18of[19]. Otherpriorshavebeenused,e.g.,

= 1=2in the case d =2, which is aminimum information (Jereys) prior.

The value = 1=2 has also been used for d > 2(Madigan and Raftery[24]).

CheesemanandStutz[8]reporttheuseof=1+1=d. Experimentshaveshown

little dierence betweenthese choices, but it is easy to see that Jereysprior

promotes x

i

closeto 0or1somewhat whereas=1+1=d penalizes extreme

probabilities. If we get signicantdierences between dierent uninformative

priorsthiswarrantsacloserinvestigationontheadequacyofdataandmodeling

assumptions. Wewill mostlyuse theuniformprior. Inmanycasesanexperts

delibered prior information canbe expressed as an equivalent sample that is

just added tothedata matrix,andthen thismodiedmatrixcanbeanalyzed

withthe uniformprior. Likewise,anumberofexpertscanbemixedto forma

mixtureprior. Ifthedatahasoccurrencevector(n

i )

d

i=1

forthedpossibledata

valuesinacase,andn=n

: = P i n i

, thentheprobabilityforthese datagiven

thediscretedistributionparametersx ,is

p(njx )= n n 1 ;:::;n d Y i x ni i : (17)

(16)

nomial coecient. This would givethe probabilitynotof getting aparticular

contingencytable(datamatrix),butagivenorderedsamplewiththefrequency

countsn

i

. Thedierencebetweenthese twoviews disappearswhenthe

multi-nomial coecientscancel in thedivision leading toBayesfactors. Integrating

outthex

i

with thepriorgivestheprobabilityofthe datagiven modelM (M

is characterizedby aparameterizedprobability distribution andaprior onits

parameters): p J (njM)= Z p(njx )p(x )dx = Z n n 1 ;:::;n d Y i x ni i Y i x i 1 i ( : ) Q i ( i ) dx = n n 1 ;:::;n d (d) () d Q i (n i +) (n+d) (18) = (n+1) (d) Q i (n i +) () d (n+d) Q i (n i +1) : (19)

Asis easilyseen,theuniform priorgivesaprobabilityforeachsamplesize

that isindependentoftheactualdata:

p u (njM)= (n+1) (d) (n+d) : (20)

Consider now the data matrix over A and B. Let n

ij

be the number of

rows with value i for A and value j for B. Let n

:j and n

i:

be the marginal

countswherewehavesummedoverthe'dotted'index,and n=n

:: = P ij n ij . Let model M 1

(gure 1) be the model where theA and B value fora row is

combined to a categorical variable ranging overd

A d

B

dierentvalues, with a

Jereys oruniformprior. Theprobabilityofthedata givenM

1

isobtainedby

replacingtheproductsandreplacingdbyd

A d B inequations(19)and(20): p J (njM 1 )= (n+1) (d A d B AB ) Q ij (n ij + AB ) ( AB ) d A d B (n+d AB ) Q ij (n ij +1) ; (21) p u (njM 1 )= (n+1) (d A d B ) (n+d A d B ) : (22)

Wecouldalsoconsider adierentmodelM 0

1

,wheretheAcolumn is

gener-atedrstandthentheB columnisgeneratedforeachvalueofAinturn. With

uniformpriorsweget:

p u (njM 0 1 )= (n+1) (d A ) (d B ) d A (n+d A ) Y i (n i: +1) (n i: +d B ) (23)

Observethat wearenotallowedtodecidebetweentheundirectedM

1 and

thedirectedmodelM 0

1

basedonequations(22)and(23). Thisisbecausethese

modelsdenethesamesetofpdf:sinvolvingAandB,thedierencelyingonly

(17)

and it might be usefulfor seeing how well data t the twoparameterizations

and parameterpriors. A dierencecompared to realBayesfactorsis that we

cannot resolvethe hypothesis by taking moredata. The factorjust measures

relativestretchintheparametrizationinthehighlikelihoodareas.

Inthe nextmodelM

2

weassume that theA and B columns are

indepen-dent,eachhavingitsowndiscretedistribution. Therearetwodierentwaysto

specifypriorinformationin thiscase. Wecaneither considerthetwocolumns

separately, eachbeingassumedtobegeneratedbyadiscretedistribution with

itsownprior. OrwecouldfollowthestyleofM 0

1

above,withthedierencethat

eachAvaluehasthesamedistributionofB-values. Nowtherstapproach:

As-sumingparametersx A

andx B

forthetwodistributions,arowwithvaluesifor

AandjforB willhaveprobabilityx A

i x

B

j

. Fordiscretedistributionparameters

x A

;x B

,theprobabilityofthedatamatrixnwillbe:

p(njx A ;x B )= n n 11 ;:::;n d A d B dA;dB Y i;j=1 (x A i x B j ) nij = n n 11 ;:::;n d A d B dA Y i=1 (x A i ) ni: dB Y j=1 (x B j ) n:j :

Integration over the priors for A and B gives the data probability given

modelM 2 : p J (njM 2 )= Z p(njx A x B )p(x A )p(x B )dx A dx B = Z n n 11 ;:::;n d A d B d A Y i=1 (x A i ) ni: d B Y j=1 (x B j ) n:j Y i (x A i ) A 1 (d A A ) ( A ) d A Y i (x B i ) B 1 (d B B ) ( B ) d B dx A dx B = (n+1) Q ij (n ij +1) (d A A ) ( A ) dA (d B B ) ( B ) dB Q i (n i: + A ) (n+d A A ) Q j (n :j + B ) (n+d B B ) :

Ifweselecttheuniformpriorweobtainlesscancelingof termsthanwedid

forM 1 in equation(20): p u (njM 2 )= (n+1) (d A ) (d B ) (n+d A ) (n+d B ) Q i (n i: +1) Q j (n :j +1) Q ij (n ij +1) : (24)

Fromequations(22)and(24)weobtaintheBayesfactorfortheundirected

(18)

u 2 p u (M 1 jD) = u 2 p u (njM 1 ) = (n+d A d B ) (d A ) (d B ) (n+d A ) (n+d B ) (d A d B ) Q j (n :j +1) Q i (n i: +1) Q ij (n ij +1) : (25)

The second approach to model independence between A and B gives the

following: p u (n jM 0 2 )= (n+1) (d A ) (n+d A ) Z ( Y i n i: n i1 :::n id B Y j x nij j ) (d B ) dx B = (n+1) (d A ) (d B ) (n+d A ) ( Y i n i: n i1 :::n id B ) Y j x n:j j dx B = (n+1) (d A ) (d B ) (n+d A ) (n+d B ) Q i (n i: +1) Q j (n :j +1) Q ij (n ij +1) : (26)

WecannowndtheBayesfactorrelatingmodelsM 0

1

(equation23)andM 0

2

(equation26),withnopriorpreferenceofeither:

p u (M 0 2 jD) p u (M 0 1 jD) = p u (njM 0 2 ) p u (njM 0 1 ) = Q j (n :j +1) Q i (n i: +d B ) (d B ) dA 1 (n+d B ) Q ij (n ij +1) (27)

Considernowadatamatrixwiththreevariables,A,BandC(gure2). The

analysis of themodelM 0

3

where full dependencies areacceptedis verysimilar

to M

1

above (equation 22). For the model M

4

without the link between A

andB weshouldpartitionthedatamatrixbythevalueofC andmultiplythe

probabilitiesoftheblockswiththeprobabilityofthepartitioningdenedbyC.

Sinceweareultimately after theBayesfactor relatingM

4 and M 3 respec-tivelyM 0 4 and M 0 3

, we cansimply multiplytheBayesfactorsrelating M

2 and M 1 (equation25) respectivelyM 0 2 andM 0 1

(equation27) foreachblockof the

partitionto gettheBayesfactorssought:

p u (M 4 jD) p u (M 3 jD) = p u (njM 4 ) p u (njM 3 ) = (d A ) d C (d B ) d C (d A d B ) dC Y c (n ::c +d A d B ) Q j (n :jc +1) Q i (n i:c +1) (n ::c +d A ) (n ::c +d B ) Q ij (n ijc +1) (28)

andinthedirectedcasewehave:

p u (M 0 4 jD) p u (M 0 3 jD) = p u (njM 0 4 ) p u (njM 0 3 ) = (d B ) (d A +1)d C Y c Q j (n :jc +1) Q i (n i:c +d B ) (n ::c +d B ) Q ij (n ijc +1) :

(19)

beabletocomparemodelsM 5 andM 6 ofgure5: p u (M 5 jD) p u (M 6 jD) = p u (njM 5 ) p u (njM 6 ) = 1 (d B ) (d A 1)d C Y c Q j (n :jc +1) Q i (n i:c +d B ) (n ::c +d B ) Q ij (n ijc +1) :

M5

A

_B

C

A

_B

M6

C

Figure5: Directed models

5 Graphical model choice - global analysis

If wehavemanyvariables, theirinterdependenciescanbemodeled asagraph

with verticescorrespondingto thevariables. The exampleof gure 6is from

[23], andshowsthedependenciesinadatamatrixrelatedto heartdisease. Of

course, a graph of this kind can give a data probability to the data matrix

in a way analogous to the calculations in the previous section, although the

formulaebecomeratherinvolved, andthenumberof possible graphsincreases

dramaticallywiththenumberofvariables. Itiscompletelyinfeasibletolistand

evaluateallgraphsifthereismorethanahandfulof variables. An interesting

possibility to simplify the calculations would use some kind of separation, so

thatanedgeinthemodelcouldbegivenascoreindependentoftheinclusionor

exclusionofmostother potentialedges. Indeed,thederivations oflast section

showhowthisworks. LetC inthatexamplebeacompoundvariable,obtained

by merging columns fc

1 ;:::c

d

g. If two models G and G 0

(20)

Mental Work

Lipoproteins

Physical Work

Smoking

Anamnesis

Blood Pressure

Figure6: Symptomsandcausesrelevantto heartproblems

presenceandabsenceoftheedgefA;Bg,andifthereisnopathbetweenAand

B exceptthroughvertexset C,then theexpressionsforp(njM

4

)andp(njM

3 )

abovewillbecomefactorsoftheexpressionsforp(njG)andp(njG 0

),respectively,

andtheotherfactorswillbethesameinthetwoexpressions. Thus,theBayes

factorrelatingtheprobabilitiesofGandG 0

isthesameasthatrelatingM

4 and

M

3

. Thisresultis independentof thechoice ofdistributions and priorsofthe

model, sincethestructure of thederivation followsthestructure of thegraph

ofthemodel-itis equallyvalid forGaussianorotherdatamodels,aslongas

the parameters of the participating distributions are assumed independent in

thepriorassumptions. Abeautifulabstractanalysisofthisphenomenoncanbe

foundin (DawidandLauritzen[12]).

Wecannowthinkofvarious'greedy'methodsforbuildinghighprobability

interaction graphs relating the variables (columns in the data matrix). It is

convenientandcustomarytorestrictattentiontoeitherdecomposable(chordal)

graphs or directed acyclic graphs. Chordal graphs are fundamental in many

applicationsofdescribingrelationshipsbetweenvariables(typicallyvariablesin

systemsof equations orinequalities). Theycanbecharacterizedin many

dif-ferentbut equivalent ways, see(Rose [29], Rose, LuekerandTarjan[30]). One

simple wayis to consider adecomposable graphasconsisting of the union of

a number of maximal complete graphs (cliques, ormaximally connected

sub-graphs),in suchawaythat(i)thereisatleastonevertexthatappearsinonly

oneclique(a simplicial vertex), and(ii) ifanedge to asimplicialvertex is

re-moved, another decomposablegraphremains, and (iii) thegraphwithout any

edgesisdecomposable. Acharacteristicfeatureofasimplicialvertexisthatits

neighbors are completely connected. This recursivedenition canbereversed

into ageneration procedure: GivenadecomposablegraphGontheset of

ver-tices,ndtwoverticessandnsuchthat(i): sissimplicial,i.e.,itsneighborsare

(21)

G obtained by adding the edge between s and n to G is also decomposable.

Wewill call such anedge apermissible edge of G. This procedure describesa

generationstructure(adirectedacyclicgraphwhoseverticesaredecomposable

graphson theset of vertices)containingall decomposable graphsonthe

vari-ableset. An interestingfeature of this generation process isthat it is easyto

computetheBayesfactorcomparingtheposteriorprobabilitiesofthegraphsG

andG 0

asgraphicalmodelsofthedata: LetscorrespondtoA,ntoB andthe

compoundvariableobtainedbyfusingtheneighborsofstoCintheanalysisof

section5. Without explicitpriormodelprobabilitieswehave:

p(GjD) p(G 0 jD) = p(njM 3 ) p u (n jM 4 ) : (29)

Asearchforhighprobabilitygraphscannowbeorganizedasfollows:

1. Startfrom thegraphG

0

withoutedges.

2. Repeat: ndanumberof permissible edges thatgivethehighestBayes

factor,andadditifthefactorisgreaterthan1. Keepasetofhighestprobability

graphsencountered.

3. Thenrepeat: Forthehighprobabilitygraphsfoundinthepreviousstep,

nd simplicial edges whose removal increases the Bayes factor the most (or

decreasesittheleast).

Foreachgraph keptin this process, itsBayesfactorrelativeto G

0 canbe

foundbymultiplyingtheBayesfactorsinthegenerationsequence. Aprocedure

similartothisoneisreportedby(MadiganandRaftery[24]),anditsresultson

small variable sets wasfound good, in that it found thebest graphsreported

in otherapproaches. It mustbenoted,however,that wehavenowpassedinto

therealmofapproximateanalysis,sincewecannot(yet)knowthatwewillnd

all high probability graphs. One splendid example of this is where we have

manybinarycategoricalcolumns,allgeneratedrandomlyandindependentlyof

each other except the last onewhich is the parity function of theother ones.

Ifwestartsearchingfromtheemptygraph,wewillneverndthisrelationship

since the intermediategraphs will have low probability. Likewise, if some

ar-bitrary subsetof thecolumns are interrelated by a parity constraintit seems

unlikelyalthoughpossiblethat we willnd itevenif westartthe searchfrom

thesaturatedmodel(graphwithalledges).

Anotherfamily of graphicalmodelsare the directed acyclicmodels. They

canbe treatedsimilarly, sinceherewechecklocally for avariable B that has

beenfounddependentonasetC,whetheritcanbeinferredalsotodependon

variable A. WecomparethusmodelsM

5

and M

6

ofgure4. Theinclusionor

exclusionof the arrowfrom A to B canbe inferred independentof all arrows

not going to B. A problem with directed graphical models is that dierent

acyclic graphscanrepresentthe samefamily of probabilitydistributions, and

thisrequires somecarefulargumentation.

6 Graphical model choice - categorical, ordinal

and Gaussian variables

Wenowconsiderdatamatricesmadeupfromordinalandrealvalueddata,and

(22)

normal distribution. It has nice theoreticalproperties manifesting themselves

in such formsas thecentrallimittheorem, theleast squaresmethod,principal

components, etc. However,it must be notedthat itis also unsatisfactory for

many data sets occurring in practice, because of its narrow tail and because

manyreallifedistributionsdeviateterriblyfromit. Severalapproachestosolve

this problem areavailable. Oneis to consider avariableasbeingobtainedby

mixingseveralnormaldistributions. Thisisaspecialcaseoftheclassicationor

segmentationproblemdiscussedbelow. Anotheristodisregardthedistribution

over the real line, and considering the variable as just being made up of an

orderedsetofvalues. Thisleadsnaturallytotherecursivesplittingofthedata

set byadecisiontree,alsodiscussedbelow.

7 Missing values and errors in data matrix

Data collectedfrom experimentsare seldom perfect. The problem of missing

and erroneousdata isavast eld in thestatisticsliterature. First ofall there

isapossibilitythat'missingness'ofdatavaluesaresignicantfortheanalysis,

in whichcasemissingnessshould bemodeled asanordinarydatavalue. Then

theproblemhasbeeninternalized,andtheanalysiscanproceedasusual, with

theimportant dierencethat themissing valuesarenotavailable foranalysis.

A moresceptical approach wasdeveloped by Ramoni andSebastiani[27], who

consider anoptionto regardthemissing valuesasadversaries(the conclusions

ondependencewouldthenbetruenomatterwhatthemissingvaluesare). The

other possibility is that missingness is known to have nothing to do with the

objectives of the analysis. For example, in a medical application, if data is

missing because of thebad condition of thepatient, missingness is signicant

if theinvestigation isconcerned with patients. Butif data ismissing because

ofunavailabilityofequipment, itisprobablynot-unless maybeifthe

investi-gationisrelated tohospitalquality. In Bayesiandata analysis,theproblem of

missing orerroneousdata createssignicantcomplications, aswewill see. As

an example, consider theanalysis of the two-columndata matrix with binary

categoricalvariables Aand B, analyzedagainstmodels M

1 and M 2 of section 5. Suppose we obtained n 00 , n 01 , n 10 and n 11

cases with the values 00, 01,

etc. WethenhaveaposteriorDirichletdistributionwithparametersn

ij forthe

probabilitiesof thefourpossiblecases. Ifwenowreceiveacasewhere bothA

and B areunknown, it is reasonablethat this caseis altogetherignored. But

what shallwedoifacasearriveswhereAisknown,say0,but B isunknown?

One possibility is to waste theentire case, but this is not orthodoxBayesian,

sinceweare notmakinguseof informationwehave. Anotherpossibility isto

usethecurrentposteriortoestimateapdfforthemissingvalue,inourcasethe

probabilitythat Bhasvalue0isp

0 =n

00 =n

0:

. Soourposteriorisnoweithera

Dirichletwithparametersn

00 ,n 01 1,n 10 1and n 11 1(probabilityp 0 )or

onewithparametersn

00 1,n 01 ,n 10 1andn 11 1(probability1 p 0 ). But

thismeansthattheposteriorisnowaweightedaverageoftwoDirichlet

distri-butions, inotherterms,isnotaDirichletdistribution atall! Asthenumberof

missing valuesincreases,thenumberoftermsintheposteriorwill increase

ex-ponentially,andthewhole advantagewithconjugatedistributionswill belost.

(23)

The related case of errors in data is more dicult to treat. How do we

describedatawhere thereareknownuncertaintiesin therecordingprocedure?

This is aproblem workedon forcenturieswhenit comes to realvalued

quan-tities as measuredin physics and astronomy, and is one of the main features

of interpretation of physics experiments. When it comes to categorical data

there islesshelpin theliterature-anobviousalternativeistorelaterecorded

vsactualvaluesofdiscretevariablesasaprobabilitydistribution,or-whichis

fairlyexpedientinourapproach-asanequivalentsample.

8 Decision trees

Decisiontreesaretypicallyusedwhenwewanttopredictavariable-theclass

variable - from other - explanatory- variables in a case, and wehave a data

matrixofknowncases. Whenmodelingdatawithdecisiontrees,weareusually

tryingtosegmentthedatasetintoranges-n-dimensionalboxesofwhichsome

are unbounded - such that a particular variable - the class variable - is fairly

constantovereach box. If theclass variable istruly constantin each box, we

have a tree that is consistent with respect to the data. This means that for

newcases,wheretheclass variableisnotdirectly available,itcanbewell

pre-dicted by theboxinto whichthe casefalls. The methodis suitablewhere the

variablesusedforpredictionareofanykind(categorical,ordinalornumerical)

andwherethepredictedvariableiscategoricalorordinalwithasmall domain.

There are several ecient waysto heuristicallybuild good decision trees, and

it is acentral techniquein the eld of machine learning. Practicalexperience

hasgivenmanycaseswerethepredictiveperformanceofdecisiontreesisgood,

but alsomany counter-intuitivephenomena havebeen uncovered bypractical

experiments. Recently,severaltreatmentsofdecisiontreeshavebeenpublished

where it isdiscussed whether ornotthesmallestpossibletree consistent with

all cases is the best one. This turned out not to be the case, and the

argu-ment that a smallest decision tree should be preferred because of some kind

of Occam's razor argument is apparently not valid, neither in theory nor in

practise[34,2]. TheBayesianapproachgivestherightinformationonthe

credi-bilityandgeneralizingpowerofadecisiontree. Itisexplainedinrecentpapers

by(Chipman,GeorgeandMcCullogh[9])and by(PaassandKindermann[26]).

Adecisiontreestatisticalmodelisonewhereanumberofboxesaredenedon

oneset ofvariables by recursivesplitting of oneboxinto two bysplitting the

rangeofonedesignatedvariableintotwo. Dataareassumedtobegeneratedby

adiscretedistributionovertheboxes,andfor each boxitis assumedthat the

class variable value is generated by another discrete distribution. Both these

distributionsaregivenuninformativeDirichletpriordistributions,andthusthe

posteriorprobabilityofadecisiontreecanbecomputedfromdata. Sincelarger

trees havemoreparameters, there is anautomatic penalization oflarge trees,

butthedistributionofcasesintoboxesalsoentersthepicture,soitisnotclear

thatthesmallesttreegivingperfectclassicationwillbepreferred,oreventhat

aconsistenttreewill bepreferredoveraninconsistent one. Thedecisiontrees

wedescribedheredonotgiveaclearcutdecisiononthevalue ofthedecision

variableforacase,butaprobabilitydistributionovervalues. Iftheprobability

(24)

pos-thename ofthisdata model indicatesits usefordecisionmaking,onecanget

bettertreesforanapplicationbyincludinginformationabouttheutilityofthe

decision in the form of a loss function and by comparing trees based on the

expectedutilityratherthanmodel probability.

ForadecisiontreeT withdboxesdatawithcclasses,andwherethenumber

of casesin box iwith class valuek is n

ik

, and n=n

::

, wehavewith uniform

priorsonboththeassignmentofcasetoboxandofclasswithinbox,

p(DjT)= (n+1) (d) (n+d) Y i (n i: +1) (c) (n i: +c) (30)

However,inordertocomparetwotreesTandT 0

,wewouldhavetoformthe

setofintersectionboxesandaskabouttheprobabilityofndingthedatawith

acommonparameteroverthe boxesbelongingto acommonboxofT relative

totheprobabilityofthedatawhentheparametersarecommoninboxesofT 0

.

ForthecasewhereT andT 0

onlydierbysplittingofoneboxiinto i 0

andi 00

,

thecalculationiseasy(n

i 00 j +n i 0 j =n ij ): p(DjT 0 ) p(DjT) = (n i: +c) (n i 0 : +c) (n i 00 : +c) Y j (n i 0 j +1) (n i 00 j +1) (n ij +1) (31)

9 Segmentation - Latent variables

Segmentationandlatentvariableanalysisisdirectedatdescribingthedataset

as acollection of subsets, each having simplerdescriptions than the full data

matrix. Supposedata set D is partitioned into d

c

classesfD (i)

g,and each of

these has a high posterior probability p(D (i)

jM

i

) wrt some model set fM

i g.

Then wethink that theclassicationis a good model for the data. However,

someproblemsremainto consider. First,what isitthat wecomparethe

clas-sication against, and second, how do we accomplish the partitioning of the

cases? Therstquestionisthesimplesttoanswer: wecompareaclassication

model against some other model, based on classicationor not. The second

is trickier, sincethe introduction of this section issomewhat misleading. The

priorinformation foramodelbased onclassicationmusthavesome

informa-tion about classes, but it does not have an explicit division of the data into

classesavailable. Indeed,if we were allowed to makethis division into classes

on our own, seeking the highestposterior class model probabilities, wewould

probablyover-tby usingthesamedatatwice -onceforclass assignmentand

once for posteriormodel probability computation. The statistical model

gen-erating segmented data could be the following: A case is rst assigned to a

classbyadiscretedistributionobtainedfromasuitableuninformativeDirichlet

distribution, and then its visible attributes are assigned by aclass-dependent

distribution. Thismodelcanbeusedtocomputeaprobabilityofthedata

ma-trix, andthen, viaBayesrule, aBayesfactor relatingthemodelwith another

one,e.g.,onewithoutclassesorwithadierentnumberofclasses. Onecanalso

haveavariablenumberofclassesandevaluatebyndingtheposterior

distribu-tionofthe numberofclasses. Thedataprobabilityisobtainedbyintegrating,

(25)

ities accordingto therespective classmodel. Needless to say, this integration

is feasible only fora handful of caseswhere the data is too meagerto permit

anykindof signicantconclusion onthenumberof classesand their

distribu-tions. Themostwell-knownproceduresforautomaticclassicationarebuilton

expectationmaximization.Withthistechnique,asetofclassparametersare

re-nedbyassigningcasestoclassesprobabilistically,withtheprobabilityofeach

casemembership determinedbythelikelihoodvectorforitinthecurrentclass

parameters[8]. We can alsosolvethe problem with the MCMC approach[28].

The MCMC approach to classicationis the following: Assume that wehave

adatamatrixandwantaclassicationofitscaseswhich makestheattributes

independent. Deneaclassassignmentrandomly,andcomputetheprobability

ofdata,giventhemodelwithindependentattributes,asin(24)whichiseasyto

generalizetomoreattributes. TheMCMCwillnowimplementamovefunction,

proposingachangedclassforsomecase. Themoveisacceptediftheposterior

probability increases, orotherwise by aprobability given by the ratio of new

to olddataprobability(seesection11). Thisprocedure isreasonablyecient,

sinceitis possibleto evaluatetheclassprobabilitiesincrementally, bykeeping

just thecurrentcontingencytableforeachclassandupdatingitincrementally.

Since absolute probabilities are held updated, we also avoid a common

com-plicationin MCMC applications arisingwhenthe dimensionof theparameter

spacechanges. Although itcansometimesbeavoidedit isnotalwaysso. The

reversiblejumpprocess wasdesignedtocopewiththisphenomenon[6].

10 Association rules

Association rulesarespecial setsof rulesused to predictdatain data mining.

Theliteratureonassociationrulesemphasizesrapidextraction, sincetypically

a data matrix has verymany potential association rules and the data

matri-ces considered areverylarge. An association ruleis written A!B, where A

and B are conditionson adata case. Theycanbe either dened by givinga

predicate on the valueof an attribute, or asa conjunction of such conditions

for several attributes. In the literature, binary attributes are often assumed.

These usefulnessofthisrule depends onhowwellitsatises theintuitive

con-dition of therule: WheneverA istruefor acase, B is also true. Thesupport

of the ruleis the fraction of caseswhere bothA and B are true, whereasthe

condence is the fraction of caseswith A true where also B is true. The lift

ofarule isthefactorbywhich itscondenceexceedsthecondence wewould

have with in-dependency between A and B, computing in a ML framework,

i.e., n AB n :: =(n A: n :B

), wherethe notationisanobviousadaptationofthe

con-tingency table notationused previously. Clearly, theconceptof liftassumesa

large database,where statistical uctuation canbe ignored. In order to asses

thesignicanceofanassociationrule,weneedthemachineryofBayesfactors,

and then we can easily assess the generalization expectable from a proposed

rule. Inshort,letthesignicanceofarulebetheBayesfactorbetweenamodel

thatgivesdependence betweenAandBandamodelthat doesnot. Thisgives

(26)

s(A!B)= :: (n A: +c) (n A: +c) AB AB (n :B +1) AB AB (n :B +1) (32)

Since the signicance depends on four quantities which can have a very

largespan ofvaluesin practicalapplications, this conceptseemsnecessaryfor

throwingoutrulesthatcannotbeexpectedtogeneralizebecauseeitherthedata

base,thesupportorthelift(orsomecombination)istoosmall. However,there

are moredangersin applyingdata miningresults, particularly theproblem of

biasedsamplingwhichnotestonsampleddatacanreveal.

Miningoflargelesforassociationrulestypicallyrevealsverylarge

quanti-tiesofsignicantrules. Manypapershavebeendevotedtondinganinteresting

subset of such rules. Twobasic approaches exist: in one, ameasure of

inter-estingnessorsurprisingnessisdenedfor aparticularrule,in anotheraruleis

evaluatedinthecontextofanalreadyexistingset.

11 Approximateanalysis withMetropolis-Hastings

simulation

Several of the cases mentioned previously, where analytical solutions become

infeasiblebecauseofbreakdownofthesimpleconjugacyprinciple(missingand

erroneous values) or the large number of models to be considered (graphical

modelsonmanyvariables)orbecauseoftheanalyticaldicultiesincomputing

dataprobability(classication)havebeenseparatelyattackedwithmontecarlo

methods. Wewilloutlineamethodsolvingallthesecasesatonce.

ThebasicproblemsolvedbyMCMCmethodsissamplingfromamultivariate

distributionovermanyvariables. Thedistribution canbegiven genericallyas

p(x;y;z;:::;w). If some variable, e.g., y, represents measured signals, then

the actual values measured, say a, can be substituted, and sampling will be

from aconditional distributionp(x;a;z;:::;w). If someother variable, sayx,

representsthemeasuredquantity,thensamplingandselectingthexvariablewill

give samplesfrom theposteriorof x giventhe measurements. Inother words,

wewill getabest possibleestimationof the quantitygiventhe measurements

andthestatisticalmodel ofthemeasurementprocess.

ThetwobasicmethodsforMCMCcomputationaretheGibbssamplerand

theMetropolis-Hastingsalgorithm. Both generateaMarkovchain withstates

overthe domainof a multivariatetarget distribution and with the target

dis-tribution as its unique limit distribution. Both exist in several more or less

renedversions. TheMetropolisalgorithmhastheadvantagesthat itdoesnot

require sampling from the conditional distributions of the target distribution

but onlyndingthequotientof thedistributionat twoarbitrarygivenpoints,

anditcanbechosenfromasetwithbetterconvergenceproperties. Athorough

introductionis givenbyNeal[25]. Tosumit up,MCMCmethodscanbeused

toestimatedistributionsthatarenottractableanalyticallyornumerically. We

get realestimates ofposteriordistributions and notjust approximate maxima

offunctions. Onthenegativeside, theMarkovchainsgeneratedhavehigh

au-tocorrelation,soasampleoverasequenceofstepscangiveahighlymisleading