BAYESIAN NETWORK CLASSIFIERS

(1)

One Year Master Thesis, 15 hp

One Year Master’s programme in Statistical Sciences, 60 hp

RATING CORRUPTION WITHIN INSURANCE COMPANIES USING

BAYESIAN NETWORK CLASSIFIERS

Oscar Öhman

(2)

Acknowledgements

I would like to thank my supervisor Priyantha Wijayatunga, for helping me develop a deeper understanding of the underlying theory for Bayesian Network classifiers, as well as for how to implement it in practice. This thesis could not have been done without him sharing his impressive expertise in the field.

(3)

Popular scientific abstract

Detecting money-laundering schemes within seemingly legal, yet suspicious insurance companies, by rating the legality of their financial transactions and such, is an important task to our society. The forces behind these schemes may be drug traffickers or even terrorist organizations. Since the rating of insurance companies may depend on various business features, financial professionals use various statistical models for classifying them into different levels of corruption. However, in order to use these models in practice, they need to be built, ideally from past data, i.e., from a set of insurance companies whose ratings and business features are known. The aim therefore of this thesis, is to build four classification models for corruption rating of insurance companies using past data.

Classification is a common practice in statistical sciences, which consists of predicting class/category/group affiliation of an observation, based on the characteristics of each observation. An observation can be either an individual, company, e-mail, car brand, etc.

Classification can be applied to situations such as whether an e-mail is a spam or not, whether an individual will be able to pay back a loan or not, or whether a company is corrupt or not.

The type of classification models used to achieve the aim of this thesis, are different types of so called Bayesian Network (BN) classifiers. These kinds of models have shown high performance in performing classification tasks in fields such as economics. Here, we build four Bayesian network models for classifying insurance companies into different levels of corruption. We compare them with each other for their classification performances. Since different models need different ways of deriving them, it can be regarded as a test of these methods for their effectiveness to correctly predict the corruption level of each insurance company in our data set.

All of the four models partially have resulted in impressive predictive performances.

Three of the models has been created using standard procedures, while one was created using a more complex, but more suitable procedure. This more complex method were expected to result in more accurate predictions. Yet, the primitive models triumphed over the complex one.

(4)

Abstract

Bayesian Network (BN) classifiers are a type of probabilistic models. The learning process consists of two steps, structure learning and parameter learning. Four BN classifiers will be learned. These are two different Naive Bayes classifiers (NB), one Tree Augmented Naive Bayes classifier (TAN) and one Forest Naive Bayes classifier (FAN). The NB classifiers will utilize two different parameter learning techniques, which are generative learning and discriminative learning. Generative learning uses maximum likelihood estimation (MLE) to optimize the parameters, while discriminative learning uses conditional likelihood estimation (CLE). The latter is more appropriate given the target at hand, while the former is less complicated. These four models are created in order to find the model best suited for predicting/rating the corruption levels of different insurance companies, given their features. Multi-class Area under the receiver operating characteristic (ROC) curve (AUC), as well as accuracy, is used in order to compare the predictive performances of the models. We observe that the classifiers learnt by generative parameter learning performed remarkably well, even outperforming the NB classifier with discriminative parameter learning. But unfortunately, this might imply an optimization issue when learning the parameters discriminately. Another unexpected result was that the CL-TAN classifier had the highest multi-class AUC, even though FAN is supposed to be an upgrade of CL-TAN. Further, the generatively learned NB performed about as good as the other two generative classifiers, which was also unexpected.

(5)

Sammanfattning

Titel: Skattning av korruptionsnivåer inom försäkringsbolag med hjälp av Bayesianska nätverk

Bayesianska nätverk (BN) är en typ av sannolikhetsmodell som används för klassificering.

Inlärningsprocessen av en sådan modell består av två steg, strukturinlärning och parameterinlärning. Fyra olika BN-klassificerare kommer att skattas. Dessa är två stycken Naive Bayes-klassificerare (NB), en Tree augmented naive Bayes-klassificerare (TAN) och en Forest augmented naive Bayes-klassificerare (FAN). De två olika NB-klassificerarna kommer att skilja sig åt i att den ena använder sig av generativ parameterskattning, medan den andra använder sig av diskriminativ parameterinlärning. Chow och Lius (CL) berömda algoritm, där det ingår att beräkna betingad ömsesidig information (CMI), brukar ofta användas för att hitta den optimala trädstrukturen. Denna variant av TAN är känd som CL-TAN. FAN är en annan slags uppgradering av NB, som kan anses vara en förstärkt variant av CL-TAN, där förklaringsvariablerna är kopplade till varandra på ett sätt som ger en skogs-liknande struktur. De två olika parameterinlärningsmetoderna som används är generativ inlärning och diskriminativ inlärning. Den förstnämnda använder sig av maximum likelihood-skattning (MLE) för att optimera parametrarna. Detta är smidigt, men samtidigt skattas inte det som avsetts. Den sistnämnda metoden använder sig istället av betingad maximum likelihood-skattning (CLE), vilket ger en mer korrekt, men också mer komplicerad, skattning. Dessa sex modeller kommer att tränas i syfte att hitta den modell som bäst skattar korruptionsnivåerna inom olika försäkringsbolag, givet dess egenskaper i form av förklaringsvariabler. En multiklassvariant av Area under the reciever operating characteristics (ROC) curve (AUC) används för att bedöma skattningsprecisionen för varje modell. Analysen resulterade i anmärkningsvärda resultat för de generativa modellerna, som med goda marginaler skattade mer precist än den diskriminativa NB-modellen.

Tyvärr kan detta dock vara en indikation på optimeringsproblem vid de diskriminativa parameterinlärningen av NB. Ett annat anmärkningsvärt resultat var att av samtliga generativa modeller, så var CL-TAN den modellen med högst AUC, trots att FAN i teorin ska vara en förbättrad variant av CL-TAN. Även den generativa NB-modellens resultat var anmärkningsvärd, då denna modell hade nästan lika hög AUC som de generativa CL-TAN- och FAN-modellerna.

(6)

1 Introduction

Fraudulent behaviour is unfortunately an occurrence in certain insurance companies. They might partake in shady financial transactions, while some insurance companies might even take part in money-laundering schemes for drug traffickers, terrorist organizations and other types of criminal networks. These kinds of insurance companies could potentially play a huge role in keeping criminal and destructive activities well and alive. So it is of high importance, for their customers, investors, employees and for our society in general, to find a way to detect fraudulent behaviours within insurance companies.

Bayesian Networks (BN) (Pearl, 1988) are a type of probabilistic models, which has previously been proven to be effective in these kinds of tasks. It has previously been used for detection of fraudulent behavior within different types of companies, in papers such as Khan et al. (2013), Glancy and Yadav (2011), Wijayatunga, Mase and Nakamura (2006), etc. With a BN, it is possible to visualize causal relationships between different types of features. A BN classifier can be built by initially learning its structure, and then its parameters given the structure. But learning the structure of a BN can be a bit too complex though, given that there are a lot of features with different levels of dependencies between them at play. This creates an incentive to simplify the process of learning the structure. One such simplified BN classifier makes the assumption that all features are conditionally independent of each other, given the class variable. This will result in a fixed simple structure and a BN classifier known as the Naive Bayes classifier (NB).

The assumption of NB can often be regarded as a bit too simplistic though, as such simple relationships are rare in practice. Yet NB has proven to be surprisingly effective in many situations. Yet, efforts have been made to extend NB, in order to find more realistic structures. One of the most well-known extensions of NB is the Tree Augmented Naive Bayes classifier (TAN), optimized with Chow and Liu’s algorithm from 1968 (Chow and Liu, 1968). It results in a tree-like structure, where dependencies between features are allowed (Friedman, Geiger and Goldszmidt, 1997). This type of TAN can abbreviated as CL-TAN.

But this approach has been proven to have its downsides as well. The Forest Augmented Naive Bayes classifier (FAN) can be regarded as an upgrade of CL-TAN, as it improves on the method by mainly two important tweaks (Jiang et al., 2005). There are other types of BN classifiers as well. But in this thesis, focus will lay on the three aforementioned types of classifiers.

The parameters of BN classifiers are often learned using maximum likelihood estimation (MLE). This is not ideal in practice though, as we are interested in learning conditional probabilities. Efforts have been made to develop more suitable ways of learning the parameters, by using conditional likelihood estimation (CLE) instead of MLE. This has proven to be a very complex task though (Pernkopf and Bilmes, 2005).

1.1 Aim

Two NB classifiers will be trained, using two different types of parameter learning methods.

CL-TAN and FAN classifiers will only have its parameters learnt using MLE, as the complexity of learning their parameters using CLE will go beyond the scope of this thesis.

(8)

The aim of this thesis will be to compare the predictive performance of these four different classifiers, in order to find the classifier best suited for the task of rating the corruption levels of different insurance companies. These insurance companies will be rated, based on 10 different features variables, on a scale between 1 and 7, with 1 being extremely high levels of corruption, while 7 represents very low levels of corruption. These 10 feature variables contains information about the insurance companies such as how many policies under 500 000 euros there are, whether there have been any abnormal behaviors, the amount of missing documents for a policy, etc. Area under the Receiver Operating Characteristics (ROC) curve (AUC), as well as accuracy, will be used as a measure of predictive performance.

(9)

2 Theory

This section will review the underlying theory of the methods necessary to achieve the aim of this thesis.

2.1 Bayesian Network

In order to provide a proper explanation for how to learn a BN classifier, the building blocks of a BN must first be explained. The following sub-section will be about this.

2.1.1 Bayesian Probability

Classical statistical methods assumes that there is a true fixed probability distribution. This probability is unknown and treated as a member of a parametric family. The estimation of these parameters is used to extract an approximation of the objectively true probability.

Bayesian methods use a different approach. By using ones background knowledge of where the parameter might lie, one can compute an exact subjective probability distribution. It will not be an estimate of the probability distribution, but rather the probability distribution itself (Koski and Noble. 2009, 14). This background knowledge is known as the prior distribution p(Hi), where {Hi}^r_i=1 is a finite exhaustive set of mutually exclusive hypothesis, i.e prior assessments of the probability distribution (Koski and Noble. 2009, 12).

The prior probability serves as the probability distribution over the event space before any data is collected. It might contain things like background information, model assumptions or just a vague function introduced for mathematical convenience (Koski and Noble. 2009, 9). Once new evidence in the shape of data has been collected, Bayes’ rule can be utilized in order to update the prior probability distribution into a posterior probability distribution

p(H_i | E) = Pr^p(E|Hⁱ^)p(Hⁱ⁾

i=1p(Hi)p(E|Hi), (1)

where E represents the new evidence, i.e the observed data. In other words, the posterior probability distribution will be a conditional probability, of the prior assessments given the observed data (Koski and Noble. 2009, 10 & 12).

2.1.2 Factorization

Each random variable Xj in a finite set of 1 + d random variables

X = (C, X1, X2, · · · , Xj, · · · , Xd), where j = 1, 2, · · · , d, has its own state space Xj, or C for the class variable C. A binary random variable Xj, for instance, has the state space X_j = {0, 1}. It can have two different states, so its cardinality is therefore |Xj| = 2. A total state space can be defined as the product set

C

d

Y

j=1

X_j.

(10)

The total state space can be regarded as all of the potential outcome values (Wiyajatunga, Mase and Nakamura, 2006).

Say there is a joint probability distribution p(x1, x₂, x₃, x₄), in which all the variables are binary. If you were to specify the total state space of this joint distribution, we would end up with 2⁴ = 16 different states, i.e. 16 different potential outcome values, or parameters. Imagine having a lot more than just four variables, some if not all being multinomial. This would lead to an infeasible amount of parameters to compute, which would be highly impractical. Instead, we could decompose the joint probability distribution into a product of conditional probabilities, by assuming some conditional independencies.

By considering how these variables are connected and how they affect each other, one could, for instance, factorize the joint probability distribution into

p(x₁, x₂, x₃, x₄) = p(x₄ | x₁, x₂)p(x₃ | x₁)p(x₁)p(x₂), (2)

where x4will be dependent on x1and x2, while x3 will be dependent on x1. The factorization approach will be a lot more effective than just calculating the joint probability for each potential state of each variable. By utilizing equation 2 instead, only 4 + 2 + 1 + 1 = 8 parameters will be necessary to estimate. This stems from the fact that only 2² = 4 parameters (or states) needs to be specified from p(x4 = 1 | x₁, x₂), only 2¹ = 2 from p(x3 = 1 | x1), and just 2⁰ = 1 each from p(x1 = 1) and p(x2 = 1). The other states are given by normalisation, i.e p(x4 = 0 | x₁, x₂) = 1 − p(x₄ = 1 | x₁, x₂), for instance. Hence, these states are not necessary to estimate (Barber. 2012, 30-31). Further, equation (2) can be generalized into the form

p(X₁, · · · , X_d) =

d

Y

j=1

p(X_j | pa(X_j)), (3)

where pa(Xj) is the parent variables of Xj. A variable Xj is a parent variable of Xi, if Xi is dependent on Xj, i.e. if Xj affects Xi. This also makes Xi the child of the parent variable (Barber. 2012, 37).

2.1.3 Directed Acyclic Graphs

Equation (3) can be visualized in a Directed Acyclic Graph (DAG) G = (V, E), where V is a finite set of nodes and E is a finite set of edges (which consists of pairs of nodes) (Koski and Noble. 2009, 41). Each p(Xj | pa(X_j))in (3), known as a local probability distribution,

(11)

is represented as a node j in a corresponding DAG. One joint probability distribution can be factorized in more than one way, which means that one joint probability distribution could yield several potential DAG’s (Barber. 2012, 37). A DAG visualizes relationships between variables expressed in (3). It consists only of directed edges, which manifests itself as arrows between the nodes. An arrow going from Xj to Xi, means that Xj ∈ pa(X_i). Undirected edges manifests itself as arrow-less lines between the nodes, where both nodes of an edge affect each other. Further, a DAG is acyclic in the sense that you can not go from a node and come back to the same node while following paths of directed edges (Barber.

2012, 22-23).

Figure 1: Example of a Directed Acyclic Graph.

Figure 1 is a DAG which represents the example factorization given in equation (2). All edges are directed and it is impossible to go from say X1 and then back to it again. The set V contains {X1, X2, X3, X4}, while the set E contains {(X1, X3), (X1, X4), (X2, X4)}. The variable X4 can be defined as a so called collider, as it blocks the path between variables X1 and X2. But nodes in a DAG can also be connected in other ways, such as in a chain or a fork. Given that V = {U, W, Z} is a set of nodes, a chain would be defined as U −→ W −→ Z, while a fork would be defined as U ←− W −→ Z.

A DAG often has a lot more paths of nodes and directed edges than the one in Figure 1. A DAG with nodes U and Z can have many paths of other nodes and edges between them. If at least one path connects U and Z, they are said to be d-connected (d means directional). But if a path is a chain or a fork, one can block the path between U and Z by conditioning on one of the nodes on the path. This will make U and Z d-separated given at least one of the nodes on the path, given that all other paths between U and Z are blocked. If a path between U and Z contains a collider, the path will be unconditionally blocked and U and Z will be d-separated, given that all other paths between U and Z are blocked. If U and Z are d-separated, it means that they are independent. A path between U and Z blocked by a collider makes U and Z unconditionally or marginally independent. A path between U and Z blocked by conditioning on a node in a fork or a chain makes U and Z conditionally independent. If U and Z are d-connected, they are said to be dependent (Pearl, Glymour and Jewell. 2016, 45-48).

(12)

2.2 Bayesian Network Classifiers

BN can be used for classification tasks. This initially requires learning the structure of a BN, and then the parameters given the structure. But learning of the structure of a BN can often be very complex. Basically it is about finding the optimal structure and forming a visual representation of the joint distribution p(x1, · · · , x_d), with the purpose of classification. This is highly computational, as there may be a lot of dependencies and independencies between variables to consider (Khanteymoori, Homayounpour and Menhaj, 2008).

A BN can be defined as B = {G, Θ}, where G is a DAG, while Θ is the parameters of a BN, and serves as the factorization properties of the distribution (Pernkopf, Wohlmayr and Tschiatchek, 2012). Learning a BN classifier is about finding a network B that best matches a training data S = {x^(m)}^(m)_m=1 with random variable vector X = (C, X1, ..., X_d) and sample size M, where x^(m) is the m^th observation (record), for m = 1, ..., M. A common approach for this optimization, is to introduce a scoring function, such as the minimal description length (MDL) scoring function. This scoring function is given by

M DL(B | S ) = log M

2 |B| − LL(B | S ).

The first term counts the number of bits needed to encode B (Friedman, Geiger and Goldszmidt, 1997). One bit is a single binary unit of information. The amount of uncertainty, of B in this case, will be halved by one bit of information (Evans. 2011, 4). So in other words, the first term can be regarded as a measurement of the length of describing B. The notation |B| is the number of parameters in the network, while ^{log M}₂ is the number of bits used for each parameter in Θ. The second term is the negated log likelihood function of B given S, which is defined as

LL(B | S ) =

M

X

m=1

log P_B(x^(m)),

where PB is a probability distribution. It measures the number of bits needed to to describe S based on PB. The higher LL(B | S) is, the closer B is to modeling the probability distribution in S. Hence, the aim is to maximize LL(B | S), in order to find the optimal structure of a BN classifier (Friedman, Geiger and Goldszmidt, 1997).

Different methods have been introduced in order to simplify the process of learning the structure of BN classifiers (Khanteymoori, Homayounpour and Menhaj, 2008). NB is one of the oldest and most common one. But because of its strict and often unrealistic assumptions, other methods, such as the CL-TAN and FAN, have been developed in order to loosen up the strict assumptions of NB (Downes and Tang, 2004).

2.2.1 Measures of Predictive Performance

The standard procedure in classification tasks, is to divide the observed sample into a training and a test data. The training data is used to fit the classifier, while the test data is used to evaluate the predictive performance of the classifier. A standard measure of predictive performance in classification tasks, is accuracy, obtained by a confusion matrix.

(13)

This type of measurement, gives the proportion of observation that are correctly classified by a fitted classifier, when classifying test observations.

Figure 2: Confusion matrix.

A confusion matrix is presented in Figure 2. As can be noted, four different potential outcomes for an observation exists, which are the true positives (TP), false positives (FP), false negatives (FN) and the true negatives (TN). The column totals consists of the total number of positives (P) and negatives (N) in the test data. The diagonal represents correctly classified observations, while the off-diagonal represents observations that is given the wrong class prediction by a given classifier. A perfect classifier is one that results in T P = P and T N = N, which will give an accuracy of a 100 %. Hence, the accuracy is measured by

Accuracy = T P + T N P + N .

Other important performance metrics that can be obtained by a confusion matrix, is TPR ≈ Positives (1) correctly classified

Total positives and

FPR ≈ Negatives (0) incorrectly classified Total negatives ,

where the R is an abbreviation for rate (Fawcett, 2006). Further, the so-called misclassification costs, denoted co0 and co1 in a binary case, are integral parts of a confusion matrix. One can define co0 as the cost of misclassifying a class 0 observation, while co1

can be defined analogously. They are key to when deciding the threshold which minimizes the misclassification rate. But the costs are often very difficult to determine, hence they

(14)

are often just set to 1, such that coo = co₁. This will result in a threshold of ¹₂ by the formula

t = _co^co¹

0+co1, (4)

such that a point is classified into class 0 if ˆp > t = ¹₂ (Hand and Till, 2001). So because of this fact, a suitable alternative to the accuracy measurement, would be the Area Under the Receiver Operating Characteristics (ROC) Curve (AUC) measurement. It will be based off a distribution ˆp(x) and ignore the costs, as well the class prior distributions.

It relies solely on comparing the class distributions within ˆp(x), as AUC can be used as a measure of difference between probability distributions (Hand and Till, 2001).

Let ˆp(x) be the estimated probability that an observation with X^T, a vector of covariates X_j, belongs to class 1. Define f(ˆp) = f(ˆp(x) | 1) as the probability density function for the estimated probability that a class 1 point is classified into class 1. Define g(ˆp) = g(ˆp(x) | 0) as the probability density function for the estimated probability that a class 1 point is classified into class 0. Define F (ˆp) and G(ˆp) as corresponding cumulative distribution functions. Figure 3 is an example of a ROC-curve. F (ˆp) is estimated by TPR, while G(ˆp) is estimated by FPR. The ROC curve in Figure 3 indicates a satisfying AUC, as the curve goes through the upper left side of the figure, which indicates that F (ˆp) > G(ˆp). Given that the AUC = 0.9, a class 1 point will have a 90 % chance of being classified into class 1, while only having a 10 % chance of being classified into class 0. One could also think of a ROC-curve as moving a threshold t from 0 to 1, and plotting F (t) against G(t) (Hand and Till, 2001).

(15)

Figure 3: The red ROC-curve will result in a satisfying AUC-value, while the gray ROC-line will result in an AUC-value of 50 %.

The ROC curve can be regarded as a trade-off between TPR and FPR. Since TPR and FPR can be represented as probability distributions, the range of each of them will naturally be [0, 1]. The aim is to have a ROC curve where the classifier will have the highest probability possible of ranking a randomly chosen positive observation higher than a randomly chosen negative observation (Fawcett, 2006).

The area under a diagonal line which goes from the origin to the point (1, 1), covers 50% of the total area of a graph, with horizontal and vertical axes ranging from 0 to 1. As this is equal to just random guessing, a reasonable classifier should have an AUC above 0.5 (Fawcett, 2006). But it is preferable if the AUC is above 0.9. This would represent a curve which passes through the far upper left corner, not unlike the ROC curve in Figure 3.

Costs may be even more tricky to determine in multi-class problems. Yet accuracy for multi-class problems have been used frequently in the field by extending equation (4).

But this is rarely a feasible option in practice. Instead, multi-class extensions of AUC has been developed, which is a more suitable approach (Hand and Till, 2001).

In binary classification tasks, the AUC is a measure of the discriminability of a pair

(16)

of classes, and will return a scalar value. When dealing with multi-class tasks, computing an AUC will be a bit more complicated, as it would be required to combine multiple pairwise discriminability values. Hand and Till (2001) proposed a formula which deals with this problem, which they defined it as

AU C_total = _|C|(|C|−1)²

n

X

{c_i,cj}∈C

AU C(c_i, c_j), (5)

where the sum of all AUC-values, for all distinct pairs of classes i and j (i 6= j), is calculated regardless of order. The number of such pairs amounts to (|C|(|C| − 1))/2. The strength of equation (5) is that it is insensitive to changes in class distribution. But its weakness is that it is profoundly difficult to visualize the surface of the area (Fawcett, 2006).

2.2.2 Naive Bayes Classifier

NB simplifies the process of learning the structure of BN’s for classification, by assuming that all the features are conditionally independent of each other given the class variable (Downes and Tang, 2004). With this assumption, the structure will be fixed, and no structure learning will in fact be required (Francois and Leray, 2006). BN can often have highly complex structures. This makes it apparent, that the assumption which NB makes, is quite a daring one. Yet NB has proven to be remarkably effective in more than some cases (Downes and Tang, 2004). This phenomena has long boggled various scientists in the field, but efforts have been made to explain it. Rish (2001), for instance, concluded that NB mainly performs well on data sets with either completely independent feature variables, or functionally dependent feature variables, which are two opposite extremes.

Rish (2001) defines functional dependency as, given equal class priors, Xj = f_j(X₁), where j = 2, 3, · · · , d, while fj(·) is a one-to-one mapping. This type of situation is not what one intuitively would regard as optimal for NB, yet it is. Rish (2001) also noted that NB is at its worst between these two extremes (Rish, 2001).

With the assumption of NB, the joint probability distribution can be factorized into

p(X, C) = p(C)

d

Y

j=1

p(X_j | C), (6)

where the class variable C = {c1, c₂, · · · , c_k, · · · , c_K−1, c_K}, will be the parent variable of all features, while all of the features Xj will be independent of each other, given C (Barber.

2012, 243). Figure 4 shows the corresponding NB structure.

(17)

Figure 4: A visualization of a Naive Bayes classifier structure.

The directed edges in Figure 4 will serve as the probabilistic influence that C, as a parent, has on the features (Wijayatunga, Mase and Nakamura, 2006). Equation (6) can also be regarded as a discriminant function, where each observation is assigned to the class ck, for which p(X, C = ck)is the largest (Downes and Tang, 2004).

2.2.3 Tree Augmented Naive Bayes Classifier

The simple and often unrealistic assumption of NB, have sparked incentive to find ways to improve upon it. Friedman, Geiger and Goldszmidt (1997), made an effort to relieve this assumption by allowing for dependencies between features. Their successful attempt resulted in a BN classifier with a tree-like structure, known as TAN. The idea was to relieve the assumption of NB by allowing each feature to have one parent (feature) variable each, in addition to having C as a parent. This resulted in a joint probability distribution which could be factorized into the class probability distribution

p(X, C) = p(C)

d

Y

j=1

p(Xj | pa(Xj)), (7)

where pa(xj) not only contains C, but also one other feature variable (Friedman, Geiger and Goldszmidt, 1997). This made it distinct from NB, but it also meant that the structure would not be fixed. Finding the most suitable structure from the joint probability distribution, would require some kind of optimization technique. In order to achieve this, Friedman, Geiger and Goldszmidt (1997) utilized a famous algorithm derived in Chow and Liu (1968). Their algorithm is known as the Chow-Liu tree and consists of fitting the optimal structure of a tree by determining its edges using the mutual information between each variable. This method is further explained in algorithm 1 (Friedman, Geiger and Goldszmidt, 1997).

(18)

Algorithm 1 The Chow-Liu tree.

1. Calculate the mutual information between each pair of variables Xi and Xj, where i 6= j. The mutual information function is given by

I_P_ˆ

S(X_i, X_j) = X

xi,xj

p(x_i, x_j) log p(x_i, x_j) p(x_i)p(x_j),

and measures how much information Xi provides about Xj. The notation ˆP_S is the estimated probability distribution of X.

2. Set up an undirected graph, with the variables as nodes and with weights, measured by mutual informations, as undirected edges.

3. Maximize the log likelihood function in the MDL scoring function, by fitting a maximum weighted spanning tree of the undirected graph in step 2.

4. Transform the resulting undirected maximum weighted spanning tree from step 3 into a directed tree, by arbitrarily selecting a root variable from the set of variables and setting the direction of all edges outwards from it.

Friedman, Geiger and Goldszmidt (1997) modified algorithm 1 and made it into an extension to NB, which they named CL-TAN. Instead of calculating the mutual information between each pair of feature variables, they calculated the conditional mutual information (CMI) between each pair of feature variables, given the class variable C. Usually, in BN classification trees, C is the root variable. But during the process of fitting the optimal structure for a CL-TAN using the proceeding steps in algorithm 1, C is left out. It is not added until the finalizing step, in which one simply adds a node containing C to the already fitted tree structure, as well as directed edges going outwards from C to each feature variable. After this last step, C will be the root variable. But a root variable will still be necessary during the step corresponding to step 4 in algorithm 1. Since it can not be C, a feature variable X_root is arbitrarily chosen as the root from the set of feature variables, similar to step 4 in algorithm 1. The modifications of algorithm 1, which will result in a CL-TAN, can more formally be expressed in algorithm 2 (Friedman, Geiger and Goldszmidt, 1997).

(19)

Algorithm 2 The structure learning procedure of a CL-TAN.

1. Calculate the CMI between each pair of variables Xi and Xj, where i 6= j. The CMI function is given by

IPˆS(Xi, Xj | C) = X

xi,xj,c

p(xi, xj, c) log p(x_i, x_j | c) p(x_i | c)p(x_j | c), and measures how much information Xi provides about Xj, given C.

2. Set up an undirected graph, with the variables as nodes and with weights, measured by CMI’s, as undirected edges.

3. Maximize the log likelihood function in the MDL scoring function, by fitting a maximum weighted spanning tree of the undirected graph in step 2.

4. Transform the resulting undirected maximum weighted spanning tree from step 3 into a directed tree, by arbitrarily selecting a root variable Xroot from the set of feature variables and setting the direction of all edges outwards from it.

5. Add a node containing C and set directed edges outwards from it, to all feature variables. This node will now be the root of the resulting directed network, which is now a CL-TAN.

2.2.4 Forest Augmented Naive Bayes Classifier

The choice of Xroot affects the structure of the tree, but the prediction accuracy is usually not significantly different depending on which Xroot one might choose (Jiang, et al., 2005).

But even if the prediction accuracy might remain approximately the same, the AUC might still be affected negatively depending on which Xroot one might choose. CL-TAN does not take this into consideration, when arbitrarily assigning the root variable in step 4 of algorithm 2. Another issue with CL-TAN is that it might create several irrelevant edges as a result of maximizing the log likelihood function in the MDL scoring function. This could potentially result in over-fitting. Jiang et al. (2005) introduced a modification of CL-TAN, which they named FAN. It fixed the previously mentioned issues with CL-TAN and generally gave better AUC-values, for most of the data sets used in their paper. In their experiments, CL-TAN rarely improved over NB in terms of AUC, while FAN did so on most of the data sets (Jiang et al., 2005).

FAN is very similar to CL-TAN. But it has got mainly two alterations for when learning the structure, which makes it distinct from CL-TAN, but also more robust in theory. The first alteration is that it doesn’t choose Xroot arbitrarily, rather it chooses the Xroot which has the maximum mutual information with C. This optimization can be expressed as

X_root= arg_X_jmax IPˆS(X_j, C),

where j = 1, · · · , d (Jiang, et al., 2005). The feature having the greatest influence on C will be selected in other words, which in most cases will ensure a more satisfying AUC.

The second alteration is that it adds a threshold to the structure learning process, which

(20)

will filter out irrelevant edges. This will result in divisions of the nodes into groups of nodes within the network. In other words, C will be directed to more than just one large tree and instead towards two or more smaller trees, such that the resulting structure is forest-like. Hence the name Forest Augmented Naive Bayes (Jiang et al., 2005). In practice, these irrelevant edges can be filtered out by replacing the LL function in the MDL scoring function with a penalized log-likelihood function, such as AIC. Keeping the LL function will result in one large tree, as it lacks penalization (Mihaljevic, Bielza and Larranga, 2018).

As the AUC is quite sensitive to the selection of Xroot, it can potentially be of high importance to choose a suitable candidate. CL-TAN will just arbitrarily choose one, which can potentially yield a poor AUC. FAN on the other hand, will use optimization techniques to find the most suitable candidate for Xroot. This might lead to a better AUC-value and this is the main reason to why FAN will often outperform CL-TAN in terms of AUC (Jiang et al., 2005).

2.2.5 Parameter Learning

The parameters of BN classifiers can be learned from data at hand and expert opinion, if available, for a given structure. While structure learning can be thought of as a way to visualize the factorization of a joint density, parameter learning can be thought of as a way to express this factorization numerically. In Wijayatunga and Mase (2006) and in Pernkopf and Bilmes (2005), two different paradigms of parameter learning, namely, generative and discriminative learning, are described. A generative classifier will learn the parameters of the joint probability distribution from the training data, using techniques such as MLE, often enhanced with a Bayesian smoothing (Dirichlet) prior.

Each variable Xj, given the value of its parents pa(Xj), has its own local conditional distribution, P (Xj = l|pa(X_j) = h), for each l ∈ Xj and h ∈ pa(Xj), where Xj and pa(Xj) are state spaces of Xj and pa(Xj), respectively. Each of these local distributions can be presented in conditional distribution tables, one table for each local distribution. Let θ^j_l|h denote a specific conditional distribution table entry, i.e., P (Xj = l|pa(X_j) = h) = θ_l|h^j . It is simply the probability that Xj takes on its lth value assignment, given that pa(Xj)take the hth assignment. For a given BN structure B, the log likelihood function of the parameters Θ (LLΘ) may be expressed as

LL_Θ(B | S ) =PM

m=1log P_Θ(X = x^(m)) =PM m=1

Pd

j=1log P_Θ(x^(m)_j | pa(x_j)^(m)) =

=

M

X

m=1 d

X

j=1

|Xj|

X

l=1

|pa(Xj)|

X

h=1

u^j_l|hlog(θ_l|h^j ), (8)

where PΘ(x^(m)_j | pa(x_j)^(m)) is the conditional probability value of Xj for the mth observation. The notation u^j_l|h is the number of observations such that Xj = l and pa(Xj) = h in S. Here, |A| denotes cardinality (number of elements) of the set A and it is assumed that the elements are named with consecutive integers starting with 1 (Pernkopf

(21)

and Bilmes, 2005).

The MLE of θ_l|h^j derived from equation (8) is

θb_l|h^j = ^u

j l|h

P|Xj |

l=1 u^j_l|h, (9)

and Bayesian estimate is

ϑb^j_l|h = ^u

j l|h+α^j_l|h P|Xj |

l=1(u^j_l|h+α^j_l|h), (10)

where α^j_l|h is the Dirichlet prior (hypothetical) count corresponding to the observed count u^j_l|h. Note that so-called Dirichlet prior serves as the prior probability distribution over the set of parameters, and can generally be formulated as

π(θ₁, θ₂, · · · , θ_q) = ^Γ(α^Qq¹^+···+α^q⁾ j=1Γ(αq) (Qq

j=1θ_j^α^j⁻¹), (11)

if θj ≥ 0 and P^q_j=1θj = 1. Otherwise, π(θ1, θ2, · · · , θq) = 0. Γ denotes the Euler Gamma function (Koski and Noble. 2009, 25). Note that there are other types of Bayesian smoothing priors in existence as well.

But using generative parameter estimation methods shown above to learn the parameters, is rarely optimal for prediction tasks as it makes little sense to optimize joint likelihood, when the real target of interest lies in finding the best conditional densities of the class given the features. A more natural solution is to maximize the conditional likelihood instead, in order to find the optimal parameters. So in Wijayatunga and Mase (2006) and in Pernkopf and Bilmes (2005), the conditional log likelihood function (CLL) of a given BN structure B, is defined as

CLL(B | S ) = logQM

m=1PΘ(C = c^(m)_k | X1:d = x^(m)_1:d ) =PM

m=1log ^P^Θ^(C=c

(m)

k ,X1:d=x^(m)_1:d ) P|C|

c0=1PΘ(C=c⁰,X_1:d=x^(m)_1:d ), (12)

where c^(m)k is the class of the mth observation, while x^(m)1:d is the observed values of

(22)

the d covariates for the mth observation. But one major problem with this approach is that unlike LLΘ, CLL lacks a closed form solution. This means that one would need to calculate it using iterative optimization techniques, which can be highly complex. Using a conjugate gradient descent algorithm with line-search is a common approach for performing this (Pernkopf and Bilmes, 2005).

But since the conjugate gradient descent algorithm mainly is a local optimizer, there will not be any guarantees for finding a global maximum during the line-search, unless one somehow manages to map each conditional BN model to a logistic regression model, where the likelihood surface is known to be concave. In that case, there would only be one single optimum available, which obviously would be the global maximum (Wettig et al., 2003).

(23)

3 Data

The data consists of 3284 Italian insurance companies and 10 feature variables, along with the class variable rating. There is no missing data in the data set. The class variable was originally a continuous index score variable, with values ranging from 20.78281 to 188.32453. The higher the score, the more legit the insurance company is considered to be. This variable has been transformed into a multi-class variable with 7 different states, using K-means cluster analysis with the package arules (Hahsler et al., 2019). K-means clustering is an unsupervised learning method, in which you partition a data set into K, distinct, non-overlapping clusters, or groups, using the following algorithm (James et al.

2017. 386-388):

Algorithm 3 K-means algorithm.

• For each variable Xj and C

1. Randomly assign numbers between 1 and K, to each observation.

2. Compute the centroid for each of the K clusters. The centroid for the kth cluster is a vector containing the p feature means for the observations in the kth cluster.

3. Reassign each observation to the cluster with the closest centroid in terms of Euclidean distance.

4. Repeat step 2-3 until the clusters stop changing.

In this case, for the class variable rating (C), K = 7. Hence, the continuous index score variable will be divided into 7 clusters, or classes. Further, the feature variables are:

1. MathRes: The size of the reserves, or provisions, in terms of euros, within an insurance company. A reserve is necessary for covering future transactions to policy holders.

2. Contractors: Amount of entities that have signed a policy in a certain insurance company.

3. Alarms: Amount of abnormal behaviours. If more than 4 alarms are logged by authorities, suspicions might arise of that specific insurance company. Abnormal behaviours are regulated by specific regulations issued by the legislators.

4. MissingDocs: The amount of missing documents for a policy. If an insurance company lacks documents in a certain policy, suspicion arises.

5. ReligEnt: Amount of religious entities that have signed policies in a specific insurance company. There have been several cases where illegal money was hidden through religious entities.

6. Onlus: ONLUS is an abbreviation of the Italian name for nonprofit organizations. This variable is a count for the amount of nonprofit organizations that have signed policies in a specific insurance company. It is included for the same reason as ReligEnt, that illegal money might be hidden within it.

(24)

7. Policies500k: The amount of policies with a worth of below 500 000 euros in a certain insurance company.

8. Policies500k 1000k: The amount of policies with a worth of between 500 000-1 000 000 euros in a certain insurance company.

9. NoSav: The amount of entities that did not deliver the Sav. The Sav is an optional document in which the contractor goes to declare his/hers income, family status, spouse’s work, and so on. Those who do not deliver this form are more likely to hide something.

10. PremIns: The amount of entities that have chosen to sign the premium insurance.

All of the feature variables initially had a wide range of values, with high maximums. For simplicity sake, each feature variable had its values divided into 10 clusters. So the state space of each feature variable Xj is Xj = {1, 2, · · · , 10}, i.e. 10 different states for each feature variable. The number of observations in each cluster for each feature variable, is presented in table 1. Table 2 consists of the number of observations in each cluster for the class variable rating.

Table 1: The number of observations in each cluster, for each feature variable.

Cluster X₁ X₂ X₃ X₄ X₅ X₆ X₇ X₈ X₉ X₁₀ 1 1749 1432 2515 2625 3120 2507 1448 2987 1315 1422

2 392 537 394 319 121 274 509 185 533 503

3 331 411 221 160 17 159 380 64 431 350

4 243 320 98 80 7 166 334 22 336 311

5 211 240 34 45 8 101 255 6 297 219

6 150 150 11 30 3 42 172 3 186 177

7 97 99 6 12 3 18 103 8 119 141

8 64 70 2 6 1 10 71 7 61 73

9 43 24 2 4 1 6 11 1 5 64

10 4 1 1 3 3 1 1 1 1 24

Table 2: The number of observations in each class/cluster, for the variable rating.

Class #Observations

1 1021

2 156

3 697

4 489

5 341

6 344

7 236

(25)

4 Results

NB, CL-TAN and FAN classifiers have been trained in the statistical software tool R (R Core Team, 2018), using both generative and discriminative parameter learning methods for NB, while only using generative parameter learning for CL-TAN and FAN. This resulted in a total amount of four different classifiers. The classifiers which had its parameters learnt by generative methods, were created using the package bnclassify (Mihaljevic, Bielza and Larranga, 2018). The NB classifier with discriminative parameter learning, has been learnt using a conjugate gradient descent algorithm with line-search. 80% of the data was used for training the classifiers, while the rest was used as test data. The package pROC was used in order to calculate the multi-class AUC for each classifier, using the formula as defined by Hand and Till (2001) in equation (5) (Robin et al., 2011). The package caret was used in order to obtain estimates for the accuracy, along with a confusion matrix, for each classifier (Kuhn, 2019).

4.1 Naive Bayes Classifiers

Classifier 1and classifier 2 are NB classifiers. Both of them has the exact same fixed structure, which can be seen in Figure 5. But while the parameters of classifier 1 are learned generatively, using MLE, classifier 2 learns its parameters discriminately, using CLE.

Figure 5: NB structure for classifier 1 and 2.

(26)

4.1.1 Classifier 1

For classifier 1, the parameters corresponding to the fixed structure in Figure 5, were trained using MLE. The Bayesian smoothing prior α, was tuned using 10-fold cross validation. It was estimated that α = 0 would yield the highest predictive accuracy. So in other words, no Bayesian smoothing prior was used for the generative learning of classifier 1’s parameters. After performing prediction on the test data, the multi-class AUC was estimated to be at 0.9321. Table 3 presents the confusion matrix of classifier 1. It correctly classified 71.54 % of all test observations.

Table 3: Confusion matrix for Classifier 1. Rows represents predicted class assignment, while the columns represents the class assignments in the test data.

1 2 3 4 5 6 7

1 220 20 10 1 2 0 0

2 0 0 0 1 0 0 0

3 3 6 106 12 8 4 0

4 0 1 27 57 21 12 1

5 0 0 0 14 26 11 3

6 0 0 0 0 8 33 11

7 0 0 0 0 2 9 28

4.1.2 Classifier 2

For classifier 2, the parameters corresponding to the fixed structure in Figure 5, was learned using CLE. Custom made functions and the built in R function optim, with the conjugate gradient method, were used in order to perform iterative optimization on the parameters. After doing predictions on the test data with the classifier, the multi-class AUC was estimated to be at 0.9086. Table 4 presents the confusion matrix of classifier 2. It correctly classified 58.30 % of all test observations.

Table 4: Confusion matrix for Classifier 2.

1 2 3 4 5 6 7

1 220 21 8 0 0 0 0

2 0 1 74 6 6 1 0

3 0 0 0 0 0 0 0

4 3 5 60 61 21 7 0

5 0 0 1 18 28 15 1

6 0 0 0 0 12 39 8

7 0 0 0 0 0 7 34

4.2 Tree Augmented Naive Bayes Classifier

Figure 6 represents the structure for classifier 3. This classifier has got a CL-TAN structure, along with generative parameter learning. The root variable was arbitrarily chosen to be Contractors. For the structure, the log likelihood was maximized by calculating the total weight of all CMI’s, in order to find the optimal structure. Hence, the resulting structure is a tree.

(27)

Figure 6: CL-TAN structure for both classifier 3 and 4.

4.2.1 Classifier 3

For classifier 3, the parameters corresponding to the structure in Figure 6, were generatively learned, using MLE with a Bayesian smoothing Dirichlet prior α = 0.2. The value of α was decided upon by tuning it, using 10-fold cross validation. After doing predictions with the classifier on the test data, the multi-class AUC was estimated to be at 0.9389. Table 5 presents the confusion matrix of classifier 3. It correctly classified 76.86

% of all test observations.

Table 5: Confusion matrix for classifier 3.

1 2 3 4 5 6 7

1 222 17 10 0 0 0 0

2 1 5 1 0 0 0 0

3 0 5 108 18 0 0 0

4 0 0 23 57 14 1 0

5 0 0 1 10 45 18 2

6 0 0 0 0 7 40 13

7 0 0 0 0 1 10 28

4.3 Forest Augmented Naive Bayes Classifier

Figure 7 represents the structure for both classifier 4. This classifier has got a FAN structure, along with generative parameter learning. The feature variable NoSav was selected as root variable, as it had the maximum mutual information with the class variable rating. In order to find the optimal structure, the AIC was maximized in the MDL-scoring function, when fitting the undirected maximum weight spanning tree. This expectedly resulted in a structure with fewer edges in comparison to the CL-tan structure. Hence,

(28)

the resulting structure here is a "forest" rather than one large "tree", as the class variable rating has edges going to more than one "tree".

Figure 7: FAN structure for both classifier 5 and 6.

4.3.1 Classifier 4

For classifier 4, the parameters corresponding to the structure in Figure 7, were generatively learned, using MLE with a Bayesian smoothing Dirichlet prior α = 0.4. The value of α was decided upon by tuning it, using 10-fold cross validation. After doing predictions with the classifier on the test data, the multi-class AUC was estimated to be at 0.9366. Table 6 presents the confusion matrix of classifier 4. It correctly classified 79.91

% of all test observations.

Table 6: Confusion matrix for classifier 4.

1 2 3 4 5 6 7

1 222 18 8 0 2 0 0

2 1 5 2 1 0 0 0

3 0 4 116 12 0 2 0

4 0 0 17 59 17 4 0

5 0 0 0 13 40 8 2

6 0 0 0 0 8 52 10

7 0 0 0 0 0 3 31

4.4 Summary

A summary of the results are presented in Table 7. The generative classifiers outperformed the discriminative NB classifier, both in terms of multi-class AUC and accuracy. All of the generative classifiers had approximately equal multi-class AUC-values, with classifier 3,

(29)

the CL-TAN classifier, having a value slightly above the other two generative classifiers. But classifier 4, the FAN classifier, had the highest accuracy though.

Table 7: The multi-class AUC of each classifier.

Classifier 1 2 3 4

Multi-class AUC 0.9321 0.9086 0.9389 0.9366 Accuracy 0.7154 0.5830 0.7686 0.7991

(30)

5 Conclusion

In terms of multi-class AUC, the resulting perfomances of the classifiers was quite unexpected. For one, the NB classifier with discriminative parameter learning, were expected to perform better than the generative one, as it corresponds better to the structures and the factorization of the joint probability. Yet, classifier 1 outperforms classifier 2. The difference in terms of accuracy is even larger, as classifier 2 only correctly classifies 58.30 % of the test observations. This could be due to the possibility that the line-search of the conjugate gradient algorithm, might have stopped at a local maximum, rather than the global maximum, when maximizing CLL. It would be highly unfortunate if this was the case, as a highly functional iterative optimization of CLL, could have had the potential of resulting in something better than the classifiers with generative parameter learning.

But the generative classifiers did result in some quite remarkable multi-class AUC values though, especially classifier 1 and classifier 3. The FAN classifiers was expected to yield the highest multi-class AUC, yet it was not significantly better than either NB or CL-TAN. In fact, classifier 3, the CL-TAN classifier with generative parameter learning, even performed slightly better than the FAN classifier, in terms of multi-class AUC. This was particularly unexpected, as FAN is supposed to be an improvement over CL-TAN. But it is important to take into consideration that the methods were only tested and compared on one data set in this thesis. Jiang, et al. (2005) used a wide range of data sets when they compared FAN with CL-TAN and NB. In a few very rare cases, CL-TAN did outperform FAN. But in terms of accuracy among the generative classifiers though, the results were as expected. FAN outperformed NB and CL-TAN, while CL-TAN outperformed NB.

One possible explanation for the remarkable performance of NB in terms of multi-class AUC, could be that its assumption is not entirely unrealistic for the data set used for this thesis. Running the CMI algorithm for all pairs of feature variables, given the class variable, generally resulted in low CMI-values, as can be seen in table 8 in Appendix A. The average CMI of all pairs of feature variables is as low as 0.1016193. In other words, most feature variables seems to be nearly independent of each other, given the class variable.

(31)

6 References

Barber, David. 2012. Bayesian Reasoning and Machine Learning. Cambridge: Cambridge University Press.

Chow, C.K & Liu, C.N. 1968. Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory. 14(3): 462-467.

Downes, Tom & Tang, Adelina. 2004. Boosting the Tree Augmented Naive Bayes Classifier. Intelligent Daa Engineering and Automated Learning Ideal 2004, Proceedings.

vol 3177: 708-713.

Evans, David. 2011. Introduction to Computing: Explorations in Language, Logic, and Machines. Scotts Valley, CA: CreateSpace Independent Publishing Platform.

Fawcett, Tom. 2006. Introduction to ROC analysis. Pattern Recognition Letters. 27(8): 861-874.

Francois, Olivier C.H & Leray, Philippe. 2006. Learning the Tree Augmented Naive Bayes Classifier from incomplete datasets. Third European Workshop on Probabilistic Graphical Models. 91-98.

Friedman, Nir & Geiger, Dan & Goldszmidt, Moises. 1997. Bayesian Network Classifiers.

Machine Learning. 29(2-3): 131-163.

Glancy, Fletcher H & Yadav, Surya B. 2011. A computational model for financial reporting fraud detection. Decision Support Systems. 50(3): 595-601.

Hand, David J & Till, Robert J. 2001. A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems. Machine Learning. 45(2):

171-186.

Hahsler, Michael & Buchta, Christian & Gruen, Bettina & Hornik, Kurt. 2019.

arules: Mining Association Rules and Frequent Itemsets. R package version 1.6-3.

https://CRAN.R-project.org/package=arules

James, Gareth & Witten, Daniella & Hastie, Trevor & Tibshirani, Robert. 2017.

An Introduction to Statistical Learning: with Applications in R. New York: Springer.

Jiang, Liangxiao & Zhang, Harry & Cai, Zhihua & Su, Jiang. 2005. Learning Tree Augmented Bayes for Ranking. Database Systems for Advanced Applications. 3453:

688-698.

Khan, Nida S & Larik, Asma S & Rajput, Quratulain & Haider, Sajjad. 2013. A Bayesian Approach for Suspicious Financial Activity Reporting. International Journal of Computers and Applications. 35(4): 181-187.

(32)

Khanteymoori, Ali Reza & Homayounpour, Mohammad Mehdi & Menhaj, Mohammad Bagher. 2008. A Bayesian Network Based Approach for Data Classification Using Structural Learning. Advances in Computer Science and Engineering. 6: 25-32.

Koski, Timo & Noble, John M. 2009. Bayesian Networks - An Introduction. Chichester, West Sussex: John Wiley & Sons, Inc.

Kuhn, Max. (2019). caret: Classification and Regression Training. R package version 6.0-84. https://CRAN.R-project.org/package=caret

Mihaljevic, Bojan & Bielza, Concha & Larranga, Pedro. 2018. bnclassify:

Learning Discrete Bayesian Network Classifiers from Data. R package version 0.4.1.

https://CRAN.R-project.org/package=bnclassify

http://www.et.bs.ehu.es/cran/web/packages/bnclassify/vignettes/introduction.pdf

Pearl, Judea. 1988. Probabilistic Reasoning in Intelligent Systems. San Francisco:

Morgan Kaufman Publishers Inc.

Pearl, Judea & Glymour, Madelyn & Jewell, Nicholas P. 2016. Causal Inference in Statistics: A Primer. Chichester, West Sussex: John Wiley & Sons Ltd.

Pernkopf, Franz & Bilmes, Jeff. 2005. Discriminative versus Generative Parameter and Structure Learning of Bayesian Network Classifiers. Proceedings of the 22nd international conference on Machine learning. 657-664.

Pernkopf, Franz & Wohlmayr, Michael & Tschiatchek, Sebastian. 2012. Maximum Margin Bayesian Network Classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence. 34(3): 521-532.

R Core Team (2018). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

Rish, Irina. 2001. An Empirical Study of the Naïve Bayes Classifier. IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence. 3: 41-46.

Robin, Xavier & Turck, Natacha & Hainard, Alexandre & Tiberti, Natalia & Lisacek, Frederique & Sanchez, Jean-Charles & Muller, Markus. 2011. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12(1):

77. DOI: 10.1186/1471-2105-12-77 <http://www.biomedcentral.com/1471-2105/12/77/>

Wettig, Hannes & Grünwald, Peter Daniel & Roos, Teemu & Myllymäki, Petri &

Tirri, Henry. 2003. When Discriminative Learning of Bayesian Network Parameters Is Easy.

Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence. 491–496.

Wijayatunga, Priyantha & Mase, Shigeru. 2006. Asymptotic Properties of Maximum Collective Conditional Likelihood Estimators for Naive Bayes Classifiers. International

(33)

Journal of Statistics and Systems.

Wijayatunga, Priyantha & Mase, Shigeru & Nakamura, Masanori. 2006. Appraisal of companies with Bayesian Networks. International Journal of Business Intelligence and Data Mining. 1(3): 329-346.

(34)

7 List of abbreviations

• AUC = Area Under the receiver operating characteristics Curve

• BN = Bayesian Network

• C = Class variable

• CMI = Conditional Mutual Information

• CLE = Conditional Likelihood Estimation

• CLL = Conditional Log Likelihood

• CL-TAN = Chow-Liu Tree Augmented Naive Bayes classifier

• DAG = Directed Acyclic Graph

• D-separation = Directional separation

• FAN = Forest Augmented Naive Bayes classifier

• FPR = False Positive Rate

• LL = Log Likelihood

• LLΘ = Log Likelihood function for learning the parameters Θ.

• MDL = Minimal Description Length

• MLE = Maximum Likelihood Estimation

• NB = Naive Bayes classifier

• ROC = Receiver Operating Characteristics

• TAN = Tree Augmented Naive Bayes classifier

• TPR = True Positive Rate

(35)

(36)

8 Appendix A

Table 8: CMI between all pairs of features, given the class variable.

CMI X_i X_j

1 0.271235958307267 MathRes Contractors 2 0.0569247268878248 MathRes Alarms 3 0.0622883641902543 MathRes MissingDocs 4 0.0277644247038606 MathRes ReligEnt 5 0.0522224492598726 MathRes Onlus 6 0.294730873604441 MathRes Policies500k

7 0.0596510797095202 MathRes Policies500k_1000k

8 0.151434017760695 MathRes NoSav

9 0.148486105985983 MathRes PremIns

10 0.0612800658692125 Contractors Alarms 11 0.0572864372003765 Contractors MissingDocs 12 0.0211157996782412 Contractors ReligEnt 13 0.0733634907799765 Contractors Onlus 14 0.709496864552543 Contractors Policies500k

15 0.0288952181795292 Contractors Policies500k_1000k 16 0.39493266762415 Contractors NoSav

17 0.28546575883532 Contractors PremIns 18 0.0675189915917944 Alarms MissingDocs 19 0.0113624200242166 Alarms ReligEnt

20 0.0347991922807538 Alarms Onlus

21 0.0636283841306609 Alarms Policies500k

22 0.0250954114075188 Alarms Policies500k_1000k

23 0.0564177640079568 Alarms NoSav

24 0.0523582107943061 Alarms PremIns

25 0.0147521418934955 MissingDocs ReligEnt 26 0.0349550032134391 MissingDocs Onlus 27 0.068235237482966 MissingDocs Policies500k

28 0.020481021851589 MissingDocs Policies500k_1000k 29 0.06216150867727 MissingDocs NoSav

30 0.0589804779991272 MissingDocs PremIns 31 0.0268664999304138 ReligEnt Onlus 32 0.0224899628963697 ReligEnt Policies500k

33 0.00964685236966623 ReligEnt Policies500k_1000k 34 0.0241080668835678 ReligEnt NoSav

35 0.0287989554826333 ReligEnt PremIns 36 0.0687651149610515 Onlus Policies500k

37 0.0224304674728637 Onlus Policies500k_1000k

38 0.0659967126523935 Onlus NoSav

39 0.0780432536248366 Onlus PremIns

40 0.026418910946431 Policies500k Policies500k_1000k 41 0.325271850431448 Policies500k NoSav

42 0.326639722593272 Policies500k PremIns 43 0.0345993142425629 Policies500k_1000k NoSav

BAYESIAN NETWORK CLASSIFIERS

RATING CORRUPTION WITHIN INSURANCE COMPANIES USING

BAYESIAN NETWORK CLASSIFIERS

Acknowledgements

Popular scientific abstract

Sammanfattning

Contents

1 Introduction

1.1 Aim

2 Theory

2.1 Bayesian Network

2.2 Bayesian Network Classifiers

3 Data

4 Results

4.1 Naive Bayes Classifiers

4.2 Tree Augmented Naive Bayes Classifier

4.3 Forest Augmented Naive Bayes Classifier

4.4 Summary

5 Conclusion

6 References

7 List of abbreviations

8 Appendix A