Sum-Product Network in the context of missing data

(1)

IN

DEGREE PROJECT ENGINEERING PHYSICS, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2020,

Sum-Product Network in the context of missing data

PIERRE CLAVIER

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

Sum-Product Network in the context of missing data

Pierre Clavier

Degree Projects in Mathematical Statistics (30 ECTS credits) Degree Programme in Engineering Physics

KTH Royal Institute of Technology year 2020

Supervisor at Paris Descartes University: Olivier Bouaziz Supervisor at Sorbonne University: Gregory Nuel

Supervisor at KTH: Anja Janssen Examiner at KTH: Anja Janssen

(4)

TRITA-SCI-GRU 2020:031 MAT-E 2020:009

Royal Institute of Technology School of Engineering Sciences KTH SCI

SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

(5)

Sum-Product N¨atverk i samband med saknade data

Svensk version

Sammanfattning

Under de senaste ˚aren har intresset för nya Deep Learning-metoder ökat avsevärt p˚a grund av deras robusthet samt deras tillämpning inom en mängd omr˚aden. Bristen p˚a teoretisk kunskap om dessa modeller samt deras sv˚artolkad karaktär väcker emellertid m˚anga fr˚agor.

Det är i detta sammanhang som Sum-Product Network kom fram, vilken erbjuder en viss ambivalens d˚a den situerar sig mellan ett linjärt neuralt nätverk utan aktiveringsfunktion och en sannolikhetsgraf. Inom vanliga applikationer med verklig data hittar vi ofta ofullständiga, censurerade eller trunkerad data. Inlärningen av dessa grafer till verklig data är dock fort- farande obefintlig. Syftet med detta examensarbete är att studera n˚agra grundläggande egenskaper hos Sum-Product Networks och försöka utöka deras inlärning och uppträning till ofullständig data. Trovärdighetsskattningar med hjälp av EM-algoritmer kommer att användas för att utöka inlärningen av dessa grafer till ofullständiga data.

(6)

(7)

Sum-Product Network in the context of missing data

English version

Abstract

In recent years, the interest in new Deep Learning methods has increased considerably due to their robustness and applications in many fields. However, the lack of interpretability of these models and the lack of theoretical knowledge about them raises many issues. It is in this context that sum product network models have emerged. From a mathematical point of view, SPNs can be described as Directed Acyclic Graphs. In practice, they can be seen as deep mixture models and as a consequence they can be used to represent very rich collections of distributions. The objective of this master thesis was threefold. First we formalized the concept of SPNs with proper mathematical notations, using the concept of Directed Acyclic Graphs and Bayesian Networks theory. Then we developed a new method for learning the structure of a SPN, based on K-means and Mutual Information Theory.

Finally we proposed a new method for the estimation of parameters in a fixed SPN, in the context of incomplete data. Our estimation method is based on maximum likelihood methods with the EM algorithm.

(8)

(9)

Acknowledgements

I would like to warmly thank my two tutors and professors, Olivier Bouaziz and Gr´egory Nuel for their support and help during my Master Thesis. I particularly appreciated the quality of their supervision which allowed me to discover the world of research in excellent conditions. I would also like to thank KTH my tutor and examiner Anja Janssen who has accepted to be my tutor at my double degree host university and who give me guidance in the writing of my Master thesis.

Many thanks also to my parents and sister who welcomed me back at home during my Master thesis and to my Swedish teachers and friends in the Applied and Computational Mathematics Master who welcomed me in their lovely country where I was able to benefit from an excellent quality of education and life. Finally, I would like to thank all my colleagues and PhD students at MAP5 laboratory: Remi, Claire, Anton, Alexandre, Pierre-Louis and especially Vincent who gives me helpful advice in the writing of my report and Warith who made me meet my supervisors.

(10)

(11)

Contents

1 Introduction to Sum-Product Network 9

1.1 Introduction to Sum-Product Network . . . . 9

1.2 SPNs as deep mixtures . . . . 13

1.3 Induced trees of SPNs as membership of a deep mixture . . . . 14

1.4 Probabiliy of the evidence of a SPN . . . . 16

2 State of the art for SPN 18 2.1 Learning a graph structure using LearnSPN. . . . 20

3 A new algorithm for learning the structure of an SPN 22 3.1 Independence test for rows . . . . 23

3.2 Independence test of columns . . . . 23

3.3 Implementation . . . . 25

4 Learning parameters with maximum likelihood approach 26 4.1 Maximum likelihood estimation using EM algorithm . . . . 26

4.2 EM for SPN . . . . 27

4.3 Forward/Backward messages . . . . 30

4.3.1 Forward pass or forward message . . . . 30

4.3.2 Backward message or backward pass . . . . 30

4.4 Updates of EM for SPNs . . . . 33

4.4.1 Leave updates . . . . 33

4.4.2 Sum weight updates . . . . 34

4.4.3 EM algorithm summary . . . . 35

4.5 Estimating parameters of SPN with incomplete data . . . . 35

4.5.1 Incomplete data introduction . . . . 35

4.5.2 EM with incomplete data in standard context . . . . 37

4.5.3 Evaluation of weight updates . . . . 39

4.5.4 Evaluation of Gaussian truncated leaves parameters . . . . 40

4.6 Simulation of EM algorithm for SPN . . . . 41

4.6.1 Simulation for complete data . . . . 41

4.6.2 Simulation for missing data . . . . 46

A Appendix 49 A.1 Short recall on graph theory . . . . 49

A.2 K-Means algorithm . . . . 50

A.3 Foundation of EM and application to mixture problems . . . . 51

A.3.1 Framework of EM algorithm and conditions for convergence . . . . . 51

A.3.2 Application of EM for Gaussian mixtures . . . . 54

A.4 Regression in the context of missing data . . . . 55

(12)

A.5 L₂ decomposition in term of bias and variance . . . . 58

(13)

Notations:

• S a SPN

• X_i a random variable and x_i a element of the image of X_i

• X, random vector X = (X₁, ..., X_n)

• val(X) are the image of random vector X, i.e. set of assumed values of X

• val(X_I) image of the random variable

• E expectation of a random variable.

• V variance of the random variable

• p_X_i the probability mass function (PMF) of discrete random variable X, or probability density function of a continuous random variable

• p shorthand for joint probability density function or mixed distribution function

• ϕ_l(x_I_l) probability density measure of leave l for random variable X_I_l

• S the joint distribution of all random variables computed at the top of the SPN

• S_q joint distribution rooted at a node q or forward pass computed at node q

• F_q backward pass of the network for the node q

• G a graph composed of a set of edges E and a set of vertices V .

• W the set of sum weights

• ch(q) , pa(q) children and parents of a node n

• w^q_i is a sum node weights, shared between a node q and its children i

• θ the set of all leave parameters. θ_l parameters of leave l

• c induced tree

• E(c) the set of all edges connected to a sum node in the induced tree c.

• L(c) set of all leaves inclued in the induced tree σ_c

• E(S) the set of all edges connected to a sum node in the SPN S

• L(S) set of all leaves of a SPN

(14)

(15)

Introduction

Context of the problem

In this decade, new methods of Deep Learning have considerably increased. One of the rea- sons for this emergence came from the rise in computational power which allowed numerically expensive methods to be developed. Deep Learning methods are robust to many problems, in many cases beating state-of-the-art methods. However, research in terms of theoretical proof of their effectiveness is still in its infancy. The fact that it is very difficult to interpret from a mathematical point of view the coefficients learned by neural networks triggered the idea of developing more interpretable networks in connection with probabilistic graphs. The Sum-Product Networks (SPNs) introduced by Poon and Domingos (2011) are halfway between neural networks without activation functions and deep probabilistic mixture models.

It is this ambivalence that makes them interesting and that lead us to study them from a theoretical point of view. In practice, SPNs have demonstrated their effectiveness in many areas like language processing (Cheng et al., 2014), speech recognition (Peharz et al., 2014) or classification (Gens and Domingos, 2012). The aim of this master thesis is to become familiar with these probabilistic graphs and to use them in the context of incomplete data.

Purpose of the thesis

The Master thesis is divided into four parts. In the first part we start by presenting a general definition of SPNs and we explain how they can be used to compute probabilities from a joint distribution. We also explain why SPNs can be seen as a deep mixture model.

In the second part we review the existing learning methods for SPNs. In the third part we develop a method for learning the graph structure of an SPN. Our method is inspired from the work of Gens and Pedro (2013). The main advantage of our algorithm, as compared to existing algorithms such as SPFlow (Molina et al. (2019)), is that it avoids overfitting issues.

Finally, in the last part, we study the problem of interval-censored data and we propose a new estimation method in this context based on SPNs. This is a notable contribution since, to our knowledge, this problem has never been addressed yet.

(16)

1 Introduction to Sum-Product Network

Before formally defining an SPN, we start by redefining some notions about Directed Acyclic Graphs (DAG) and trees in general, in order to understand SPNs in depth. We refer the reader to A.1 or Francis Bach (2017) for more details on DAGs.

Figure 1: a) directed graph, b) DAG c) tree

In the example of Fig. 1 , we consider for all graphs a set V = {1, 2, 3, 4} of vertices and the set E of directed edges. For the graphs a, b and c the set of directed edges correspond respectively to: E_a = {(1, 2), (1, 3), (2, 3), (3, 4), (4, 2)}, E_b = {(1, 2), (1, 3), (3, 4), (2, 4)}, E_c = {(1, 2), (2, 4), (2, 3)}. The graph a has a cycle and is not a DAG. However b and c are DAGs and c is a special case of DAG called tree as all its nodes have only one parent.

Hereafter, only DAGs will be considered.

1.1 Introduction to Sum-Product Network

In this subsection, we construct step by step a SPN. We start by defining the random vector X = (X₁, . . . , X_N) with ∀i ∈ I = [[1, N ]] = {1, . . . , N }, X_i a real univariate random variable.

The aim of SPNs is to represent the joint probability distribution of X.

According to the original paper of Poon and Domingos (2011), a SPN called S can be defined as a DAG G = {V, E}, with E ⊂ (V ×V ) and V, E representing respectively the set of vertices and edges, where :

1. The DAG has a root node called 0 with no parents.

(17)

2. ∀q ∈ V , there are three types of nodes : leaves which are nodes without children, sum nodes represented by + and product nodes represented by ×.

3. Nodes which are not root or leaves are internal nodes and are represented by the algebraic operations sum or product.

All nodes q ∈ V are associated to a function S_q. All leaves are associated to a random vector or a random variable XI_l where Il ⊂ I such that Il 6= ∅ and we use the notation XI_l to represent the joint variables X_i, for i ∈ I_l. For example, if I_l= {1, 2, 8}, X_I_l = (X₁, X₂, X₈).

Then let x_I_l represent the realization of X_I_l. We define ϕ_l(x_I_l) a density ϕ_l associated to the leave l and evaluated at xI_l. For ease of presentation we only consider densities for ϕl in what follows but all our results still hold true with a discrete probability measure for ϕ_l. In practice, the ϕ_l are often parametric functions in which case they are denoted ϕ_l(·|θ_l).

Between every sum nodes q and their children i such that i ∈ ch(q) := {i, (q, i) ∈ E}, there are sum weights w^q_i ∈ W with W the set of all sum weights. For every sum node q and,

∀i ∈ ch(q), we impose that w^q_i ≥ 0.

Figure 2: A SPN over binary random variables inspired by Poon and Domingos (2011)

In Fig. 2, a SPN is described over the random vector X = (X₁, X₂, X₃) where these random variables as disposed in leaves of the graph which are numbered by 7, 8, 9, 10, 11, 12. There are family relationships in the graph, for example, 0 is the parent of 1, 2 so {1, 2} ∈ ch(0).

(18)

Moreover, sum weights are presented in Fig. 2 such that :

w⁰₁ = 0.95, w⁰₂ = 0.05, w₈³ = 0.1, w₉³ = 0.9

Regarding leaves, we have I₇ = {1} and node 7 has the density ϕ₇(x₁). For the node 8 : I₈ = {2} has a density ϕ₈(x₂) etc. It should be noted that a random variable can appear in different nodes but with a priori different probability measures which have not the same parameters, for example: ϕ₈(x₂) and ϕ₉(x₂).

Now we explain how SPNs can be used to compute the joint density of X. We define the function S_q associated to the node q, in the following way:

Sq(x) =







ϕ_q(x_I_q) if q is a leaf Q

i∈ch(q)S_i(x) if q is product node P

i∈ch(q)w^q_iS_i(x) if q is a sum node.

(1)

We set S := S₀, when q = 0 is the root of the SPN. We can describe this equation in words:

• If q is a leaf, pick the density of the considered random variable

• If q is a product node, its density is the product of densities of its children.

• If q is a sum node, its density is the weighted sum of its children.

With the example of Fig. 2 we can easily compute S, ∀x ∈ val(X):

S(x) = 0.95ϕ₇(x₁)(0.9ϕ₉(x₂) + 0.1ϕ₈(x₂))(0.5ϕ₁₀(x₃) + 0.5ϕ₁₁(x₃))

+ 0.05ϕ₁₂(x₁)(0.3ϕ₁₀(x₃) + 0.7ϕ₁₁(x₃))(0.2ϕ₈(x₂) + 0.8ϕ₉(x₂)) (2) We now define the scope of a node q of the SPN which represents a subset of the variables {X1, . . . , XN}.

Definition 1.1 (Scope of a SPN) Let S be an SPN over the random variables X = (X₁, .., X_N) and let q be a node of S. We define:

sc(q) = {X_i} if q is the leaf containing the random variable X_i S

i∈ch(q)sc(i) otherwise

For the example in Fig. 2, the scope of the top node S is {X₁, X₂, X₃} but the scope of node 3 is only {X₂}.

We now want to investigate if an SPN evaluated at its root or at any internal node is a probability density. For this purpose, we provide three properties of SPNs which are decomposability, completeness and normality.

(19)

Definition 1.2 (Decomposability of SPNs) A product node q of an SPN is said to be decomposable iff

∀i, i⁰ ∈ ch(q), i 6= i⁰ ⇒ sc(i) ∩ sc(i⁰) = ∅

If all product nodes are decomposable, then the SPN is said to be decomposable.

In other words, in a decomposable SPN, no random variables can appear in more than only one child of a product node. For example in Fig. 2, the two products 1 and 2 are decomposable. Indeed, ch(1) = {7, 3, 5} and sc(7) = {X1}, sc(3) = {X2}, sc(5) = {X3}. For the node 2, ch(2) = {4, 6, 12} and sc (4) = {X₂}, sc (6) = {X₃}, sc (12) = {X₁}.

Definition 1.3 (Completeness of a sum node of a SPN) We say that a sum node q is complete iff :

∀i, i⁰ ∈ ch(q) sc(i) = sc(i⁰) Then a SPN is complete if all its sum nodes are complete.

The SPN in Fig. 2 is complete as children of nodes 3 and 4 have the same random variable {X₂}. The sum nodes 5 and 6 have the same scope {X₃}. Finally, children of the root 0 have the same scope {X₁, X₂, X₃}.

This property of the SPN has the consequence that for a sum node, all children must have the same scope. That is why sum nodes split the dataset into clusters but keep the same random variables in all children’s scopes. The last condition, normality is a condition on weights and allows to get a normalised density.

Definition 1.4 (Normalised SPN) A sum node q is said to be normalised iff the weights associated to its node sum to one:

X

i∈ch(q)

w_i^q = 1

If all sum nodes of a SPN are normalised, the SPN is said to be normalised.

One notices that an SPN can be normalised without any difficulties. Dividing each weight by the sum of all weights of the sum node, we obtain normalization of all sum nodes and then a normalised SPN. At this point, we have obtained sufficient conditions on SPNs to enable them to represent a probability density. In what follows we will always consider these conditions to be satisfied. This leads to the following definition of an SPN to represent a probability density.

Definition 1.5 (Validity of a SPN) A SPN S composed with the random vector X is said to be valid iff ∃ φ and a normalised probability measure φ : val(X) −→ [0, 1], such that:

∀x ∈ val(X), φ(x) = S(x)

(20)

SPNs are intended to represent a probability density and under the three conditions of decomposability, completeness and normality, Poon and Domingos (2011) have proved that this is the case.

Theorem 1 (Poon and Domingos, 2011) A decomposable, complete and normalised SPN is valid.

This theorem is fundamental as it gives conditions to obtain a valid SPN. Note that this theorem has been demonstrated under weaker conditions where decomposability is replaced by another condition of consistency but the general idea remains the same. We refer the reader to a proof in the appendix of Poon and Domingos (2011).

Finally, from a given SPN S, we define a sub-DAG as a DAG whose root is an internal node of S and the scope of its root contains a subset of the leaves of S. We denote such a sub-DAG, S_q where q is the internal node of S and the root of S_q. Another important consequence of a complete and decomposable SPN S is that any sub-DAG of S is a SPN.

Now that we have provided a mathematical definition of our SPN, we are going to look at SPNs as a generalisation of simple mixture models.

1.2 SPNs as deep mixtures

In this part, we try to understand the probabilistic meaning of the nodes of a SPN. In Fig. 3, a product node is represented with three children. It is possible to interpret it as independence between random variables that are not in the same child of a product node. This structural approach can be used to learn the structure of a SPN by searching independence between random variables and try to split it if they are independent. Note that a random variable can only be present in one of the children of a product node as product nodes need to respect the condition of decomposability. Considering a product node over three univariates random variables X, Y, Z in its scope, we get :

Figure 3: A product node between three random variables

S(x, y, z) = ϕ₁(x)ϕ₂(y)ϕ₃(z)

For a sum node, the notion of mixture is present as the operation + in SPN is in a sense

(21)

a mixture model between different classes. We get in Fig. 4 the weighted sum over three density functions with the same scope composed of the random variable X. This is the classical mixture model. A sum node in SPN is equivalent to clustering the data but keeping the same random variables in each children node to respect the condition of completeness.

Figure 4: A Sum node with three children

S(x) = w₁ϕ₁(x) + w₂ϕ₂(x) + w₃ϕ₃(x)

As a result, SPNs can be seen as deep mixtures between random variables with many layers.

The term deep comes from the fact that for many layers of sum nodes, SPNs encode mixtures of mixtures and so a complex density. Finally product nodes capture independence over the random variables. One of the challenge with SPNs is to perform estimation of the parameters θ in the ϕl(·|θl) densities associated to the leaves. This can be performed using the Expectation-Maximization (EM) algorithm (see Dempster et al. (1977)).

1.3 Induced trees of SPNs as membership of a deep mixture

In standard mixture problems, it is of interest to compute probabilities of cluster membership.

In SPNs, the notion of cluster membership is defined by induced trees. They represent admissible paths from the root. Those admissible paths must have all variables in their scopes in order to represent the joint distribution of X. More formally, an induced tree c of a SPN can be constructed recursively as follows:

1. Start at the root node. Include the root of S in c.

2. If q is a product node, include in c all children i ∈ ch(q). Then continue with all children.

3. If q is a sum node, include in c only one child i ∈ ch(q) with associated weight w_i^q. Then continue with the only chosen child.

4. if q is a leaf, finish the algorithm.

(22)

Figure 5: A SPN with an induced tree in bold from Desana and Schn¨orr (2016) Let N (S) be the set of all sum nodes for the SPN S. In the same fashion, N (c) represents the set of all sum nodes for the induced tree c. Introduce also, E(c) = {(i, q) : i ∈ q, q ∈ N (c)}, the set of all edges connected to a sum node in the induced tree. We note C the cardinality of the induced tree c. Then, it can be shown (see Desana and Schn¨orr (2016)) that the probability to belong to the induced tree c (cluster membership), denoted p(c|W ), is equal to:

p(c|W ) = Y

(q,i)∈E(c)

w^q_i

From this property, we can easily see that PC

c=1p(c|W ) = 1.

In the next definition we define the density p(x|c, θ) of X given the induced tree c. We let L(c) be the set of all leaves belonging to the induced tree c. We have:

∀x ∈ val(X), p(x|c, θ) = Y

l∈L(c)

ϕ_l(x_I_l|θ_l)

For example, in Fig. 5, we have for the random vector X = (A, B) : p(x|c, θ) = ϕ₇(b)ϕ₉(a) and p(c|W ) = w₂¹w₉⁴

Finally, we can define the joint density of X and of induced tree membership c given the parameters of the leaves θ :

p(c, x|θ, W ) = p(c|W )p(x|c, θ) = Y

(q,i)∈E(c)

w_i^q Y

l∈L(c)

ϕ_l(x_I_l|θ_l)

Dennis and Ventura (2015) showed the fundamental property that summing out the induced trees gives the likelihood of the data S(x|θ).

(23)

Theorem 2 Dennis and Ventura (2015)

S(x|θ) =

C

X

c=1

p(x, c|θ, W ) =

C

X

c=1

p(c|W )p(x|c, θ) .

This property is interesting as it shows that induced trees define latent class in the same manner as for mixture problems. It also justifies why we will use the EM algorithm to perform estimation of the SPN parameters (see Section 4.2)

1.4 Probabiliy of the evidence of a SPN

We define the evidence as an event {X ∈ X } = TN

i=1{X_i ∈ X_i}. The probability of an evidence computed from the SPN S is denoted by PS and is equal to:

PS(X ∈ X ) = PS(X₁ ∈ X₁, . . . , X_N ∈ X_N) = Z

X1

· · · Z

XN

S(x)dx = Z

X

S(x)dx

This integral can be computed from Equation (1). It is important to stress that the functions ϕ_l in Equation (1) can be either densities or discrete probability measures. In case some ϕ_l are discrete and others are densities, this integral should be interpreted with caution. Define X₁, . . . , X_J, J discrete random variables, and X_{J +1}, . . . , X_N, N − J continuous random variables. Then we can compute the evidence {X₁ = x₁, . . . , X_J = x_J, X_{J +1} ∈ X_{J +1}, . . . , X_N ∈ X_N} from the SPN S in the following way:

PS(X1 = x1, . . . , XJ = xJ, XJ +1 ∈ XJ +1, . . . , XN ∈ XN)

= Z

X_{J +1}

· · · Z

X_N

S(x₁, . . . , x_J, x_{J +1}, . . . , x_N)dx_{J +1}. . . dx_N

All those definitions are also valid for a sub-SPN S_q as defined in Section 1.1. From Equa- tion (1) we can easily compute the probability of an evidence recursively. We provide the general definition for any sub-SPN S_q.

P_S_q(X ∈ X ) =









 R

X_Iqϕ_q(x_I_q)dx_I_q if q is a leaf and X_q is an interval ϕ_q(x_I_q) if q is a leaf and X_q = {x_I_q} Q

i∈ch(q)PSi(X ∈ X ) if q is product node P

i∈ch(q)w_i^qP_S_i(X ∈ X ) if q is a sum node

(24)

As a consequence, the probability of an evidence for an SPN S can be recursively calculated from all the sub-SPN of S.

Remarkably, any probability of evidence can be computed in card(V ) time. It should be noted that for a normalised SPN S, the evidence {X ∈ val(X)} = TN

i=1{Xi ∈ val(Xi)} has probability equal to 1. In other words: PS(X ∈ val(X₁) × · · · × val(X_N)) = 1.

We now look again at the example of Fig. 2, where the density of the SPN was given by Equation (2). If we consider the random vector X = (X₁, X₂, X₃) and we decide that for example ϕ7, ϕ12 ∼ N (0, 1), ϕ10, ϕ11 ∼ N (2, 1). Moreover, ϕ9 and ϕ8 are Bernoulli random variables such that ϕ₉(X₃ = 1) = ¹₂ and ϕ₉(X₃ = 0) = ¹₂ with the same values for for ϕ₈. If we want to compute the evidence, X = (X₁, X₂, X₃) = ({X₁ ≥ 0}, {X₂ = 1}, {X₃ ≤ 2}) we obtain:

P_S(X ∈ X ) = 0.95P_S₇(X₁ ∈ X₁)(0.9P_S₉(X₂ ∈ X₂) + 0.1P_S₈(X₂ ∈ X₂))

(0.5P_S₁₀(X₃ ∈ X₃) + 0.5P_S₁₁(X₃ ∈ X₃)) + 0.05P_S₁₂(X₁ ∈ X₁)(0.3P_S₁₀(X₃ ∈ X₃)+

0.7P_S₁₁(X₃ ∈ X₃))(0.2P_S₈(X₂ ∈ X₂) + 0.8P_S₉(X₂ ∈ X₂)) With P_S₇(X₁ ∈ X₁) and P_S₁₂(X₁ ∈ X₁) equal toR

R⁻ϕ(x₁|µ = 0, σ = 1)dx₁ = ¹₂. And P_S₁₀(X₃ ∈ X₃) and P_S₁₁(X₃ ∈ X₃) equal to R2

−∞ϕ(x₁|µ = 0, σ = 1)dx₁ = ¹₂ : Finally P_S₈(X₂ ∈ X₂) and P_S₉(X₂ ∈ X₂) are replaced by P (X₃ = 1) = ¹₂

If we want to marginalize a random variable, we will replace its evidence by the entire image of the considered random variable. For example if we want to compute in our case of Gaussian random variables X = ({X1 ≥ 0}, {X3 ≤ 0}):

we will input in the SPN : {X1 ≥ 0, X2 = 0, X3 ≤ 0}S{X1 ≥ 0, X2 = 1, X3 ≤ 0}.

Compute marginal densities and partition is easy in a SPN an it allow us to do fast and exact inference in linear time of the network as we are computing at each node a algebraic operation which is linear.

The notable facts on SPNs discussed in this chapter are :

• SPNs are able to represent under some conditions (decomposability and completeness) the joint distribution over random variables.

(25)

• The advantage is that computational cost is always linear in the size of the network and allows fast and exact inference.

• The fact that we can interpret SPNs as deep and hierarchical mixture model is very interesting and we can think about it as latent model for sum nodes and learn and represent complex probability densities.

• However their structure is complex because of constraints of completeness and decomposability.

2 State of the art for SPN

SPNs, introduced by Poon and Domingos (2011) allows to perform fast and exact inference in linear time in the size of the network and are able to deal with large dataset. Computing marginal distributions or partition is fast as compared to other methods. SPNs as probabilistic model provide a flexible representation of complex and high dimensional data. In the recent years SPNs have found many applications in various fields like :

• Language processing, (Cheng et al., 2014)

• Speech Recognition, (Peharz et al., 2014)

• Image completion (Poon and Domingos, 2011; van de Wolfshaar and Pronobis, 2019)

• Classification (Gens and Domingos, 2012)

• Regression with Gaussian processes (Trapp et al., 2018)

Even though SPNs can also be represented as deep mixture (Peharz et al., 2016), they have received a moderate attention in the Deep Learning community. In order to infer a SPN from a dataset, two tasks must be performed: learning the graph of the SPN and learning the parameters of the leaves and of the weights. The former task is the most difficult due to the special structure requirements (conditions of decomposability and completeness) of SPNs.

This is in contrast with neural networks where learning the parameters is straightforward but they provide less interpretable results. (Gens and Domingos, 2012; Vergari et al., 2015) Nevertheless, some applications in Deep Learning and Auto-Encoders have been published recently (Tan and Peharz, 2019; Peharz et al., 2019).

Similar graphical methods such as Bayesian Network (BN) have received a great deal of attention in recent years because of their ability to represent complex data. However they are more computationally intensive. In particular, learning and computing marginal in BNs

(26)

is most of the time intractable or computationally expensive. Yet there exists some exception for low treewidth models (Wainwright and Jordan, 2008).

Theoretical properties of SPNs have been studied first by Delalleau and Bengio (2011) that explain the representation power of SPNs with regard to their deep or shallow architectures.

It as been proved that deep SPNs (with more than two internal layers in their structure) better perform the task to represent complex distribution than shallow SPNs (with only one internal node layer). The Martens and Medabalimi (2014) show that the expressiveness of SPNs was correlated with their depth. A more recent theoretical study (Peharz et al., 2015) demonstrates that a less strict constrain on the product node called consistency was not better than the stronger condition of decomposability to represent a more compact network.

One of the most complex problem in SPNs is the task of learning the SPN from data. There are two separate tasks which are the learning of graph G of the SPN and the learning of the parameters of the leaves θ and of sum weights W .

The learning of the graph G as been studied a lot and the fist algorithm created was LearnSPN (Gens and Pedro, 2013). There are a certain number of algorithms that try to simplify or strengthen LearnSPN like (Vergari et al., 2015). Moreover, some techniques have been developed in order not have to choose the nature of leaves (Molina et al., 2018) ant try to estimate a leaf without any parametric model called mixed SPN. Others greedy algorithm try to find a structure with SVD clustering (Adel et al., 2015) or reduce the problem of finding the threshold in independence test like in the Prometheus (Jaini et al., 2018). All this algorithms use heuristic structure exploration but do not try to have a global objective and are more based on a constraint approach of the SPN structure. The main idea is looking for a structure that represents the same conditional dependencies between random variables.

It probabilistic graphical models there are other techniques based on a scoring approach but they are still not really developed in SPN community. Friedman and Koller (2003) have tried to find a structure by a Bayesian methods of a Bayesian Network model with MCMC. Others scoring functions exist to find a relevant structure like MLD (Maximum-description-length) that could be maximised. With this method, the aim is to find a structure that maximises the likelihood of the data given the structure with a term of penalization proportional to the number of parameters to estimate in the graph.

As the problem of finding a relevant graph and parameters is a very tough and open problem, models based on estimation of parameters of the SPN only once the graph fixed have been developed. Most of them are based on maximum likelihood estimation. As it is very difficult to maximize directly the likelihood of the SPN, some algorithms based on on gradi- ent optimization (Peharz et al., 2019; Butz et al., 2019) or Expectation-Maximization have been created (Poon and Domingos, 2011; Desana and Schn¨orr, 2016; Zhao et al., 2016). EM

(27)

algorithms are particularly interesting and draw profit of the latent interpretation of the sum nodes in SPN. However these iterative algorithms do not guarantee to obtain a global maximum and for complex structures, a local maximum is often reached. So instead of thinking to maximize the likelihood, Rashwan et al. (2016) have proposed a Bayesian moment matching called BMM. These techniques lend itself to online learning but still need to have a prefixed structure to operate on.

Apart from Trapp et al. (2019) which takes into account missing data and is able to learn the graph and the parameters within the framework of a Bayesian learning, no method on maximum likelihood takes into account incomplete data. Bayesian methods require lot of computational time and are very complex to implement. After making this state of the art, the question was to create a algorithm which could take into account incomplete data in SPNs. As EM is an algorithm which is well adapted to real data with incomplete values or values censored in an interval for example, we will try to extend this algorithm to deal with incomplete data in SPNs.

2.1 Learning a graph structure using LearnSPN.

The learning of SPNs can be subdivided into several sub-tasks. The learning of the graph G of the SPN with the nature of its leaves, the parameters θ of its leaves and the sum weights W . The two approaches that can be considered to learn a probabilistic graphical model are :

• Learning based on a score. After definition of a score function, the aim is to maximize it with respect to the data. For example a suitable score function could be a BIC function.

• Learning based on constraints. In this approach, the goal is to find the best structure that respects and represents dependencies between variables in the data. For example in an SPN, we will perform independence tests to split columns for a product node or clustering tasks to find mixtures in a sum node.

We will focus more on greedy algorithms which are better known and less computationally expensive than the Bayesian approach for graph learning tasks. The Bayesian approach has the capacity to explore more configurations in term of graphs than the greedy approaches, but their complexity is much greater than the first methods mentioned. We will try to implement a greedy learning version.A recent package called SPFlow (Molina et al., 2019) have been implemented in Python and we will use their code to create our own algorithm.

The first algorithm that as been proposed to learn an SPN is LearnSPN (Gens and Pedro, 2013). The main idea behind this algorithm is to grow a tree from top to down that recursively and with a greedy manner partitions the matrix of data into sub matrix which have common characteristics. Then at the end we fit a distribution for every sub matrix. For example if

(28)

we decide that an arbitrary sub matrix follows a normal distribution we will compute its empirical mean and variance which specifies the distribution.

In a product node, variables that are not in the same edge are considered as independent.

So we try to partitioning columns by separate features that are as independent as possible.

Performing a independence test such as F- or G- (Gens and Pedro, 2013) test can accomplish this task. However, these techniques uses thresholds as hyper parameter to model whether or not the columns are independent.

In a sum node, the split is equivalent to finding clusters that represent mixture model in the rows of the data matrix. So it can be performed by a K-Means, GMM or any other algorithm that aggregates rows.

The weights of the sum node during learning are chosen proportional to the number of instances that fall into the created clusters. Finally, when there are not enough instances or features in the sub matrix, we fit a distribution as a leaf that is supposed to represent the final samples.

Figure 6: Partitioning of data matrix with a SPN

Figure 7: A SPN partitioning the data set

In the figures above, we have arbitrarily represented different splits of the data set according to the SPN nodes. We obtain at the end sub matrixes which are supposed to be more representative to single distribution than the entire data. The following algorithm sums up all this explanation. In LearnSPN, splitFeatures is the function that split columns and Clus- terInstance is the one which splits rows of the data set. The function UnivariateDistribution

(29)

fit a chosen distribution in the sub matrix if we have decided not to split the dataset. It is relevant to notice that in this algorithm, we don’t have the alternation of the sum and product nodes but there are variations of this algorithms where we alternate between the two. Considering |T | the number of row and |T_i| the number of rows in the cluster i, the sum weights are equal in this algorithm to ^|T_{|T |}ⁱ^| which is the proportion of samples in the cluster i.

Algorithm 1: LearnSPN

Input: A set of T rows and V features column, m the minimum of instances to split the dataset

Output: A SPN S encoding a probability learned from T

1 if |V | == 1 then

2 S ← UnivariateDistribution(T,V) ;

3 else

4 {V_j}^c_j=1 ← splitFeatures(V,T) ;

5 if C > 1 then

6 S ←QC

j=1LearnSPN(T, Vj, m)

7 else

8 {T_i}^R_i=1← clusterInstance(T, V ) ;

9 S ←PR

i=1

|Ti|

|T |LearnSPN(T_i, V, m)

10 return S

In the implementation of our greedy algorithm for graph learning, we will be careful not to take more than two clusters both for split of columns and rows as recommended in Vergari et al. (2015). Moreover as there is no criterion to chose if we have to get a sum or product node, we will alternate between sum and product nodes. However which function should be used for splits of rows and columns? We will investigate this question in the next section and create our own version of LearnSPN as in many cases, trees obtained by some existing packages overfit the data.

3 A new algorithm for learning the structure of an SPN

In this section we present our algorithm designed to learn the structure of an SPN. We observe n replications (X_j,1, . . . , X_j,N) for j = 1, . . . , n. We start by sorting our observations into a matrix of size n × N , where the jth line corresponds to (X_j,1, . . . , X_j,N). The idea of the algorithm is to alternatively split the lines and the columns of this data matrix. Each split along a line will correspond to a + node and each split along a column will correspond to a × node.

(30)

3.1 Independence test for rows

In order to split the data set into subsets, there is a need to find adequate clusters that will lead to relevant SPNs. A non parametric method is preferable as we don’t assume the nature of distributions. This is why we will use K-means type of algorithm to split the data with respect to the rows. K-means is the most common algorithm in the literature due to its simplicity and its interpretability. It is also the method implemented in most of the algorithms based on LearnSPN (Vergari et al., 2015; Molina et al., 2018).

We will use the K-Means ++ algorithm (see Arthur and Vassilvitskii (2007)) which uses cleverly chosen initialization points instead of K-Means. The K-Means ++ complexity is O(log(K)) as compared to K-Means whose complexity is O(n²). We refer the reader to A.2 for more details on K-Means ++ algorithm.

3.2 Independence test of columns

There are several techniques to test independence between variables such as the F-test with a covariance matrix, or the G-test used in (Vergari et al., 2015). However, these techniques need to set an arbitrary threshold and as a result, their performance depend on the choice of this threshold. As a consequence, we recommend the use of non-parametric methods which do not require any threshold and can capture complex dependencies between variables. A natural idea is to use the mutual information between two distributions defined as follows.

Given a joint distribution over two random variables : pX,Y(x, y) = p(X = x, Y = y), and its marginals p_X(x) = P

yp_X,Y(x, y), p_Y(y) = P

xp_X,Y(x, y) we define the mutual information between X and Y as :

I(X, Y ) =X

x,y

p_X,Y(x, y) log pX,Y(x, y) p_X(x)p_Y(y)

= D (pX,YkpXpY)

D (p_X,Ykp_Xp_Y) it the Kullback-Leibler divergence between two vectors (X, Y ) with densities p_X,Y and p_Xp_Y respectively.

D(pkq) =X

x

p(x) logp(x) q(x)

The interesting property of the Kullback-Leibler divergence is that D(pkq) ≥ 0 and this equality holds if and only if p = q. This leads to the following property:

I(X, Y ) = 0 ⇔ X ⊥⊥ Y

(31)

So, finding random variables with low mutual information would help us to find independent random variables. We will now investigate if the mutual information captures more the correlation between random variable than standard tests based on correlation matrix such as the F-test.

For an example, we compare a F-test using a covariance matrix and the method from the package NPEET (Ver Steeg, 2014) which computes with efficiency mutual information via K Nearest Neighbors. We simulate three independent random variables (X1, X2, X3) following the uniform distribution U [0, 1] and we simulate:

y = X₁+ sin(5πX₂) + 0.3

with ∼ N (0, 1). In Figure 8 we present the result for one sample of size 1 000. The given values for the F-tests and the mutual information (MI) are all normalized, that is they are divided by the maximum value of all three tests. We clearly see that the F test on the covariance matrix captures the linear dependence in X₁ whereas it is unable to capture the dependence in the sinus function. Mutual information captures both linear and sinusoidal dependence. Finally both of the tests do not detect any association between y and x₃ which are independent. Note that, even though MI takes values between 0 and 1 in theory, in practice negative values can occur due to numerical approximations.

In our method to split columns, we use MI as a pseudo-distance between random vectors.

We start by computing the matrix of MI distances for each row couple. From this matrix our goal is to find the “best two subsets of variables” with different MI. In order to do so, a natural idea is to use an agglomerative clustering (also called hierarchical clustering) with the distance induced by MI. Computationally efficient implementation of this algorithm in Python exists in the package scikit-learn.

Figure 8: Mutual Information vs. F-test of independence

(32)

The principle of this algorithm is: initially each feature in the scope of a product node form a class. We try to reduce the number of classes to two and this is done iteratively. At each step, we merge two classes, thus reducing the number of classes. At the end of the algorithm, we obtain the number of clusters we want defining as an hyperparameter.

In order to construct the hierarchical clustering, we need to define a criterion based on our distance MI, to cluster step by step the columns of the data matrix. The most three popular criterions in hierarchical clustering are:

1. Complete criterion: max_a,b{d(a, b) : (a, b) ∈ (A × A), a 6= b}

2. Single criterion: min_a,b{d(a, b) : (a, b) ∈ (A × A), a 6= b}

3. Average criterion : _|A||B|¹ P

a∈A

P

b∈Ad(a, b)1a6=b

where A ⊂ Rⁿis the set of the observations and d represents the distance 1−MI between a and b. As explained before, MI takes values between 0 and 1, with 0 representing independence and 1 representing the strongest possible association.

After several simulation experiments and implementation on real dataset we have decided to use the single criterion with the distance 1 − MI. There are several drawbacks with this technique as it only allows us to compare two by two random variables (and not to compare three by three variables or more). Moreover the time complexity of this tasks is maximum in O(n³) but in general case with O(log(n)n²).

3.3 Implementation

The idea is to use the work of Molina et al. (2019) and try to adapt it to be able to have our own algorithm. In this package, there is a learning structure of SPNs implemented. How- ever, it leads to shallow that are extremely wide and unable to capture complex probability distributions from the data. We are trying to get simple graphs with always two splits per node. Once a simple structure is fixed, we will try to refine the parameters with the help of maximum likelihood optimization methods such as EM. The aim is not to overfit the learning of the graph but get approximative initial points for weights and leave parameters. If we are too far from exact parameters, it will be difficult to find good estimations as non convex optimization algorithms most of time find only local maxima. In Fig. 9 and Fig. 10, we have constructed with our algorithm, on the left and with package SPFlow on the right two SPNs for four normal random variables (V₀, V₁, V₂, V₃).