Graphical lasso for covariance structure learning in the high dimensional setting

(1)

DEGREE PROJECT, IN MATHEMATICAL STATISTICS , SECOND LEVEL STOCKHOLM, SWEDEN 2015

Graphical lasso for covariance structure learning in the high dimensional setting

VIKTOR FRANSSON

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

Graphical lasso for covariance structure learning in the high dimensional setting

V I K T O R F R A N S S O N

Master’s Thesis in Mathematical Statistics (30 ECTS credits) Master Programme in Mathematics (120 credits) Royal Institute of Technology year 2015

Supervisor at KTH: Tatjana Pavlenko Examiner: Tatjana Pavlenko

TRITA-MAT-E 2015:76 ISRN-KTH/MAT/E--15/76-SE

Royal Institute of Technology SCI School of Engineering Sciences KTH SCI SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

(4)

(5)

Abstract

This thesis considers the estimation of undirected Gaussian graphical models especially in the high dimensional setting where the true observations are assumed to be non-Gaussian distributed.

The first aim is to present and compare the performances of existing Gaussian graphical model estimation methods. Furthermore since the models rely heavily on the normality assumption, various methods for relaxing the normal assumption are presented. In addition to the existing methods, a modified version of the joint graphical lasso method is introduced which monetizes on the strengths of the community Bayes method. The community Bayes method is used to partition the features (or variables) of datasets consisting of several classes into several communities which are estimated to be mutually independent within each class which allows the calculations when performing the joint graphical lasso method, to be split into several smaller parts. The method is also inspired by the cluster graphical lasso and is applicable to both Gaussian and non-Gaussian data, assuming that the normal assumption is relaxed.

Results show that the introduced cluster joint graphical lasso method outperforms competing methods, producing graphical models which are easier to comprehend due to the added information obtained from the clustering step of the method. The cluster joint graphical lasso is applied to a real dataset consisting of p = 12582 features which resulted in computation gain of a factor 35 when comparing to the competing method which is very significant when analysing large datasets. The method also allows for parallelization where computations can be spread across several computers greatly increasing the computational efficiency.

(6)

(7)

Sammanfattning

Graphical lasso för kovariansstrukturs inlärning i högdimensionell miljö

Denna rapport behandlar uppskattningen av oriktade Gaussiska grafiska modeller speciellt i högdimensionell miljö där dom verkliga observationerna antas vara icke-Gaussiska förde- lade.

Det första m˚alet är att presentera och jämföra prestandan av befintliga metoder för up- pskattning av Gaussiska grafiska modeller. Eftersom modellerna är starkt beroende av normalantagandet, s˚a kommer flertalet metoder för att relaxa normalantagandet att pre- senteras. Utöver dom befintliga metoderna, kommer en modifierad version av joint graphi-cal lasso att introduceras som bygger p˚a styrkan av community Bayes metod. Community Bayes metod används för att partitionera variabler fr˚an datamängder som best˚ar av flera klasser i flera samhällen (eller communities) som antas vara oberoende av varandra i varje klass. Detta innebär att beräkningarna av joint graphical lasso kan delas upp i flera mindre problem.

Metoden är ocks˚a inspirerad av cluster graphical lasso och applicerbar för b˚ade Gaussisk och icke-gaussisk data, förutsatt att det normala antagandet är relaxed.

Resultaten visar att den introducerade cluster joint graphical lasso metoden utklassar konkurrerande metoder, som producerar grafiska modeller som är lättare att först˚a p˚a grund av den extra information som erh˚alls fr˚an klustringssteget av metoden. Joint graphical lasso appliceras även p˚a en verklig datauppsättning best˚aende av p = 12582 variabler som

resulterade i minskad beräkningstid av en faktor 35 vid jämförelse av konkurrerande metoder.

Detta är mycket betydande när man analyserar stora datamängder. Metoden möjliggör ocks˚a parallellisering där beräkningar kan spridas över flera datorer vilket yt-terligare kraftigt ökar beräkningseffektiviteten.

(8)

(9)

Acknowledgements

I would like to thank my supervisor at KTH, Tatjana Pavlenko, for the opportunity to work on this project, for all the great and helpful discussions, for the great introduction to the multivariate statistical analysis field and for providing necessary information which has inspired me. The subject was new for me and it is truly an interesting area in many aspects.

Finally, I am most grateful for the support from my family during my entire studies.

Stockholm, November 5, 2015 Viktor Fransson

(10)

(11)

1 Introduction

This section will cover a short background for the main subject of this thesis which will give the reader a brief understanding on the topic. Topics mentioned in the following section will be presented in more detail later on.

1.1 Background

Consider a data matrix X_n×p consisting of n observations from p features (or variables) where p > n. This is a high-dimensional setting, and in recent years there has been much interest in estimating undirected graphical models G = (V, E) on the basis of this data matrix. In a Gaussian graphical model, each variable is represented by a node and each nonzero off-diagonal element in the inverse covariance matrix is represented by an edge between the corresponding pair of nodes (the nodes are assumed not to be connected to themselves). Here V = (1, ..., p) denotes the nodes (or vertices) corresponding to the p variables (or features) in X and the edge set E describes the conditional dependence between Xi, ..., Xi⁰, i 6= i⁰. Graphical models are especially of interest in the analysis of gene expression datasets since it is believed that genes operate in pathways or networks and this can be visualized by estimating an Gaussian graphical model. Examples illustrating such networks will be presented as an application in this thesis.

Suppose the observations x₁, ..., x_n∈ R^pare independent and identically distributed N (µ, Σ), where µ ∈ R^p and Σ is a positive definite p × p covariance matrix. One normally wants to estimate a sparse inverse covariance matrix (i.e. where many of the elements are exactly equal to zero) since zeros in the inverse covariance matrix Σ⁻¹ correspond to pairs of variables that are conditionally independent given all of the other variables in the dataset. More precisely, Σ⁻¹_ii0 = 0 for some i 6= i⁰ if and only if the ith and i⁰th features are conditionally independent given the other variables.

A natural way to estimate the inverse covariance matrix (or precision matrix) Σ⁻¹ is with maximum likelihood. Under a Gaussian model, this approach involves maximizing

log det Σ⁻¹− tr SΣ⁻¹ . (1.1)

Here S denotes the empirical covariance matrix of X. Maximizing (1.1) with respect to Σ⁻¹ leads to the maximum likelihood estimate S⁻¹ which usually is denoted ˆΘ, where Θ = Σ⁻¹ denotes the true precision matrix (i.e. the precision matrix is equivalent to the inverse covariance matrix). Generally S⁻¹will not contain any element equal to zero leading to hard interpretation. One is more interested in the zeros which correspond to pairs of variables that are estimated to be conditionally independent. Another disadvantage when

(14)

solving (1.1) is when p > n, since then S⁻¹ becomes singular and the maximum likelihood estimate cannot be computed. As a result, this maximum likelihood approach is not suited for high dimensional problems. Therefore in recent years other methods for estimating Σ⁻¹ in the high dimensional setting has been proposed, the most popular method being the graphical lasso proposed by Yuan and Lin (2007). They proposed that one should instead maximize the penalized log-likelihood defined as

log det Θ − tr (SΘ) − λ||Θ||₁ (1.2)

with respect to Θ, penalizing only the off-diagonal elements of Θ. The use of the l₁ or lasso penalty with tuning parameter λ1 has the effect that the solution ˆΘ is positive definite for all λ > 0 even if S is singular (which is the case when p > n). The only difference between (1.1) and (1.2) is the added penalty term, −λ||Θ||₁. The solution to (1.2) will be sparse (due to the lasso penalty) with some elements exactly equal to zero. These elements correspond to pairs of variables that are estimated to be conditionally independent.

Sparsity increases with increased penalty parameter λ until, eventually, all elements are unconnected. Recent improvements regarding the identification of block structure prior to running the graphical lasso algorithm has lead to drastic speed improvements when applied to very high dimensional problems. This improvement uses the fact that one can calculate the connected components of the solution in beforehand, meaning the computations can be split up into several smaller computations leading to drastic speed improvements.

Calculating the connected components of the solution in beforehand also allows for the calculations to be parallelized (or spread) over several computers, making it possible to solve an otherwise infeasible large-scale problem. More about this can be read in Section 2.1.

Recent studies have also found a connection with hierarchical clustering and the graphical lasso method, more precisely a connection regarding the connected components. The connected components of the graphical lasso solution are identical to the clusters obtained by performing single linkage clustering based on a similarity matrix defined the absolute value of the scaled empirical covariance matrix, |bS|. Single linkage hierarchical clustering is know to perform poorly due to the trailing effects so the the published paper monetizes on this fact and presented the cluster graphical lasso method which looks beyond single linkage clustering. The paper was published by Tan et al. (2013) and their method involves clustering of the features using an alternative to single linkage clustering, and then performing graphical lasso on each of the estimated clusters. This leads to greatly improved results in the sense of computational time and interpretability. Futhermore, the clusters obtained by performing single linkage clustering based on this similarity matrix are also identical to the connected components of the thresholded covariance matrix. More about this in Section 2.2,2.3and 2.4.

(15)

However there are some drawbacks with the graphical lasso method when analysing datasets where the observations belong to several distinct but related classes. In that case one would assume the graphical model to be both similar but also different between the classes. One good example of this is when analysing gene expression data where samples are taken from both healthy and sick tissues. If using the standard graphical lasso when analysing the data, one assumes that the observations are all drawn from the same distribution, which normally is not the case. Also one ignores the similarities between the classes. Instead one wants to estimate a graphical model with respect to healthy and sick tissue. A recent paper has investigated this issue and proposed the joint graphical lasso method for estimating multiple graphical models corresponding to distinct but related classes. This method is an extension of the graphical lasso method to the case of multiple datasets and based on a penalized log-likelihood approach. For specific values of the tuning parameters, the joint graphical lasso and graphical lasso methods produce the same results. More about the joint graphical lasso can be read in Section 2.5. A modified and cheaper version of the joint graphical lasso will be presented which monetizes on the strengths of the community Bayes algorithm, which is an algorithm which partitions the features of a dataset consisting of several classes into multiple conditional independent communities (or clusters). This results in smaller tractable problems with similarities to the cluster graphical lasso but applicable only for datasets consisting of more than one class. This modified version of the joint graphical lasso method results in greatly improved results and reduced computation time. This method is presented in Section3and the community Bayes method is presented in Section2.8.

Nearly all methods for the estimation of the graphical models rely heavily on the assumption of normality where the observations are seen as Gaussian distributed. When analysing distributions with random noise, heavy tails or real datasets this is generally not the case as one can assume. Therefore the nonparanormal and nonparanomal SKEPTIC for relaxing the Gaussian assumption has been proposed. The first method is based on a semiparametric Gaussian copula and the second method is based on the nonparametric rank-based statistics Spearman’s rho and Kendall’s tau. Result shows that the nonparanormal graphical models can be used as a replacement for the Gaussian graphical models, even when the true distribution is Gaussian. More about this in Section2.6 and 2.7.

1.2 Purpose

The purpose of this thesis is to present existing methods for estimating Gaussian graphical models mentioned above and to explore and compare these methods using both simula- tions and application of real datasets. A modified version of the joint graphical lasso will be presented with similarities to the cluster graphical lasso method. In the comparisons different hierarchical clustering methods will be used: single linkage clustering, complete linkage clustering and average linkage clustering (these clustering methods are used in the

(16)

cluster graphical lasso and the cluster joint graphical lasso). Two real datasets will be considered, the first being a one-class financial dataset consisting of closing prices of the S

& P 500 index and the second consisting of three-class gene data of high dimensions with observations from three kinds of leukemia. The financial dataset is meant present how these methods can be used in the world of finance. When analysing one-class datasets the cluster graphical lasso is expected to outperform the standard graphical lasso method and when analysing the three-class datasets the cluster joint graphical lasso methods are expected to outperform the standard joint graphical lasso methods. One can also assume that the graphical lasso and cluster graphical lasso methods will perform poorly when applied to a multi-class dataset.

1.3 Outline

The rest of the paper is organized as follows. In Section2we present all methods mentioned briefly in the introduction which are to be used in the analysis. In Section3we present the modified version of the joint graphical lasso which consists of a combination of two methods described in Section 2. In Section 4 we present the results using simulated data mainly consisting of ROC curves, which is a method for estimating the performance of the methods (the true and estimated networks are compared). In Section5 we apply the methods on the two real datasets mentioned in the previous section which are available for download for free. In Section6 we present the main results and conclude with a discussion.

2 Mathematical Background

This section will present the mathematical background and theory from the literature study that is to be used during the analysis. For more information the reader is referred to the papers related to the methods.

2.1 Graphical Lasso

This section will present the graphical lasso method. The graphical lasso method uses a coordinate descent procedure.

Given X_n×p, a data matrix consisting of n observations from p features (or variables) following a multivariate Gaussian distribution with mean µ and covariance Σ, one is often interested in estimating a sparse undirected graphical model through the use of a l1 (or lasso) penalty. We define the precision matrix as Θ = Σ⁻¹ and let S denote the empirical covariance matrix of X_n×p defined as S = _n¹X^TX. The graphical lasso problem is to maximize the penalized log-likelihood defined as

(17)

log det Θ − tr(SΘ) − λ||Θ||1 (2.1) over positive definite matrices Θ. Here λ is a nonnegative penalizing (or tuning) parameter applied only to the off-diagonal elements of Θ. Here ||Θ||₁denotes the sum of the absolute values of the elements of Θ and tr denotes the trace defined as

tr(A) = a11+ a22+ ... + ann =

n

X

i=1

aii (2.2)

Banerjee et al. (2007) show that the problem (2.1) is convex and considers the estimation of Σ. They let W be the estimate of Σ and show that one can solve the problem by optimizing over each row and corresponding column of W using block coordinate descent.

The gradient (optimality conditions) for maximizing (2.1) is written as

Θ⁻¹− S − λΓ(Θ) = 0 (2.3)

and since W = Σ = Θ⁻¹ this is is equivalent to

W − S − λΓ(Θ) = 0. (2.4)

Here we used the fact that the derivative of log(det(Θ)) is Θ⁻¹. Furthermore Γ_ii⁰ ∈ sign(Θ_ii⁰) is the subgradient of Θ , applied componentwise to the elements.

Graphical lasso uses a block-coordinate method for solving (2.4) and in what follows W, S ,Θ and Γ is partitioned as

W =W11 w12

w^T₁₂ w₂₂

S =S11 s12

s^T₁₂ s₂₂

Θ =Θ11 θ12

θ^T₁₂ θ₂₂

Γ =Γ11 γ12

γ₁₂^T γ₂₂

. (2.5)

where the matrices have been partitioned into two parts. Part one being the first p − 1 rows and columns and part two being the pth row and column. Using the relationship WΘ = I we have

W₁₁ w12

w₁₂^T w22

Θ₁₁ θ12

θ^T₁₂ θ22

= I 0 0^T 1

(2.6)

(18)

Consider the pth column of (2.4) we get

w₁₂− s₁₂− λΓ(θ₁₂) = 0. (2.7)

Furthermore (2.6) implies that

w₁₂= −W₁₁θ₁₂/θ₂₂ (2.8)

Equation (2.8) in (2.7) gives

−W₁₁Θ12

Θ₂₂ − s₁₂− λΓ(θ₁₂) = 0. (2.9) The graphical lasso operates on the above gradient equation for β = Θ₁₂/Θ₂₂, that is

−W₁₁β − s12− λΓ(θ₁₂) = 0. (2.10) where Γ(θ12) ∈ sign(β) since θ22> 0. (2.10) is the stationary equation for the following l1

regularized quadratic program

min

β∈R^p−1

1

2β⁰W11β + β⁰s12+ λ||β||1

(2.11)

where W₁₁is assumed to be fixed which is analogous to a lasso regression problem of the last variable on the rest, except the cross-product matrix S11 is replaced by the current estimate W11. Exploiting the sparsity in β this problem can be solved efficiently using elementwise coordinate. From bβ one can obtain wb₁₂ from (2.8). (2.6) further implies that

1 θb22

= w₂₂− bβ⁰wb₁₂ (2.12)

Finally bθ₁₂ can be recovered from bβ and bθ₂₂. Worth noting is that after having solved for β and updated w₁₂ the graphical lasso can move onto the next block. After the converge the disentageling of θ12 and θ22 can be done.

(19)

This procedure completes the graphical lasso, proposed by Friedman et al. (2008b) which builds on the work of Banerjee et al. (2008). The graphical lasso algorithm is summarized in Algorithm (1).

Algorithm 1 Graphical Lasso

1: procedure GraphicalLasso(S, λ)

2: Initialize W = S + λI. The diagonal of W remains unchanged in what follows.

3: Repeat for j = 1, 2, ...p, 1, 2, ..., p, ... until convergence.

4: a) Partition the matrix W into two parts (part one being the first p − 1 rows and columns and part two being thr pth row and column).

5: b) Solve the estimating equations W11β − s12+ λΓ(β) = 0 using the coordinate descent algorithm (2.12) for the graphical lasso.

6: c) Update w12= W11β.b

7: In the final cycle for each j solve for bΘ₁₂= − bβ₂₂Θb₂₂ with 1/ bΘ₂₂= w₂₂− w^T₁₂β.b

8: end procedure

Note that from (2.4), the solution wii = sii+ λ for all i since Θii > 0. For more details regarding the graphical lasso see Friedman et al. (2008b), a direct link to the paper is available in the reference as well.

2.2 Hierarchical Clustering

In order to understand the methods in the following sections we will introduce the method of hierarchical clustering which will be performed when clustering the variables (or features) of the datasets into several disjoint clusters. As mentioned in the introduction, one can greatly decrease the computational time required and increase the interpretability of the results if one were to identify the connected components of the solution in beforehand.

Hierarchical clustering is suitable for this.

Hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters without selecting the number of clusters in beforehand. The result of the clustering can be visualized as dendrograms, which shows the sequence of cluster fusion and the distance (or level) at which each fusion took place. An example can be seen in Figure 1 where dendrogram trees using three different methods has been cut at a height which results in six disjoint clusters. As a comment one can see that the methods produces different results which is caused by the chosen linkage settings, more about this will follow in coming sections.

There are generally two type of strategies for performing hierarchical clustering, agglomerative and divisive clustering. In the case of agglomerative clustering each observation

(20)

starts in their own cluster and pairs of clusters are merged as one moves up the hierarchy.

In the case of divisive clustering all observation start in one cluster and splits are performed recursively as one moves down the hierarchy. This section will cover agglomerative clustering techniques that will be used during the analysis.

0.00.20.40.60.81.0

60 46 23748427 28650 34 31 38 40 33 36 39 37 32 358103 5 9 1 213 26 56 58 51 54 52 59 55 53 57 16 11 15 19 20 18 17 12 14 29 21 25 22 30 24 44 41 45 43 49 42 47

Single linkage (SLC)

0.00.20.40.60.81.0

43 49 42 47 55 53 57 58 54 52 59 46 51 56621 25 29 22 30741 48 31 34 44 50 38 40 33 36 39 37 32 35 15 19 20 17 18 12 14 13 165 1 2 3 9 4 810 27 28 11 60 24 45 23 26

Complete linkage (CLC)

0.00.20.40.60.81.0

13 16 11 15 19 20 17 18 12 14 23 24 26 29 27 28 60 21 25 22 307 4 9 5 1 2 3 810648 31 38 40 33 36 34 39 37 32 35 55 53 57 58 54 52 59 51 56 45 43 49 41 42 47 46 44 50

Average linkage (ALC)

Figure 1: Left: Dendrogram obtained from hierarchical clustering with single linkage clustering. Middle: Dendrogram obtained from hierarchical clustering with complete linkage clustering. Right: Dendrogram obtained from hierarchical clustering with average linkage clustering. Note that in all cases the true data consisted of 6 clusters, each with 10 features and 100 observations. The colors indicate which clusters would have been obtained if the dendrogram is cut at a height resulting in 6 clusters.

Assume we have a dataset consisting of n observations. The agglomerative clustering algorithms begin with every observation representing a single, disjoint cluster. At each of the n − 1 steps the closest two (least dissimilar) clusters are merged into a single cluster which produces one less cluster at the higher level. In order to be able to compare the clusters one requires a measure of dissimilarity (or distance) between the clusters to be defined.

(21)

Let A and B represent two such clusters. The dissimilarity, defined as d(A, B), between A and B is computed from the set of pairwise observation dissimilarities d_ii⁰ where one member of the pair i is in A and i⁰ is in B.

The following sections will explain how the dissimilarity measure is chosen and calculated and the three most common types of ways to define intergroup dissimilarities will be presented.

2.2.1 Choice of Dissimilarity measure

In order to decide which clusters should be combined, a measure of dissimilarity between the sets of observations is required. In most methods of hierarchical clustering this is achieved by the use of an appropriate measure of distance between pairs of observations.

Sometimes other dissimilarity measures are preferred which is the case in this report.

Namely a correlation-based distance which considers two observations to be similar if their features are highly correlated despite being far away in the sense of Euclidean distance.

Important to note is that the dissimilarity is defined as one minus the similarity.

The choice of dissimilarity measure has a strong effect on the resulting dendrogram. De- pending on the type of data being clustered and question at hand, the choice of dissimilarity measure has a significant impact on the results. For example one must consider whether or not the variables should be scaled prior the calculation of the dissimilarity between observations, or if one should use a correlation based or distance based dissimilarity. The selection of appropriate dissimilarity measure has a huge impact on the results meaning one should think twice prior t performing clustering.

In our case we use a correlation based distance measure since we are interested in the clustering of the features (or variables) so the dissimilarity is calculated as one minus the absolute value of the scaled covariance of the observations i.e. one minus the absolute value of the correlation between the observations. Since the correlation is a similarity measure, one minus the correlation can be used as a dissimilarity measure. This will result in an p × p matrix when the correlation of p features is calculated. Given the observation matrix Xn×p consisting of n observations and p features, the symmetric dissimilarity is defined as

D = 1 − |Cor(X)| (2.13)

where Cor(X) denotes the correlation of X.

(22)

2.2.2 Single Linkage

In single linkage hierarchical clustering (also know as nearest neighbour clustering), the dissimilarity between two clusters is defined as the shortest distance (least dissimilar) between two points in each cluster.

d_SLC(A, B) = min

i∈A,i⁰∈BD_ii⁰ (2.14)

Single linkage clustering only requires that a single dissimilarity D_ii⁰ be small between two groups to be considered close together, without taking into account the other observations between the groups. Therefore single linkage clustering tends to produce trailing clusters due to the tendency to combine two groups at low thresholds which also is illustrated in Figure1.

2.2.3 Complete Linkage

In complete linkage hierarchical clustering (also know as furthest neighbour clustering), the dissimilarity between two clusters is defined as the longest distance (most dissimilar) between two points in each cluster.

d_CLC(A, B) = max

i∈A,i⁰∈BD_ii⁰ (2.15)

Complete linkage clustering is the opposite extreme compared with single linkage clustering and tends to produce compact clusters. One weakness of complete linkage clustering is that observations assigned to a cluster can be much closer to members of other clusters than they are to some members of their own cluster.

2.2.4 Average Linkage

In average linkage hierarchical clustering, the dissimilarity between two clusters is defined as the average distance between each point in one cluster to every point in the other cluster.

dALC(A, B) = 1 N_AN_B

X

i∈A

X

i⁰∈B

D_ii⁰ (2.16)

Average linkage clustering is a mix of the two extremes of single and complete linkage clustering and attempts to produce clusters that are relatively compact and far apart.

(23)

For more details regarding the hierarchical clustering method, the readers is refereed to textbooks covering the topic. Some suggestions can be found in the reference.

2.3 Covariance Thresholding

Recently Witten et al. (2011) and Mazumder & Hastie (2012) presented an necessary and sufficient condition that can be used to check if the graphical lasso solution will be block diagonal without even solving the original problem. This condition is implemented in the latest version of the graphical lasso algorithm and leads to massive computational gains since the problem can be split into several disjoint problems. The condition uses the property that the connected components of the non-zero elements of the graphical lasso solution bΘ^(λ) is equivalent to the connected components of the thresholded covariance matrix S at level λ, which will be explained. The computational advantage in calculating the thresholded covariance matrix S is what causes the massive computational gains of the existing graphical lasso problems. Note that if λ is very small, no advantage will achieved since both the graphical lasso solution and the thresholded covariance matrix will consist of only one large connected component. The opposite extreme is when λ = 1 (assuming the empirical covariance is scaled to be a correlation matrix) resulting in all nodes (or features) being unconnected in the solution. This graph would not give much information and is therefore not of interest.

In what follows we let bΘ^(λ) denote the graphical lasso solution using the tuning parameter λ. We further let S denote the empirical covariance matrix estimated from the data. Then Mazumder & Hastie (2012) noticed that the sparsity pattern of bΘ^(λ) can be visualized by calculating the associated adjacency matrix defined as

E_ii^(λ)0 = (

1 if bΘ^(λ)_ii0 6= 0, i 6= i⁰

0 otherwise , (2.17)

which defines a symmetric graph G^(λ) = ( bV, E^(λ)) defined on the vertices (or features) V = 1, ..., p and edges defined by (2.17).b

Suppose the graph G^(λ) admits a decomposition into K(λ) connected components as

G^(λ) = ∪^K(λ)_l=1 G_l^λ, (2.18)

where G_l^(λ)= ( bV_l, E_l^(λ)) are the components of graph G^(λ). Note that K(λ) can take values from 1 through p. If K(λ) = p all features are unconnected and if K(λ) = 1 all features are connected into one large component.

(24)

The sparsity pattern of the thresholded empirical covariance matrix S using the tuning parameter λ can also be visualized by calculating the associated adjacency matrix defined as

E_ii^(λ)0 =

(1 if |S_ii⁰| > λ, i 6= i⁰

0 otherwise . (2.19)

The above defines a symmetric graph G^(λ)= (V, E^(λ)) defined on the vertices (or features) V = 1, ..., p and edges defined by (2.19).

Similar to (2.18), suppose the graph G^(λ) also admits a decomposition into K(λ) connected components as

G^(λ) = ∪^K(λ)_l=1 G^λ_l, (2.20)

where G^(λ)_l = (V_l, E_l^(λ)) are the components of graph G^(λ).

When calculating the connected components of G^(λ), one must first calculate the graphical lasso solution bΘ^(λ). When calculating the connected components of G^(λ) one only requires to perform a simple screening rule of S, completely independent of the graphical lasso solution. The latter method, to perform the screening of S requires much less computational efforts. The surprising connection they made was that the connected components, in the sense of vertices (or features), of (2.18) and (2.20) are exactly the same. Note that the edge structures (2.17) and (2.19) between the two solution need not to be preserved.

Using the above observation, we can calculate the connected components of bΘ^(λ) in beforehand, without calculating the graphical lasso solution. Knowing that the cost of calculating the components of the thresholded empirical covariance matrix is orders of magnitude smaller than calculating the graphical lasso solution we are able to separate the graphical lasso problem (2.1) into K(λ) separate problems on the same form as in (2.1). The subproblems will have the size equal to the number of vertices (or features) in each sub- problem, more precisely |V_i| × |V_i|, i = 1, ..., K(λ), for each connected component. Hence for certain values of λ, when the solution is seperated into several smaller parts, calculating the graphical lasso solution becomes feasible although it may be impossible to calculate the graphical lasso solution for the whole p×p matrix. This fact has been greatly used to speed up other graphical methods as well. Another advantage of this is that the computations can be parallelized over several computers speeding up heavy computations effectively.

Mazumder & Hastie (2012) presented the following Theorem summarizing their results.

Theorem 1 For any λ > 0, the components of the estimated concentration graph G^(λ), as defined in (2.17) and (2.18) induce the exactly same vertex (or features) partition as that

(25)

of the thresholded empirical covariance graph G^(λ), as defined in (2.19) and (2.20). That is K(λ) = K(λ), and there exists a permutation π on {1, ..., k(λ)} such that

Vb_i^(λ)= V_π(i)^(λ), ∀i = 1, ..., k(λ)

As a result, if the graphical lasso solution is known to be block diagonal with K blocks after the screening of S, the solution will take the form

Θ =b





 Θb1

. ..

ΘbK





 (2.21)

where each bΘ_k is solved by maximizing the penalized log-likelihood defined as

log det Θ_k− tr (SΘ_k) − λ||Θ_k||₁ (2.22) for k = 1, ..., K. By inspection one can see that this is equal to the graphical lasso problem, (1.2) for k = 1, ..., K. For more details regarding the thresholding procedure see Witten et al. (2011) and Mazumder & Hastie (2012). Also a direct link to the papers is available in the reference as well.

2.4 Cluster Graphical Lasso

This section will present the cluster graphical lasso method and the surprising connection between graphical lasso method and single linkage hierarchical clustering.

2.4.1 Background

Ming et al. (2012) presented the cluster graphical method which improves the estimation of Gaussian graphical models by monetizing on the fact that there is a surprising connection between the graphical lasso and single linkage hierarchical clustering. They presented the idea that the the graphical lasso is a two step procedure where in the first step, single linkage hierarchical clustering is performed on the variables in order to identify the connected components. In the second step graphical lasso is performed on the subset of variables within each cluster. More precisely, the graphical lasso method determines the connected components of the estimated network via single linkage hierarchical clustering. Single linkage clustering is known to perform badly (especially due to trailing effects which can be visualized in Figure1) which is why they proposed the cluster graphical lasso method

(26)

which involves clustering of the features using an alternative to single linkage clustering and the performing graphical lasso on the subset of variables within each cluster.

Ming et al. (2012) were inspired by the results of Witten et al (2011) and Mazumder &

Hastie (2012), namely the following Theorem:

Theorem 2 The connected components of the graphical lasso solution with tuning parameter λ are the same as the connected components of the undirected graph corresponding to the p × p edge matrix (or adjacency matrix) E^(λ) defined as (2.19) (see section 2.3 for details).

Using the results one can consider a partition of the features into two disjoint sets A and B. If for example |S_ii⁰| ≤ λ for all i ∈ A and for all i⁰ ∈ B, the features in A and B are unconnected in the graphical lasso solution. With these results, solving the standard graphical lasso problem (2.1) can be seen as a two step procedure where in the first step the connected components of the undirected graph is identified using the adjacency matrix E^(λ) as defined in (2.19). In the second step the graphical lasso is performed with parameter λ on each connected component separately as in (2.22) resulting in a solution on the form (2.21).

They further show that the process of identifying the connected components according to screening of the covariance matrix S as in (2.19) is equivalent to performing single linkage hierarchical clustering on the basis of the similarity matrix given by the absolute value of the elements of the empirical covariance matrix S and then cutting the resulting dendrogram tree at level λ. Their proposal of the cluster graphical algorithm has two big advantages over the standard graphical lasso method, the first allowing the use of a different clustering method and secondly decoupling the tuning parameter λ from the clustering step and graphical lasso problem. This results in improved detection of the connected components of the estimated graphical model in the high dimensional setting.

2.4.2 Graphical lasso and single linkage clustering

This section will clarify the results presented in the previous section. Assume that we have a dataset Xn×p consisting of n observations from p features, where the columns have been standardised to have mean zero and variance one. Let S denote the empirical covariance matrix of X_n×p (which in this case is the correlation matrix of X_n×p).

Theorem3establishes the surprising connection between the two seemingly unrelated methods, more precisely that the connected components obtained by the graphical lasso and by performing single linkage hierarchical clustering on the similarity matrix ˜S defined as, S = |S|, are identical.˜

Theorem 3 Let C₁, ..., C_K denote the clusters that results from performing single linkage

(27)

hierarchical clustering using the similarity matrix ˜S, and cutting the resulting dendrogram at a height of 0 ≤ λ ≤ 1. Let D1, ..., D_R denote the connected components of the graphical lasso solution with tuning parameter λ. Then K = R and there exists a permutation π such that Ck= D_π(k) for k = 1, ..., K.

Cutting of the dendrogram might not be clear for all readers which will now be clarified.

Suppose we generated a data matrix and calculated the similarity matrix ˜S to be

S =˜







1 0.3809929 0.4250369 0.4279785 0.3809929 1 0.5916069 0.5209770 0.4250369 0.5916069 1 0.9724252 0.4279785 0.5209770 0.9724252 1







. (2.23)

The resulting dendrogram of (2.23) can be seen in Figure 2. Note that when creating dendrograms with R one uses the dissimilarity matrix (and not the similarity matrix) which is defined as D = 1 − ˜S. For instance, cutting the dendrogram at height h = 0.3 (green line) would result in the clusters {1}, {2}, {3, 4}. Therem 3 says that these are the same clusters that would be obtained by performing graphical lasso with the tuning parameter λ = 1 − h = 1 − 0.3 = 0.7. Here the one minus h also comes from the fact that we are dealing with dissimilarities.

(28)

0.00.20.4 1 2 3 4

Single linkage (SLC)

Figure 2: Dendrogram obtained using single linkage hierarchical clustering clustering. The horizontal lines indicates different cuts of the dendrogram tree (note that the tree can be cut any any height). Cutting the dendrogram at the blue line (h = 0.5) will result in the clusters {1}, {2, 3, 4}. Cutting the dendrogram at the green line (h = 0.3) will result in the clusters {1}, {2}, {3, 4}. Cutting the dendrogram at the pink line (h = 0.1) will result in the clusters {1}, {2}, {3, 4} as well. The figure is to be seen as a tree, growing from the bottom and up. The cuts breaks the branches creating clusters consisting of the elements of each branch. Depending on which height one cuts the tree, different clusters will be obtained. These clusters (or connected components) are equivalent to what would be obtained using the graphical lasso with the tuning parameter λ = 1 − h,.

2.4.3 Algorithm

The cluster graphical lasso is a simple method. The cluster graphical lasso algorithm is presented in Algorithm 2. In short, the features are partitioned into K clusters using a clustering method of choice based on ˜S and then one performs graphical lasso on each of the K obtained clusters of variables.

Worth mentioning is that the cluster graphical lasso method can also be interpreted as a penalized log likelihood problem similar to (2.1) where we impose a huge penalty on |Θ_ii⁰| if the ith and i⁰th feature are in different clusters as

log det Θ − tr(SΘ) − λX

i6=i⁰

w_ii⁰|Θ⁻¹_ii0| (2.24)

where

(29)

wii⁰ =λ_k if i, i⁰ ∈ C_k

∞ if i ∈ Ck, i⁰ ∈ C_k⁰, k 6= k⁰ (2.25)

Algorithm 2 Cluster graphical lasso

1: procedure ClusterGraphicalLasso(S, K, λ)

2: Let C₁, ..., C_K be the clusters obtained by performing a clustering method of choice based on the similarity matrix ˜S. The kth cluster contains |C_k| features.

3: for k = 1 to K do

4: a) Let Sk be the empirical covariance matrix for the features in the kth cluster.

Here, S_k is a |C_k| × |C_k| matrix.

5: b) Solve the graphical lasso problem (2.1) using the covariance matrix S_k with a given value of λ_k. Let bΘ_k denote the graphical lasso estimate.

6: end for

7: Combine the K resulting graphical lasso estimates into a p × p matrix bΘ that is block diagonal with blocks bΘ1, ..., bΘK

8: Return bΘ

9: end procedure

For more details regarding the cluster graphical lasso see Kean et al. (2013). Also a direct link to the paper is available in the reference as well.

2.5 Joint Graphical Lasso

This section will present the joint graphical lasso method which uses an ADDM algorithm to solve the resulting convex optimization problem.

2.5.1 Background

Danaher et al. (2012) presented the joint graphical lasso which builds on the graphical lasso and estimates multiple related Gaussian graphical models for a high dimensional dataset where the observations belong to distinct classes. The estimation borrows strength across the classes in order to estimate multiple graphical models which share certain characteristics such as the location or values of non-zero elements of Θ.

We now let K denote the number of classes in the dataset and let Σ⁻¹_k denote the true precision matrix for the kth class. Using the joint graphical lasso algorithm, estimates of Σ⁻¹₁ , ..., Σ⁻¹_K will be obtained. As usual we let Σ⁻¹_k = Θk for k = 1, ..., K.

(30)

Suppose we have K datasets X⁽¹⁾, ..., X^(K) with K ≥ 2 where X^(k) is a nk × p matrix, where the p features are common to all K datasets. We also assume as usual that the observations are independent and that the observations within each dataset (each class) are identically distributed x^(k)₁ , ..., x^(k)_n_k ∼ N (µ_k, Σ_k) for k = 1, ..., K. We further assume that µ_k = 0 within each set. Let S^(k) denote the empirical covariance matrix for X^(k). Then the joint graphical lasso problem is to maximize the penalized log-likelihood defined as

K

X

k=1

nk

log det Θ^(k)− tr(S^(k)Θ^(k))

− P ({Θ}) (2.26)

over positive definite matrices Θ⁽¹⁾, ..., Θ^(K). Here P ({Θ}) denotes a convex penalty function so that (2.26) is strictly concave in {Θ} and tr as usual, denotes the trace. They propose two forms of P ({Θ}) which encourages that Θ⁽¹⁾, ..., Θ^(K) share certain characteristics such as the location or value of nonzero elements and also result in sparse estimated precision matrices. These two forms are called fused graphical lasso and group graphical lasso which will now be explained in more detail.

The fused graphical lasso is the solution to (2.26) where P ({Θ}) is defined as

P ({Θ}) = λ1 K

X

k=1

X

i6=i⁰

|Θ^(k)_ii0 | + λ₂ X

k<k⁰

X

i,i⁰

|Θ^(k)_ii0 − Θ^(k_ii0⁰⁾|. (2.27)

Here λ₁ and λ₂ are non-negative tuning parameters. Using P ({Θ}) defined as (2.27), first penalties are applied to each off-diagonal element of the K precision matrices and secondly penalties are applied to differences between corresponding elements of each pair of precision matrices. As usual when λ₁ is large, the solution will be sparse. Also, when λ₂ is large, many elements among Θ⁽¹⁾, ..., Θ^(K) will be identical across classes. For very large λ2 all elements are equal across the classes. Therefore the fused graphical lasso encourages both similar network structure and similar edge values.

The group graphical lasso is the solution to (2.26) where P ({Θ}) is defined as

P ({Θ}) = λ1 K

X

k=1

X

i6=i⁰

|Θ^(k)_ii0 | + λ₂X

i6=i⁰

v u u t

K

X

k=1

Θ^(k)_ii0 ². (2.28)

Here λ₁ and λ₂ are non-negative tuning parameters. Using P ({Θ}) defined as (2.28), first penalties are applied to each off-diagonal element of the K precision matrices and secondly a group lasso penalty is applied to the (i, i⁰) element across all K precision matrices. This

(31)

group lasso penalty encourages a similar sparsity pattern across all the estimated precision matrices i.e. the zeros are encouraged to be located in the same place across all the K precision matrices. When λ₁ = 0 and λ₂ > 0 all bΘ^(k) will have an identical pattern of non-zero elements since λ1 affects the sparsity only within class and λ2 the sparsity across several classes at the same time.

In both above cases, if λ₂ = 0, one calculates K uncoupled graphical lasso optimization problems defined as (2.1). The group lasso penalty encourages a weaker form of similarity across the K estimated precision matrices than the fused lasso penalty. The fused lasso penalty encourages shared edge values across the K estimated precision matrices where the group lasso penalty encourages only a shared sparsity pattern.

2.5.2 Algorithm

The joint graphical lasso uses an alternating directions method of multiplies algorithm (ADDM) which will be presented below briefly as proposed by Danaher et al. (2012).

Problem (2.26) can be rewritten to minimization problem as

{Θ}{Z}min (

−

K

X

k=1

n_k

+ P ({Z}) )

(2.29)

over positive definite matrices Θ⁽¹⁾, ..., Θ^(K) as well as the constraint that Z^(k)= Θ^(k) for k = 1, ..., K where {Z} = Z⁽¹⁾, ..., Z^(K). The scaled augmented Lagrangian for (2.26) is given by

L_p({Θ}, {Z}, {U}) = −

K

X

k=1

n_k

+P ({Z})+ρ 2

K

X

k=1

||Θ^(k)−Z^(k)+U^(k)||²_F (2.30) where {U} = U⁽¹⁾, ..., U^(K)are dual variables. In short an ADMM algorithm corresponding to (2.30) is as follows

Algorithm 3 Outline of an ADDM algorithm

1: {Θ_(i)} ← arg min_{Θ}{L_ρ {Θ}, {Z_(i−1)}, {U_(i−1)}}

2: {Z_(i)} ← arg min_{Z}{L_ρ {Θ_(i)}, {Z}, {U_(i−1)}}

3: {U_(i)} ← {U_(i−1)} + {Θ_(i)} − {Z(i)}

(32)

In more detail the ADMM algorithm corresponding to (2.30) is as follows.

Algorithm 4 Outline of an ADDM algorithm

1: Initialize the variables Θ^(k)= I, U^(k)= 0, Z^(k)= 0 for k = 1, ..., K.

2: Select a scalar ρ > 0

3: for i = 1, 2, 3, ... until convergence do

4: i) For k = 1, ..., K, update Θ^(k)_(i) as the minimizer (with respect to Θ^(k) of

−n_k

log det Θ^(k)− tr

S^(k)Θ^(k)

+ρ 2

K

X

k=1

||Θ^(k)− Z^(k)_(i−1)+ U^(k)_(i−1)||²_F.

Letting VDV^T denote the eigenvector decomposition of S^(k) − ρZ^(k)_(i−1)/nk + ρU^(k)_(i−1)/nk, the solution is given by V eDV^T where eD is the diagonal matrix with jth diagonal element

n_k 2

−D_ii+ q

(D²_ii+ 4ρ/n_k

. ii) Update {Z_(i)} as the minimizer (with respect to {Z}) of

ρ 2

K

X

k=1

||Z^(k)− (Θ^(k)_(i) + U^(k)_(i−1))||²_F + P ({Z}) (2.31)

iii) For k = 1, ..., K update U^(k)_(i) as U^(k)_(i−1)+ (Θ^(k)_(i) + Z^(k)_(i)).

5: end for

6: The resulting bΘ⁽¹⁾, ..., bΘ^(K) from this algorithm are the JGL estimates of Σ⁻¹₁ , ..., Σ⁻¹_K which is guaranteed to coverage to the global optimum.

Depending on the choice of P ({Z}) Equation (2.31) the will look differently. By letting A^(k)= Θ^(k)_(i) + U^(k)_(i−1) The minimization problem (2.31) can be rewritten as

ρ 2

K

X

k=1

||Z^(k)− A^(k)||²_F + P ({Z}) (2.32)

If P ({Z}) is defined as in the fused graphical lasso case (2.27) then (2.32) takes the form

ρ 2

K

X

k=1

||Z^(k)− A^(k)||²_F + λ1 K

X

k=1

X

i6=i⁰

|Θ^(k)_ii0 | + λ₂ X

k<k⁰

X

i,j

|Θ^(k)_ii0 − Θ^(k_ii0⁰⁾|. (2.33)

(33)

If P ({Z}) is defined as in the group graphical lasso case (2.28) then (2.32) takes the form

ρ 2

K

X

k=1

||Z^(k)− A^(k)||²_F + λ1 K

X

k=1

X

i6=i⁰

|Θ^(k)_ii0 | + λ₂X

i6=i⁰

v u u t

K

X

k=1

Θ^(k)_ii0². (2.34)

Inspired by the results of Witten et al. (2011) and Mazumder & Hastie (2012), which is presented in section2.3, one can greatly reduce the computational time and power required when running the joint graphical lasso algorithm by determining whether the joint graphical lasso solution will be block diagonal in beforehand. Then one can simply perform joint graphical lasso on the features within each block separately and still obtain exactly the same solution as if one would if performing the joint graphical lasso on the whole p × p matrix. As usual the reason for this is that computing a eigen decomposition of the whole p × p matrix is very computational heavy when p is large.

More precisely, if we determine that the joint graphical lasso will be block diagonal with tuning parameter λ1 and λ2, i.e. that the solution bΘ⁽¹⁾, ..., bΘ^(K) are block diagonal, each with the same R blocks, where the rth block contains p_r features such that P_R

r=1p_r = p, then rather than running the joint graphical lasso on the p×p matrices we can run the joint graphical lasso method on the R blocks separately where the matrices have size pr× p_r for r = 1, ..., R. Similarly to Equation (2.21), the solution of the joint graphical lasso problem will be on the form

Θb^(k) =





 Θb^(k)₁

. ..

Θb^(k)_R





, k = 1, ..., K (2.35)

They proposed the following two theorems with sufficient and necessary conditions that check if the resulting joint graphical lasso solution will be block diagonal:

Theorem 4 The connected components of the fused graphical lasso solution with tuning parameter λ₁ and λ₂ with K = 2 classes are the same as the connected components of the undirected graph fulfilling the following conditions (letting A and B denote two disjoint partitions of the p variables)

(34)

With K > 2 classes then

|n₁S^(k)_ii0 | ≤ λ₁ ∀i ∈ A, ∀i⁰ ∈ B, k = 1, ..., K

Theorem 5 The connected components of the group graphical lasso solution with tuning parameter λ₁ and λ₂ with K ≥ 2 classes are the same as the connected components of the undirected graph fulfilling the following conditions (letting A and B denote two disjoint partitions of the p variables)

PK k=1

|n₁S_ii^(k)0 | − λ₁2

+≤ λ²₂ ∀i ∈ A, ∀i⁰ ∈ B

Hence for certain values of λ₁ (preferably large) and λ₂ the solution is separated into R sub-problems resulting in drastic speed improvements.

For more details regarding the joint graphical lasso algorithm see Danaher et al. (2012).

Also a direct link to the paper is available in the reference as well.

2.6 Nonparanormal

This section will present the nonparanormal distribution which is a method that can be used to relax the normality assumption.

2.6.1 Background

To overcome the problem that most methods for estimating the sparse undirected graphs heavily rely on normality, one may use a semi-parametric Gaussian copula (or nonparanormal) as proposed by Liu et at. (2009). In their paper they consider a nonparanormal extension of undirected graphical models based on the multivariate Gaussian distribution in the high dimensional setting. More precisely they consider using a high dimensional Gaussian copula with non-parametric marginals which is referred to as the nonparanormal distribution. Also worth noting is that as today it is not possible to test if high dimensional data follows a Gaussian distribution or not, since no such test exists. One can only assume that most real life observations do not follow a normal distribution, which is why the assumption on normality should be taken carefully and the use of a relaxation technique can have a great impact on the resulting estimated networks.

We say that a random vector X = (X1, ..., Xp)^T has a nonparanormal distribution if there exists functions {f_i}^p_i=1 such that Z ≡ f (X) ∼ N (µ, Σ) where f (X) = (f₁(X₁), ..., f_p(X_p)) (i.e. the transformed variables follow a Gaussian distribution). The nonparanormal distribution is defined as

(35)

X ∼ NPN(µ, Σ, f ). (2.36) The joint probability density function of X (when the f_i’s are monotone and differentiable) is given by

pX(x) = 1

(2π)^p/2|Σ|^1/2exp

−1

2(f (x) − µ)^TΣ⁻¹(f (x) − µ)

^p Y

i=1

|f_i⁰(xi)|. (2.37)

Note that the density in (2.37) is not identifiable unless the f_i’s preserve the means and variances, that is

µ_i = E(Zi) (2.38)

σ_i² = Σii= Var(Zi) = Var(Xi) (2.39) Worth noting is that these conditions only depend of the diagonal of Σ and not the full covariance matrix. Further we let Fi(x) denote the marginal distribution function of Xi. By the definition of the distribution function we get

Fi(x) = P(Xi≤ x) = P(Zi≤ f_i(x)) = Φ fi(x) − µi

σ_i

. (2.40)

This implies that

F_i(x) = Φ f_i(x) − µ_i σi

Φ⁻¹(F_i(x)) = f_i(x) − µ_i σ_i σ_iΦ⁻¹(F_i(x)) = f_i(x) − µ_i σiΦ⁻¹(Fi(x)) + µi= fi(x)

f_i(x) = µ_i+ σ_iΦ⁻¹(F_i(x)) (2.41) We further define the transformation as

h_i(x) = Φ⁻¹(F_i(x)) . (2.42)

Graphical lasso for covariance structure learning in the high dimensional setting

Graphical lasso for covariance structure learning in the high dimensional setting

Graphical lasso for covariance structure learning in the high dimensional setting

Abstract

Sammanfattning

Acknowledgements

Contents

1 Introduction

2 Mathematical Background