The Multilinear Normal Distribution:Introduction and Some Basic Properties

(1)

The Multilinear Normal Distribution:

Introduction and Some Basic Properties

Martin Ohlson∗

M. Rauf Ahmad∗

Dietrich von Rosen†∗ LiTH-MAT-R-2011/02-SE

∗ _{Department of Mathematics, Link¨}_{oping University, SE–581 83 Link¨}_oping,

Sweden.

Corresponding author: Martin Ohlson. Tel.: +46 13 281447. E-mail ad-dress: martin.ohlson@liu.se

† _{Energy and Technology, Swedish University of Agricultural Sciences, SE–}

(2)

(3)

More on the Kronecker

Structured Covariance Matrix

Martin Ohlson∗, M. Rauf Ahmad∗ and Dietrich von Rosen† ∗

∗_{Department of Mathematics,}

Link¨oping University, SE–581 83 Link¨oping, Sweden.

E-mail: martin.ohlson@liu.se, muahm@mai.liu.se.

†_{Energy and Technology,}

Swedish University of Agricultural Sciences, SE–750 07 Uppsala, Sweden.

E-mail: dietrich.von.rosen@slu.se.

Abstract

In this paper, the multilinear normal distribution is introduced as an extension of the matrix-variate normal distribution. Basic properties such as marginal and conditional distributions, moments, and the char-acteristic function, are also presented. The estimation of parameters using a flip-flop algorithm is also briefly discussed.

Keywords: Flip-flop algorithm; matrix normal distribution; marginal and conditional distributions; maximum likelihood estimators; mo-ments; tensor product.

(4)

1 Introduction

The matrix normal distribution, being an extension of the ordinary multi-variate (vector-) normal distribution, can be regarded as a bilinear normal distribution - a distribution of a two-way (two-component) array, each com-ponent representing a vector of observations. The complexity of data, which has become a norm of the day for a variety of applied research arenas, re-quires a consideration of extension of the bilinear normal distribution. The present paper presents this extension, correspondingly named multilinear normal distribution (Kollo and von Rosen, 2005, Ch. 2), based on a parallel extension of bilinear matrices to multilinear tensors (Comon, 2009). Al-though, the adjective multilinear have not yet found its way into the general statistical literature, one may trace the same or similar nomenclature with reference to the analysis of complicated data structures in different research paradigms, with a commonly used alternative expression being analysis of multiway data (Kroonberg, 2008, p 16). Kroonberg (2008) also gives some useful references on multiway analysis, particularly based on tensor algebra; see also Coppi and Bolasco (1989).

Although, at least theoretically, a relatively uncharted territory of re-search as compared to the multivariate normal distribution, some interest-ing and very useful applications of multilinear distribution can be found in the literature. Particularly, the emergence of complicated and enormous data sets in the recent decades has given serious impetus for such applied literature to flourish. As a byproduct, this has caused a huge amount of literature on the theory and applications of tensor in statistics.

One of the most important use of the multilinear normal distribution, and hence tensor analysis, is perhaps in magnetic resonance imaging (MRI). A nice work, particularly focusing on the need to go from matrix-variate to tensor-based multilinear normal distribution, is given in Basser and Pajevic (2003). They genuinely argue why a vectorial treatment of a complex data set which actually needs a tensorial treatment and the application of multi-linear normality, can lead to wrong or inefficient conclusions. For some more relevant work in the same direction, see Basser and Pajevic (2000, 2007), Bihan et al (2001), and the references cited therein, whereas a Bayesian perspective is given in Mart´ın-Fern´andez et al. (2004); see also Hoff (2011). Analysis of multilinear, particularly trilinear data, has a specific attraction in chemometrics and spectroscopy; see for example Leurgens and Ross (1992), and Burdick (1995). Other areas of application include signal processing (Kofidis and Regalia, 2001), morphometry (Lepore et al, 2008), geostatistics (Liu and Koike, 2007), and statistical mechanics (Soize, 2007), to mention a few. The extensive use of tensor variate analysis in these and other similar fields has generated a special tensorial nomenclature, for example diffusion tensor, dyadic tensor, stress and strain tensors etc (McCullagh, 1987). Sim-ilarly, special tensorial decompositions, for example PARAFAC and Tucker

(5)

decompositions (Kroonberg, 2008) have been developed; for a general com-prehensive review of tensor decompositions and their various applications, see Comon (2001), and Kolda and Bader (2007).

The use of tensor, and its associated distributional structure, is even older, and with most frequent applications in the theory of linear models. Some classical treatises on tensors and multilinear algebra are Afrken and Weber (1995), Northcott (1984), Carmeli and Malin (2000). For a compre-hensive exposition of the use of tensors in statistics, see McCullagh (1987). In another unique contribution, McCullagh had already introduced tensor notation in statistics with particular reference to the computation of polyno-mial cumulants (McCullagh, 1984); see also Kaplan (1952), and Dauxois et al. (1994). The decomposition of ANOVA models into the potential sources of variation is always an important task in the theory of linear models. A tensorial treatment of ANOVA decomposition is given in Takemura (1983), whereas a study of multilinear skewness and kurtosis in linear models is given in Pukelsheim (1980); see also Drygas (1985). Hext (1963) gives an interesting application in the theory of design of experiments, with particu-lar emphasis on rock magnetism. This paper uncovers some very attractive features of theoretical and geometrical aspects of tensors, when considered from a statistical perspective. The geometrical consideration of tensors in statistics, sometimes even more important than pure theoretical treatment, owes basically to setting the multivariate normality on the Riemannian ge-ometry (Skovgaard, 1984). As the simplest case of geometrical structure of the parameter space of bivariate normal distribution, see Sato et al (1979), which also uses tensor notation to simplify the complicated expressions.

This paper formally introduces multilinear normal distribution, i.e. a normal distribution for the analysis of multiway data, and discusses some basic properties. The rest of the paper is organized as follows. Section 2 introduces the multilinear normal distribution, along with some notations which simplify the calculations that follow. In Section 3, some properties of the multilinear normal distribution, such as marginal and conditional distributions, moments, and characteristic function, are given. Section 4 presents an estimation procedure for the parameters of the distribution.

2 Model

Let X = (xi1,...,ik) : ×

k

i=1pi be a tensor of order k, with the dimension

p1, p2, . . . pk. Figure 1 shows the special case when k = 3. If pi = 1, 2 ≤ i ≤ k

or 3 ≤ i ≤ k we have the special cases when the tensor equals a vector or a linear mapping.

In order to perform explicit computations, the tensor has to be repre-sented via coordinates. In this paper, the representation will mainly be in vector form. However, the representation of the tensor X : ×k_i=1pi as a

(6)

vec- i1 = 1, . . . , p1    xi111 . . . xi11p3 .. . . .. ... xi1p21 . . . xi1p2p3    X =

Figure 1: The box visualizes a three dimensional data set as a third order tensor.

tor can be done in several ways. If we look at the tensor space in Figure 1 this means that we can look upon the tensor from different directions.

Put ep_1:k = e1 i1⊗· · ·⊗e k ik and d q 1:l= d 1 j1⊗· · ·⊗d l jl, where p = (p1, . . . , pk)

and q = (q1, . . . , ql), respectively, and ⊗ denotes the Kronecker product. To

emphasize the dimension, we will write p_k, or p(1 : k), instead of p. The vectors ei_j : pi × 1 are the unit basis vectors, i.e., a pi-vector with 1 in

the jth position, and 0 elsewhere; similarly for di_j : qi × 1. Further, let

p∗_j:l = Ql

i=jpi and p+j:l =

Pl

i=jpi, with the special cases p∗ = p∗1:k and

p+ = p+_1:k, respectively. When there is no ambiguity, we shall drop the dimension from the basis vectors and write e1_i₁ as ei1, and e

p 1:k as e1:k, etc. Definition 2.1. (i) Tp=nx : x =P Ipxi1...ike p 1:k o

, where Ip is the index set

Ip= {i1, . . . , ik: 1 ≤ ij ≤ pj, 1 ≤ j ≤ k} , (ii) Tpq=nX : X =P Ip∪Iqxi1...ik,j1,...,jle p 1:k(d q 1:l) 0o_, (iii) T_⊗pq= {X ∈ Tpq : X = X1⊗ · · · ⊗ Xk, where Xi : pi× qi}, (iv) T⊗p=X ∈ T⊗pp : X = X1⊗ · · · ⊗ Xk, where Xi : pi× pi .

Note that, the tensor space in (i) is described using vectors, whereas in (ii) using matrices.

The univariate, multivariate, and matrix normal distributions are well known. We may observe that a matrix-variate normal distribution, X ∼ Np,n(µ, Σ, Ψ), can be defined as X = µ + Σ1/2U Ψ1/2, where U = (unl),

unl ∼ N (0, 1), i.i.d. This can also be written as

X ij Xijeid0j = X ij µijeid0j+ X ik X nl X mj τikδmjunleie0kend0ldmd0j,

(7)

where Σ = τ τ0, Ψ = δδ0, ei : p × 1, dj : n × 1 are unit basis vectors which in turn equals X ij Xijeid0j = X ij µ_ijeid0j + X ij X kl τikδmjukmeid0j.

Writing the basis vectors as a Kronecker product, i.e., eid0j → dj⊗ ei,

the vec-representation of the matrix normal distribution is obtained. This leads to the following extension of the matrix-variate normal distribution. Definition 2.2. A tensor X is multilinear normal (MLN) of order k,

X ∼ Np(µ, Σ) ,

if

x = µ + Σ1/2u,

where x ∈ Tp, µ ∈ Tp, Σ ∈ T_⊗p, p = (p1, . . . , pk), and the elements of

u ∈ Tp _{are independent standard normally distributed. The square root}

Σ1/2 can be any square root.

The dispersion matrix in Definition 2.2, Σ ∈ T⊗p, expanded in terms of its

component matrices, can be written as Σ = Σ1:k = Σ1⊗· · ·⊗Σk. Moreover,

X ∼ Np(µ, Σ) and x ∼ Np(µ, Σ) pertain to the same distribution.

Relieving Definition 2.2 of its basis vectors, a coordinate-free (vector-space) version of the multilinear normal distribution follows immediately as (see Wichura, 2006; Eaton, 2007)

xi1...ik = µi1...ik + X j1,...,jk τ_i1₁_j₁τ_i2₂_j₂· · · τ_ik kjkuj1j2...jk, where Σi= τi(τi)0 : pi× pi with τi= (τ_kli ) : pi× pi.

Using vector representation, x ∈ Tp, we can conveniently write the pdf of a multilinear normal distribution extending the pdf of ordinary multivariate normal distribution.

Theorem 2.1. The density function of the multilinear normal distribution (Definition 2.2) is given as fX(x) = (2π)−p ∗_/2 Yk i=1 |Σi|−p ∗_/(2p i) ! exp −1 2(x − µ) 0 Σ−1_1:k(x − µ) , (1) where Σ is positive definite, x, µ ∈ Tp, and Σ1:k ∈ T⊗p.

We close this section by giving some comments on a special matrix which will be used in Section 4.

(8)

Definition 2.3. The matrix Ks,r ∈ T⊗pis the tensor commutation operator,

defined as an orthogonal matrix, satisfying Ks,rep_1:k = ep(1:s−1)1:s−1 ⊗ eprr ⊗ e

p(s+1:r−1)

s+1:r−1 ⊗ epss⊗ e

p(r+1:k)

r+1:k , s ≤ r,

i.e., Ks,r interchanges basis vectors.

For notational convenience, we shall write Ks,rep_1:k = es,r_1:k.

The tensor commutation operator operates on the same lines as the well known commutation matrix is used to interchange vectors in a Kronecker product of two vectors. Using the tensor commutation operator on the vector x ∈ Tp, we write

Ks,rx = xs,r.

The following two theorems about the properties of the tensor commu-tation operator follow directly from Definition 2.3, and from properties of the commutation matrix.

Theorem 2.2. Let Ks,r ∈ T⊗pbe the tensor commutation operator, then (i)

Ks,r = K0r,s, and (ii) Ks,rKr,s= Ip+.

Theorem 2.3. Let Σ1:k ∈ T⊗p, and Ks,r ∈ T⊗p be the tensor commutation

operator, then

Kr,sΣ1:kKs,r= Σ1:s−1⊗ Σr⊗ Σs+1:r−1⊗ Σs⊗ Σr+1:k.

Again, for notational convenience, we shall write Kr,sΣ1:kKs,r = Σs,r1:k.

3 Properties of the MLN distribution

In this section, we establish some properties of the multilinear normal dis-tribution. The proof of the following theorem is trivial.

Theorem 3.1. Let x ∼ Np(µ, Σ), where x ∈ Tp, and let A ∈ T⊗qp. Then,

Ax ∼ Nq(Aµ, AΣA0),

where AΣA0 ∈ T_⊗q.

By appropriately choosing A in Theorem 3.1, several interesting special cases can be studied.

(9)

3.1 Marginal distributions The matrix Mr= X Is∪It es_1:k(dt_1:l)0, Mr∈ Tst,

facilitates the computation of several marginal distributions, represented as Mrx ∼ Npm(Mrµ, MrΣM

0

r), where

Is= {i1, . . . , ik: 1 ≤ iv ≤ sj, 1 ≤ v ≤ k} , (2)

It=j1, . . . , jl: 1 ≤ jv ≤ tj, 1 ≤ v ≤ l, rw1 ≤ jw ≤ r2w, rw1, rw2 are known .

The index set It, by imparting restrictions on Mr, through r1w and r2w,

generates marginal distributions, Mrx. In the sequel, we shall focus on the

specific marginal distributions

x•rl = Mr1x, (3) x•rl = Mr2x, (4) where Mr1 = X Is∪It es_1:k(dt_1:l)0, It= {j1, j2, . . . , jl: 1 ≤ jr≤ m}, (5) Mr2 = X Is∪It es_1:k(dt_1:l)0, It= {j1, j2, . . . , jl: m + 1 ≤ jr ≤ pr}, (6)

and Is are given in (2). Following theorems specify certain independence

conditions on the marginals.

Theorem 3.2. The normal variables x•rl and x•rl are independent, if and

only if Σr₁₂ = 0, where Σr₁₂ is the upper right partition of Σr of size m ×

(pr− m) in Σ1:k.

Proof: Because of normality independence holds if and only if (if r = 1 or r = k obvious modifications of the proof has to take place)

0 = C[x•rl, x•rl] = Mr1C[x, x]M 0 r1 = Mr1ΣM 0 r1 = (Ip∗_1:r−1⊗ (I : 0) ⊗ Ip∗_r+1:k)Σ1:k(Ip∗_1:r−1 ⊗ (0 : I)0⊗ Ip∗_r+1:k) = Σ1:r−1⊗ Σr12⊗ Σr+1:k. (7)

Since Σi, i = 1, 2, . . . k, differ from zero, (7) holds if and only if Σr12= 0.

Theorem 3.3. The normal variables x•rl and x•sl are not independent, if

s 6= r.

Proof: If studying (7) the statement of the corollary is evident, i.e. for any matrices Pi, Qi, i = 1, 2, (P1⊗ P2)(Q1⊗ Q2) = 0 if and only if PiQi = 0

for either i = 1 or i = 2.

Theorem 3.4. Let A ∈ Tsp and B ∈ Ttp. Then, Ax and Bx are inde-pendent if and only if AΣB0 = 0, i.e., if for some r, ArΣrB0r= 0.

(10)

3.2 Moments, characteristic function and cumulants

When comparing the multivariate normal distribution with the multilinear normal distribution, the difference lays in the structure of the parameter space generated by µ and Σ. The elements of both distributions are or-ganized in vectors of non-repeated normal components. Thus, it is easy to imagine that moments for the multilinear normal distribution can be ob-tained from the multivariate ones. Indeed, the characteristic, as well as the cumulant generating functions for the multilinear normal distribution follow immediately from those of the multivariate normal distribution.

Theorem 3.5. Let x ∼ Np(µ, Σ), where x ∈ Tp. The characteristic

func-tion of x equals

ρ(t) = E[ei t0x] = ei t0µ−1/2t0Σt, t ∈ Tp, and the cumulant generating function

ϕ(t) = ln E[ei t0x] = i t0µ − 1/2t0Σt, t ∈ Tp.

To compute moments, we need a suitable differential operator (matrix derivative). Let Y ∈ Tpq be a function of X ∈ Trs, with their vectorized versions y and x, defined as

y =X i_1:k1 X j_1:k2 yi_1:k1j_1:k2eq(1:kj_1:k22)⊗ e p(1:k1) i_1:k1 , x = X m_1:k3 X n_1:k4 xm_1:k3n_1:k4dns(1:k_1:k44)⊗ dr(1:km_1:k33), respectively. Then, dY dX = dy dx = X i_1:k1 X j_1:k2 X m_1:k3 X n_1:k4 ∂yi_1:k1j_1:k2 ∂xm_1:k3n_1:k4 ds(1:k4) n_1:k4 ⊗ dr(1:km_1:k33) eq(1:k2) j_1:k2 ⊗ e p(1:k1) i_1:k1 0 . (8) Higher order derivatives may be defined recursively, i.e.

dk_Y dXk = d dX dk−1_Y dXk−1.

(11)

dρ(t) dt t=0 = (iµ − Σt)ρ(t) t=0 = iµ,

which on further differentiation, gives

d2ρ(t) dt2 _t=0 = (−Σ + (iµ − Σt)(iµ − Σt)0)ρ(t) _t=0 = −Σ − µµ0.

Moreover, differentiating the cumulant generating function gives, similarly to the multivariate case, the above moments and that all cumulants of higher order than two equal 0.

Using the same differential operator directly on the pdf generates the Hermite polynomials, defined as

dk_f X(x)

dxk = (−1)

k_H

k(x, µ, Σ)fX(x),

where fX(x) is given in (1) and Hk(x, µ, Σ) are the Hermite polynomials.

Clearly, for k = 0, d0fX(x)

dX0 = fX(x)

Theorem 3.6. Let x ∼ Np(µ, Σ), where x ∈ Tp. The Hermite polynomials

H(x, µ, Σ) k = 0, 1, 2 are given by H0(x, µ, Σ) = 1,

H1(x, µ, Σ) = Σ−1(x − µ),

H2(x, µ, Σ) = Σ−1(x − µ)(x − µ)0Σ−1− Σ−1.

One possible application of Hermite polynomials is in Edgeworth expan-sions.

3.3 Conditional distributions

Having the joint and marginal densities, we can compute the conditional densities. Define

µ_•r_l= Mr1µ, (9)

µ•rl = M_r₂µ, (10)

Σr_1•2= Σr₁₁− Σr₁₂(Σr₂₂)−1Σr₂₁, Σr= (Σrij). (11)

(12)

Theorem 3.7. Let x ∼ Np(µ, Σ), where x ∈ Tp. Let x•rl, x•rl, µ•rl, µ•rl,

be as defined in (3), (4), (9), and (10), respectively. Then, x•rl|x•rl has the

same distributions as µ•rl+ (Ip∗1:r−1 ⊗ Σ r 12(Σr22) −1_{⊗ I} p∗ r+1:k)(x•rl− µ•rl) +(Σ1:r−1⊗ Σr1•2⊗ Σr+1:k)1/2u, (12) where u ∼ Np(0, Ip∗).

Proof: The conditional distribution has to be normal. Therefore, only the conditional mean and dispersion are computed. Since

D[Mr1x] = Mr1ΣM 0 r1 D[Mr1x] = M_r1ΣM0_r1 C[Mr1x, Mr1x] = Mr1ΣM 0 r1,

the conditional mean is

E[Mr1x|Mr1x] = Mr1µ + Mr1ΣMr1(Mr1ΣM

0

r1)−1(M_r1x − M_r1µ)

which is identical to (12). The conditional dispersion equals

D[Mr1x|Mr1x] = Mr1ΣM 0 r1 − Mr1ΣMr1(Mr1ΣM 0 r1)−1M_r1ΣM_r₁ = Σ1:r−1⊗ Σr1•2⊗ Σr+1:k.

4 Inference

In the subsequent it is indicated how maximum likelihood estimators of the unknown parameters in the MLN can be obtained. The likelihood equations for the covariance matrices Σ1, . . . , Σk of Σ1:k will be derived. Moreover,

to achieve a unique parametrization we set σ(2)p2p2 = σ

(3) p3p3 = · · · = σ (k) pkpk = 1, where Σr= σ_ij(r)

, similar as in Srivastava et al. (2008).

Assume that there are n independent tensor observations Xj : ×k_i=1pi,

j = 1, . . . , n, from fX(x) in (1), with vector representations xj. From

the joint likelihood function, one can easily see that µ will be estimated by averaging. Hence, in the subsequent computations, we assume µ = 0, without any loss of generality. Then, the likelihood function for Σ1:k is given

by L = L(Σ1:k) = (2π)−p ∗_/2Yk i=1 |Σi|−p ∗_n/(2p i)_exp    −1 2 n X j=1 x0_jΣ−1_1:kxj    , (13)

(13)

which can also be written as L = (2π)−p∗/2 k Y i=1 |Σ_i|−p∗n/(2pi)_exp    −1 2 n X j=1 tr Σ−1₁ X(1)_j 0Σ−1_2:kX(1)_j    , (14) where X(1)_j =X Ip xi1...ik(e 2 i2 ⊗ · · · ⊗ e k ik)(e 1 i1) 0 =X Ip xi1...ike p2:k 2:k (e 1 i1) 0 : p∗_2:k× p1,

since vec X(1)_j = xj. For simplicity, we will omit ”(1)” in the exponent and

write the matrix as X(1)_j = Xj. Now the trace in (14) can be rewritten as

in Ohlson et al. (2010); tr n Σ−1₁ X0_j(Σ2⊗ Σ3:k)−1Xj o = trnΣ−1₁ X0_jIp2 ⊗ Σ −1/2 3:k Σ−1₂ ⊗ I_p∗ 3:k Ip2⊗ Σ −1/2 3:k Xj o = p∗_3:k X l=1 tr n Σ−1₁ X0_j Ip2 ⊗ Σ−1/2_3:k ep ∗ 3:k l Σ−1₂ × Ip2⊗ ep ∗ 3:k l 0 Σ−1/2_3:k Xj = p∗_3:k X l=1 trΣ−1₁ Y0_jlΣ−1₂ Yjl , (15) where Yjl= Ip2 ⊗ ep ∗ 3:k l 0 Σ−1/2_3:k Xj.

Hence, given Σ3:k, we have n p∗3:kindependent observations Yjl, j = 1, . . . , n,

l = 1, . . . , p∗_3:k, respectively.

Under the condition σ(2)p2p2 = 1, the likelihood equations for Σ1 and Σ2

follow from Srivastava et al. (2008)

Σ1= 1 p∗_2:kn n X j=1 p∗_3:k X l=1 Y0_jlΣ−1₂ Yjl, (16) Σ2= 1 p1p∗_3:kn n X j=1 p∗_3:k X l=1 YjlΣ−11 Y 0 jl. (17)

(14)

Rewriting (16), we have Σ1= 1 p∗_2:kn n X j=1 p∗_3:k X l=1 Y0_jlΣ−1₂ Yjl = 1 p∗_2:kn n X j=1 p∗ 3:k X l=1 X0_j Ip2 ⊗ Σ−1/2_3:k ep ∗ 3:k l Σ−1₂ × Ip2⊗ ep ∗ 3:k l 0 Σ−1/2_3:k Xj = 1 p∗_2:kn n X j=1 X0_jΣ−1_2:kXj. (18)

Using the tensor commutation operator (Definition 2.3), we can also write the likelihood function as

L = (2π)−p∗/2 k Y i=1 |Σi|−p ∗_n/(2p i)_exp    −1 2 n X j=1 (xs,r_j )0 Σs,r_1:k−1xs,r_j    . (19) and, for s = 2, r 6= 1, 2, L =(2π)−p∗/2 k Y i=1 |Σ_i|−p∗n/(2pi) × exp    −1 2 n X j=1 tr Σ−1₁ X2,r_j 0 Σ2,r_2:k −1 X2,r_j    , (20) where X2,r_j = X i1,...,ik x(j)_i₁_...i ke 2,r 2:k(e 1 i1) 0 : p∗_2:k× p1.

The trace in (20) equals

tr Σ−1₁ X2,r_j 0Σ2,r_2:k−1X2,r_j = p∗ 2:k/pr X l=1 tr Σ−1₁ Y2,r_jl 0Σ−1_r Y2,r_jl , (21) where Y2,r_jl = Ipr⊗ ep ∗ 2:k/pr l 0 Σ2,r_2:k\r −1/2 X2,r_j , Σ2,r_2:k\r= Σ3⊗ · · · ⊗ Σr−1⊗ Σ2⊗ Σr+1⊗ · · · ⊗ Σk, (22)

(15)

i.e., Σ2,r_2:k\r are Σ2,r_2:k with Σr deleted. We will also use the same notation for

the basis vectors, where e2,r_1:k\r given as e2,r_1:k\r= e1_i₁⊗ e3_i 3 ⊗ · · · ⊗ e r−1 ir−1⊗ e 2 i2 ⊗ e r+1 ir+1 ⊗ · · · ⊗ e k ik. (23)

means e2,r_1:k with eir removed. Hence, given Σ

2,r

2:k\r, we have n p ∗

2:k/pr =

n p∗_1:r−1p∗_r+1:k independent observations. Again, since σp(r)rpr = 1, using the

techniques in Srivastava et al. (2008), we have the following likelihood equations for Σ1 and Σr.

Σ1 = 1 p∗_2:kn n X j=1 p∗_2:k/pr X l=1 Y2,r_jl 0 Σ−1_r Y2,r_jl , (24) Σr= 1 p∗_1:r−1p∗_r+1:kn n X j=1 p∗ 2:k/pr X l=1 Y2,r_jl Σ−1₁ Y2,r_jl 0. (25)

The following theorem can now be stated:

Theorem 4.1. The likelihood equations that are maximizing the likelihood function (13) under the conditions σ(2)p2p2 = σ

(3) p3p3 = · · · = σ (k) pkpk = 1 are given by Σ1 = 1 p∗_2:kn n X j=1 X0_jΣ−1_2:kXj (26) and, for r = 2, . . . , k Σr = 1 p∗_1:r−1p∗_r+1:kn n X j=1 X2,r(r)_j 0Σ2,r_1:k\r−1X2,r(r)_j , (27) where X2,r(r)_j = P Ipx (j) i1...ike 2,r 1:k\r(erir) 0_{, and Σ}2,r 1:k\r and e 2,r 1:k\r are given as

(16)

Proof: Σ1 is given in (18). We will now prove Σr. From (25), we have Σr= 1 p∗_1:r−1p∗_r+1:kn n X j=1 p∗_2:k/pr X l=1 Y2,r_jl Σ−1₁ Y2,r_jl 0 = 1 p∗_1:r−1p∗_r+1:kn n X j=1 p∗_2:k/pr X l=1 Ipr⊗ ep ∗ 2:k/pr l 0 Σ2,r_2:k\r −1/2 X2,r_j × Σ−1₁ X2,r_j 0 Ipr⊗ Σ2,r_2:k\r−1/2ep ∗ 2:k/pr l = 1 p∗_1:r−1p∗_r+1:kn n X j=1 X Ip X I0 p x(j)_i 1...ikx (j) i0 1...i0k (e1_i₁)0Σ−1₁ e1_i0 1 × p∗ 2:k/pr X l=1 Ipr⊗ ep ∗ 2:k/pr l 0 Σ2,r_2:k\r−1/2 e2,r_2:ke2,r₂0_:k0 0 × Ipr⊗ Σ2,r_2:k\r −1/2 ep ∗ 2:k/pr l ! = 1 p∗_1:r−1p∗_r+1:kn n X j=1 X Ip X I0 p x(j)_i 1...ikx (j) i01...i0k (e1_i₁)0Σ−1₁ e1_i0 1 × p∗_2:k/pr X l=1 er_i_r er_i0 r 0 ⊗ ep ∗ 2:k/pr l 0 Σ2,r_2:k\r −1/2 e2,r_2:k\r × e2,r₂0_:k0_\r0 0 Σ2,r_2:k\r −1/2 ep ∗ 2:k/pr l ! = 1 p∗_1:r−1p∗_r+1:kn n X j=1 X Ip X I0 p x(j)_i₁_...i kx (j) i0₁...i0_k(e 1 i1) 0 Σ−1₁ e1_i0 1 ×e2,r_2:k\r 0 Σ2,r_2:k\r −1 e2,r₂0_:k0_\r0 er_i_r er_i0 r 0 = 1 p∗_1:r−1p∗_r+1:kn n X j=1 X Ip X I0 p x(j)_i₁_...i kx (j) i0₁...i0_k × e1_i₁ ⊗ e2,r_2:k\r ek_i_r 00 Σ1⊗ Σ2,r_2:k\r −1 e1_i0 1⊗ e 2,r 20_:k0_\r0 er_i0 r 0 = 1 p∗_1:r−1p∗_r+1:kn n X j=1 X2,r(r)_j 0Σ2,r_1:k\r−1X2,r(r)_j

(17)

since p∗_2:k/pr X l=1 ep ∗ 2:k/pr l ep ∗ 2:k/pr l 0 = Ip∗ 2:k/pr.

Thus the proof is complete.

The likelihood equations (26) and (27) are nested, for which no explicit solution exists. One way to solve these equations is to use the so called flip-flop algorithm Srivastava et al. (2008). We may also note that by using the results of Srivastava et al. (2008), the algorithm converges to a unique solution provided there is enough data, and σp(2)2p2 = σ

(3)

p3p3 = · · · = σ

(k) pkpk = 1

is chosen to be part of the starting values.

References

Arfken, G.B., and Weber, H.J. (1995). Mathematical methods for physicists. Foruth edition. Academic press, New York.

Basser, P.J., and Pajevic, S. (2000). Statistical artifacts in diffusion ten-sor MRI (DT-MRI) caused by background noise. Magnetic Resonance in Medicine, 44, 41-50.

Basser, P.J., and Pajevic, S. (2003). A normal distribution for tensor-valued random variables: Applications to diffusion tensor MRI. IEEE Transac-tions on Medical Imaging, 22, 7, 785-794.

Basser, P.J., and Pajevic, S. (2007). Spectral decomposition of a 4th-order covariance tensor: Applications to diffusion tensor MRI. Signal Process-ing, 87, 220-236.

Bihan, D.L., Mangin, J-F. , Poupon C., Clark, C.A., Pappata, S., Molko, N., and Chabriat, H. (2001). Diffusion tensor imaging: concepts and ap-plications. Journal of Magnetic Resonance Imaging, 13, 534-546.

Burdick, D.S. (1995). An introduction to tensor products with applications to multiway data analysis. Chemometrics and Intelligent Laboratory Sys-tems., 28, 229-237.

Comon, P. (2001). Tensor decompositions: State of the art and applications. In: McWirter, JG, and IK Proudler (Eds.) (2001). Mathematics in signal processing, V. Oxford University Press, Oxford.

Comon, P. (2009). Tensors versus matrices: Usefulness and unexpected prop-erties. IEEE/SP 15th Workshop on Statistical Signal Processing. Cardiff, UK.

(18)

Coppi, R., and Bolasco, S. (Eds.) (1989). Multiway data analysis. Elsevier, Amsterdam.

Carmeli, M., and Milon, S. (2000). Theory of spinors. Wolrd scientific, Sin-gapore.

Danielson, D.A. (2003). Vectors and tensors in engineering and physics. Westview press, Cambridge, MA.

Dauxois, J., Romain, Y., and Viguier-Pla, S. (1984). Tensor products and statistics. Linear algebra and its applications, 210, 59-88.

Drygas, H. (1985). Linear sufficiency and some applications in multulinear estimation. Journal of Multivariate Analysis, 16, 71-84.

Eaton, ML (2007). Multivariate statistics: A vector space approach. Lecture notes - Monograph series, Vol. 53, Institute of Mathematical Statistics, Ohio.

Green, A.E., and Zerna, W. (1968). Theoretical elasticity. Second edition. Oxford university press, Oxford.

Hext, G.R. (1963). The estimation of second-order tensors, with related tests and designs. Biometrika, 50, 3/4, 353-373.

Hoff, Peter D (2011). Hierarchical multilinear models for multiway data. Computational Statistics & Data Analysis, 55, 530-543.

Kaplan, E.L. (1952). Tensor notation and the sampling cumulants of k-statistics. Biometrika, 39, 3/4, 319-323.

Kollo, T., and von Rosen, D. (2005). Advanced multivariate statistics with matrices. Springer, New York.

Kroonberg, P.M. (2008). Applied multiway data analysis. Wiley, New York. Kofidis, E., and Regalia, P.A. (2001). Tensor approximation and signal pro-cessing applications. Complementary mathematics: Structural matrices in mathematics, computer science, and engineering, I, 280, 103-133. Pro-ceedings of the AMS-IMS-SIAM joint summer research conference, Uni-versity of Colorado, Boulder, June 27-July 01. 1999. Ed.: V Olshevsky. Kolda, T.G., and Bader, B.W. (2007). Tensor decompositions and

applica-tions. SANDIA report, California.

Lawden, D.F. (1982). An introduction to tensor calculus, relativity, and cos-mology. Third edition. Wiley, New York.

Leurgens, S., and Ross, R.T. (1992). Multilinear models: Applications in spectroscopy. Statistical Science, 7, 3, 289-319.

(19)

Lepore, N., Brun, C., Yi-Yu Chou, Ming-Chang Chiang, Dutton, R.A., Hayashi, K.M., Luders, E., Lopez, O.L., Aizenstein, H.J., Toga, A.W., Becker, J.T., and Thompson, P.M. (2008). Generalized tensor-based mor-phometry of HIV/AIDS using multivariate statistics of deformation ten-sors. IEEE Trans Med Imaging, 27, 1, 129-141.

Liu, C., and Koike, K. (2007). Extending multivariate space-time geostatis-tics for environmental data analysis. Math Geol, 39, 289-305.

Mart´ın-Fern´andez, M., Westin, C-F., and Alberola-L´opez, C. (2004). 3D Bayesian regularization of diffusion tensor MRI using multivariate Gaus-sian Markov random fields. Lecture notes in computer science, 351-359, Springer, New York.

McCullagh, P. (1984). Tensor notation and cumulants of polynomials. Biometrika, 71, 3, 461-476.

McCullagh, P. (1987). Tensor methods in statistics. Chapman & Hall, Lon-don.

Morse, P.M., and Feshbach, H. (1953). Methods of theoretical physics. McGraw-hill, New York.

Northcott, D.G. (1984). Multilinear algebra. Cambridge university press, Cambridge.

Oeding, L. (2008). Report on geometry and representation theory of tensors for computer science, statistics and other areas. Technical report. Ameri-can Institute of Mathematics, Palo Alto, California.

Ohlson, M., Ahmad, M. R., and von Rosen, D. (2010). More on the Kro-necker structured covariance matrix. Submitted to Communications in Statistics - Theory and Methods.

Pukelsheim, F. (1980). Multilinear estimation of skewness and kurtosis in linear models. Metrika, 27, 103-113.

Sato, Y., Sugawa, K., and Kawaguchi, M. (1979). The geometrical struc-ture of the parameter space of the two-dimensional normal distribution. Reports on Mathematical Physics, 16, 1, 111-119.

Shashua, A., and Hazan T. (2005). Non-negative tensor factorization with applications to statistics and computer vision. Proceedings of the 22nd internatuional conference on machine learning, Bonn, Germany.

Skovgaard, L.T. (1984). A Riemannian geometry of the multivariate normal model. Scandinavian Journal of Statistics, 11, 4, 211-223.

(20)

Soize, C. (2007). Tensor-valued random fields for meso-scale stochastic mdoel of anisotropic elastic microstructure and probabilistic analysis of representative volume element size. probabilistic Engineering Mechanics, 23, 307-323.

Sokolnikoff, I.S. (1951). Tensor analysis: Theory and applications. Wiley, New York.

Srivastava, M., von Rosen, T., and von Rosen, D. (2008). Models with a Kronecker product covariance structure: estimation and testing. Mathe-matical Methods in Statistics, 17,357–370.

Takemura, A. (1983). Tensor analysis of ANOVA decomposition. Journal of the American Statistical Association, 78, 384, 894-900.

Wichura, MJ (2006). The cooprdinate-free approach to linear models. Cam-bridge University Press, CamCam-bridge.