• No results found

Applications of common principal components in multivariate and high-dimensional analysis

N/A
N/A
Protected

Academic year: 2021

Share "Applications of common principal components in multivariate and high-dimensional analysis"

Copied!
51
0
0

Loading.... (view fulltext now)

Full text

(1)

Applications of Common Principal

Components in Multivariate and

High-Dimensional Analysis

Toni Duras

J ¨onk ¨oping University

J ¨onk ¨oping International Business School JIBS Dissertation Series No.131, 2019

(2)

Doctoral Thesis in Statistics

Applications of Common Principal Components in Multivariate and High-Dimensional Analysis JIBS Dissertation Series No.131

c

2019 Toni Duras and J¨onk¨oping International Business School Publisher:

J¨onk¨oping International Business School P.O. Box 1026 SE-55111 J¨onk¨oping Tel.: +46 36 10 10 00 www.ju.se Printed by BrandFactory AB 2019 ISSN 1403-0470 ISBN 978-91-86345-93-8

(3)

Acknowledgment

First, I would like to express my sincere gratitude to my advisor Professor Thomas Holgersson for continuous support during my PhD studies, for his patience, motivation, enthusiasm, and knowledge. His guidance has helped me from the very beginning. I am very happy to have him as my advisor and mentor.

Besides my advisor, I would like to give a special thanks to Professor P¨ar Sj¨olander and Professor Kristofer M˚ansson, for their encouragement and insightful comments. You have always been very helpful when I needed help or sought advice.

Thanks to all my friends and colleagues at the JIBS for making me feel at home. Waking up in the morning and getting to work has not been difficult when I have you as my colleagues.

Thank you to all my fellow PhD candidates: Pingjing Bo, Vincent Byusa, Miquel Correa, Orsa Kekezi, Emma Lappi, Amedeus Malisa, Helena Nilsson, Aleksandar Petreski, Jonna Rickardsson, and Yvonne Umulisa. Thank you for your support in our shared struggle.

Some special words of gratitude go to my friends Orsa Kekezi, Emma Lappi, Jonna Rickardsson and Dr. Tina Wallin. We spent a great deal of time together, countless hours filled with laughter and joy, and became really good friends. I look up to all of you in different ways. Who knew that it would take me over 25 years to find out that I had sisters?

Monica Bartels and Katarina Bl˚aman, I am grateful for your willingness to help as soon as I had any administrative problems or questions. The school would not run particularly well without you two!

Professor Charlotta Melander, I genuinely appreciate all the encouragement and support you have given me. I remember, in particular, when you said that you would be happy if your kids had me as a role model. That comment really warmed my heart! I really like football and lifting weights, and luckily, I share those passions with others at JIBS. Associate Professor Agostino Manduchi, Professor Kristofer M˚ansson, Professor Paul Nystedt, Associate Dean of Faculty Lars Pettersson, Mikael Radouch and Professor P¨ar Sj¨olander; I enjoy our daily discussions about football, and I see them as a necessity in my everyday life at JIBS, thank you. I would also like to thank my gym partner, Brofessor Johan Klaesson, for the motivation, friendly competition and banter in the holy temple of gains.

Last but not least, I would like to thank those who mean the most to me; my family: my parents Damir and Ulrica, and my brother Pierre. Thank you for supporting me no matter what I pursue in life. I love you very much.

(4)

Abstract

This thesis consists of four papers, all exploring some aspect of common principal component analysis (CPCA), the generalization of PCA to multiple groups. The basic assumption of the CPC model is that the space spanned by the eigenvectors is identical across several groups, whereas eigenvalues associated with the eigenvectors can vary. CPCA is used in essentially the same areas and applications as PCA.

The first paper is comparing the performance of the maximum likelihood and Krzanowski’s estimators of the CPC model for two real-world datasets and in a Monte Carlo simulation study. The simplicity and intuition of Krzanowski’s estimator and the findings in this paper support and promote the use of this estimator for CPC models over the maximum likelihood estimator. Paper number two uses CPCA as a tool for imposing restrictions on system-wise regression models. The paper contributes to the field by proposing a variety of explicit estimators, deriving their properties and identifying the appropriate amount of smoothing that should be imposed on the estimator. In the third paper, a generalization of the fixed effects PCA model to multiple populations in a CPC environment is proposed. The model includes mainly geometrical, rather than probabilistic, assumptions, and is designed to account for any possible prior information about the noise in the data to yield better estimates, obtained by minimizing a least squares criterion with respect to a specified metric. The fourth paper survey some properties of the orthogonal group and the associated Haar measure on it. It is demonstrated how seemingly abstract results contribute to applied statistics and, in particular, to PCA.

(5)

Contents

1. Introduction 11

1.1 Aim of the thesis . . . 15

1.2 Outline of the thesis . . . 16

2. Mathematical foundation 17 2.1 The orthogonal group . . . 17

2.2 The singular value decomposition . . . 17

2.3 Wishart matrices . . . 19

2.4 The Haar measure . . . 23

2.5 Regression analysis in multivariate settings . . . 27

3. Principal component analysis 31 3.1 Principal component analysis in a single population . . . 31

3.1.1 Principal components in a population . . . 31

3.1.2 Principal components of a sample . . . 33

3.1.3 Asymptotic theory of the sample principal components . . . . 34

3.2 Principal components in multiple populations . . . 36

3.2.1 The common principal component model . . . 36

3.2.2 Sample common principal components . . . 37

3.2.3 Asymptotic theory for common principal components . . . . 39

3.3 Identification of the common principal component model . . . 40

3.3.1 Log-likelihood tests . . . 41

3.3.2 Akaike information criterion . . . 42

3.3.3 Bayesian information criterion . . . 42

3.3.4 Bootstrap vector correlation distribution . . . 43

4. Summary of papers and future research 45 4.1 Summary of the papers . . . 45

4.2 Future research . . . 47

References 49 I A comparison of two estimation methods for common principal compo-nents 59 1 Introduction . . . 61

2 Preliminaries . . . 62

2.1 Common principal components analysis . . . 63

(6)

3.1 Maximum likelihood estimation . . . 64

3.2 Krzanowski’s estimation of CPC . . . 66

4 Identification of common eigenvector models . . . 66

5 Empirical analysis . . . 68

5.1 Data and variables . . . 69

5.2 Common principal component analysis . . . 69

6 Simulation study . . . 72

7 Summary . . . 78

References . . . 79

A Appendix . . . 81

A.1 The algorithm for proportional covariance matrices . . . 81

A.2 Sample covariances and correlations of sample CPCs . . . 81

A.3 Tables . . . 82

II Common principal components with applications in regression 91 1 Introduction . . . 93

2 Preliminaries . . . 95

3 Common principal component analysis . . . 96

4 Common principal component regression . . . 98

4.1 Univariate principal component regression . . . 98

4.2 Common principal components in seemingly unrelated regres-sions . . . 99

4.3 Smoothing and modeling criteria . . . 107

5 Summary . . . 111

References . . . 112

A Appendix . . . 114

III An extension of the fixed effects principal component model to a com-mon principal component environment 117 1 Introduction . . . 119

2 Preliminaries . . . 121

3 The fixed effects model for PCA . . . 122

3.1 Estimation . . . 123

3.2 Choice of metric M . . . 124

3.3 Choice of dimension q . . . 124

4 Extending the fixed effects PCA model to a CPC environment . . . . 125

4.1 The fixed effects model for CPCA . . . 126

5 Model identification . . . 130

5.1 Log-likelihood tests . . . 131

5.2 Akaike information criterion . . . 132

5.3 Bayesian information criterion . . . 132

5.4 Bootstrap vector correlation distribution . . . 132

6 Summary . . . 132

References . . . 134 IV A small excursion on the Haar measure on Op 137

(7)

A Orthogonal Weingarten functions . . . 151 JIBS dissertation series 153

(8)

1. Introduction

Univariate statistical analysis is the simplest form of analyzing data, as it analyzes one variable at a time. The essential characteristics of the univariate distribution are, for example, measures of central tendency and variability, which are important descriptive statistics. Most of the statistical research conducted before the second half of the 20th century was aimed at univariate problems. Multivariate statistical analysis, which is what much of this thesis concerns, addresses data consisting of a collection of measurements (variables) on a number of subjects or objects (observations). While the multivariate measures of central tendency and variability have analogous relevance, the essential aspect of multivariate analysis is the dependency between different variables. The multivariate approach allows for the investigation of the joint performance of variables, assessing their influence on each other and how they are related. Multivariate datasets are collected in a range of disciplines and are one of the most common types of data encountered in science and practice today.

The early development of the multivariate statistical analysis was based on the multivariate normal distribution, with work tracing back to several authors including Adrian (1809), Laplace (1811), Plana (1813), Gauss (1823), Bravais (1844) and Edge-worth (1892). However, it was after the publications of bivariate problems (correlation, regression, and homoscedasticity) by Galton (1894, 1886, 1889) that the theory and ap-plication of the multivariate normal distribution and multivariate analysis truly became popular in use. Various researchers contributed to the pioneering work of multivariate analysis. Pearson (1896, 1901) developed the famous Pearson’s product correlation coefficient for continuous variables and the foundation of principal component analy-sis (PCA). Fisher (1915) derived the distribution of Pearson’s correlation coefficient, which was then generalized by Wishart (1928), who derived the distribution of the quadratic form of multivariate normal vectors, today known as the Wishart distribution that is vital to multivariate statistics. In the 1930s, multivariate analysis flourished with many important publications, sparking new fields of multivariate analysis, such as multivariate distance (Mahalanobis, 1930, 1936), generalized variances (Wilks, 1932), Wilks’ Λ (a term coined by Rao (1948)), principal component analysis (Pearson, 1901, Hotelling, 1933), canonical correlation analysis (Hotelling, 1936), discriminant analysis (Bose, 1936, Fisher, 1936, 1938), derivation of exact distributions for different statistics (Wishart, 1928, Hotelling et al., 1931, Bartlett, 1934a, Cochran, 1934, Bose, 1936, Bose and Roy, 1938), distributions of eigenvalues (Fisher, 1939, Girshick, 1939, Hsu, 1939, Roy, 1939), and multivariate testing (Wilks, 1932, Bartlett, 1934b, 1939). By the 1950s, multivariate analysis was considered to play an important role in the statistician’s toolbox, and the first books on the subject arose, i.e., Roy (1957), Kendall (1957) and especially Anderson (1958). The field of multivariate statistical analysis

(9)

has since then continued to grow and prosper, and today, there are many books written on the topic, for example, Srivastava and Khatri (1979), Mardia et al. (1979), Fang and Zhang (1990), Anderson (2003), Kollo and von Rosen (2006) and Muirhead (2009). The reader is referred to von Rosen (2018) for a more comprehensive description of the development of the multivariate statistical analysis.

PCA is one of the oldest and best-known techniques for multivariate analysis and is essential to the work in this thesis. The idea is to reduce the dimensionality of a dataset of correlated variables while maintaining as much as possible of the variation in the dataset. This is achieved by transforming the dataset into new uncorrelated variables, ordered so that the first few contain most of the variation present in the original variables. PCA is considered to have been developed by Pearson (1901) and Hotelling (1933). Pearson (1901) tackled the geometric problem of representing data points with the best fitting lines and planes, where the best fit is measured by the sum of the squared perpendicular distances from the point to the plane. Hotelling (1933) studied whether a set of variables, usually correlated, could be represented by another set of independent variables, possibly of smaller dimension. The asymptotic distribution behavior of the principal components has been derived by Girshick (1939), Lawley (1953, 1956) and Anderson (1963). PCA is applicable in many different statistical situations: in multiple regression, multicollinearity causes problems with interpreting and estimating parameters. Instead of performing the regression on the original data, the problems can be dealt with by performing the regression on the principal components instead. As the principal components are uncorrelated, the issue of multicollinearity disappears, and the interpretation is facilitated (Mansfield et al., 1977, Gunst, 1983). They can also be used to detect outliers, i.e., extreme observations in at least one dimension or where there are extreme combinations of variables. Via graphical inspection of the first few and/or the last few principal components, such extreme observations can be detected (Gnanadesikan and Kettenring, 1972, Besse, 1994a). Principal components have proven useful in assessing the normality of random vectors (Srivastava, 1984). PCA can also be used to identify clusters among the variables in a dataset. If the variables have a cluster structure, then there will be one principal component with high variance and some with low variance associated with each cluster (Rao, 1964). Clusters help identifying the structure of the data, possibly allowing for the dimensions of the data to be reduced by keeping one PC for each cluster, without losing too much information. For a profound and detailed description of the subject of PCA, its applications and properties, see Jolliffe (2011). The generalization of PCA to multiple populations is called common principal component analysis (CPCA) and was developed by Flury (1984, 1988). It assumes that the principal component transformation is identical in each population, but the variance associated with the components may differ among populations. That is, the covariance matrix of each population has an identical set of eigenvectors while allowing for population-specific eigenvalues. CPCA is essentially used in the same areas and applications as its one-population predecessors. However, the generalization of one-population PCA to multiple populations comes at the cost of being more mathematically involved. CPCA has received surprisingly little attention in the literature, and many problems in the field remain to be solved.

PCA and CPCA are only dependent on the elliptical geometry of the underlying distributions (Hallin et al., 2014). Therefore, estimating and identifying structures of

(10)

Introduction

covariance matrices are certainly important. The estimation and comparisons of multi-ple populations generally involve assumptions about their variances. In many cases, the variance relationship among populations is either tested or assumed to be equal. In uni-variate cases, this usually makes sense since the variances cannot exhibit any complex relationships; they are either equal or not. In multivariate cases, the covariance matrices contain information about both the variance of each variable and the covariance among variables, thus, allowing for much more complicated relationships among covariance matrices. The test of equality is often rejected even when the covariance matrices exhibit some similarities between them, likely having a more complex relationship than the two extreme cases of being equal or unrelated. The estimation of covariance matrices plays a significant role in many multivariate statistical methods and identify-ing a fittidentify-ing relationship among covariance matrices will help improve the estimation, and is therefore highly important. The similarities of covariance structures of several populations are summarized in the literature as Flury’s hierarchy of models (Flury, 1988). This hierarchy includes five model levels: (i) equality; (ii) proportionality; (iii) common principal component (CPC) model, where the eigenvectors are shared across populations; (iv) partial common principal component model, where only a subset of the eigenvectors is shared across populations; and (v) unrelated covariance matrices. Of course, no statistical model is universally acceptable for any data or purpose. As Box and Draper (1987) stated, “all models are wrong but some are useful”; now, how do we identify such models? An improper model choice can lead to disappointing performances and misleading conclusions, and there are numerous methods available for identifying the most suitable model (see, e.g., Bozdogan (1987), Sclove (1987), Rao et al. (2001), Pepler et al. (2016)). In line with the principle of parsimony, if two or more competing models provide good fit for the data at hand, then the model with the fewest parameters to estimate should be selected. Thus, balance between goodness-of-fit and parsimony is sought. Akaike’s information criterion (AIC, Akaike (1973)) and the Bayesian information criterion (BIC, also known as the Schwarz information criterion (Schwarz, 1978)) are the most commonly used criteria for model selection. Flury (1988) proposed and used log-likelihood ratio tests (chi-squared tests) and AIC (Akaike, 1973). These methods all have their foundation in maximum likelihood theory, assuming normally distributed populations, which is unsupported by many real datasets. Several nonparametric methods without distributional assumptions are available in the literature. The reader is referred to Pepler (2014) and Pepler et al. (2016) for a detailed review of the particular case of identification of similarities among several covariance matrices.

In this thesis, the CPC model (the third level in Flury’s hierarchy) is of main interest. The CPCs have commonly been estimated with maximum likelihood (ML) theory, relying on assumptions of multivariate normality data, which does not hold for many real-world datasets. No explicit form of the ML estimation exists; rather, an algorithm is required to solve the ML equations. Flury and Gautschi (1986) proposed an iterative numerical algorithm (FG) that requires considerable computational power. It loops through all pairs of columns of the eigenvector matrix to complete a single iteration, which makes it both ineffective and sometimes infeasible in higher-dimensional settings (Browne and McNicholas, 2014). Krzanowski (1984) proposed a simple approximation of the CPCs that requires only standard computer packages for PCA. If the population

(11)

covariance matrices have the same set of eigenvectors, then any weighted sum (or product) of the marginal covariance matrices do too. Thus, the CPC coefficients can be estimated by performing an SVD of any weighted sum (or product) of the sample covariance matrices. This estimator has a simple functional form, is intuitively ap-pealing, and is readily available. Robust methods have been proposed by Boente and Orellana (2001), Boente et al. (2002) and Boente et al. (2009), still relying on some distributional assumptions. Hallin et al. (2014) point out that using robust methods come with a loss of efficiency and that any covariance-based estimation method is known to be fragile. Instead, they provided a rank-based estimator (R-estimator) of the eigenvectors which is consistent for any n-root and asymptotically normal under any elliptical density, regardless of any moment assumptions, and also efficient under some prespecified elliptical densities.

An important application of PCA is within regression analysis. Regression analysis is a technique for modeling and analyzing several variables when one variable can be seen as dependent upon the other variables, often trying to establish the causal effect of one variable on another (the dependent one). Regression analysis is used today in almost any field dealing with data, for example, in economics, psychology, sociology, political science, and biology, to name a few. The first building stones of regression were published in the early 19th century by Legendre (1805) and Gauss (1809), who were trying to establish the orbits of bodies around the sun using the method of least squares. The word regression was first used by Galton (e.g., Galton (1886, 1890)), with a focus only on biology. Yule (1897), Pearson (1903) expanded Galton’s work to a general statistical setting while assuming the joint distribution of the variables to be normally distributed. Fisher (1922, 1925) loosened the normality assumption for the joint distribution by assuming that only the conditional distribution of the dependent variable is normal (Aldrich et al., 2005). Currently, regression analysis and its methods are widespread and developed to handle very different types of data, for example, methods for addressing regressions with various types of missing data; regressions where the response variables are graphs, images, or curves; regressions for time series; multivariate regressions; nonparametric and Bayesian regression methods; and regressions with more response variables than observations. The important literature on the topic of regression analysis includes Amemiya (1985), Judge et al. (1988), Greene (2003), Hair et al. (2013), Draper and Smith (2014).

Since the principal component transformation is defined by an often random or-thogonal matrix with normalized eigenvectors as columns, the results in random matrix theory (RMT) are relevant. The early development of random matrix theory was made by Wishart (1928), Wigner (1951, 1955), and the field has exploded in activity in recent years (Tao, 2012). The purpose of RMT is to understand the properties of matrices with random elements drawn from some probability distribution, referred to as random matrix ensembles, usually through some statistics of matrix eigenvalues. For example, PCA is essentially concerned with eigenvectors and eigenvalues of a covariance matrix, since, in fact, the coefficients of the first component are the components of the nor-malized eigenvector corresponding to the largest eigenvalues, and the variance of the first principal component is the largest eigenvalue. One concern is to understand the properties of the population covariance matrix from the observed sample covariance matrix, which is random. If the sample data are normally distributed, the observed

(12)

Introduction

sample covariance matrix will follow a Wishart distribution with the population co-variance matrix as the scale matrix. Then, the joint distribution of the eigenvalues of the observed sample covariance matrix is known (e.g., Theorem 3.2.18 in Muirhead (2009)) and depends on the population matrix only through its eigenvalues. Thus, the eigenvalues of the observed sample covariance matrix are very important. However, the distribution is generally not easily derived, as it involves an integral over the orthogonal group. For more in depth reading on the topic, see Mehta (2004), Bai and Silverstein (2010), Tao (2012).

The last couple of decades have seen rapid developments in computer science. The storage and processing speed of computers have skyrocketed. This has led to easier and inexpensive ways of collecting and storing large amounts of data, to the point that it is becoming difficult to analyze them. This type of data is commonly referred to as high-dimensional data and has dimensions much larger than the cases found in the classical multivariate statistical analysis. For instance, in genetics, physics, wireless communication, finance, health and computer facial recognition, the case of high dimension (p) and large samples (n) applies. The goals of analyzing high-dimensional data and multivariate data are often the same; what separates the two is the way the goal is reached. The advancements in computer technology have helped in many aspects, but they also pose new challenges for statisticians. When p > n, many of the classical methods and limit theorems are no longer valid, and a new statistical toolkit is required to achieve the goals. For example, covariance matrices become singular when p > n and cannot be inverted as many multivariate methods demand. According to Bai and Silverstein (2010), the parameter estimation of structured covariance matrices is poor when the number of parameters is more than four. If the regressors in linear regression are more than six and considered random (or contain measurement errors), then the estimates will be far from the theoretical ones unless the sample size is very large. Assume the data consist of p = 20 parameters and a sample size of n = 100. We can consider the asymptotic setup as p being fixed and n tending to infinity, as p = 2√n or p = 0.2n, and all cases have their own limiting behavior. These asymptotics in which the dimension and sample size increase simultaneously to some positive constant, i.e., p/n → c (Kolmogorov condition) are referred to as (n, p)-asymptotics (Fujikoshi et al., 2011), increasing dimension asymptotics (Serdobol’skii, 1999), or p/n-asymptotics (Kollo et al., 2011). The question is which result is closer to the problem at hand.

1.1.

Aim of the thesis

As stated in the introduction, in many fields today (e.g., genetics, physics, computer science, and finance), it is common to collect large datasets with high dimensions (p) and large samples (n), to the point that it is difficult to analyze them. To be able to conduct proper statistical inference of the population of interest, some form of data reduction will be necessary. The aim of this thesis is to contribute to multivariate statistical analysis by demonstrating how results based on the multipopulation general-ization of PCA (CPCA) can be used to reduce the parameter space in multivariate and high-dimensional models. PCA is one of the best-known techniques for multivariate analysis and used in a wide range of statistical applications with an availability in

(13)

practically every statistical computer package. It it thus surprising to learn that CPCA has received little attention in much of the statistical literature and many problems and properties remain unsolved. Since CPCA is essentially applied in the same areas as its one-population counterpart, it plays an important role in multivariate analysis and has led us to further investigate the field. This thesis contributes to the quest of lessening the parameter space by proposing a new way of reducing the parameter space of the regressors in seemingly unrelated regression models, presenting a new model-based approach for estimating CPCA, and combining the results of the Haar measure to project matrices from the orthogonal p × p space down to lower-dimensional space (e.g., Rpand R).

1.2.

Outline of the thesis

The thesis is composed in two parts. The first part consists of five chapters. Chapter contains the introduction to the thesis along with specifying the aim and outline of the thesis. The proceeding chapter presents important results in the research area to facilitate the understanding and reading of the papers presented in the second part of the thesis. In the third chapter, the theoretical framework of PCA, CPCA and the identification of CPC is described. The last chapter of the part one summarizes the papers in the thesis and briefly talks about future research areas. The second part of the thesis contains the four papers: (i) A Comparison of Two Estimation Methods for Common Principal Components, (ii) Common Principal Components with Applications in Regression, (iii) An Extension of the Fixed Effects Principal Component Model to a Common Principal Component Environment, and (iv) A small excursion on the Haar measure on Op.

(14)

2. Mathematical foundation

2.1.

The orthogonal group

In this section, the orthogonal group is described. Since eigenvector matrices are orthogonal and used in creating principal components, this group is essential for all papers in the thesis.

Definition 1 (Orthogonal group). The set of orthogonal p × p matrices is denoted Op

and defined as Op= n Op×p: O 0 O = OO0= Ip o

. Opforms a group, with its group

operation being matrix multiplication. The elements of O ∈ Opcan be regarded as the

coordinates of a point on a p (p − 1) /2-dimensional surface in Rp2, and the surface is a subset of the sphere of radius p1/2in Rp2 (Muirhead, 2009).

Property 1 (Continuity (Bai and Silverstein, 2010)). Opforms a compact topological

group, where the topology is given by identifying Op as a subspace of Rp

2

. It is compact, and the mappings f1: Op × Op → Op and f2: Op → Op defined by

f1(O1, O2) = O1O2and f2(O) = O−1 = O

0

are continuous.

Definition 2. For any matrices A ∈ Fn×pand B ∈ Fm×q, where F is a field such as

R or C, their Kronecker product is denoted A ⊗ B and is defined as

A ⊗ B =      a11B a12B · · · a1pB a21B a22B · · · a2pB .. . ... ... an1B an2B · · · anpB      ∈ Fnm×pq.

Theorem 1. If A ∈ Opand B ∈ Ok, then A ⊗ B ∈ Okpand B ⊗ A ∈ Okp.

Proof. (A ⊗ B) (A ⊗ B) 0 = AA0⊗BB0 = Ip⊗Ik = Ikpand (A ⊗ B) 0 (A ⊗ B) = A0A ⊗ B0B = Ip⊗ Ik = Ikp; thus, A ⊗ B ∈ Okp. Similarly, B ⊗ A ∈ Okp.

2.2.

The singular value decomposition

The singular value decomposition (SVD) is one of the most useful decompositions for matrices, used in many statistical applications such as single-spectrum analysis in time series, matrix approximation in dimension reduction techniques, least squares approximation of a square matrix by a scalar multiple of an orthogonal or unitary matrix, and more. It is also useful as a computational tool for calculating various

(15)

quantities (Seber, 2008) and used in all papers of this thesis. SVD can factorize both real and complex matrices, but only real matrices will be covered in this thesis. Definition 3 (Eigenvalues and eigenvectors). Let A be a p × p real matrix. The eigenvalues (also known as characteristic roots) of A are the roots of the charac-teristic polynomial, c(λ) = det (A − λIp). Sometimes, f (λ) = det (λIp− A) =

(−1)pdet (A − λIp) is used as the characteristics polynomial, as λphas the

coeffi-cient of 1. Eigenvalues are usually ordered after the magnitude of their absolute values |λ1| ≥ · · · ≥ |λn| ≥ 0.

For every λj, there exists a nonzero vector π such that

Aπ = λjπ.

π is called an eigenvector associated with λj, and is usually normalized to have unit

length. Eigenvectors with distinct eigenvalues are unique (up to a constant) and linearly independent. Eigenvectors of eigenvalues with multiplicity greater than one are no longer unique and can either be linearly independent or dependent.

Now, when the definition of the eigenvalues and eigenvectors are set, the SVD can be established.

Definition 4 (Singular value decomposition (Seber, 2008)). Any real n × p matrix A of rank r ≤ min{n, p} can be expressed as

An×p= Un×nΛ 1/2 n×pΠ

0

p×p,

where U ∈ Onand Π ∈ Op, and Λ = diag (λij) with λ11≥ λ22≥ · · · ≥ λrr> 0 =

λr+1,r+1= · · · = λmm, and λij = 0 for all i, j, i 6= j. λii(i = 1, 2, . . . , m) are the

eigenvalues values of A. The columns of U and Π are the orthonormalized eigenvec-tors of AA0 and A0A, respectively, ordered after the magnitude of the corresponding eigenvalues.

The spectral decomposition, defined below, follows from the SVD, and recasts a matrix in terms of its eigenvalues and eigenvectors.

Definition 5 (Spectral decomposition). Let A be any p × p real symmetric matrix with eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λpand corresponding orthonormal eigenvectors

π1, π2, . . . , πp. Then, A = ΠΛΠ0 = p X i=1 λiπiπ 0 i,

where Πp×p= (π1, π2, . . . , πp) and Λp×p= diag (λ1, λ2, . . . , λp).

In fact, the eigenvector matrix Π is the matrix of coefficients for the principal compo-nents, as will be discussed in more detail in Chapter 2.5.

(16)

Mathematical foundation

2.3.

Wishart matrices

The Wishart distribution was first derived by Wishart (1928). It is the multivariate ex-tension of the chi-square distribution and shares many properties with that distribution. The Wishart distribution is one of the most important distributions in multivariate anal-ysis and widely used, for example, in many test statistics, and the sample covariance matrix of a normally distributed matrix with independent samples has a Wishart distri-bution. The Wishart distribution is covered in most theoretical books on multivariate analysis; for some general references, see Srivastava and Khatri (1979), Gupta and Nagar (1999), Anderson (2003), Kollo and von Rosen (2006) and Muirhead (2009). This section will present the definition together with a number of useful properties and moments.

Definition 6. Let X denote a n × p matrix distributed as Nn,p(0, In, Σ), where n ≥ p

and Σ is positive definite. The matrix W = X0X is then distributed according to the Wishart distribution with n degrees of freedom and scale matrix Σ, denoted W ∼ Wp(n, Σ).

In the special case when the mean of X is M 6= 0, W follows a noncentral Wishart distribution, denoted Wp(∆, n, Σ), where ∆ = MM

0

.

It is clear from the above definition that the sample covariance matrix of X is Wishart-distributed.

Theorem 2 (Theorem 2.4.2 in Kollo and von Rosen (2006)). Let W ∼ Wp(∆, n, Σ)

and A ∈ Rq×p. Then,

AWA0∼ Wq(A∆A

0

, n, AΣA0). Form this theorem, we can derive the following corollaries.

Corollary 1 (Mardia et al. 1979, p. 67). Let W ∼ Wp(n, Σ), and a ∈ Rp such

a0Wa 6= 0. Then, it follows from Theorem 2 that (i) Σ−1/2WΣ−1/2 ∼ Wp(n, Ip),

(ii) a0Wa ∼ Wp(n, a

0

Σa), (iii) a0Wa/a0Σa ∼ χ2

n.

Corollary 2 (Corollary 2.4.2.1 in Kollo and von Rosen (2006)). Let W ∼ Wp(n, Ip)

and let O ∈ Opbe independent of W. Then W and OWO

0

have the same distribution. Corollary 3 (Corollary 2.4.2.2 in Kollo and von Rosen (2006)). Let Wii, i = 1, . . . p

denote the diagonal elements of W ∼ Wp(∆, n, cIp). Then,

(i) 1cWii ∼ χ2(n, δii), where ∆ = (δij) and Wiiis independent of Wjj if i 6= j,

(17)

Theorem 3 (corollary 2.4.3.1 in Kollo and von Rosen (2006)). Let X ∼ Nn,p(M, In, Σ)

and Q be a p × p symmetric and idempotent matrix, so that MQ = 0. Then, X0QX ∼ Wp(rank(Q), Σ).

In particular, let X ∼ Nn,p(µ1

0

p, In, Σ), where µ is a p-dimensional vector, and

Q = In−1n1n1

0

n. The matrix Q has rank n − 1, is idempotent and symmetric. Then,

X0QX is the unbiased sample covariance matrix, and from Theorem 3, X0QX ∼ Wp(n − 1, Σ).

Sums of Wishart matrices are often analyzed, so the following theorem is of importance.

Theorem 4 (Theorem 2.4.1 in Kollo and von Rosen (2006)). Let W1∼ Wp(∆1, n, Σ)

be independent of W2∼ Wp(∆2, m, Σ). Then,

W1+ W2∼ Wp(∆1+ ∆2, n + m, Σ).

Now, some expected values of the elements of W ∼ Wp(n, Σ) and some of its scalar

and matrix values functions are considered.

Theorem 5 (Theorem 3.3.15 in Gupta and Nagar (1999)). Let W = (Wij) ∼

Wp(n, Σ); then,

(i) E[Wij] = nσij,

(ii) Cov[Wij, Wkl] = n(σikσjl+ σilσjk),

(iii) E[WAW] = nΣA0Σ + n tr(ΣA)Σ + n2ΣAΣ,

(iv) E[tr(AΣ)W] = nΣAΣ + nΣA0Σ + n2tr(AΣ)Σ,

(v) E[tr(AW) tr(BW)] = n tr(AΣBΣ) + n tr(A0ΣBΣ) + n2tr(AΣ) tr(BΣ),

where Σ = (σij) and A and B are constant p × p matrices.

Through differentiating the moment generating function of W ∼ Wp(n, Σ), De Waal

and Nel (1973) derived the following moments: (i) E[W2] = n [(n + 1)Σ + (tr Σ)I

p] Σ, (ii) E[W3] =n(n2+ 3n + 4)Σ2+ 2(n + 1)(tr Σ)Σ +(n + 1)(tr Σ2)Ip+ (tr Σ)2Ip Σ, (iii) E[W4] =n(n3+ 6n2+ 21n + 20)Σ3+ (3n2+ 10n + 12)(tr Σ)Σ2 + (2n2+ 5n + 5)(tr Σ2)Σ + 3(n + 1)(tr Σ)2Σ + (n2+ 2n + 4)(tr Σ3)Ip+ 3(n + 1)(tr Σ)(tr Σ2)Ip +(tr Σ3)Ip Σ.

(18)

Mathematical foundation

Theorem 6 (Theorem 3.3.23 in Gupta and Nagar (1999) using Pielaszkiewicz’s (2015) notation). Let W ∼ Wp(n, Σ); then,

E(tr W)k = 2kX κ hn 2 i κ Zκ(Σ),

where Z(·) is the zonal polynomial corresponding to κ, where κ = (k1, . . . , km) is

a partition of k such that k1 ≥ 0 for all i = 1, . . . , m, andPiki = k. [a]κ =

Qm

i=1[a − i + 1]κiwith [a]k= (a + k)!/a! for a ∈ C and k ∈ N0.

Subrahmaniam (1976) established the following result using zonal polynomials E(tr(Σ−1W))k = 2hnp

2 i

k

.

Note that it follows from Theorem 2 that E(tr(Σ−1W))k = E (tr(V))k, where

V ∼ Wp(n, Ip).

Kollo and von Rosen (2006) provided the following result for mixed moments of two independent Wishart matrices.

Theorem 7 (Theorem 2.4.15 in Kollo and von Rosen (2006)). Let W1∼ Wp(n, Ip)

be independent of W2 ∼ Wp(m, Ip), where p ≤ n and p ≤ m. Let F = (W1+

W2)−1/2W2(W1+ W2)−1/2and Z = W2−1/2W1W−1/22 . Then, (i) E[Z] = n m − p − 1Ip, m − p − 1 > 0, (ii) D[Z] = 2n(m − p − 2) + n 2 (m − p)(m − p − 1)(m − p − 3)(Ip+ Kp,p) + 2n(m − p − 1) + 2n 2 (m − p)(m − p − 1)2(m − p − 3)vec Ip(vec Ip) 0 , m − p − 3 > 0, (iii) E[F] = m n + mIp (iv) D[F] =  a3− m2 (n + m)2  vec Ip(vec Ip) 0 + a4(Ip+ Kp,p)

where, for a matrix A = (a1, . . . , ap), vec A = (a

0

1, . . . , a

0

p)

0

, D[A] denotes the dispersion matrix of A, defined as D[A] = E[(vec A−E[vec A])(vec A−E[vec A])0], Kp,p(called a commutation matrix) is a partitioned pp × pp matrix consisting of p × p

blocks where the (j, i)-th element of the (i, j)-th block is equal to one, while all other elements in that block are zero, and

a4= n − p − 1 (n + m − 1)(n + m + 2)[(n − p − 2 + 1 n + m)c2− (1 + n − p − 1 n + m )c1], a3= n − p − 1 n + m [(n − p − 2)c2− c1] − (n + m + 1)c4, a2= n(m − p − 2) + n2+ n (m − p)(m − p − 1)(m − p − 3), a1= n2(m − p − 2) + 2n (m − p)(m − p − 1)(m − p − 3).

(19)

We continue with listing some moments of the inverted Wishart distribution.

Definition 7 (Inverted Wishart distribution). Vp×pis said to have a inverted Wishart

distribution, denoted as V ∼ Wp−1(n, Ψ), if its inverse V−1follows a Wishart distri-bution Wp(n, Ψ−1).

Moments of the inverted Wishart matrix have been obtain using different techniques. Kaufman (1967) used a factorization theorem, Gupta (1968) worked with invariance arguments, Haff (1979) established an identity by applying Stokes’ theorem and derived the first two moments, and von Rosen (1988) identified a general method for obtaining the rthorder moment and gave explicit expression up to the fourth order. Parts of these

results are summaries in the following three theorems:

Theorem 8 (Theorem 3.4.3 in Gupta and Nagar (1999)). Let V ∼ Wp−1(n, Ψ); then,

(i) E[vij] = ψij n − 2p − 2, n − 2p − 2 > 0, (ii) Cov[vij, vkl] = 2(n − 2p − 2)−1ψijψkl+ ψikψjl+ ψilψkj (n − 2p − 1)(n − 2p − 2)(n − 2p − 4) , n − 2p − 4 > 0, (iii) E[VAV] = tr(AΨ)Ψ + (n − 2p − 2)ΨAΨ

(n − 2p − 1)(n − 2p − 2)(n − 2p − 4), n − 2p − 4 > 0,

where V = (vij), Ψ = (ψij), and Ap×pis a constant positive semidefinite matrix.

Theorem 9 (Theorem 3.4.4 in Gupta and Nagar (1999)). Let V ∼ Wp−1(n, Ψ); then,

(i) E[V3] = (c1c3+ c2c3+ c1c4+ 5c2c4)Ψ3+ (2c2c3+ c1c4+ c2c4)(tr Ψ)Ψ2 − (c2c3+ c2c4)(tr Ψ)Ψ − c2c4(tr Ψ)2Ψ, n − 2p − 6 > 0, (ii) E[tr(V)V] = c1(tr Ψ)Ψ + 2c2Ψ2, n − 2p − 4 > 0, (iii) E[tr(V)V−1] = (n − p − 1)(tr Ψ)Ψ −1− 2I p n − 2p − 2 , n − 2p − 2 > 0, wherec1= (n − 2p − 3)c2, c2= [(n − 2p − l)(n − 2p − 2)(n − 2p − 4)]−1, c3= (n − 2p − 4)[(n − 2p − 6)(n − 2p)]−1, and c4= 2[(n − 2p − 6)(n − 2p)]−1.

Theorem 10 (Theorem 3.4.5 in Gupta and Nagar (1999)). Let V ∼ Wp−1(n, Ψ); then,

(i) E[VAV] = c1ΨAΨ + c2(ΨA

0

Ψ + tr(AΨ)Ψ), (ii) E[tr(AV)V] = c1tr(AΨ)Ψ + c2[ΨA

0

Ψ + ΨAΨ],

(iii) E[tr(AV) tr(BV)] = c1tr(AΨ) tr(BΨ) + c2[tr(BΨAΨ) + tr(BΨA

0

Ψ)],

where c1, c2are defined as in Theorem (9) and Ap×p, Bp×pare constant matrices.

For additional Wishart moments, see Gupta and Nagar (1999) and Kollo and von Rosen (2006).

(20)

Mathematical foundation

2.4.

The Haar measure

Modern day statistical analysis often involves analysis of high-dimensional data, typ-ically under the assumption of the limiting ratio between the sample size n and the dimension p such that p/n → c, for a constant c, as n → ∞ and p → ∞. Regardless of the asymptotic assumption, it is evident that just a random dataset, in form of a random matrix Xn×p, provides little but descriptive information of the population of

interest. Some kind of data reduction is necessary for proper statistical inference. The Haar measure plays an important role in many such results and deserves more attention in statistical analysis. If it can be established that a random orthogonal matrix is a random Haar matrix, then certain properties and product moments of the entries of the matrix are known (Meckes, 2008, Bai and Silverstein, 2010, Collins and Matsumoto, 2017), as will be demonstrated in this chapter, and, in more detail, in Paper IV. Definition 8 (Haar measure). A probability measure hpdefined on the Borel σ-field

Bopof Opis said to be a Haar measure if, for any set A ∈ Bopand matrix O ∈ Op,

hp(OA) = hp(A), where OA = {OA : A ∈ A} (Bai and Silverstein, 2010).

If a random matrix H ∈ Opis distributed according to the Haar measure, H ∼ hp,

then it is called a p-dimensional Haar-distributed orthogonal random matrix or a random Haar matrixfor short.

The Haar measure can also be defined on topologies rather different from the orthogonal group. For example, within ergodic theory, a Haar measure can be defined as follows:

Definition 9. Let T = {z ∈ C : |z| = 1} denote the unit circle. Then the Haar measure dh on T is given by (Eisner et al., 2015)

Z T f (h)dh =: Z 1 0 f (e2πit)dt

The theorem below establish the asymptotic distribution of the eigenvalues of a product of a random unitary Haar matrix.

Theorem 11 (Section 3.1.5 in M¨uller et al. (2013)). Let the random matrix H be a square p × p matrix with independent identically complex Gaussian distributed entries with zero mean and finite positive variance. Then, the empirical distribution of the eigenvalues of the unitary random matrix

T = H(H∗H)−1/2

converges to a nonrandom distribution function as p → ∞ whose density is given by pT(z) =

1

2πδ(|z| − 1). (2.4.1) The essential statement of (2.4.1) is that the eigenvalues are uniformly distributed.

(21)

Theorem 12 (Theorem 1.1.1 in Girko (2012)). The Haar measure exists on any sepa-rated topological locally compact group G. If h and h0are two Haar measures on G, then h0 = ch, where c is a positive number.

Since Op is compact, there exists a unique Haar measure that is also a probability

measure, called the normalized Haar measure, obtained by the condition h(Op) = 1.

Property 2. Let Gnbe a compact group with the unique Haar probability measure

h(n), for n ∈ N. The product measure

N

nh(n)is the unique Haar probability measure

on the product groupQ

nGn. In particular, if G and H are compact groups with Haar

measures h and µ, then the product measure h ⊗ µ is a Haar measure on the product group G × H (Eisner et al., 2015).

See Wijsman (1990) for a thorough description of invariant measures of groups with applications in statistics.

The moments of the random Haar matrix H is given by integrating with respect to the Haar measure (Gu, 2013). To be able to establish certain product moments of entries of Haar matrices, some permutation theory and the orthogonal Weingarten function is introduced.

Definition 10 (Partitions & Blocks, Definition 1.11 in Gu (2013)). Let S be a finite totally ordered set. q = {V1, . . . , Vr} is called a partition of the set S if and only if the

Vi(1 ≤ i ≤ r) are pairwise disjoint, nonempty subsets of S such that V1∪· · ·∪Vr= S.

Further, V1, . . . , Vrare called the blocks of q.

Definition 11 (Pairings, Definition 1.12 in Gu (2013)). For any integer n ≥ 1, let P(n) be the set of all partitions of [n] = {1, 2, . . . , n}. If n is even, then a partition q ∈ P(n) is called a pair partition or pairing if each block of q consists of exactly two elements. The set of all pairings of [n] is denoted by P2(n). For any partition

q ∈ P(n), let #(q) denote the number of blocks of q (also counting singletons). Definition 12 (Join, Definition 1.14 in Gu (2013)). Let q ≤ r mean that each block of q is completely contained in one of the blocks of r. Let P be a finite partially ordered set, and let q, r be in P . If the set U = {t ∈ P |t ≥ q and t ≥ r} is nonempty and has a minimum t0(i.e., an element t0∈ U that is smaller than all the other elements

of U ), then t0is called the join of q and r and is denoted by q ∨ r.

Lemma 1 (Lemma 2 in Mingo and Popa (2013)). Let q, r ∈ P2(n) be pairings and

(i1, i2, . . . , ik) a cycle of qr, where qr is the product of the permutation multiplication

between q and r from right to left. Let js= r(is), then (jk, jk−1, . . . , j1) is also a

cycle of qr, and these two cycles are distinct; {i, . . . , ik, j1, . . . , jk} is a block of q ∨ r,

and all are of this form; 2#(q ∨ r) = #(qr).

Example 1 (Example 1.15 in Gu (2013)). If q, r ∈ P2(8) such that

q = (1, 2)(3, 5)(4, 8)(6, 7) and r = (1, 6)(2, 5)(3, 7)(4, 8). Then, qr = (1, 7, 5)(2, 3, 6)(4)(8) and q ∨ r = {{1, 2, 3, 5, 6, 7}, {4, 8}}. We notice that the cycles of qr appear in pairs in the sense that if (i1, . . . , ik) is a cycle of qr, then

(22)

Mathematical foundation

Definition 13 (Admissible and Strongly Admissible (Collins and Matsumoto, 2017)). An ordered sequence i = (i1, . . . , i2n) of 2n positive integers is called admissible for

a pairing q ∈ P2(2n) if it holds that

(s, t) ∈ q ⇒ is= it.

Further, i is called strongly admissible for q if

(s, t) ∈ q ⇐⇒ is= it.

Using Definitions 4-7, we can present a very useful and important method for obtaining moments of Haar distributed matrices and functions thereof.

Definition 14 (Orthogonal Weingarten Function (Collins and Matsumoto, 2017)). Let q, r be two pairings in P2(2n). Let i = (i1, . . . , i2n) and j = (j1, . . . , j2n) be strongly

admissible for q and r, respectively. Then, the Orthogonal Weingarten function WgH(q, r, p) is defined as

WgH(q, r, p) = Z

Op

Hi1j1· · · Hi2nj2ndh = E [Hi1j1· · · Hi2nj2n] ,

where H = (Hij) ∈ Op is a p-dimensional random Haar matrix and dh is the

normalized Haar measure on Op.

Example 2 (From Collins and Matsumoto (2017)). If q = (1, 3)(2, 6)(4, 5) and r = (1, 2)(3, 4)(5, 6), then we can write

WgH(q, r, p) = Z

Op

H21H11H22H32H33H13dh.

The value of WgH(q, r, p) only depends on the cycle structure of qr and is thus often denoted by the cycle structure of qr. Since the cycles of qr always appear in pairs, it is enough to choose one cycle from each pair of cycles. For example, WgH([2, 1], p) denotes the common value of every qr, whose cycle decomposition consists of two transpositions and two fixed points, and with q, r ∈ P2(6). (Gu, 2013). A table of

some values of the orthogonal Weingarten function is provided in the appendix of Paper IV.

Lemma 2 (Lemma 4.1 in Collins and Matsumoto (2017)). Let H be a p-dimensional random Haar matrix. Let i = (i1, . . . , i2n) and j = (j1, . . . , j2n) be two sequences of

positive integers in [p]. Then, E [Hi1j1· · · Hi2nj2n] = Z Op Hi1j1· · · Hi2nj2ndh = X q∈P2(2n) X r∈P2(2n) δq(i)δr(j) WgH(q, r, p), where δq(i) = ( 1 if i is admissible for q, 0 otherwise.

(23)

The moments of an odd number of factors vanish, so EHi1j1· · · Hi2n+1j2n+1 = 0

(Collins and ´Sniady, 2006).

The application and usefulness of Lemma 4 is demonstrated in a couple of examples below.

Example 3 (From Matsumoto (2012)). Suppose H is a p-dimensional Haar matrix; then, E [H1j1H1j2H2j3H2j4] = 1 p(p + 2)(p − 1)((p+1)δj1j2δj3j4−δj1j3δj2j4−δj1j4δj2j3) for p ≥ 2, j1, j2, j3, j4∈ [p] and δij= ( 1 for i = j, 0 for i 6= j.

Example 4 (Example 1.17 in Gu (2013)). Suppose H is a p-dimensional Haar matrix where p ≥ 2.

1. Integrands of type Hi1j1· · · Hi2nj2nthat do not have pairings integrate to zero.

For example, EH113 H12 = E [H11H11H11H12] = 0.

2. If Hi1j1Hi2j2Hi3j3Hi4j4 = H

2

11H222 = H11H11H22H22, then the indexes

only admit one pairing, namely, q = (1, 2)(3, 4), r = (1, 2)(3, 4), and qr = (1)(2)(3)(4). Thus, EH112 H 2 22 = Wg H (q, r, p) = WgH([1, 1], p) = p + 1 p(p − 1)(p + 2). 3. if Hi1j1Hi2j2Hi3j3Hi4j4 = H 4

11 = H11H11H11H11, then all pairings are

ad-missible because all the indexes are the same, equal to 1. A similar computation shows that EH114 = 3(p + 1) − 6 p(p − 1)(p + 2) = 3 p(p + 2).

The following property yields a simple way of obtaining a random p-dimensional Haar matrix. This is important from a statistical perspective, perhaps even the main application of the Haar measure within statistics.

Property 3. Let nS ∼ Wp(n, Ip) and P be the eigenvector matrix of S (i.e., S =

PDP0) where the signs of the first row of P are iid symmetrically distributed. Then P ∼ hp(Anderson, 1958).

Thus, the component coefficient matrix for any PCA on standard normally dis-tributed vectors is a random Haar matrix. Paper IV demonstrates how the properties of the Haar measure on the orthogonal group play an important role in dimension reduction by projecting from Opto a lower-dimensional space.

(24)

Mathematical foundation

2.5.

Regression analysis in multivariate settings

Multivariate regression models are models where the set of explanatory variables is identical for each dependent variable,

y1= Xβ1+ 1

y2= Xβ2+ 2

.. .

yG= XβG+ G,

where ygis a vector of the dependent variables, X is a matrix of explanatory

vari-ables, βgis a vector of fixed unknown regression coefficients, and gis a vector of

unobservable random errors. There are G marginal linear regression models, and it is assumed that a total of n observations are used to estimate them. These models do not generally introduce any severe problems compared to the classical univariate regression models. If the set of exploratory variables are different for each dependent variable, the analysis is more complicated. These models are known as seemingly unrelated regression (SUR) models (see Srivastava and Giles (1987)):

y1= X1β1+ 1

y2= X2β2+ 2

.. .

yG= XGβG+ G,

where each regression model involves kg explanatory variables, for a total of k =

PG

g=1kg, and n > kgobservations. Let  = (

0

1, . . . , 

0

G)

0

. Assume the exogeneity of the explanatory variables and homoskcedasticity, i.e.,

E[|X1, . . . , XG] = 0, and, E[g

0

g|X1, . . . , XG] = σggIn.

Further, assume the error terms are uncorrelated across observations but correlated across regression models, i.e.,

E[isjt|X1, . . . , XG] = σij, for s = t and 0 otherwise.

More compactly, we write

E[0|X1, . . . , XG] = Ω =      σ11In σ12In · · · σ1GIn σ21In σ22In · · · σ2GIn .. . ... . .. ... σG1In σG2In · · · σGGIn      = Σ ⊗ In,

where Σ = (σij). Each regression model can be estimated consistently by the ordinary

least squares (OLS) estimator, given by ˆ βOLS= argmin β kyg− Xgβgk2= (X 0 gXg)−1X 0 gyg.

(25)

However, if the dependent variables are related, i.e., correlated, then it is usually more efficient to estimate the parameters in all regression equations jointly and involve the correlations using the generalized least squares (GLS) method (Srivastava and Giles, 1987, Greene, 2003). Consider the regression models as a single stacked model,

y =      y1 y2 .. . yG      =       X1 0 · · · 0 0 X2 ... .. . . .. 0 0 · · · 0 XG       ,      β1 β2 .. . βG      , =      1 2 .. . G      = Xβ + .

The GLS estimates β by minimizing the squared Mahalanobis distance of the residual vector, ˆ βGLS= argmin β (y − Xβ)0Ω−1(y − Xβ) =X0Ω−1X −1 X0Ω−1y. (2.5.1) The information in Σ is effectively used in the GLS estimator to increase the precision of the point estimators of the regression parameters (Zellner (1962), Binkley (1982), Binkley and Nelson (1988)). OLS and GLS are equivalent in two special cases: (i) the fairly common case, where all equations have the same regressors (multivariate regression), for example, in the capital asset pricing model in empirical finance (Greene, 2003), and (ii) when Σ is diagonal.

Note that Σ = (σij) is usually unknown, and GLS appears to be of limited use.

However, it is possible to estimate Σ and use ˆΩ = ˆΣ ⊗ Ipin (2.5.1) to obtain modified

GLS estimates, commonly known as feasible generalized least squares (FGLS) (or sometimes as Zellner’s efficient estimator (ZEF) after Zellner (1962)). The least squares residuals are often used to consistently estimate σij, i.e.,

ˆ

σij = sij=

e0iej

n , (2.5.2)

where ejdenotes the residual vector of the jthregression equation. Other estimators are

available in the literature, for example, the maximum likelihood estimators discussed in Srivastava and Giles (1987) and a Bayesian approach discussed in Zellner (1971).

In many applications, the SUR model is too heavily parameterized, and some con-straints may be necessary to obtain efficient estimates. In Paper II, it is demonstrated how to estimate the parameters of a SUR model when the exploratory variables share eigenvectors, proposing a smoothing using common principal component regression. To test the linear hypothesis about β, consider the following. The general linear hypothesis imposes a set of restrictions to the SUR model and can be written as

R1β1= q1,

R2β2= q2,

.. . RGβG= qG,

(26)

Mathematical foundation

or, more compactly, as Rβ = Q, where

R =       R1 0 · · · 0 0 R2 ... .. . . .. 0 0 · · · 0 RG       , β =      β1 β2 .. . βG      and Q =      q1 q2 .. . qG      .

Assume that Jgrestrictions are made for βg= (βg1, . . . , βgk)

0

. Then, Rg: Jg× k,

qg : Jg× 1, and each row of Rgcontains the coefficients in one of the restrictions.

The general linear hypothesis testing about β is evaluated through the F -ratio statistic,

F [J, Gn − k] =  R ˆβGLS− Q 0h R(X0Ωˆ−1X)−1R0 i−1 R ˆβGLS− Q  /J e0Ωe/(G(n − k)) , (2.5.3)

which requires the unknown Ω. Instead, we estimate Ω by (2.5.2) and use the FGLS estimation of β. Then, (2.5.3) will behave as

ˆ F = 1 J  R ˆβFGLS− Q 0h R dCovh ˆβFGLS i R0i −1 R ˆβFGLS− Q  ,

since e0Ωe/(G(n−k)) converges to 1 as n → ∞, that follows the F -distribution. This is only valid asymptotically; generally, F [J, m] converges to a chi-squared distribution with J degrees of freedom as n → ∞. Thus, the following statistic (a Wald statistic), with an asymptotic chi-square distribution with J degrees of freedom,

J ˆF =R ˆβFGLS− Q 0h R dCovh ˆβFGLS i R0i −1 R ˆβFGLS− Q  , can also be used (Greene, 2003).

(27)

3. Principal component analysis

Principal component analysis (PCA) is the backbone of this thesis; therefore, it is presented in its own chapter. The idea of PCA is to transform a dataset of correlated variables into uncorrelated components (called principal components) through linear combinations of the original variables. The components are ordered by the amount of variance they account for. In other words, the first component is the linear combination that accounts for the largest possible variation. The second component is the linear combination that accounts for the largest possible variance of the remaining variation (the variation not captured by component one) while being orthogonal to the first component, and so on. Unless the cloud of points of original variables is spherical, most of the variation is captured within the first few principal components, making PCA a suitable method for reducing the dimensionality of data.

PCA is still a field of research under development, with uses in a wide range of statistical techniques. It requires computational power (unless the dimension is very small); thus, the rapid growth in computational power has contributed to the relevance of the method. As previously mentioned, PCA is used in as an ingredient within other applications, such as regression analysis, detection of outliers, assessing the normality of random vectors and cluster analysis, to name a few.

A profound description of the properties and applications of PCA is given by Jolliffe (2011), who dedicated an entire book to the subject.

The derivation of the principal components in sections 3.1 and 3.2 is based on the work of Flury (1988) and Pepler et al. (2016).

3.1.

Principal component analysis in a single population

3.1.1. Principal components in a population

Let x be a vector in Rpwith existing second moments. For simplicity reasons we will

assume that E [x] = 0. The covariance matrix of x is Ehxx0i= Σ and assumed to be distinct. Now find a linear combination

y = a0x, a ∈ Rp

such that the variance of y is maximized. To avoid the variance of y being arbitrarily large, we impose a normalization constraint on a so that the inner product is unity,

(28)

J ¨onk ¨oping International Business School

Consider the spectral decomposition of Σ,

Σ = p X i=1 λiπiπ 0 i = ΠΛΠ 0 , (3.1.1)

where Π = (π1, π2, . . . , πp) ∈ Opand Λ = diag (λ1, λ2, . . . , λp). The eigenvalues

are ordered in decreasing order, λ1 > λ2 > · · · > λp > 0, and the eigenvectors π

thereafter. From (3.1.1), we can establish Π0ΣΠ = Λ. Since Π ∈ Op, it is clear that

π1, . . . , πpforms a basis of Rp, so a can be written as

a = α1π1+ α2π2+ · · · + αpπp, (3.1.2)

for some α = (α1, α2, . . . , αp)

0

∈ Rpsuch that hα, αi = 1. An upper bound of the

variance of y is identified as

Var [y] = a0Σa = α0Π0ΣΠα = α0Λα = p X i=1 λiα2i ≤ λ1 p X i=1 α2i = λ1.

The upper bound of the variance of any normalized linear combination of x can thus not surpass λ1, but by selecting a = πi, it can be attained. Thus, the first principal

component is defined by

y1= π

0

x.

To obtain the rest of the principal components, this procedure is repeated under the additional constraint of being uncorrelated with the previous principal components. To prove that the kthprincipal component is given by

yk= π

0

kx,

where πk is the eigenvector corresponding to the kth largest eigenvalue of Σ, we

assume that k − 1 principal components have been established. For the kthprincipal

component, the variance a0Σa should be maximized under the constraints ha, ai = 1 and

a0Σπj= 0 j = 1, 2, . . . , k − 1. (3.1.3)

We have Σπi = λiπi; thus, (3.1.3) implies a

0

(29)

follows that α1= α2= · · · = αk−1= 0 in (3.1.2), so

Var [y] = a0Σa = p X i=k λiα2i ≤ λk p X i=k α2i = λk,

and the bound is attained by a = πk; the proof is complete.

The p principal components of a vector x ∈ Rpare obtained by

y =      y1 y2 .. . yp      =      π01x π02x .. . πp0x      = Π0x.

The covariance matrix of y is

Cov [y] = Π0ΣΠ = Λ,

clearly showing that the principal components are uncorrelated. Note that the eigen-vectors are only unique up to the sign. Further, if the eigenvalues are not all distinct, then the eigenvectors associated with multiple eigenvalues are not uniquely defined. However, in this thesis, all eigenvalues are assumed to be distinct.

3.1.2. Principal components of a sample

Let x1, . . . , xnbe a sample of random p-dimensional vectors, and define the n × p

matrix X =      x01 x02 .. . x0n.     

Let ¯x be the column means of X and S denote the unbiased covariance matrix, S = 1

n − 1 

X0X − n¯x¯x0,

where ¯x is a vector containing the column means of X. To find the sample principal component coefficient, a spectral decompression of S is performed,

S = PLP0.

(li, pi) is the ith eigenvalue-eigenvector pair of S and L = diag (l1, l2, . . . , lp) and

(30)

J ¨onk ¨oping International Business School

definite and has distinct eigenvalues with probability 1. The eigenvectors are orthogonal and normalized to have unit length. Note that the value of ¯x is irrelevant since it does not affect the covariance matrix.

By the invariance property of maximum likelihood estimators (Casella and Berger, 2002), the maximum likelihood estimates of Π and Λ are the eigenvectors and eigen-values ofn−1n S, i.e.,

ˆ

Π = P and Λ =ˆ n − 1 n L.

The jthsample principal component of the kthrow of X is then the linear

combina-tion

ˆ yjk= p

0

jxk, j = 1, 2, . . . , p, k = 1, 2, . . . , n.

All sample principal components of X can more compactly be written as ˆ

Y = XP.

3.1.3. Asymptotic theory of the sample principal components

Continuing with the notation from sections 3.1.1 and 3.1.2, and in what follows, assume that the data matrix Xn×pis distributed as Np(0, Σ), where all eigenvalues of

Σ are agian assumed to be distinct. Then, it follows that the asymptotic distribution of the sample principal components ˆY = XP tends to Np(0, Λ). The first-order

approximation of the asymptotic distributions of P and L are (Johnson and Wichern, 2002)

(i) √n (l − λ) is approximately Np 0, 2Λ2, where l and λ are the diagonal

vec-tors of L and Λ, respectively.

(ii) √n (pi− πi) is approximately Np(0, Qi), where

Qi= λi X k=1 k6=i λk (λk− λi) 2πkπk.

(iii) All eigenvalues of S are distributed independently of the corresponding eigenval-ues.

From (i)-(iii), a large sample 100(1 − α)% confidence interval for λjcan be obtained

by   lj 1 + zα/2 q 2 n−1 ; lj 1 − zα/2 q 2 n−1  

The standard deviation of the kthelement of the jtheigenvector is given by

s(pjk) = v u u t lj n − 1 X h=1 h6=j lh (lh− lj) p2 hk.

(31)

Let the jtheigenvector be partitioned as πj= " πj(1) πj(2) # ,

where π(2)j contain the last p − r loadings for the possibly redundant variables. To test the hypothesis,

H0: π (2) j = 0,

Flury (1988) used the statistic

T = (n − 1) p(2)j 0    X h=1 h6=j ljlh (lj− lh) 2p (2) h p (2)0 h    −1 p(2)j , (3.1.4)

where p(2)j is the estimated πj(2). T , under the null hypothesis, is asymptotically chi-square-distributed with p − r degrees of freedom.

PCA is often used as a technique to reduce the dimension, where only q < p components are used to represent the variation in the data. Thus, it is of interest to test whether the last p − r coefficients are simultaneously equal to zero for all q retained eigenvectors. Flury (1988) extended (3.1.4) to test such a hypothesis. Let Q denote the indices of the q retained eigenvectors; the null hypothesis is then

H0(Q) : π (2)

j = 0 ∀j ∈ Q,

and the test statistic

T = (n − 1)X j∈Q p(2) 0 j   p X h /∈Q ljlh (lj− lh) 2p (2) h p (2)0 h   −1 p(2)j ,

which, under the null hypothesis, is asymptotically chi-squared with q(p − r) degrees of freedom.

To test whether an eigenvector is equal to a specific normalized vector p0 j, i.e.,

H0: πj = π0j,

Anderson (1963) proposed using the test statistic X2= (n − 1)ljp0 0 j S−1p 0 j+ d −1 j p 00 jSp 0 j− 2  , (3.1.5) which, under the null hypothesis, is asymptotically chi-squared with p − 1 degrees of freedom.

Flury (1988) generalized (3.1.5) to test the general hypothesis that q eigenvectors are equal to q predetermined eigenvectors,

H0: [π1, π2, . . . , πq] =π10, π 0 2, . . . , π

0 q ,

(32)

J ¨onk ¨oping International Business School

whereπ10, π20, . . . , π0q are q orthonormal vectors. The test statistic is given by

Xq2= (n − 1)   1 4 q−1 X j=1 q X h=j+1 ˆ θjh−1  p0hπ 0 j− p 0 jπ 0 h 2 + q X j=1 p X h=q+1 ˆ θjh−1  p0hπ 0 j 2   (3.1.6)

where ˆθjh = ljlh/ (lj− lh)2, and Xq2 is asymptotically chi-squared with

q [p − (q + 1) /2] under the null hypothesis.

To test whether the last p − q eigenvalues are small and equal, meaning that the last components only reflect noise and can be disregarded, the hypothesis is

H0: λq+1= λq+2= · · · = λp,

and the test statistics are given by (Bartlett, 1951) u =  n −2p + 11 6   (p − q) ln ¯l − p X j=q+1 ln lj  , where ¯l =Pp

j=q+1lj/(p−q) is the average of the last p−q eigenvalues. Therefore,

un-der the null hypothesis, u is approximately chi-squared with 12(p − q − 1) (p − q + 2) degrees of freedom.

3.2.

Principal components in multiple populations

Consider the situation where the same variables are measured for multiple natural popu-lations, for example, males and females or different countries. If analyzed separately, it might be of interest to see how much the principal components differ among the popula-tions or, more explicitly, whether the structure of the covariance matrices differs among populations, and if so, how much. If the covariance matrices are identical (identical eigenvalues and eigenvectors), it is natural to pool all data together before performing a PCA to improve its properties; if not, pooling the data is unsuitable. However, in multivariate settings, the matter goes beyond the dichotomous choice between equal or unequal covariance matrices. Indeed, they can be different in numerous ways. Some of these have been considered by Flury (1988), who listed a hierarchy. One possibility is that the covariance matrices share the same principal axes but their relative importance may vary among the groups; then, the CPC assumption is appropriate. It assumes that the covariance matrices have common eigenvectors and can thus be diagonalized by the same orthogonal transformation, while possibly having population-specific eigenvalues.

3.2.1. The common principal component model

Consider a set of G positive definite covariance matrices Σg, g = 1, 2, . . . , G, each of

dimension p × p. Under the CPC assumption, all covariance matrices are diagonalized by a single orthogonal matrix Π = (π1, π2, . . . , πp), i.e.,

Σg= ΠΛgΠ = p X i=1 λgiπiπ 0 i g = 1, 2, . . . , G,

(33)

where Λg = diag (λg1, λg2, . . . , λgp) is the diagonal matrix with the eigenvalues of

the gth covariance matrix on its diagonal elements. In other words, the covariance

matrices share the same eigenvectors but have individual eigenvalues.

3.2.2. Sample common principal components

Let X1, X2, . . . , XGbe ng× p data matrices, where the same set of p variables are

measured from G populations. The unbiased sample covariance estimator of Σgis

given by Si= 1 ng− 1  X0gXg− n¯xgx¯ 0 g  g = 1, 2, . . . , G.

According to the CPC assumption, the same orthogonal matrix, P, diagonalizes all covariance matrices simultaneously, i.e.,

P0SgP = Lg g = 1, 2, . . . , G,

where diag (Lg) = (lg1, lg2, . . . , lgp) are the sample eigenvalues of group g on the

diagonal elements. Generally, the estimated P will not diagonalize Sgcompletely

because of sampling variations, even if the CPC assumption is valid. However, if the CPC assumption holds, the off-diagonal elements of Li will be quite small but not

precisely zero.

To find the maximum likelihood estimates of P = (p1, p2, . . . , pp) under the CPC

assumption, the following system of equations has to be solved (Flury, 1984):

p0l G X g=1 (ng− 1) lgl− lgj lgllgj Sg ! pj= 0, for g = 1, 2, . . . , G; l, j = 1, 2, . . . , p; l 6= j, where lgj = p 0

jSgpj. The equations are to be solved under the orthogonal constraint

of the eigenvectors. The exact solutions can be found using an iterative numerical algorithm, such as the FG-algorithm (Flury and Gautschi, 1986).

It loops through all pairs of columns of the eigenvector matrix to complete a single iteration and requires considerable computational power, which makes it both ineffective and sometimes infeasible in higher dimensional settings (Browne and McNicholas, 2014). For the maximum likelihood estimates of the CPC model, no self-evident way of ordering the eigenvectors exists, and Flury (1986) suggested using the rank order of the first group, i.e.,

p01S1p1> p

0

2S1p2> · · · > p

0

pS1pp.

A simpler, intuitive and readily available method for finding estimates of P and Lg

using any standard computer package for PCA was proposed by Krzanowski (1984). Assuming the CPC model holds, we can write

Π0ΣΠ = G¯ −1Π0Σ1Π + . . . + Π

0

ΣGΠ



(34)

J ¨onk ¨oping International Business School

where ¯Λ and Π are the collection of eigenvalues and eigenvectors of ¯Σ, the arithmetic mean of the group covariance matrices. The estimates of Π can thus be obtained by a PCA of the sample counterpart ¯S, i.e., ¯S = P ¯DP0, and the estimates of Λg are

established as Lg= diag

 P0SgP



. Here, it is natural to rank order the eigenvectors after the magnitude of the eigenvalues of ¯S, as

p01Sp¯ 1> p 0 2Sp¯ 2> . . . > p 0 pSp¯ p. ¯

S can be substituted by any weighted sum or product of the marginal sample covariance matrices since they too have the same eigenvectors.

While there exist other estimation methods for the CPC model, this thesis will only cover the maximum likelihood estimator and Krzanowski’s estimator. Some other methods will be briefly mentioned, and the reader is referred to the references for more details.

Trendafilov (2010) proposed a stepwise estimation procedure, which for a certain number of eigenvectors, q < p, each such eigenvector explains more variance than the next. This estimator coincides with the maximum likelihood estimation if all eigen-values are simultaneously decreasing between groups. If not, the stepwise procedure retains such an order, at least approximately.

Hallin et al. (2014) pointed out that any covariance-based estimation method is known to be fragile. They provided a rank-based estimator (R-estimator) of the eigenvectors that are consistent for any n-root and asymptotically normal under any elliptical density, regardless of any moment assumptions, and efficient under some prespecified elliptical densities. Boente and Orellana (2001) and Boente et al. (2002) suggest replacing the sample covariance matrices S1, S2, . . . , SG with more robust

estimations. Boente et al. (2006) proposes a projection pursuit estimation technique for CPC. However, using robust methods comes with a loss of efficiency, and PCA and CPC are no exceptions (Hallin et al., 2014).

When the eigenvectors have been estimated, the jth sample common principal

component is the linear combination, yj = p

0

jxg= pj1xg1+ pj2xg2+ · · · + pjpxgp j = 1, 2, . . . , p,

where pjiis the ithloading (element) in the jtheigenvector and xgiis the ithelement of

the vector xgfrom the gthgroup. The matrix equivalent for group g is defined as

Yg= XgP,

with sample covariance matrix

Cov [Yg] = Lg= P

0

SgP,

where the diagonal elements of Lgare the sample variance of the common principal

components. The sample correlation matrix is given as

Rg= [diag (L)]−1/2Lg[diag (L)]−1/2.

Figure

Table 3.3.1: Decomposition of the log-likelihood ratio statistic into partial log- log-likelihood ratio statistics

References

Related documents

In this study on a randomly selected population of adult middle-aged men and women, self- reported energy and macronutrient intake was analysed in relation to the prevalence of the

(2010b) performed a numerical parameter study on PVB laminated windscreens based on extended finite element method (XFEM). According to their finding, the curvature does play

Instrumenten är utrustade med ett GP-IB interface för kommunikation mellan instrument och exempelvis en PC. Instrumenten går även att beställa med RS-232-C interface. Varje

Keywords: commons, common goods, shareable goods, legal theory, social theory, socio-legal theory, property theory, critical legal theory, critical approaches to law,

difficulty of exclusion like water or timber- collectively (examples of resources analyzed in her works) through commons governance buffered from the imperatives

We claim that the category of FII will have an advantage due to them having more access to private information. One reason for this is that they are closer to key position holders

In this study, we compare the two statistical techniques logistic regression and discriminant analysis to see how well they classify companies based on clusters – made from

For this estimator there was a problem with the overlap assumption when applying in par- ticular different types of logistic ridge regression, but for one case also with ML and