Aspects of common principal components

(1)

Licentiate Thesis

Aspects of Common Principal

Components

Toni Duras

(2)

Licentiate Thesis in Statistics

Aspects of Common Principal Components JIBS Research Reports No. 2017-2

c

2017 Toni Duras and J¨onk¨oping International Business School Publisher:

J¨onk¨oping International Business School P.O. Box 1026

SE-55111 J¨onk¨oping Tel.: +46 36 10 10 00 www.ju.se

Printed by BrandFactory AB 2017 ISSN 1403-0462

(3)

Acknowledgment

Foremost, I would like to express my sincere gratitude to my advisor Professor Thomas Holgersson for the continuous support during my Ph.D. studies, for his patience, motivation, enthusiasm, and knowledge. His guidance has helped me immensely from the very beginning. I am very happy to have him as my advisor and mentor.

Besides my advisor, I would like to thank the rest of the statistical department at JIBS: Professor P¨ar Sj¨olander and Associate Professor Kristofer M˚ansson, for their encouragement and insightful comments. I appreciate our daily talks about football together with Associate Professor Agostino Manduchi and Professor Paul Nystedt, they help me stay sane.

Thanks to all my colleagues from the Department of Economics for making me feel at home, and for all the fun we have had during the last few years. I appreciate all of you. A special thanks to my fellow Ph.D. candidates and friends: Malin Allgurin, Mark Bagley, Pingjing Bo, Orsa Kekezi, Emma Lappi, Amedeus Malisa, Helena Nilsson, Aleksandar Petreski, Jonna Rickardsson, and Tina Wallin, for your support in our shared struggle and for the countless of hours filled with laughter and joy.

Monica Bartels, Katarina Bl˚aman and Marie Petersson, I am grateful for your willingness to help as soon as I had any administrative problems or questions. Thank you for keeping the fifth floor running so well.

I would also like to thank my bro, Professor Johan Klaesson, for the friendly competition and banter in the gym, and for pushing me to grow, physically. My gym buddy is smarter and stronger than yours.

Last but not the least, I would like to thank my family: my parents Damir and Ulrica, and my little brother Pierre, for supporting me no matter what I pursue throughout my life. I love you very much.

(4)

(5)

Abstract

The focus of this thesis is the common principal component (CPC) model, the generalization of principal components to several populations. Common principal components refer to a group of multidimensional datasets such that their inner products share the same eigenvectors and are therefore simultaneously diagonalized by a common decorrelator matrix. Common principal component analysis is essentially applied in the same areas and analysis as its one-population counterpart. The generalization to multiple populations comes at the cost of being more mathematically involved, and many problems in the area remains to be solved.

This thesis consists of three individual papers and an introduction chapter. In the first paper, the performance of two different estimation methods of the CPC model is compared for two real-world datasets and in a Monte Carlo simulation study. The second papers show that the orthogonal group and the Haar measure on this group plays an important role in PCA, both in single- and multi-population principal component analysis. The last paper considers using common principal component analysis as a tool for imposing restrictions on system-wise regression models. When the exogenous variables of a multi-dimensional model share common principal components, then each of the marginal models in the system is, up to their eigenvalues, identical. They hence form a class of regression models situated in between the classical seemingly unrelated regressions, where each set of explanatory variables is unique, and multivariate regression, where each marginal model shares the same common set of regressors.

(6)

(7)

Introduction and summary of the thesis

Estimation of multiple populations often involves assumptions about their variances. They are often tested or assumed to be equal or not across populations. In univariate cases this usually make sense since the variances cannot exhibit that complex relationships. In the multivariate cases, the covariance matrices contain information both about the variance of each variable and the covariance between variables. Thus, allowing for much more complicated relationships between covariance matrices. The covariance matrices might have different structures and testing for equality is often rejected. Although, they might exhibit some similarities between them, possibly a more complex relationship than the two most extreme cases of being equal or unrelated. Estimation of covariance matrices plays a significant role in many multivariate statistical methods, identifying relationships among covariance matrices to help improve the estimation is therefore highly important.

One such relationship is the common principal component (CPC) model. Introduced by Flury (1984), it is the generalization of principal components to several populations. The model assumes the principal component transformation to be identical in each population, but its importance may vary. That is, having identical set of eigenvectors while allowing for individual eigenvalues. The similarities of covariance structures of several populations are summarized in the hierarchy of models (Flury, 1988), including; equality, proportionality, the CPC model, the partial CPC model (only a subset of the eigenvectors is shared across populations) and unrelated covariance matrices. Model identification is typically done by some goodness-of-fit measure, such as Akaike information criterion (AIC), Schwarz criterion (BIC), likelihood ratio (LR) statistics, to name a few. See Pepler et al. (2016) for comparison of 8 methods for selecting suitable model among the five levels for two populations. If the CPC model is reasonable it produces estimated covariance matrices with less bias than if equality was assumed and element with smaller variances than elements of the ordinary unbiased covariance matrix (Pepler, 2014).

The CPC model will be the focus throughout the thesis. The model has commonly been estimated with maximum likelihood (ML) theory and relies on assumptions of multivariate normality data, which does not hold for many real-world datasets. No explicit form of the ML estimation exists but an iterative numerical algorithm called the FG-algorithm (Flury and Gautschi, 1986) can solve the ML equations exactly. Krzanowski (1984) proposed obtaining estimates of the CPC model based on a principal component analysis (PCA) of a weighted sum of the sample covariance matrices. Such an estimation is free from any

(10)

J¨onk¨oping International Business School

normal assumption of the data, have a simple functional form and intuitive appeal.

Just as its one population counterpart, the CPC model can be used as a dimension-reduction technique within other type of statistical analysis. Mul-tivariate methods often involve a lot of parameters. In general, if there are p parameters in question then the unknown parameters in the covariance matrices estimation is of order p2_{. More parameters to estimate leads to less stable}

estimates in terms of the variance. If the CPC model is appropriate, principal components are more accurately estimated and improves the quality of the analysis its used in.

PCA is applicable in many different statistical situations. In multiple re-gression, multicollinearity causes problems with interpreting and estimating parameters. Instead of performing the regression on the original data, both problems can be dealt with by preforming the regression on the principal com-ponents instead. As the principal comcom-ponents are uncorrelated the issue of multicollinearity disappears. They can be used to detect outliers. Outliers are either extreme observations in at least one dimension or have extreme combina-tions of variables. Graphical representation of the score of the first few or the last few principal components can facilitate the detection. Principal components have proved useful in assessing normality of random vectors. Srivastava (1984) developed a measure of skewness and kurtosis for multivariate populations to derive tests of multivariate normality and assessing multivariate normality using graphical inspection based on principal components. PCA can also be used to identify clusters among the variables in a data set. If the variables have a cluster structure, then there will be one principal component with high variance and some with low variance associate with each cluster. Clusters helps with identifying the structure of the data, possibly allowing the dimensions of the data to be reduced by keeping one PC for each cluster, without losing too much information. For a profound description on the subject of PCA, its applications and properties, see Jolliffe (2002). Common principal component analysis (CPCA) is essentially applied in the same areas and analysis as its one-population counterpart. The generalization of one population to multiple populations comes at the cost of being more mathematically involved, and many problems in the area remains to be solved.

In this thesis, the performance of the ML estimation and Krzanowski’s es-timation is compared on real-world datasets and in a simulation study. It is shown that the orthogonal group and the Haar measure on this group plays an important role in PCA, both in single- and multi-population PCA. Moreover, it is proposed how CPCA can be used in seemingly unrelated regressions and some of its properties are derived.

The first paper is an application of the CPC model to two real world data sets. Conducted on annual Swedish municipality level data related to innovations collected for 11 years where each year constitutes a population, and on the Iris flower data (Anderson, 1935) where each type of Iris flower is considered a

(11)

Introduction and summary of the thesis

population. The model identification procedures are based on the previously mentioned goodness-of-fit measures, AIC, BIC and a LR statistics, calculated for all five models in the hierarchy of similarities among covariance matrices. The models are estimated and compared using ML estimation and Krzanowski’s estimation and a Monte Carlo simulation study is conducted in which the performance of the two estimation methods are compared for the CPC model. The simulation study investigates how the accuracy of the estimation methods is affected by autocorrelation and the number of covariance matrices, dimensions and sample sizes for multivariate normal data and for chi-square distributed data. The model selection procedure for the two data sets are not affected much by which estimation method is used and Krzanowski’s estimation method consistently outperform the maximum likelihood estimation in the simulation study, seemingly being the preferred method.

The second paper describes some properties of the orthogonal group and the associated Haar measure on it. It is demonstrated how recent results by Bai and Silverstein (2010), Meckes (2008) and Wijsman (1990) plays an important role in single- and multi-population principal component analysis.

The last paper considers a class of seemingly unrelated regression (SUR) models obtained by incorporating Common Principal Component analysis on the specification of the covariates. The model utilizes a method for simultaneous diagonalization of a set of quadratic forms, thereby imposing a constraint on the regression parameter space. The proposed estimator forms a class of exogenous adaptive estimator which under the normality assumption possesses a simple and closed-form sampling distribution which in turn facilitates inference, such as hypothesis tests or interval estimates, on population parameters.

(12)

J¨onk¨oping International Business School

References

Anderson, E. (1935). The irises of the gaspe peninsula. Bulletin of American Iris Society, 59:2–5.

Bai, Z. and Silverstein, J. W. (2010). Spectral analysis of large dimensional random matrices, volume 20. Springer.

Flury, B. (1988). Common principal components and related multivariate models. Number 519.5 F5.

Flury, B. N. (1984). Common principal components in k groups. Journal of the American Statistical Association, 79(388):892–898.

Flury, B. N. and Gautschi, W. (1986). An algorithm for simultaneous or-thogonal transformation of several positive definite symmetric matrices to nearly diagonal form. SIAM Journal on Scientific and Statistical Computing, 7(1):169–184.

Jolliffe, I. (2002). Principal component analysis. Wiley Online Library. Krzanowski, W. (1984). Principal component analysis in the presence of group

structure. Applied Statistics, pages 164–168.

Meckes, E. (2008). Linear functions on the classical matrix groups. Transactions of the American Mathematical Society, 360(10):5355–5366.

Pepler, P. T. (2014). The identification and application of common principal components. PhD thesis, Stellenbosch: Stellenbosch University.

Pepler, P. T., Uys, D., and Nel, D. (2016). A comparison of some methods for the selection of a common eigenvector model for the covariance matrices of two groups. Communications in Statistics-Simulation and Computation, 45(8):2917–2936.

Srivastava, M. S. (1984). A measure of skewness and kurtosis and a graphical method for assessing multivariate normality. Statistics & Probability Letters, 2(5):263–267.

Wijsman, R. A. (1990). Invariant measures on groups and their use in statistics. IMS.

Aspects of common principal components