Approximation of misclassification probabilities in linear discriminant analysis with repeated measurements

(1)

Department of Mathematics

Approximation of

misclassification probabilities

in linear discriminant analysis

with repeated measurements

Edward Kanuti Ngailo, Dietrich von Rosen,

Martin Singull

(2)

Linköping University

(3)

Approximation of misclassification probabilities in linear

discriminant analysis with repeated measurements

Edward Kanuti Ngailoa,1, Dietrich von Rosena,b, Martin Singulla

a _{Department of Mathematics, Link¨}_{oping University, SE-581 833 Link¨}_{oping, Sweden} b _{Department of Energy and Technology, Box 7032, SE-750 07 Uppsala, Sweden}

1 _{Department of Mathematics, University of Dar Es Salaam, Box 35042, Dar Es Salaam, Tanzania}

Abstract

In this paper, we propose approximations for the probabilities of misclassifica-tion in linear discriminant analysis when means follow a growth curve structure. The discriminant function can classify a new observation vector of p repeated mea-surements into one of two multivariate normal populations with equal covariance matrix. We derive certain relations of the statistics under consideration in order to obtain approximations of the misclassification errors. Finally, we perform Monte Carlo simulations to evaluate the performance of proposed results.

ASM Classification:

Keywords: Approximation, Growth Curve model, linear discriminant function, probability of misclassification.

1 Introduction

In the 1930’s multivariate statistics was a blossoming area and attracted researchers. One of the first to deal with discriminant analysis and classification as we know it today was Fisher (1936). Fisher published several papers on discriminant analysis, including Fisher (1938) in which he reviewed his 1936 work and related it to the contributions by Hotelling (1931) and his famous T2 _{statistic, and by Mahalanobis (1936) to his ∆}2 _{statistic and} other distance measures.

Linear discriminant analysis is a technique which is commonly used for the supervised classification problems. It is used for modeling differences in groups, that is, separating two or more classes by maximizing the distance between means under the assumption

(4)

of the same covariances. Hastie et al. (2009) explored other supervised classification techniques which use object characteristics to identify the class it belongs to, for example a new email is spam or non-spam or a patient is diagnosed with a disease or not. These techniques include Bayesian classifiers which are probabilistic classifiers based on Bayes theorem, neural networks which involve training, testing and using nonlinear function to minimize the classification error, and support vector machines which build a model that separates each class by optimizing the margins between support vectors. Several textbooks for example McLachlan (1992), Johnson and Wichern (2007), Rencher and Christensen (2012) and James et al. (2013) have treated discriminant analysis in detail.

The statistical problem treated by Fisher (1936) was that of assigning an unknown observation into one of two known groups on the basis of p measured characteristics, i.e., a feature vector x = (x1, . . . , xp)0. Furthermore, the groups are assumed to follow distributions with the same covariance. The sample based Fisher’s discriminant function (Anderson (2003)) equals L(x; x1, x2, Spl) = (x1− x2)0S−1pl x − 1 2(x1− x2) 0 S−1_pl (x1+ x2), (1)

where x1 and x2 denote the sample mean vectors of the two groups and Spl is the pooled sample covariance matrix given by

Spl = (n1− 1)S(1)+ (n2 − 1)S(2) n − 2 , where S (j)₌ 1 nj− 1 nj X i=1 xij − ¯xj xij − ¯xj 0 ,

with sample size n = n1+ n2. The classification rule for a new observation x is: classify x into π1 if L(x; x1, x2, Spl) > 0, holds, otherwise classify into π2. In order to understand the performance of the classification rule it is important to evaluate the probability of misclassification.

Discriminant analysis is usually applied to multivariate statistical problems in which several variables are collected simultaneously. However, Burnaby (1966), Lee (1977, 1982) among others considered discriminant analysis procedures for repeated measurements. Repeated measures discriminant analysis procedures are applied to data collected at mul-tiple occasions on the same individual. Roy and Khattree (2005a,b) developed procedures based on univariate and multivariate repeated measures data, focusing on different covari-ance structures. Recently, Lix and Sajobi (2010) reviewed the literature on discriminant analysis for univariate and multivariate repeated measures data, focusing on covariance patterns and linear mixed-effects models with applications to psychological research.

One of the first to discuss discrimination between growth curves was Burnaby (1966). His focus was to generalize procedures for constructing discriminant functions as well as to propose a generalized distance between the populations of repeated measurements. In our work, the classification of growth curves relies solely on the Growth Curve model

(5)

given by Potthoff and Roy (1964). The classical linear classification function is modi-fied by including changes on the group means. Lee (1977) considered classification of individuals into one of two growth curves using a Bayesian approach. The study was later extended by Nagel (1979). Again, Lee (1982) developed both non-Bayesian and Bayesian classification of growth curves. Their study considered two different covariance structures, the arbitrary positive definite covariance matrix and Rao’s simple covariance structure. The present paper use a non-Bayesian approach with a model assumption of an arbitrary positive definite covariance matrix. Unlike Lee (1982), our work explores the approximation for misclassification probabilities. Mentz and Kshirsagar (2005) developed the classification function with mean following Potthoff and Roy (1964) Growth Curve model structure. They investigated the performance of the classification function, based on both an arbitrary covariance matrix and a structured covariance matrices (compound symmetry and Rao’s simple covariance structure).

The modifications required for the classification function (1) when the repeated mea-surements of the populations obey some structure, have not been studied much in the literature. Apart from modifying the classical linear classification function (1) to include the changes caused by the Growth Curve model for the means, the present work stud-ies in particular the classification function when the number of repeated measurements is allowed to grow close to the size of the sample. The results are derived for the case when the covariance matrix is known and when it is unknown. The follow-up analysis involves investigating the performance of proposed approximations through Monte Carlo simulations.

The organization of this paper is as follows. In Section 2, the main idea is given and the classification function based on Potthoff and Roy (1964) is presented. In Section 3, the approximations of the probabilities of misclassifications are derived for both known and unknown covariance matrices and the proposed results are supported by a simulation study in Section 4. In Section 5, we finalize the paper by giving a brief summary of the results.

Throughout this paper matrices will be denoted by bold upper case letters, vectors by bold lower case letters, and elements of matrices by ordinary letters.

2 Classification into one of two growth curves

In this section we start by introducing the Growth Curve model of Potthoff and Roy (1964) which is helpful in formulating the linear discriminant function. The Growth Curve model is given by

(6)

where X is a p × n matrix of repeated measurements, E is the p × n error matrix with columns assumed to be independently p variate normally distributed with mean vector 0 and a positive definite unknown covariance matrix, Σ. If assuming a polynomial growth of order q − 1 and k independent groups of individuals, with ni, i = 1, . . . , k individuals, then the design matrices A : p × q, C : k × n and parameter B : q × k are given by

A =       1 t1 . . . tq−11 1 t2 . . . tq−12 .. . ... . .. ... 1 tp . . . tq−1p       , C =       10_n 1 0 0 n1 . . . 0 0 n1 00_n 2 1 0 n2 . . . 0 0 n2 .. . ... . .. ... 00_n_k 00_n_k . . . 10_n_k       , B =b1, . . . , bk , (3)

where 10_n_i and 00_n_i are vectors of ones and zeros, respectively. From now on suppose k = 2, the optimal classification rule that minimizes the probability of misclassification is: assign x : p × 1 to π1 if

q1f1(x) > q2f2(x), (4)

and to π2 otherwise. f1(x) and f2(x) are the density functions for x belonging to π1 or π2, respectively and q1 and q2 are the prior probabilities about the classification. Given the Growth Curve model (2), we have the probability density functions

fi(x) = (2π)− p 2|Σ|− 1 2exp −1 2tr n Σ−1(x − Abi)(x − Abi)0 o , i = 1, 2,

and the classification rule (4) can be rewritten as

(x − Ab2)0Σ−1(x − Ab2) − (x − Ab1)0Σ−1(x − Ab1) > 2ln q2 q1 , which is identical to (b1 − b2)0A0Σ−1x − 1 2(b1− b2) 0 A0Σ−1A(b1+ b2) > ln q2 q1 .

Assuming q1 = q2, the linear classification function is given by

L(x; b1, b2, Σ) = (b1− b2)0A0Σ−1x − 1

2(b1− b2)

0_A0_Σ−1_A(b

1+ b2), (5)

and the classification rule is defined as: classify x to π1 if L(x; b1, b2, Σ) > 0, and to π2 otherwise. In this paper we will study two cases; Σ is known and Σ is unknown.

(7)

2.1 Estimators of the parameters B and Σ

There exist different approaches to estimate the model parameters B and Σ, in the Growth Curve model of Potthoff and Roy (1964), see e.g., Srivastava and Khatri (1979), Kollo and von Rosen (2005), von Rosen (2018). The maximum likelihood estimator of B when A and C are assumed to have full rank, is given by

b

B = (bb1, bb2) = (A0S−1A)−1A0S−1XC0(CC0)−1,

where

S = X(In− PC0)X0, with P_C0 = C0(CC0)−1C.

With C given in (3) and k = 2, we have the estimators for the mean parameters as

b

bi = (A0S−1A)−1A0S−1xi, (6)

where xi = _n1_iX(i)1ni, with X

(i) _{being the observations from group π}

i, i = 1, 2. Further-more the maximum likelihood estimator for Σ is given by

n bΣ = (X − A bBC)(X − A bBC)0 = S + (Ip − PA)XPC0X(I_p− P_A), (7)

which is a positive definite with probability one when p < n − k and

PA = A(A0S−1A)−1A0S−1 is a projection on C(A), where C(·) is the column space of a matrix. When Σ is known the estimator of the mean parameter B is given by

b B = (A0ΣA)−1A0ΣXC0(CC0)−1 = (bb1, bb2), where b bi = (A0Σ−1A)−1A0Σ−1xi ∼ Nq bi, 1 ni (A0Σ−1A)−1.

Lemma 1. Let bΣ be given in (7) and suppose that all included inverses exist. Then

A0(n bΣ)−1 = A0S−1.

The proof of this lemma can be found, for example, in von Rosen (2018).

3 Approximation of probabilities of misclassification

In this section we consider the approximations of the probabilities of misclassification us-ing the linear classification function given in (5). While in general, it is hard to obtain the

(8)

exact probability of misclassifications, there has been extensive studies for their asymp-totic approximations including asympasymp-totic expansions (see for example, Okamoto (1963), Siotani (1982), McLachlan (1992), Fujikoshi (2000), Fujikoshi et al. (2010), Shutoh et al. (2011)). The main purpose of this section is to derive the approximations for the prob-abilities of misclassification through expressing the linear classification function in (5) as a location and scale mixture of the standard normal distribution. The probabilities of misclassification by the linear classification function (5) are denoted

e(2|1) = Pr(L ≤ 0|x ∈ π1), e(1|2) = Pr(L > 0|x ∈ π2),

where e(2|1) is the probability of allocating x of p repeated measurements into π2, al-though it is known that they come from π1 and similarly for e(1|2). In this study we are interested in deriving the approximation of e(2|1). Note that in this article, probabilities of misclassification is used interchangeably with misclassification errors.

3.1 Approximation of misclassification errors, known Σ

In this subsection we assume that Σ is known. Suppose that the observation x of p repeated measurements is from π1, the conditional distribution of L0 = L(x; bb1, bb2, Σ) given (bb1, bb2) is N (−U0, V0), that is, E[L0|bb1, bb2] = −U0 and Var(L0|bb1, bb2) = V0, where

U0 = (bb1− bb2)0A0Σ−1A(bb1− b1) − 1

2V0, (8)

V0 = (bb1− bb2)0A0Σ−1A(bb1− bb2). (9)

Hence, L0 given bb1, bb2 can now be expressed as a location and scale mixture of the standard normal distribution given by

L0 = V 1/2 0 Z0− U0, (10) where Z0 = V −1/2 0 (bb1− bb2)0A0Σ−1(x − Ab1).

Given bb1, bb2, Z0 is obviously independent of (U0, V0) and is distributed as N(0, 1). The probability of misclassification where x is assigned to π2, when it actually belongs to π1 can be expressed using (10) as

e0(2|1) = Pr(L0 ≤ 0|x ∈ π1, bb1, bb2, Σ) = E[1{L0≤0}] = E(U0,V0)[E[1{L0≤0}|U0, V0]] = E(U0,V0)[Pr(L0 ≤ 0|U0, V0)] = E(U0,V0)[Pr(V 1/2 0 Z0− U0 ≤ 0|U0, V0)] = E(U0,V0)[Φ(V −1/2 0 U0)], (11)

(9)

where Φ(·) denotes the cumulative distribution function of N (0, 1), E[·] denotes the expec-tation and 1{·} denotes the indicator function. As an approximation of (11), we propose

e0(2|1) ' Φ (E[V0])−1/2E[U0], (12)

obtained by replacing U0and V0with E[U0] and E[V0] in a similar manner as was performed by Fujikoshi (2000) and Shutoh et al. (2011).

Theorem 1. The expectations of U0 and V0 defined by (8) and (9) equal,

E[V0] = ∆2+ n1+ n2 n1n2 q, E[U0] = − 1 2 ∆2+n1− n2 n1n2 q , where ∆2 = (b1− b2)0A0Σ−1A(b1− b2).

Proof. Firstly, E[V0] is derived. It is utilized that bb1and bb2 are independently diistributed. Then since bb1− bb2 ∼ Nq b1− b2, (_n1₁ + _n1₂)(A0Σ−1A)−1,

E[V0] = E[(bb1− bb2)0A0Σ−1A(bb1− bb2)] = E[tr(A0Σ−1A0(bb1− bb2)(bb1− bb2)0)] = trA0Σ−1_AE[(bb1− bb2)(bb1− bb2)0] = tr A0Σ−1A 1 n1 + 1 n2 (A0Σ−1A)−1+ (b1− b2)(b1− b2)0 = (b1− b2)0A0Σ−1A(b1− b2) + n1+ n2 n1n2 q.

Moreover, E[U0] is calculated as,

E[U0] = E h (bb1− bb2)0A0Σ−1A(bb1− b1) i − 1 2E[V0]. Consider E[(bb1 − bb2)0A0Σ−1A(bb1− b1)] = E[bb0₁A0Σ−1Abb1− bb01A 0 Σ−1Ab1 + bb02A 0 Σ−1Ab1− bb02A 0 Σ−1Abb1] = E[bb0₁A0Σ−1Abb1] − b01A 0 Σ−1Ab1+ b02A 0 Σ−1Ab1− E[bb02A 0 Σ−1Abb1] = E[bb0₁A0Σ−1Abb1] − b01A 0 Σ−1Ab1 = q n1 . Then E[U0] = q n1 − 1 2E[V0] = 1 2 (b1− b2)0A0Σ−1A(b1− b2) + n1− n2 n1n2 q .

(10)

The following theorem appears.

Theorem 2. For the linear classification function based on (5) with unknown b1, b2 and known Σ, the misclassification errors can approximately be evaluated via

e0(2|1) ' Φ γ0, where γ0 = − 1 2 ∆2₊ n1−n2 n1n2 q q ∆2₊ n1+n2 n1n2 q ! , with ∆2 _{= (b}

1−b2)0A0Σ−1A(b1−b2) and Φ ·) is the distribution function of the standard normal distribution.

If n1 and n2 tend to infinity, then e0(2|1) ' Φ − 1₂∆. Although we can obtain the approximation for the misclassification errors, we can not use b1 and b2 directly in the distance measure ∆2 _{since they are usually unknown. Thus, when utilizing e}

0(2|1), b1 and b2 are replaced by bb1 and bb2, respectively.

3.2 Approximation of misclassification errors, unknown Σ

In Subsection 3.1, we assumed that Σ was known. In this subsection we shall assume that all parameters of the populations πi, i = 1, 2, are unknown. Hence, we want to study L(x; bb1, bb2, bΣ). However, from Lemma 1 we can note that _n1A0Σb−1 = A0S−1 and we have L(x; bb1, bb2, n bΣ) = L(x; bb1, bb2, S). We study L(x; bb1, bb2, n bΣ) = L(x; bb1, bb2, S) instead of L(x; bb1, bb2, bΣ) because we are just interested in the sign of L. Morever, the distribution for the estimator S is a Wishart distribution, whereas the distribution of the maximum likelihood-based estimator is unknown (cf., von Rosen (2018)).

Theorem 3. Assume that observation x comes from π1. Then the statistic L(x; bb1, bb2, S) can be expressed as L = V1/2Z − U, where V = (bb1− bb2)0A0S−1ΣS−1A(bb1− bb2), (13) Z = V−1/2(bb1− bb2)0A0S−1(x − Ab1), U = (bb1− bb2)0A0S−1A(bb1− b1) − 1 2V ,e (14)

(11)

and eV = (bb1− bb2)0A0S−1A(bb1− bb2) is the sample Mahalanobis squared distance between two populations.

The result (13) is obtained by noting that the conditional distribution of

(bb1 − bb2)0A0S−1(x − Ab1) given bb1, bb2, S, is N (0, V ) when x comes from π1. Given b

b1, bb2, S, we can see that Z follows a standard normal distribution, i.e., Z ∼ N(0, 1), which is also conditionally true. Moreover, Z and (U, V ) are independent. Analogously to (11), the probability of misclassification when x comes from π1 is given by

e(2|1) = Pr(L(x; bb1, bb2, S) ≤ 0|x ∈ π1, bb1, bb2, S) = E(U,V )[Φ(V−1/2U )]. (15)

Moreover, in Subsection 3.1, we consider the same type of approximation of the probability of misclassification as in (15):

e(2|1) ' Φ (E[V ])−1/2E[U ], (16)

which is found by replacing U and V by E[U ] and E[V ], respectively. Fujikoshi (2000) and Shutoh et al. (2011) presented a similar approach.

To find the expectations in the probability of misclassification (16), we need the fol-lowing lemma.

Lemma 2. Let S : p × p be a random matrix distributed according to Wp(n − 2, Σ), where Σ is positive definite. Then

(i) ES−1_ΣS−1_{= (n − 3)d} 1Σ−1, (ii) ES−1_A(A0_S−1_A)−1_A0_S−1_{= d}

2Σ−1− d3(Σ−1− Σ−1A(A0ΣA)−1A0Σ−1), (iii) ES−1_A(A0_S−1_A)−1_A0_S−1_ΣS−1_A(A0_S−1_A)−1_A0_S−1

= (n − 3)d1Σ−1+(n − 3)d4− (n + p − 2q − 3)d5(Σ−1− Σ−1A(A0ΣA)−1A0Σ−1), where d1 = 1 (n − p − 2)(n − p − 3)(n − p − 5), if n − p − 5 > 0, d2 = 1 n − p − 3, if n − p − 3 > 0, d3 = 1 n − (p − q) − 3, if n − 2 > p, d4 = 1 (n − (p − q) − 2)(n − (p − q) − 3)(n − (p − q) − 5), if n > (p − q) − 5, d5 = 1 (n − q − 2)(n − (p − q) − 3)(n − q − 5), if n − q − 2 > 0, n − 2 > (p − q) − 1.

(12)

Proof. (i) Let Σ = MM0. Define S∗ = M−1S(M0)−1 and note that S∗ ∼ Wp(n, Ip). Then

ES−1ΣS−1 = E(M0)−1S∗−1M−1Σ(M0)−1S∗−1M−1 = (M0)−1E[S∗−2]M−1 = (n − 3)d1(M0)−1M−1 = (n − 3)d1Σ−1,

since E[S∗−2] = (n−3)d1Ip (see Gupta (1968), p. 388, Gupta and Nagar (2000), p. 100 for more details and derivations). Furthermore, for the proofs of (ii) and (iii) see for example von Rosen (2018), p. 447.

We are now ready to derive the expectations in (16).

Theorem 4. If b1, b2 and Σ are unknown, then the expectations of U and V defined by (14) and (13) are E[V ] = c1∆2+ n1+ n2 n1n2 (pc1+ (p − q)c2), E[U ] = −1 2 c3∆ 2₊n1− n2 n1n2 ((c4− c5)p + c5q),

where ∆2 _{is the squared Mahalanobis distance given by}

∆2 = (b1− b2)0A0Σ−1A(b1− b2), c1 = f − 1 (f − p)(f − p − 1)(f − p − 3), c2 = 1 f − (p − q) − 1 f − 1 (f − (p − q))(f − (p − q) − 3)− f + p − 2q − 1 (f − q)(f − q − 3) , c3 = 1 f − p − 1, c4 = 1 f − p − 1, c5 = 1 f − (p − q) − 1,

and f = n1+ n2− 2, and all constants are supposed to exist.

Proof. Note that S and bB = (bb1, bb2) are not independent. However,

E[V ] = ES E[V |S] = ES E[(bb1− bb2)0A0S−1ΣS−1A(bb1− bb2)|S] = EStr A0S−1ΣS−1AE[(bb1− bb2)(bb1− bb2)0|S]. (17)

The conditional covariance matrix is

Cov( bB|S) = (CC0)−1⊗ (A0S−1A)−1A0S−1ΣS−1A(A0S−1A)−1, (18)

where ⊗ denotes the Kronecker product and Cov(·) denotes the covariance. Result (18) can be found, for example, in Kollo and von Rosen (2005). Since for our choice of C in

(13)

(3) we have (CC0)−1= 1 n1 0 0 _n1 2 ! ,

and it follows that

Cov(bbi|S) = 1 ni

(A0S−1A)−1A0S−1ΣS−1A(A0S−1A)−1, i = 1, 2.

From (17) and since

b b1− bb2|S ∼ Nq b1− b2, ( 1 n1 + 1 n2 )(A0S−1A)−1A0S−1ΣS−1A(A0S−1A)−1, (19) ES E[V |S] = EStr A0S−1ΣS−1A 1 n1 (A0S−1A)−1A0S−1ΣS−1A(A0S−1A)−1 + 1 n2 (A0S−1A)−1A0S−1ΣS−1A(A0S−1A)−1 + (b1− b2)(b1− b2)0 = n1+ n2 n1n2 E Str A0S−1ΣS−1A(A0S−1A)−1A0S−1ΣS−1A(A0S−1A)−1 + (b1− b2)0A0S−1ΣS−1A(b1− b2) = n1+ n2 n1n2 tr E SA0S−1ΣS−1A(A0S−1A)−1A0S−1ΣS−1A(A0S−1A)−1 + (b1− b2)0A0ESS−1ΣS−1A(b1− b2) = n1+ n2 n1n2 tr ΣE SS−1A(A0S−1A)−1A0S−1ΣS−1A(A0S−1A)−1A0S−1 + (b1− b2)0A0ESS−1ΣS−1A(b1− b2). (20)

Using Lemma 2 (i) and (iii), with S ∼ Wp(n1+ n2− 2, Σ), we have

ESS−1A(A0S−1A)−1A0S−1ΣS−1A(A0S−1A)−1A0S−1 = c1Σ−1+ c2(Σ−1− Σ−1A(A0ΣA)−1A0Σ−1),

where c1 = (f − 1)d1, c2 = (f − 1)d4− (f + p − 2q − 1)d5 and f = n1+ n2− 2. Furthermore, from (20) we have

ES E[V |S] = n1+ n2 n1n2 tr Σ(c1Σ−1+ c2(Σ−1− Σ−1A(A0ΣA)−1A0Σ−1) + c1(b1− b2)0A0Σ−1A(b1− b2) = n1+ n2 n1n2 (pc1 + (p − q)c2) + c1(b1− b2)0A0Σ−1A(b1− b2),

(14)

and ES[U ] is calculated as follows: E[U ] = ES E[U |S] = ES E[(bb1− bb2)0A0S−1A(bb1− b1) − 1 2V |S]e = ES E(bb1− bb2)0A0S−1A(bb1− b1)|S − 1 2ES E[ eV |S],

where eV = (bb1 − bb2)0A0S−1A(bb1− bb2). Note that bb1 and bb2 are unbiased and given S they are independently normally distributed. Consider

ES E[(bb1− bb2)0A0S−1A(bb1− b1)|S] = ES E[bb01A 0 S−1Abb1− bb01A 0 S−1Ab1+ bb02A 0 S−1Ab1− bb02A 0 S−1Abb1|S] = ES E[bb01A 0 S−1Abb1|S] − ES E[bb01A 0 S−1Ab1|S] + ES E[bb02A 0 S−1Ab1|S] − ES E[bb02A 0 S−1Abb1|S] = ES E[bb01A 0 S−1Abb1|S] − ES E[bb01A 0 S−1Ab1|S] = EStr A0S−1AE[bb1bb0₁|S] − ES[b01A 0 S−1Ab1] = EStr A0S−1A( 1 n1 (A0S−1A)−1A0S−1ΣS−1A(A0S−1A)−1 + b0₁_ES[A0S−1A]b1− ES[b01A 0 S−1Ab1] = 1 n1E Str A0S−1ΣS−1A(A0S−1A)−1 i = 1 n1tr ΣE SS−1A(A0S−1A)−1A0S−1 = 1 n1 ((c4− c5)p + c5q), (21)

where we have used Lemma 2 (ii) in the last equality. Now using (19), _ES E[ eV |S] equals ES E[ eV |S] = ES E[(bb1− bb2)0A0S−1A(bb1− bb2)|S] = EStr A0S−1AE[(bb1− bb2)(bb1− bb2)0|S] = EStr A0S−1A 1 n1 (A0S−1A)−1A0S−1ΣS−1A(A0S−1A)−1 + 1 n2 (A0S−1A)−1A0S−1ΣS−1A(A0S−1A)−1 + (b1− b2)(b1− b2)0 = n1+ n2 n1n2 E Str A0S−1A(A0S−1A)−1A0S−1ΣS−1A(A0S−1A)−1 + (b1− b2)0A0S−1A(b1− b2)

(15)

= n1 + n2 n1n2 tr E SA0S−1A(A0S−1A)−1A0S−1ΣS−1A(A0S−1A)−1 + (b1− b2)0A0ESS−1A(b1− b2) = n1 + n2 n1n2 tr ΣESS−1A(A0S−1A)−1A0S−1 + (b1− b2)0A0ESS−1A(b1− b2) = c3(b1− b2)0A0Σ−1A(b1 − b2) + n1+ n2 n1n2 ((c4− c5)p + c5q), (22)

where c3 = _{f −p−1}1 . We arrived to (22) using Lemma 2. Combining (21) and (22) we get

E[U ] = −1 2 c3(b1− b2)0A0Σ−1A(b1− b2) + n1− n2 n1n2 ((c4− c5)p + c5q) .

From the above given results we have the following theorem.

Theorem 5. For the classification based on (5) with unknown b1, b2 and Σ, the misclas-sification errors can be approximated as

e(2|1) ' Φ γ, where γ = −1 2 −1 2 c3∆ 2₊n1−n2 n1n2 ((c4− c5)p + c5q) q c1∆2 +n_n1₁+n_n₂2(pc1+ (p − q)c2) ! , with c1 = f − 1 (f − p)(f − p − 1)(f − p − 3), c2 = 1 f − (p − q) − 1 f − 1 (f − (p − q))(f − (p − q) − 3)− f + p − 2q − 1 (f − q)(f − q − 3) , c3 = 1 f − p − 1, c4 = 1 f − p − 1, c5 = 1 f − (p − q) − 1, f = n1+ n2− 2, and ∆2 = (b1− b2)0A0Σ−1A(b1− b2).

4 Simulation study

The approximation of misclassification errors in case of repeated measures observations that follow a Potthoff and Roy (1964) growth curve structure on the means have not been proposed before. In this section a simulation study is performed to investigate the performance of the approximation of the misclassification errors proposed in Theorems

(16)

2 and 5. We compare the performance of it with the relative frequencies, which are the number of times an observation is misclassified to π2 while it actually comes from π1, calculated using classification function (5) for unknown b1, b2 and known Σ and unknown b1, b2, Σ, by Monte Carlo simulations. Put n = 80 with n1 = n2 = n₂, and (t1, . . . , tp) = (0.50, . . . , 2.50). Set p ∈ {10, 20, 30, 40, 50, 60, 70, 72, 74, 100, 120} for unknown b1, b2 and known Σ and p ∈ {10, 20, 30, 40, 50, 60, 70, 72, 74} for unknown b1, b2, Σ. Data X : p × n are generated using the Growth Curve model X = ABC + E, E ∼ Np,n(0, Σ, I), where the design and parameter matrices are respectively given as

A =       1 t1 t21 t31 1 t2 t22 t32 .. . ... ... 1 tp t2p t3p       , C = 1 0 40 0 0 40 00₄₀ 10₄₀ ! , B =      0.117 0.125 −0.345 −0.313 0.232 0.314 −0.042 −0.075      .

For Σ we have Σ = DRD, where D = diag(σ1, . . . , σp), σi = pa0+ (i − 1)d, with d = 2−a0

p−1 for i = 1, . . . , p , a0 = 0.1, and R = (ρij), where ρij = (−1)

i+j_r|i−j|γ

, with r = 0.2, γ = 0.1 for j = 1, . . . , p.

In Figure 1, the third order growth curves are given which show the true mean growth profiles which were used when data was simulated and the estimated growth profiles for two populations π1 and π2. Note here that Figure 1 is produced when b1, b1, Σ are unknown. In the left plot, the dashed lines seem to be close to the solid lines for p = 10. This means that the mean growth profile is well estimated with p = 10 repeated measurements. However, on the right plot, the discrepancy between lines (dashed and solid lines) become considerable large as the number of repeated measurements increases. This means that the mean growth profile is poorly estimated with a large number of repeated measurements, i.e., p close to n.

In Table 1 the approximation of the misclassification errors are investigated for p repeated measurements in the case when Σ is known. As before e0(2|1) denotes the ”true” approximation of the misclassification errors and _be0(2|1) denotes the estimated approximation of misclassification errors. The Rf and bRf stand for relative frequencies of misclassifications computed using classification function (5) for known and unknown Σ. An observation x is generated from π1 with known b1 and Σ, i.e.,

x = Ab1 + e, e ∼ Np(0, Σ). (23)

Rf is calculated as the relative number of events {L(bb1, bb2, Σ) ≤ 0} with observation x generated using (23) with b1 and Σ as the true values. bRf is calculated as the relative number of events {L(bb1, bb2, Σ) ≤ 0} where observation x is generated using (23) with b1 and Σ as the estimated values.

(17)

0.5 1.0 1.5 2.0 2.5 −0.1 0.0 0.1 0.2 0.3 n1 = n2 = 40, p = 10 Time Gro wth π1 π2 Estimated π1 Estimated π2 0.5 1.0 1.5 2.0 2.5 −0.1 0.0 0.1 0.2 0.3 n1 = n2 = 40, p = 74 Time Gro wth π1 π2 Estimated π1 Estimated π2

Figure 1: The third order growth curves describe the sample mean per group (solid lines) and the estimated mean growth curves (dashed lines) for the populations π1 and π2. In the left plot p = 10 whereas in the right plot p = 74.

true misclassification errors e0(2|1) when more information (repeated measurements) were included, and the misclassification errors become smaller for larger p. This means that more information the smaller are the misclassification errors. This can be seen in Figure 2 (left plot). It can be noted that there are values of misclassification errors for p > n − 2, which we can have because Σ is known.

In Table 2 we present results for the approximation of misclassification errors for the case when Σ is unknown. As earlier e(2|1) denotes the ”true” approximation for the misclassification errors and _be(2|1) denotes the estimated approximation for the misclas-sification errors. Rf and bRf are the values of relative frequencies. Rf is calculated as the relative number of events {L(bb1, bb2, bΣ) ≤ 0} with observation x generated using (23) with b1 and Σ as the true values. Rbf is calculated as the relative number of events {L(bb1, bb2, bΣ) ≤ 0} where observation x is generated using (23) with b1 and Σ as the estimated values.

In Table 2 the values of the estimated misclassification errors decrease when number of repeated measurements are relatively small as for p = 10 through p = 60. For large number of repeated measurements which closer to the sample size, the misclassification errors increase, see for example when p ∈ {70, 72, 74}. This is due to the sample covariance matrix which is not a good estimator when the number of repeated measurements get larger. Also, in the last column in Table 2, the relative frequency ( bRf) has zero values for example from p = 60 through p = 74, it is because a new observation in the simulation is generated based on the estimated parameters instead of the true b1 and Σ.

We conclude that, the proposed approximation can be suggested for use when the number of repeated measurements are not close to the sample size. Moreover, the classifi-cation function (5) is not good for the approximation especially when when the covariance

(18)

is unknown and number of repeated measurements grow near the sample size. The sim-ulation results pave the way for one to propose new estimators and investigate the case when the number of repeated measurements are comparable to sample size or exceed it.

Our method of evaluating the misclassification errors by an approximation approach can be extended in many ways including generalizing the present study to include ex-tensions of the Potthoff and Roy (1964) model, such as the sum of profiles model (e.g., Verbyla and Venables (1988)).

Table 1: e0(2|1) and be0(2|1) are the values of the ”true” and estimated approximations of misclassification errors, computed using Theorem 2 for known b1, b2, Σ and bb1, bb2, Σ, respectively. Rf and bRf are the values of relative frequencies. Rf is calculated as the relative number of events {L(bb1, bb2, Σ) ≤ 0} with observation x generated using (23) with b1 and Σ are the true values. Rb_f is calculated as the relative number of events {L(bb1, bb2, Σ) ≤ 0} where observation x is generated using (23) with b1 and Σ are the estimated values. p e0(2|1) Rf be0(2|1) Rbf 10 0.185 0.181 0.180 0.155 20 0.104 0.104 0.102 0.091 30 0.062 0.057 0.061 0.060 40 0.038 0.024 0.037 0.048 50 0.023 0.020 0.024 0.026 60 0.015 0.013 0.015 0.013 70 0.009 0.007 0.010 0.011 72 0.008 0.006 0.008 0.006 74 0.007 0.006 0.007 0.005 100 0.002 0.003 0.003 0.002 120 0.001 0.003 0.001 0.001

5 Summary

We have developed a linear classification function when the means follow the Potthoff and Roy (1964) Growth Curve model. The developed classification function can assign a new observation of p repeated measurements to one of two specified groups. After developing a classification rule it is natural to enquiry how well the decision rule can appropriately classify a new observation of p repeated measurements. In general, it is hard to obtain the exact expression for the probability of misclassification. We express the linear discriminant function as a location and scale mixture of the standard normal distribution and derive an approximations for the probability of misclassification.

It seems that larger p is better for classification when Σ is known, but when Σ is unknown and p is close to n we have problem with the instability for the sample covariance matrix S. If p > n − 2, then S is singular and a regular inverse cannot be taken.

(19)

Table 2: e(2|1) and be(2|1) are the values of the ”true” and estimated approximations of misclassification errors, computed using Theorem 5 for known b1, b2, Σ and bb1, bb2, bΣ, respectively. Rf and bRf are the values of relative frequencies. Rf is calculated as the relative number of events {L(bb1, bb2, bΣ) ≤ 0} with observation x generated using (23) with b1 and Σ are the true values. Rb_f is calculated as the relative number of events {L(bb1, bb2, bΣ) ≤ 0} where observation x is generated using (23) with b1 and Σ are the estimated values. p e(2|1) Rf be(2|1) Rbf 10 0.216 0.208 0.188 0.162 20 0.160 0.133 0.117 0.071 30 0.139 0.121 0.074 0.029 40 0.137 0.104 0.051 0.006 50 0.151 0.108 0.035 0.002 60 0.188 0.166 0.028 0.000 70 0.281 0.229 0.032 0.000 72 0.319 0.243 0.041 0.000 74 0.385 0.278 0.090 0.000

References

Anderson, T. W. (2003). An Introduction to Multivariate Statistical Analysis, Third edition. Wiley Series in Probability and Statistics, New York.

Burnaby, T. (1966). Growth-invariant discriminant functions and generalized distances. Biometrics, 22:96–110.

Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7:179–188.

Fisher, R. A. (1938). The statistical utilization of multiple measurements. Annals of Eugenics, 8(4):376–386.

Fujikoshi, Y. (2000). Error bounds for asymptotic approximations of the linear discrim-inant function when the sample sizes and dimensionality are large. Journal of Multi-variate Analysis, 73:1–17.

Fujikoshi, Y., Ulyanov, V. V., and Shimizu, R. (2010). Multivariate Statistics: High-Dimensional and Large-Sample Approximations. Wiley, New York.

Gupta, A. and Nagar, D. (2000). Matrix Variate Distributions. Chapman and Hall/CRC, Boca Raton.

Gupta, S. D. (1968). Some aspects of discrimination function coefficients. Sankhy¯a: The Indian Journal of Statistics, Series A, 30:387–400.

(20)

20 40 60 80 100 120 0.00 0.05 0.10 0.15 0.20 Repeated measurements Misclassification errors e0(2|1) e^ 0(2|1) Rf R^f 10 20 30 40 50 60 70 80 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Repeated measurements Misclassification errors e(2|1) e ^(2|1) Rf R^f

Figure 2: (a) True and estimated approximations for misclassification errors computed using Theorem 2 and the relative frequencies Rf and bRf for known Σ calculated using (5) (b) True and estimated approximations for misclassification errors calculated using Theorem 5 and the relative frequencies Rf and bRf, computed using classification function (5) when Σ is unknown.

Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Science & Business Media, Stanford, California.

Hotelling, H. (1931). The generalization of students ratio. Annals of Mathematical Statatistics, 2:360–378.

James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An Introduction to Statistical Learning: with Applications in R. Springer, New York.

Johnson, R. A. and Wichern, D. W. (2007). Applied Multivariate Statistical Analysis. Prentice Hall Upper Saddle River, New Jersey.

Kollo, T. and von Rosen, D. (2005). Advanced Multivariate Statistics with Matrices, volume 579. Springer Science, New York.

Lee, J. C. (1977). Bayesian classification of data from growth curves. South African Statistical Journal, 11(2):155–166.

Lee, J. C. (1982). Classification of growth curves. Handbook of Statistics, Vol. II. P. R. Krishnaiah and L. W. Kanal (Eds)., volume 2. North-Holland.

Lix, L. and Sajobi, T. (2010). Discriminant analysis for repeated measures data: A review. Frontiers in Psychology.

(21)

Mahalanobis, P. C. (1936). On the generalized distance in statistics. National Institute of Science of India, 2:49–55.

McLachlan, G. (1992). Discriminant Analysis and Statistical Pattern Recognition. Wiley, New York.

Mentz, G. B. and Kshirsagar, A. M. (2005). Classification using growth curves. Commu-nications in Statistics-Theory and Methods, 33(10):2487–2502.

Nagel, D. (1979). Bayesian classification estimation and prediction of growth curves. South African Statistical Journal, 13(2):127–137.

Okamoto, M. (1963). An asymptotic expansion for the distribution of the linear discrim-inant function. The Annals of Mathematical Statistics, 34(4):1286–1301.

Potthoff, R. F. and Roy, S. (1964). A generalized multivariate analysis of variance model useful especially for growth curve problems. Biometrika, 51:313–326.

Rencher, A. C. and Christensen, W. (2012). Methods of Multivariate Analysis. Wiley, Toronto, Canada.

Roy, A. and Khattree, R. (2005a). Discrimination and classification with repeated measures data under different covariance structures. Communications in Statistics-Simulation and Computation, 34(1):167–178.

Roy, A. and Khattree, R. (2005b). On discrimination and classification with multivariate repeated measures data. Journal of Statistical Planning and Inference, 134(2):462–485.

Shutoh, N., Hyodo, M., and Seo, T. (2011). An asymptotic approximation for epmc in linear discriminant analysis based on two-step monotone missing samples. Journal of Multivariate Analysis, 102(2):252–263.

Siotani, M. (1982). Large sample approximations and asymptotic expansions of classifi-cation statistics. Handbook of Statistics, 2:61–100.

Srivastava, M. S. and Khatri, C. G. (1979). An Introduction to Multivariate Stastistics. North Holland, New York.

Verbyla, A. P. and Venables, W. (1988). An extension of the Growth Curve model. Biometrika, 75(1):129–138.

von Rosen, D. (2018). Bilinear Regression Analysis: An Introduction. Springer, New York.