Conditional mean variables: A method for estimating latent linear relationships with discretized observations

(1)

U.U.D.M. Project Report 2020:13

Examensarbete i matematik, 30 hp Handledare: Silvelyn Zwanzig Examinator: Denis Gaidashev Juni 2020

Department of Mathematics

Uppsala University

Conditional mean variables: A method for

estimating latent linear relationships with

discretized observations

(2)

(3)

1

Conditional mean variables: A method for estimating latent linear

relationships with discretized observations

Mathias Berggren

Abstract

A moment-based method is proposed for estimating the linear relationships between latent continuous variables that can only be observed through discretized (usually ordered)

categories. This method is based on finding the conditional means of the underlying variable given the observations. The consistency of these conditional mean-based estimates and their correspondence to the parameters in the underlying model is proven and demonstrated in a number of theorems. Finally, the methodology is compared to some related methods.

Acknowledgements

I would like to thank my supervisor Silvelyn Zwanzig for her help and guidance in writing this thesis and in particular for her great help in structuring and presenting the ideas within in a clearer way.

(4)

2

Table of content

Abstract ... 1 Table of content ... 2 1. Introduction ... 3 1.1 Linear regression ... 3

1.2 The linear regression model with categorically observed variables... 6

2. Conditional mean variables ... 10

2.1 Basic properties of conditional mean variables ... 12

2.2 Conditional mean variables in linear regression ... 24

2.2.1 As one independent variable ... 24

2.2.2 As multiple independent variables – Some special cases ... 33

2.2.3 As dependent variable ... 36

2.2.4 As independent and dependent variables ... 52

2.3 A linear relationship between two approximately standard uniform distributions ... 62

2.4 Factor analysis with conditional mean variables ... 73

3. Connections to other methods ... 77

3.1 Mann-Whitney U test and Wilcoxon signed rank test ... 77

3.2 Spearman correlation ... 80

3.3 Polyserial and Polychoric correlation ... 80

3.4 Logistic and probit regression ... 82

3.5 Some additional methods ... 83

(5)

3

1. Introduction

This thesis is concerned with the application of (usually ordered) categorical variables for estimating an underlying, i.e. latent, continuous variable and its relation to other variables. We will specifically focus on how the categorical variable allows estimation of the

slope-parameters in a linear relationship between the latent variable and some other variable, both when the latent variable is an independent and/or the dependent variable. In order to do so we must assume the distribution of the latent variable, which allows us to associate with each category a conditional mean of the latent variable. This will in turn allow us to estimate that variable’s linear relationship. The approach we take here, which is to construct these

conditional mean variables and apply them to estimate the moments in the usual least squares

linear regression estimation, is connected to several other methods for working with latent variables, some of the most common of which we will relate to these methods. However, to the author’s knowledge, this way of working with conditional mean variables as substitutes of the latent variable has not been done in this focused way before. It is the author’s belief that working with the moment method directly allows some additional insights about how these latent variable approaches work. As an example, Theorem 7 in this thesis provides a simple alternative to probit regression and polyserial correlation estimation within the multivariate normal model. To facilitate such insight, we will frequently complement our mathematical proofs with illustrations of how the conditional mean variables achieve their goal. We begin by providing an overview of the problem in Chapter 1. In Chapter 2, the main part of this thesis, we develop the conditional mean variable method and its use in regression. This method is summarized in a number of theorems, which are exemplified, and illustrated to provide insight into the functioning of conditional mean variables. This is then related to existing methods used for ordered categorical data in Chapter 3.

1.1 Linear regression

In regression analysis we want to predict the value of one variable (the dependent variable, here always denoted by 𝑌) from a set of other variables (the independent variables, here always denoted by 𝑋_𝑗 for the 𝑗:th variable). In its simplest form, linear regression, the

independent variables are modelled as having a linear relationship with the dependent variable as follows:

(6)

4 (𝐸𝑞. 1.1) 𝑌 = 𝛽0+ ∑ 𝛽𝑗𝑋𝑗

𝑝

𝑗=1

+ 𝜀

We will write single parameters and variables as per above, while writing matrices and vectors in bold notation. If we define:

𝑿 ≔ ( 𝑋1 ⋮ 𝑋𝑝 ) , 𝜷 ≔ ( 𝛽1 ⋮ 𝛽_𝑝 )

Then Eq 1.1 can be written in matrix notation as per below:

(𝐸𝑞. 1.2) 𝑌 = 𝛽₀+ 𝑿𝑇𝜷 + 𝜀

So this relationship states that the 𝑗:th of the in total 𝑝 number of independent variables is weighted by the unknown parameter 𝛽𝑗. In addition, there is an unknown intercept 𝛽0, and an

error term 𝜀 with mean zero that captures 𝑌:s random variation around the linear relationship. Usually 𝜀 is modelled as being independent of all the 𝑋-variables, with 𝐸𝜀 = 0, 𝑉𝑎𝑟(𝜀) < ∞. In case we have 𝑛 independent and identically distributed observations from this model, the data will be denoted by (𝑿_𝑖, 𝑌_𝑖)_{𝑖=1,…,𝑛}, or (𝑿_𝑖, 𝑌_𝑖) for short. If there is only one independent variable, this will be written as (𝑋_𝑖, 𝑌_𝑖), to keep with our matrix notation. If this is our

observations, 𝜷̂ is a (𝑝 + 1) × 1 vector of the beta-estimates, 𝑿𝑜𝑏𝑠 is a 𝑛 × (𝑝 + 1) matrix of

the 𝑛 observations (the rows) on the 𝑝 variables (the columns) plus a leading column of ones to capture the intercept, and 𝒀_𝑜𝑏𝑠 is a 𝑛 × 1 vector of the observations on the dependent variable, and we further assume that the design matrix 𝑿_𝑜𝑏𝑠𝑇 𝑿_𝑜𝑏𝑠 is of full rank, i.e. that 𝑛 > 𝑝 and that no variable is a linear transformation of another. Then, by the Gauss-Markov least squares theorem (Aitken, 1935) the minimum variance linear unbiased estimator of the betas is equivalent to the moment estimator, given by:

(𝐸𝑞. 1.3) 𝜷̂ = (𝑿_𝑜𝑏𝑠𝑇 _𝑿

𝑜𝑏𝑠)−1𝑿𝑜𝑏𝑠𝑇 𝒀𝑜𝑏𝑠

If we denote 𝑿_{𝑜𝑏𝑠,1:𝑝} as the same matrix of 𝑋-observations as before but without the leading column of ones, 𝑿̅_{𝑜𝑏𝑠,1:𝑝} as the vector of the 𝑝 sample means of the 𝑋-variables in their

(7)

5

corresponding column, and 𝑌̅ as the sample mean of the 𝑌-variable. Then, an equivalent way to write these estimates, which we will find useful, is as:

(𝐸𝑞. 1.4𝑎) 𝜷̂1:𝑝: = ( 𝛽̂₁ ⋮ 𝛽̂_𝑝 ) = = ((𝑿_{𝑜𝑏𝑠,1:𝑝}− 𝑿̅_{𝑜𝑏𝑠,1:𝑝})𝑇(𝑿_{𝑜𝑏𝑠,1:𝑝}− 𝑿̅_{𝑜𝑏𝑠,1:𝑝})) −1 (𝑿_{𝑜𝑏𝑠,1:𝑝}− 𝑿̅_{𝑜𝑏𝑠,1:𝑝})𝑇(𝒀_𝑜𝑏𝑠− 𝑌̅) (𝐸𝑞. 1.4𝑏) 𝛽̂0 = 𝑌̅ − ( 𝛽̂1 ⋮ 𝛽̂𝑝 ) 𝑇 𝑿̅_{𝑜𝑏𝑠,1:𝑝}

That is because the true beta-parameters under the linear model (Eq 1.2) fulfill:

(𝐸𝑞. 1.5𝑎) 𝜷 = ( 𝛽1 ⋮ 𝛽_𝑝 ) = 𝐶𝑜𝑣(𝑿, 𝑿)−1𝐶𝑜𝑣(𝑿, 𝑌) (𝐸𝑞. 1.5𝑏) 𝛽0 = 𝐸𝑌 − ( 𝛽1 ⋮ 𝛽_𝑝 ) 𝑇 ( 𝐸𝑋1 ⋮ 𝐸𝑋𝑝 )

Where 𝑿 is the 𝑝 × 1 vector of the 𝑋-variables (as in Eq. 1.2). The equations above also hold under the somewhat weaker assumption that 𝐸𝜀|𝑿=𝒙= 0 ∀𝒙 and 𝑉𝑎𝑟(𝜀) < ∞, rather than the

error being independent of all 𝑋. This ensures that 𝐶𝑜𝑣(𝑋_𝑗, 𝜀) = 0 ∀𝑗. However, if the variance of the error is no longer constant, then the estimates do converge to the correct values, but the estimates are no longer the minimum variance unbiased linear estimators. If 𝑾 is the 𝑛 × 𝑛 diagonal weight matrix with the reciprocal of the error variance for the 𝑖:th observation on the 𝑖:th entry of the diagonal, then this is instead fulfilled for the following weighted estimator:

(𝐸𝑞. 1.6) 𝜷̂ = (𝑿_𝑜𝑏𝑠𝑇 𝑾𝑿_𝑜𝑏𝑠)−1_𝑿 𝑜𝑏𝑠 𝑇 _𝑾𝑌

𝑜𝑏𝑠

We will generally work with the non-weighted estimates, but this weighing can be useful to keep in mind when we study the estimates optimality properties. An example of a linear relationship between two variables is illustrated in Figure 1.1.1.

(8)

6

Figure 1.1.1: An example of a linear relationship. Here 𝑋~𝑈(0,1), and 𝑌|𝑋~𝐵𝑒𝑡𝑎 ( 𝛼+𝛽𝑋

2𝛼(1−𝛼),

1−(𝛼+𝛽𝑋)

2𝛼(1−𝛼)), hence

𝐸(𝑌|𝑋) = 𝛼 + 𝛽𝑋 by usual properties of the Beta-distribution, and 𝜀|𝑋~𝐵𝑒𝑡𝑎 ( 𝛼+𝛽𝑋

2𝛼(1−𝛼),

1−(𝛼+𝛽𝑋)

2𝛼(1−𝛼)) − 𝐸(𝑌|𝑋).

Shown is a sample of 10 000 draws following this relationship, with 𝛽 = 0.75, 𝛼 = 0.5(1 − 𝛽), as well as the true mean-line (black thick line), and least-square estimated line (thin red line). For a justification for looking at this type of relationship, see section 2.3.

1.2 The linear regression model with categorically observed variables

Importantly for our purposes, it is usually assumed that all the 𝑋 and 𝑌 variables of the model can be observed perfectly. If some cannot, say if there is some error in the variables, the estimation usually becomes more complicated. We will look at a special case of such errors, which is when some or all variables cannot be observed more exactly than to what category it gets sorted into – corresponding to a part of some coarser partition of the support of the variables. If this is the case, we will call the underlying variable that we are truly interested in the latent variable (following usual terminology of unobserved variables), and the categorical variable the categorically observed variable. We will also sometimes refer to this

categorically observed variable as the discretization of the corresponding latent variable, capturing that a continuous range of the latent variable’s support has been made into one discrete category on the observed variable. That is, if 𝑋 is a categorically observed variable, and the partition of the support of 𝑋, which decides the categories is given by 𝒜𝑋 =

(9)

7

{𝒜𝑋,𝑘}_{𝑘=1,…,𝐾}, then 𝑥 ∈ 𝒜𝑋,𝑘 means the observed variable gets sorted into the 𝑘:th category.

We will drop the 𝑋-subscript when it is obvious what variable the partition refers to. The number of categories, 𝐾, is possibly infinite, although we will usually work with the case when 𝐾 is small. If the support of 𝑋 is denoted by Ω_𝑋, then for 𝒜_𝑋 to be a partition of Ω_𝑋, it is required that:

𝒜_𝑋,𝑘∩ 𝒜_𝑋,𝑗 = ∅, 𝑘 ≠ 𝑗

⋃ 𝒜_𝑋,𝑘

𝑘=1,…,𝐾

= Ω_𝑋

One way to describe this model is to define 𝐾 new variables ∆𝑋,𝑘 as follows:

∆_𝑋,𝑘≔ {1 𝑖𝑓 𝑋 ∈ 𝒜𝑘

0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒, ∆𝑋≔ {∆𝑋,𝑘}𝑘=1,…,𝐾

Then, if 𝑌 is observed and 𝑋 is categorically observed, and we have 𝑛 independent and identically distributed observations, we will write (∆_𝑋_𝑖, 𝑌_𝑖) as the model for the data (understood as (∆_𝑋_𝑖, 𝑌_𝑖)

𝑖=1,…,𝑛), to keep with the notation in the previous section.

Correspondingly if 𝑋 is observed and 𝑌 is categorically observed we write (𝑋_𝑖, ∆_𝑌_𝑖), and if both are categorically observed we write (∆_𝑋_𝑖, ∆_𝑌_𝑖). If we have 𝑝 > 1 categorically observed independent variables, each with 𝐾_𝑗 categories, we write the partition as 𝒜_𝑋_𝑗 =

{𝒜𝑋𝑗,𝑘}_{𝑘=1,…,𝐾} 𝑗

for the 𝑗:th such variable, and the variables as:

∆_𝑋_𝑗_,𝑘≔ {1 𝑖𝑓 𝑋𝑗 ∈ 𝒜𝑋𝑗,𝑘

0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒, ∆𝑋𝑗≔ {∆𝑋𝑗,𝑘}_{𝑘=1,…,𝐾} 𝑗

, ∆_𝑿≔ {∆_𝑋_𝑗}

𝑗=1,…,𝑝

And the model for the data is written as (∆_𝑿_𝑖, 𝑌_𝑖) if the dependent variable is not categorically observed (and correspondingly in other cases). In case only some of the independent variables are categorically observed, say this is true for variable number 1 and 3 out of 4 independent variables, then we will denote the observed model with subscripts, where the first subscript using 𝑗 or the exact number denotes the variable, and the second subscript, using 𝑖, denotes

(10)

8

the observation, i.e.: (∆_𝑿_{1,3},𝑖, 𝑿_{2,4},𝑖, 𝑌_𝑖). As before, we will drop the 𝑋 and 𝑌 subscripts if it is obvious which variable the ∆-variables refer to.

As an example, say we want to estimate the temperature in C○ of some water (under standard conditions). If we do not have access to a thermometer and only estimate the temperature by looking at whether the water is a solid, a liquid, or a gas, this would result in a categorical partition with observed model (∆𝑋𝑖) = ({∆1, ∆2, ∆3}𝑖), corresponding to three categories that

indicate whether the temperature is below 0 (𝒜₁ = {𝑥: 𝑥 ≤ 0}), between 0 and 100 (𝒜₂ = {𝑥: 0 < 𝑥 ≤ 100}), or over 100 (𝒜3 = {𝑥: 𝑥 > 100}).

Only knowing that some of the variables are categorically observed is not enough to capture the underlying relationship between the 𝑋 and 𝑌 variables. Usually then, if an independent variable can only be observed categorically, dummy-coding the different categories is

commonly used. This means that one category is designated as the baseline and scored 0, and a variable for each other category is constructed and scored 1 if that category is observed and 0 if it is not. Correspondingly, if it is the dependent variable that can only be observed

categorically, a common solution is to use regression models for categorical variables, such as logistic or probit regression, which assumes some probability distribution for being in the different categories given the independent variables (logistic or normal distribution,

respectively). However, especially for dummy-coding, such solutions no longer estimate the original relationship, as it ignores how the scores on the underlying variable relates to the other variables, and only looks at how the observed categories do so instead. This is

important, since the discretized measures need not follow the same relationship as their latent variables.

In order to correctly capture the underlying true relationship between the latent variables, the idea we follow here is to assume that some additional information is known, and study how this allows estimating the original relationship. Specifically, if the categorically observed variable is an independent variable, we assume that the distribution of the underlying variable is known. So if the 𝑗:th independent variable is categorically observed, we know 𝐹_𝑋_𝑗

completely, that is we know 𝑋_𝑗~𝐹_𝑋_𝑗. Similarly, if the categorically observed variable is a dependent variable, we assume that the conditional distribution given the independent variables, as well as its marginal distribution, is known up to but not including the

(11)

beta-9

parameters in the linear model. This is the same as assuming that we know 𝐹_𝜀|𝑿 and 𝐹_𝑌

completely, except for the beta-parameters, which can affect their distribution. That is, for any possible values of the parameters 𝜷, where 𝐹_𝑌,𝜷 denotes the distribution of 𝑌 given those parameter-values, then we know the distributions so that 𝜀|𝑿~𝐹_{𝜀|𝑿,𝜷}, 𝑌~𝐹_𝑌,𝜷. In addition, we assume some knowledge that allows us either to know the partition of the support, or

otherwise to estimate this partition. This will usually, but not always, be that the categories are ordered, so that if 𝑥1 ∈ 𝒜𝑘 and 𝑥2 ∈ 𝒜𝑘+1, then 𝑥1 < 𝑥2 for all 𝑘.

With our model in place, we now turn to developing the method needed to estimate the beta-parameters under this model. As we will show, the problem will be solved by the application of conditional mean variables, which take the conditional mean of the latent variable given the part of its partition that gives rise to the categorical observation.

(12)

10

2. Conditional mean variables

As explained in the previous chapter, the general idea we follow, common to other latent variable approaches, is to consider the categorical observations as resulting from an

underlying (latent) variable, so that when that variable is in a particular part of its support, the observation gets sorted into a corresponding category (see Figure 2.1 below for an

illustration). Knowledge of how the support is partitioned, and correspondingly what part of the partition a category is connected to, then allows numerical information about the

underlying variable which can be used in regression. Evidently, if we have a limited number of categories, we cannot have a one-to-one function between the category and the exact value of the latent variable. Luckily though, this is not usually needed. As we will show in this chapter, in order to get consistent estimators of the beta parameters in the underlying model, it is enough to know – or be able to estimate – the conditional means of the underlying variable given that it is in a particular part of the partition, and then apply a function that maps this conditional mean as the value of all observations in the corresponding category. With this in mind, we now turn to the construction of these conditional mean variables.

Definition 1: 𝑍_𝑿,𝒜 is a conditional mean variable with respect to 𝑿 and some partition 𝒜 = {𝒜_𝑘}_{𝑘=1,…,𝐾} of Ω_𝑿, if:

𝑿 ∈ 𝒜_𝑘 → 𝑍_𝑿,𝒜 = 𝐸(𝑿|𝑿 ∈ 𝒜_𝑘) =∫𝒜𝑘𝒙𝑓(𝒙)𝑑𝒙

∫_𝒜 𝑓(𝒙)

𝑘 𝑑𝒙

, ∀𝑘

Definition 2: 𝑅_𝑿,𝒜 is a conditional remainder with respect to 𝑿 and some partition 𝒜 = {𝒜_𝑘}_{𝑘=1,…,𝐾} of Ω_𝑿, if:

𝑿 ∈ 𝒜_𝑘 → 𝑅_𝑿,𝒜 = 𝑿 − 𝐸(𝑿|𝑿 ∈ 𝒜_𝑘), ∀𝑘

It directly follows that:

(13)

11

We will frequently omit the subscript of 𝒜 in the 𝑍 and 𝑅 variable, and so only write e.g. 𝑍_𝑿 when the particular partition is not relevant. We will also frequently refer to 𝑅_𝑿 simply as a

remainder, which is to be understood as a remainder with respect to 𝑍𝑿.

As per the previous chapter, another way to write these variables is to, if the partition has 𝐾 parts, define 𝐾 new variables ∆_𝑘 as follows:

∆_𝑘= {1 𝑖𝑓 𝑋 ∈ 𝒜𝑘 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 So that: 𝑃(∆_𝑘= 1) = 𝑃(𝑋 ∈ 𝒜_𝑘) = ∫ 𝑓(𝑥) 𝒜𝑘 𝑑𝑥

Then the 𝑍𝑋,𝒜 variable can be written as:

(𝐸𝑞. 2.1) 𝑍_𝑋,𝒜 = ∑ 𝐸(𝑋|𝑋 ∈ 𝒜_𝑘)∆_𝑘

𝐾

𝑘=1

Similarly, the 𝑅_𝑋,𝒜 variable can be written as:

(𝐸𝑞. 2.2) 𝑅𝑋,𝒜 = 𝑋 − ∑ 𝐸(𝑋|𝑋 ∈ 𝒜𝑘)∆𝑘 𝐾 𝑘=1 = ∑(𝑋 − 𝐸(𝑋|𝑋 ∈ 𝒜_𝑘))∆𝑘 𝐾 𝑘=1

Where the last equality is because the sum of all ∆_𝑘 variables must equal 1 (since 𝑋 is always somewhere and only in one part of the partition of its support).

With the definition of conditional mean variables and their remainder in place, we now turn to some basic properties of these variables, which will be useful when we apply them in linear regression.

(14)

12

2.1 Basic properties of conditional mean variables

This section is a prerequisite for the proofs with regard to regression in the following sections. Here we formulate and analytically prove some properties of conditional mean variables and their corresponding remainders in two useful theorems.

Theorem 1: Suppose 𝑍_𝑋 is a conditional mean variable with respect to 𝑋 and some partition 𝒜 of Ω_𝑋, that 𝑅_𝑋 is the corresponding remainder, that 𝑎 and 𝑏 are some constants, and that 𝑌₁ and 𝑌2 are some other variables, possibly dependent on 𝑋, then:

(𝑇1.1) 𝐸𝑍_𝑋 = 𝐸𝑋 (𝑇1.2) 𝐶𝑜𝑣(𝑍_𝑋, 𝑅_𝑋) = 0

(𝑇1.3) 𝑍𝑎+𝑏𝑋,𝑎+𝑏𝒜 = 𝑎 + 𝑏𝑍𝑋,𝒜

(𝑇1.4) 𝑍𝑌1+𝑌2|𝑋,𝒜 = 𝑍𝑌1|𝑋,𝒜+ 𝑍𝑌2|𝑋,𝒜

Where 𝑎 + 𝑏𝒜𝑘 ≔ {𝑎 + 𝑏𝑥: 𝑥 ∈ 𝒜𝑘}, and we with 𝑍𝑌1+𝑌2|𝑋,𝒜 mean:

𝑍_𝑌₁_+𝑌₂_|𝑋,𝒜 ≔ ∑ 𝐸((𝑌₁+ 𝑌₂|𝑋)|𝑋 ∈ 𝒜_𝑘)∆_𝑘 𝐾 𝑘=1 = ∑ 𝐸(𝑌₁+ 𝑌₂|𝑋 ∈ 𝒜_𝑘)∆_𝑘 𝐾 𝑘=1

Where ∆_𝑘 is the indicator variable for whether 𝑋 ∈ 𝒜_𝑘 or not.

We will take the proofs step by step.

Proof of T1.1: This part is proven by simply writing out the equation of 𝐸𝑍𝑋, using the

notation in Eq 2.1: 𝐸𝑍_𝑋 = 𝐸 (∑ 𝐸(𝑋|𝑋 ∈ 𝒜_𝑘)∆_𝑘 𝐾 𝑘=1 ) = ∑ 𝐸(𝑋|𝑋 ∈ 𝒜_𝑘)𝑃(𝑋 ∈ 𝒜_𝑘) 𝐾 𝑘=1 Since: 𝐸(𝑋|𝑋 ∈ 𝒜_𝑘) =∫𝒜𝑘𝑥𝑓(𝑥)𝑑𝑥 ∫_𝒜 𝑓(𝑥) 𝑘 𝑑𝑥 , 𝑃(𝑋 ∈ 𝒜_𝑘) = ∫ 𝑓(𝑥) 𝒜𝑘 𝑑𝑥

(15)

13 This means: 𝐸𝑍_𝑋 = ∑∫𝒜𝑘𝑥𝑓(𝑥)𝑑𝑥 ∫_𝒜 𝑓(𝑥) 𝑘 𝑑𝑥 ∫ 𝑓(𝑥) 𝒜𝑘 𝑑𝑥 𝐾 𝑘=1 = ∑ ∫ 𝑥𝑓(𝑥) 𝒜𝑘 𝑑𝑥 𝐾 𝑘=1 = ∫ 𝑥𝑓(𝑥) Ω𝑋 𝑑𝑥 = 𝐸𝑋 Which proves T1.1. □

Corollary T1.1: By the same assumptions as for Theorem 1, 𝐸𝑅𝑋 = 0.

Proof: 𝐸𝑅_𝑋 = 𝐸(𝑋 − 𝑍_𝑋) = 𝐸𝑋 − 𝐸𝑍𝑋 = 𝐸𝑋 − 𝐸𝑋 = 0 □

Proof of T1.2: This can be proven by the law of total expectation:

𝐶𝑜𝑣(𝑍_𝑋, 𝑅_𝑋) = 𝐸((𝑍_𝑋− 𝐸𝑍_𝑋)𝑇_(𝑅 𝑋− 𝐸𝑅𝑋)) = = 𝐸𝑍𝑋(𝐸𝑅𝑋((𝑍𝑋− 𝐸𝑍𝑋) 𝑇_(𝑅 𝑋− 𝐸𝑅𝑋)|𝑍𝑋)) = 𝐸𝑍𝑋((𝑍𝑋− 𝐸𝑍𝑋) 𝑇_𝐸 𝑅𝑋((𝑅𝑋− 𝐸𝑅𝑋)|𝑍𝑋))

Using Corollary T1.1 and the definition of 𝑅_𝑋, the inner expectation can be simplified as follows:

𝐸𝑅𝑋((𝑅𝑋− 𝐸𝑅𝑋)|𝑍𝑋) = 𝐸𝑅𝑋(𝑅𝑋|𝑍𝑋) = 𝐸𝑋(𝑋 − 𝑍𝑋|𝑍𝑋) = 𝐸𝑋(𝑋|𝑍𝑋) − 𝑍𝑋

But the expected value of 𝑋 given any value of 𝑍_𝑋 is by Definition 1 equal to 𝑍_𝑋, since the value of 𝑍_𝑋 tells us which part of the partition 𝒜 we are in, and since the expected value of 𝑋 equals the value taken by 𝑍𝑋 in that partition. Hence:

𝐶𝑜𝑣(𝑍_𝑋, 𝑅_𝑋) = 𝐸_𝑍_𝑋((𝑍_𝑋− 𝐸𝑍_𝑋)𝑇_𝐸 𝑅𝑋((𝑅𝑋− 𝐸𝑅𝑋)|𝑍𝑋)) = = 𝐸_𝑍_𝑋((𝑍_𝑋− 𝐸𝑍_𝑋)𝑇_(𝐸 𝑋(𝑋|𝑍𝑋) − 𝑍𝑋)) = 𝐸𝑍𝑋((𝑍𝑋− 𝐸𝑍𝑋) 𝑇_(𝑍 𝑋− 𝑍𝑋)) = 0 □

(16)

14

Note that the previous proof not only showed that the covariance is zero as a whole, but that the expected value of 𝑅_𝑋 given 𝑍_𝑋 equals zero, no matter what value 𝑍_𝑋 takes. This should not be surprising if one recalls that 𝑅_𝑋 was defined as the remainder of 𝑋 in each part of the partition. Property T1.2 gives us some interesting relationships for the variances of 𝑍𝑋 and

𝑅_𝑋, which we turn to next.

Corollary T1.2: By the same assumptions as for Theorem 1:

(𝐶. 𝑇1.2.1) 𝐶𝑜𝑣(𝑍_𝑋, 𝑋) = 𝑉𝑎𝑟(𝑍_𝑋) (𝐶. 𝑇1.2.2) 𝐶𝑜𝑣(𝑅𝑋, 𝑋) = 𝑉𝑎𝑟(𝑅𝑋)

(𝐶. 𝑇1.2.3) 𝑉𝑎𝑟(𝑋) = 𝑉𝑎𝑟(𝑍𝑋) + 𝑉𝑎𝑟(𝑅𝑋)

Proof: All these proofs follow from simple expansions of the terms and application of T1.2.

C.T1.2.1 is proven by:

𝐶𝑜𝑣(𝑍_𝑋, 𝑋) = 𝐶𝑜𝑣(𝑍_𝑋, 𝑍_𝑋+ 𝑅_𝑋) = 𝐶𝑜𝑣(𝑍𝑋, 𝑍𝑋) + 𝐶𝑜𝑣(𝑍𝑋, 𝑅𝑋) = 𝑉𝑎𝑟(𝑍𝑋) + 0 = 𝑉𝑎𝑟(𝑍𝑋)

The proof of C.T1.2.2 is analogous. Finally, C.T1.2.3 is proven by:

𝑉𝑎𝑟(𝑋) = 𝑉𝑎𝑟(𝑍_𝑋+ 𝑅_𝑋) = 𝑉𝑎𝑟(𝑍_𝑋) + 2𝐶𝑜𝑣(𝑍_𝑋, 𝑅_𝑋) + 𝑉𝑎𝑟(𝑅_𝑋) = 𝑉𝑎𝑟(𝑍_𝑋) + 𝑉𝑎𝑟(𝑅_𝑋) □

Proof of T1.3: Note that, ∀𝒜𝑘:

𝑋 ∈ 𝒜𝑘 ↔ 𝑎 + 𝑏𝑋 ∈ 𝑎 + 𝑏𝒜𝑘 As such, ∀𝒜_𝑘: 𝑍_{𝑎+𝑏𝑋,𝑎+𝑏𝒜} = ∑ 𝐸(𝑎 + 𝑏𝑋|𝑎 + 𝑏𝑋 ∈ 𝑎 + 𝑏𝒜_𝑘)∆𝑘 𝐾 𝑘=1 = = ∑ 𝐸(𝑎 + 𝑏𝑋|𝑋 ∈ 𝒜𝑘)∆𝑘 𝐾 𝑘=1 = 𝑎 + 𝑏 ∑ 𝐸(𝑋|𝑋 ∈ 𝒜𝑘)∆𝑘 𝐾 𝑘=1 = 𝑎 + 𝑏𝑍𝑋,𝒜

(17)

15

Note that the third property of Theorem 1 means that it is in a sense unimportant what linear transformation of a latent variable and its conditional mean variable we are working with (barring theoretical reasons for preferring it some way). Thus, if the latent variable has a normal distribution, it is enough to treat it as a standard normal distribution, any conditional means will just follow the same linear transformation. Similarly, if the latent variable follows a uniform distribution, it is enough to treat it as a standard uniform distribution, and so on for any family of distributions closed under linear transformations.

Proof of T1.4: To repeat, with 𝑍_𝑌₁_+𝑌₂_|𝑋,𝒜 we meant:

𝑍_𝑌₁_+𝑌₂_|𝑋,𝒜 = ∑ 𝐸((𝑌₁+ 𝑌₂|𝑋)|𝑋 ∈ 𝒜_𝑘)∆_𝑘 𝐾 𝑘=1 = ∑ 𝐸(𝑌₁+ 𝑌₂|𝑋 ∈ 𝒜_𝑘)∆_𝑘 𝐾 𝑘=1

Where {∆_𝑘}_{𝑘=1,…,𝐾} are the variables that indicate whether 𝑋 ∈ 𝒜_𝑘. Consequently, the proof follows from applying usual properties of expectation, i.e. by breaking out the part of the right-hand side, which yields:

Which completes the proof of the final property of Theorem 1. □

We now turn to a couple of examples to illustrate how to find the probabilities and values of the conditional mean variable. We will exemplify with a normal and uniform distribution, two examples that we will frequently return to due to their attractive properties. However, as seen from the definition and Theorem 1, conditional mean variables and their remainder can be defined and worked with in respect to a variable with any distribution. Our examples will also use ordered partitions, where each part takes all values between two endpoints. This will be useful when we afterwards discuss how to estimate the conditional mean variables from ordered categories, but no theorem in this thesis requires the partition to be in this simple way unless specifically stated in the theorem.

(18)

16

Example 1 (Normal distribution): Suppose 𝑋 follows a normal distribution with mean 𝜇 and variance 𝜎2_{, and that the 𝑘:th part of the partition 𝒜 is given by 𝒜}

𝑘 = {𝑥: 𝑐𝑘−1< 𝑥 ≤ 𝑐𝑘},

where 𝑐_𝑘−1< 𝑐_𝑘. Then the probability to be in that part of the partition becomes:

𝑃(𝑋 ∈ 𝒜_𝑘) = ∫ 𝑓(𝑥) 𝒜𝑘 𝑑𝑥 = ∫ 𝑓(𝑥) 𝑐𝑘 𝑐𝑘−1 𝑑𝑥 = Φ (𝑐𝑘− 𝜇 𝜎 ) − Φ ( 𝑐_𝑘−1− 𝜇 𝜎 )

And the conditional mean becomes:

𝐸(𝑋|𝑋 ∈ 𝒜_𝑘) =∫𝒜𝑘𝑥𝑓(𝑥)𝑑𝑥 ∫_𝒜 𝑓(𝑥) 𝑘 𝑑𝑥 = ∫ 𝑥𝑓(𝑥) 𝑐𝑘 𝑐𝑘−1 𝑑𝑥 ∫_𝑐𝑐𝑘 𝑓(𝑥) 𝑘−1 𝑑𝑥 = = ∫ 𝑥 1 √2𝜋𝜎2exp (− 1 2𝜎2(𝑥 − 𝜇)2) 𝑐𝑘 𝑐𝑘−1 𝑑𝑥 Φ (𝑐𝑘_𝜎− 𝜇) − Φ (𝑐𝑘−1_𝜎− 𝜇) = ∫ 𝑥 ± 𝜇 √2𝜋𝜎2exp (− 1 2𝜎2(𝑥 − 𝜇)2) 𝑐𝑘 𝑐𝑘−1 𝑑𝑥 Φ (𝑐𝑘_𝜎− 𝜇) − Φ (𝑐𝑘−1_𝜎− 𝜇) = = ∫ 𝑥 − 𝜇 √2𝜋𝜎2exp (− 1 2𝜎2(𝑥 − 𝜇)2) 𝑐𝑘 𝑐𝑘−1 𝑑𝑥 Φ (𝑐𝑘_𝜎− 𝜇) − Φ (𝑐𝑘−1_𝜎− 𝜇) + ∫ 𝜇 √2𝜋𝜎2exp (− 1 2𝜎2(𝑥 − 𝜇)2) 𝑐𝑘 𝑐𝑘−1 𝑑𝑥 Φ (𝑐𝑘_𝜎− 𝜇) − Φ (𝑐𝑘−1_𝜎− 𝜇) = = [−√𝜎2 √2𝜋exp (− 1 2𝜎2(𝑥 − 𝜇)2)] 𝑥=𝑐𝑘−1 𝑐𝑘 Φ (𝑐𝑘_𝜎− 𝜇) − Φ (𝑐𝑘−1_𝜎− 𝜇) + 𝜇 Φ (𝑐𝑘_𝜎− 𝜇) − Φ (𝑐𝑘−1_𝜎− 𝜇) Φ (𝑐𝑘_𝜎− 𝜇) − Φ (𝑐𝑘−1_𝜎− 𝜇)= = 𝜇 − 𝜎2 𝜑 ( 𝑐_𝑘− 𝜇 𝜎 ) − 𝜑 ( 𝑐_𝑘−1− 𝜇 𝜎 ) Φ (𝑐𝑘_𝜎− 𝜇) − Φ (𝑐𝑘−1_𝜎− 𝜇)

So 𝑍_𝑋,𝒜 takes this value with probability Φ (𝑐𝑘−𝜇

𝜎 ) − Φ ( 𝑐𝑘−1−𝜇

𝜎 ).

Example 2 (Uniform distribution): Suppose 𝑋 follows a uniform distribution between the real numbers 𝑎 and 𝑏, where 𝑎 < 𝑏, i.e. 𝑋~𝑈(𝑎, 𝑏), and that the 𝑘:th part of the partition 𝒜 is

(19)

17

given by 𝒜_𝑘 = {𝑥: 𝑐_𝑘−1< 𝑥 ≤ 𝑐_𝑘}, where 𝑐𝑘−1 < 𝑐𝑘. Then the probability to be in that part

of the partition becomes:

𝑃(𝑋 ∈ 𝒜𝑘) = ∫ 𝑓(𝑥) 𝒜𝑘 𝑑𝑥 = ∫ 1 𝑏 − 𝑎 𝑐𝑘 𝑐𝑘−1 𝑑𝑥 =𝑐𝑘− 𝑐𝑘−1 𝑏 − 𝑎

And the conditional mean becomes:

𝐸(𝑋|𝑋 ∈ 𝒜_𝑘) =∫𝒜𝑘𝑥𝑓(𝑥)𝑑𝑥 ∫_𝒜 𝑓(𝑥) 𝑘 𝑑𝑥 = ∫ 𝑥 1 𝑏 − 𝑎 𝑐𝑘 𝑐𝑘−1 𝑑𝑥 ∫_𝑐𝑐𝑘 _{𝑏 − 𝑎}1 𝑘−1 𝑑𝑥 = 𝑐_𝑘2 − 𝑐_𝑘−12 2(𝑏 − 𝑎) 𝑐𝑘− 𝑐𝑘−1 𝑏 − 𝑎 = 𝑐𝑘+ 𝑐𝑘−1 2

So 𝑍𝑋,𝒜 takes the value (𝑐𝑘+ 𝑐𝑘−1)/2 with probability (𝑐𝑘− 𝑐𝑘−1)/(𝑏 − 𝑎).

For an illustration of how the conditional mean variable and its remainder relates to the underlying distribution, see Figure 2.1.1. For an illustration of the relationship between two continuous variables versus one dependent continuous variable and one independent

conditional mean variable, see Figure 2.1.2.

Figure 2.1.1: An example of how a conditional mean variable 𝑍𝑋,𝒜 and its corresponding remainder 𝑅𝑋,𝒜 is

constructed from an underlying variable 𝑋 and a partition 𝒜 of Ω𝑋. Here 𝑋 follows a standard normal

distribution, and 𝒜 is a partition where 𝒜1= {𝑥: 𝑥 ≤ −1|𝑥 > 1} (area in light red), 𝒜2= {𝑥: −1 < 𝑥 ≤ 0.5}

(area in light green), 𝒜3= {𝑥: 0.5 < 𝑥 ≤ 1} (area in light blue). Left-figure: The value that 𝑍𝑋,𝒜 takes in each

(20)

18

color. The corresponding probabilities, equal to the area of each part of the partition, are shown by the height of the vertical line. Here 𝑍𝑋,𝒜+ 𝑅𝑋,𝒜(= 𝑋) lies in the green part of the distribution when 𝑋 ∈ 𝒜2. Right-figure:

The partition from the left, rearranged so the corresponding vertical line (the conditional mean) is over zero, to show the distribution of 𝑅𝑋,𝒜. When 𝑋 ∈ 𝒜2, then 𝑅𝑋,𝒜 lies in the green area.

Figure 2.1.2: A constructed scatterplot between two variables 𝑋 and 𝑌 (red dots), and the corresponding scatterplot between 𝑍𝑋 and 𝑌 (blue dots). Here there are three ordered categories (cutoffs between categories are

shown as grey horizontal lines) and hence three corresponding conditional means (blue horizontal lines). As shown by the blue arrows between the (𝑋, 𝑌) dots and the (𝑍𝑋, 𝑌) dots, the latter are simply found by projecting

the 𝑋-value onto the conditional mean for its category. The true (linear) mean-relationship between X and Y is shown as a black line, while the least squares estimate for this relationship with these observations is shown as a red line (when predicting 𝑌 by 𝑋), and a blue line (when predicting 𝑌 by 𝑍𝑋) respectively.

Normally, even if we can assume some underlying distribution from theory, we will not know the partitions for each observed category. However, in many cases it can be natural to assume that the categories can be ordered, so that the observations in a higher-numbered category correspond to higher values of the latent variable than the observations in a lower-numbered category – that is to say that we have an ordered categorical variable. Examples include highest schooling attained (primary school, secondary school, high-school), as an indicator of total time of schooling, answer to a 3-point scale about political preference (left-wing,

moderate, right-wing), as a rough indicator of one’s position on the left/right spectrum, or for that matter weight in kilograms, as an indicator of the exact weight. When this holds true, the underlying partition must be describable by a set of ordered cut-offs that decide which category a value on the latent variable corresponds to. For example, suppose 𝑋 follows a

(21)

19

standard uniform distribution, and that 𝒜 is described by the two cutoffs {0.5,0.75}, then any value of 𝑋 up to at most 0.5 correspond to the lowest category, values between 0.5 and 0.75 correspond to the middle category, and values over 0.75 correspond to the highest category. Similarly, the example from Figure 2.1.1 would be an ordered categorical variable if 𝒜1 =

{𝑥: 𝑥 ≤ −1|𝑥 > 1} was split into two new parts so that 𝒜1 = {𝑥: 𝑥 ≤ −1} and 𝒜4 =

{𝑥: 𝑥 > 1}. This brings us to our second theorem:

Theorem 2: Denote the cumulative distribution function of a continuous variable 𝑋 by 𝐹𝑋, the

corresponding quantile function (inverse of the cumulative distribution function) by 𝐹𝑋−1, and

the cutoffs for the ordered categorical observations of 𝑋 by 𝑐_𝑘 for the 𝑘:th such cutoff. We identify 𝑐0 with the lower endpoint and 𝑐𝐾 with the upper endpoint of the support of 𝑋

(possibly ∓∞ respectively), so that Ω𝑋 = ⋃𝐾𝑘=1𝒜𝑘 = ⋃𝐾𝑘=1(𝑐𝑘−1, 𝑐𝑘]. Furthermore, assume

we have a sample of 𝑛 independent categorical observations of 𝑋, i.e. we observe (∆_𝑋_𝑖), with 𝑛_𝑗 observations in category 𝑗. Then we get a consistent estimator of each cutoff 𝑐_𝑘 by:

(𝑇2.1) 𝑐̂_𝑘≔ 𝐹𝑋−1(∑ 𝑛𝑗 𝑛 𝑘 𝑗=1 )

We also get consistent estimators of the conditional means of each category by:

(𝑇2.2) 𝐸̂(𝑋|𝑋 ∈𝒜_𝑘) ≔∫ 𝑥𝑓(𝑥)

𝑐̂𝑘

𝑐̂𝑘−1 𝑑𝑥

∫𝑐̂𝑘𝑓(𝑥)

𝑐̂𝑘 𝑑𝑥

Furthermore, let 𝑌 be some other variable, and 𝑔 be a continuous function, and suppose we have 𝑛 independent observations of (∆_𝑋_𝑖, 𝑌_𝑖). Let 𝑧̂_𝑋,𝑖 denote the observed value of 𝑍̂_𝑋,𝑖 ≔ ∑𝐾_𝑘=1∆_𝑋_𝑖𝐸̂(𝑋|𝑋 ∈𝒜_𝑘) and let 𝑦_𝑖 denote the observed value of 𝑌_𝑖. Then we get a consistent estimator of 𝐸(𝑔(𝑍_𝑋, 𝑌)) by:

(𝑇2.3) 𝐸̂(𝑔(𝑍_𝑋, 𝑌))≔1

𝑛∑𝑔(𝑧̂𝑋,𝑖, 𝑦𝑖) 𝑛

𝑖=1

Proof: If the ordering holds the cutoffs must satisfy:

𝑐𝑘 = 𝐹𝑋−1(∑ 𝑃(𝑋 ∈ 𝒜𝑗) 𝑘

𝑗=1

)

(22)

20

𝐹𝑋(𝑐𝑘) = ∑ 𝑃(𝑋 ∈ 𝒜𝑗) 𝑘

𝑗=1

With a sample of independent observations of 𝑋, we may construct a consistent estimator of the probabilities as:

𝑃̂(𝑋 ∈ 𝒜𝑗) =

𝑛𝑗

𝑛

Since this is simply the empirical probability, this estimate will converge to the true

probability of being in that category. Consequently, by the continuous mapping theorem, any continuous transformation of the probabilities, where these probability estimates are simply plugged in instead of the true values, will converge to the true value for that transformation (see e.g. van der Vaart, 2000). Thus, in particular, the cutoffs are just as easily estimated by:

𝑐̂𝑘 = 𝐹𝑋−1(∑ 𝑛𝑗 𝑛 𝑘 𝑗=1 )

Which proves T2.1. Furthermore, 𝑓 is a continuous function by the continuity of the distribution of 𝑋. Thus ∫𝑐̂𝑘 𝑥𝑓(𝑥)

𝑐

̂𝑘−1 𝑑𝑥/ ∫ 𝑓(𝑥)

𝑐 ̂𝑘

𝑐̂𝑘 𝑑𝑥 is also a continuous transformation of the cutoffs. Therefore, the estimate in T2.2 converges to the correct conditional means, also by the continuous mapping theorem. Finally, 𝑧̂_𝑋,𝑖 are continuous transformations of the estimated conditional means, and the continuity of 𝑔 ensures that the estimate in T2.3 is a continuous transformation of 𝑧̂𝑋,𝑖 and 𝑦𝑖. Consequently, the continuous mapping theorem and the law of

large numbers ensures that the estimate in T2.3 converges to the correct value as well. □

Theorem 2 ensures that we have consistent estimators of the conditional mean values, the probabilities corresponding to the different categories, and thus a consistent estimator of the corresponding conditional mean variable, which we can then apply for further use (Theorem 2.3). Therefore, if we are for example interested in estimating any moment of the conditional mean variable, or any transformation thereof such as e.g. its variance or covariance with another variable, T2.3 ensures that any estimator where the values of the observations can be

(23)

21

replaced by the estimated values as in Theorem 2 will still converge to the true value of that moment. For example, suppose we want to estimate the covariance between 𝑍_𝑋 and 𝑌, with the model for the observations (𝑍𝑋𝑖, 𝑌𝑖). That is, we observe the conditional means for the

categorical observations of 𝑋 directly without having to estimate them. Denote the 𝑖:th observation of 𝑍_𝑋 by 𝑧_𝑋,𝑖, the 𝑖:th observation of 𝑌 by 𝑦_𝑖, and the respective sample means by 𝑧̅𝑋 and 𝑦̅. Then this covariance can, as usual, be estimated by the following method of

moments estimator:

𝐶𝑜𝑣̂ (𝑍_𝑋, 𝑌) =1

𝑛∑(𝑧𝑋,𝑖 − 𝑧̅𝑋)(𝑦𝑖 − 𝑦̅)

𝑛

𝑖=1

As a moment estimator, this will converge to the true covariance of 𝑍𝑋 and 𝑌, i.e. 𝐶𝑜𝑣(𝑍𝑋, 𝑌).

Importantly though, even if we have to estimate the conditional means through estimating the partition, thus replacing the conditional mean values that 𝑧_𝑋,𝑖 take, i.e.:

∫_𝒜 𝑥𝑓(𝑥) 𝑘 𝑑𝑥 ∫_𝒜 𝑓(𝑥) 𝑘 𝑑𝑥 By: ∫_𝒜̂ 𝑥𝑓(𝑥) 𝑘 𝑑𝑥 ∫_𝒜̂ 𝑓(𝑥) 𝑘 𝑑𝑥

Due to only having the observations (∆_𝑋_𝑖, 𝑌_𝑖), but with additional knowledge that 𝒜̂_𝑘 can be estimated by some continuous transformation of the probabilities, e.g. as the cutoffs in Theorem 2, so that: 𝒜̂_𝑘 = {𝑥:𝑐̂𝑘−1 < 𝑥 ≤𝑐̂𝑘} = {𝑥:𝐹𝑋−1(∑ 𝑛𝑗 𝑛 𝑘−1 𝑗=1 ) < 𝑥 ≤𝐹𝑋−1(∑ 𝑛𝑗 𝑛 𝑘 𝑗=1 )}

Then Theorem 2 (by the continuous mapping theorem) ensures that the moment estimators with these values plugged in instead of the true values will still converge to the true moments

(24)

22

of 𝑍_𝑋. This property will be useful when we estimate the variance of 𝑍_𝑋 and the covariance between 𝑍_𝑋 and another variable for purposes of regression.

The above argument allows conditional mean variables to be used for a range of different ordered categories. However, as indicated, the method of conditional mean variables can be used also for other cases when the underlying partition is either known from theory or when theory can tell us how to estimate it. Such a case is exemplified below.

Example 3 (An IQ-test question): In IQ-testing, the distribution of intelligence is often

thought to follow a normal distribution, usually modeled so that the IQ-scores should follow a roughly 𝑁(100,152_{) distribution. A typical IQ-test question consists of a set of images which}

follow some pattern, with one image left out, and a row of alternatives of which only one fits together with the other images. It is then the test-taker’s task to indicate which of these alternatives is the correct one. See left-part of Figure 2.1.3 for an illustration. Suppose we have access to only one such question, that it has eight alternatives, and that we want to estimate the IQ-score of a person who answered the question correctly the least square way – i.e. with the mean IQ of the people who answered the question correctly, with only the

observations (∆_𝑋_𝑖) (𝑋 = 𝐼𝑄). Since everyone who just guesses has a 1/8 chance of answering correctly, treating the categories (answered wrongly, answered correctly) as ordered

categorical with a simple cutoff between them would not yield a proper estimate of the conditional mean IQ of the people who answered correctly – not even approximately so. Instead, suppose everyone with an IQ above 𝑐 are capable of answering the question

correctly, and that everyone below 𝑐 cannot and simply guesses (which seems reasonable if all eight alternatives are designed to seem equally plausible if one hasn’t figured out the correct pattern). Thus, the conditional mean of those who answered correctly becomes:

𝐸(𝑋|𝑎𝑛𝑠𝑤𝑒𝑟 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦) = ∫_−∞𝑐 𝑥𝑓(𝑥)𝑑𝑥 ∫_−∞𝑐 𝑓(𝑥)𝑑𝑥 ∫ 𝑓(𝑥) 𝑐 −∞ 𝑑𝑥 ∗ 1 8 + ∫ 𝑥𝑓(𝑥)_𝑐∞ 𝑑𝑥 ∫ 𝑓(𝑥)_𝑐∞ 𝑑𝑥 ∫ 𝑓(𝑥) ∞ 𝑐 𝑑𝑥 ∗ 1 ∫_−∞𝑐 𝑓(𝑥)𝑑𝑥 ∗1_{8 + ∫ 𝑓}_𝑐∞ (𝑥)𝑑𝑥 ∗ 1 = = ∫ 𝑥𝑓(𝑥) 𝑐 −∞ 𝑑𝑥 ∗ 1 8 + ∫ 𝑥𝑓(𝑥) ∞ 𝑐 𝑑𝑥 ∫_−∞𝑐 𝑓(𝑥)𝑑𝑥 ∗1_{8 + ∫ 𝑓}_𝑐∞ (𝑥)𝑑𝑥

(25)

23

That is to say, everyone who has an IQ(= 𝑋) below 𝑐, and thus has a conditional mean score of 𝐸(𝑋|𝑋 ≤ 𝑐) = ∫ 𝑥𝑓(𝑥)_−∞𝑐 𝑑𝑥/ ∫ 𝑓(𝑥)_−∞𝑐 𝑑𝑥, are a proportion 𝑃(𝑋 ≤ 𝑐) = ∫ 𝑓(𝑥)_−∞𝑐 𝑑𝑥 of the whole population and have probability 1/8 to answer correctly. In addition, everyone who has an IQ above 𝑐, with a conditional mean score of 𝐸(𝑋|𝑋 > 𝑐) = ∫ 𝑥𝑓(𝑥)_𝑐∞ 𝑑𝑥/

∫ 𝑓(𝑥)_𝑐∞ 𝑑𝑥, are a proportion 𝑃(𝑋 > 𝑐) = ∫ 𝑓(𝑥)_𝑐∞ 𝑑𝑥 of the whole population and always answer correctly. For an illustration of this partition, see the right-part of Figure 2.1.3. In case we have an independent sample of 𝑛 observations (∆_𝑋_𝑖) where 𝑛₁ is the number of people who answered the question incorrectly, then 𝑐 can therefore be estimated by:

𝑛₁ 𝑛 = ∫ 𝑓(𝑥) 𝑐̂ −∞ 𝑑𝑥 ∗7 8↔𝑐̂ = 𝐹𝑋 −1₍𝑛1 𝑛 ∗ 8 7)

The properties of conditional mean variables that were proven in this section are useful when we apply the variables in linear regression, which we turn to next.

Figure 2.1.3: Left-figure: A typical IQ-question. The test-taker is tasked with indicating which figure, from the

eight alternatives at the bottom, fit into the empty space above. This figure is copied from the article Carpenter, P., Just, M., & Shell, P. (1990). What one intelligence test measures: A theoretical account of the processing in the Raven progressive matrices test. Psychological Review, 97(3), 404-431. Right-figure: The distribution of IQ~N(100,152_{) if the question to the left has difficulty 115 (the black vertical line) and probability 1/8 of}

answering correctly if guessing randomly. The proportion of everyone who would answer incorrectly is in red (7/8 of the distribution under 115), and the proportion of everyone who would answer correctly is in light-blue (everyone over 115 and 1/8 of the distribution under 115). The conditional means are indicated by a vertical line in a darker shade of the corresponding color. Shown in purple is also the conditional mean of people with an IQ over 115. Note that the true conditional mean, taking random guessing into account, is significantly lower.

(26)

24

2.2 Conditional mean variables in linear regression

As previously stated, in linear regression we have the model:

𝑌 = 𝛽₀+ 𝑿𝑇_{𝜷 + 𝜀 = 𝛽}

0 + ∑ 𝛽𝑗𝑋𝑗 𝑝

𝑗=1

+ 𝜀, 𝐸𝜀|_𝑿=𝒙= 0 ∀𝒙

The goal of this section is to develop the method needed when some or all of the independent and/or dependent variables cannot be observed directly, and we only have access to a

corresponding conditional mean variable (either directly, or more commonly because we can estimate them from the distribution of observations for a, usually ordered, categorical

indicator variable). We will deal with the case when the conditional mean variable functions as the independent-, dependent-, or both independent- and dependent variable in succession. This is because the conditional mean variables are easiest to apply as an independent variable. Using them as the dependent variable requires conditioning not just on the latent variable, but also on all independent variables, which usually requires estimating the parameters and updating the conditional mean variables iteratively. Once this is done, we will combine the methods from these parts to allow us to estimate the parameters in the underlying model when we have one conditional mean variable both as an independent and dependent variable. This will also finally allow us to work with multiple independent variables.

2.2.1 As one independent variable

Here we look at when the model for our observations is (∆_𝑋_𝑖, 𝑌_𝑖). That is, when we have one independent categorically observed variable (𝑝 = 1), and one dependent observed variable. In the case of one independent and dependent variable, we now have the latent model:

(𝐸𝑞. 2.2.1.1) 𝑌 = 𝛽₀+ 𝛽₁𝑋 + 𝜀, 𝐸𝜀|_𝑋=𝑥 = 0 ∀𝑥, 𝑋 = 𝑍_𝑋+ 𝑅_𝑋

The usual least squares moment estimators are:

(𝐸𝑞. 2.2.1.2) 𝛽̂₁ =∑ (𝑥𝑖 − 𝑥̅)(𝑦𝑖 − 𝑦̅) 𝑛 𝑖=1 ∑𝑛 (𝑥_𝑖 − 𝑥̅)2 𝑖=1 , 𝛽̂₀ = 𝑦̅ − 𝛽̂₁𝑥̅ This is because:

(27)

25 (𝐸𝑞. 2.2.1.3) 𝛽₁ =𝐶𝑜𝑣(𝑋, 𝑌)

𝑉𝑎𝑟(𝑋) , 𝛽0 = 𝐸𝑌 − 𝛽1𝐸𝑋

I.e. some functions of moments. The important element, that we neither know already nor can’t estimate from our observations is the term 𝐶𝑜𝑣(𝑋, 𝑌). Crucial for our purposes then is whether this can be estimated if we replace 𝑋 by 𝑍𝑋. This is handled by the following

theorem:

Theorem 3: Suppose 𝑍_𝑋 is a conditional mean variable with respect to 𝑋 with 𝑉𝑎𝑟(𝑍_𝑋) > 0. Furthermore, suppose the relationship 𝑌 = 𝛽₀+ 𝛽₁𝑋 + 𝜀 holds, where 𝐸𝜀|_𝑋=𝑥 = 0 ∀𝑥. Then:

(𝑇3) 𝐶𝑜𝑣(𝑋, 𝑌) = 𝑉𝑎𝑟(𝑋)

𝑉𝑎𝑟(𝑍_𝑋)𝐶𝑜𝑣(𝑍𝑋, 𝑌)

Proof: The proof is quite straight-forward, we just need to expand the expression 𝐶𝑜𝑣(𝑍𝑋, 𝑌)

in terms of 𝑌, and use the second property of conditional mean variables from Theorem 1. I.e. first:

𝐶𝑜𝑣(𝑍_𝑋, 𝑌) = 𝐶𝑜𝑣(𝑍_𝑋, 𝛽₀+ 𝛽₁𝑋 + 𝜀) = 𝛽₁𝐶𝑜𝑣(𝑍_𝑋, 𝑋) + 𝐶𝑜𝑣(𝑍_𝑋, 𝜀)

By usual properties of covariances. Further, by the assumption that 𝐸𝜀|𝑋=𝑥 = 0 ∀𝑥, it must

also hold that 𝐸𝜀|𝑍𝑋=𝑧𝑋 = 0 ∀𝑧𝑋, as 𝑍𝑋 depends only on 𝑋. Thus:

𝐶𝑜𝑣(𝑍_𝑋, 𝜀) = 𝐸((𝑍_𝑋− 𝐸𝑍_𝑋)𝜀) = 𝐸_𝑍_𝑋((𝑍_𝑋− 𝐸𝑍_𝑋)𝐸_𝜀(𝜀|𝑍_𝑋)) = 0

Thus:

𝐶𝑜𝑣(𝑍_𝑋, 𝑌) = 𝛽₁𝐶𝑜𝑣(𝑍_𝑋, 𝑋) = 𝛽₁𝑉𝑎𝑟(𝑍_𝑋)

Where the last equality comes from the first property of Corollary T1.2. Furthermore, by Equation 2.2.1.3:

(28)

26 This shows that:

𝛽₁ =𝐶𝑜𝑣(𝑍𝑋, 𝑌) 𝑉𝑎𝑟(𝑍_𝑋) =

𝐶𝑜𝑣(𝑋, 𝑌) 𝑉𝑎𝑟(𝑋)

Thus, multiplying by 𝑉𝑎𝑟(𝑋) completes our proof. □

As a direct consequence of Theorem 3:

𝛽₁ =𝐶𝑜𝑣(𝑍𝑋, 𝑌) 𝑉𝑎𝑟(𝑍_𝑋)

As before, denote by 𝑧_𝑋,𝑖 the 𝑖:th observation of 𝑍_𝑋, by 𝑦_𝑖 the 𝑖:th observation of 𝑌, and by 𝑧̅_𝑋 and 𝑦̅ the respective sample means. Then we get a consistent estimator of 𝛽1 by:

(𝐸𝑞. 2.2.1.4) 𝛽̂1 =

∑𝑛𝑖=1(𝑧𝑋,𝑖 − 𝑧̅𝑋)(𝑦𝑖 − 𝑦̅)

∑𝑛 (𝑧_𝑋,𝑖 − 𝑧̅_𝑋)2

𝑖=1

Further, by the first property of Theorem 1, 𝐸𝑍𝑋 = 𝐸𝑋, so we get a consistent estimator of 𝛽0

by:

(𝐸𝑞. 2.2.1.5) 𝛽̂₀ = 𝑦̅ − 𝛽̂₁𝑧̅_𝑋

Hence, these consistent estimators are found by simply replacing the role of 𝑋 by 𝑍𝑋 in these

moment estimators. As these estimates are derived from the least squares method, we can therefore estimate by finding the least squares regression line between 𝑍_𝑋 and 𝑌, which by Theorem 3 is the same as the least squares regression line between 𝑋 and 𝑌, and given by these moments. For an illustration of the relationship between 𝑍_𝑋 and 𝑌, see Figure 2.2.1.1.

(29)

27

Figure 2.2.1.1: The example from Figure 1.1.1, with 𝑋~𝑈(0,1), 𝑌|𝑋~𝐵𝑒𝑡𝑎((𝛼 + 𝛽𝑋)/(2𝛼(1 − 𝛼)), (1 − (𝛼 + 𝛽𝑋))/(2𝛼(1 − 𝛼))), 𝛽 = 0.75, 𝛼 = 0.5(1 − 𝛽). The same 10 000 observations (grey dots) and true mean-line between 𝑋 and 𝑌 is shown (thick black mean-line). In addition, we now have a 3-part ordered partition of 𝑋 described by the cutoffs {0.5, 0.75} (vertical black lines). The scatterplot with the conditional mean variable 𝑍𝑋

in place of 𝑋 is shown as black dots (creating three thick vertical lines). The moment estimate between 𝑍𝑋 and 𝑌

is shown as a thin red line, which, just like the estimate with 𝑋, closely corresponds to the true mean-line for this many observations. In addition, the three pairs of sample means of (𝑋, 𝑌) when 𝑋 is in each part of the partition is shown as grey triangles. Since 𝐸(𝑌|𝑋 ∈ 𝒜𝑘) = 𝐸(𝛼 + 𝛽𝑋 + 𝜀|𝑋 ∈ 𝒜𝑘) = 𝛼 + 𝛽𝐸(𝑋|𝑋 ∈ 𝒜𝑘) in each

partition, and since 𝑍𝑋= 𝐸(𝑋|𝑋 ∈ 𝒜𝑘) there, 𝑍𝑋 and 𝑌 follows the correct linear relationship.

2.2.1.1 Optimal partition for estimating the slope

A question one can ask about this substitution of the independent variables with conditional mean variables is what partition that will allow the smallest possible variance in the estimates of the beta-parameters. In the usual moment least squares regression between one independent variable 𝑋 and a dependent variable 𝑌, if the errors have constant variance 𝜎2_{, the variance of}

the slope estimate converges to:

𝑉𝑎𝑟(√𝑛𝛽̃1) = 𝑉𝑎𝑟 (√𝑛 ∑𝑛_𝑖=1(𝑥_𝑖 − 𝑥̅)(𝑦_𝑖 − 𝑦̅) ∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅)2 ) → 𝜎 2 𝑉𝑎𝑟(𝑋)

(30)

28

Since the “error” in the case of a conditional mean variable is not just 𝜀, but 𝛽₁𝑅_𝑋+ 𝜀 (since 𝑌 = 𝛽0+ 𝛽1𝑋 + 𝜀 = 𝛽0+ 𝛽1𝑍𝑋+ 𝛽1𝑅𝑋+ 𝜀), if we write 𝛽̂1 for this estimate, the variance of

√𝑛𝛽̂1 converges to:

(𝐸𝑞. 2.2.1.1.1) 𝑉𝑎𝑟(√𝑛𝛽̂1) →

𝛽₁2_{𝑉𝑎𝑟(𝑅}

𝑋) + 𝜎2

𝑉𝑎𝑟(𝑍_𝑋)

We may note that by Corollary T1.2 𝑉𝑎𝑟(𝑋) = 𝑉𝑎𝑟(𝑍_𝑋) + 𝑉𝑎𝑟(𝑅_𝑋), and since variances are always positive, this means that 𝑉𝑎𝑟(𝑋) ≥ 𝑉𝑎𝑟(𝑍_𝑋). Hence 𝑉𝑎𝑟(√𝑛𝛽̂1) ≥ 𝑉𝑎𝑟(√𝑛𝛽̃1). This

is as expected, as we now have less information in our observations. Further, by Corollary T1.2 we may rewrite Equation 2.2.1.1.1. as:

𝑉𝑎𝑟(√𝑛𝛽̂1) → 𝛽₁2(𝑉𝑎𝑟(𝑋) − 𝑉𝑎𝑟(𝑍_𝑋)) + 𝜎2 𝑉𝑎𝑟(𝑍_𝑋) = 𝛽₁2_{𝑉𝑎𝑟(𝑋) + 𝜎}2 𝑉𝑎𝑟(𝑍_𝑋) − 𝛽1 2

Since 𝑉𝑎𝑟(𝑋), 𝛽₁, and 𝜎2_{depend on the latent variable 𝑋 and the relationship between 𝑋 and}

𝑌, the only term that can vary due to the partition is 𝑉𝑎𝑟(𝑍_𝑋). Hence, maximizing this term minimizes the variance of the slope estimate. As the intercept estimate is only the sample mean of 𝑌 minus the slope estimate times the mean of 𝑋 (which we assume is known if the conditional mean variable can be constructed), minimizing the variance of the slope estimate is enough to minimize this estimate as well. The next theorem defines the conditions for 𝑉𝑎𝑟(𝑍𝑋) to be maximized with respect to a certain distribution on the latent variable 𝑋 and a

certain number 𝐾 of ordered categories, and hence defines when the slope estimate based on 𝑍_𝑋 has the lowest possible variance with this construction.

Theorem 4: For a given variable 𝑋, observed through ∆𝑋 with a fixed number 𝐾 of ordered

categories for its partition, where we denote the cutoffs between these categories by {𝑐𝑘}𝑘=1,…,𝐾−1, with 𝑐0 and 𝑐𝐾 denoting the lower and upper endpoints of the support of 𝑋.

Then the variance of the corresponding conditional mean variable 𝑍_𝑋 is maximized,

corresponding to an optimal case for estimating the slope on a dependent variable 𝑌 = 𝛽₀+ 𝛽1𝑋 + 𝜀, where 𝐸𝜀|𝑋=𝑥 = 0 ∀𝑥, if the partition fulfills:

(𝑇4) 𝑐_𝑘= 𝐸(𝑋|𝑐𝑘−1< 𝑋 ≤ 𝑐𝑘) + 𝐸(𝑋|𝑐𝑘 < 𝑋 ≤ 𝑐𝑘+1)

(31)

29

Proof: First, we may observe that by Theorem 1, 𝐸𝑍_𝑋 is a constant (= 𝐸𝑋), so maximizing 𝑉𝑎𝑟(𝑍_𝑋) for a given variable 𝑋 is the same thing as maximizing 𝐸𝑍_𝑋2_{, which, by the preceding}

argument is enough for the partition to be optimal for estimating the slope. When we have 𝐾 ordered cutoffs as above, this means:

𝐸𝑍_𝑋2 = ∑ (∫ 𝑥𝑓(𝑥) 𝑐𝑘 𝑐𝑘−1 𝑑𝑥 ∫_𝑐𝑐𝑘 𝑓(𝑥) 𝑘−1 𝑑𝑥 ) 2 ∫ 𝑓(𝑥) 𝑐𝑘 𝑐𝑘−1 𝑑𝑥 𝐾 𝑘=1 = ∑(∫ 𝑥𝑓(𝑥) 𝑐𝑘 𝑐𝑘−1 𝑑𝑥) 2 ∫_𝑐𝑐𝑘 𝑓(𝑥) 𝑘−1 𝑑𝑥 𝐾 𝑘=1

Writing this as 𝑔(𝑐₁, … , 𝑐_𝐾−1), the partial derivatives become:

𝜕 𝜕𝑐_𝑘𝑔(𝑐1, … , 𝑐𝐾−1) = = 2𝑐𝑘𝑓(𝑐𝑘) ∫ 𝑥𝑓(𝑥) 𝑐𝑘 𝑐𝑘−1 𝑑𝑥 ∫𝑐𝑘 𝑓(𝑥) 𝑐𝑘−1 𝑑𝑥 −𝑓(𝑐𝑘) (∫ 𝑥𝑓(𝑥) 𝑐𝑘 𝑐𝑘−1 𝑑𝑥) 2 (∫𝑐𝑘 𝑓(𝑥) 𝑐𝑘−1 𝑑𝑥) 2 − 2𝑐𝑘𝑓(𝑐𝑘) ∫ 𝑥𝑓(𝑥) 𝑐𝑘+1 𝑐𝑘 𝑑𝑥 ∫𝑐𝑘+1𝑓(𝑥) 𝑐𝑘 𝑑𝑥 + 𝑓(𝑐_𝑘) (∫𝑐𝑘+1𝑥𝑓(𝑥) 𝑐𝑘 𝑑𝑥) 2 (∫𝑐𝑘+1𝑓(𝑥) 𝑐𝑘 𝑑𝑥) 2 = = 𝑓(𝑐𝑘) (2𝑐𝑘𝐸(𝑋|𝑐𝑘−1 < 𝑋 ≤ 𝑐𝑘) − (𝐸(𝑋|𝑐𝑘−1< 𝑋 ≤ 𝑐𝑘)) 2 − 2𝑐𝑘𝐸(𝑋|𝑐𝑘< 𝑋 ≤ 𝑐𝑘+1) + (𝐸(𝑋|𝑐𝑘 < 𝑋 ≤ 𝑐𝑘+1)) 2 )

Setting this equal to zero and supposing 𝑓(𝑐_𝑘) ≠ 0, we may reorder this as:

2𝑐𝑘(𝐸(𝑋|𝑐𝑘 < 𝑋 ≤ 𝑐𝑘+1) − 𝐸(𝑋|𝑐𝑘−1 < 𝑋 ≤ 𝑐𝑘)) =

= (𝐸(𝑋|𝑐_𝑘 < 𝑋 ≤ 𝑐_𝑘+1))2− (𝐸(𝑋|𝑐_𝑘−1 < 𝑋 ≤ 𝑐_𝑘))2

Since, by our assumptions 𝐸(𝑋|𝑐𝑘 < 𝑋 ≤ 𝑐𝑘+1) ≠ 𝐸(𝑋|𝑐𝑘−1< 𝑋 ≤ 𝑐𝑘), this can be

(32)

30 𝑐_𝑘 =𝐸(𝑋|𝑐𝑘−1< 𝑋 ≤ 𝑐𝑘) + 𝐸(𝑋|𝑐𝑘< 𝑋 ≤ 𝑐𝑘+1)

2

Calculating the second derivatives reveals that this gives the maximum of the variance of 𝑍𝑋

for ordered partitions. □

Put in words, Theorem 4 says that the maximal variance happens when each cutoff lies in the middle of its adjacent conditional means (see Figure 2.2.1.1.1). This gives us 𝐾 − 1

unknowns of 𝐾 − 1 cutoffs, which allows us to find this maximum. Or at least this is so in theory, although in general it might not be so easy to solve analytically. One simple example is given below.

Example 4 (Maximal variance for a uniform latent variable): Suppose 𝑋~𝑈(0,1), and that we seek a 𝐾-part partition 𝒜 which maximizes the variance of 𝑍𝑋. This is given when:

𝑐𝑘 = 𝐸(𝑋|𝑐_𝑘−1< 𝑋 ≤ 𝑐_𝑘) + 𝐸(𝑋|𝑐_𝑘< 𝑋 ≤ 𝑐_𝑘+1) 2 = 𝑐_𝑘−1+ 𝑐_𝑘 2 + 𝑐_𝑘+ 𝑐_𝑘+1 2 2 ↔ ↔ 𝑐_𝑘= 𝑐𝑘−1+ 𝑐𝑘+1 2 ↔ 𝑐𝑘+1= 2𝑐𝑘− 𝑐𝑘−1

Since 𝑐₀ = 0, 𝑐_𝐾 = 1, we have that:

𝑐₁ =0 + 𝑐2

2 ↔ 𝑐2 = 2𝑐1, 𝑐3 = 2𝑐2− 𝑐1 = 4𝑐1− 𝑐1 = 3𝑐1

And similarly for higher k, so that 𝑐𝑘 = 𝑘𝑐1. In particular, we must have that:

1 = 𝑐_𝐾 = 𝐾𝑐₁ ↔ 𝑐₁ = 1

𝐾 → 𝑐𝑘= 𝑘 𝐾

So in the case of a standard uniform distribution the variance of an ordered conditional mean variable is maximized when each part of the partition covers an equal 1/𝐾 of the support

(33)

31

(note that by the third property of Theorem 1 this means that for 𝑈(𝑎, 𝑏) the same thing is fulfilled when each part covers (𝑏 − 𝑎)/𝐾 of the support).

As said, in the general case it will not be as easy to find an analytical solution. However, if 𝑐1

is decided, 𝐸(𝑋|𝑐₀ < 𝑋 ≤ 𝑐₁) follows directly, which also directly decides 𝐸(𝑋|𝑐1 < 𝑋 ≤ 𝑐2)

since these conditional means must be equidistant from 𝑐₁ by Theorem 4, but this in turn decides 𝑐2 and so on. Hence, a numerical procedure for finding the maximum needs only one

parameter. For an example, see Figure 2.2.1.1.1.

Figure 2.2.1.1.1: Numerical approximation of maximal-variance cutoffs for the conditional mean variable when

𝑋~𝑁(0,1), partitions are ordered, and 𝐾 = 3. Shown are the distribution of 𝑋, the approximately optimal cutoffs (vertical black lines), with each part of the partition in a different color and conditional means in a darker shade of the corresponding color with height equal to their probability (the area in each partition). The variance of the conditional mean variable when the first cutoff takes different values up to 0 and when the second cutoff is found by applying the equality in Theorem 4 is also shown in grey.

2.2.1.2 Variance of the slope estimate as a function of the precision of the partition of X

As the final part in this section we will study how the variance of the slope estimate varies as a function of making the partition of 𝑋 finer (i.e. splitting up the partition more and more). Since the factor which decided the variance of the estimate was found to be the variance of

(34)

32

the conditional mean variable, i.e. 𝑉𝑎𝑟(𝑍_𝑋), it is enough to study how this variance changes as a function of taking a finer partition. For simplicity, let’s study how the variance changes when we just make one part of the partition finer, by splitting it up in two new parts in some way. Say we do so for the 𝑘:th part of the partition, 𝒜_𝑘, and let’s denote the new parts 𝒜_𝑘;1 and 𝒜_𝑘;2, which means that:

𝒜_𝑘;1∪ 𝒜_𝑘;2= 𝒜_𝑘, 𝒜_𝑘;1∩ 𝒜_𝑘;2 = ∅

Denoting the old partition by 𝒜 and the new by ℬ, it is easily seen that:

ℬ = {𝒜\{𝒜_𝑘}, 𝒜𝑘;1, 𝒜𝑘;2}

Hence 𝑍_𝑋,𝒜, the conditional mean variable of 𝑋 is also a conditional mean variable of 𝑍_𝑋,ℬ, i.e.: 𝑍_𝑋,𝒜 = 𝑍_𝑍_𝑋,ℬ_,𝒜 By Corollary T1.2: 𝑉𝑎𝑟(𝑋) = 𝑉𝑎𝑟(𝑍_𝑋,𝒜) + 𝑉𝑎𝑟(𝑅_𝑋,𝒜) Consequently: 𝑉𝑎𝑟(𝑍_𝑋,𝒜) = 𝑉𝑎𝑟(𝑍_𝑍_𝑋,ℬ_,𝒜) = 𝑉𝑎𝑟(𝑍_𝑋,ℬ) − 𝑉𝑎𝑟(𝑅_𝑍_𝑋,ℬ_,𝒜) ≤ 𝑉𝑎𝑟(𝑍_𝑋,ℬ)

With strict inequality when 𝑉𝑎𝑟(𝑅_𝑍_𝑋,ℬ_,𝒜) > 0, which is true provided neither of 𝒜_𝑘;1 and 𝒜𝑘;2 is empty. Thus, taking a finer partition always increases the variance of the conditional

mean variable, which also improves the slope estimate in the regression. Furthermore, since Corollary L1.2. implies 𝑉𝑎𝑟(𝑋) ≥ 𝑉𝑎𝑟(𝑍_𝑋,𝒜), this means that as the partition gets finer and finer over its whole support, and the conditional mean variable will converge to 𝑋, then its variance will converge to the variance of 𝑋, and the slope-estimator with the conditional mean variable will converge to having the same efficiency as that with 𝑋 directly, as expected.

(35)

33

2.2.2 As multiple independent variables – Some special cases

Here we take a brief look at some special cases when we have the model for our observations (∆𝑿𝑖, 𝑌𝑖), that is, we have several (𝑝 > 1) categorically observed independent variables. The

reason we did not have more than one independent conditional mean variable before is that we still haven’t proven how to find consistent estimates when the dependent variable, or for that matter both related variables, can only be observed through their partition. As we will show in the next section, it is in general not so easy that we can just replace 𝑌 with 𝑍_𝑌 in the moment equations. Similarly, for two independent variables, we cannot apply the reasoning in Theorem 3, because there is nothing that guarantees that the effect of the conditional

remainders cancels out as it did in 𝐶𝑜𝑣(𝑍𝑋, 𝑋) = 𝐶𝑜𝑣(𝑍𝑋, 𝑍𝑋), i.e.:

𝐶𝑜𝑣(𝑍_𝑋₁, 𝑋₂) = 𝐶𝑜𝑣(𝑍_𝑋₁, 𝑍_𝑋₂ + 𝑅_𝑋₂) = 𝐶𝑜𝑣(𝑍_𝑋₁, 𝑍_𝑋₂) + 𝐶𝑜𝑣(𝑍_𝑋₁, 𝑅_𝑋₂)

Where there is no guarantee that the latter covariance will cancel out (and generally do not). Since the slopes in the multivariate case are given by:

𝐶𝑜𝑣(𝑿, 𝑿)−1_{𝐶𝑜𝑣(𝑿, 𝑌) = 𝐶𝑜𝑣(𝑿, 𝑿)}−1_{𝐶𝑜𝑣(𝑿, 𝑿}𝑇_{𝜷 + 𝜀) = 𝐶𝑜𝑣(𝑿, 𝑿)}−1_{𝐶𝑜𝑣(𝑿, 𝑿}𝑇_𝜷)

Replacing 𝑿 by 𝒁𝑿 will not yield the correct slopes, since 𝐶𝑜𝑣(𝒁𝑿, 𝑌) ≠ 𝐶𝑜𝑣(𝒁𝑿, 𝒁𝑿𝑇𝜷), i.e.:

𝐶𝑜𝑣(𝒁𝑿, 𝑌) = 𝐶𝑜𝑣(𝒁𝑿, 𝑿𝑇𝜷 + 𝜀) = 𝐶𝑜𝑣(𝒁𝑿, 𝑿𝑇𝜷) = ( ∑ 𝐶𝑜𝑣(𝑍𝑋1, 𝑋𝑗𝛽𝑗) 𝑝 𝑗=1 ⋮ ∑ 𝐶𝑜𝑣 (𝑍𝑋𝑝, 𝑋𝑗𝛽𝑗) 𝑝 𝑗=1 )

We have that 𝐶𝑜𝑣 (𝑍_𝑋_𝑗, 𝑋_𝑗𝛽_𝑗) = 𝐶𝑜𝑣 (𝑍_𝑋_𝑗, 𝑍_𝑋_𝑗𝛽_𝑗) ∀𝑗, but in general 𝐶𝑜𝑣 (𝑍_𝑋_𝑗, 𝑋_𝑙𝛽_𝑙) ≠

𝐶𝑜𝑣 (𝑍_𝑋_𝑗, 𝑍_𝑋_𝑙𝛽_𝑙) ∀𝑗 ≠ 𝑙, so 𝐶𝑜𝑣(𝒁_𝑿, 𝒁_𝑿)−1_{𝐶𝑜𝑣(𝒁}

𝑿, 𝑌) ≠ 𝜷.

This said, there are some cases when multiple independent variables can be replaced by their conditional mean variables without any further complications. These cases are summarized in the following theorem:

(36)

34

Theorem 5: Suppose 𝑌 = 𝛽0 + 𝜷𝑇𝑿 + 𝜀, and let 𝑍𝑋𝑗 be the conditional mean variable of 𝑋𝑗

with respect to a certain partition 𝒜(𝑗), and let 𝑅𝑋𝑗 be the corresponding remainder. If

𝐶𝑜𝑣 (𝑍𝑋𝑗, 𝑅𝑋𝑙) = 0 ∀𝑗, 𝑙, then 𝐶𝑜𝑣(𝒁𝑿, 𝒁𝑿)

−1_{𝐶𝑜𝑣(𝒁}

𝑿, 𝑌) = 𝜷.

Proof: If 𝐶𝑜𝑣 (𝑍_𝑋_𝑗, 𝑅_𝑋_𝑙) = 0 ∀𝑗, 𝑙 this means that:

𝐶𝑜𝑣 (𝑍_𝑋_𝑗, 𝑋_𝑙) = 𝐶𝑜𝑣 (𝑍_𝑋_𝑗, 𝑍_𝑋_𝑙) ∀𝑗, 𝑙 𝐶𝑜𝑣 (𝑍𝑋𝑗, 𝑌) = 𝐶𝑜𝑣 (𝑍𝑋𝑗, 𝛽0+ 𝜷 𝑇_{𝑿 + 𝜀) = ∑ 𝛽} 𝑙𝐶𝑜𝑣 (𝑍𝑋𝑗, 𝑋𝑙) 𝐾𝑙 𝑙=1 = ∑ 𝛽𝑙𝐶𝑜𝑣 (𝑍𝑋𝑗, 𝑍𝑋𝑙) 𝐾𝑙 𝑙=1 ∀𝑗

In the case that we observe the 𝑋-variables directly, we have:

𝐶𝑜𝑣(𝑋𝑗, 𝑋𝑙) = 𝐶𝑜𝑣(𝑋𝑗, 𝑋𝑙) ∀𝑗, 𝑙

𝐶𝑜𝑣(𝑋𝑗, 𝑌) = 𝐶𝑜𝑣(𝑋𝑗, 𝛽0+ 𝜷𝑇𝑿 + 𝜀) = ∑ 𝛽𝑙𝐶𝑜𝑣(𝑋𝑗, 𝑋𝑙) 𝐾𝑙

𝑙=1

∀𝑗

We see that the covariances of the 𝑍𝑋-variables, that make up 𝐶𝑜𝑣(𝒁𝑿, 𝒁𝑿) and 𝐶𝑜𝑣(𝒁𝑿, 𝑌),

follows the same pattern as the covariances of the 𝑋-variables, that make up 𝐶𝑜𝑣(𝑿, 𝑿) and 𝐶𝑜𝑣(𝑿, 𝑌), with 𝐶𝑜𝑣(𝑋𝑗, 𝑋𝑙) replaced by 𝐶𝑜𝑣 (𝑍𝑋𝑗, 𝑍𝑋𝑙) everywhere, with no further changes.

Thus:

𝐶𝑜𝑣(𝒁_𝑿, 𝒁_𝑿)−1_{𝐶𝑜𝑣(𝒁}

𝑿, 𝑌) = 𝐶𝑜𝑣(𝑿, 𝑿)−1𝐶𝑜𝑣(𝑿, 𝑌) = 𝜷

As stated. □

The simplest case where Theorem 5 applies is when all the 𝑋-variables are independent of each other. If so, the same thing will hold true for all their conditional mean variables, and so