Estimation and Testing the Quotient of Two Models

(1)

School of Education, Culture and Communication

Division of Applied Mathematics

MASTER THESIS IN MATHEMATICS / APPLIED MATHEMATICS

Estimation and Testing the Quotient of Two Models

by

Marko Dimitrov

Masterarbete i matematik / tillämpad matematik

DIVISION OF APPLIED MATHEMATICS

MÄLARDALEN UNIVERSITY SE-721 23 VÄSTERÅS, SWEDEN

(2)

School of Education, Culture and Communication

Division of Applied Mathematics

Master thesis in mathematics / applied mathematics

Date:

2018-06-01

Project name:

Estimation and Testing the Quotient of Two Models

Author: Marko Dimitrov Supervisor(s): Christopher Engström Reviewer: Milica Ranˇcí´c Examiner: Sergei Silvestrov Comprising: 15 ECTS credits

(3)

Abstract

In the thesis, we introduce linear regression models such as Simple Linear Regression, Mul-tiple Regression, and Polynomial Regression. We explain basic methods of the model para-meters estimation, Ordinary Least Squares (OLS) and Maximum Likelihood Estimation (MLE). The properties of the estimates, and what assumptions need to be made for the model for the estimates to be the Best Linear Unbiased Estimates (BLUE) are given. The basic Bootstrap methods are introduced. The real world problem is simulated in order to see how measurement error affects the quotient of two estimated models.

(4)

Acknowledgments

I would like to thank my supervisor Senior Lecturer Christopher Engström of the School of Education, Culture and Communication at Mälardalen University. Prof. Engström’s consist-ently allowed this paper to be my own work but steered me in the right direction whenever he thought I needed it.

I would also like to thank Prof. Dr. Miodrag Ðor ¯devi´c who was involved in the validation survey for this master thesis. Without his participation and input, the validation survey could not have been successfully conducted.

I would also like to acknowledge Senior Lecturer Milica Ranˇcí´c, School of Education, Culture and Communication at Mälardalen University as the reviewer, and I am gratefully indebted to her for her very valuable comments on this thesis.

The data used in the master thesis comes from ship log data gathered at Qtagg AB, from one ship gathered over roughly half a month and I wish to acknowledge Qtagg AB for the data.

Finally, I must express my very profound gratitude to my friends and girlfriend for provid-ing me with unfailprovid-ing support and continuous encouragement throughout my years of study and through the process of researching and writing this thesis. This accomplishment would not have been possible without them. Thank you.

Author:

(5)

List of Figures

4.1 The data (velocity) without measurement errors . . . 34

4.2 Case 1 - The data (fuel efficiency) without measurement errors . . . 35

4.5 A sample of data taken from the Uniform Distribution . . . 38

4.6 A sample of data taken from the Generalized Pareto Distribution . . . 39

4.7 A sample of data taken from the Normal Distribution . . . 40

4.8 A sample of data taken from the Log-normal Distribution . . . 41

4.9 A sample of data taken from the Gamma Distribution . . . 42

4.10 A sample of data taken from the Student’s t-distribution . . . 43

(8)

List of Tables

(9)

Abbreviations and Acronyms

SSR Sum of Squares due to Regression. 9

SS_T Total Sum of Squares. 9

BLUE Best Linear Unbiased Estimates. 7 MLE Maximum Likelihood Estimation. 9 OLS Ordinary Least Squares. 5

RSS Residual Sum of Squares. 5, 16 CDF Cumulative Density Function. 21 EDF Empirical Distribution Function. 21 PDF Probability Density Function. 21

i.i.d. Independent and Identically Distributed. 21 df Degrees of Freedom. 7

(10)

Introduction

Regression analysis is a statistical technique used for analyzing data, and for finding a relation-ship between two or more variables. Behind the regression analysis lays elegant mathematics and statistical theory. It can be used in many fields, in engineering, economy, biology, medi-cine etc. In the book, Dougherty [4], there is a good example of where the regression can be used and how to use it.

We explain Simple Linear Regression, for much more information, proofs, theorems, and examples the author refers the reader to the book by Weisberg [10]. There are also good examples of Simple Linear regression in book by Dougherty [4].

Multiple Regression analysis is, perhaps, more important than the Simple Linear Regres-sion. There are a lot of results and books about the Multiple Regression, starting with Rencher and Schaalje [6] which author refers to the reader. I would like to mention books by Wasser-man [9], Montgomery et al. [5], Seber and Lee [7], Casella and Berger [1], and Weisberg [10] which contain a lot more information. Besides the simple linear regression and multiple re-gression, books such Casella and Berger [1], and Weisberg [10] contain other linear regression models as well as nonlinear models.

For the better understanding of the Polynomial Regression, the author refers the book by Wasserman [9] (Chapter 7) which will give you enough information on why Regression Analysis is good, and why it is not. The problem we mention in the thesis, ill-conditioning, is well-explained and solved in Chapter 7 Wasserman [9].

An excellent introduction to the bootstrap methods and confidence intervals is given in Davis [2] and Davison and Hinkley [3]. However, in the book by Van Der Vaart and Wellner [8] (Section 3.6 to 3.9), they mention the bootstrap empirical process and take the bootstrap method to the next level.

Our goal is to simulate the real world problem and use the methods we mention. We will make some assumptions, estimate two models (Simple Linear, Multiple or Polynomial regression models), and look at the quotient of the two models. One would search for the distribution of the quotient of two models, but that could be really complicated, that’s why we introduce the bootstrap method. Computing the confidence intervals for the mean of the quotient, we get the results. We introduce different types of measurement errors to the data and see how does that affect the quotient.

(11)

Formulation of Problem and Goal of the

Thesis

This project is inspired by my supervisor Christopher Engström. The formulation of the prob-lem studied in the project and the goal of the project, given by the supervisor, are below.

When creating a new control system or new hardware for a vehicle or some other machine there is also need to test it in practice. For example, if you want to evaluate if one method is more fuel efficient then another. The standard method to do this is by doing the testing in a controlled environment where you can limit the number of outside influences on the system.

However, doing the tests in a controlled environment is not always possible - either be-cause of cost considerations or bebe-cause the thing you want to test is something that is hard to archive in a controlled environment.

The goal of the thesis is to evaluate how well a quotient between two models, for example, the fuel efficiency of two engines, behaves when the data is taken in the non-controlled envir-onment where two engines cannot be tested simultaneously and the effects of outside factors are large.

Mathematically, the problem can be described as follows:

1) Given two sets of data, a model is constructed for each in order to predict one of the variables given the others (regression problem);

2) From these two models try to predict the quotient between the two predicted variables if they were given the same input parameters, for example by computing confidence intervals;

3) Introduce bias or different types of errors with known distributions into the data and determine how this affects the randomness in the result;

4) The project should be made using a mixed theoretical and experimental approach but may lean towards one or the other.

(12)

Chapter 1 Simple Linear Regression

1.1 The Model

Regression is a method of finding the relationship between two variables Y and X . The vari-able Y is called a response varivari-able and the varivari-able X is called a covariate. The varivari-able X is also called a predictor variable or a feature. In the simple linear regression we have only one covariate, but, as we will see later, there could be more covariates.

Let’s assume that we have a set of data D = {(yi, xi)}Ni=1. To find relationship between Y

and X we estimate the regression function

r(x) = E(Y |X = x) =

Z

y f(y|x) dy. (1.1) The simplest is to assume that the regression function is a linear function:

r(x) = θ0+ θ1x.

where x is a scalar (not a vector).

Beside the regression function (mean function), the simple linear regression model consists of an another function

Var(Y |X = x) = σ2 which is the variance function.

By changing the parameters θ0and θ1we can get every possible line. To us, the parameters

are unknown and we have to estimate them by using the data D.

Since the variance σ2 is positive, the observed value, in general, will not be the same as the expected value. Because of that, to account for the difference between those values, we look at the error

ξi= yi− (θ0+ θ1xi)

for every i ∈ {1, 2, . . . , N}. The errors depend on the parameters and are not observable, there-fore they are random variables.

We can write simple linear regression model as

(13)

The model is called simple because there is only one feature to predict the predictor vari-able, and the linear part means that the model (1.2) is linear in parameters θ0 and θ1, to be

precise, the assumption that the regression function (1.1) is linear. Considering that ξi are

random variables, yiare random variables as well.

For the model to be complete, we have to make following assumptions about the errors ξi,

i= 1, 2, . . . , N.

1. E(ξi|xi) = 0 for all i = 1, 2, . . . , N;

2. Var(ξi|xi) = σ2for all i = 1, 2, . . . , N;

3. Cov(ξi, ξj|xi) = 0 for all i 6= j, i, j = 1, 2, . . . , N.

First assumptionguarantees that the model (1.2) is well defined. It is equivalent to E(yi|xi) =

θ0+ θ1xiwhich means that yi depends only on xi and all other factors are random, contained

in ξi.

The second assumptionimplies Var(yi|xi) = σ2, the variance is constant, it does not depend

on values of xi.

Third assumptionis equivalent to Cov(yi, yj|xi) = 0. The errors, as well as variables yi, are

uncorrelated with each other. Under the assumption of normality, this would mean that the errors are independent.

1.2 Estimation of the Model Parameters

1.2.1 Ordinary Least Squares

One of many methods to estimate unknown parameters θ0 and θ1 in (1.2) is the Ordinary

Least Squares (OLS) method.

Let ˆθ0and ˆθ1be the estimates of θ0and θ1. We define the fitted line by

ˆr(x) = ˆθ0+ ˆθ1x

the fitted values as

ˆ

y_i= ˆr(xi)

the residuals as

ˆ

ξi= yi− ˆyi= yi− ( ˆθ0+ ˆθ1xi)

and the residual sums of squares or (RSS) by RSS= N

∑

i=1 ˆ ξ_i2. (1.3) By minimizing the residual sums of squares we get the estimates ˆθ0and ˆθ1. Those estimates

are called the least square estimates. The function we want to minimize is RSS(θ0, θ1) =

N

∑

i=1

(14)

and by solving the linear system ∂ RSS(θ0, θ1) ∂ θ0 = 0 ∂ RSS(θ0, θ1) ∂ θ1 = 0 (1.5) we get ˆθ0and ˆθ1. When we differentiate, linear system (1.5) becomes

−2 N

∑

i=1 (yi− ((θ0+ θ1xi)) = 0 −2 N

∑

i=1 (yi− ((θ0+ θ1xi))xi= 0

Solving the linear system we get the least square estimates ˆ θ0= ¯y− ˆθ1x, ˆ θ1= ∑ N i=1xiyi− N ¯x ¯y ∑Ni=1x2i − N ¯x2 =∑ N

i=1(xi− ¯x)(yi− ¯y)

∑Ni=1(xi− ¯x)2 (1.6) where ¯y= 1 N N

∑

i=1 y_iand ¯x= 1 N N

∑

i=1 x_i.

The estimates given in (1.6) will be the estimates which minimize the function (1.4) if we prove that the second derivatives are positive. We could also notice that the function (1.4) has no maximum, therefore the estimates are the minimum.

1.2.2 Properties of the Ordinary Least Square Estimators

To estimate parameters θ0 and θ1, the three assumptions on page 10 were not used. Even if

the assumption E(yi|xi) = 0 for all i = 1, 2, . . . , N does not hold, we can define ˆyi= θ0+ θ1xi

to fit the data D = {yi, xi}N_i=1.

The estimates ˆθ0 and ˆθ1 are also random variables, because they depend on statistical

errors. If the assumption on page 10 hold, from the Gauss-Markov theorem (see Theorem (1) on page 19), the estimators ˆθ0and ˆθ1are unbiased and have the minimum variance among all

linear unbiased estimatorsof parameters θ0and θ1,

E( ˆθ0|X) = θ0

E( ˆθ1|X) = θ1

also the variance of the estimates are Var( ˆθ0|X) = σ2 ∑Ni=1x2i − N ¯x2 Var( ˆθ1|X) = σ2 " 1 N+ ¯ x2 ∑N_i=1(xi− ¯x)2 # . (1.7)

(15)

Since θ0depends on θ1, it’s obvious that the estimates are correlated Cov( ˆθ0, ˆθ1|X) = −σ2 ¯ x ∑N_i=1(xi− ¯x)2 .

The estimates ˆθ0and ˆθ1are called Best Linear Unbiased Estimates (BLUE).

1.2.3 An Estimator of the Variance and Estimated Variances

Ordinary Least Squares does not yield the estimation of the variance. Naturally, estimation ˆσ2 should be obtained by averaging the squared residuals because σ2= Ehy_i− E(yi|xi)

xi

i2 . From the assumption 2, on page 10, we have the constant variance σ2 for every yi, i =

1, 2, . . . , N. Also, we use ˆy_i to estimate E(yi|xi). To get the unbiased estimation ˆσ2of σ2, we

divide RSS ((1.3)) by its degrees of freedom (df), where residual df is number of cases in data D(N) minus the number of parameters, which is 2. The estimate is

ˆ

σ2= RSS

N− 2, (1.8) this quantity is called the residual mean square.

To estimate the variances of ˆθ0and ˆθ1we simply change σ2with ˆσ2in (1.7). Therefore,

d Var( ˆθ0|X) = ˆ σ2 ∑N_i=1x2_i − N ¯x2 d Var( ˆθ1|X) = ˆσ2 " 1 N + ¯ x2 ∑N_i=1(xi− ¯x)2 # . are the estimated variances.

1.3 Hypothesis Testing, Confidence Intervals and t-test

Until now, we didn’t need to make any assumptions about the error’s distribution besides the three assumption on page 10. Suppose that we add the following assumption

ξi|xi: N(0, σ2), i = 1, 2, . . . , N.

Since the predictions are linear combination of the error, we have y_i|x_i: N(θ0+ θ1xi, σ2), i = 1, 2, . . . , N.

With this assumption we can construct confidence intervals about the model parameters and test hypotheses. Perhaps we are more interested in hypotheses about θ1, by doing so we can

determine if there is actually a linear relationship between X and Y by testing the hypotheses H₀: θ1= 0,

H₁: θ16= 0.

(16)

In general, we can test the hypothesis

H0: θ1= c,

H1: θ16= c.

(1.10) where c is an arbitrary constant. Depending on what we need to determine, we choose the constant c. Before we examine the hypothesis (1.9) and (1.10) we need the following proper-ties: • ˆθ1: N θ1, σ2 1 ∑Ni=1(xi− ¯x)2 , • (N − 2) ˆσ2 1 σ2 : χ 2_{(N − 2),}

• ˆθ1and ˆσ2are independent random variables,

where ˆσ2is given by (1.8).

Using these properties, the hypothesis test of (1.10) is obtained by computing the t-statistics t= _q θˆ1− c d Var( ˆθ1|X) (1.11) where q d

Var( ˆθ1|X) is standard deviation. The t-statistic given by (1.11) has distribution t(n −

2, δ ). The non-centrality parameter δ is given by δ = qE( ˆθ1|X) d Var( ˆθ1|X) = θ1 1 σ 1 ∑Ni=1(xi− ¯x)2 . (1.12) Hypothesis (1.9) is just a special case of hypothesis (1.10), which means the t-statistics for (1.9) is t= ˆ θ1 ˆ σ2√ 1 ∑N_i=1(xi− ¯x)2 ,

where t is distributed as t(N − 2), because from (1.12), if H0: θ1= 0 we have δ = 0.

For two-sided alternative hypothesis given in (1.9), we reject the null hypothesis H0with

the significance α when |t| ≥ tα

2,N−2, where tα2,N−2 is the upper

α

2 percentage point of the

central student’s distribution.

Probability p, which fits for the absolute value of observed t (as the inverse of distribution function), is called the p value. Considering the following

p> α =⇒ p 2 >

α

2 =⇒ |t| < tα2,N−2

which means that we accept the null hypothesis H0. Alternatively, if p ≤ α we reject the H0.

Finally, to get the confidence interval, starting with P{|t| ≤ tα

2,N−2} = 1 − α

using transformations, a 100(1 − α)% confidence interval for θ1is given by

ˆ θ1− tα 2,N−2· ˆσ 2_· 1 q ∑N_i=1(xi− ¯x)2 ≤ θ1≤ ˆθ1+ tα 2,N−2· ˆσ 2_· 1 q ∑N_i=1(xi− ¯x)2

(17)

1.4 The Coefficient of Determination

We define the coefficient of determination as R2= SSR

SST

where SSR= ∑Ni=1( ˆyi− ¯y)2is a sum of square due to regression and SST= ∑Ni=1(yi− ¯y)2is a

total sum of squares.

Since it can be proved that

SST = RSS + SSR

the total sum of squares is in fact total amount of the variation in yi. Considering this, we have

1 =SST SST = RSS+ SSR SST =RSS SST + R2,

which means that R2is a proportion of how much of the variation is explained by the model (by the regression).

From 0 ≤ RSS ≤ SST, it follows that R2∈ [0, 1]. The bigger the R2is, the more variability

of Y are explained by the model.

We can always add more variables to the model and the coefficient would not decrease, but that doesn’t mean that the new model is better. The error sum of squares should be reduced to get the better model.

Some of computer packages use adjusted coefficient of determination given by R2_adj= 1 − RSS/d f

SS_T/(N − 1).

1.5 Maximum Likelihood Estimation

While the OLS method does not require assumption about the errors to estimate the paramet-ers, the maximum likelihood estimation (MLE) method can be used if the error’s distribu-tion is known.

For the set of data D = {yi, xi}Ni=1, if we assume that the errors in the simple regression

model are normally distributed

ξi|xi: N(0, σ2), i = 1, 2, . . . , N

then

y_i|xi: N(θ0+ θ1xi, σ2), i = 1, 2, . . . , N.

Since the parameters θ0, θ1and σ2are unknown, the likelihood function is given by

L(yi, xi; θ0, θ1, σ2) = N

∏

i=1 (2πσ2)−12_exp ( − 1 2σ2(yi− θ0− θ1xi) 2 ) = (2πσ2)−N2 exp ( − 1 2σ2 N

∑

i=1 (yi− θ0− θ1xi)2 ) (1.13)

(18)

Values ˆθ0, ˆθ1and ˆσ2that maximize function (1.13) are called the I. To find the maximum

value of the function (1.13) is the same as finding maximum of its natural logarithm, ln L(yi, xi; θ0, θ1, σ2) = ln (2πσ2)− N 2 exp ( − 1 2σ2 N

∑

i=1 (yi− θ0− θ1xi)2 )! = −N 2ln(2π) − N 2 ln σ 2₋ 1 2σ2 N

∑

i=1 (yi− θ0− θ1xi)2. (1.14) To find maximum of (1.14) we solve system

∂ ln L(θ0, θ1, σ2) ∂ θ0 = 0 ∂ ln L(θ0, θ1, σ2) ∂ θ1 = 0 ∂ ln L(θ0, θ1, σ2) ∂ σ2 = 0 or equivalently 1 σ2 N

∑

i=1 (yi− θ0− θ1xi) = 0 1 σ2 N

∑

i=1 (y_i− θ0− θ1xi)xi= 0 N 2σ2+ 1 2σ4 N

∑

i=1 (yi− θ0− θ1xi)2= 0. (1.15)

The solution to (1.15) gives us the maximum likelihood estimates ˆ

θ0= ¯y− ˆθ1x,

ˆ θ1=

∑N_i=1(xi− ¯x)(yi− ¯y)

∑N_i=1(xi− ¯x)2 ˆ σ2= ∑ N i=1(yi− ˆθ0− ˆθ1x1)2 N (1.16) which are the same as the estimates in (1.6), which we obtained using the OLS method. From

ˆ

σ2we can get unbiased estimator for the parameter σ2and the ˆσ2is asymptotically unbiased itself.

Since the MLE method requires more assumptions, naturally, with more assumption comes better properties. The estimators have the minimum variance among all other unbiased estim-ators.

Therefore, under the assumption of normality, the maximum likelihood estimates are the same as the OLS estimates.

(19)

Chapter 2 Multiple Regression

In this chapter, we generalize the methods for estimating parameters from the Chapter (1). Namely, we want to predict the variable Y using several features X1, X2, . . ., Xk, k ∈ N.

Basic-ally, we add features to explain parts of Y that have not been explained by the other features.

2.1 The Model

The regression function (1.1), under the assumption of linearity, for this problem becomes r(x) = E(Y |X = x) = θ0+ θ1x1+ θ2x2+ . . . + θkxk

where x is a vector x = (x1, x2, . . . , xk).

Therefore, the multiple regression model can be written as y= θ0+ θ1x1+ θ2x2+ . . . + θkxk+ ξ .

In order to estimate parameters in the model, θ0,θ1,. . .,θk, we need N observations, a data

set D. Suppose that we have data D = {yi, xi}N_i=1 where xi is a vector, xi= (xi1, xi2, . . . , xik),

k_{∈ N, k is number of feature variables, i = 1, 2, . . . , N. Hence, we can write the model for the} ith observation as

y_i= θ0+ θ1xi1+ θ2xi2+ . . . + θkxik+ ξi, i = 1, 2, . . . , N. (2.1)

By saying linear model, we mean linear in the parameters. There are many examples where a model is not linear in xi j’s but it is linear in θi’s.

For k = 1, we get the simple regression model, so it is not a surprise that the three assump-tions on page 10 should hold for the multiple regression as well, i.e.,

1. E(ξi|xi) = 0 for all i = 1, 2, . . . , N;

2. Var(ξi|xi) = σ2for all i = 1, 2, . . . , N;

(20)

Interpretation of these assumptions is similar as the interception for (1.1).

For k = 2, the mean function E(Y |X ) = θ0+ θ1X1+ θ2X2+ . . . + θkXk is a plane in 3

dimensions. If k > 2, we get a hyperplane. We can not imagine or draw a k−dimensional plane for k > 2. Notice that the mean function given above means that we are conditioning on all values of the covariates.

For easier interpretation of the results, we would like to write the model (2.1) in a matrix form. Start by writing (2.1) as

y₁= θ0+ θ1x11+ θ1x12+ . . . + θkx1k+ ξ1

y₂= θ0+ θ1x21+ θ1x22+ . . . + θkx2k+ ξ2

.. .

yN= θ0+ θ1xN1+ θ1xN2+ . . . + θkxNk+ ξN

which gives us clear view of how to write the model in the matrix form. Simply,      y1 y₂ .. . y_N      =      1 x11 x12 x13 . . . x1k 1 x₂₁ x₂₂ x₂₃ . . . x_2k .. . ... ... ... . .. ... 1 xN1 xN2 xN3 . . . xNk           θ0 θ1 .. . θk      +      ξ1 ξ2 .. . ξN      and if we denote y =      y1 y₂ .. . y_N      ; X =      1 x11 x12 x13 . . . x1k 1 x₂₁ x₂₂ x₂₃ . . . x_2k .. . ... ... ... . .. ... 1 xN1 xN2 xN3 . . . xNk      ; θθθ =      θ0 θ1 .. . θk      ; ξξξ =      ξ1 ξ2 .. . ξN      (2.2) it becomes y = Xθθθ + ξξξ .

The three assumptions can be expressed as E(ξξξ |X) = 000, Cov(ξξξ |X) = σ2I where Var(ξi|xi) =

σ2and Cov(ξi, ξj|xi, xj) is contained in Cov(ξξξ |X) = σ2I.

X is N × (k + 1) matrix. We require the full column rank of the matrix, which means that N has to be greater than the number of columns (k + 1). Otherwise, there could happen that one of the columns is a linear combination of other columns. Through this chapter N will be greater than k + 1 and that rank of matrix X is k + 1, rank(X) = k + 1. Parameters θθθ are called regression coefficients.

2.2 Estimation of the Model Parameters

Our goal is to estimate unknown parameters θθθ and σ2 from the data D. Depending on the error’s distribution, we can use different methods to estimate the parameters.

(21)

2.2.1 Ordinary Least Squares

The method that does not require any assumptions about the exact distribution of errors is Ordinary Least Squares.

The fitted values ˆy_iare given by ˆ

y_i= ˆθ0+ ˆθ1xi1+ ˆθ2xi2+ . . . + ˆθkxik, i = 1, 2, . . . N.

To obtain the OLS estimators for the parameters in θθθ , we seek for ˆθ0, ˆθ1, ˆθ2, . . . , ˆθk that

minimize N

∑

i=1 ˆ ξ_i2= N

∑

i=1 (yi− ˆyi)2 = N

∑

i=1 (yi− ( ˆθ0+ ˆθ1xi1+ ˆθ2xi2+ . . . + ˆθkxik))2 (2.3) One way to minimize the function is finding the partial derivatives with respect to each ˆ

θj, j = 0, 1, . . . , k, set the results to be equal to zero and solve k + 1 equations to find the

estimates. One of the reasons why we wrote the model in the matrix form is to simplify these calculations. Therefore, since we assumed rank(X) = k + 1 < N, the following procedure holds.

Firstly, (2.3) can be written in matrix form as ˆξξξ

0_ˆ ξ ξξ = ∑N_i=1(yi− x0iθθθ )ˆ 2where ˆ ξ ξ ξ =      ˆ ξ1 ˆ ξ2 .. . ˆ ξN      ; ˆθθθ =      ˆ θ0 ˆ θ1 .. . ˆ θk      ; xi=      1 x_i1 .. . x_ik      . So, the function (2.3) we want to minimize now becomes

N

∑

i=1 ˆ ξ_i2= ˆξξξ 0_ˆ ξ ξξ = N

∑

i=1 (yi− x0iθθθ )ˆ 2 = (y − X ˆθθθ )0(y − X ˆθθθ ) = y0y − (X ˆθθθ )0y − y0X ˆθθθ + (X ˆθθθ )0X ˆθθθ = y0y − 2y0X ˆθθθ + ˆθθθ0X0X ˆθθθ

where we used basic matrix operations. Now we use matrix calculus to obtain the estimates. Differentiating ˆξξξ

0_ˆ

ξξξ with respect to ˆθθθ and setting the result to equal zero, we get 0 − 2Xy + 2X0X ˆθθθ = 0

from which we have

(22)

From the assumption rank(X) = k + 1, matrix X0X positive-definite matrix, therefore the matrix is nonsingular, so (X0X)−1exists. Now we have the solution

ˆ θ

θθ = (X0X)−1X0y. (2.4) Checking if the hessian of ˆξξξ

0_ˆ

ξξξ is positive-definite matrix, gives us that ˆθθθ is actually the minimum. The hessian is 2X0X which is a positive-definite matrix because of the assumption of the rank.

Since ˆθθθ minimizes the sum of squares, we call it the ordinary least square estimator.

2.2.2 Properties of the Ordinary Least Squares Estimators

Even without the three (two) assumptions we could obtain the OLS estimators, but their prop-erties wouldn’t be as nice as with the assumptions.

Let’s assume that E(y|X) = Xθθθ . The following holds E( ˆθθθ |X) = E((X0X)−1X0y|X)

= (X0X)−1X0E(y|X) = (X0X)−1X0Xθθθ = θθθ

(2.5) which means that ˆθθθ is unbiased estimator of θθθ .

Let’s now assume that Cov(ξξξ |X) = σ2I. Under this assumption we can find the covariance matrix of ˆθθθ ,

Cov( ˆθθθ |X) = Cov((X0X)−1X0y|X)

= (X0X)−1X0Cov(y|X)((X0X)−1X0)0 = (X0X)−1X0σ2IX(X0X)−1

= σ2(X0X)−1X0X(X0X)−1 = σ2(X0X)−1

(2.6)

Using this two properties, we can prove one of the most important theorem, also known as Gauss-Markov Theorem.

Theorem 1. (Gauss-Markov Theorem) If y = Xθθθ + ξξξ , E(ξ |X) = 0, Cov(ξξξ |X) = σ2I and rank(X) = k + 1, then the ordinary least square estimator given by (2.4) is the Best Linear Unbiased Estimator (BLUE), the estimator has minimum variance among all unbiased estim-ators.

Proof. The linearity of the estimator is easy to notice from the (2.4). The proof that the estimator is unbiased is given in (2.5). Let’s prove now that the variance σ2(X0X)−1 of the least squares estimator is the minimum among all unbiased estimators.

(23)

Assume that we have an linear estimator ˆβββ = B1y of θθθ . Without losing the generality, there

exists a non zero matrix B such that B1= (X0X)−1X0+ B. Besides the linearity, the estimator

ˆ β

ββ should also be unbiased, so the following holds E( ˆβββ |X) = θθθ and also

E( ˆβββ |X) = E(B1y|X) = B1E(y|X)

= ((X0X)−1X0+ B)E(Xθθθ + ξξξ |X) = ((X0X)−1X0+ B)Xθθθ

= (X0X)−1X0Xθθθ + BXθθθ = (I + BX)θθθ

which implies that BX = 0.

The estimator was arbitrary, let’s prove that its variance is greater or equal to the variance of the OLS estimator. If we prove that Cov( ˆβββ |X) Cov( ˆθθθ |X), it will imply that the variances of ˆθiare the minimum among all others because the diagonal elements of the matrices are the

variances of the estimators. Note that above means that Cov( ˆβββ |X) − Cov( ˆθθθ |X) is a positive semi-definite matrix. The following holds

Cov( ˆβββ |X) = Cov(B1y|X) = B1Cov(y|X)B01= σ2B1B01

= σ2((X0X)−1X0+ B)((X0X)−1X0+ B)0

= σ2((X0X)−1X0X(X0X)−1+ (X0X)−1X0B0+ BX(X0X)−1+ BB0) = σ2((X0X)−1+ BB0)

where we used that BX = 0 (also X0B0= 0). From (2.6), we have Cov( ˆθθθ |X) = σ2(X0X)−1, so Cov( ˆβββ |X) − Cov( ˆθθθ |X) = BB0 0

is an positive definite matrix because of assumption that B is non zero matrix. Considering the comment above, the OLS estimator is BLUE.

2.2.3 An Estimator of the Variance and Estimated Variance

Under the assumptions on page 16, the variance of yiis constant for all i = 1, 2, . . . , N.

There-fore,

Var(yi|xi) = σ2= E(yi− E(yi|xi)|xi)2

and also,

E(yi|xi) = x0iθθθ .

Naturally, based on the data D = {yi, xi}Ni=1 we estimate the variance as it follows

ˆ σ2= 1 N− k − 1 N

∑

i=1 (yi− x0iθ )ˆ 2

(24)

or in the matrix form

ˆ

σ2= RSS

N− k − 1 (2.7) where RSS = (y − X ˆθθθ )0(y − X ˆθθθ ) is Residual Sum of Squares.

The statistic (2.7), under the assumptions on page 16, is an unbiased estimator of the parameter σ2, i.e. E( ˆσ2|X) = σ2.

Using (2.6) and (2.7), the unbiased estimator for Cov( ˆθ ) is d

Cov(θ ) = ˆσ2(X0X)−1.

If we add one assumption to the Gauss-Markov theorem from page 19, which is E(ξ_i4|xi) =

3σ4, then the estimated variance (2.7) has minimum variance among all quadratic unbiased estimators, which can be proven. See Theorem 7.3g. in [6].

2.3 Maximum Likelihood Estimation

So far, there were no assumption made about the distribution of the errors. To obtain Maximum Likelihood Estimator, we need to make those assumptions. In this section, we will assume nor-mality of the random variable ξξξ . So, let ξξξ : NN(0, σ2I), where NN stands for N−dimensional

normal distribution. From the covariate matrix we have that the errors are uncorrelated which, under the assumption of normality, means that they are independent as well.

The random variable y is normally distributed with expectation Xθθθ and covariate mat-rix σ2I), which implies that the joint probability density function, which we denote with ϕ (y; X,θθθ , σ2), is ϕ (y, X;θθθ , σ2) = N

∏

i=1 ϕ (y_i; xi,θθθ , σ2)

because yiare independent random variables. Or equivalently, we can write it as

ϕ (y, X;θθθ , σ2) = (2π)− N 2|σ2I|− 1 2exp −1 2(y − Xθθθ ) 0_(σ2_I)−1_{(y − Xθ} θ θ )

from the definition of the density of multivariate normal distribution. When y and X are known, the density function is treated as a function of parameters θθθ and σ2 and in this case we call it the likelihood function, and we denote it as

L(y, X;θθθ , σ2) = (2π)− N 2|σ2I|− 1 2_exp −1 2(y − Xθθθ ) 0_(σ2_I)−1_{(y − Xθ} θθ ) . (2.8) By maximizing the function (2.8) for given y and X we obtain the maximum likelihood estimators θθθ and σ2. Maximizing logarithm of the function (2.8) is the same as maximizing the likelihood function, so, for easier calculation, we maximize the logarithm of the likelihood function. The likelihood function now becomes

ln L(y, X;θθθ , σ2) = −N 2 ln(2π) − N 2 ln σ 2₋ 1 2σ2(y − Xθθθ ) 0_{(y − Xθ} θθ ). (2.9)

(25)

When we find the gradient of the function ln L(y, X;θθθ , σ2) and equalize it with 0, we get the Maximum Likelihood Estimator ˆθ , which is given by

ˆ θ

θθ = (X0X)−1X0y

and it is the same as the estimator obtained with the OLS method. And the biased estimator of the variance σ2, which we get from this, is given by

ˆ σ_b2= 1 N(y − X ˆθθθ ) 0_{(y − X ˆ} θ θθ ). The unbiased estimator of the variance is

ˆ

σ2= 1

N− k − 1(y − X ˆθθθ )

0_{(y − X ˆ}

θθθ ).

To verify that the estimator ˆθθθ actually maximizes function (2.9), we calculate hessian matrix of the function (2.9) and prove that it is negative definite matrix. Since the hessian matrix is −X0X, under the assumption from the beginning of the chapter that rank(X) = k + 1, we have −X0X ≺ 0 which proves the claim.

2.3.1 Properties of the Maximum Likelihood Estimators

The following properties of the estimators hold under the assumption of the normality of error’s distribution. We will just state them without proofs.

• ˆθθθ : Nk+1(θθθ , σ2(X0X)−1);

• (N − k − 1) ˆσ2/σ2: χ2(N − k − 1); • ˆθθθ and ˆσ2are independent;

• ˆθθθ and ˆσ2are jointly sufficient statistics for θθθ and σ2;

• The estimators ˆθθθ and ˆσ2have minimum variance among all unbiased estimators.

2.4 Polynomial Regression

In this section we introduce Polynomial Regression model as a special case of Multiple Re-gression model with only few properties and short descriptions. You can read more about the Polynomial Regression in the book [7, Chapter 8], and in the book [10, Section 5.3]. Namely, if we set xi j = x_ijin (2.1), j = 1, 2, . . . , k, k ≤ N − 1, we get

y_i= θ0+ θ1xi+ θ2x2i + . . . + θkxki+ ξi, i = 1, 2, . . . , N. (2.10)

(26)

The inspiration for such model arises from Weierstrass approximation theorem (see [2, Chapter VI]), which claims that every continuous on a finite interval can be uniformly approx-imated as closely as desired by a polynomial function.

Although it seems like a great solution, the better approximation requires higher polyno-mial degree, which means more unknown parameters to estimate. Theoretically k can go up to N − 1, but, when k is greater then approximately 6, the matrix X0X becomes ill-conditioned and other problems arise.

The matrix X now becomes X =      1 x₁ x2₁ x3₁ . . . xk₁ 1 x₂ x2₂ x3₂ . . . xk₂ .. . ... ... ... . .. ... 1 xN x2N x3N . . . xkN      (2.11) and matrices y, θθθ , and ξξξ are the same. The model (2.10) can be written as

y = Xθθθ + ξξξ . (2.12) Even thought the problem of finding the unknown parameters in Polynomial Regression is similar to the problem in Multiple Regression, Polynomial Regression has special features.

The model (2.10) is the kthorder polynomial model in one variable. When k = 2 the model is called quadratic, when k = 3 the model is called cubic and so on. The model can also be in two or more variables, for example, a second order polynomial is given by

y= θ0+ θ1x1+ θ2x2+ θ11x21+ θ22x22+ θ12x1x2+ ξ .

which is known as the response surface.

For our purposes, we will study only Polynomial Regression in one variable.

We want to keep order of the model as low as possible. By fitting higher order polynomial we will most likely over fit the model, which means that such a model will not be a good predictor or will enhance understanding of the unknown function.

From our assumption of rank(X) = k + 1, full column rank, in polynomial regression models, when we increase the order of the polynomial, as we mentioned, the matrix X0X becomes ill-conditioned. This implies that the parameters will be estimated with error, because (X0X)−1might not be accurate.

2.4.1 Orthogonal Polynomials

Before computers were made, people had problem calculating the powers x0, x1, . . . xk manu-ally (by hand), but in order to fit the Polynomial Regression this is necessary. Assume we fit the Simple Linear Regression model to some data. We want to increase the order of the model, but not to start from the very start. What we want to do is to create situation where adding an extra term merely refines the previous model. We can archive that using the system of orthogonal polynomials. Now, with computers, this has less use. The system of orthogonal

(27)

polynomials can mathematically be obtained using Gram-Schmidt method. The kthorthogonal polynomial has degree k.

As we mentioned, ill-conditioning is a problem as well. In polynomial regression model, the assumption that all independent variables are independent is not satisfied. This issue can also be solved by orthogonal polynomials.

There are continuous orthogonal polynomials and discrete orthogonal polynomials. The continuous orthogonal polynomials are classic orthogonal polynomials such as Hermite poly-nomials, Laguerre polypoly-nomials, Jacobi polynomials. We use discrete orthogonal polypoly-nomials, where the orthogonality relation involves summation.

The columns in matrix X in the model (2.12) are not orthogonal. So, if we want to add another term θk+1xk+1i , the matrix (X0X)−1will change (we need to calculate it again). Also,

the lower order parameter ˆθi, i = 0, 1, . . . , k will change.

Let’s instead fit the model

y_i= θ0P0(xi) + θ1P1(xi) + θ2P2(xi) + . . . + θkPk(xi) + ξi, i = 1, 2, . . . , N, (2.13)

where Pj(xi) are orthogonal polynomials, Pj(xi) is jthorder polynomial, j = 0, 1, . . . , k, (P0(xi) =

1). From orthogonality we have

N

∑

i=1

P_m(x_i)P_n(x_i) = 0, m 6= n, m, n = 0, 1, . . . , k,

The model (2.13) can be written in matrix form y = Xθθθ + ξξξ . Considering that matrix X is now X =      P0(x1) P1(x1) P2(x1) . . . Pk(x1) P₀(x2) P1(x2) P2(x2) . . . Pk(x2) .. . ... ... . .. ... P₀(xN) P1(xN) P2(xN) . . . Pk(xN)      and from orthogonality, the following holds

X0X =      ∑Ni=1P02(xi) 0 0 . . . 0 0 ∑Ni=1P12(xi) 0 . . . 0 .. . ... ... . .. ... 0 0 0 . . . ∑N_i=1P_k2(xi)      .

We know that the ordinary least square estimator is given by ˆθθθ = (X0X)−1X0y, or equival-ently ˆ θj= ∑Ni=1Pj(xi)yi ∑N_i=1P2_j(xi) , j= 0, 1, 2, . . . , k. From (2.6) we have the variance, or equivalently

Var( ˆθj|xj) =

σ2 ∑Ni=1Pj2(xi)

(28)

It is interesting to notice that ˆ θ0= ∑N_i=1P0(xi)yi ∑N_i=1P₀2(xi) = ∑ N i=1yi N = ¯y

Perhaps we want to add a term θk+1Pk+1(xi) to model (2.13), then the estimator for θk+1

will be ˆ θk+1= ∑N_i=1Pk+1(xi)yi ∑N_i=1P_k+12 (xi) .

To obtain the estimator, we didn’t change other terms in the model, we only look at newly added term. Because of the orthogonality, there is no need for finding (X0X)−1or any other estimators again. This is a way to easily fit higher order polynomial regression model. We can terminate the process when we find optimal (for our purpose) model.

(29)

Chapter 3 The Bootstrap

3.1 Introduction

Let x1, x2, ..., xN be a homogeneous sample of data, which can be observed as the outcomes of

independent and identically distributed (i.i.d.) random variables X1, X2, ..., XN, with

Probab-ility Density Function (PDF) f , and Cumulative Distribution Function (CDF) F. Using the sample, we can make inferences about parameter θ (population characteristic). To do that, we need a statistic S, which we assume that we have already chosen and that it is an estimate of θ (which is a scalar).

For our needs, we are focused on how to calculate confidence intervals for parameters θ using the PDF of statistic S. We also could be interested in its bias, standard error, or its quantiles.

There are two situations, the one we are interested in the non-parametric, and the paramet-ric. Statistical methods based on mathematical model with known parameter τ that fully de-termines PDF f are called parametric methods, and the model is called parametric model. In this case, the parameter θ is a function of parameter τ. The statistical methods where we use only the fact that random variables are i.i.d. are non-parametric methods, and the models are called non-parametric models.

For the non-parametric analysis, empirical distribution is important. The empirical distri-bution sets equal probabilities to each element of the sample, xi, i = 1, 2, ..., N. The

probabil-ities are _N1. Empirical Distribution Function (EDF) ˆF as an estimate of CDF F is defined as:

ˆ

F(x) = number of elements in the sample ≤ x

N .

The function ˆF can also be written as: ˆ F(x) = 1 N N

∑

i=1 I_A_i (3.1)

(30)

Because of the importance of the EDF, we will define it more formally. Define function ν(x) as:

ν (x) = { j : Xj≤ x, j = 1, 2, . . . , N} , x ∈ R. The function | · | represents cardinality of a set.

Now we can define EDF as:

ˆ

F(x) =ν (x)

N , x ∈ R. (3.2) Random variable ˆF(x) is a statistic with values in a set

n 0, 1 N, 2 N, . . . , N− 1 N , 1 o . The distribution of random variable will be:

P ˆ F(x) = k N = P(ν(x) = k) =N k F(x)k(1 − F(x))N−k, k = 0, 1, 2, . . . , N,

where F is the CDF, which means that ˆF(x) follows a Binomial Distribution with parameters p_{= P(X ≤ x) = F(x), x ∈ R and N.}

Considering the fact E(IAi) = F(x), for x ∈ R, ˆF

n→∞

−−−→ F almost certain or P( ˆF −−−→ F) = 1,n→∞

which can be proven by Borel’s law of large numbers.

3.1.1 Statistics

Many statistics can be represented as a property of EDF. For example, ¯x = _N1∑Ni=1xi (the

sample average) is the mean of the EDF. Generally, the statistic s is a function of x1, x2, . . . , xN

and will not be affected by reordering the data, which implies that statistic s will depend on the EDF ˆF. So, statistic s can be written as a function of ˆF, s = s( ˆF). The statistical function s(·) can be perceived as a way for computing statistic s from function ˆF. This function is useful in non-parametric case since the parameter θ is defined by the function as s(F) = θ . The mean and the variance can be observed as a statistical functions:

s(F) = Z x dF(x) s(F) = Z x2dF(x) − Z x dF(x)2.

For the parametric methods we often define θ as a function of the model parameter τ, but the same definition stands for them too.

(31)

Notation S = s(·) will be used as a function and notation s as the estimate of θ , which is based on the data x1, x2, . . . , xN. The estimate can usually be expressed as s = s( ˆF), which

actually represents the relation between parameter θ and the CDF F.

From the definition (3.1), ˆF−−−→ F, as we mentioned before, then if s(·) is continuous, Sn→∞ converges to θ when n → ∞ (consistency).

We will not go into more details, applying bootstrap does not require such formality. We will assume that S = s( ˆF).

3.2 The Bootstrap Estimates

Finding the distribution of statistics S can help us inference about estimates θ . For example, if we want to obtain 100(1 − 2α)% confidence interval for θ , we could possibly show that statistic S has approximately normal distribution with mean θ + β and standard deviation σ . The β is the bias of S. When we have assumption that bias and variance are known then:

P(S ≤ s|F) ≈ Φ

s− (θ + β ) σ

, where function Φ is:

Φ(z) =√1 2π

Z z

−∞

e−t22 _dt, z ∈ R

If the α quantile of the standard normal distribution is zα = Φ−1(α), then 100(1 − 2α)%

confidence intervalfor θ is:

s− β − σ · z1−α ≤ θ ≤ s − β − σ · zα (3.3)

which we obtained from: P

β + σ · zα ≤ S − θ ≤ β + σ · z1−α

≈ 1 − 2α.

However, the bias and the variance will almost never be known. Therefore, we need to estimate them. Express β and σ as:

β = b(F ) = E(S|F ) − s(F ), σ2= v(F) = Var(S|F),

where we note that S|F means that random variables from which S is calculated have distribu-tion F (X1, X2, . . . , XN are i.i.d. with CDF F). Assume that ˆF is estimation of function F, then

we can obtain the estimates of β and σ as:

B= b( ˆF) = E(S| ˆF) − s( ˆF)

V = v( ˆF) = Var(S| ˆF), (3.4) This estimates are called the bootstrap estimates.

(32)

3.3 Parametric Simulation

The bootstrap idea has two steps, first estimating parameters, and then approximate them using simulation. We do that because sometimes we cannot simply express the formula for calculating parameter estimates.

The practical alternative is re-sampling the data from a fitted parametric model, and calcu-lation of properties of S which we need.

Let Fτ be CMF and fτ be PDF. Suppose that we have data x1, x2, . . . , xN and a parametric

model for the distribution of the data. Let ˆF(x) = Fτˆ(x) be CMF of a fitted model which we

get when we estimate τ (usually) using Maximum Likelihood Estimate with ˆτ . Note random variable distributed accordingly to ˆF as X∗.

3.3.1 Approximations

Suppose now that calculation is for some reason too complicated. As we mentioned the al-ternative is to simulate data sets (re-sample) and estimate the properties. Let X₁∗, . . . , X_N∗ be a data set i.i.d. from distributed ˆF. Denote with S∗statistic calculated from simulated data set. By repeating the process R times, we obtain R values S∗₁, S∗₂, . . . , S∗_R. The estimator of the bias will now become:

B= b( ˆF) = E(S| ˆF) − s = E∗(S∗) − s and this is estimated by:

B_R= 1 R R

∑

r=1 S∗_r− s = ¯S∗− s.

Here, s is parameter value for the model, so S∗− s is analog to S − θ . Similarly, the estimator of the varianceof S is:

V_R= 1 R− 1 R

∑

r=1 (S∗_r− ¯S∗)2.

As R increase, by the law of large numbers, BRconverges to B (the exact value under the fitted

model) as well as VR to V .

3.4 Non-parametric Simulation

Supposed that we have X1, X2, . . . , XNfor which it is sensible to assume that they are i.i.d. from

unknown distribution F. Using EDF ˆF we estimate CDF F, and we use ˆF as we would use it in a parametric model. First we see if we can calculate it easily, if not, we simulate the data sets (re-sampling) and approximate. Empirical calculations of the properties we require.

Simulation using EDF is based on the fact that EDF puts equal probabilities to each values of data set x1, x2, . . . , xN. So, every simulated sample (re-sample) X1, X2, . . . , XN is taken at

(33)

3.5 Confidence Intervals

The distributed of S can be used to calculate confidence intervals, which is the main goal of the bootstrap for our needs. There are multiple ways to use bootstrap simulation, we will describe two methods.

We could use the normal approximation of distribution S. This means that we will need to estimate limits (3.3) using the bootstrap estimates of bias and variance. Using the bootstrap method, we can estimate the quantiles for S −θ with s∗_(R+1)p−s where we assume that (R+1)p is a whole number, so the p quantile of S − θ is (R + 1)pth ordered value of s∗− s, that is s∗_(R+1)p− s. So, an 100(1 − 2α)% confidence interval is:

2s − s∗_{(R+1)(1−α)}≤ θ ≤ 2s − s∗_(R+1)α (3.5) which can be obtained from:

P(a ≤ S − θ ≤ b) = 1 − 2α =⇒ P(S − b ≤ θ ≤ S − b) = 1 − 2α.

The interval (3.5) is called basic bootstrap confidence interval. The bigger R the more accurate the confidence interval will be. Typically, you take R > 1000, but there are more factors that accuracy depends on, for more details you can check the books mentioned in the Bibliography.

When the distribution of S − θ depends on unknowns, we try to mimic Student’s-t statistic, therefore we define a studentized version of S − θ as:

Z=S√− θ V

where V is an estimate of Var(S|F). With this, we eliminate the unknown standard deviation when making inference about the normal mean.

Student-t 100(1 − 2α)% confidence interval for mean is: ¯

x− ˆσ · tN−1(1 − α) ≤ θ ≤ ¯x − ˆσ · tN−1(α)

where ˆσ is estimated standard deviation of the mean, and tN(α) is quantile of the Student-t

distribution with N degrees of freedom. We can obtain 100(1 − 2α)% confidence interval for θ analogously, for distribution Z, as follows:

s− ˆσ · z1−α ≤ θ ≤ s − ˆσ · zα

here zpis p quantile of Z. To estimate the quantiles of Z, we use replicates of the studentized

bootstrap statistic:

Z∗= S

∗_{− s}

√ V∗

where we obtain values from re-samples X₁∗, X₂∗, . . . , X_N∗. When we use simulated values z∗₁, z∗₂, . . . z∗_R to estimate zα, then we obtain studentized bootstrap confidence interval for

θ :

(34)

The studentized bootstrap method is used to obtain confidence intervals in our non-parametric problem.

(35)

Chapter 4 Simulation and Evaluation

In this chapter, we will simulate a real-world problem using data created in the program lan-guage MatLab.

The goal is to estimate two models and test the quotient between them. In order to do that, we are assuming that we know the true relationship between the variables we observe - we know the true models.

The data from which we need to estimate a model has measurement errors. Therefore, suppose that we know the real data and the data with the measurement error. We want to see how the assumption of the measurement error’s distribution affects the quotient.

4.1 Mathematical Description of the Problem

Let ˜D1= { ˜yi1, ˜xi1}N_i=11 and ˜D2= { ˜yj2, ˜xj2}N_j=12 be two sets of data.

We assume that the data ˜x_i1, ˜x_j2, ˜y_i1 and ˜y_j2 are known to us, and that there is an error in measurement when obtaining the data, which means

˜ x_i1= xi1+ ξi1, i = 1, 2, . . . , N1; ˜ x_j2= xj2+ ξj2, j= 1, 2, . . . , N2, ˜ y_i1= yi1+ εi1, i = 1, 2, . . . , N1; ˜ yj2= yj2+ εj2, j= 1, 2, . . . , N2,

where ξi1and ξi2, as well as εi2and εj2, follow the same distribution for every i = 1, 2, . . . , N1,

j= 1, 2, . . . , N2. For the purpose of the problem, we will assume that we also know the true

values of the data, xi1, xj2, yi1and yj2.

We know the true relationship between observed variables Y and X , which means that we know the true models that fit the data. Let D1 = {yi1, xi1}Ni=11 and D2 = {yj2, xj2}Nj=12 .

Depending on the problem, we will either use the Simple Linear Regression or the Polynomial Regression (OLS method) to create the models. We will set that the parameters in one of the two real models to be 5% smaller than the parameters in the other model.

(36)

will get two models y1(x) and y2(x), and we are interested in the quotient:

y₁(x)

y₂(x). (4.1) Since the goal is to see how and if the assumption of the measurement’s error distribution affects the quotient (4.1), we will repeat this process for different errors but from the same distribution and obtain the confidence interval for the quotient using the bootstrap method. We know the true quotient, the true ratio between the models, which we will use to see if it belongs to the confidence interval we obtain.

4.2 An Analogy With the Real World

We want to simulate the relationship between velocity and fuel efficiency in the ships. The true data always comes with measurement errors due to many factors, which is the reason why we add the measurement error to our data.

The assumption that we know the true models can help us understand the relationship between the velocity and fuel consumption. Suppose, for example, we want to see which of two engines is better and how much (does it spend more or less fuel). This can be hard because of the measurement errors, results might lead us in the wrong direction. By testing the quotient of the two models, assuming different errors, we can see if the assumption of particular error should affect our decision of which of two engines spends less and how certain can we be in our decision.

4.3 Parameter Estimation

The velocity and fuel consumption data we use comes from ship log data gathered at Qtagg AB, from one ship gathered over roughly half a month.

Our real sets of data D1 and D2 consists of velocity measured in knots and fuel efficiency

measured in liters per hour. The plotted data for the real velocity (without measurement errors) is given in figure (4.1).

(37)

Figure 4.1: The data (velocity) without measurement errors

Our assumption of knowing the true models will give us insight of how the fuel efficiency data should look. We will consider only three cases.

1. the true models are the following:

y1(x) = θ11x

y₂(x) = θ12x

(4.2) where we choose the θ11to be θ11≈ 128.518, and we set the other one to be 5%, bigger,

θ12 ≈ 134.944. As we mentioned, we will always set the parameters to be 5% bigger

for one model. Using the models, we can obtain the true data yi1 and yi j. The data is

given in figure (4.2).

y1(x) = θ31x3

y2(x) = θ32x3

(4.3) where θ31 ≈ 0.5942, θ32 ≈ 0.6239. The true data yi1 and yi j, using those models are

given in figure (4.3).

y₁(x) = θ01+ θ11x+ θ21x2+ θ31x3

y2(x) = θ02+ θ12x+ θ22x2+ θ32x3

(4.4) where θθθ1= [−3.04317, 126.6566, −13.4759, 0.9254]0and θθθ2has coefficients 5% greater

(38)

Figure 4.2: Case 1 - The data (fuel efficiency) without measurement errors

The data in the all of the three cases are without any measurement errors. In the next sections, we will add errors to those data.

When we add errors, we will get new data from which we will create adequate models. We will estimate the parameters using OLS method described earlier. Depending on the case, we estimate the parameters to get the same type model as the true model. We will describe how to do that.

Assume that we have data ˜D₁and ˜D₂, using results for the OLS method, we obtain estim-ates for each case as follows:

1. in this case we have for each model to estimate only one parameter. Let: ˜ Xi=      ˜ x_1i ˜ x2i .. . ˜ xNi,i      , Y˜i=      ˜ y_1i ˜ y2i .. . ˜ yNi,i      , i = 1, 2. The OLS estimates are:

ˆ θ1i= ( ˜X 0 iX˜i)−1X˜ 0 iY˜i, i = 1, 2.

2. this case does not differ a lot from the first case. Using OLS method, and results in polynomial regression, we get:

ˆ θ3i= ( ˜X 0 iX˜i)−1X˜ 0 iY˜i, i = 1, 2,

(39)

Figure 4.3: Case 2 - The data (fuel efficiency) without measurement errors where ˜ Xi=      ˜ x3_1i ˜ x3_2i .. . ˜ x3_N i,i      , Y˜i=      ˜ y_1i ˜ y2i .. . ˜ y_N_i_,i      , i = 1, 2.

3. this case is classic polynomial regression, there is no need to describe it.

Now that we know how to estimate parameters, we will just state the results in the next sections.

4.4 Confidence Intervals

In order to obtain confidence intervals for the mean of the quotient, we need data. In first and second part, we actually need to obtain confidence interval for the mean of the quotient of the coefficients. To simulate the data, we need to do the following:

• For the first and second part we do the same. First, we add error from the same distri-bution to the real data sets. From those two sets, we estimate two models as explained above. After the estimation, we save the quotient of the estimated parameters - because it is actually the quotient of the two models. In order to obtain more of the quotients, we repeat the process. Denote the data set of the quotients as Q.

• The third part is quite different. Here, the quotient is not just a simple ratio of two parameters. Of course, we first estimate models using ˜D₁ and ˜D₂. After estimation,

(40)

Figure 4.4: Case 3 - The data (fuel efficiency) without measurement errors

using the same input parameters x ∈ {xi1, xj2, i = 1, 2, . . . , N1, j = 1, 2, . . . , N2} in (4.1)

we obtain a set of data S1. This set of data is not enough to see how the error affects

the quotient. The same way as we obtained set S1, we repeat the process and obtain sets

S₁, S2, . . . , S9000 (we decided that 9000 sets should be enough). Now we compute the

mean of every data set Si, i = 1, 2, . . . , 9000, and the union of the means is the data set

of the quotients Q.

In both cases, in order to obtain the confidence interval, we use the bootstrap method on the data set Q.

We want to see how does the assumption that the measurement error has a particular dis-tribution affects the quotient - see how the confidence intervals of the mean behave.

For each assumption of the distribution, we will state the confidence intervals.

• Assuming the random variables ξi1, ξj2, εi1, εj2 have Uniform Distribution on interval

(−0.3, 1) (see (B.2)),

ξi1, ξj2, εi1, εj2: U(−0.3, 1),

we get the following confidence intervals of the mean of the quotient: 1. in the first case, when the true models are given in (4.2), we get:

[0.951974, 0.9519996] 2. when the true models are given in (4.3):

(41)

3. and the true models are given in (4.4):

[0.9136, 1.1509]

A sample of data taken from the Uniform Distribution is given in figure (4.5).

Figure 4.5: A sample of data taken from the Uniform Distribution

• Assuming the random variables ξi1, ξj2, εi1, εj2 have Generalized Pareto Distribution

with parameters ξ = 0.1, µ = 0.2 and σ = 0.2 (see (B.3)), we get the following confid-ence intervals of the mean of the quotient:

1. in the first case, when the true models are given in (4.2), we get: [0.951977, 0.951992]

2. when the true models are given in (4.3):

[0.950903, 0.950956] 3. and the true models are given in (4.4):

[0.89138, 0.98558]

(42)

Figure 4.6: A sample of data taken from the Generalized Pareto Distribution

• Assuming the random variables ξi1, ξj2, εi1, εj2have Normal Distribution with

paramet-ers µ = 0, σ2= 0.4 (see (B.4)), we get the following confidence intervals of the mean of the quotient:

[0.5428, 0.9387]

(43)

Figure 4.7: A sample of data taken from the Normal Distribution

• Assuming the random variables ξi1, ξj2, εi1, εj2have Log-Normal Distribution with

para-meters µ = 0, σ2= 0.1 (see (B.5)), we get the following confidence intervals of the mean of the quotient:

[0.9204, 0.9523]

(44)

Figure 4.8: A sample of data taken from the Log-normal Distribution

• Assuming the random variables ξi1, ξj2, εi1, εj2have Gamma Distribution with

paramet-ers α = 21, β = 0.02 (see (B.6)), we get the following confidence intervals of the mean of the quotient:

[0.9467, 0.9542]

(45)

Figure 4.9: A sample of data taken from the Gamma Distribution

• Assuming the random variables ξi1, ξj2, εi1, εj2 have Student’s t-distribution with

de-grees of freedom d f = 15, (see (B.7)), we get the following confidence intervals of the mean of the quotient:

[0.6981, 1.8737]

(46)

Figure 4.10: A sample of data taken from the Student’s t-distribution

• Assuming the random variables ξi1, ξj2, εi1, εj2 have Chi-Square Distribution with

de-grees of freedom d f = 0.8, (see (B.8)), we get the following confidence intervals of the mean of the quotient:

[0.9467, 0.9542]

(47)

Figure 4.11: A sample of data taken from the Gamma Distribution

4.5 True Values of the Quotient

From the (4.2) and (4.3) we know the true quotient of the models, which is: θ11 θ12 =128.518 139.944 = 0.952381, θ31 θ32 =0.5942 0.6239 = 0.952381,

which is obviously the same, because we choose the parameters. Similarly, the true quotient of the model (4.4) is the same:

θtrue= 0.952381.

4.6 Evaluation of the Results

(48)

Distribution\CI Case 1 Case 2 Case 3 Uniform [0.951974, 0.9519996] [0.9523961, 0.9523962] [0.9136, 1.1509] Generalized Pareto [0.951977, 0.951992] [0.950903, 0.950956] [0.89138, 0.98558] Normal [0.933671, 0.933703] [0.952035, 0.952140] [0.5428, 0.9387] Log-normal [0.951259, 0.951266] [0.9489897, 0.949012] 0.9204, 0.9523] Gamma [0.951975, 0.951981] [0.9509001, 0.950922] [0.9467, 0.9542] Student’s t [0.951709, 0.951783] [0.950322, 0.950608] [0.6981, 1.8737] Chi-Square [0.952203, 0.952241] [0.951385, 0.951726] [0.9467, 0.9542]

Table 4.1: Table of the confidence intervals for the mean of the quotient The true value of the quotient of the models, for each of the three cases, is

θtrue= 0.952381

.

From the confidence intervals given in the table (4.1), we can notice a couple of things. • In case 1, when we assume the true models are (4.2), and we fit the same type of

mod-els, the confidence intervals are small. This is good, we can be confident how big the quotient will be.

The true value of the quotient is close to the limits of the confidence interval for almost all distributions we tested in case one, except in the case when we assume that the measurement error follows the normal distribution.

Conclusion: if we know that the true models are given by (4.2) (using hypothesis testing for example), under the assumptions we made, no matter what the error is the quotient will be close to the real value.

In the real world, if we have two different engines and models (4.2) for the fuel efficiency of the engines, no matter what the measurement errors are (as long as they are from the same distribution, and follow our assumptions), we can conclude that the quotient of the estimated models will represent how much and which of the two engines spends less fuel. If the quotient is bigger then 1, the first engine spends more, if the quotient is smaller then 1, the second engine spends more.

This will almost never be true. Perhaps the measurement error distribution won’t be the same for the fuel efficiency and the velocity.

• The second case is similar to the first one. The true value is still close to the confidence interval limits, which is good again. We can conclude the same as above.

• From our three cases, we can see that the third case has the biggest confidence intervals. It is quite expected, this is the most general case. We can notice that for distributions such as Log-normal, Gamma, and Chi-Square, which are fairly similar, the confidence intervals are the smallest and the real values of the quotient belong to each of the three confidence intervals. This requires more testing (different distribution parameters) to

(49)

see if we can draw a conclusion similar to the one in the cases 1 and 2. Of course, we cannot be as confident as in the previous cases.

• The lowest effect on the quotient have the errors with distributions such as Log-normal, Gamma, and Chi-Square, which is, perhaps, unexpected. It would be interesting to test more with the assumption that the measurement error’s distribution is one of those three. We could assume that the true models are given with (4.3) and instead of the same type models, we fit the polynomial regression models.

(50)

Chapter 5 Discussion

5.1 Conclusions

Estimating unknown parameters in the Simple Regression Model (1.2) and the Multiple Re-gression model (2.1) using Ordinary Least Squares (OLS) method explained in subsection (1.2.1), and in section (2.2.1), respectively, we created models which simulate the relationship between Fuel Efficiency and Velocity. The model (2.10) was used as a special case of the model (2.1).

The estimates are given in (1.6), for the Simple Regression model, and in (2.4), for the Multiple Regression model. For the special case of the Multiple Regression, Polynomial Re-gression, the estimates are the same as in (2.4), but the matrix X is given in (2.11).

The estimates can be obtained without any assumptions about the model. However, when the three assumption in (1.1) (or assumptions on page 17) stand, the estimates are the Best Linear Unbiased Estimates (BLUE) among all linear unbiased estimators. Gauss-Markov theoremon page 19 claim that. The theorem is important and we gave the proof using matrix calculus. The properties of the OLS estimates are given in (1.2.2) and in (2.2.2).

The Estimator of the Variance and the Estimated Variances are given in (1.2.3) and in (2.2.3). The basics about Hypothesis Testing, Confidence Intervals, and t-test were introduced in the section (1.3).

In sections (1.5) and (2.3) we mention another way to estimate the parameters called the Maximum Likelihood Estimate (MLE) method. It turns out the estimates we get using MLE method are the same as the estimates we get using OLS method.

In the Chapter (3), we mention the Bootstrap and the bootstrap’s properties we need for our simulation. We introduce the Empirical Distribution Function (EDF) in (3.2) which is perhaps the most important for the bootstrap. The Bootstrap Estimates are given in (3.4). Our goal was to obtain Confidence Intervals for the parameter θ using the Bootstrap method. For that use, we introduced two basic Bootstrap Methods for computing confidence intervals, the results is given in (3.5) and (3.6), and they are called the Basic Bootstrap Confidence Interval and the Studentized Bootstrap Confidence Interval, respectively.

Everything we mentioned so far is used in our simulation. We estimate two models and look at the quotient of the two estimated models. The goal of the simulation is to see how the

Estimation and Testing the Quotient of Two Models

School of Education, Culture and Communication

Division of Applied Mathematics

MASTER THESIS IN MATHEMATICS / APPLIED MATHEMATICS

Estimation and Testing the Quotient of Two Models

by

Marko Dimitrov

Masterarbete i matematik / tillämpad matematik

DIVISION OF APPLIED MATHEMATICS

School of Education, Culture and Communication

Division of Applied Mathematics

Contents

List of Figures

List of Tables

Abbreviations and Acronyms

Introduction

Formulation of Problem and Goal of the

Thesis

Chapter 1

Simple Linear Regression

1.1

The Model

1.2

Estimation of the Model Parameters

1.2.1

Ordinary Least Squares

∑

∑

∑

∑

∑

∑

1.2.2

Properties of the Ordinary Least Square Estimators

1.2.3

An Estimator of the Variance and Estimated Variances

1.3

Hypothesis Testing, Confidence Intervals and t-test

1.4

The Coefficient of Determination

1.5

Maximum Likelihood Estimation

∏

∑

∑

∑

∑

∑

∑

Chapter 2

Multiple Regression

2.1

The Model

2.2

Estimation of the Model Parameters

2.2.1

Ordinary Least Squares

∑

∑

∑

∑

∑

2.2.2

Properties of the Ordinary Least Squares Estimators

2.2.3

An Estimator of the Variance and Estimated Variance

∑

2.3

Maximum Likelihood Estimation

∏

2.3.1

Properties of the Maximum Likelihood Estimators

2.4

Polynomial Regression

2.4.1

Orthogonal Polynomials

∑

Chapter 3

The Bootstrap

3.1