• No results found

Principal Component Modelling of Fuel Consumption ofSeagoing Vessels and Optimising Fuel Consumption as a Mixed-Integer Problem

N/A
N/A
Protected

Academic year: 2021

Share "Principal Component Modelling of Fuel Consumption ofSeagoing Vessels and Optimising Fuel Consumption as a Mixed-Integer Problem"

Copied!
146
0
0

Loading.... (view fulltext now)

Full text

(1)

School of Education, Culture and Communication

Division of Applied Mathematics

MASTER THESIS IN MATHEMATICS / APPLIED MATHEMATICS

Principal Component Modelling of Fuel Consumption of

Seagoing Vessels and Optimising Fuel Consumption as a

Mixed-Integer Problem

by

Jean-Paul André Ivan

Masterarbete i matematik / tillämpad matematik

DIVISION OF APPLIED MATHEMATICS

MÄLARDALEN UNIVERSITY SE-721 23 VÄSTERÅS, SWEDEN

(2)

School of Education, Culture and Communication

Division of Applied Mathematics

Master thesis in mathematics / applied mathematics

Date:

2020-09-30

Project name:

Principal Component Modelling of Fuel Consumption of Seagoing Vessels and Optimising Fuel Consumption as a Mixed-Integer Problem

Author :

Jean-Paul André Ivan

Supervisors: Christopher Engström Reviewer : Milica Rančić Examiner : Doghonay Arjmand Course Code: MMA692 Course Name:

Degree Project in Mathematics

Comprising:

(3)

Abstract

The fuel consumption of a seagoing vessel is, through a combination of Box-Cox transforms and principal component analysis, reduced to a univariate function of the primary principle component with mean model error −3.2 % and error standard deviation 10.3 %.

In the process, a Latin-hypercube-inspired space partitioning sampling tech-nique is developed and successfully used to produce a representative sample used in determining the regression coefficients.

Finally, a formal optimisation problem for minimising the fuel use is de-scribed. The problem is derived from a parametrised expression for the fuel consumption, and has only 3, or 2 if simplified, free variables at each timestep. Some information has been redacted in order to comply with NDA restric-tions. Most redactions are either names (of vessels or otherwise), units, and in some cases (especially on figures) quantities.

Keywords

Fuel consumption modelling, dimensionality reduction, PCA, Box-Cox trans-form, recursive space partitioning, parametrisation

(4)

Acknowledgements

I would like to thank Q-Tagg and their staff who spent may hours serving as the primary point of contact between my work and the activities of the com-pany. I thank Christopher Engström for his patient and trusting supervision and Karl Lundengård for his tireless work for both the programme and his students.

I would also like to thank Heidi Marais, whose support from the beginning of my degree never wavered. Without her constant love and encouragement this thesis may never have been finished.

I thank my parents for their support and curiosity, and my father for teach-ing me mathematics and science.

I thank all my friends, in Sweden, South Africa, and across the world for the multitudinous support they offered.

Finally, I would like to thank The University of Pretoria in general, Dr. Lauber Martins, Dr. Helen Inglis, Dr. Daniel Wilke, Carl Sandrock, and Prof. Phillip de Vaal in particular for teaching me so much more than just their course material.

(5)

Contents

List of Figures . . . vii

List of Tables . . . viii

1 Overview 1 1.1 Existing Work . . . 1 1.1.1 Hellström Procedure . . . 1 1.1.2 Other Work . . . 2 1.2 Summary of Approach . . . 4 1.3 Scope . . . 6 1.4 Comments on Notation . . . 7 2 Mathematical Preliminaries 10 2.1 Review of Linear Regression . . . 10

2.1.1 Simple Linear Regression . . . 10

2.1.2 Multiple Linear Regression . . . 12

2.1.3 Multivariate Linear Regression . . . 13

2.2 Data Preprocessing: Filtering . . . 15

2.2.1 The Hampel Filter . . . 15

2.2.2 The Savitzky-Golay Filter . . . 16

2.3 Fitting and Re-Expression . . . 17

2.3.1 Tukey Transforms . . . 17

2.3.2 Box-Cox Transforms . . . 17

2.3.2.1 Univariate Transformations . . . 17

2.3.2.2 Multivariate Transformations . . . 18

2.3.3 Validating Transformations . . . 19

2.4 Assumptions Applied to Regressions and Transformed Variables 21 3 Principal Component Analysis and Principal Component Re-gression for Modelling Fuel Consumption 22 3.1 Literature Review and Background . . . 22

3.1.1 Deriving PCs as a Maximal Variance Basis . . . 22

(6)

3.1.3 Sample PCs . . . 27

3.1.4 The Singular Value Decomposition . . . 29

3.1.4.1 The Eigendecomposition of a Matrix . . . 29

3.1.4.2 Derivation of the SVD . . . 30

3.1.4.3 Geometric Interpretation of the SVD . . . 32

3.1.5 Meaning and Interpretation of PCs . . . 32

3.1.6 Selection of PCs . . . 34

3.1.6.1 Ad-Hoc Methods . . . 35

3.1.6.1.1 Total Explained Variance . . . 35

3.1.6.1.2 Eigenvalue Magnitude - Guttman-Kaiser Cri-terion . . . 35

3.1.6.1.3 Eigenvalue Magnitude - Broken Stick Rule . . 36

3.1.6.1.4 Scree Plots . . . 37

3.1.6.2 Cross-Validation and Partial Correlation . . . 39

3.1.6.2.1 Cross-Validation . . . 39

3.1.6.2.2 Approximation of Cross Validation . . . 41

3.1.6.2.3 Partial Correlation . . . 42

3.1.6.3 Hypothesis Testing Procedures . . . 42

3.1.6.3.1 Bartlett’s Test . . . 42

3.1.6.3.2 Bentler-Yuan Linearity Test . . . 43

3.1.7 Regression Using PCs . . . 44

3.1.7.1 Derivation of PCR . . . 44

3.1.7.2 Advantages of PCR . . . 45

3.2 Methodology . . . 47

3.2.1 Preprocessing and Filtering . . . 48

3.2.1.1 Algorithmic Filtering . . . 48

3.2.1.2 Filter Parameters . . . 50

3.2.1.3 Heuristic Filtering . . . 52

3.2.1.4 Remaining Preprocessing . . . 53

3.2.2 Coarse Variable Selection . . . 53

3.2.3 Individual Engine Performance and Three-Engine Twin-state Asymmetry . . . 53

3.2.4 Individual Engine Performance and Master-Slave Asym-metry . . . 54

3.2.5 Grouping Engines . . . 54

3.2.6 Preliminary Collinearity Elimination . . . 55

3.2.7 Variable Elimination due to Practical Considerations . 57 3.2.8 Sampling Methods . . . 58

3.2.8.1 Random Sampling . . . 58

3.2.8.2 Latin Hypercube Sampling . . . 60

(7)

3.2.8.4 LHS-Like Algorithm . . . 63

3.2.8.5 More Sampling Methods: Trip Subsections . . 71

3.2.9 Linear Regression . . . 72

3.2.9.1 Linear Regression on Unmodified Data . . . . 73

3.2.9.2 Linear Regression After Marginal Box-Cox . . 73

3.2.9.3 PCR . . . 74

3.2.10 Model Comparison . . . 75

3.3 Findings . . . 76

3.3.1 Fuel Consumption Estimation . . . 76

3.3.2 Full Model Compared to m Retained Components . . . 86

4 A Mixed-Integer Optimisation Problem for Minimising Fuel Consumption 91 4.1 Deriving Fuel Consumption as an Integral Over z1 . . . 91

4.1.1 Sailing as Movement in a High-Dimensional Space . . . 92

4.1.2 Sailing as a Curve . . . 95

4.1.3 An Expression for Fuel Consumption from z1 . . . 97

4.2 Formal Problem Statement . . . 98

4.3 Discretisation and Simplification . . . 102

4.3.1 Derivatives of r(t) . . . 103

4.3.2 The Objective Function . . . 103

4.3.3 Equality Constraints . . . 104

4.3.4 Inequality Constraints . . . 105

4.3.5 Discretised Problem Statement . . . 106

4.3.6 Further Simplification . . . 107

5 Conclusions and Future Work 109 5.1 Conclusion . . . 109

5.2 Further Work . . . 111

References 115

Appendices A.1

A.1 Additional Figures - Fuel Consumption and Retained PCs . . A.1 A.2 Some PCs from Early Work . . . A.8 A.3 Thesis Requirements . . . A.12

(8)

List of Figures

2.1 Example of power transform validation by Riani. . . 20 3.1 An example of a scree plot for data with 8 variables. . . 37 3.2 Filters applied to measurements for water depth below keel

for one vessel trip. . . 49 3.3 Filters applied to measurements for water depth below keel

for one vessel trip. . . 50 3.4 Example of a Latin hypercube sampling with missing data. . . 61 3.5 Example data which presents difficulties w.r.t. obtaining a

representative sample. . . 63 3.6 Example data with divisions produced by the LHS-like

algo-rithm. . . 64 3.7 Sampling by taking one point from each quadrant. . . 67 3.8 Sampling using depth-modified sampling rate. . . 69 3.9 Sampling using depth-modified sample rate with on a dataset

with N = 161, nrsr= 32, and nmax = 12. . . 70

3.10 Trip 11, 4-component fuel consumption model, 1-2. . . 77 3.11 Trip 13, 4-component fuel consumption model, 1-2 and 2-2. . . 78 3.12 Trip 20, 4-component fuel consumption model, 1-2 and 2-2. . . 79 3.13 Trip 21, 4-component fuel consumption model, 1-1, 1-2, and

2-2. . . 80 3.14 Trip 33, 4-component fuel consumption model, 1-1, 1-2, and

2-2. . . 81 3.15 Trip 38, 4-component fuel consumption model, mostly 2-2. . . 81 3.16 Trip 40, 4-component fuel consumption model, 1-1, 1-2, and

2-2 with frequent switching. . . 82 3.17 Trip 42, 4-component fuel consumption model, 1-1 and 1-2

with instability. . . 82 3.18 Trip 47, 4-component fuel consumption model, 1-1, 1-2, and

2-2 with frequent switching. . . 83 3.19 Trip 53, 4-component fuel consumption model, 1-1. . . 83

(9)

3.20 Trip 57, 4-component fuel consumption model, 1-1 and 2-2 with sharp deceleration. . . 84 3.21 Trip 65, 4-component fuel consumption model, 1-1 and 1-2

with instability. . . 84 3.22 Trip 69, 4-component fuel consumption model, 1-1 and 2-2,

worst performance among all trips. . . 85 3.23 Trip 11, fuel consumption prediction comparision with

differ-ent PCs retained. . . 86 3.24 Trip 65, fuel consumption prediction comparision with

differ-ent PCs retained, showing instability removal. . . 87 4.1 Example of simplification to optimiser by allowing only

shift-ing of twinstate change. . . 108 A.1 Trip 13, fuel consumption prediction comparision with

differ-ent PCs retained, showing maintained poor performance. . . . A.2 A.2 Trip 20, fuel consumption prediction comparison with different

PCs retained, showing the same poor performance across models.A.2 A.3 Trip 21, fuel consumption prediction comparison with different

PCs retained, performance in different regions maintained. . . A.3 A.4 Trip 33, fuel consumption prediction comparison with different

PCs retained, excellent performance across models in unusual circumstances. . . A.3 A.5 Trip 38, fuel consumption prediction comparison with different

PCs retained, showing the same poor performance across models.A.4 A.6 Trip 40, fuel consumption prediction comparison with different

PCs retained, acceptable performance across models. . . A.4 A.7 Trip 42, fuel consumption prediction comparison with

differ-ent PCs retained, excelldiffer-ent performance across models with stability improvements. . . A.5 A.8 Trip 47, fuel consumption prediction comparison with different

PCs retained, acceptable performance across models. . . A.5 A.9 Trip 53, fuel consumption prediction comparison with different

PCs retained, excellent performance throughout entire trip. . . A.6 A.10 Trip 57, fuel consumption prediction comparison with different

PCs retained, acceptable performance across models. . . A.6 A.11 Trip 69, fuel consumption prediction comparison with different

(10)

List of Tables

1.1 Notation guide, examples. . . 8

2.1 Revised symbols for linear models . . . 14

2.2 Dimensions of revised models . . . 14

3.1 Jolliffe’s recommended method of simplifying the interpreta-tion of PCs . . . 33

3.2 Upper and low bounds for heuristic filters. . . 52

3.3 Coarse variable selection - twinstate asymmetry. . . 55

3.4 Coarse variable selection - master/slave asymmetry. . . 56

3.5 Coarse variable selection - engine variables before collinearity elimination. . . 57

3.6 Coarse variable selection - variables remaining after selection. . 58

3.7 Data subset selection methods . . . 59

3.8 PC interpretation table for 4 component models. . . 88

3.9 Bounds for PC interpretation; Table 3.8. . . 89

3.10 Comparison of performance of four-, three-, two-, and single-component PCR. . . 89 A.1 PC interpretation table prior to collinearity elimination for the

first six PCs for sailing in the 1-1 twinstate. . . A.8 A.2 Bounds for PC interpretation; Table A.1. . . A.9 A.3 Variable reference for old PCA. . . A.10

(11)

Chapter 1

Overview

Seagoing vessels operate in complex environments, but the dynamic effects they experience are often very slow and can often be mitigated to some ex-tent by control systems on-board the vessel. These slow dynamic effects mean that it may be feasible to attempt to model the fuel consumption of such a vessel using regressive techniques for independent data, and forgoing any time-series analysis in favour of these simpler methods for independent data. Such a model could then be used to attempt to minimise fuel use dur-ing a vessel’s journey.

Simplification through PCA, with the awareness of desirable variables which an ideal solution would include in the resulting optimisation procedure, was the principle idea kept in mind during the review of the existing methodology and the body of literature.

1.1

Existing Work

1.1.1

Hellström Procedure

The approach taken to the modelling was somewhat divorced from previous work. The existing models used at Q-Tagg, the company in which this work was performed in co-production with, are based on the work in [22].

In this work Hellström describes fuel consumption as a function of a ves-sel’s speed through water, denoted F (vs), which is modified by a depth

fac-tor, D(vg, d) (function of speed over ground and depth), and a wind factor, W(vw, ϕw) (function of wind speed and wind direction).1 The relation given

1In [22] the function W (v

(12)

is

f = F (vs)(1 + D(vg, d))(1 + W (vw, ϕw)) ,

giving the fuel consumption rate f in L h−1. The total fuel consumption is

then, using the same notation that we use in Chapter 4, given as a sum of the fuel consumption over the subsections of the trip is

Ftot. = N −1 X i=1 kxi+1− xik vg fi,

where kxi+1− xik/vg gives the time required to complete subsection i of the trip at speed over ground vg.

This all is built on several assumptions which were incompatible with sev-eral desirable properties for a model extensible in the fashion required. The final result of the work is a reduction of the fuel minimisation problem to a pair of root-fining problems. The latter of the pair actually represents a set of root-finding problems: one for each subsection of the trip. Stating the problem in this way means that a method looking to optimise engine configurations as well will have to solve this pair of root-finding problems for each configuration being evaluated. This exponentially increases the number of root-finding problems as a function of route length.

The construction of this method is also such that it is hard to include desider-ata such as the principal issue of this work, engine configuration, or other potentially interesting variables such as allowing the route itself to be a target of optimisation.

1.1.2

Other Work

In [52] we saw a similar idea to that which this work aims to produce. A two-factor model-predictive control (MPC) strategy is presented, accounting for both dynamic and static effects where static effects such as wind are evalu-ated along the vessel’s route in a time-dependent fashion. In this work we do not necessarily wish to feature dynamic effects and explicitly separate our so-lution from the control problem; hoping to pose the problem as a large-scale measured by the Beaufort scale. The Beaufort scale roughly corresponds to well-defined wind speed intervals [37, pg. 41] but is a subjective measure based on conditions at sea [37, pg.40]. In this work we use directly measured wind speed throughout and have made small changes to Hellström’s procedure to exclude any subjective measurements of ocean conditions.

(13)

optimisation problem and allow the control systems to manage small-scale dynamic effects freely. This is obviously not the ‘best’ method in terms of expected results but permits the problem to be simple enough to investigate whether dimensionality reduction via PCA my be a viable strategy for more complex approaches.

In [14] a method that discretises the optimisation problem as a directed acyclic path is developed, with several components depending on work in [23], which is again based on some results in [11]. The net constraints im-posed by this series of dependencies is that the analysis that is carried through requires a fixed path, and convex cost functions as in the original work in [11]. As mentioned before, the preference would be to develop models that could be used for a method including variable routes. It is also difficult to guarantee from the outset that a realistic cost function (fuel consumption) will be convex.

This trend towards convex cost functions and fixed routes pervades litera-ture, see for example [21] and [15] in addition to the previously mentioned set. In [55], however, a routing method is described which optimises based on weather conditions is described. The authors reference work in the same vein, and present a dynamic programming formulation and approach to the problem. In this work we see that routing with complex, analytic cost func-tions is possible, but computationally expensive. The inclusion of engine configuration in the problem, a variable we had trouble finding relevant lit-erature on, ensures that the problem will already be computationally diffi-cult through the introduction of an integer variable. However, the potential to make route variation continuous, rather than discrete, offers a potential means to simplify the problem from another angle.

We thus proceeded with the following in mind:

1. If the complex analytical models of [55] could be replaced by regressive, linear models the computational aspects of the problem may be simpli-fied greatly. The regressive, linear models might be simplisimpli-fied further by PCA.

2. Correct formulation of the routing problem may permit an acceptable approximation of the mixed-integer problem to be continuous, and per-haps even convex.

(14)

1.2

Summary of Approach

An almost immediate challenge in this investigation was the sheer volume of data on the behaviour of the vessels we investigated. The database on which we sought to base our regressive models, consisted of nearly 80 variables recorded at a frequency of 4 Hz over a period of several months. Although all but about 20 to 25 of the original variables in the dataset were immediately discarded the remaining database was still inconveniently large and difficult to manage. This was not helped by the addition of several derived variables into the dataset.

After this initial investigation, we then paused to consider what might help in making the task of modelling and optimising fuel consumption simpler. Perhaps the optimisation of the fuel consumption could be approximated well with a convex problem? That would help tremendously. From this somewhat dubious assumption, working backwards from a rough sketch of what such an approximation might look like, the following was determined:

1. Even if the problem could be made convex, the dimensionality of the dataset would have to be reduced if there was to be any hope of the problem being at all tractable.

2. The problem had a fundamentally discrete component: how many en-gines a vessel was using at any given time.

3. If the problem, or some sub-problem, was to be made convex it would be easiest to do so if the fuel consumption model could be linear. In order to address the first and last points, we decided to investigate princi-pal component analysis (PCA). PCA was among the first of the multivariate analysis techniques to be developed [28, pg. ix], and is primarily a technique of dimensionality reduction. The principal components (PCs) produced by a PCA are mutually orthogonal basis vectors in the parent space, ordered by the amount of variance they represent within the dataset they are derived from.

We hoped to perform a PCA, reduce the dimensionality of the dataset, and then create a linear regressor from the PCs; a technique known as principal component regression (PCR). Initial results were poor, and a lot of review on how the PCs were being calculated and selected was performed in the hopes that the problem laid there (this forms a large part of Chapter 3). When this approach did not bear fruit, greater attention was paid to the filtering

(15)

and transformation of the data which forms a part of the much smaller body of theory collected in Chapter 2. A great deal of improvement came of this avenue of investigation, and it also revealed that the approach of random sampling of the dataset for our regressions was insufficiently discerning. Investigating different sampling techniques became the next and last step needed to produce a good linear model. The exact sampling technique used was a bad version of binary space partitioning, followed by a simple way of sampling from this partitioning.2

In order to deal with discrete nature of the contribution of the engine config-uration to the problem we considered several means of creating a continuous measure that could be used as a proxy for the engine configuration. However, none so far have been successful. As such, the optimisation problem stated in this work remains a mixed-integer program.

The statement of the problem that was eventually selected involves sation of the vessel’s journey as a function of time. The style of parametri-sation results in a problem that is quite complex, but allows for changes in engine configuration, continuous alteration of the vessel’s route, and can be re-formulated for the optimisation of very long voyages where the Earth’s curvature becomes relevant.

The statement of the optimisation problem is in some sense incomplete; there is still room for simplification, and possible reformulation of the parametri-sation in order to make the continuous part of the problem convex. Making the the non-discrete part of the problem convex is still a feasible hope, since the goal of producing a linear model (and thus potentially affine objective function) succeeded.

The details of the problem still require significant work before the formulation presented can be compared to that in [55], but the present work may reveal a promising idea since the form of the optimisation problem suggests an acceptable approximation which is, in contrast to all the literature that we could find, purely continuous rather than remaining a mixed-integer problem.

2While trying to find a suitable sampling method, and having some difficulty in doing

so, we developed our own unrefined take on binary space partitioning. We discovered the class of binary space partitioning algorithms only a good deal of time after a good deal of investigation into this matter (see §3.2.8). With regards to the manner of sampling from the partition: we suspect there may be a name for the technique that we are unaware of, but it may be the case that it is a minor novelty.

(16)

1.3

Scope

This work covers the following:

1. Brief description of linear regression.

2. Description of power transforms, particularly the Box-Cox transform, and some underlying theory.

3. Description and derivation of principal component analysis.

4. A description of practical techniques of preprocessing and sampling used in preparing the data for linear estimation.

5. A comparison of the performance of the following techniques of pre-dicting the fuel consumption of a seagoing vessel:

(a) Linear regression.

(b) Linear regression after a marginal Box-Cox power transformation. (c) A four-component PCR after a marginal Box-Cox power

transfor-mation.

(d) PCR using the same loadings as before, but with three, two, and one component retained.

6. Derivation of a formal optimisation problem for the fuel consumption of a seagoing vessel using a set of single-component PCRs as the con-sumption model.

7. Discretisation and simplification of the optimisation problem.

Additionally, any points where a different procedure could be considered, or where decisions were made that might have benefited from more detailed deliberation are noted and described in §5.2.

(17)

1.4

Comments on Notation

We attempt to remain consistent with the meaning attached to common in-dexing variables, aside from where being too inflexible would create more confusion than it resolves. For example, the index i usually indexes the ith

observation on a variable, vector, or matrix. For example, xi, yi are the ith

observations on a variable x and a vector y.

The index j usually indexes the jth variable in a vector or matrix. For

example: for Y , an (N × p) matrix of N observations on p variables, yij

is the ith observation on the jth variable, y

i is a vector of the ith

observa-tion on all p variables, yj is a vector of all N observations on the jth variable.

We avoid having the subscript as the only means of identifying the contents of a variable, and context should always be sufficient to identify what a subscript is indexing. For example: in (3.21), with some simplification to avoid unnecessary complications here, we have

¯xj = 1 N N X i=1 xi, (1.1)

from which it should be clear that ¯xj is a (p × 1) vector containing the mean

of each of the p variables over N samples.

The index k usually indexes the kth largest eigenvalue of a matrix, or a value

associated with it. For example: αk could be an eigenvector associated with

the kth largest eigenvalue of some matrix.

The idea of keeping the meaning of a particular subscript mostly consistent is adopted from [27], which served as a notational inspiration and where con-sistent subscripts were often found to be useful for understanding relations quickly.

We also take inspiration from [27] in our choice of vector representation. With few or no exceptions3 lowercase letters are a scalar, boldface letters are

a vector, and uppercase letters are a matrix. Indexes do not affect whether a symbol is a scalar, vector, or matrix, but often show relationships between

3One notable exception occurs in (3.13) where we felt that the uppercaseς was not

sufficiently different from the lowercase ς, so we opted to use both a boldface and uppercase ς to denote the singular values of a matrix.

(18)

lowercase, uppercase, and boldface symbols.

Another feature of the notation in [27] is that attaching a subscript to derived variables gives information about what set of variables was used to produce the derived variable. For example, the mean of each observation is denoted:

¯xi = 1 p p X j=1 xj.

Notice that the barred variable is in boldface, this is intentional and is in-tended to help identify that ¯xi is a (N × 1) vector: one element for each

observation in X. This may be useful if X contains data which all have the same units; such as the speeds of several engines. For comparison, a vector containing the mean of every variable over all observations is denoted ¯xj as

shown in (1.1).

Again, an index is never left as the only means of identifying the meaning of a symbol; but we hope that in adopting this notation we speed up compre-hension.

The mean of an indexed vector is denoted by an overline: xi, compare this

to a vector of means: ¯xi. A short reference on how to interpret indexing

subscripts on matrices and vectors is presented in Table 1.1.

Table 1.1: Table of examples of notation, showing interactions between sub-scripts and text cases/boldface.

Symbol Meaning

X A matrix.

xi A vector formed from X, usually of i observations (row-wise).

xj A vector formed from X, usually of j variables (column-wise). xij A variable from X (usually row i, column j).

¯xi A mean taken over the feature indexed by i.

xi The mean of the vector xi.

A carat, or ‘hat’, denotes a predicted or approximated variable. For exam-ple: in a linear regression y = xβ +  where  is an error term we might say

(19)

We will also do our best to follow ideas expressed in [47], with special atten-tion to points (3) and (4)4: we will attempt to choose our notation such that

‘unnatural’ or ‘undesirable’ expressions are difficult to express or, at the very least, cumbersome.

4Desiderata 3, “(Preservation of quality, I), Every ‘natural’ concept in [a mathematical

field] should be easily expressible using the notation.”

Desiderata 4, “(Preservation of quality, II) Every ‘unnatural’ concept in [a mathematical field] should be difficult to express using the notation. (In particular, it is possible for a notational system to be too expressive to be suitable for a given application domain.) Contrapositively, expressions that look clean and natural in the notation system ought to correspond to natural objects or concepts in [the mathematical field].”

(20)

Chapter 2

Mathematical Preliminaries

2.1

Review of Linear Regression

Here we discuss the methods we will use to perform regression on our datasets, beginning with a quick overview of simple and multiple linear regression be-fore moving to the main topic of concern; general/multivariate linear models.

2.1.1

Simple Linear Regression

Simple linear regressions are models of a scalar variable y which is assumed (see §2.4) to be linearly related to a scalar explanatory variable x. The simple linear model is often, for example in [10, pg. 340], represented as:

y = α + xβ +  , (2.1)

where α and β are model parameters, x is a (N × 1) vector of observations on the predictor variable, y is a (N × 1) vector of predicted values of y for each observation, and  is a (N × 1) vector of error terms. Essentially, we assume that any deviation from the linear relation ˆy = α+xβ is due to some random error, which is usually assumed to be normally distributed and have zero mean (see §2.4). We will discuss typical assumptions of simple linear regression models in §2.4.

When minimising the average squared sum of the error terms1, or in this case

the square of the l2-norm of , one way to derive the model parameters is

by noting that the minimum error occurs at a stationary point of the error in terms of the model parameters. So, assuming we have N observations on

(21)

the predictor and dependent variables we have the error given by 1 N T  = 1 N(y − α − xβ) T(y − α − xβ) = 1 N(y Ty −2α1Ty −2βxTy+ α2+ 2α1T+ βxT) (2.2)

the partial derivative with respect to α is then

∂α  1 N T = − 2 N(y − α − xβ) T1 .

For α, noticing that 1

Ny T1 = ¯y, 1 Nx T1 = ¯x, and 1 Nα1 T1 = α, and letting /∂α= 0 we obtain2 α = ¯y − ¯xβ . (2.3)

From (2.3) we see that if our data has zero mean we will conveniently have

α= 0. Later, almost all of our data will be normalised such that it has zero

mean and unit variance and we can easily apply a centring normalisation to any dataset by working with y − ¯y in place of the original data so we will continue with the assumption that our data has zero mean. Expression (2.2) then becomes 1 N T = 1 N(y Ty −2βxTy+ βxT)

and the partial derivative with respect to β is now

∂β 1 N T = − 2 Nx T(y − xβ) .

From setting /∂β = 0 we now have xT(y − xβ) = 0. Solving for β we obtain

the standard equation for the simple linear regression coefficient

β = xTx−1xTy .

So, (2.1) becomes:

y = xβ +  ,

ˆy = xβ (2.4)

where ˆy is a vector of values of y predicted by the model for any vector of predictor variables x.

2The vectors of ones, denoted 1, appear as we understand that scalar-vector addition

(22)

2.1.2

Multiple Linear Regression

Multiple linear regression models extend simple linear regression models by allowing the dependent (scalar) variable y to be a function of a (px×1) vector

of explanatory variables x [10, pg. 389]. If for each observation on y we place the corresponding x in a row of a (N × px) matrix X we have

y = α + Xβ + e.

The problem, however, is typically expressed as

y = Xβ + e (2.5)

where X has a leading column of ones, serving to include α within the X matrix X =     1 x1,1 . . . x1,px ... ... ... ... 1 xN,1 . . . xN,px    

and in this case it is convenient to redefine px such that the new X is still

(N × px) in order to avoid having to always keep track of an extra −1.

Considering the mean squared error once again, we obtain a familiar result, 1 N T  = 1 N(y − Xβ) T(y − Xβ) = 1 N(y Ty −TXTy+ βTXT)

and the partial derivative in the simple case is replaced with the Jacobian with respect to β:β 1 N T = 1 N(−2X Ty+ 2XT) .

The minimum of the mean squared error obtained from solving this equation occurs at β = (XTX)−1XTy. This is analogous to the solution for a simple

linear regression, but adds the constraint that XTX be invertible. X having

full rank ensures that XTX is positive definite, and thus invertible. This is

equivalent to ensuring that there are no perfect collinearities in any of the predictor variables [10, pg. 403].

If we discard the error term as before the multiple linear regression equation, once again very similar to the simple case, becomes

(23)

2.1.3

Multivariate Linear Regression

Strictly speaking, there are two types of multivariate regression models: mul-tivariate simple and mulmul-tivariate multiple. Mulmul-tivariate simple models have more than one dependent variable and a single predictor variable, while mul-tivariate multiple models are the full generalisation of all the preceding types of linear models; hence their other common name: general linear models3.

From this point on the phrase “multivariate linear regression/model” refers implicitly to multivariate multiple models or regressions.

The general form of a multivariate linear model should come as no surprise:

Y= A+ XB+ E

with Ynow a (N

Y × py) matrix of py dependent variables over NY

observa-tions, Anow (N

Y × py), Xremaining (NY × px), Bnow (px× py), and E

now (NY × py).4 As in the multiple regression case in (2.5) we can absorb

the intercepts in Ainto XB by adding another row of model parameters

to Band a column of ones to Xand as before it is best to redefine p

y to

avoid some extra bookkeeping and then

Y= XB+ E.

This is quite cumbersome, however. The solution offered by [35, pg. 173] is to reduce the problem back to a multiple linear regression problem by defining: y = hyi,1T, . . . , yi,py TiT , X = Ipy ⊗ X,  = hi,1T, . . . , i,pyTiT , where y

i,n is a (NY ×1) vector taken from the nth column of Y, and i,n is

a (NY ×1) vector taken from the nth column of E. We then have βi,n∗ , size

(px×1), similarly defined, with

β = hβi,1T, . . . , βi,pyTiT

3Not to be confused with generalised linear models, which remove some assumptions

we will mention later (§2.4).

4We introduce Y, A, X, B, E, N

Y since we will reduce the problem in a moment

and wish to reserve the ‘usual’ symbols for the the reduced version, as well as to make it clear that this is not a desirable representation for the problem.

(24)

and the system y = Xβ + is now a multiple linear model with the same pa-rameters encoded in β and  as in the general linear model Y= XB+E.

In light of this reduction, which effectively changes the problem from one with py independent variables to a problem with one independent variable,

we will restate and clean up our notation. We do this as some symbols are preserved from this point through the entirety of the present work, since several concepts are common to many of the techniques discussed. Table 2.1 gives these changes. Table 2.2 lists the sizes of the most relevant matrices under these changes for convenience. What is given in these tables applies to the expressions in §2.1.2 with py = 1.

Table 2.1: Table of revised symbols used in this section.

Y= XB+ Ey= Xβ + 

Number of observations NY N = pyNY

Number of response variables py 1

Number of predictor variables px p= pxpy

Table 2.2: Table of dimensions of variables using revised symbols. Variable Dimensions

y N ×1

X N × p

(25)

2.2

Data Preprocessing: Filtering

The data for this work in particular contained a substantial amount of noise, and no single signal processing technique could resolve all the largest issues encountered: extremely large impulsive noise, sensor failures, and consistent small-scale noise. The best combination found was a combination of the Hampel filter and the Savitzky-Golay filter, see §3.2.1 for details. Here we describe the two filters and some of the theory involved. The description is by no means exhaustive, with a great deal of generalisations, extensions, and details on parameter selection remaining unmentioned. The omissions, however, do not contain any details that were actually used and are complete in the context of the application of the filters in this work.

2.2.1

The Hampel Filter

The Hampel filter is a generalisation of a class of filters known as median-based filters, itself belonging to a class known as decision-median-based filters [40]. [40], quoting [3], claims that the Hampel filter and variations thereof have been ‘reinvented again and again’. The original description of the filter may have been the dissertation by Hampel, [18]5. The same author analyses some

properties of the filter, and others, in [19] and seems to be, if not the origina-tor, the populariser of the term ‘mean absolute deviation’ (MAD) [20] which appears in this and related filters.

Without access to Hampel’s dissertation, the implementation in this work was based on the description in [40]. They define a moving data window wK k

of half-width K,

wkK = [xk−K. . . xk. . . xk+K] ,

and the action of the filter as

yk =    xk |xk− mk| ≤ tSk mk |xk− mk| > tSk,

where mk = median(wKk) and

Sk = 1.4826 median

j[−K,K](|xk−j− mk|)

5Many attempts to acquire this text were made to confirm this but the dissertation

(26)

is the MAD scale estimate. According to [40] the factor of 1.4826 makes the MAD scale estimate an unbiased estimate of the standard deviation for nor-mally distributed data. It should be possible, then, to obtain similar results using Sk= sk ∼ σ, that is, let Sk be the (sample) standard deviation of the

window, which for larger windows tends towards the (population) standard deviation for the entire signal.6 The tuning parameter t, however, means

that the filter can freely be adjusted until the desired aggressiveness/conser-vativeness is achieved and the exact method of calculating Sk becomes less

important if we are free to tune as much as necessary.

2.2.2

The Savitzky-Golay Filter

The Savitzky-Golay filter is a convolutional filter that fits points in a window

wK

k to a polynomial over the window. Tables of convolution coefficients

for the filter were published in [43] and some corrections in [45]. These tables contained analytical solutions for the convolution coefficients under the assumption that observations are equally spaced in the time domain. Using the same notation as in the Hampel filter the action of the Savitzky-Golay filter is defined by

yk = 1 |C| K X i=−K Ciyk+i,

where we use |C| as shorthand for PK

−KCi.

We used the SciPy [49] implementation of the Savitzky-Golay filter for which details can be found at [50], [51]. In [17] the coefficients are given as the elements of the matrix C = (JTJ)−1JT where J is a Vandermonde matrix Jk+K+1,j = kj−1 with k = −K, . . . , K. An interesting side note: a look at

the tables in [43] reveals that polynomials of degree 2n and 2n + 1 share convolution coefficients.

6§5.2: an analysis, either statistical or empirical or the difference in

behaviour/perfor-mance of the Hampel filter when using the MAD scale estimate with t = 1, the sample standard deviation, and the population standard deviation.

(27)

2.3

Fitting and Re-Expression

If we are to perform linear regression on data or the principal component loadings (discussed in §3.1.1) we must first transform the data to remove any non-linear relationships. In [48], an important source in this area, refers to this procedure as re-expression or as ‘straightening and flattening’. More recent work, for example [53], tends to simply refer to these techniques as transforms or power transforms.

2.3.1

Tukey Transforms

In [48, pg. 172] Tukey discusses applications of the idea of using power trans-forms to re-express data using what he calls a ‘ladder’. After an exploratory plot (we imagine some y = f(x)) of the data is made the trend can be made linear by replacing y with a transformed variable from the set

{. . . , y3, y2, y12, y13, . . . ,log y, . . . y−1

2, y−1, y−2, . . . } .

While [48, pg. 172] does not explicitly mention fractional powers other than

y12 and y− 1

2, in a later example [48, pg. 191, pg. 192] he does discuss using

fractional powers in general as a side note.

2.3.2

Box-Cox Transforms

2.3.2.1 Univariate Transformations

The paper [7] by the eponymous Box and Cox offers a formal approach to se-lecting power transforms of two forms which, with modern computing, make the ‘ladder’ method of finding power transforms antiquated. In [7] Box and Cox consider a simple and a generalised case of power transforms (for the univariate case these are given in (2.7) and (2.8)), as well as a further gener-alisation which allows for negative data. The most general case is not used in this work so we ignore negative data7.

In the univariate case, the ‘linear model’ we are solving is y = c + , that is, we seek to have the normal least squares regression for a constant y have normally distributed residuals.

7§5.2: what is the difference between the workaround used in this work for negative

(28)

As mentioned, the univariate transforms are given by, y(λ) =        −1 λ (λ 6= 0) ln y (λ = 0) (2.7) y(λ) =        (y + λ2)λ1 −1 λ1 1 6= 0) ln(y + λ2) 1 = 0) . (2.8) The maximised log likelihood8,9 for a particular choice of λ is then, from [7],

Lmax(λ) = − n 2ln s2(λ) + (λ − 1) p X i=1 ln yi, Lmax(λ) = − n 2ln s2(λ) + (λ1−1) p X i=1 ln(yi+ λ2) ,

for the shifted and unshifted cases respectively, where s2(λ) is the sample

variance of y(λ) for a particular λ.

2.3.2.2 Multivariate Transformations

In the multivariate case we can consider two approaches: marginal normality and joint normality. Marginal normality is straightforward10since, as [26, pg.

197] points out, it is equivalent to solving a λj for each variable independently.

If λ is the vector of parameters for the power transform, then

Lmax(λj) = − n 2ln s2jj(λj) + (λj−1) p X i=1 ln yij, Lmax(λj) = − n 2ln s2jj(λj) + (λ1,j−1) p X i=1 ln(yij + λ2,j) ,

for the shifted and unshifted cases respectively, where s2

jj(λj) is the variance11

of y(λj)

j , an (N × 1) vector of observations on y

(λj)

j (yj under the

transforma-tion given by λj), and yij is the ith observation the jth variable.

8That is, the maximum likelihood for Ey(λ) = Xβ. In the multivariate case this

refers to the maximum likelihood for EY(λ) = XB

9§5.2: when can this problem solved analytically, since dL

max/ dλ has a closed-form? 10§5.2: when can this problem solved analytically, since dL

max/ dλ has a closed-form? 11Can also be denoted cov (y

(29)

Although [26, pg. 197] claims that in most cases marginal normality is suf-ficient and the improvement gained by attempting to force joint normality is often unnecessary in practical applications, given the non-linear nature of the data in this work and the many interactions between the variables, we believe it prudent to at least be aware that joint normality may offer some improvement if needed. We refer to [2] and [26, pg. 197], who give the formulae for Lmax derived from (2.7):

Lmax(λ) = − n 2ln (det S(λ)) + p X j=1 (λj −1) p X i=1 ln yij ! , Lmax(λ) = − n 2ln (det S(λ)) + p X j=1 (λ1,j −1) p X i=1 ln(yij + λ2,j) ! ,

and extension to the shifted case (if necessary) is straightforward.

In [26, pg. 197] it is recommended that solving the optimisation problems for joint normality should benefit from using the values of λ obtained for marginal normality as a starting point12. Both [7] and [33] give relations for

likelihood estimates on normalised transforms. If we have the transformed data y(λj)

j , the normalisation used by both works is y

(λj)

j [gm (yj)]λj where

gm (yj) is the geometric mean of yj13.

2.3.3

Validating Transformations

The quality of a transformation is often affected very strongly by several influential observations, most often outliers, and worse still by clusters of outliers [42]. This is to be expected since the transformations we have dis-cussed aim for (multivariate) normality in the residuals, and a tight grouping of observations (and therefore residuals) or exceptional outliers will skew the maximum likelihood function much more than other data. The work by [42], however, is applied mostly to time-series analysis; and although we hope to progress to time series models eventually, the first models used here are purely state-based.

We did draw some inspiration on a method for validating a choice of trans-formation from [42] during initial work where the Tukey transform was used. Eventually we moved away from the Tukey transform and to the Box-Cox

12§5.2: what are the properties of this optimisation problem? Can it be solved

analyti-cally?

(30)

transform when the former did not produce satisfactory results. We mention this method briefly as a sort of record of the progression of the present work. The method in question involves comparing the effect of subset size on the residuals associated with a certain transformation, and accepting the trans-formation as valid if residuals decrease or stay low as subset size increases. Figure 2.1 shows an example.

0 σ 2σ 3σ 4σ −σ −2σ −3σ −4σ Sample Size Error λ = 2.3 λ = 2.2 λ = 2.0 λ = 1.9 λ = 2.1 λ = 1.8 λ = 1.0

Figure 2.1: An example plot of one of the methods used in [42] to evaluate the quality of a power transform, here λ = 1.8 maintains a low residual error as sample size increases.

The quality of a transformation is also dependent on how well the data being transformed align with the global assumptions made on the transformed variables. In this case our global assumptions are linearity, homoscedasticity, uncorrelatedness, and normality (see §2.4) according to [41], which also gives methods for validating these global assumptions14.

14§5.2: do the results of these test change if we use the geometric mean normalisation

(31)

2.4

Assumptions Applied to Regressions and

Transformed Variables

In linear regressions, and the Box-Cox transform, we see mention that re-gressions or transformation will only hold, be well-behaved, or be statisti-cally justified if the variables (on which the regression is being performed or

to which our original variables are being transformed) obey certain

assump-tions. These assumptions are usually listed as: linearity, homoscedasticity, uncorrelatedness, and normality. We will define the terms here for complete-ness.

The assumption of linearity is simply that the variables on which we perform or regression are actually linearly related [46]. With respect to the Box-Cox transform the assumption of linearity means that there exists a power trans-form of the trans-form which the Box-Cox transtrans-form uses which will make the target variables linearly correlated [7].

Homoscedasticity is the requirement that the variance of the variables under consideration, both the predictors and the predicted variable, should have the same variance at different levels [46]. This is usually checked after per-forming the regression: the residuals of the regression should have roughly equal variance for all values of the predicted variable. This interpretation can be carried over directly to the residuals of the Box-Cox transform. Uncorrelatedness can refer to two properties: the uncorrelatedness of the residuals of a regression, or the uncorrelatedness of the variables in the re-gression - in other words, an absence of multicollinearity [46]. In the work by Box and Cox both of these seem important.

Normality refers to the expectation that a regression or transformation exists which has normally distributed residuals. The Box-Cox transform is designed around this very idea and defines the transformation in terms of the exponent required to achieve normally distributed residuals, so it should be obvious that this cannot work if no such transformation exists.

(32)

Chapter 3

Principal Component Analysis

and Principal Component

Regression for Modelling Fuel

Consumption

3.1

Literature Review and Background

3.1.1

Deriving PCs as a Maximal Variance Basis

We begin with a p-element random variable, ˜x, with zero mean. We will later consider a population or sample of N observations on ˜x. This assumption can be made without loss of generality since given a random variable w with non-zero mean we can subtract the mean ¯w and work with w − ¯w instead.

We define a matrix ˜X, where observations on the random variable ˜x appear

row-wise,

˜

X = h˜x1 . . . ˜xN

iT

.

For most work a standardised variable, which we will denote x (without a tilde), is actually more relevant than unnormalised observations. For now we derive PCs first using the unmodified observations ˜x and thereafter using standardised variables (§3.1.2) and defer discussion of the choice of standard-isation and its effects to §3.1.2.

If we wish to transform our basis to one which maximises variance in succes-sive basis vectors we are first seeking some ˜α1 which maximises

var ˜

(33)

(˜Σ is the covariance matrix of the random variable ˜x - or the observation matrix X - identity given by [26, pg. 76]). There are two typical constraints imposed to keep this maximum finite. The first is the typical choice for basis vectors, ˜αT1α˜1 = 1 [28, pg. 5], the second, according to [28, pg. 25], was

suggested by Hotelling, ˜αT

1α˜1 = ˜λ1, where ˜λ1 is the largest eigenvalue of ˜Σ.1

The problem of maximising ˜αT1˜Σ ˜α1 subject to ˜αT1α˜1 = 1 can be rewritten

using Lagrange multipliers [28, pg. 5], max

˜

α1 ˜

αT1 ˜Σ ˜α1− ˜λ( ˜αT1α˜1−1) . (3.1)

A maximum, if it exists, will be found at a stationary point. So, from the partial derivative with respect to ˜α, which has had some constants factored

out, in (3.2) we see that this becomes an instance of the eigenvalue problem in (3.3): α˜  ˜ αT1 ˜Σ ˜α1− ˜λ( ˜αT1α˜1−1)  = 0 ˜Σ ˜α1− ˜λα˜1 = 0 (3.2)  ˜Σ − ˜λIp  ˜ α1 = 0 . (3.3)

We know that ˜λ and ˜α1 must be an eigenvalue and eigenvector pair of ˜Σ. We

can also guarantee that ˜λ (and any other eigenvalue of ˜Σ) is non-negative, since ˜Σ is a covariance matrix and therefore symmetric and positive semi-definite.

Multiplying (3.2) by ˜αT

1 we also obtain ˜αT1 ˜Σ ˜α1 = ˜αT1˜λ ˜α1. So the

maximi-sation problem becomes max ˜ α α˜ T 1˜λ ˜α1 − ˜λ( ˜αT1α˜1−1) ⇔max ˜ α ˜λ ,

so the eigenvalue given by ˜λ must be as large as possible. The first basis vector of maximal variance is therefore given by the eigenvector ˜α1 of the

covariance matrix ˜Σ associated with the largest eigenvalue, which we denote ˜λ1. This is the case for all subsequent basis vectors according to [28, pg. 6]:

that the next basis vector of maximal variance is given by ˜α2 and so on. We

give an inductive proof here.

1§5.2: there is a body of literature on the effects of these two normalisations, here we

use the variant more common in contemporary work, ˜αT

(34)

Claim. The basis for a set of observations which successively maximises

vari-ance in the direction of each basis is given by the eigenvectors of the covari-ance matrix of the observations.

Proof. Orthogonality of the basis vectors is enforced with Lagrange

multi-pliers, as before. For the next basis vector we require that ˜αT

2α˜2 = 1 and

˜

αT

1α˜2 = 0, so the maximisation problem becomes

max ˜ α2 var  ˜ αT2 ˜x = max ˜αT2 ˜Σ ˜α2− ˜λ( ˜αT2α˜2−1) − φ1α˜T1α˜2.

And in general we always require that ˜αT

kα˜k = 1 and ˜αTkα˜k0 = 0 for any

k 6= k0. The problem for a some ˜αk then becomes

max ˜ αk var  ˜ αTk˜x = max ˜αTk˜Σ ˜αk− ˜λ( ˜αTkα˜k−1) − k−1 X k0=1 φk0α˜Tk0α˜k. (3.4)

Call the expression which we maximise in (3.4) f, then we have

∂f α˜k = 2˜Σ ˜ αk2˜λ ˜αkk−1 X k0=1 φk0α˜k0

with the stationary points occurring at ∂f/∂ ˜α

k = 0, so the maximum must occur at ˜Σ ˜αk− ˜λα˜k− 1 2 k−1 X k0=1 φk0α˜k = 0 . (3.5)

For k = 1, the summation is empty and (3.4) reduces to (3.1).

For k > 1 we show that φk = 0 for all k and thus we obtain the eigenvalue

problem for all k. First we show the base case of φ1 = 0. For k = 2 left

multiply (3.5) by ˜αT1 and obtain

˜

αT1 ˜Σ ˜α2−α˜T1˜λ ˜α2− 1

2α˜T1φ1α˜1 = 0 . (3.6)

The first two terms in (3.6) are 0 since the two basis vectors are orthogonal, and so φ1 = 0. With φ1 = 0 in the case of k = 2, the equality in (3.5)

(35)

obtain the following. Base case: φ1 = 0 . Hypothesis: φ1 = · · · = φk−1 = 0 . Step: ˜Σ ˜αk+1− ˜λα˜k+1− 1 2 k X k0=1 φk0α˜k0 = 0 ˜Σ ˜αk+1− ˜λα˜k+1−1 2φkα˜k = 0 . (3.7)

So from (3.7) we notice that it must be the case that φk = 0, by the same

argument as in (3.6). So the directions of maximum variance are ˜αk, each

with variance ˜λk, where ˜λk is the kth largest eigenvector of ˜Σ.2

Definition 3.1: Principal Component Loadings/Coefficients

The derived basis vectors α˜k are called principal component loadings or

principal component coefficients.

Definition 3.2: Principal Components

The vector ˜z = ˜AT˜x, with ˜A=hα˜

1 . . . α˜p

i

, is called the principle

com-ponents of ˜x, with ˜zij giving the principal component score of observa-tion ˜xi in the direction of the jth loading.

These definitions are based on recommendations by [28, pg. 6] which are intended to avoid the sort of confusion pointed out in [32] that occurs when referring to the basis formed by the eigenvectors, ˜α1. . .α˜p, as principal

com-ponents as well.

For a set of N observations we obtain a set of N principal components, as in ˜

Z = ˜ATX˜T = ˜X ˜A .

3.1.2

PCs as a Maximal Correlation Basis

Consider now, as mentioned before, a random variable x standardised by some weight wj

xij = ˜xij/wj

2This derivation is based on [28, pg. 5 - pg. 6], but has been modified to include the

(36)

and construct a matrix of observatioins on this variable as before,

X = hx1 . . . xN

iT

.

Recall that for data with non-zero mean we assume that first the mean is subtracted to fix this ‘problem’: xij = (˜xij − ¯˜xij)/wj.3

The choice of wj influences the resulting properties of the loadings, and there

are many normalisations that suggest themselves as worth investigating. The most commonly used is the standard norm: wj =

q

var (˜xj) [28, pg. 21].

From this point forward, unless explicitly stated, x will refer to a random variable standardised in this way.4

In the previous section we treated our observations as a random vector ˜x and used ˜Σ = var (˜x). To derive the PCs for the normalised observations consider them once more as a random vector x, where xj is one of the random

variables in x. From the same definition, Σ = var (x), we obtain Σ = cov ˜xj wj , ˜xj0 wj0 ! = E[(˜xj− E[˜xj])(˜xj0 − E[˜xj0])] wjwj0 .

In the case where wj =

q

var (˜xj) we thus have Σ = corr (˜x). Deriving PCs

from the correlation matrix of ˜x is therefore equivalent to deriving PCs using the covariance matrix of the normalised variable x. In cases where wj takes

on some value other than q

var (˜xj) the derivation is a little more involved

(see footnote 4).

From this point forward the kthlargest eigenvector of Σ and its corresponding

eigenvalue will be denoted by λkand αk. Similarly, we have A = [α1, . . . , αp],

with the principal components given by

Z = XA .

3A comment on our choice of notation: we wish to avoid stacking of tildes, bars, carats,

or other similar accents as in ¯˜xij. This is the reason we chose to have our normalised

variables denoted without a tilde, since these appear in the majority of the work.

4§5.2: the effects of the standard norm are very well documented because of its desirable

simplifying properties: zero mean and unit variance. It is not by any means the only useful norm that should be considered but much of the theory of PCA has been developed under the assumption of the standard norm.

(37)

In [28, pg. 21] it is remarked that there is no simple relationship between the loadings obtained from the unnormalised data ˜X and those obtained

from the normalised data X. The relationship between these loadings does however appear as part of the derivation of sample PCs. We obtain

Σ ∼ S = ˜V−1 2S ˜˜V

1 2

and, as noted, there is no simple relationship between the loadings obtained from ˜S and those of S. See §3.1.3, expression (3.9) for a full derivation of

this relation.

3.1.3

Sample PCs

Some properties of PCs differ depending on whether they are derived from a population [28, pg. 29]. Since in applied contexts we usually have access to samples, not populations, investigating the differences is important. Our

N observations on ˜x should be normalised as before. We denote the

expec-tation of the jth variable across all N normalised observations E [x

j]. Note

that E [xj] = 0 under the standard norm.

The sample covariance gives an unbiased estimator of the population covari-ance, and for any standardisation that maps the sample mean to 0 we can derive the following:

Σ ∼ S = 1 N −1 N X i=1 (xij − E[xj])(xij0 − E[xj0]) (3.8) = 1 N −1 N X i=1 xijxij0 = 1 N −1X T X .

If the population mean is known to be zero, which in practice is infrequently the case, the correct unbiased estimator is given by [28, pg. 30] as

Σ ∼ S = 1

NX TX .

We further remark that, in practice, the number of samples is often so large that the two estimators give essentially identical estimates anyway. For low sample sizes, however, the difference is obviously meaningful.

(38)

The critical observation here is that for correctly standardised data we need not calculate the covariance matrix directly: we can use the sample to pro-duce the corresponding covariance matrix. This formulation is a combination of that presented in [28, pg. 31] and [26, pg. 450] the latter of which makes direct use of the definition of sample variance as in (3.8).

We mentioned in §3.1.2 that our derivation for sample PCs would be useful in showing that the mapping from ˜Σ-loadings to Σ-loadings is non-trivial. Once again, using the unbiased estimator for covariance, note that the sample covariance is asymptotic to the population covariance as in

˜Σ ∼ ˜S = 1 N −1 N X i=1 (˜xij˜xj)(˜xij0 −˜xj0) .

For a normalised variable we have

Σ ∼ S = 1 N −1 N X i=1 (xij − E[xj])(xij0 − E[xj0]) = 1 N −1 N X i=1   ˜xij − E[˜xj] − E [˜xij − E[˜xj]] q var (˜xj) · ˜xij0 − E[˜xj0] − E [˜xij0− E[˜xj0]] q var (˜xj0)   = 1 N −1 N X i=1 ˜xij − E[˜xj] q var (˜xj) ˜xij0 − E[˜xj0] q var (˜xj0) = ˜V−1/2˜ S ˜V−1/2 , (3.9) where ˜V−1/2 = diag var (˜xj) −1/2

, and we have the same result as [26, pg. 72]. This transformation consists of two scalings and a rotation and so there is no simple relationship between the loadings obtained for a covariance ma-trix and those obtained for a correlation mama-trix, as mentioned in §3.1.2. This means that correlation and covariance PCs are sufficiently different represen-tations of data that we should expect (and it is the case in practice) one will be more useful than the other in a given circumstance. The correlation PCs are often more useful, since most datasets contain measurements of different units and orders of magnitude, but this is not always the case.

The derivation for sample PCs is very similar to that in §3.1.1, but we will repeat part of the argument in brief. We look to maximise aT

Figure

Figure 2.1 shows an example.
Table 3.1: Table showing Jolliffe’s recommended method of simplifying the analysis and interpretation of PCs.
Figure 3.2: Filters, with window half width K = 60 s, applied to measure- measure-ments for water depth below keel [ ] against time [ ] for one vessel trip.
Figure 3.3: A smaller section of the same trip as in Figure 3.2, showing the effectiveness of the combined filters.
+7

References

Related documents

Figure 4.3: The difference in instantaneous fuel consumption using two different estimation methods in stars, using steady state engine maps and compensating for the transient

MCA: Multi-Criteria Analysis MFA: Material Flow Analysis MIPS: Material Intensity Per unit of Service NPK: Nitrogen, Potassium and Phosphorus basic fertiliser NS: Normalised Score

The essential part of the presented approach is that the deconvolution process is not only guided by the comparison to the reference binding energy values that often show

Corresponding American studies of energy savings in urban areas (see, for example, Wagner 1980, OECD 1982) have suggested relatively small savings potentials. Studies have covered

PACE makes outlier detection, estimation of the function outside the observation du- ration and the gathering of common statistical properties, like mean and variance in

Puetter highlights the importance of this group, claiming that “[t]he group not only pre-agrees all critical Council decisions with relevance for the euro area member

Higher Total Physical Activity is Associated with Lower Arterial Stiffness in Swedish, Young Adults: The Cross-Sectional Lifestyle, Biomarkers, and Atherosclerosis Study..

Study I, an in vitro study, investigated the effects of two different local anaesthetics, lidocaine and ropivacaine, on cell viability and cell prolifer- ation in colon cancer