Methods of high-dimensional statistical analysis for the prediction and monitoring of engine oil quality

(1)

IN

DEGREE PROJECT MATHEMATICS, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2016,

Methods of high-dimensional statistical analysis for the prediction and monitoring of engine oil quality

FREDRIK BERNTSSON

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ENGINEERING SCIENCES

(2)

(3)

Methods of high-dimensional statistical analysis for the prediction and monitoring of engine oil quality

F R E D R I K B E R N T S S O N

Master’s Thesis in Mathematical Statistics (30 ECTS credits) Master Programme in Applied and Computational Mathematics (120 credits

Royal Institute of Technology year 2016 Supervisors at Scania: Joakim Voltaire, Mats Ridemark Supervisor at KTH: Tatjana Pavlenko Examiner: Tatjana Pavlenko

TRITA-MAT-E 2016:65 ISRN-KTH/MAT/E--16/65-SE

Royal Institute of Technology SCI School of Engineering Sciences KTH SCI SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

(4)

(5)

i

Abstract

Engine oils fill important functions in the operation of modern internal combustion engines. Many essential functions are provided by compounds that are either sacrificial or susceptible to degradation. The engine oil will eventually fail to provide these functions with possibly unrepairable damages as a result. To decide how often the oil should be changed, there are several laboratory tests to monitor the oil condition, e.g. FTIR (oxidation, nitration, soot, water), viscosity, TAN (acidity), TBN (alkalinity), ICP (elemental analysis) and GC (fuel dilution). These oil tests are however often labor intensive and costly and it would be desirable to supplement and/or replace some of them with simpler and faster methods. One way, is to utilise the whole spectrum of the FTIR-measurements already performed. FTIR is traditionally used to monitor chemical properties at specific wave lengths, but also provides information, in a more multivariate way though, relevant for viscosity, TAN, and TBN. In order to make use of the whole FTIR-spectrum, methods capable of handling high dimensional data have to be used. Partial Least Squares Regres- sion (PLSR) will be used in order to predict the relevant chemical properties.

This survey also considers feature selection methods based on the second order statistic Higher Criticism as well as Hierarchical Clustering. The Feature Selection methods are used in order to ease further research on how infrared data may be put into usage as a tool for more automated oil analyses.

Results show that PLSR may be utilised to provide reliable estimates of mentioned chemical quantities. In addition may mentioned feature selection methods be applied without losing prediction power. The feature selection methods considered may also aid analysis of the engine oil itself and feature work on how to utilise infrared properties in the analysis of engine oil in other situations.

(6)

(7)

ii

Sammanfattning

Motoroljor fyller viktiga funktioner i en modern förbränningsmotor. Många viktiga funktioner tillgodoses av ämnen som antingen förbrukas över tid eller är sårbara mot ned- brytning. Efter tillräckligt lång tid kommer motoroljan inte längre kunna tillgodose mo- torn med dessa funktioner. För att ta reda på hur snabbt en olja bryts ned under körning och hur ofta den behöver bytas ut, finns det flera laborativa analysmetoder. De vanli- gaste är FTIR (oxidation, nitrering, sot, vatten), viskositet (smörjförmåga), TAN (surhet), TBN (basisk buffert), ICP (grundämnesanalys) and GC (bränsleutspädning). Dessa olje- tester är ofta arbetsintensiva och kostsamma och det vore önskvärt att ersätta eller kom- plettera dessa med mätningar från en FTIR-spektrometer. För att kunna använda hela det spektrumet krävs att statistiska metoder kapabla till att hantera högdimensionella data används. Partial Least Squares Regression används för att prediktera relevanta test från kemisk oljeanalys.

Den här studien betraktar också variabelselektionsmetoder baserade på andra ord- ningens statistik Higher Criticism såväl som hierarkisk klustring. Variabelselektionsme- toderna används för att bidra till framtida utveckling mot hur data från det infraröda spektrumet ska användas för att kunna automatisera oljeanalyser.

Resultaten visar att PLSR kan användas för att tillhandahålla exakta prediktioner av ovan nämnda kvantiteter. Vidare kan variabelselektionsmetoderna i undersökningen an- vändas för att reducera antalet variabler och fortfarande bibehålla prediktionsförmågan.

Variabelselektionsmetoderna kan också användas som verktyg för att i framtiden bidra med kunskap hur oljans egenskaper i det infraröda spektrumet kan utnyttjas för analys av motorolja i andra situationer.

(8)

(9)

iii

Acknowledgements

I want to thank my supervisors at Scania, Joakim Voltaire and Mats Ridemark who both have been indispensable to this project. In addition, I am very grateful for the help provided by my supervisor Tatjana Pavlenko at KTH. Special regards also to Natalia Stepanova who provided valuable input on statistical issues.

(10)

(11)

Introduction

1.1 Problem Motivation

This thesis will investigate the possibility to complement or replace certain traditional chemical analysis in monitoring engine oil. Oil analysis are needed in order to prolong lifetime of the machinery as well as the oil itself and limit the down time of the machinery. Engine oil, or crankcase lubricant is an essential element in the function of any internal combustion engine. However, due to the engine’s hostile environment engine oils are also susceptible to degradation and formation of acidic compounds along with depletion of some functional additives. While routinely doing oil changes may provide a satisfactory engine oil change policy in some applications, often a more sophisticated policy is desirable. Traditional chemical analysis provide a way to monitor the state of the engine oil, i.e. TBN (Total Base Number), TAN (Total Acid Number) and viscosity. There are standardised test procedures to many relevant measures. However, these analyses are often time consuming and labor intensive, and some of them may also require handling and disposal of unhealthy chemicals. An alternative to the analyses for some relevant properties is the usage of FTIR-spectroscopy.

One way to make usage of the FTIR spectrum is as a trending and screening tool [5].

Typically a difference spectrum of the used engine oil and a reference oil is analysed.

With knowledge of the source of certain peaks, conclusions may be drawn of the engine oil’s state. There are also ways to measure TBN with FTIR [6] by allowing a chemical to react with the oil to produce a quantifiable peak in the spectrum. The benefit of this kind of method is that it can be standardized, and permit a wider test range of different types of lubricants.

Another approach that have been used for oil condition monitoring is to use a statistical method (Principal Component Analyses, PCA) to determine a domain with samples with approved functionality to the scores of the first principal components [1]. There are also approaches, trying to predict physical quantities important to the oil’s condition status based on other measurements than spectroscopic data, such as in [8]. The study presents a model to monitor the oil’s condition based on its permittivity and viscosity. [8]

also highlights one of the drawbacks of FTIR-spectroscopy, namely the complexity and high cost of the instrument itself. These drawbacks are reasons that make an on-line application with a full spectrum FTIR-spectrometer undesirable. However, IR-sensors limited to specific wavelengths are simpler and cost less which could motivate the study and identification of the most relevant wavelengths.

1

(14)

2 CHAPTER 1. INTRODUCTION

The FTIR-spectrometer used in this survey measures the transmission in the mid- infrared region (4000 − 400cm⁻¹), which primarily is able to detect bonds of organic molecules [10]. It is also possible to use the near-infrared (13, 000 − 4000cm⁻¹) region in order to draw conclusions of the oil’s condition [7]. However more sophisticated mathematical analysis is then required due to very complex behaviors of the spectrum in this region.

1.2 Statistical concerns

In this survey we will make use of collected FTIR-data to predict chemical properties relevant to the condition of engine oil samples collected from Scania’s test site.

The number of variables obtained from each oil sample is p = 3550. Many generic problems that arise when high dimensional complex data are considered are present. Is it possible to examine an internal covariance structure by visual examination of data for example? While this often is difficult from a pure data driven approach, knowledge provided from chemistry may serve as a measure on reliability of the results. Another difficulty is due to what may be called the ”p > n”-problem, n is here the number of samples. This situation forces us to consider other methods than classical statistical textbook methods.

One objective in this survey is prediction, and as in any prediction problem is the prediction error a suitable measure on how powerful the model is. The objective to predict relevant quantities is derived from the desire to replace certain chemical measures, which in turn ultimately are derived from the question of whether or not the engine oil is able to fulfill its functions. Hence is the tolerance to errors rather large, compare for example with the situation of estimating a fundamental physical constant where any further increase in accuracy has some value.

The other objective of interest in this survey is to reduce the number of features included in the model. The motivation to this is twofold, one is simply that from a computational perspective a smaller model is more desirable than one having a larger number of features. The other and more important comes from the desire of using the model as tool to analyze the oil itself. A method to discard uninformative variables may help to analyze the composition of the oil.

This survey will consider primarily the following statistical methods

• Partial Least Squares Regression

• Higher Criticism Feature Selection

• Hierarchical Clustering

PLSR will serve as the method used to prediction and therefore also as the performance measure. Higher Criticism Feature Selection and Hierarchical Clustering will be used as tools to select features and reduce the size of the model in the sense that the number of features will be reduced. PLSR will also in these reduced models serve as performance measure. Two kinds of Higher Criticism Feature Selection Model will be considered, the conventional where variables are assumed to be independent and one where the covariance structure is believed to be block diagonal. The unsupervised learning method Hier- archical Clustering will also be considered.

(15)

Chapter 2

Lubricants

2.1 Lubricants

The principle of supporting a load on a friction reducing film is known as lubrication.

The substance of the film itself is referred to as lubricant and the action to apply a lubricant for the purpose of lubrication is known as to lubricate.

Lubrication is not new, instead it has been taken advantage of as long as moving parts have been used in machinery. The industrial revolution came with increased de- mands on lubrication due to the introduction of more advanced machinery. Nowadays a major part of the global lubricant production is designated for internal combustion engines. Lubricants in internal combustion engines will compose the main focus of the sub- sequent chapter and the survey as a whole.

Lubricants designated for usage in internal combustion engines are also commonly referred to as engine oils. Engine oils from the dawn of the automotive era were not very specialized or sophisticated in comparison with contemporary engine oils. Frequent oil changes were required in order to keep the engine in operation. Engine oil’s primary function, to provide lubrication has been added by further functionalities such as cool- ing, cleaning, suspending, and protection against corrosive damage. In addition, engine oils should ideally keep these properties in a wide range of external conditions such as varying temperature, speed and pressure.

Engine oils are composed of a base oil and an additive package. The base oil of internal combustion engines are composed of hydrocarbons with C20− C₇₀coal atoms. The molecular structure of the base oil is a blend of hydrocarbons, designed to have advan- tageous viscosity properties. The additive package is designed to maintain or enhance certain properties, or to add new functionality to the engine oil. Examples of already ex- isting properties are viscosity and oxidation stability. Added functionalities are for example cleaning and increased anti wear properties.

2.2 Oil degradation

An engine oil in use will encounter chemical and physical impacts from heat, pressure, air, corrosive agents, contaminants and wear that will change its lubricative properties over time. Therefore it is essential to monitor these changes to know when there is a need for an oil change. The most important engine oil properties to measure are de- scribed in the following sections.

3

(16)

4 CHAPTER 2. LUBRICANTS

µ

λ µ(λ) Boundary lubrication

Mixed lubrication

Elastohydrodynamic lubrication

Hydrodynamic lubrication

Figure 2.1: The Stribeck curve. The friction coefficient µ between two surfaces as a function of the Sommerfeld-number λ.

2.3 Viscosity

Viscosity is an essential parameter to monitor to prevent wear during engine operation.

A fluid’s viscosity both determines how much load the fluid can take and how easy the fluid can form a lubricative film between surfaces in relative motion. These aspects are summarized by the Stribeck-curve, see figure 2.1. λ is the Sommerfeld number

λ = ω · η p

where ω is the angular frequency, η the dynamic viscosity and p is the pressure. The Sommerfeld number is dimensionless. During the operation of an engine, the lubricant is wanted to form hydrodynamic lubrication. If the viscosity is too low, oil films tend to be thinner and non-protective, which increases the risk for wear in the mixed and/or boundary lubrication regime. On the other hand, with a too viscous fluid it may take longer time to establish a hydrodynamic film, and once the film is formed it will still require more energy and a higher fuel cost than necessary.

Since the viscosity decreases with temperature, viscosity modifying agents are present in the oil’s additive package to adjust for the changes in viscosity during e.g. cold start- ups and change in external climate. Typically, the viscosity is measured according to standard ASTM D446 at two different temperatures, 40^◦C and at 100^◦C.

(17)

CHAPTER 2. LUBRICANTS 5

Table 2.1

Reaction Wavenumber intervals 1/λ [cm⁻¹]

Compounds created at

Oxidation 1700-1750

Nitration 1600-1650

Sulfation 1100-1200

Compounds consumed at Alcaline reserve (Calcium carbonate) 1516

Antioxidants 3650

Antiwear (ZDDP) 960-980, 660

2.4 Oxidation, Nitration, Sulfation and Additive depletion

Oxidation is a common potential problem in any application that makes use of lubricants for its operation. Oxidation as a reaction has gotten its name from the element oxygen and is in our settings primarily the reaction involving base oil molecules and oxygen. Oxidation in other settings are rusting and other corrosive processes and are often very harmful to the functionality of the device. Generally are these reactions harmful, same applies to oxidation in engine oil. The rate of base oil oxidation increases expo- nentially above a certain temperature. Crankcase lubricants are in an environment with high temperatures and high degree of acidity that makes them very susceptible to oxidation. Other catalytic parameters to oxidation are the presence of wear particles, water and other contaminants. In addition, wear particles, water and other contaminants may catalyse the oxidation process.

The oxidation process produce acidic compounds as well as polymeric compounds with higher molecule weight than when started. If oxidation is allowed to proceed and further polymerization occurs, sludge is formed. The sludge can precipitate as a thin film forming varnish or lacquers on metal surfaces. Sludge may cause blocking of valves and clogging of other orifices. Lacquers and varnish deposits increase wear if formed on sliding surfaces. Since oxidation contributes to the formation of bigger and heavier molecules will the viscosity increase and impair fuel economy.

As with oxidation is nitration a concern in engine oils. Heat may cause nitrogen and oxygen from the atmosphere to react and form so called NOx-gases which in turn may cause oil thickening.

Sulfation may occur when sulfur from the diesel or base oil react with oxygen or water under high temperatures to form various sulfurous compounds, including acids based on sulfur. These acids may in turn cause the formation of sludge, sedimentation and the formation of varnish.

FTIR-spectrums have been used as a tool to keep these processes under control whose impact may be monitored on wavenumbers displayed in table 2.1

2.5 Oil acidity

The above mentioned processes are all responsible for degradation of the engine oil and in particular is the formation of acid compounds that are harmful to the engine oil. The

(18)

6 CHAPTER 2. LUBRICANTS

level of acidity is measured by TAN (Total Acid Number) and the measurement is performed according to standard ASTM D664. The alkaline reserve is measured by the TBN (Total Base Number) and measured according to ASTM D4739. TAN and TBN are re- ported as the amount of potassium hydroxide needed to neutralize one gram of the sample. These standards are performed by titration which is a type of wet chemistry analysis which is rather labor intensive and requires the usage of reagents. Titration is one of the chemical analysis methods that we hope to be able to replace by analysis of the FTIR spectrum.

(19)

Chapter 3

FTIR-Spectroscopy

Fourier Transform Infrared Red Spectroscopy (FTIR-spectroscopy) measures the transmission (and or absorption) of different wavelengths of infrared light by a translucent specimen. In our project we use FTIR-spectroscopy to analyze engine oils. A typical spectrum obtained from the FTIR-spectrometer is displayed in figure 3.1. The FTIR-spectrometer is built around a interferometer, typically a Michelson interferometer. The interferometer generates a polychromatic beam dependent on the interferometer’s mirror position and a detector measures the intensity of the light transmitted through the specimen. The intensity as a function of the interferometer’s moving mirror is called a interferogram. The Fourier transform then transforms the interferogram into a transmission spectrum as in figure 3.1.

The FTIR-instrument measures the absorption taking place in the Infrared region.

What determines the rate of absorption at a particular wavelength is the concentration of molecules in the specimen having ability to absorb that wavelength. The presence of a certain molecule will contribute to an increased rate of absorption for all wavelengths that can be absorbed by that molecule. The interval of wave numbers considered by our FTIR-spectroscope is 451 − 4000 cm⁻¹ which is part of the so called mid-infrared region.

As displayed in figure 3.1 does the y-axis represent the percentage of transmission.

7

(20)

8 CHAPTER 3. FTIR-SPECTROSCOPY

0 500 1,000 1,500

2,000 2,500

3,000 3,500

4,0000 20 40 60

1/λ [cm⁻¹]

Transmission

Figure 3.1: Visual representation of FTIR-spectroscopy measurement of crankcase lubricant.

(21)

Chapter 4

Theory of Partial Least Squares

4.1 Introduction to Partial Least Squares

Partial Least Squares (PLS) is a dimension reduction method and is most commonly used in combination with regression. PLS models the relationship between blocks of observed data by an intermediate representation in latent spaces. A visual representation of this setting is displayed in figure 4.2. There are a variety of variants of the PLS-method suitable to slightly different settings, the one presented here is the most common one, PLS1 or PLS2 used in the regression variant. The difference between PLS1 and PLS2 is only in the number of response variables, in PLS1 is only one response variable allowed, while PLS2 allows multiple.

PLS is usually recognized to come from the field of chemometrics and was first de- veloped during 1970s when the usage of computers in chemical analyses became increas- ingly exploited. Analytical instruments generating a large number correlated data points per sample may see usage of PLS.

4.2 Theory of PLS

Consider the general setting of two blocks of data X ⊂ R^p andY ⊂ R^m. Here p is the number of variables in theX -block and m is the number of variables in the Y -block.

Now construct matrices of n observation from these two sets of variables, XXX ∈ R^n×p and YYY ∈ R^n×m. The data matrices are assumed to be centered (Pn

i=1XXX_ik = 0 k ∈ {1, . . . , k}, same for YYY). PLS finds representation of X and Y such that

X

XX = T PT PT P^T + EEE Y

YY = U QU QU Q^T + FFF . (4.1) Here are TTT and UUU matrices of size n × P where P is the number of variables in the latent spaces. The matrices TTT and UUU are often called scores and corresponds to an observation’s coordinates in the latent space. PPP and QQQ are matrices of size n × p and n × m respectively. PPP and QQQare often called loadings and represent the latent variables’ coordinates in the observed spaces. If the variables in the observed spaces are highly collinear then typically p P .

So far we have not said anything about how TTT and UUU are chosen. PLS finds TTT and UUU such that the following objective function is maximized for each pair of columns (ttt, uuu)in

9

(22)

10 CHAPTER 4. THEORY OF PARTIAL LEAST SQUARES

TTT and UUU

cov(ttt, uuu)2

= cov(XwXwXw, Y cY cY c)2

= max_|r|=|s|=1 cov(XrXrXr, Y sY sY s)2

(4.2) where cov(ttt, uuu) = ttt^Tuuu/ndenotes the sample covariance. PLS finds bases {ttti}^P_i=1 and {uuu_i}^P_i=1 such that the covariance between each pair of basis vectors (tk, u_k) is maximized.

PLS is in its classical form based on the algorithm Nonlinear Iterative Partial Least Squares (NIPALS) in order to find the pair (ttt1, uuu₁). Start by choosing some uuu₀ and repeat following steps until some convergence criterion is fulfilled

1. www₁= XXX^Tuuu₀/(uuu^T₀uuu₀) 2. ||www1|| → 1

3. ttt1 = XwXwXw1

4. ccc1= YYY^Tttt1/(ttt^T₁ttt1) 5. ||ccc1|| → 1

6. uuu₁= Y cY cY c₁.

The graphical representation of the dimensions of the different matrices in figure 4.1 may be helpful in order to understand the NIPALS-algorithm.

X

XX ttt uuu YYY

www^T ppp^T

ccc^T qqq^T

Figure 4.1: Graphical representation of the dimensions of matrices present in the NIPALS- algorithm.

When the NIPALS-algorithm converged we determine the loadings ppp1 and qqq1

ppp₁ = XXX^Tttt₁/(ttt^T₁ttt₁) and qqq1 = YYY^Tuuu₁/(uuu^T₁uuu₁)

What differs between different variants of PLS is most times the way the matrices XXX and Y

YY are deflated. In PLS1 and PLS2 are the matrices deflated according to

XXX₁ = XXX₀− tptptp^T and YYY₁ = YYY₀− tttttt^TYYY₀/(ttt^Tttt) = YYY₀− tctctc^T (4.3) This procedure starts over with the deflated matrices XXX1 and YYY1 until a satisfying number P of bases {ttti}^P_i=1 are computed.

It is also possible to reformulate the problem of finding the weights wwwi i = 1, . . . , P into an eigenvalue problem

XXX^TY YY YY Y^TXXXwwwi= λwwwi i = 1, . . . , P (4.4)

(23)

CHAPTER 4. THEORY OF PARTIAL LEAST SQUARES 11

This can be seen by again consider the NIPALS-algorithm, let wwwn be the www obtained by the nth iteration of NIPALS and follow the steps backwards

ww

wn= XXX^Tuuun−1/(uuu_n−1n−1n−1Tuuun−1_n−1n−1)

= XXX^TY cY cY cn−1_n−1n−1/[(uuun−1_n−1n−1Tuuun−1_n−1n−1)(cccn−1_n−1n−1Tcccn−1_n−1n−1)]

= XXX^TY YY YY Y^Tttt_n−1/[(uuu_n−1n−1_n−1^Tuuu_n−1n−1_n−1)(cccn−1_n−1_n−1^Tccc_n−1n−1_n−1)(tttn−1_n−1_n−1^Tttt_n−1n−1_n−1)]

= XXX^TY YY YY Y^TXwXwXw_n−1/[(uuu_n−1n−1_n−1^Tuuu_n−1n−1_n−1)(ccc_n−1n−1_n−1^Tccc_n−1n−1_n−1)(ttt_n−1n−1_n−1^Tttt_n−1_n−1n−1)(www_n−1_n−1n−1^Twww_n−1n−1_n−1)]

Which, if convergence, reformulates the original problem into solving equation 4.4. The variant PLS called SIMPLS uses SVD-decomposition of equation 4.4 instead of the NI- PALS algorithm.

It is worth noting that the PLS algorithm is completely symmetric if neglecting equation 4.3. Excluding 4.3 does PLS only try to find latent spaces with bases that have as high covariance as possible. The representation of the data in the latent space, i.e. the scores are orthogonal. This is the same as stating that all column vectors ttti of the score- matrix TTT are mutually orthogonal.

(ttti, tttj) = ttt^T_itttj = 0 for i 6= j.

This is easily proven by applying the NIPALS-algorithm backwards. Suppose i < j X

X

X_j = XXX_j−1− X_j−1www_j−1ttt^T_j−1XXX_j−1/(ttt^T_j−1ttt_j−1)

= XXX_j−1I − www_j−1ttt^T_j−1XXX_j−1/(ttt^T_j−1ttt_j−1)

= . . . =

= XXXi+1ZZZ

=XXXi− ttt_ittt^T_iXXXi/(ttt^T_i ttti) ZZZ

where ZZZ is just some matrix made from the product of superfluous matrices that arose from the induction step. Now observe that

ttt^T_iXXXj = ttt^T_i XXXi− ttt_ittt^T_iXXXi/(ttt^T_i ttti) ZZZ

=ttt^T_iXXX_i− ttt^T_i ttt_ittt^T_iXXX_i/(ttt^T_i ttt_i) ZZZ

=ttt^T_iXXX_i− ttt^T_i XXX_i ZZZ

=ttt^T_iXXX_i− ttt^T_i XXX_i

| {z }

=0

Z ZZ

and therefor

ttt^T_itttj = ttt^T_i XXXjwwwj = 0 for i > j Which was what to be shown.

4.3 Regression properties

PLS is a way to find a representation of a high dimensional that has lower dimension as depicted in figure 4.2. However we also want to find a relation between the latent spaces such that

UUU = T DT DT D + HHH (4.5)

(24)

12 CHAPTER 4. THEORY OF PARTIAL LEAST SQUARES

T T T Latent spaceT

U UU Latent spaceU

X₁ X₂ · · · X_p Y₁ Y₂ · · · Y_m

Figure 4.2: Path diagram of PLS model. TTT and UUU are the latent variables. XXXand YYY are the observed data variables, but assumed to be derived from the latent variables according to the directions indicated by the arrows.

where UUU is the latent representation of YYY and HHH is the residual matrix from the regression. DDD is the regression coefficients. Now put (4.5) into (4.1)

YY

Y = (T DT DT D + HHH)QQQ^T + FFF = T DQT DQT DQ^T + (HQHQHQ^T + FFF )

| {z }

:=FFF^∗

(4.6)

Furthermore, let us continue the NIPALS inspired notation by letting CCC := YYY^TTTT and assume that the scores TTT are orthonormalized, i.e. that is TTT^TTTT = III where III is the identity matrix. This let us rewrite equation (4.6) into

YY

Y = T CT CT C^T + FFF^∗ (4.7)

Next step is to reformulate (4.7) into a form that contains XXX instead of the scores TTT. This is done based on the relation

T

TT = XWXWXW (PPP^TWWW )⁻¹. (4.8) Then this is put back into equation (4.7) and we retrieve

YY

Y = XBXBXB + FFF^∗. (4.9)

Where the regression coefficients B are given by

BBB = WWW (PPP^TWWW )⁻¹CCC^T = XXX^TUUU (TTT^TXXXXXX^TUUU )⁻¹T YT YT Y (4.10) It can also be shown that the PLSR-algorithm (PLS1) is equivalent to the conjugate gradi- ent method applied to find coefficients bbb minimizing

1

2bbb^TAAAbbb − zzz^Tbbb

where AAA := XXX^TXXX and zzz := XXX^Tyyy along AAA-orthogonal directions. The conjugate gradi- ent solution retrieved from p iterations is equivalent to the PLSR-solution with p components.

(25)

Chapter 5

Feature Selection with Higher Criticism Thresholding

Data glut is responsible to the breakdown of applicability of many classical classification methods in a number of today’s highly relevant such as astronomy and genomics. In this survey we are also dealing with the same issue, the FTIR-spectrometer produces data points measured in a few thousands for each sample. What has not changed though is the difficulty to obtain good observations. In our case, the number of engine oil samples is still limited to the number of oil tests. Another characteristic for fields such as astronomy and genomics is the sparsity of relevant features in data samples. Among all measured data points most of them are irrelevant for classification.

This section will at large follow the same structure as the background presentation in [3] and [2].

5.1 Multivariate Linear Classification

Consider the simple setting of a linear classifier acting on samples XXXi ∈ R^1×p labeled by Yi ∈ {−1, 1}. A sample XXXi is assumed to be XXXi ∼ N_p(Yiµµµ, ΣΣΣ) where µµµ ∈ R^p is the mean contrast vector and ΣΣΣ ∈ R^p×p is the covariance matrix. A linear classifier assigns a new unlabeled sample XXX ∈ R^1×p to the class based on the sign of

Lˆ_T(X) =

p

X

j=1

w^∗(j)XXX(j) (5.1)

where w(·) is a weight function yet to be determined. Classical approaches such as Lin- ear Discriminant Analysis (LDA) suggests w ∝ Σ⁻¹. The trouble in our setting is that p > n and hence will we not be able to estimate a full rank ˆΣand the estimation of the covariance matrix will not be invertible as required by many methods. Similar approaches, to use a generalised inverse Σ^† often generate noisy results.

13

(26)

14 CHAPTER 5. FEATURE SELECTION WITH HIGHER CRITICISM THRESHOLDING

-T 0 T

-T -1 0 1 T

w^clip_T (z) w^hard_T (z) w^{sof t}_T (z)

Figure 5.1: Graphs of three families of weight functions.

5.2 Feature selection by thresholding

The weight function w^∗(·) typically has one of the following forms ˆ

w^clip_T (z) :=sign(z)1{|z|>T }

ˆ

w^{sof t}_T (z) :=sign(z) max(|z| − T ), 0) (5.2) ˆ

w^hard_T (z) := z1{|z|>T }.

Where sign(·) is the sign-function and 1A(z) is the indicator function defined by 1A(x) :=

(1, if x ∈ A 0, if x 6∈ A

Graphs of these functions are displayed in figure 5.1. The Z-scores z are calculated according to

Z(j) = n^−1/2

n

X

i=1

Y_iXXX_i(j) (5.3)

The most important aspect of all these three weight functions is that they are only non zero beyond a certain threshold value T . If a feature is irrelevant, the weight function will give its coefficient the value zero. This will improve the interpretability of the model.

One crucial question remains, how shall we choose the threshold T ? We will determine the threshold T by the usage of the so called Higher Criticism Thresholding.

5.3 Higher Criticism Hypothesis Testing

The concept of second-level significance testing or with the more catchy name Higher Criticism was first introduced 1976 by Tukey in his class notes for the course statistics

(27)

CHAPTER 5. FEATURE SELECTION WITH HIGHER CRITICISM THRESHOLDING 15

411 at Princeton University. Tukey introduced the concept of Higher Criticism by means of a story: A young psychologist administers many hypothesis tests as apart of a research project, and finds that of 250 tests 11 were significant at the 5%-level. the young researcher feels very proud of this fact and is ready to make a big deal about, until a se- nior researcher suggests that one would expect 12.5 significant tests in the purely null case, merely by chance. In that sense, finding only 11 significant results is actually some- what disappointing.

Tukey then proposed the second-level significance testing HC_0.05,p=√

p [(Fraction significant at 0.05) − 0.05] /√

0.05 · 0.95

and suggested that large values, as 2 or greater should indicate evidence against the overall body of tests. Though Tukey did not develop these ideas further at the time but a reasonable extension may be to consider the biggest HC value among a number of confi- dence levels

HC_p^∗ = max

0<α<α0

√p [(Fraction significant at α) − α] /p

α · (1 − α)

There are plenty different kinds of HC-testing. The variant of HC-testing this survey is based on the notion of P-values.

In our setup there are p independent tests of unrelated hypotheses H0,j versus H1,j

where

H0,j : X(j) ∼ N (0, 1)

H_1,j : X(j) ∼ N (µ_j, 1), µ_j > 0 (5.4) In the sparse setting we expect a greater majority of the null hypothesis to be true and there to be a fraction of tests where the null hypothesis is false. In Higher Criticism testing we test if the whole body of null hypothesis are true. We let πj for j = 1, . . . , p be the feature P-values conditional H0, i.e. πj = P(|X(j)| > |Z(j)|| H0). We also introduce a notation for the ordered feature P-values π₍₁₎ ≤ π₍₂₎ ≤ . . . ≤ π_(p). We recall that according to the probability integral transform the P-values are πj ∼_iid U [0, 1]for j = 1, . . . , p under the null hypothesis.

Definition 5.3.1. (HC-testing) The HC-test is defined by the HC-objective function HC^∗(i, π_(i)) := max

1≤i≤pα0

√p i/p − π_(i)

p(i/p)(1 − i/p) (5.5)

where α0∈ (0, 1), typically α0 = 1/10.

In order to obtain the HC-threshold (HCT) we apply the HC-test to the feature p-values and define the HC-threshold to be

Definition 5.3.2. (HC-threshold) Let the maximum of the HC-objective function be at- tained at ˆı and define the HCT by ˆt^HC = |Z|_(ˆ_ı)

There are other principles for choosing the threshold value T , often mentioned alterna- tives are False Detection Rate Thresholding (FDRT) and Bonferroni Thresholding. Higher Criticism Thresholding has been shown to perform well in comparison to these methods in rare-weak (RW) cases studied [4].

Since the development of HC has been driven by the need to investigate cases where the useful features are rare and weak, most scientific papers deal with asymptotic cases.

(28)

16 CHAPTER 5. FEATURE SELECTION WITH HIGHER CRITICISM THRESHOLDING

5.4 Variants of Higher Criticism Thresholding

The variant of Higher Criticism thresholding defined as in 5.3.1 is only one of many possible objective functions. In [9] is a broader class of HC objective functions considered.

Definition 5.3.1 is extended to the representation T (i, π_(i)) := max

1≤i≤pα0

√pi/p − π_(i)

q(i/p) (5.6)

where q is a Erd˝os-Feller-Kolmogorov-Petrovski (EFKP) upper-class function. An EFKP upper- class function is a function q(u) : (0, 1) 7→ R⁺, q(·) is symmetric and

lim sup

u→0

|B(u)|/q(u)^a.s= b

for some b < ∞. B(u) is a brownian bridge on the interval [0, 1]. For all details see for example [9]. An important example of q(u) is

q(u) =p

u(1 − u) log log(1/(u(1 − u))) (5.7) While this choice of q(·) has been shown to have better asymptotic properties than the original choice did the usage of 5.7 not affect the outcome noticably and these results are therefore not included in this report.

(29)

Chapter 6

Feature selection based on Higher Crit- icism with Block Diagonal Covariance Structure

The theory presented in this section follows the reasoning in [11]. In equation 5.4 we model the variables XXX(j) for j = 1, . . . , p to be mutually independent. This is a very strong assumption and one could argue heuristically argue that any more general covariance structure would model the data better. therefore we instead assume that our train- ing data (XXX_i, Y_i) for i = 1, . . . , n where (XXX × Y ) ∈ R^p × {0, 1} is modeled according to

XXXi|Y_i = 0 ∼ Np(µµµ1, ΣΣΣ)

XXXi|Y_i = 1 ∼ Np(µµµ2, ΣΣΣ). (6.1) Here is ΣΣΣa covariance matrix of the form

ΣΣΣ :=





 Σ

ΣΣ_[1] _Σ_Σ_Σ

0

[2]

ΣΣ Σ_[3]

. ..

0

^Σ^Σ^Σ^[b]







(6.2)

where ΣΣΣ_[1], . . . , ΣΣΣ_[b] all are covariance matrices of size p0 := p/b. µµµ1 and µµµ2 are mean vectors such that µµµ := µµµ₂ − µµµ₁ 6= 000. We also require the number of samples from the two different classes to be equal.

The idea with this block variant of Higher Criticism feature selection is now to consider block of variables instead of individual variables and construct p-values for blocks of variables and then test if data from the two classes come from the same distribution for a certain block.

We introduce the following notation for the variable vector XXX := (XXX^T_[1], . . . , XXX^T_[b])^T where XXX_[1] all are column vectors of variables of size p0. We also construct estimates of

17

(30)

18 CHAPTER 6. FEATURE SELECTION BASED ON HIGHER CRITICISM WITH BLOCK DIAGONAL COVARIANCE STRUCTURE

the mean vector µµµ1 and µµµ2 by ˆ µ µ

µ_1,[k]= 2 n

n/2

X

j=1

XXX⁽¹⁾_j,[k] k = 1, . . . , b

ˆ µ µ

µ_2,[k]= 2 n

n/2

X

j=1

XXX⁽²⁾_j,[k] k = 1, . . . , b

Here is XXX^(m)_j,[k] the observed data points of variables in block k from class m. Also let ˆ

µ

µµ_[k]:= ˆµµµ_2,[k]− ˆµµµ_1,[k] k = 1, . . . , b

We also estimate the covariance structures ΣΣΣ_[k] by the estimator ˆΣΣΣ_[k] defined by ΣΣΣˆ_[k]:= 1

n − 2

n/2

X

j=1

(XXX⁽¹⁾_j,[k]− µµµ_1,[k])(XXX⁽¹⁾_j,[k]− µµµ_1,[k])^T

! +

+ 1

n − 2

n/2

X

j=1

(XXX⁽²⁾_j,[k]− µµµ_2,[k])(XXX⁽²⁾_j,[k]− µµµ_2,[k])^T

!

We introduce the statistics Tk for k = 1, . . . , b defined by T_k:= n(n − p0 − 1)

4(n − 2)p₀ ∆ˆ_k k = 1, . . . , b (6.3) where

∆ˆ_k= ˆµµµ^T_[k]ΣΣΣˆ⁻¹_[k]µµµˆ_[k].

It can be shown that under the assumption that the data is from 6.1 then Tk have the properties

(a) T1, . . . , Tb are mutually independent (b) For all k = 1, . . . , b,

T_k ∼ F_p₀_,n−p₀₋₁(γ_k)

where F is the F-distribution and γk is a non-centrality parameter according to γk=

n

2∆_k := ⁿ₄µµµ_[k]^T ΣΣΣ⁻¹_[k]µµµ_[k]

Similar to as in equation 5.4 do we want to consider the two hypotheses H_0,k : T_k^i.i.d∼ F_p₀_,n−p₀₋₁(0)

H_1,k : T_k^i.i.d∼ (1 − )F_p₀_,n−p₀−1(0) + F_p₀_,n−p₀−1(γ)

(6.4) where γ is a strictly positive non-centrality parameter and ∈ (0, 1) is a sparsity parameter. We keep the feature block XXX_[k] if w_k := 1(γk > 0)and estimate w_k by the estimator

ˆ

w_k :=1(Tk > t)where Tk is the statistic from equation 6.3 and1 is the indicator function.

The threshold t is chosen by usage of the HC-threshold in definition 5.3.2.

A classifier to assign class membership of a new sample XXX0 is Ψ(XXX₀) :=1

b

X

k=1: ˆwk=1

XXX_0,[k]−µµµˆ_1,[k]+ ˆµµµ_2,[k]

2

T

Σˆ

ΣΣ⁻¹_[k](ˆµµµ_2,[k]− ˆµµµ_1,[k]) ≤ 0

!

(6.5) which assigns XXX0 to class Ψ(XXX0) similarly to the classifier in equation 5.1. This variant of Block-Higher Criticism has been shown to have better asymptotic properties in the rare weak setup than the naive variant.

(31)

Chapter 7

Hierarchical Clustering

Hierarchical clustering is an unsupervised statistical learning technique used to group observations together that are close to each other according to some metric. The easiest, or most straight forward choice of metric is the euclidean metric

d_euclidean(xxx, yyy) = ||xxx − yyy||2 :=

q

(xxx − yyy)^T(xxx − yyy).

However, if the dimensionality p becomes big then the euclidean distance fails to fulfill some intuition of how the euclidean behave in lower dimension. In this survey is the correlation metric more appealing since we deal with collected data. The correlation metric is defined as

dcorrelation(xxx, yyy) := 1 − (xxx − ¯x)xx ^T(yyy − ¯yyy)

p(xxx − ¯xxx)^T(xxx − ¯xx)(yx yy − ¯yyy)^T(yyy − ¯yyy) (7.1) After a metric to determine the distance between points is chosen the algorithm of hierarchical clustering is rather self explaining.

1. Determine the pairwise distance between all observations. If there are n samples in total the number of distances that have to be calculate is ⁿ₂ = n(n − 1)/2.

2. For i = n, n − 1, . . . , 2

Examine all inter-cluster dissimilarities among all i clusters, and fuse the two clusters with smallest distance between them.

The way the distance between two clusters are calculated is known as the linkage criterion. Below are four different linkage criteria presented.

• Complete: Set the distance between two clusters to be the distance equal the dis- tance between the two nodes in the clusters which are furthest apart.

• Single: Set the distance between two clusters to be the distance equal the distance between the two nodes in the clusters which are closest together.

• Average: Set the distance between the clusters to be the average of all distances be- tween all inter cluster distances.

• Centroid Set the distance between the clusters to be the distances between the two average points in the two clusters.

19

(32)

20 CHAPTER 7. HIERARCHICAL CLUSTERING

The most common way to present this kind of hierarchical clustering is with help of a so called dendrogram, a such may be seen in figure 9.9. The height of the branches tells the distance between the two fused clusters.

(33)

Chapter 8

PLSR prediction model

Without any further modifications we apply PLS for regression of the available relevant chemical quantities. We denote the FTIR-spectrum by XXX and define the response variable YYY

YYY :=

TAN TBN Visc40 Visc100

so that YYY ∈ R^n×4 where n again is the number of observations. The overall goal is to find the regression coefficients βββ ∈ R^p×4 that solves the equation

Y Y

Y = XβXβXβ + (8.1)

where is some residual term.

We use MATLAB’s plsregress function on a subset of all observations with 10-fold cross validation and a large number of included components, 50. The measures RMSPE (Root Mean Square Prediction Error) and Relative Error in figure8.1 are defined as

RMSPE(y^∗, y) :=

v u u t

1 n_test

ntest

X

i=1

(y^∗− y)² (8.2)

and

Relative Error(y^∗, y) := 1 ntest

ntest

X

i=1

|y_i^∗− y_i|

|y_i| (8.3)

where y^∗ and y are predicted and measured values respectively. ntest is the number of el- ements in the test set. In order to assure that there are no hidden caveats in the predicted values we plot the measured values versus the predicted values of the considered quantities in figure 8.2. The predicted values in those figures are done based on coefficients retrieved from 20 PLS-bases. Apart from TAN does the model a good job to predict the physical quantities. In the figure can we see that the predicted values are located close to the red line with few exceptions. For TAN on the other hand does the model a poorer performance, in particular does the model have problems with predicting samples with higher TAN-values. Those samples are all predicted to have lower TAN-values than they actually have. We can also see from the relative error plot that TAN is the quantity with highest relative error.

In figure 8.3 we can see the regression coefficients β of equation 8.1 with the inter- cept excluded. The absolute value of the regression coefficients say something about their

21

(34)

22 CHAPTER 8. PLSR PREDICTION MODEL

0 5 10 15 20 25 30 35 40 45 50

0 1 2

number of components included in PLSR model

RMSPE

Error statistics

0 5 10 15 20 25 30 35 40 45 50

0 5 · 10⁻² 0.1 0.15 0.2

number of components included in PLSR model

RelativeError

Figure 8.1: Performance statistics of the PLS-regression on a test set.

importance. It is clearly visible that for all quantities, the interval [1800, 450] have large absolute values of the regression coefficients in comparison to the remaining variables.

However, due to the high variance in most intervals of regression coefficients it is difficult to draw any further conclusions regarding whether or not the compounds responsible for certain wave numbers are depleted or created.

(35)

CHAPTER 8. PLSR PREDICTION MODEL 23

0 5 10

predicted TAN

measuredTAN

TAN

0 5 10 15

predicted TBN

measuredTBN

TBN

0 20 40 60 80

predicted Visc40

measuredVisc40

Visc40

0 2 4 6 8 10 12

predicted Visc100

measuredVisc100

Visc100

Figure 8.2: Plots of predicted versus measured chemical quantities.

(36)

24 CHAPTER 8. PLSR PREDICTION MODEL

0 500

1,000 1,500

2,000 2,500

3,000 3,500

4,000

−4

−2 0 2 4

·10⁻² TBN

0 500

1,000 1,500

2,000 2,500

3,000 3,500

4,000

−0.2

−0.1 0 0.1 0.2

Visc40

0 500

1,000 1,500

2,000 2,500

3,000 3,500

4,000

−2

−1 0 1 2 ·10⁻²

1/λ [cm⁻¹] Visc100

0 500

1,000 1,500

2,000 2,500

3,000 3,500

4,000

−2 0 2

4 ·10⁻² TAN

Figure 8.3: Regression coefficients obtained from PLSR. 20 PLS-bases are used.

(37)

Chapter 9

Feature Selection

As displayed in figure 8.1 are most of the concerned physical quantities possible to predict more accurate than a relative error of 10%. The exception is the TAN-quantity which the regression-model fails to predict on an adequate level of accuracy. However, what the prediction models fail to achieve is to formulate an easily interpretable set of regression coefficients. The high variability between adjacent regression coefficients displayed in figure 8.3 makes it difficult to identify broad trends other than the absolute values among the coefficients.

9.1 Feature selection with Higher Criticism

One way to reduce the number of variables is by applying the Higher Criticism feature selection presented in the background section. In order to do so, we need to reformulate our setup with continuous response variables into a binary regression setup. One reasonable way to construct two classes from the physical quantities is to sort the physical quantity and assign the lower end to one class and the higher end of the quantity to the other. In this setup we assign (1/10) of the sorted vector to the class with lower quantity and an equally big selection to the other class. We do this for the physical quantities considered in section 8. These quantities are again

Y Y Y =

We also need to center the spectrum variables. If the original spectrum matrix is denoted by ˜XXX ∈ R^n×p then we center all variables according to

XXX_j := ˜XXX_j− 1 n

n

X

i=1

X XX_ij

and apply our further analysis on this XXX_j.

In figure 9.1-9.4 are the results from the Higher Criticism displayed. The subplots of each figured are understood by:

(a) displays the sorted Z-scores as defined in 5.3. The dashed red line represent the values (ˆı/p, ˆt^HC).

25

(38)

26 CHAPTER 9. FEATURE SELECTION

(b) displays ascending sorted P-values (i/p, π_(i))for i = 1, . . . , p under the null hypothesis in 5.4. The dashed red line represent the expected distribution of P-values under the null hypothesis.

(c) displays the second order Higher Criticism statistic 5.3.1 under the null hypothesis 5.4. The blue line is (i/p,√

p√^i/p−π⁽ⁱ⁾

(i/p)(1−i/p))for i = 1, . . . , p and the red dashed line is attained at the HC-maximum value ˆı/p.

(d) displays the z-score from equation 5.3 for all wave numbers. The selected features are filled with green color and the HC-threshold is represented by the dashed red line.

(39)

CHAPTER 9. FEATURE SELECTION 27

0 0.1

0 10

a

0 0.1

b

0 0.1

0 30

c

0 500

1,000 1,500

2,000 2,500

3,000 3,500

4,000

−6

−4

−2 0 2 4 6

d

Figure 9.1: Plots of predicted versus measured chemical quantities. TAN

0 0.1

0 10

a

0 0.1

b

0 0.1

0 30

c

0 500

1,000 1,500

2,000 2,500

3,000 3,500

4,000

−20

−10 0 10 20

d

Figure 9.2: Plots of predicted versus measured chemical quantities. TBN

(40)

0 0.1

0 10

a

0 0.1

b

0 0.1

0 30

c

0 500

1,000 1,500

2,000 2,500

3,000 3,500

4,000

−20

−10 0 10 20

d

Figure 9.3: Plots of predicted versus measured chemical quantities. Visc40

0 0.1

0 10

a

0 0.1

b

0 0.1

0 30

c

0 500

1,000 1,500

2,000 2,500

3,000 3,500

4,000

−15

−10

−5 0 5 10 15 20

d

Figure 9.4: Plots of predicted versus measured chemical quantities. Visc100

(41)

As a measure of performance we apply the binary classifier as in equation 5.1 on a sepa- rate test set. The misclassification rates are displayed in table 9.2.

Table 9.1: HC misclassification rates soft hard clip TAN 0.0096 0.0192 0.0192 TBNBP 0.0962 0.1058 0.1058 Visc40 0.0865 0.0962 0.0962 Visc100 0.0385 0.0385 0.0385

9.2 Block-HC feature selection

Another variant of feature selection with similar setup as Higher Criticism (HC) is the one called Block-Higher Criticism Feature Selection. It has been shown to have better asymptotic properties than ordinary single variable HC feature selection. However, it comes at cost, namely that some kind of prior knowledge has to be exploited in order to create the covariance matrix 6.2. Fortunately, the physics of spectroscopy has provided us with a way to do this, simply by assigning variables with similar wave numbers to the same group. This also coincides with how we have represented the spectrum samples. A physical explanation to this is that a certain molecule will itself not only affect a discrete variable but a whole interval. Another non-physical motivation is that, if there is an underlying complex covariance structure of the spectrum variables, then should a covariance structure with less restrictions be able to recognize a bigger part of that structure.

In order to use Block-HC Feature Selection we have to determine the parameter p0, the number of element in each group. While any value positive integer of p0 is permitted in the asymptotic case, we are limited to any divisor of p (= 3550). Since theory doesn’t provide us with any tool for choosing p0 we instead use the misclassification rate as a measure to help us determine p0.

Table 9.2: Block-HC misclassification rates

p0= 1 5 10 25 50

TAN 0.3585 0.2324 0.4542 0.1746 0.3810 TBN 0.0148 0.0028 0.2303 0.0085 0.0486 Visc40 0.0049 0.0028 0.3014 0.0521 0.1514 Visc100 0.1070 0.0486 0.2148 0.0908 0.1120

In table 9.2 are the misclassification rates displayed. Both p0 = 5 and p0 = 25 give decent misclassification rates, but we choose p0 = 25since it deviates most from the previous HC feature selection method.

In figure 9.5-9.8 are the results from the Block Higher Criticism displayed. The subplots of each figured are understood by:

(a) displays the sorted T-scores as defined in 6.3. The dashed red horizontal line represent the threshold value ˆt^HC.

(b) displays ascending sorted P-values (i/b, π_(i)) for i = 1, . . . , b under the null hypothesis in 6.4. The dashed red line represent the expected distribution of P-values under the null hypothesis.

(42)

(c) displays the second order Higher Criticism statistic 5.3.1 under the null hypothesis 6.4. The blue line is (i/b,√

b√^i/b−π⁽ⁱ⁾

(i/b)(1−i/b)) for i = 1, . . . , b and the red dashed line is attained at the HC-maximum value ˆı/p.

(d) displays the T-scores from equation 6.3 for all groups. The selected features are filled with green color and the HC-threshold is represented by the dashed red line.

(43)

0 0.1

a

0 0.1

b

0 5 · 10⁻² 0.1 2

3 4

c

0 500

1,000 1,500

2,000 2,500

3,000 3,500

4,000 0 2 4 6 8 10 12

k [cm⁻¹]

ScoreT

d

Figure 9.5: Block-HC feature selection for TAN.

0 0.1 750

a

0 0.1

b

0 5 · 10⁻² 0.1 2

3 4

c

0 500

1,000 1,500

2,000 2,500

3,000 3,500

4,0000 100 200 300 400 500

k [cm⁻¹]

ScoreT

d

Figure 9.6: Block-HC feature selection for TBN.

(44)

0 0.1 600

a

0 0.1

b

0 5 · 10⁻² 0.1 2

3 4

c

0 500

1,000 1,500

2,000 2,500

3,000 3,500

4,0000 100 200 300 400

k [cm⁻¹]

ScoreT

d

Figure 9.7: Block-HC feature selection for Visc40.

0 0.1 290

a

0 0.1

b

0 5 · 10⁻² 0.1 2

3 4

c

0 500

1,000 1,500

2,000 2,500

3,000 3,500

4,0000 50 100 150 200

k [cm⁻¹]

ScoreT

d

Figure 9.8: Block-HC feature selection for Visc100.

(45)

9.3 Feature selection with Hierarchical Clustering

Another strategy to perform feature selection is with Hierarchical Clustering applied on the features. We will use the correlation metric defined in (7.1) together with complete linkage. The result is shown in figure 9.9. Since the clustering does not take any correlation between the spectrum and the physical quantities into account some additional selection policy has to be used. We choose to use available knowledge about the properties of FTIR-spectroscopy in order to select clusters to keep. We know that the carbonyl bond, C O, which plays an important role in deciding the degree of oxidation is visible in the range 1830 − 1650 cm⁻¹ region. The region 1500 − 650 cm⁻¹ is called the fingerprint region and displays a complex behavior. Based on this we keep all variables in clusters {4, 5, 6, 7, 8, 9} which primarily have variables in the mentioned regions.

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6

4 5 1 10

2 3 6 8 9 7

correlation distance

0 500

1,000 1,500

2,000 2,500

3,000 3,500

4,0000 20 40 60 80 100

1/λ [cm⁻¹]

Transmission

Cluster TBN 13.5 TBN 1.1

1 2 3 4 5 6 7 8 9 10

Cluster

Figure 9.9: Dendrogram of spectrums’ clusters in top plot. In bottom plot are features’

clusters displayed together with two spectrums.

Methods of high-dimensional statistical analysis for the prediction and monitoring of engine oil quality

Methods of high-dimensional statistical analysis for the prediction and monitoring of engine oil quality

FREDRIK BERNTSSON

Methods of high-dimensional statistical analysis for the prediction and monitoring of engine oil quality

Abstract

Sammanfattning

Acknowledgements

Contents

Chapter 1

Introduction

1.1 Problem Motivation

1.2 Statistical concerns

Chapter 2

Lubricants

2.1 Lubricants

2.2 Oil degradation

2.3 Viscosity

2.4 Oxidation, Nitration, Sulfation and Additive depletion

2.5 Oil acidity

Chapter 3

FTIR-Spectroscopy

Chapter 4

Theory of Partial Least Squares

4.1 Introduction to Partial Least Squares

4.2 Theory of PLS

4.3 Regression properties

Chapter 5

Feature Selection with Higher Criticism Thresholding

5.1 Multivariate Linear Classification

5.2 Feature selection by thresholding

5.3 Higher Criticism Hypothesis Testing

5.4 Variants of Higher Criticism Thresholding

Chapter 6

Feature selection based on Higher Crit- icism with Block Diagonal Covariance Structure

0

0

Chapter 7

Hierarchical Clustering

Chapter 8

PLSR prediction model

Chapter 9

Feature Selection

9.1 Feature selection with Higher Criticism

9.2 Block-HC feature selection

9.3 Feature selection with Hierarchical Clustering