• No results found

Chemometrics: Unravelling information from complex data generated by modern instruments

N/A
N/A
Protected

Academic year: 2022

Share "Chemometrics: Unravelling information from complex data generated by modern instruments"

Copied!
64
0
0

Loading.... (view fulltext now)

Full text

(1)

Chemometrics

  Unravelling information from complex data generated by modern instruments

  Pedro F. M. de Sousa

Pedro F. M. de Sousa     Chemometrics

Department of Materials and Environmental Chemistry

ISBN 978-91-7911-500-5

Pedro F. M. de Sousa

holds a MSc in Analytical Chemistry from University of Cádiz (Spain) since 2014. He carried his doctoral studies at the department of Materials and Environmental Chemistry at Stockholm University from 2015-2021, under the the supervision of Dr. Magnus Åberg.

This thesis is based on the work developed in four scientific projects

published as papers in scientific journals. The development of new

analytical instruments allows the acquisition of more information from

complex chemical mixtures. The richness in data, however, is often

accompanied by interpretation problems due to its complexity. The

studies developed in these projects have been essentially focused on a

data analysis perspective, interpreting complicated data by means of

algorithms. Several chemometrical approaches, based on multivariate

data analysis and signal processing algorithms have been studied and

employed in each project. Most of the data analysis problems studied

these projects are related to liquid chromatography hyphenated to mass

spectrometry systems, including tandem mass spectrometry. One of the

projects has been related to spectrophotometric data.

(2)
(3)

Chemometrics

Unravelling information from complex data generated by modern instruments

Pedro F. M. de Sousa

Academic dissertation for the Degree of Doctor of Philosophy in Analytical Chemistry at Stockholm University to be publicly defended on Wednesday 9 June 2021 at 10.00 in Magnélisalen, Kemiska övningslaboratoriet, Svante Arrhenius väg 16 B, online via Zoom, public link is available at the department website.

Abstract

Chemometrics is a discipline dedicated to solving problems arising from complicated analytical systems, combining statistics, mathematics, and computational programming languages.

This thesis is based on the work developed in four scientific projects published as papers in scientific journals. The studies developed in these projects have been essentially focused on a data analysis perspective, interpreting complicated data by means of algorithms, employing chemometrical methodologies. Several chemometrical approaches, based on multivariate data analysis and signal processing algorithms have been studied and employed in each project. Most of the data analysis problems studied these projects are related to liquid chromatography hyphenated to mass spectrometry systems, including tandem mass spectrometry. One of the projects has been related to spectrophotometric data.

Chromatographic peak shifts have been attributed to lack of control of the nominal chromatographic parameters.

The purpose of the work presented in Paper I was to study retention time data, obtained experimentally by provoking peak shifts with controlled effects, to demonstrate that there are patterns associated with such changing factors affecting chromatographic processes. PCR (Principal Component Regression) models were calculated for each compound (98 compounds), using the retention time data of each compound as responses (y), and the retention time data of the remaining compounds as regressors (X). The results demonstrate that the peak shifts of each compound across samples are correlated with the peak shifts of the other compounds in the chromatographic data. This work confirmed a previous work, where an algorithm was developed to improve alignment of peaks in large number of complex samples, based on peak shift patterns.

Partial Least Squares (PLS) is one of the mostly used chemometrics techniques. In the work presented in Paper II, a previously reported modified PLS algorithm was studied. This algorithm was developed with the purpose of not generating overfitting models with increasing noise in X, which happens with the classical PLS. However, the results in less-noisy data were not as good as the classical PLS. From this study, we have developed another modified algorithm that does not overfit with increasing noise in X, and it converges with the solutions of the classical PLS in less-noisy data.

DNA adductomics is a recent field in omics that studies modifications in the DNA. The goal of the project in Paper III was to develop a program with a graphical interface to interpret LC-MS/MS using a data independent acquisition method, to identify adducts in DNA nucleosides. The results were compared with those performed manually. The program detected over 150 potential adducts whereas manually, in a previous work, only about 25 were found. This program can detect adducts automatically in a matter of seconds.

Cancer has been associated with processes that are related to exposure to pollutants and the consumption of certain food products. This process has been related to electrophilic compounds that react with DNA (adducts). When DNA modifications occur, often defense mechanisms in the cell are triggered often leading to the rupture of the cell. Fragments of DNA (micronuclei) are then roaming in the blood stream. In this work (Paper IV), electrophilic additions to hemoglobin (adducts) and the expression of micronuclei in blood samples from 50 children were studied. One of the goals of the project was to find correlations between the adducts in hemoglobin and the expression of micronuclei. PLS was used to model the data. However, the results were not conclusive (R2 = 0.60), i.e., there may be some trends, but there are other variables not modelled that may influence the variance in expression of micronuclei.

Keywords: Chemometrics, Chromatography, Mass spectrometry, DIA, PCA, PLS, Experimental Design, DNA, adductomics.

Stockholm 2021

http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-192497

ISBN 978-91-7911-500-5 ISBN 978-91-7911-501-2

Department of Materials and Environmental

(4)
(5)

CHEMOMETRICS

 

Pedro F. M. de Sousa

(6)
(7)

Chemometrics

 

Unravelling information from complex data generated by modern instruments

 

Pedro F. M. de Sousa

(8)
(9)

I dedicate this thesis to

my wife Zhaneta, my

sons André and Jon, and

my parents António and

Fernanda.

(10)
(11)

List of papers

I. Sousa, P. F. M., de Waard, A., Åberg, K. M. Elucidation of chroma- tographic peak shifts in complex samples using a chemometrical ap- proach. Analytical Bioanalytical Chemistry. 410, 5229–5235 (2018).

II. Sousa, P. F. M., Åberg, K. M. Can we beat overfitting? - A closer look at Cloarec’s PLS algorithm. Journal of Chemometrics. 32, e3002 (2018).

III. Sousa, P. F. M., Martella, G., Åberg, K. M., Esfahani, B., Motwani, H. V. nLossFinder—A Graphical User Interface Program for the Non- targeted Detection of DNA Adducts. Toxics 9, 78 (2021).

IV. Carlsson, H., Aasa J., Kotova, N., Vare D., Sousa, P.F.M., Rydberg, P., Abramsson-Zetterberg, L., Törnqvist, M. Adductomic Screening of Hemoglobin Adducts and Monitoring of Micronuclei in School- Age Children. Chemical Research in Toxicology. 30 (5), 1157-1167.

2017.

(12)

The contribution as author of this thesis in the papers presented is as follows:

I. The respondent was responsible for the experimental work, while su- pervising a master student in this project. The data analysis and writ- ing the paper were all done by the respondent. The idea of the project was suggested by the supervisor, as a continuation of previously de- veloped projects.

II. The respondent was responsible for the development of the algo- rithms, data analysis, and the writing of the paper. The idea of the project suggested by the supervisor, who also contributed to the de- velopment of the algorithm proposed in this paper.

III. The respondent was responsible for the development of the program presented in this paper, the data analysis, and the writing of most of the paper. The experiments and part of the writing of the paper were performed by co-workers. The idea of the project started from a prob- lem suggested by a co-worker, but it was designed by the respondent.

IV. The respondent was just involved in part of the data analysis and writ-

ing a small part in the paper. The idea, the experiments and writing of

most of the paper were performed by the other authors.

(13)

Contents

Populärvetenskaplig sammanfattning ... 1

Introduction ... 3

Chemometrics ... 5

Regression Analysis ... 8

Principal Component Analysis ... 10

Principal Component Regression ... 14

Partial Least Squares ... 16

Method Validation ... 20

Design of Experiments ... 22

Signal processing ... 25

Applications ... 28

Peak detection in Liquid Chromatography ... 28

Adductomics... 30

Discussion ... 31

Paper I ... 31

Paper II ... 34

Paper III ... 36

Paper IV... 38

Conclusions ... 39

References ... 41

(14)
(15)

Populärvetenskaplig sammanfattning

Analysinstrument utvecklas ständigt efter allteftersom tekniska framsteg görs.

Bättre instrument kan ge bättre analysresultat, d.v.s. man kan nu få mycket mer information från prover i moderna instrument än vad som tidigare var möjligt. Emellertid resulterar utvecklingen av instrument i allt större och mer komplexa datamängder. Kemometri är en disciplin dedikerad till lösningen av problem vid utvinning och tolkning av vetenskaplig information från kom- plexa analytiska data. Problemen löses genom tillämpning av matematiska och statistiska metoder. Beroende på instrument och teknik kan olika databe- handlingsmetoder användas. I arbetet som presenteras i denna avhandling har flera angreppssätt som involverar programmering av algoritmer använts för att förstå komplexa analytiska data. Frågeställningarna, algoritmernas och de- ras tillämpningar har utvecklats, studerats och rapporterats i fyra artiklar pub- licerade i vetenskapliga tidskrifter.

Artikel I. Kromatografi är en analytisk teknik som har använts vid separat- ionen av kemiska föreningar sedan mitten av 1800-talet och fram till idag.

Numera, med framsteg inom instrumentering, kan denna teknik möjliggöra identifiering av tusentals föreningar i en enda analys. I en kromatografisk pro- cess separeras kemiska föreningar enligt deras fysikalisk-kemiska egenskaper över tid och detekteras sedan av t.ex. masspektrometer som mäter föreningens mängd samt dess mer eller mindre unika massa. Det finns dock faktorer som kan påverka den kromatografiska processen och som är svåra att hålla under fullständing kontroll. I en serie av många analyserade prover tenderar samma förening att variera i separeringstiden (retentionstiden), vilket kan utgöra ett problem vid tilldelningen av föreningarnas identitet mellan proverna och svårt att organisera dem korrekt i ett tabell (matris). Detta problem, känt som kor- respondensproblemet, undersöktes i ett av de projekt som presenterades i denna avhandling. Ett statistiskt tillvägagångssätt som kan upptäcka mönster i kromatografisk retention av föreningar presenteras, och sådana mönster kan användas för att identifiera föreningar mellan prover med hjälp av dedikerade algoritmer. Resultaten från detta arbete kompletterar tidigare arbeten från vår grupp, i vilka algoritmer för att matcha föreningar mellan prover baserat på mönster har utvecklats och förfinats.

Artikel II. Brus förekommer alltid i analytiska instrumentdata, och det kan

betraktas som fiende nummer ett för en analytisk kemist. Partial Least Squares

(16)

(PLS) är en välkänd multivariat modelleringsalgoritm. I ett tidigare arbete de- monstrerades att den klassiska PLS överanpassar data mer och mer med ökande brusnivåer, dvs. den klassiska PLS-algoritmen modellerar bruset och inte den underliggande informationen, vilket resulterar i modeller med dålig förmåga att ge generalisera till data från nya prover. En överanpassad modell är en modell som är bra på att prediktera den data som användes för att beräkna modellen, men den fungerar inte bra för ny data. En modifierad PLS-algoritm föreslogs i ett tidigare arbete av Olivier Cloarec. Vi har testat denna föreslagna algoritm med flera dataset och resultaten var inte tillfredsställande, dvs. i icke- brusiga data var resultaten inte lika bra som för den klassiska PLS algoritmen.

Utifrån denna undersökning har vi utvecklat en annan modifierad PLS-algo- ritm som inte överanpassar brus, som den klassiska PLS gör, samtidigt som den ger bättre modeller än den tidigare föreslagna algoritmen. Prestanda är jämförbara med de för den den klassiska PLS-modellen för dataset med låga brusnivåer.

Artikel III. DNA-adduktomik är ett nytt fält som är dedikerat till att mäta kemiska modifieringar av DNA som kallas DNA-addukter. Sådana modifie- ringar är relaterade till cancerstudier och i andra studier som markörer för ex- ponering för miljöföroreningar. Ett datorprogram utvecklades för att detektera dessa addukter i komplexa data erhållna med ett analytiskt instrument. Ett mycket större antal potentiella addukter (över 150 st) upptäcktes av detta pro- gram än vad som tidigare åstadkommits genom tidskrävande manuell arbete (ca 20 st).

Artikel IV. Cancer har associerats med processer som är relaterade till ex-

ponering för föroreningar och konsumtion av vissa typer av mat. Speciellt vik-

tigt verkar kemiska föreningar som reagerar med DNA och bildar DNA-ad-

dukter vara. När cellens DNA modifieras med addukter utlöses ofta försvars-

mekanismer som leder till att cellen bryts ner. Fragment av DNA kommer då

ut i blodomloppet. I detta arbete befanns sådana föreningar reagera med DNA

i blodkroppar. I ett försök att hitta en korrelation mellan dessa DNA-addukter

i blodceller och fria DNA-fragment i blodet undersöktes blodprover från över

50 barn. Resultaten var dock inte entydiga, det vill säga det fanns ingen direkt

korrelation mellan dessa föreningar fästa till blodkroppars DNA och mängden

av fria DNA-fragment i blodet.

(17)

Introduction

Chemometrics is a discipline dedicated to chemistry, particularly to analytical chemistry, which is an area with a remarkable importance in science, interact- ing with other areas, such as biochemistry, organic chemistry, and physical chemistry. Analytical chemistry is mainly focused on the development and implementation of methodologies that can provide quantitative and qualitative scientific information from analytical systems in the most efficiently way as possible. Analytical systems (i.e., analytical instrumentation employed in the detection or identification of chemical compounds in samples) are evolving constantly, fulfilling the ongoing demands from academic and professional analytical research. Such developments in analytical instrumentation, how- ever, tend to increase the complexity in the analytical data acquired from sam- ples. In one hand, more scientific information can be obtained from such com- plicated analytical data. On the other, such data may be too complicated to interpret by inadequate data analysis methods, and the potential of such in- strumental approaches may be underused. Classical approaches in analytical chemistry, univariate or multivariate, tend to become less efficient in the anal- ysis and interpretation of complicated datasets from evolving analytical in- strumental systems. Thus, up-to-date computational data analysis methodolo- gies are sought, aiming to maximize the utility of novel instruments and the potential to obtain more scientific information from complex analytical sys- tems. Chemometrics is dedicated to solving such problems arising from com- plicated analytical systems, combining statistics, mathematics, and computa- tional programming languages.

Different strategies can be adopted to extract information from multivariate (multicomponent) analytical datasets, depending on their level of complexity.

Quoting Yi-Zeng Liang

1

, “There are three kinds of multicomponent systems,

white, gray and black”: (i) “white”, when we know the composition of a sys-

tem, e.g., a trivial system where the chemical composition of a set of samples

is well known; (ii) “gray”, when the chemical composition of system is par-

tially known; or (iii) “black”, when there is no knowledge whatsoever about

the composition of the system. Analytical approaches can be employed to

study data in a targeted approach, such as for “white” or “gray” systems. Un-

targeted approaches can be employed to study “gray” or “black” systems. The

choice in the instrumental method and the data analysis approach, is crucial to

(18)

obtain any scientific information from such complicated analytical data sys-

tems. The complexity of analytical systems is what drives chemometricians to

develop new strategies (algorithms) that can aid in the interpretation of such

complicated data. In the work presented in this thesis several data analysis

problems have been studied in a chemometrical perspective, and methods

have been proposed.

(19)

Chemometrics

Statistical methods have been employed in chemistry for quite some time (more than a century). With the evolution of analytical instrumentation, grad- ually, more data can be acquired and in shorter times. Analytical processes can benefit from more information provided from analyses, i.e., when more parameters (variables) can be studied. However, the inclusion of many param- eters in an analytical process may result in complicated data, such as multi- variate data matrices, which may be hard to interpret. Thus, to analyze data with high levels of complexity, specially designed statistical and computa- tional methods are sought. Chemometrics is a discipline dedicated to the ap- plication of statistical methods for the interpretation of such emerging com- plicated analytical data in chemistry, employing computationally intense ap- proaches, mainly from a multivariate perspective.

The origins of chemometrics are traceable to the early 1960s, supposedly when scientists began to have access to computational systems. Before those years, computerized systems were expensive bulky machines, practically only accessible to dedicated engineers and mathematicians. Technological devel- opments have gradually provided the access to computational systems to sci- entists in general, stimulating mathematical and statistical data analysis ap- proaches on several fields in science. Other than chemistry, fields such as ge- ology and biology also evolved from the benefits of the accessibility to com- putational systems, solving problems of interpretation of large multivariate datasets by means of computational approaches. Biology evolved remarkably with studies related to the human genome discovery, which has led to the cre- ation of a discipline that nowadays is known as bioinformatics. Disciplines in science, such as bioinformatics, have evolved in parallel with chemometrics, following the technological evolution of analytical instrumentation and com- puterized systems.

2,3

One of the earliest examples of the development and application of chemo-

metrics comes from a publication in 1960, when a mathematical technique,

dubbed as factorial analysis, was developed to study chemical shifts in proton

magnetic resonance data using different solvents.

4

Another example, pub-

lished in 1969, was the development of a computerized learning machine

method to interpret patterns in combined data from mass spectra, infrared

spectra and melting points of chemical compounds.

5

These methods have

(20)

eventually been integrated in chemometrics, although the concept of chemo- metrics itself had not yet been established at that time these methods were developed.

The concept of chemometrics was presented for the first time in a publica- tion by Svante Wold in Technometrics in 1972

2,6

, and appeared in a title of a publication for the first time in Journal of Chemical Information and Com- puter Sciences in 1975, by B. R. Kowalski, emphasizing the concept of pattern recognition.

7

In 1974, Svante Wold, together with Bruce R. Kowalski, founded the International Chemometrics Society. Thus, S. Wold and B. R.

Kowalski are considered the founders of chemometrics.

3

Chemometrics was not well accepted by the scientific community in the beginning, when its potential had not yet been recognized. However, the skep- ticism over chemometrics declined gradually, as solutions to problems asso- ciated with the interpretation of complicated analytical data obtained by from instruments were developed. The first journal that dedicated a section to chemometrics was Analytica Chimica Acta in 1977, introducing the section

“Computer Techniques and Optimization”, which endured until 1982. In the meantime, in 1980, Analytical Chemistry changed the name of one of its sec- tions, “Statistical and Mathematical Methods in Analytical Chemistry”, to

“Chemometrics”. Chemometrics began to be generally accepted as a disci- pline in chemistry by the scientific community around the 1980s. Two jour- nals, specifically dedicated to chemometrics, were created and have published work related to applications and developments in the field since then: Chemo- metrics and Intelligent Laboratory Systems and Journal of Chemometrics.

8

Several books dedicated to chemometrics have been published based on developments achieved by researchers in the discipline, e.g., Chemometrics - Data Analysis for the Laboratory and Chemical Plant, by Richard G. Brere- ton

9

, is book for an intermediate level of learning, for those who wish to un- derstand chemometrics without much advanced knowledge in complex math- ematics. At an advanced level, with a requirement of a deeper mathematical knowledge, Handbook of Chemometrics and Qualimetrics, is a compendium edited by D. L. Massart and B. G. M. Vandeginste, composed of two volumes, Part A

10

and Part B

11

, containing detailed information over chemometrical methodologies.

Several chemometrical approaches relevant to the scope of the research

(21)

a more robust soft-modelling method, with a wider range of applications, re- lating to the formers (MLR and PCR). These multivariate data analysis meth- ods (MLR, PCR, and PLS) have been employed and studied in different per- spectives in the work presented in this thesis. The validation of models em- ploying such methods is also presented here, as the typical chemometrical ap- proach for the assessment of the predictive capability of a model. Then, a brief overview of Design of Experiments (DoE) methods in chemometrics is also introduced, focusing on factorial DoE methods, which have been employed in one of the projects. Finally, ‘Signal Processing’ methods are also introduced, focusing on the methods employed in one of the projects and in-house soft- ware used in some of the projects.

(22)

Regression Analysis

The developments in analytical instrumentation and methodologies aim to provide better decision-making scientific information from measurements and facts. The quality of such information, however, may depend on the instru- mental technique employed, and ultimately on the mathematical methodolo- gies applied on the analysis of the data obtained from the measurements.

12

Regression analysis is a mathematical approach dedicated to the study of relationships between measurements and facts. Regression techniques, essen- tially, aim to model data ‘the best as possible’ to efficiently predict facts from measurements. In analytical chemistry, e.g., instrumental signals (measure- ments, explanatory variables, predictors, or regressors) and concentrations of compounds (facts, responses, or regressands) are hypothetically mathemati- cally correlated.

The method of ‘least squares’ is a fundamental method in regression anal- ysis, and it is considered as one of the oldest ‘general estimation’ concepts in statistics. The first publication of this method appeared in 1805 by Adrien Marie Legendre

13

, although Carl Friedrich Gauss claimed to have discovered it 1795.

14

This method has led to the development of regression methodologies and applications in most scientific disciplines (geology, psychology, biology, chemistry, physics, etc.).

15

In regression, a dependent variable (or variables) designated as y, and in- dependent variables, designated as x are considered, and a relation between y and x is determined, to describe how y varies with x. Such relationships can be established mathematically by linear, multilinear, or high order polynomial equations.

10

The most simple regression case is the univariate linear regres- sion, where a single independent variable x describes y, and the relationship between x and y (regression) is defined by a straight line, i.e., by a linear equa- tion with the form 𝑦 = 𝛽 0 + 𝛽 1 𝑥 1 , where 𝛽 0 and 𝛽 1 are the intercept and the

‘slope’ respectively.

The least squares method is essentially an optimization process, where, in

the case of a linear univariate regression, the best possible line that can de-

scribe the data in x and y is determined by minimizing the ‘sum of the squares

of the residuals’, i.e., the distances between the modelled data and the model

(23)

with one regression coefficient for each independent variable. These variables x

i

can represent other functions, e.g., k

th

powers (x

k

) or interactions (e.g., x

x

×x

y

×x

z

). The MLR algebraic equation is

𝐘 = 𝐗𝐁

where Y is a matrix (or a vector y) containing the dependent variables data, 𝐗 is a matrix containing the independent variables data, and 𝐁 is a matrix (or a vector b) with the (least squares) regression coefficients estimates. The re- gression coefficients are calculated as,

𝐁 ̂ 𝐌𝐋𝐑 = (𝐗 𝐓 𝐗) −𝟏 𝐗 𝐓 𝐘

To obtain a sensible MLR model, however, the number of least squares estimated coefficients should be equivalent to the number of significant inde- pendent variables in a system, i.e., a nominal model equation defining such system must be known a priori. Thus, MLR may not be a suitable method to model systems with unknown variables, and therefore it is considered as a

‘hard modelling’ application.

17

Soft-modelling regression applications, Principal Components Regression

(PCR) and Partial Least Squares Regression (PLS) are more robust than MLR,

in the sense that they can model better complex data with numerous and un-

known variables, and without any knowledge of the nominal equations that

describe the system. Moreover, correlations between independent variables in

𝐗 constitute a problem in MLR due to the inversion operation of 𝐗, which does

not happen with PCR nor with PLS. MLR is an excellent application for ex-

perimental design modelling though. In this case, the independent variables

are well known and mutually orthogonal (independent). MLR has been em-

ployed directly in experimental designs in the development of Paper I. The

least squares operation is employed implicitly in the PLS and PCR algorithms.

(24)

Principal Component Analysis

The developments of new technologies aim to provide better quality in the acquisition of scientific information from measurements. Such developments in analytical instrumentation tend to provide larger and more complex data from measurements. Consequently, great challenges may be posed to obtain valid scientific information, while making use of the maximum potential of such instruments. Large datasets with many variables (dimensions) tend to be complicated to interpret. Principal component analysis (PCA) is a technique that can reduce the dimensionality of data, according to correlations between variables, without a significant loss of statistical information.

18

The applica- tion of PCA can facilitate the interpretation of complicated analytical data, e.g., by providing a visualization of the structure of the data, to detect outliers in data, or to assess the quality of sample replicates in data. PCA is considered one of the most important techniques in multivariate data analysis, and it con- stitutes the basis of numerous multivariate data analysis methods in several disciplines (chemistry, biology, geology, etc.).

18,19

In chemistry, PCA is in- volved in several chemometrical multivariate data analysis approaches, such as classification, calibration (or regression), and curve resolution.

9

In classification (also designated as supervised pattern recognition), sam- ples (or objects) can be discriminated (or classified) according to similarities (or patterns) characterized by several measured variables (multivariate data).

This can be achieved by a simple visual interpretation of PCA ‘plots’ or by the means of statistically dedicated classification algorithms, e.g., SIMCA (Soft Independent Modelling of Class Analogy).

19,20

In calibration, Principal Components Regression (PCR) is a soft-modelling regression technique that combines the PCA variable reduction capability with the least squares method.

21

This method is further explained in the next section. In curve reso- lution, PCA can be employed, e.g., in the decomposition of analytical signals, such as the deconvolution of overlapping chromatographic peaks, with spec- troscopic or mass spectral data support.

9

Several PCA algorithms have been developed, although essentially for the same purpose, and maintaining a convergence of results (numerical solutions).

Depending on the purpose, and the data in question, ‘which algorithm is the

(25)

the 1930s, introducing a new algorithm, designated as the ‘power’ method, and contributed with some theoretical aspects of PCA that are known today.

25

Another well-known PCA algorithm is the NIPALS (Nonlinear Iterative Par- tial Least Squares). This algorithm was preliminarily outlined and presented in a scientific report by Ronald Fisher in 1932. NIPALS was then redeveloped by Herman Wold in 1966.

26,27

The NIPALS algorithm is remarkably famous for being also implemented in Partial Least Square Regression (PLS). NI- PALS and SVD are probably the mostly used and taught PCA algorithms in chemometrics. SVD was the algorithm of choice employed in the work pre- sented in Paper I and Paper II.

PCA has been adopted in diverse scientific fields. Some modifications, such as transformations (rotations) and constraints introduced in a PCA algo- rithm can simplify the interpretation of results.

18

PCA has been integrated in multivariate data analysis algorithms, such as in Orthogonal-Partial Least Squares (O-PLS or O2-PLS), in an implicit data pre-processing step.

28,29

In the work presented in Paper II, a modified Partial Least Squares (PLS) algorithm was developed, where PCA is employed also as part of a pre-processing step implicit in the algorithm. PCA can also be combined with other methods, such as ANOVA (Analysis of Variance).

ANOVA-simultaneous component analysis (ASCA) is a generalization of ANOVA, with applications on multivariate data, and particularly related to study data obtained from designed experiments.

30

Such method has been em- ployed in Paper I.

Algebraically, variables are defined as columns in a data matrix, and graph- ically, as orthogonal axes in a cartesian coordinate system, with the number of axes equal to the number of variables (illustrated as v

1

and v

2

in Figure 1 in a trivial 2-variable system). PCA consists essentially of orthogonal transfor- mations (projections) of the original variable data vectors into new vectors that comprise the maximum variance of the original data vectors. These ‘new’

vectors (eigenvectors) are mutually orthogonal and have the same origin as the original variable axes (red and green lines in Figure 1). The Principal Com- ponents (PCs) are essentially a new set of orthogonal axes with the directions of the eigenvectors. The values resulting from the projection of the original data into the PCs are designated as the scores. The eigenvectors are also des- ignated as the loadings.

In theory, and independently of the number of original variables, the n re- sulting Principal Components (PCs) are equivalent to the n sources of varia- bility in the system, e.g., the varying concentration of n-compounds in samples analyzed in an analytical instrument over many variables (with a dimension- ality much larger than n), will reduce the original variables to n PCs. The num- ber of PCs that comprise the variance in a matrix data is designated as the

‘rank’ of the system (matrix), which can be defined as chemical or mathemat-

ical rank. The chemical rank corresponds the number of significant PCs, i.e.,

(26)

sources of variance may be present in a system, e.g., contaminants or noise, which may increase the rank. The mathematical rank accounts for all the sources of variance of the system, including even less significant PCs that ex- plain little variance.

31

Figure 1. Illustrative (trivial) example of PCA with 14 samples and 2 var- iables. The thick lines (red and green) are the directions (eigenvectors, or load- ings) of the Principal Components (PCs). The thinner lines, perpendicular to the thick lines, are orthogonal projections of the data onto the directions of the PCs. These orthogonal projections (scores) are the PCs of the original data.

The scores are the result of the dimension reduction of the original varia-

bles, with the number of rows equal to the number of samples, and the number

of columns equal to the number of PCs (i.e., the number of significant sources

(27)

where X is the original data matrix, T is the scores matrix, 𝐏 is the loadings matrix, and E is the residual matrix.

Figure 2. Principal Component Analysis (adapted from

9

).

In SVD, the equation is defined as 𝐗 = 𝐔𝐒𝐕 𝐓

where U is a matrix with normalized scores, S is a diagonal matrix containing the eigenvalues (singular values), and V is a matrix containing the eigenvec- tors (loadings). This equation is equivalent to the PCA equation, where the PCA scores T correspond to the multiplication of the SVD normalized U scores with the diagonal matrix containing the eigenvalues S (i.e., T = US).

The PCA loadings P correspond to the SVD eigenvectors V.

31

(28)

Principal Component Regression

Principal Component (or Components) Regression (PCR) is a multivariate regression method proposed by William Massy in 1965.

32

PCR is essentially a regression technique combining the least squares method and Principal Component Analysis (PCA). Multicollinearity in the data matrix X can pose challenges in the least squares modeling in MLR. Multicollinearity is charac- terized by large correlations between two or more variables in the independent data matrix X. The least squares method on such data results in large variances in the estimated regression coefficients, and consequently biased estimations (predictions) of responses 𝐘̂ are obtained. The dimensionality reduction of the independent variables data X by means of PCA is one of the solutions to over- come the problem of multicollinearity in complex multivariate data, i.e., ma- trices with many variables (columns).

21,33

The algebraic equation that defines PCR is analogous to the MLR equation (𝐘 = 𝐗𝐁), but instead of applying the least squares with all the data in the matrix X, the PCA scores (T) of the matrix X are used instead:

𝐘 = 𝐓𝐁 𝐓

where Y is a matrix (or a vector y) containing response variable(s), T is a matrix with the PCA scores for n ‘significant’ principal components, i.e., the number of components that explain the variance in the data matrix X. 𝐁 𝐓 is a matrix (or a vector 𝐛 𝐓 ) containing the least squares estimates.

Solving the equation above for 𝐁 𝐓 , i.e., to obtain the least squares estimates that relate the scores T with the responses Y, results in

𝐁 ̂ 𝐓 = (𝐓 𝐓 𝐓) −𝟏 𝐓 𝐓 𝐘

The PCR model equation, relating X with Y, is 𝐘 = 𝐗𝐁 𝐏𝐂𝐑 . Since the load- ings P are orthogonal, the PCA equation 𝐗 = 𝐓𝐏 𝐓 can be rewritten as T = XP.

Therefore, 𝐘 = 𝐓𝐁 𝐓 becomes 𝐘 = 𝐗𝐏𝐁 𝐓 . Comparing with the PCR regres-

sion equation 𝐘 = 𝐗𝐁 𝐏𝐂𝐑 , the least squares coefficients 𝐁 𝐓 (that relate T and

Y) are related to the regression coefficients 𝐁 ̂ (that relate X with Y) as

(29)

The deduction of PCR is explained in detail in literature, employing different PCA algorithms (see Massy

32

, Jolliffe

21

, Brereton

9

or Geladi & Kowalsky

34

).

The Predicted responses 𝐘̂ from a PCR model can be obtained from a pre- dictor data matrix X and the PCR regressions coefficients 𝐁 ̂ 𝐏𝐂𝐑 as

𝐘 ̂ = 𝐗𝐁 ̂ 𝐏𝐂𝐑

Although PCR presents a solution for the problem of regression with com- plicated and multicollinear data, yet another more ‘robust’ regression method has been developed for modeling complicated data. This method, designated as Partial Least Squares (PLS), is explained in the next section.

PCR was employed in the modelling of the chromatographic retention

times in Paper I. In Paper II, PCR was compared with three different Partial

Least Squares (PLS) algorithms.

(30)

Partial Least Squares

Partial Least Squares (PLS), also designated as Projection to Latent Struc- tures, is a widely employed multivariate regression analysis technique in sev- eral fields, e.g., analytical chemistry, physical chemistry, and biochemistry.

The foundations of PLS are attributed to Herman Wold, with the development of the Nonlinear Iterative Partial Least Squares (NIPALS) algorithm in the 1960s and 1970s. The NIPALS algorithm was firstly developed to model com- plicated data sets composed of numerous matrices (blocks of matrices), des- ignated as path models, employed in social sciences data, namely in the field of economics.

27,35,36

Herman Wold also proposed the NIPALS algorithm as an alternative solution to Principal Component Analysis (PCA), which is still employed nowadays.

37

In the 1980s, the NIPALS algorithm was studied fur- thermore and redeveloped for regression applications in chemistry, mainly by the groups of Svante Wold and Harald Martens, establishing the PLS (NI- PALS) algorithm as it is known nowadays (as the classical PLS).

34,36

Partial Least Squares regression (PLS-R) and Principal Component regres- sion (PCR) were developed with the purpose to overcome the multicollinear- ity problem in multilinear regression (MLR). PCR is performed in two inde- pendent steps. First, the PCs of X are extracted (calculated). Then, the least squares method is employed to model the responses data in Y with the PCs of X, i.e., the modelling makes use of the PCs ‘of interest’ in the X data while Y remains intact. This independency between Y and the PCs of X may introduce errors in the estimation of the regression coefficients, mainly because useful information that correlates X and Y may be lost in the residuals, i.e., PCs that are not included in the model.

34

In PLS, the PCs (which are called latent vari- ables) not only explain X, but are also Y.

In PLS, latent variables (LVs) are calculated, analogously as PCs are cal- culated in PCR. However, the LVs in PLS are calculated as the directions that explain the maximum covariance between X and in Y, whereas in PCR (or PCA), the PCs have the directions that explain the maximum variance in X, independently of the variance in Y.

The NIPALS algorithm, employed in PLS, is a sequential algorithm where

the directions of the latent variables (designated as weights W), i.e., the ‘ei-

(31)

Figure 3. PLS decomposition of X- and Y-blocks. (adapted from

9

).

The PLS algorithm (NIPALS) is illustrated in Figure 4, where the follow- ing steps are repeated for each latent variable:

1. The first step is the calculation of the weight(s) w, which is a normal- ized vector with the combined variance–covariance of the X- and Y- blocks data. For each latent variable, the starting vector u (Y-scores) is one of the columns from Y-block (one variable y). The vector u is iter- atively recalculated (in step 4) until the resulting X-scores t (step 2) converge between iterations.

2. The X-scores vector t of is calculated by projecting the X-block data onto the weights vector w.

3. The Y-loadings vector q is calculated from the Y-block data and the X-scores t. (solving the relation Y = TQ

T

in order to Q)

4. The vector u (Y-scores) is calculated from the Y-block data and the Y- loadings q. (solving the relation Y = UQ

T

in order to U). Note, that in PLS the scores T are the same for the Y- and X-blocks, and the Y- scores u are just temporary scores employed in the iterative process to determine the ‘actual’ scores t.

5. The scores vector t is checked for convergence, i.e., compared with the

scores obtained in the previous iteration. The process is repeated from

step 1, using the vector u calculated in step 4, until convergence is

reached. If there is only response vector (y), the convergence is reached

(32)

6. The X-loadings vector p is calculated from the X-block data and the scores t. (solving the relation X = TP

T

in order to P)

7. Both X- and Y-blocks are deflated by subtracting variance explained by the latent variable (as tp

T

and tq

T

).

8. The algorithm is repeated (from step 1), with the deflated X- and Y- blocks, as many times as the number of desired latent variables. The choice of the number of latent variables can be evaluated by cross-val- idation methods, which are described below.

Figure 4. PLS-NIPALS algorithm.

The regression coefficients B relating the predictor data in X and responses

data in Y can be calculated as

(33)

𝐘 ̂ = 𝐗𝐁 𝐏𝐋𝐒

The PLS-NIPALS algorithm can be employed to model a single response vector y in Y, or several responses simultaneously. These approaches are des- ignated as PLS1 and PLS2, respectively. The results may differ when model- ing each variable in Y separately, or altogether in a single model. In principle, PLS1 may produce better models than PLS2, and not otherwise. As suggested in literature, (see Brereton

9

) the best procedure is to test both methods, PLS1 and PLS2, and verify if significant differences justify modelling a single re- sponse individually or model the whole Y-block.

Several adaptations of PLS have been developed since its inception in the 1970s. These adaptations tend to improve the method as a multivariate cali- bration technique, providing better predictive models, or by simplifying the interpretation of results, i.e., the parameters scores, loadings, and residuals.

One example of a modification of the original PLS algorithm, is the Or- thogonal Projections to Latent Structures (O-PLS) algorithm. This origin of this algorithm comes from another algorithm proposed by Svante Wold et al.

38

, designated as Orthogonal Signal Correction (OSC). Trygg et al.

28

rede- veloped the concept into a more practical algorithm. Multivariate instrumental data (e.g., spectroscopic data) often contain systematic variations that are un- related to the responses Y. Ideally, these unrelated variations should be in the residuals of the model E. However, in practice some unrelated variation is included in the components part of the model. This algorithm provides means to separate the systematic variation in X into a part that is correlated to Y (X

y

) and another part that is not correlated to Y (X

o

), i.e., that is orthogonal to Y.

Modelling just the correlated part X

y

can reduce the complexity and improve the interpretation of PLS models. A detailed description of the algorithm can be found in a publication by Trygg et al

29

, and some applications are reported in literature, e.g., in drug design

39

, and in analysis of hyperspectral images

40

.

There are numerous other adaptations of the classical PLS-NIPALS, e.g., Quadratic Partial Least Squares (QPLS)

41

, Straightforward Implementation of Modification of Partial Least Squares (SIMPLS)

42

, Interval Partial Least Squares (IPLS)

43

, Weighted Partial Least Squares (WPLS)

44

, Total PLS (TPLS)

45

, etc.. A review by Mehmood et al.

46

describes several developed modifications of PLS, as well as their most common applications in different fields in science.

A modified PLS algorithm was proposed by Cloarec

47

, addressing to the

overfitting problem that PLS-NIPALS has with increasing noise in the data in

X. This PLS modified algorithm was studied in the work presented in Paper

II, which led to the development of another modified PLS algorithm. In Paper

I, a PLS algorithm was employed to model chromatographic retention time

shifts. In Paper IV, PLS was employed to the analyze experimental data from

(34)

Method Validation

Chemometrics is dedicated, amongst other parts, to the development of modelling methods that can relate chemical or physical properties of complex samples with analytical instrumental signals in a multivariate level, e.g., con- centrations of several chemical compounds analyzed by ultraviolet-visible (UV-Vis) or near infrared (NIR) spectroscopy, or by mass spectrometry. Such relationships can be used to predict the properties in unknown samples, ana- lyzed under the same experimental conditions. The MLR, PCR and PLS meth- ods explained in the previous sections are modelling techniques that can be used to construct predictive models (calibration or regression). Such predic- tive models can be quantitative, e.g., to determine chemical compound con- centrations, or qualitative, e.g., to discriminate groups or classes of samples (male/female, healthy/unhealthy, etc.) in the data.

The quality of a predictive model depends on the quality of the data used in its construction, and on its representativity over the system analyzed, i.e., if the modelled data can represent all, or the most significant, sources of varia- bility in a system. PCR or PLS, require that a specific number of Principal Components (PCs) or Latent Variables (LVs) are used to model the data. If too few are used, the model will not explain enough variance in the studied variables, resulting in underfit models. On the other hand, if too many are used, it will result in overfit models.

48

Multiple Linear Regression (MLR) is a ‘hard modelling’ approach, that re- lies on well-defined mathematical model equations. Fulfilling such modelling requirements data-wise, the quality of a ‘hard model’ can be assessed practi- cally by its fit, i.e., from the prediction of the data that was used to construct the model itself (calibration or training set), and ultimately predicting the nom- inal values of an external dataset. The fit of a model can be expressed as the

‘auto-prediction’ coefficient of determination R

2

, or the Root Mean Square Error (RMSE).

49

In soft modelling, e.g., in PCR or in PLS, the auto-prediction (R

2

) does not

provide a reliable assessment of the predictive quality of a model. The R

2

in-

creases with the number of components included in the model and reaches

100% fit when all possible components are included in the model, i.e., the

(35)

and when the number of latent variables has been established, an external data set should be tested to confirm the quality of the model.

In cross-validation, subsets of the calibration set (training set) samples are used to construct models, and these models are then used to predict the re- sponses of remaining (left out) samples in the set. This process is repeated until the responses of all the samples have been predicted. The ‘left out’ sam- ples can be blocks of samples (2, 3, 4, …, half set, etc.) or just one at a time (leave-one-out). For the latter, e.g., if the training set is composed of 20 sam- ples, 20 calibration models are calculated, each one leaving a different sample out (out of the 20). These 20 models will predict the responses of the samples that were left out of in each model, and the residuals (calculated from the dif- ference between the predicted and the real values) are used to calculate the cross-validation coefficient of determination, designated as Q

2

. The value of Q

2

increases with the number of components used in the model until the rank of the system is reached, i.e., until the number of sources of variation (latent variables) that correlate X and Y is optimal. After that point, i.e., including more latent variables than the actual rank of the system, the model will start to overfit the data, and the Q

2

will start to decrease. The equations to calculate R

2

and Q

2

and RMSE are presented in Paper II.

Although the cross-validation is a good method to assess the optimal num- ber of components to use in a model, the quality of prediction of the model is, as mentioned, better assessed by predicting the responses in an external test set employing the calibration model. The resulting coefficient of determina- tion, or RMSE for the latter is the best validation figure of merit assessing the quality of prediction of a model.

9

There are other methods that can be employed in the validation of multi-

variate models, such as Bootstrap. In this case, the model is recalculated

choosing a random choice of samples from the training set, predicting the re-

sponses of the respective samples used in the model. The process is repeated

many times, and the average of the resulting residuals obtained in the itera-

tions are used to calculate Q

2

.

9,50

Nonetheless, cross-validation is the valida-

tion approach mostly employed for PLS or PCR in chemometrics, and it was

used in the work presented in this thesis, when applicable (Papers I, II and IV).

(36)

Design of Experiments

Experiments have a central role in science. Scientific studies, such as those related to chemistry and to industrial processes in general, often rely on ex- periments to research, develop and optimize processes. In such perspective, the efficiency of experiments is an important aspect. Thus, it is necessary to improve the methods in experimental research, such as applying statistical mathematical methods, or to develop design of experiments methods.

51

Statistical experimental designs, commonly known as Design of Experi- ments (DoE) is a concept that became formalized in statistics in the 1920s, although there are records of the application of such principles since the 18th century. DoE methods have become more available to scientists (other than dedicated statisticians) with the development of computers in the 1970s, and with the consequent availability of DoE programmable algorithm strategies or software applications. The optimization of chromatographic peak separation and the optimization of chemical extraction processes are some examples of the application of DoE in analytical chemistry.

52

Before understanding the DoE statistical methods, one should consider the alternative methodologies, as logical and intuitive methods to solve problems in the development of experiments. Such methodologies may include ‘trial &

error’ experiments, or the ‘One Factor at a Time’ (OFAT) method. In OFAT,

the effects of factors are experimentally studied by varying one factor at a time

while maintaining the others at fixed values. This approach is potentially in-

efficient, mainly because the interactions between variables cannot be ac-

counted for with such approach. For instance, when several hypothetical fac-

tors influencing a chemical reaction are studied, such as the concentration of

the reagents, the pH, the temperature, and the time of the reaction; in a first

set of experiments, the temperature of the reaction is varied while maintaining

the other factors fixed. Then, in another set of experiments, the pH is varied

while maintaining the other factors fixed, and so on, until all the factors have

been varied individually, and at several levels (different levels of pH, different

temperatures, etc.), while maintaining the other factors fixed. Such method

can be efficient if the interactions between factors are not significant, e.g., if

the temperature can influence the pH, varying these two factors independently

(37)

the responses of a process. In optimization designs, the effects of factors known to influence a process are studied to determine how their effects max- imize or minimize the responses and thusly optimize such process.

Full factorial and fractional factorial designs are the most fundamental DoE screening methods, and frequently mentioned in chemometrics books

9

. For the sake of this thesis, only these DoE methods are described here, as essential tools employed in part of the work developed (Paper I).

In factorial designs, all the factors are studied simultaneously, considering all the possible combinations between them. The number of experiments (N) depends on the number of factors (k) and the number of levels (l), where 𝑁 = 𝑙 𝑘 . A design matrix (table) is generated with coded values, i.e., for each factor, the actual experimental unit level values (pH, temperature, etc.) are repre- sented as equidistant numbers around zero, e.g., in a 5-level design the levels are [-2, -1, 0, 1, 2].

A full factorial design with 2 levels and 10 factors would require 2 10 =1024 experiments to estimate all the effects of the main factors and in- teractions. In such case, although the number of levels is the lowest possible in a DoE design (2 levels with coded values [-1, 1]), the number of factors is quite large (10), and consequently the number of required experiments is large (1024). DoE is a part of chemometrics aiming for the optimization of pro- cesses and providing the most efficient and less time-consuming experimental processes as possible. Obviously, many experiments, such as 1024, is quite an excessive number for the purpose. Thus, there are DoE strategies that can re- duce the number of experiments, while maintaining a certain level of control over the process of acquiring the best information as possible from experi- ments. With 10 factors and 2 levels, e.g., a fractional factorial design can re- duce the number of experiments to half (2 10-1 = 512), or to even less; to 256, 128, 64, 32, or 16 experiments. Fractional factorial designs, or other such re- ducing experimental designs strategies, such as Plackett-Burman experi- mental designs, may reduce the number of experiments substantially. Depend- ing on the level of fractionation, main factors and interaction terms eventually become confounded with other main factors or with other interaction factors.

Only non-confounded factor effects can be modelled (estimated) properly.

Typically, interactions between more than 2 factors (e.g., 3- or 4-factor inter- actions) can be assumed to have very small or insignificant effects in re- sponses. Although there may be some risks associated with such assumptions, such risks must be assumed in exchange for a reasonable (or smaller) number of required experiments. These approaches should depend on the system ana- lyzed, and obviously on the perspective of the analyst, as there are no specific rules on which model approach should prevail.

Assuming that 2-factor interactions are significant above other higher com-

binations (interactions), a fractional factorial design with resolution IV (I +

(38)

III) is usually the ‘best’ option, i.e., when the main factors (I) are only con- founded with 3-level interactions (III) or higher. With a lower resolution, i.e., resolution III (I + II), the main factors (I) are confounded with the 2-factor interactions (II) or higher. This can be risky, when 2-factor interactions are significant and confounded with main factors. Following the example above, i.e., in a DoE with 10 factors and 2 levels, rather than 1024 experiments, 32 experiments would be required for a fractional factorial design with resolution IV, and 16 experiments for a design with resolution III.

In the development of the work presented in paper I, several factors that hypothetically affect the chromatographic retention time shifts of compounds in liquid chromatography were investigated by employing a screening DoE, i.e., just to study main factor effects. Such design was employed to study the variation of retention time of several compounds simultaneously (over 50 models, one for each compound), while varying 8 hypothetical influencing factors (these factors are described in Figure 5 below). A reduced (fractional) factorial design with 2 levels was employed, and 2 8-4 = 16 experiments (res- olution IV) were performed for this stage of the study.

After studying the results from the screening design, from the 8 studied factors, 2 of these factors were chosen to be further studied. These 2 factors (temperature and pH of the mobile phase) were studied via a full factorial de- sign with 5 levels and 2 factors.

A higher number of experimental levels (more than 2) allows to study the factors in terms of a higher order of effects (other than just interactions), such as quadratic or other order of effects that the factors may have on the response data. Such ‘high-resolution’ approach (5-levels) is obviously costly regarding the number of experiments, thus only 2 factors were studied to obtain a rea- sonable minimum number of experiments (5 2 = 25).

Other than screening and optimization, or generation of experimental data for studies based on controlled factors; DoE can also be employed to construct multivariate calibration mixtures. In such case each sample (experiment) will have mixed analytes (factors) with different concentrations (levels). These factors are mutually independent (orthogonal) within the experimental dataset (samples).

In DoE, independently of the method of choice, the design matrices are

constructed in such way that the variables (factors) are mutually orthogonal.

(39)

Signal processing

Analytical instrumental techniques such as near infrared spectroscopy (NIR) or liquid chromatography hyphenated to mass spectrometry (LC-MS), or hyphenated to UV-Vis diode array detectors (LC-DAD), provide analytical information from samples by computerized processes, as digitized signals.

A chromatogram or a NIR spectrum, typically, consists of several signal peaks, with different retention times or wavelengths respectively, with differ- ent intensities, and different shapes. In chromatographic hyphenated tech- niques, such as in LC-MS systems, one spectrum is obtained at each chroma- tographic retention time scan in each sample analyzed, whereas in a NIR anal- ysis only one spectrum is processed for each sample analyzed.

Superimposed peaks are common in NIR spectra, due to the overlapping of pure spectra chromophore peaks of chemical compounds. In chromatographic hyphenated methods, although the chromatographic process aims to separate compounds in a sample and ideally to minimize the overlap of detected spec- tral signals, in complex mixtures of chemical compounds, the overlapping of signals (peaks) is expected. Thus, although from a different data dimensional- ity perspective, hyphenated chromagraphic data and spectrophotometric data should be expected to be composed of superimposed peaks, influenced, or dis- torted, by noise or by overlapping neighboring peaks. Chemometrics is dedi- cated in part to the development of signal-processing methods, with the pur- pose to reveal the underlying information from such complex multivariate an- alytical data structures.

9

The goal of signal processing methods is to obtain the purest signals as possible from data, which, e.g., is fundamental in the establishment of corre- lations between controllable analytical variables, such as concentrations of an- alytes, and the respective pure (or as pure as possible) instrumental signals.

Digital signal processing methods related to analytical chemistry data are defined into two categories, designated as domain transformations and filter methods. The latter includes, e.g., polynomial least-squares smoothing, differ- entiation, median smoothing, matched filtering, boxcar averaging, or Kalman filtering. Domain transformations include, e.g., Fourier transforms (FT) and wavelet transforms (WT).

53

Noise is always expected to be present in any data acquisition process. In

analytical chemistry instrumental data analysis, the nature of the noise can be

defined as random or chemical. Random noise, moreover, can be defined as

structured (with patterns over acquired signals), or stationary (independent

from the signals). The random noise observed in analytical chemistry instru-

mental methods is typically stationary, which can have two behaviors, desig-

nated as homoscedastic and heteroscedastic noise. In the former, the level of

noise does not vary with the intensity of the signals, i.e., it is stable and con-

(40)

added to the signals. The heteroscedastic noise is also considered normally distributed, and with positive or negative increments to the signals, but it is proportional to the magnitude of the signals. For the latter, the signal-to-noise ratio is typically considered as constant. The type of stationary noise, whether homoscedastic or heteroscedastic, depends both on the architecture of the an- alytical instrument and on the analytical method employed, i.e., the variance of replicate signals (different samples) may vary, significantly or not, with the magnitude of the signals, depending on the method employed. For instance, in linear regression methods, such as the classical univariate least squares cal- ibration, the noise is usually assumed to be homoscedastic in the linear range of a calibration model. Linear regression methods in analytical chemistry are typically limited to a certain dynamic range, depending on the close-to-linear relationship that the instrument signals have with the concentration of ana- lytes. In such case, the noise heteroscedasticity may, in principle, be consid- ered insignificant due to such limited ranges in signal magnitude. Studies have demonstrated that heteroscedastic noise is present in data acquired from LC- DAD (liquid chromatography hyphenated to diode array detector, UV-Vis) instruments, and even in spectrophotometric methods, such as NIR-FT. There- fore, independently of the analytical instrumentation, in theory, heteroscedas- tic noise is always expected at some extent.

9,11,54

Along with the instrumental random noise (high frequency), there is also another source of noise designated as chemical noise, which is characteristic certain of certain analytical approaches, namely LC-MS. Chemical noise is typically a result from the contributions of low intense interferant compound signals, which can be difficult to distinguish from low-intense signals of in- terest. Such interferants occur in complex samples, e.g., biological samples, or are eventually present in the mobile phase used in liquid chromatography.

55

Such overlapping interferants in LC-MS have a suppressive effect in the ion- ization process (electrospray ionization), which affects the signal magnitudes (designated as matrix effects). Chemical noise has a different behavior than the random instrumental noise, and consequently such signal contributions to the data cannot be easily filtered by the same processes that filter random noise. High-frequency random noise can be removed by domain transfor- mations, (e.g., FT and WT).

56

Wavelet Transforms gained popularity in the 1980s, with many develop-

(41)

into a frequency domain.

58

Hence, FT is employed as a signal pre-processing method in interferometer detectors, such as in NIR-FT (Near Infrared – Fou- rier Transform), or in the state-of-the-art high-resolution orbitrap mass ana- lyzers, where periodic signals generated in the orbitrap mass analyzer are transformed and converted in m/z signals. In WT, different functions can be employed to a signal function. WT can have a better specificity to fit signal functions than FT, e.g., for post-processing mass and spectroscopic spectral data.

However, the denoising process by these transform methods in hyphenated data, such as the removal of chemical noise (interferents) in LC/MS data can be complicated due to similarities between the chemical noise signals and low intense signals of interest. The sampling frequency along the chromatographic time domain can be low for low abundant compounds, i.e., peaks of low abun- dant compounds are represented by too few spectral data points, which may be difficult to distinguish from chemical or random noise, when applying do- main transforms approaches such as FT or WT.

56

Thus, for feature detection in chromatographic data from complex samples, alternative peak detection methods are sought.

In the work presented in Paper III, a Graphical Interface Program (GUI) program (named nLossFinder) was developed to analyze HPLC-HRMS data.

In a part of this program, algorithms are employed to extract features from data, i.e., to detect peaks in pure ion chromatograms (PICs), which are ex- tracted from the raw data. First, an algorithm based on a nearest-neighbors method extracts PICs from the raw data. Then, a match filter is employed to detect peaks in these PICs. This match filter makes use of the second deriva- tive of a Gaussian equation as a curve (peak) fitting function with the PIC data.

Then, another curve is calculated, based on an estimate of the noise of the PIC data, as a weighted standard deviation of the PIC data at each data point. The overlapping of the match filter over the estimated noise curve provides the identification of peaks in the PICs, allowing to discriminate low intense peaks from chemical noise. This match filter strategy (using the second derivative of the Gaussian equation) was firstly proposed by Danielson et al.

59

, and adapted in the development of TracMass2

60

, which is a program developed for analyzing HLPC/HRMS data, with a purpose of extracting and aligning chro- matographic features in numerous and complex samples. The peak detection algorithms in nLossFinder (Paper III) were adapted from TracMass2.

Another signal processing filter was employed in the work presented in

Paper II. The Savitsky-Golay (SG) algorithm combines a polynomial least-

squares smoothing and a differentiation filter, and can be employed in the es-

timation of underlying signals in the presence of noise.

59

In the work presented

in Paper II, SG has been implemented to estimate the noise in spectra. This

noise estimate is then introduced in the proposed modified PLS algorithm, as

data pre-processing step embedded in the algorithm.

References

Related documents

Advertising Strategy for Products or Services Aligned with Customer AND are Time-Sensitive (High Precision, High Velocity in Data) 150 Novel Data Creation in Advertisement on

“Det är dålig uppfostran” är ett examensarbete skrivet av Jenny Spik och Alexander Villafuerte. Studien undersöker utifrån ett föräldraperspektiv hur föräldrarnas

Figure B.3: Inputs Process Data 2: (a) Frother to Rougher (b) Collector to Rougher (c) Air flow to Rougher (d) Froth thickness in Rougher (e) Frother to Scavenger (f) Collector

Figure 3 illustrates the log recogni- tion rates per log diameter class and the two measurement scenarios when using the MultivarSearch engine. The PCA model from the

Since matching is a quasi-experiment, it tries to replicate experimental conditions by ensuring that all determinants of the outcomes (other than treatment status) are

We recently broke this dogma by showing that different forms of tinnitus can show significant heritability and thus a predominant genetic influence over environmental factors..

The analysis is based on extractions for (spelling variants of) the noun way from the Early Modern (EEBO, PPCEME2) and Late Modern (CEAL, PPCMBE1) English periods, with a focus on

• virtual int load (SmpElectricityMeterConfig ∗ data, const char ∗ file)=0 Loads XML file with SMP electricity meter configuration.. • virtual int save (SmpElectricityMeterConfig