• No results found

Evaluation of Homogeneity in Drug Seizures Using Near-Infrared (NIR) Hyperspectral Imaging and Principal Component Analysis (PCA)

N/A
N/A
Protected

Academic year: 2021

Share "Evaluation of Homogeneity in Drug Seizures Using Near-Infrared (NIR) Hyperspectral Imaging and Principal Component Analysis (PCA)"

Copied!
69
0
0

Loading.... (view fulltext now)

Full text

(1)

Linköping University | Department of Physics, Chemistry and Biology Bachelor project thesis, 16 hp | Educational Program: Chemical Analysis Engineering

Spring term 2020 | LITH-IFM-x-EX—20/3854--SE

Evaluation of Homogeneity in

Drug Seizures Using Near-Infrared

(NIR) Hyperspectral Imaging and

Principal Component Analysis

(PCA)

Olle Strindlund

Examiner: Johan Dahlén

(2)

Avdelning, institution Division, Department

Department of Physics, Chemistry and Biology Linköping University Datum Date 2020-05-20 Språk Language Svenska/Swedish Engelska/English ________________ Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport _____________ ISBN ISRN: LITH-IFM-x-EX--20/3854--SE _________________________________________________________________

Serietitel och serienummer ISSN

Title of series, numbering ______________________________

Titel

Title

Evaluation of Homogeneity in Drug Seizures Using Near-Infrared (NIR) Hyperspectral Imaging and Principal Component Analysis (PCA)

Författare

Author

Olle Strindlund

Sammanfattning

Abstract

The selection of a representative sample size is a delicate problem when drug seizures comprised of large number of units arrive at the Swedish National Forensic Centre (NFC). If deviating objects in the selected sample size are found, additional analyzes are required to determine how representative the results are for the entire population. This

generates further constraints on operational analysis flow. With the desire to provide a tool which forensic scientists at NFC can base their assessment of the representative nature of the selected sampling of large drug seizures on, this project investigated the possibilities of evaluating the level of homogeneity in drug seizures using near-infrared (NIR) hyperspectral imaging along with principal component analysis (PCA). A total of 27 sample groups (homogeneous, heterogeneous and seized sample groups) were analyzed and different predictive models were developed. The models were either based on quantifying the variation in NIR spectra or in PCA scores plots. It was shown that in the spectral range of 1300-2000 nm, using a pre-processing combination of area normalization, quadratic (second polynomial) detrending and mean centering, promising predictive abilities of the models in their evaluation of the level of

homogeneity in drug seizures were achieved. A model where the approximated signal-dependent variation was related to the quotient of significant and noise explained variance given by PCA indicated most promising predictive abilities when quantifying the variation in NIR spectra. Similarly, a model where a rectangular area, defined by the maximum distances along PC1 and PC2, was related to the cumulative explained variance of the two PCs showed most promising predictive abilities when quantifying the variation in PCA scores plots. Different zones for which within sample groups are expected to appear based upon their degree of homogeneity could be established for both models. The two models differed in sensitivity. However, more comprehensive studies are required to evaluate the models applicability from an operational point-of-view.

Nyckelord

Keyword

Homogeneity, drug seizures, pharmaceuticals, NIR hyperspectral imaging, chemometrics, PCA URL för elektronisk version

(3)

Abstract

The selection of a representative sample is a delicate problem when drug seizures comprised of large number of units arrive at the Swedish National Forensic Centre (NFC). If deviating objects in the selected sample size are found, additional analyzes are required to investigate how representative the results are for the entire population. This generates further pressure on operational analysis flow. With the goal to provide a tool which forensic scientists at NFC can base their assessment of the representative nature of the selected sampling of large drug seizures on, this project investigated the possibilities of evaluating the level of homogeneity in drug seizures using near-infrared (NIR) hyperspectral imaging along with principal

component analysis (PCA). A total of 27 sample groups (homogeneous, heterogeneous and seized sample groups) were analyzed and different predictive models were developed. The models were either based on quantifying the variation in NIR spectra or in PCA scores plots. It was shown that in the spectral range of 1300-2000 nm, using a pre-processing combination of area normalization, quadratic (second polynomial) detrending and mean centering,

promising predictive abilities of the models in their evaluation of the level of homogeneity in drug seizures were achieved. A model where the approximated signal-dependent variation was related to the quotient of significant and noise explained variance given by PCA indicated most promising predictive abilities when quantifying the variation in NIR spectra. Similarly, a model where a rectangular area, defined by the maximum distances along PC1 and PC2, was related to the cumulative explained variance of the two PCs showed most promising

predictive abilities when quantifying the variation in PCA scores plots. Different zones for which within sample groups are expected to appear based upon their degree of homogeneity could be established for both models. The two models differed in sensitivity. However, more comprehensive studies are required to evaluate the models applicability from an operational point-of-view.

Keywords: Homogeneity, drug seizures, pharmaceuticals, NIR hyperspectral imaging,

(4)

Abbreviations

API – Active principal ingredient CI – Chemical imaging

CV – Coefficient of variation EDA – Exploratory data analysis

NFC – Swedish National Forensic Centre

NIPALS – Nonlinear iterative partial least squares

NIR – Near-infrared

NPS – New psychoactive substances PC – Principal component

PCA – Principal component analysis SD – Standard deviation

SVD – Singular value decomposition S/N – Signal-to-noise ratio

(5)

Contents

1. Introduction ... 1

1.1 Background and problem formulation ... 1

1.2 Swedish National Forensic Centre ... 1

1.3 Previous studies ... 2 1.4 Aim of thesis ... 3 1.5 Limitations ... 3 2. Theory ... 4 2.1 NIR spectroscopy ... 4 2.1.1 Vibrational spectroscopy ... 4

2.1.2 Theory of vibrational spectroscopy ... 4

2.1.3 The polyatomic molecule ... 5

2.1.4 Interpreting NIR spectra ... 5

2.2 NIR hyperspectral imaging ... 6

2.3 Pre-processing data ... 7

2.3.1 The purpose of pre-processing data ... 7

2.3.2 Normalization ... 7

2.3.3 Quadratic detrending ... 7

2.3.4 Mean centering ... 7

2.4 Techniques for exploratory analysis ... 8

2.4.1 What is exploratory analysis? ... 8

2.4.2 PCA ... 8

2.4.3 Cluster analysis ... 10

3. Materials and methods ... 11

3.1 Sample groups ... 11

3.1.1 Homogeneous sample groups ... 11

3.1.2 Heterogeneous sample groups ... 11

3.1.3 Seized sample groups ... 11

3.2 Instrumentation ... 11

3.3 Data treatment ... 12

3.4 Quantifying variation in NIR spectra ... 12

3.4.1 Predictive models ... 12

3.4.2 Choosing wavelength range ... 12

3.4.3 Model A ... 12

3.4.4 Model B ... 12

3.5 Quantifying variation in PCA scores plots ... 13

4. Results and discussion ... 14

4.1 Finding conditions for measurements ... 14

4.1.1 Evaluating frames for the UmBio tray ... 14

4.1.2 Choosing wavelength range ... 15

4.2 Quantifying variation in NIR spectra ... 18

4.2.1 Choosing the number of significant PCs ... 18

4.2.2 Model A ... 20

4.2.3 Analysis of seized sample groups ... 23

4.2.4 Model B ... 25

4.3 Quantifying variation in PCA scores plots ... 33

4.3.1 Visual assessment of scores plot ... 33

(6)

4.3.3 Predictive models ... 36

4.4 Ethical and societal aspects ... 44

5. Conclusions ... 45

Acknowledgments ... 46

References ... 47

Appendix ... 50

Appendix I. System and process ... 50

Appendix II. The UmBio Inspector ... 52

Appendix III. Effect on NIR spectra after pre-processing ... 54

Appendix IV. Choosing the number of PCs ... 55

Appendix V. Reflection problems for certain capsules ... 57

Appendix VI. Scores plots ... 59

(7)

1

1. Introduction

1.1 Background and problem formulation

The illicit importation, distribution and consumption of narcotic substances and

pharmaceuticals is a growing problem in Sweden [1,2]. A doubling in handled cases of drug related seizures in the last ten years was reported by the Swedish National Forensic Centre (NFC), which is responsible for analyzing all drug seizures in Sweden [3]. Reasons for this development is partly due to Europe’s emergence as an integral part of the global drug market, increasing availability, and also due to a shift in priority from the Swedish Police Authority and Swedish Customs [1,4]. Cannabis makes up the majority of seizures in Sweden [2]. However, seizures of pharmaceuticals regulated as narcotics, such as synthetic opioids or benzodiazepines, have rapidly increased since the nineties, and often comprise large numbers of tablets or capsules [1,2].

When large seizures of tablets or capsules arrive at NFC selection of representative sample size for analysis is required. The sample sizes must both be large enough to provide a true representation of the entire seizure, but simultaneously not too large that they put unnecessary strain on the operational analysis flow at NFC. If deviating objects in the selected sample size are found, additional analyzes are required to determine how representative the results are for the entire population. This generates further constraints on operational analysis flow. To circumvent this, NFC purchased the UmBio Inspector in 2012. The UmBio Inspector is a near-infrared (NIR) hyperspectral imaging instrument which provides a rapid, non-destructive analysis technique suitable for large quantities in samples. Using the NIR hyperspectral imaging system, the homogeneity of large drug seizures is meant to be studied. From the results, the representative nature of the selected sample size could thereby be assessed. In combination with principal component analysis (PCA), a multivariate analysis technique in chemometrics, it has been shown in previous studies at NFC that this instrument could

differentiate samples in a seizure based on chemical composition, i.e. detect deviating objects in an inhomogeneous seizure [5,6]. However, before implementation of the instrument into routine procedures, its ability to evaluate the level of homogeneity within a seizure needs to be studied.

1.2 Swedish National Forensic Centre

NFC operates as an independent expert organization under the Swedish Police Authority, and is responsible for the overall forensic processes in the Swedish judicial system [7]. The organization has about 500 employees and is divided into four main units, the document and information technology unit, the biological unit, the chemical and technology unit and the drug analysis unit. Most of the organization’s capacity is located at the central, forensic laboratory in Linköping, but there are also eleven smaller, regionally operated laboratories spread across Sweden, from Malmö in the south to Umeå in the north [8]. In addition to their responsibility in conducting forensic investigations and analyzes, NFC also has an obligation to conduct both research and development as well as education in the forensic field [7]. The drug analysis unit’s main duties are to investigate, through qualitative and quantitative chemical analyzes, whether seizures contain any substance regulated as narcotic substances, goods dangerous for public health or doping agents. The unit also identifies previously unidentified substances that could be regulated in the future, such as new psychoactive substances (NPS). In 2018, a total of about 50 000 cases were processed at the drug analysis unit [3].

(8)

2

1.3 Previous studies

Traditional NIR spectroscopy has found broad applications in many fields of industry, especially within the food and beverage industry [9]. Applications for the determination of moisture content, protein structure as well as carbohydrate and fat content in various food or beverage samples have been reported [9-12]. NIR hyperspectral imaging, however, is an evolution of traditional NIR spectroscopy which, beyond the spectra, introduces spatial dimensions of samples. It is a technique that over the years has increased in popularity due to its versatility, rapid and non-destructive analysis that requires little to no sample preparation. The technique has applications in the fields of pharmaceuticals, biomedicine and the food industry [13].

One study reported the possibility to perform both a qualitative and quantitative assessment of homogeneity in different blending stages of pharmaceutical samples using NIR hyperspectral imaging (or NIR chemical-imaging (CI)) combined with different chemometric models [14]. Homogeneity in this context is referred to as the degree of fulfilled homogeneous mixture between the active principal ingredient (API) and excipients. Analysis was carried out on four binary mixtures of ibuprofen and starch at different concentrations using diffuse reflectance mode on a NIR hyperspectral imaging spectrometer (Think Spectrally Roda-25, Valencia, Spain) with a mercury-cadmium-telluride detector. Images had a resolution of 320´256 pixels, and spectra were measured with 10 scans in the range 1200-2000 nm, with 7 nm increments, and a total acquisition time of 120 s. Savitzky-Golay smoothing and standard normal deviate (SNV) transformation were the chosen pre-processing techniques. PCA, cluster analysis (K-means and Fuzzy C-means clustering) and correlation coefficients were investigated in the quality assessment of homogeneity of the samples, while classical least-square regression (CLS) and multivariate curve resolution-alternating least least-squares (MCR-ALS) were used for quantitative investigation. The study showed that PCA, Fuzzy C-means and correlation coefficients all have possibilities in exploratory (qualitative) analysis of homogeneity in different stages of blending, while tendencies towards misinterpreting results occurred with K-mean clustering. For quantitative investigations, spectra of the pure

compounds were required. Both CLS and MCR-ALS showed promise in predicting concentrations in the samples, however CLS poorly predicted starch concentrations [14]. Another study investigated possibilities of using NIR hyperspectral imaging combined with different multivariate data processing methods for assessing homogeneity of dried sugar-protein mixtures [15]. Proteins are an integral part of biopharmaceuticals, and producing stable proteins can be challenging. Sugars or other additives are therefore added to the proteins to stabilize them, and a homogeneous protein-sugar distribution is desired as it improves stability. In the study, lysozyme and trehalose mixtures with different lysozyme content were prepared as tablets and analyzed using a Sapphire NIR Imager with Sapphire Go Software (v. 1.0). Images had a resolution of 160´160 pixels, and spectra were measured between 1200 and 2400 nm, with 4 nm increments. Normalization, mean-centering and scaling were used as pre-processing techniques of data. It was concluded that multivariate approaches (PCA and correlation coefficients) rather than univariate (single wavelength or peak ratio methods) were more suitable for homogeneity determination in the samples. It was found that the second principal component in PCA could be used to determine sample

homogeneity. As a quantitative approach, partial least-square (PLS) regression was the most accurate in describing sample homogeneity [15].

(9)

3

1.4 Aim of thesis

The aim of this project was to gain further insight into the evaluation of homogeneity in drug seizures using NIR hyperspectral imaging. A quantitative measure of the expected variation in both different combinations of pre-processed NIR spectra and in PCA scores plots for a range of pharmaceutical products, seized sample groups and sample groups of known heterogeneity were investigated in order to provide a reference point in the evaluation of homogeneity in drug seizures. The sample groups of known heterogeneity were either sample groups that had differences in API concentration, or sample groups that had different APIs within their respective population.

1.5 Limitations

As the project will only be conducted for ten weeks, it will be a part of the larger, long-term project conducted at NFC, where the implementation of the NIR hyperspectral imaging-system (UmBio Inspector) to evaluate homogeneity in drug seizures into routine procedures is under investigation. Planning and evaluation of the project can be seen in Appendix I.

Moreover, in the project, homogeneity will only be investigated in the sense of the

cohesiveness between samples within a population, and not in a manner of identification or quantification of substances.

(10)

4

2. Theory

2.1 NIR spectroscopy

2.1.1 Vibrational spectroscopy

To understand NIR spectroscopy, the fundamentals of infrared spectroscopy need to be comprehended. Infrared spectroscopy covers a range of spectroscopic techniques in which spectra are generated by molecules interacting with electromagnetic radiation, causing them to vibrate. This is why it is also referred to as vibrational spectroscopy [16,17]. The most common techniques utilizing this phenomena are NIR, mid-infrared (MIR) and Raman spectroscopy. The techniques differ in the sense that Raman spectroscopy is based on scattering of radiation, whilst MIR and NIR are based on absorption processes [17].

2.1.2 Theory of vibrational spectroscopy

When the atoms in a diatomic molecule interacts with energy through radiation (e.g. infrared wavelengths) they begin to vibrate. The distance between them thereby changes in an

oscillating fashion. If the system experiences a force of restoration proportional to the length of displacement (R), the potential energy (V) in a chemical bond of the diatomic molecule could be described by the quantum harmonic oscillator [16,18]. That potential energy is given by the parabolic function

𝑉(𝑅) = 12𝑘(𝑅 − 𝑅!)" (1)

where k is the force constant, R is the distance of displacement (vibrational distance), Re is the equilibrium distance between the atoms prior to the vibration and (R-Re) is the change in distance from the equilibrium during the vibration. The vibrational frequency, v0, of the vibration can consequently, via the harmonic oscillator, be described as

𝑣# = 1 2𝜋

-𝑘

𝜇 (2) where µ is the reduced mass of the diatomic molecule given by

𝜇 = 𝑚$𝑚"

𝑚$+ 𝑚" (3)

When solving the Schrödinger equation for this system, it is found that the only allowed energy levels (Ev) are given by

𝐸% =3𝑣 +1

24ħ𝜔, 𝜔 = -𝑘

𝜇 , 𝑣 = 0, 1, 2, 3, … (4)

where ħ is the reduced Planck’s constant, 𝜔 is the frequency of the vibrational oscillation and v is the vibrational quantum number. In this system, transitions (absorption or emission) are only allowed between the closest energy level, so that the specific selection rule

(11)

5 is maintained. However, at larger vibrational excitations, the approximation of the harmonic oscillator poorly reflects reality [16]. As the approximation does not allow the bond to dissociate in its parabolic form, the anharmonic model is introduced. The anharmonic system indicates that the force of restoration is no longer proportional to the length of the

displacement, i.e. the bond can dissociate at larger vibrational excitations [9, 18]. In the system the allowed energy levels must be described differently, and when solving the Schrödinger equation they can be given as

𝐸% = 3𝑣 +1

24 ℎ𝑐𝑣; − 3𝑣 + 1

24 𝜒ℎ𝑐𝑣; (6)

where v is the vibrational quantum number, h is the Planck’s constant, 𝑣; is the wave number corresponding the fundamental vibrational frequency defined in Equation 2 (v0 = c × v;) and χ is the anharmonic constant defined as

𝜒 =ℎ𝑐𝑣;

4𝐷! (7)

where De is the depth of the well in the anharmonic system [18]. In this system, the selection rule described in Equation 5 no longer applies, indicating that transitions to higher energy levels are possible, such as ∆v = ±2, ±3, …. These transitions are known as overtones, and the transition ∆v = ±2 is known as the first overtone, ∆v = ±3 the second overtone and so on. These overtones are of great importance for NIR spectroscopy; the observed absorption bands in the NIR region of 700-2500 nm (14 300-4000 cm-1) are almost exclusively derived from overtones and combinations of vibrational transitions [9]. For MIR and Raman spectroscopy these overtones are of less importance as they deal with less energetic wavelengths [16].

2.1.3 The polyatomic molecule

When considering a polyatomic molecule, vibrational modes are more complicated. For a linear molecule including N atoms, the number of vibrational degrees of freedom are 3N – 5, while for a nonlinear molecule 3N – 6. From the vibrational degrees of freedom, the number of fundamental vibrational frequencies of the molecule are given (Equation 2) [9]. There are different types of vibrations, which, generally are described as bending or stretching. Bending involves a change in the bond angle in a molecule or group and are further divided into four types, wagging, rocking, twisting and scissoring. On the other hand, stretching involves vibrations that cause a continuous elongation and shortening of the bond between atoms (symmetrical or asymmetrical). All of these types of vibrations may cause overtones or combinations of vibrational transitions, however the likelihood of anharmonicity decreases with heavier atoms in the molecules [9].

2.1.4 Interpreting NIR spectra

Almost all observed absorption bands in NIR spectroscopy are a consequence of overtones by stretching or combinations of bending and stretching vibrations in AHy functional groups [9]. As hydrogen is so light relative to the other atoms, it deviates from the harmonic potential quite easily when undergoing stretching vibrations increasing the likelihood of overtones and combinations. Generally, overtone and combination bands of carbon-hydrogen (CH), oxygen-hydrogen (OH) and nitrogen-oxygen-hydrogen (NH) stretching frequencies are most readily

observable in NIR spectra, due to their abundancy in organic molecules and that these bonds inherit the largest fundamental vibrational energies, increasing anharmonicity [17]. Moreover,

(12)

6 due to the decrease in probability for every consequent overtone, the intensity of the

corresponding absorption bands also decreases, usually by a factor of 10 to 100 between each overtone. As a result, interpretation of untreated NIR spectra becomes difficult due to a decrease in specificity. However, in combination with multivariate data analysis methods, NIR spectroscopy becomes a powerful analytical technique [16].

2.2 NIR hyperspectral imaging

NIR hyperspectral imaging combines spectra with spatial information to increases the amount of data that can be extracted from NIR spectroscopy [16]. The hyperspectral image can be seen as a three-dimensional data matrix, also known as a hyperspectral cube (Figure 1). The cube has two planar dimensions (x- and y-axis) and one spectral dimension (z-axis). The planar dimensions constitute pixel coordinates in the image and accommodate for the spatial information about the samples, and for every pixel, an entire NIR spectrum is represented in the spectral dimension [17]. By introducing the spatial dimensions, NIR hyperspectral imaging enables identification of the distribution of chemical composition through the entire sample, which is not possible with traditional one-point NIR spectroscopy. Consequently, these abilities of NIR hyperspectral imaging makes it highly suitable for studying

homogeneity in samples [19].

Figure 1. Schematic over a hyperspectral cube. The pixel coordinates (spatial dimensions) are given by the x-

and y-axis while the z-axis represents the spectral dimension. In every pixel there is a full NIR spectrum represented.

The generated data is large, multivariate and difficult to interpret, which is why a logical sequence of data processing and multivariate, chemometric data analysis methods are required before any results can be visualized. These steps often deal with background removal, noise reduction and scatter correction, followed by a form of exploratory analysis, usually PCA (Section 2.4.2) [17]. Noise reduction and scatter correction are data pre-processing methods that will be discussed in Section 2.3. Background removal, however, is a data reduction step that removes any background pixels (and their associated spectra) from the data set matrix. The background pixels (or the background in general) are of no interest for the purpose of the analysis and need to be removed in order to base the analysis on a clean data set derived only from the samples in question [17].

(13)

7

2.3 Pre-processing data

2.3.1 The purpose of pre-processing data

Pre-processing methods in chemometric data analysis are ways of removing unwanted variation that can influence the interpretation of data generated from experiments [20]. Unwanted variations often originate from instrumental or experimental artefacts and need to be cleaned from the raw data in order for it to be properly and correctly interpreted for its intended purpose. There are a range of pre-processing methods, which can be used solely or in combinations. However, the choice of methods should be thoroughly considered based upon the nature of the data [20]. When dealing with spectral data, especially NIR spectra, the use of pre-processing techniques is essential; since NIR spectra often experience baseline deviation and other unwanted systematic variations due to light scattering and different effective pathlengths through the sample [21].

2.3.2 Normalization

Normalization is one of the possible pre-processing methods used to compensate for physical variance between samples caused by light scattering [21,22]. The purpose of normalization is to scale the samples, and there are different types of transformations that can be used [23]. One of these transformations is unit vector normalization, where, for spectroscopic

applications, every spectra in a set of samples is divided with its norm (Euclidean norm), giving the spectra a length of one. Fixing the length (normalizing) of the spectra corrects for baseline shifts, also known as multiplicative effects [22]. Another commonly used

transformation for spectroscopic applications is area normalization, where the area under (integral of) the spectra is set as one. This is done to compensate for the light’s different effective pathlength through the samples [23].

2.3.3 Quadratic detrending

Quadratic detrending is a pre-processing method used to remove what are known as additive effects in spectral data. Additive effects are trends observed in the spectra, such as offsets or changes of the slope. These are caused by sudden deviations in instrumental responses or sample composition [22, 24]. The trends are often curvilinear (arched) in generated NIR spectra, which is why a quadratic (second-degree) polynomial fit is often subtracted from the spectra to standardize the variation and thereby detrend them. Detrending only compensates for additive effects in spectra and is therefore often used in combinations with normalization methods to also compensate for multiplicative effects [24].

2.3.4 Mean centering

Mean centering is a mathematical, data transformation method, which manipulates every column (variables) of a data set to have a mean of zero (zero means). This is done by subtracting the column mean from each variable and this removes any eventual constant offset in the data set [25]. For spectroscopic techniques, such as NIR, an average spectrum of the complete data set is defined and consequently subtracted from each spectrum within the data set [16]. As the data is centered around the mean, differences between the samples in regards to the relative response of the variables are more readily observable [26].

(14)

8

2.4 Techniques for exploratory analysis

2.4.1 What is exploratory analysis?

Exploratory data analysis (EDA) refers to an approach in which data is analyzed by searching for patterns, trends or similarity between samples. It is exploratory in the sense that, prior to the analysis, there are no known assignments for the objects (samples) to any classes, which is why patterns, trends and similarities within the data sets are found in an unsupervised way. This is why it also is referred to as unsupervised data analysis [25]. In the approach there are many different techniques that can be employed based on the purpose of the analysis.

However, results are commonly presented by visual representations, i.e. plots and diagrams. There are both univariate and multivariate techniques in EDA and they are chosen based on the type of data and the purpose of the analysis [27]. For spectroscopic applications

(multivariate data sets), the aim with the analysis is typically to either search for similar groups of samples or variables, known as cluster analysis, or to more readily observe variation within the data set by reducing the high dimensionality of variables (PCA) [25].

2.4.2 PCA

The main purpose of using PCA is to reduce the dimensions of large multivariate data sets with correlated variables, without subsequently losing significant variation. This is done by the transformation of the observed values to new, uncorrelated variables – principal

components (PCs). Consider the original data set as the matrix, X, with n rows (objects) and p columns (variables), with PCA it is projected onto a new coordinate system given by two smaller matrices, known as the score matrix, T, and loading matrix, L, according to

𝐗 = 𝐓𝑳𝑻 (8)

where T is the score matrix with n rows and d columns (number of principal components) and

LT is the transposed loading matrix with p columns and d rows [25]. Every variable has, by

definition, a loading on (contribution to) each PC. These loadings, 𝑙, form loading vectors, ld,

which define the direction for every PC. This direction is defined in such a way that the maximum amount of variation in the samples is preserved, i.e. it is a fitting of a line such that the relative distance between the samples is preserved. The loading vectors are also often normalized to length one such that ldTld = 1. This is done in order to make data and variables

more comparable [28]. Furthermore, when projected (Figure 2), every object (sample) in the data set receives their coordinates on the PCs by their scores, t. The scores can be seen as linear combinations of the loadings and the original variables, and for the first principal component (PC1) they are given by

𝑡$,$ = 𝑥$,$𝑙$,$+ 𝑥$,"𝑙",$+ … + 𝑥$,(𝑙(,$ (9) 𝑡",$ = 𝑥",$𝑙$,$ + 𝑥","𝑙",$+ … + 𝑥",(𝑙(,$ . . . 𝑡),$ = 𝑥),$𝑙$,$ + 𝑥),"𝑙",$+ … + 𝑥),(𝑙(,$

where 𝑡),$ is the score of object n on PC1 and x are the observed values for all p variables in that object (sample) [25]. The second principal component (PC2) is orthogonal to PC1, PC3

(15)

9 orthogonal to PC2 etc.. The number of PCs can be computed to as many variables as there are in the original data sets. However, as a reduction of dimensions is desirable, the PCs are determined in a iterative way, such that PC1 explains a maximum amount of the variance in the original data, PC2 explains the maximum amount of variance not described by PC1 and so on and so forth [25]. There are different types of algorithms or methods to compute these components. Most common are nonlinear iterative partial least squares (NIPALS) and singular value decomposition (SVD). NIPALS is the simplest algorithm in PCA and computes the PCs, or latent variables, one by one in decreasing order of importance, i.e. by decreasing amount of explained variance. The algorithm is efficient when only the first few PCs are required to explain a substantial amount of variance in the data set. On the contrary, SVD is a non-sequential algorithm, and calculates all the PCs at once. The algorithm is more efficient when most or all of the PCs in the model are required to explain a substantial amount of variance in the data set [29]. The underlying theory of these algorithms will not be

discussed in this thesis.

Figure 2. Projection of samples in a three-dimensional (x1, x2 and x3) coordinate system onto a new,

dimensionally reduced coordinate system defined by PC1 and PC2.

After the PCs have been computed, the significant PCs must be separated from noise PCs. Noise PCs contain no information about the differences in the chemical system (samples), and only constitute to the background [30]. The distinction can be done in various ways and it is often based on statistical or experimental criteria which need to be fulfilled. Two of these estimations are percentage of explained variance and cross validation. With percentage of explained variance, the number of necessary PCs is determined by determining how many PCs are required to satisfy a set limit of explained cumulative variance, e.g. 80 %. The

explained variance is a way of measuring how much variance of the original data is expressed in the new model and the limit is set by previous empirical data of similar data sets [23,25]. On the other hand, cross validation is a way to validate that the PCA model is applicable to new, similar data and thereby confirming the number of necessary PCs. One object at a time or groups of objects are removed from the data set and the PCA is run again upon which a prediction error of the removed objects is calculated. PCs that exhibit the lowest average prediction error would distinguish significant PCs from noise PCs [23,30].

Results from a PCA are often depicted and interpreted by means of graphical (visual) representation, i.e. different types of plots. The main results of interest are usually explained variance, loadings and scores plots [23]. The explained variance, as previously mentioned, is

(16)

10 a measure of how much variation in the original data set each PC accounts for in the new model. The loadings plots are scatter plots and represent each variables’ loadings on the respective PCs. The larger the loadings (positive or negative), the more impact those variables have on the scores of the samples. Variables close to each other in the loadings plot are

correlated and have similar impact on the scores. For spectroscopic applications, the loadings plot is often depicted as line plots to more readily be comparable to the spectral data. Noise PCs can readily be distinguished using loadings plots, due to their unstable and oscillating appearances, i.e. no resemblance to the spectral data is observed [30]. The scores plot is also a scatter plot, which provides every object’s (sample’s) projection (score) onto the PCs [25]. As the loadings define the directions of the PCs, they are often used to interpret the scores plot. The connection can be constituted as: variables with large, positive loadings give i) higher than average values in that variable for objects with a positive score and ii) lower than average values in that variable for objects with a negative score. The opposite applies for negative loadings [23]. Objects close to each other in the score plot have similar scores and are thereby similar to each other. The formation of groupings, or clusters, of similar objects in score plots is therefore prevalent, and is one of PCA’s advantages [25].

2.4.3 Cluster analysis

Cluster analysis is another method for unsupervised analysis in which samples in data sets are grouped together (clustered) based on their similarity. The similarity is often decided by measures of distance, the closer two samples are, the more resemblance they have. There are different types of distance measurements, however the Euclidean distance is most commonly used [25]. The Euclidean distance, de, is a straight line distance measurement. In a

one-dimensional Euclidean plane (vector), the distance between two objects, i and j, is given as the absolute difference between the coordinates of the two objects on the plane, according to

𝑑*,+ = |𝑖$− 𝑗$| (10)

In a multivariate vector space, this distance measurement is more complex. The most common technique for cluster analysis is K-means clustering, in which the given data set is divided into k clusters [23]. Usually the number of clusters within the data set is already known, through other EDA techniques such as PCA. The optimum dispersion of the data set into the defined clusters is therefore achieved through K-means clustering. The dispersion of data to arrange the clusters can be based on different calculations. Two commonly used calculations are based on achieving the largest distance i) between objects (complete linkage,

Figure 3a), or ii) between defined centroids (centroid linkage, Figure 3b) in different clusters

[25].

Figure 3. a) Complete linkage and b) centroid linkage distance-based (di,j) calculations in the arrangement of

samples in different clusters.

(a)

(17)

11

3. Materials and methods

3.1 Sample groups

3.1.1 Homogeneous sample groups

A total of eight different regulated pharmaceutical drugs were used as a basis to represent a homogeneous sample group. Both tablets and capsules were studied. A total of six tablets were used: Ksalol® 1 mg alprazolam (Galenika a.d., Zemum, Serbia), Xanor® 2 mg

alprazolam (Pfizer AB, Sollentuna, Sweden), Stesolid® 10 mg diazepam (Teva Sweden AB, Helsingborg, Sweden), Iktorivil® 2 mg clonazepam (Roche AB, Solna, Sweden), Ergenyl® 100 mg sodium valproate (Sanofi Aventis AB, Bromma, Sweden), and Orgabolin 2mg ethyloestrenol (Organon AB, Oss, Netherlands). Additionally, a total of two capsule-based pharmaceuticals were used: Omeprazol Actavis 20 mg omeprazole (Actavis Group PTC ehf., Hafnarfjordur, Iceland) and Lansoprazol Mylan 30 mg lansoprazole (Mylan AB, Stockholm, Sweden). If available, a total of 100 samples for each pharmaceutical were selected from seizures or the reference inventory at NFC (Linköping, Sweden) and analyzed on five consecutive trays (i.e. a total population of 500 tablets or capsules).

3.1.2 Heterogeneous sample groups

Several sample groups with different degrees of heterogeneity were selected. A seizure with capsules containing either pregabalin or gabapentin as the API was used to prepare binary sets on the tray. In the sets, 50 capsules with different distributions of the two groups were chosen and analyzed on five consecutive trays (i.e. a total population size of 250 capsules). The sets had distributions of 50/50 (125/125 capsules), 70/30 (175/75 capsules) and 90/10 (225/25 capsules). Prepared capsules at NFC with 40 %, 30 % and 20 % MDMA and lactose as excipient were also used to compile heterogeneous sample groups. Combinations of 40/30/20 (150/150/150 capsules), 40/30 (150/75 capsules) and 40/20 MDMA (150/50 capsules) were analyzed.

3.1.3 Seized sample groups

Several seized sample groups provided by NFC were used: OxyContin® 80 mg oxycodone hydrochloride (MundiPharma A/S., Gothenburg, Sweden), Tramadol-X 200 mg tramadol hydrochloride (Signature, India), Tamoxifen-EGIS 20 mg tamoxifen (Poland), Winstrol 20 mg stanozolol (Pro Medica), Dolol® 50 mg tramadol (Nycomed, Danmark Aps., Roskilde, Denmark) and Lyrica® 300 mg pregabalin (Pfizer Manufacturing GmBH, Freiburg,

Germany). A total of four ecstasy seizures (Ecstasy I, II, III and IV) were also used to represent potential heterogeneous sample groups. For the seizures, 50-100 tablets were selected and analyzed on five consecutive trays. Additionally, three seizures of regulated doping agents, Doping I (oxymetholone), Doping II (stanozolol) and Doping III (tamoxifen) were also investigated. For the all three seizures, 80 tablets were selected and analyzed on five consecutive trays

3.2 Instrumentation

The hyperspectral images were recorded on diffuse reflectance mode with a LUMO SDK SWIR camera (Specim, Spectral imaging Ltd., Oulu, Finland) in combination with Breeze v. 2019 2.0 software (Predikera, Umeå, Sweden). The samples were evenly distributed on a tray (42×40 cm) with a capacity of 100 samples and scanned using a UmBio Inspector (UmBio, Umeå, Sweden) equipped with a conveyer belt. The conveyer belt had a travel speed of 200 mm/s which reduced under the camera to 100 mm/s, where the images were recorded for 4 s

(18)

12 (400 mm) with an integration time of 1500 µs. The images had a resolution of 320×320 pixels and spectra were recorded over the wavelength range 1100-2400 nm, with 6,34 nm

increments. The hyperspectral images were initially processed to identify objects on the sample tray with SACman v. 0.80 (NFC, Linköping, Sweden) to later be exported to The Unscrambler® X 10.3 (64-bit, CAMO Software AS., Oslo, Norway) to process spectra with chemometric models.

3.3 Data treatment

The data pre-processing techniques utilized and investigated in this homogeneity study were area normalization, quadratic (second degree polynomial) detrending and mean centering. These were chosen based on their applicability in the discriminant studies of drug seizures using NIR hyperspectral imaging previously conducted at NFC [5,6]. PCA with SVD computation was the chosen exploratory analysis technique employed on the pre-processed data.

3.4 Quantifying variation in NIR spectra

3.4.1 Predictive models

By quantifying the variation in NIR spectra for the different sample groups (i.e.

homogeneous, heterogenous and seized sample groups), predictive models to be used to evaluate the level of homogeneity in drug seizures were developed. Two models (model A and B), were investigated on two different combinations of pre-processing techniques: (i) area normalization and mean centering and (ii) area normalization, quadratic detrending and mean centering. The models were based on standard deviations (SD) within the NIR spectra

combined with explained variance of the model from PCA as measurement of variation.

3.4.2 Choosing wavelength range

A reduction of the spectral range was reported necessary in previously conducted discriminant studies at NFC. Therefore, it was also investigated whether the same applied when studying homogeneity. The effect on spectral variation using both combinations of pre-processing techniques was studied after different reductions of the wavelength range had been applied.

3.4.3 Model A

The model was based on expressing the variation in NIR spectra as a percentage of the total spectral area. The spectral variation was estimated by calculating the sum of 4SD across each variable (wavelength). The sum provided an area constituted by ± 2SD (covers 95,5 % of the population) across the spectra, which readily could be related to the total spectral area. The spectral variation (percentage) was further related to the quotient between the significant explained variance and the explained variance related to noise given by PCA for the respective sample groups. PCA was conducted using 20 PCs and the initial three PCs were chosen (after investigation) to best represent significant explained variance, and noise, consequently, the remainder up to 100 % cumulative explained variance.

3.4.4 Model B

The second model was based on approximating the signal variation in NIR spectra derived from noise and signal (samples) respectively. The noise variation was approximated by calculating the average of the 10 variables (wavelengths) that exhibited the lowest SD, while the signal variation was approximated by calculating the average of the 10 variables

(19)

13 consequently corrected for noise (by subtracting their respective noise variation). The signal variation was further related to the quotient between the significant explained variance and the explained variance related to noise given by PCA for the respective sample groups.

3.5 Quantifying variation in PCA scores plots

To evaluate the level of homogeneity in drug seizures, predictive models were also developed by quantifying the variation in PCA scores plots for the different sample groups (i.e.

homogeneous, heterogenous and seized sample groups). The same combinations of pre-processing techniques as described in Section 3.4.1 were used. The models were based on relating the size of the clusters of the sample groups in their respective two-dimensional scores plot given by PC1 and PC2 to the explained variance of those PCs. The cluster sizes were estimated by calculating either i) a rectangular or ii) an elliptical area. The rectangular area was calculated by multiplying the maximum distances along PC1 and PC2. The elliptical area, on the other hand, was calculated by multiplying π with half of the maximum distances along PC1 and PC2.

(20)

14

4. Results and discussion

4.1 Finding conditions for measurements

4.1.1 Evaluating frames for the UmBio tray

Frames for the tray, in which all samples were evenly distributed on, have been manufactured at NFC (Appendix II). The frames had 10 rows (1-10) and 10 columns (A-J), fitting a total of 100 samples. Initial acquisitions on Frame A resulted in a large peak between 1777 and 1800 nm in the NIR spectra derived from a small group of samples in the investigated population (Figure 4). This peak was present for five initially investigated sample groups (Tramadol-X, Dolol®, Tamoxifen-EGIS, Ksalol® and Winstrol) and as different molecules have unique NIR spectra, it was improbable that all five different pharmaceuticals shared the same large peak. Furthermore, it could be observed that the deviating groups for all five materials were located in row 10 of Frame A, indicating that it was a positionally derived anomaly rather than something caused by the samples themselves.

Figure 4. a) NIR spectra after area normalization, quadratic detrending and mean centering and corresponding

hyperspectral image of the tray with 100 samples using b) Frame A. With the image-processing software, SACman v0,8, it could be concluded that the large peak near 1800 nm derived from row ten on the tray.

Consequently, another frame (Frame B), which had a different position on the tray compared with Frame A, was tested with the same samples. The peak did not appear in the NIR spectra (Figure 5) for any of the samples run using this frame, further strengthening a positionally derived anomaly for Frame A. It is believed that the peak between 1777 and 1800 nm originates from reflectance from the rim of the tray, as row 10 in Frame A was positioned 5 cm from the lower rim of the tray compared to 8,5 cm for Frame B. All measurements afterwards were therefore conducted with Frame B.

(21)

15

Figure 5. Obtained NIR spectra after area normalization, quadratic detrending and mean centering using Frame

B and the same sample groups as with Frame A. No large peak near 1800 nm was present. 4.1.2 Choosing wavelength range

It was concluded that the spectral range needed to be reduced from the full spectral range of 1100-2400 nm. The variables with the largest SD were for all sample groups consistently located in the extreme region of the spectral range, specifically 1100-1200 and 2200-2400 nm. This was present for both combinations of pre-processed data, i.e. (i) area normalization and mean centering (Figure 6) and (ii) area normalization, quadratic detrending and mean centering (Figure 7). The effect the different pre-processing techniques had on the spectral appearance can be seen in Appendix III. Additionally, when studying the loadings plot for all investigated samples (Stesolid® is used as an example), it was found that the variables in these regions (1100-1200 and 2200-2400 nm) had large loadings for the first PCs, and thereby defining the direction of those PCs. The loadings in these regions also indicated noise-like behavior, i.e. oscillating tendencies. The initial PCs should inherently explain a large portion of the variation in the data set (variation between samples) and should thereby not be noise-dependent. The reason for this was concluded to be due to the NIR system’s detector. From the white reference measurements, it could be seen that the detector had the highest effective response in the region of 1300-2000 nm, i.e. lower response in the outer regions (1100-1300 and 2000-2400 nm). The spectral range was therefore shortened from 1100-2400 nm to 1300-2000 nm.

(22)

16

Figure 6. a) SD plotted against the wavelengths (nm) in the full spectral range (1100-2400 nm) with the

corresponding b) PC1, c) PC2 and d) PC3 loadings plot given by PCA after (i) area normalization and mean

centering of Stesolid®.

Note that the spectral range was shortened to 1300-2000 nm prior to applying area normalization (quadratic detrending) and mean centering again. The pre-processing

techniques are therefore regarded as post-processing, however they will still be referred to as pre-processing techniques in the thesis.

(a) (b)

(c) (d)

PC1

(23)

17

Figure 7. a) SD plotted against the wavelengths (nm) in the full spectral range (1100-2400 nm) with the

corresponding b) PC1, c) PC2 and d) PC3 loadings plot given by PCA after (ii) area normalization, quadratic

detrending and mean centering of Stesolid®.

The effect on the SD and the loadings plots with the shortened spectral range (1300-2000 nm) will only be shown for the second combination of processing techniques, (ii) area

normalization, quadratic detrending and mean centering (Figure 8). When the spectral region was shortened, the regions with the largest SD became more unique for every sample group. Furthermore, these regions (large SD) were also correlated to the regions of high loadings (positive and negative) in the first few PCs, in which there no longer were noise-like tendencies observed. This would suggest that the first PCs contain significant information about the variation in the data set.

(a) (b)

(c) (d)

PC1

(24)

18

Figure 8. a) SD plotted against the wavelength (nm) in the reduced spectral range (1300-2000 nm) with the

corresponding b) PC1, c) PC2 and d) PC3 loadings plot given by PCA after (ii) area normalization, quadratic

detrending and mean centering of Stesolid®.

4.2 Quantifying variation in NIR spectra

4.2.1 Choosing the number of significant PCs

The initial version of the models utilized a cut-off level to differentiate the significant PCs from noise PCs. The cut-off level was set at 2 % of the explained variance for each PC. Homogeneous and heterogeneous sample groups generally reached the cut-off level of 2 % with the three or four initial PCs with the first (i) pre-processing combination. For the second (ii) combination of pre-processing methods, however, there was a difference in the required number of PCs between homogeneous and heterogeneous sample groups. Heterogeneous sample groups still reached the level with three or four PCs, while homogeneous sample groups required more PCs. Pharmaceuticals in tablet form (homogeneous) required six to ten PCs before the cut-off level was reached, while some capsule-based pharmaceuticals

(homogeneous) required five to six PCs before the cut-off level was reached. Visualization of the differences in explained variance between homogeneous and heterogeneous sample groups can be observed in Figure 9.

(a) (b)

(c) (d)

PC1

(25)

19

Figure 9. Explained variance for a) Stesolid® (homogeneous, tablets) and b) 50/50 pregabalin/gabapentin

capsules (heterogeneous, capsules) given by PCA with SVD computation performed on 20 PCs.

It was suspected that with the cut-of level of 2 %, very homogeneous sample groups had an underestimation of the noise level in the model, i.e. too much noise was embodied in the significant PCs. This was supported by the loadings plot given by the PCA, as higher order PCs indicated noise-like behavior (Figure 10). The loadings plot indicated that oscillating patterns (noise) emerged after the three initial PCs in all samples. This was evident even though these early PCs accounted for more than 2 % explained variance. This effect was present regardless of sample type, tablet or capsule (Appendix IV).

Figure 10. Loadings plot for a) PC3, b) PC4, c) PC5 and d) PC6 given by PCA after (ii) area normalization,

quadratic detrending and mean centering for Ksalol® in the spectral range of 1300-2000 nm.

Confirming homogeneity should, in principle, not require a large number of PCs as the majority of variation within the data sets are by definition explained by the initial PCs in a PCA. For homogeneous sample groups, there are small variations within the data set, which explains why a smaller cumulative explained variance by the initial PCs is expected compared

(a) (b)

(c) (d)

PC3 PC4

PC5 PC6

(26)

20 to sample groups with elements of heterogeneity. In an ideal homogeneous sample group, variation in the data set would only be derived from noise, which is why in a PCA model (using SVD computation) performed on 20 PCs, every PC would explain an equal amount of variance, i.e. 5 %. For standardization purposes in an operational point of view, the use of a definite number of PCs to evaluate homogeneity of drug seizures is preferable for NFC rather than the determination of the 2 % cut-off level in every case. As PCs higher than PC3

displayed significant influence from noise, the evaluation of homogeneity in drug seizures was further investigated using the first three PCs as representative of signal-dependent variance for both combinations of pre-processing techniques ((i) and (ii)).

4.2.2 Model A

The results of the investigated sample groups, where the spectral variation is expressed as a percentage of total spectral area, are given in Table 1. The sample groups were divided by their origin, i.e. (a) homogeneous sample tablets, (b) homogeneous sample capsules, (c) seized sample groups and (d) heterogeneous sample groups. The first three PCs defined the significant cumulative explained variance, whilst noise, consequently, explained the

remaining variance up to 100 %. Using the given values, the distribution of spectral variation between material and noise dependent variation can be deduced. All samples (except

Winstrol, 8,20 %) had a spectral variation as a percentage of total spectral area within the tolerance limit of 4,5 % set by the use of ± 2SD. The homogeneous sample groups (tablets and capsules) had, as expected, a lower percentage of spectral variation compared with the sample groups with different degrees of heterogeneity. It is also evident the distribution of spectral variation between signal- and noise-dependent variation, given by the cumulative explained variance, differs between homogeneous and heterogeneous sample groups. This was anticipated due to the inherently small variation between the samples in homogeneous sample groups, as described in Section 4.2.1. Furthermore, homogeneous sample capsules had in general a higher cumulative explained variance from the first three PCs compared with homogeneous sample tablets. This was expected as tablets in general are more uniform in terms of their composition.

(27)

21

Table 1. The spectral variation (± 2SD, %) after combination of (i) area normalization and mean centering given

as a percentage of total spectral area, significant and noise explained variance (%) with their quotient given by PCA for the respective sample groups: a) homogeneous sample tablets, b) homogeneous sample capsules, c) seized sample groups and d) heterogeneous sample groups. Data are given in the reduced spectral range of 1300-2000 nm after (i) area normalization and mean centering. PCA (using SVD computation) was performed with 20 PCs.

Reduced spectral range (1300-2000 nm)

Homogeneous sample tablets

Spectral variation

(± 2SD, %) Significant explained variance (%) Noise explained variance (%) Quotient

Stesolid® 1,44 72,26 27,74 2,61 Xanor® 1,91 79,84 20,16 3,96 Iktorivil® 1,85 78,07 21,93 3,56 Ksalol® 2,61 83,55 16,45 5,08 Orgabolin 2,03 82,36 17,64 4,67 Ergenyl® 2,68 77,61 22,39 3,47 Homogeneous sample capsules Lanoprazol 2,07 84,62 15,38 5,50 Omeprazol 2,47 84,05 15,95 5,27 Seized sample groups Tamoxifen-EGIS 2,88 93,11 6,89 13,51 OxyContin® 2,74 91,51 8,49 10,78 Winstrol 8,20 99,13 0,87 114,56 Tramadol-X 4,07 95,71 4,29 22,29 Lyrica® 2,74 92,82 7,18 12,94 Dolol® 4,61 96,75 3,25 29,76 Ecstasy I 3,27 94,80 5,20 18,25 Ecstasy II 3,61 94,42 5,58 16,92 Ecstasy III 2,24 90,84 9,16 9,92 Ecstasy IV 4,56 96,10 3,90 24,62 Doping I 3,29 86,15 13,85 6,22 Doping II 3,14 89,98 10,02 8,98 Doping III 1,66 91,88 8,12 11,32 Heterogeneous sample groups 40/20 MDMA 3,69 93,82 6,18 15,18 40/30 MDMA 3,43 93,35 6,65 14,03 40/30/20 MDMA 3,56 92,90 7,10 13,09 50/50 Preg/Gaba 3,27 96,80 3,20 30,22 70/30 Preg/Gaba 4,83 95,90 4,10 23,38 90/10 Preg/Gaba 4,08 94,04 5,96 15,78

Several different ways of representing the given data were tested. A model in which the spectral variation was related to the quotient between significant and noise explained variance indicated promise in its predictive ability (Figure 11). A distinct linear correlation between the spectral variation and the quotient could be observed. With an increased spectral variation, the quotient increased proportionally. Indicatively, homogeneous sample groups are expected to have a spectral variation lower than 3 % and heterogeneous sample groups above that level.

(28)

22 Additionally, heterogeneous sample groups (orange, triangles) had elevated quotients (above 12), giving separation in both dimensions. If sample groups possess a spectral variation lower than 3 %, but with an elevated (above 5,5) quotient of explained variance (as with many seized sample groups (blue)) it could be an indication of certain degree of heterogeneity. The homogeneous nature of the seized sample groups will be discussed further in Section 4.2.3 and 4.2.4. Note that Winstrol is not included in Figure 11 as its large quota (114,56) would diminish the visibility of the separation of the remainder of the samples. Note also that similar coding for colors and shapes will be used throughout the project thesis. The explanation of those will therefore not be included in the diagrams. Different sizes of the triangles

representing MDMA capsules (small triangles, orange) and pregabalin/gabapentin capsules (large triangles, orange) are applied. This is done in order to more readily compare

heterogeneity derived from differences in API concentration (MDMA capsules) with

substance derived heterogeneity (pregabalin/gabapentin capsules). It was evident, as expected, that substance derived heterogeneity had larger spectral variation and quotient of explained variance compared with that of heterogeneity derived from differences in API concentration (Figure 11).

Figure 11. Spectral variation (y-axis) expressed as percentage of total spectral area after (i) area normalization and mean centering vs. quotient of significant and noise explained variance (x-axis) for the respective sample

groups: homogeneous sample tablets (circles, green), homogeneous sample capsules (triangles, green), seized sample tablets (circles, blue), seized sample capsules (triangles, blue), ecstasy tablets (seized, squares, blue), doping tablets (seized, rhombs, blue), pregabalin/gabapentin capsules (heterogeneous, large triangles, orange) and MDMA capsules (heterogeneous, triangles, orange).

Even though a sufficient separation between homogeneous and heterogeneous sample groups was achieved using the total spectral variation as measurements for the combination (i) area normalization and mean centering, the model was initially abandoned. It was found when creating the model from detrended data (Figure 12), i.e. a combination of (ii) area

normalization, quadratic detrending and mean centering, that little to no predictive ability was observed. The spectral variation appeared to be similar for homogeneous and heterogeneous sample groups, only allowing separation in one dimension (by difference in quotient). Some seized sample groups would have a spectral variation of up to 65 %, due to their low total

0 1 2 3 4 5 6 0 5 10 15 20 25 30 35 Spe ct ra l va ri at ion (% )

Quotient explained variance (significant/noise)

Homogeneous sample tablets Homogeneous sample capsules Pregabalin/gabapentin capsules (heterogeneous) MDMA capsules (heterogeneous) Seized sample tablets Seized sample capsules

(29)

23 spectral area (area under mean spectrum). After detrending, the area under the mean spectrum is no longer normalized to 1, which is why a large variation of total spectral area was

observed for the respective sample groups. The area under the mean spectrum will be highly dependent on the NIR activity of the constituents of the sample material, so that comparison between sample groups becomes difficult. And as successful discriminant studies with NIR hyperspectral imaging conducted at NFC used a combination of (ii) area normalization, quadratic detrending and mean centering as pre-processing techniques, it is, again from an operational stand-point, preferable to use the same combinations in the evaluation of homogeneity [5,6]. Moreover, omitting quadratic detrending could lead to the inclusion of unwanted variation in the data set, especially for densely packed pharmaceutical products [24]. Further models (for both variation in NIR spectra and in PCA scores plots) will therefore only be presented for their predictive ability using the second combination of pre-processing techniques, (ii) area normalization, quadratic detrending and mean centering.

Figure 12. Spectral variation (y-axis) expressed as percentage of total spectral area after (ii) area

normalization, quadratic detrending and mean centering vs. quotient of significant and noise explained

variance (x-axis) for the respective sample groups: homogeneous sample tablets (circles, green), homogeneous sample capsules (triangles, green), seized sample tablets (circles, blue), seized sample capsules (triangles, blue), ecstasy tablets (seized, squares, blue), doping tablets (seized, rhombs, blue), pregabalin/gabapentin capsules (heterogeneous, large triangles, orange) and MDMA capsules (heterogeneous, triangles, orange).

4.2.3 Analysis of seized sample groups

MDMA tablets (seized) and doping tablets (seized) are samples which, from NFCs

experience, often exhibit variation of the API when seizures are analyzed, which is why they were chosen as potentially, suitable representatives for heterogeneous sample tablets. To assess the homogeneity (or heterogeneity) of the chosen seizures, six tablets from Ecstasy II, III, IV and Doping I, II, III were extracted from their respective sample groups and the API quantified by gas chromatography (GC) with flame-ionization detection (FID, ecstasy tablets) and nuclear magnetic resonance (NMR) spectroscopy (doping tablets). The tablets were chosen by utilizing the hyperspectral image and the scores plot given by PCA for the

respective sample groups. As it was not known which PC that would account for the variation in concentration of the API, the six tablets were chosen such that they would represent a

0 10 20 30 40 50 60 70 0 2 4 6 8 10 12 14 16 18 20 Spe ct ra l va ri at ion (% )

(30)

24 sufficient variation in both PC1 and PC2 (Figure 13). Ecstasy II (Figure 13a) and Doping III (Figure 13b) are shown as examples.

Figure 13. Two-dimensional scores plots for a) Ecstasy II and b) Doping III given by PCA using SVD

computation performed with 20 PCs. Marked (circled, red) objects in the respective scores plots are objects that were selected for analysis.

The results of the given quantification of API in the respective tablets are shown in Table 2. The mean (percentage of API in tablets), SD and coefficient of variation (CV) were calculated for the respective sample sizes. It was found that the objects position along PC2, i.e. their score on PC2, was correlated to the variation in concentration of API for the tablets. If the total loading (sum of loadings) of PC2 was positive, objects with positive score in PC2 would have higher than average (mean) values, and vice versa for total negative loadings. The findings suggested Ecstasy II and III had similar and small variation (under 4 % CV) in the selected sample size, while Ecstasy IV, in relation, had a four times larger variation. No quantification of Ecstasy I was performed due to time constraints. For all doping tablets, a variation (CV) under 10 % was found. It is important to note that there is an increased

uncertainty in the quantification of oxymetholone (Doping I) as the integrated, singlet peak of interest overlap slightly with an adjacent peak. As doping tablets generally has small amounts of API in them, an increased variation can be expected. Note also, only single sample analysis was performed on each doping tablet (normally a double analysis is performed at NFC), which why the relative variation might not be directly comparable to that of ecstasy tablets.

Table 2. Calculated mean (% API in tablets), SD (%) and CV (%) from the given results of the analysis of

seized sample groups which represent a potential heterogeneous nature (Ecstasy II, III, IV and Doping I, II, III). Six tablets from the respective populations were selected and analyzed on GC-FID (ecstasy tablets) and with NMR (doping tablets).

Sample group Mean (%) SD (%) CV (%)

Ecstasy II 36,31 1,39 3,83 Ecstasy III 52,83 1,62 3,07 Ecstasy IV 35,24 4,34 12,32 Doping I 9,62 0,96 9,94 Doping II 2,08 0,15 7,02 Doping III 0,88 0,06 6,72 (a) (b)

(31)

25

4.2.4 Model B

The results for Model B can be observed in Table 3. The model investigated the predictability of homogeneity by approximating the signal and noise variation by calculating the average for the 10 variables with largest respectively smallest SD, and further relating them to the

explained variance given by the PCA. Note that average variation for the signal-dependent variables (wavelengths) are noise corrected; the average noise-dependent variation was subtracted from the respective signal variation. A signal-to-noise ratio (S/N) could thereby be established. What also is important to note is that the variables with the 10 largest SD were located in unique regions for each specific sample group, strengthening the suggestion of the validity in approximating signal variation in that manner. The spectral variation was found to highly correlate to the loadings plot of PC1, which meant the variable regions with the 10 largest SD had large loadings (positive or negative) in PC1, while regions with the 10 smallest SD would consequently have small loadings. With the addition of quadratic detrending, there was an observed decrease in significant explained variance with the initial three PCs for all types of sample groups. With every step of pre-processing, additional unwanted, systematic variation was removed (variation derived solely from differences between samples in the population is enhanced), which is why the decrease was expected.

(32)

26

Table 3. Approximated signal and noise variation in NIR spectra by the average of 10 largest SD and average of

10 smallest SD respectively. The average 10 largest SD are given as total (not noise corrected) and signal (noise corrected); significant and noise explained variance (%) with their quotient given by PCA for the respective sample groups: a) homogeneous sample tablets, b) homogeneous sample capsules, c) seized sample groups and d) heterogeneous sample groups. Data are given in the reduced spectral range of 1300-2000 nm after (ii) area normalization, quadratic detrending and mean centering. PCA (using SVD computation) was performed with 20 PCs. Note that the average largest and smallest SD are multiplied by 104.

Reduced spectral range (1300-2000 nm)

Homogeneous sample tablets Average 10 largest SD (total) Average 10 smallest SD (noise) Average 10 largest SD (signal) S/N Significant explained variance (%) Noise explained variance (%) Quotient Stesolid® 0,31 0,18 0,13 0,76 47,74 52,26 0,91 Xanor® 0,45 0,23 0,22 0,99 59,17 40,83 1,45 Iktorivil® 0,42 0,22 0,20 0,91 57,26 42,74 1,34 Ksalol® 0,56 0,28 0,28 0,99 62,00 38,00 1,63 Orgabolin 0,44 0,21 0,23 1,07 61,97 38,03 1,63 Ergenyl® 0,67 0,31 0,36 1,15 63,22 36,78 1,72 Homogeneous sample capsules Omeprazol 0,63 0,27 0,36 1,31 71,79 28,21 2,54 Lanoprazol 0,56 0,20 0,36 1,85 72,23 27,77 2,60 Seized sample groups Tamoxifen 0,58 0,23 0,35 1,41 79,16 20,84 3,80 OxyContin® 0,64 0,22 0,42 1,95 82,53 17,47 4,72 Winstrol 2,35 0,41 1,94 4,69 97,34 2,66 36,60 Tramadol-X 0,91 0,29 0,62 2,16 91,05 8,95 10,17 Lyrica® 0,61 0,29 0,32 1,09 88,68 11,32 7,84 Dolol® 1,29 0,30 0,99 3,36 94,60 5,40 17,52 Ecstasy I 0,85 0,32 0,53 1,79 91,84 8,16 11,25 Ecstasy II 0,85 0,22 0,63 2,04 86,16 13,84 6,23 Ecstasy III 0,72 0,34 0,38 1,14 79,27 20,72 3,83 Ecstasy IV 1,10 0,33 0,77 2,15 91,37 8,63 10,58 Doping I 0,82 0,38 0,44 1,17 71,70 28,30 2,53 Doping II 0,83 0,35 0,49 1,40 81,91 18,10 4,53 Doping III 0,45 0,17 0,28 1,59 86,40 13,60 6,35 Heterogeneous sample groups 40/20 MDMA 0,70 0,30 0,41 1,37 87,55 12,45 7,03 40/30 MDMA 0,68 0,35 0,33 0,93 86,22 13,78 6,26 40/30/20 MDMA 0,70 0,31 0,39 1,26 88,24 11,76 7,51 50/50 Preg/Gaba 1,25 0,36 0,90 2,51 93,84 6,16 15,25 70/30 Preg/Gaba 1,15 0,36 0,80 2,21 92,30 7,70 11,99 90/10 Preg/Gaba 0,88 0,36 0,53 1,47 90,59 9,41 9,63

References

Related documents

Compared to several other isolation techniques, afnity isolation requires fewer or no sample preparation steps as the labelling occurs on the surface of the device instead

It has been shown that if the relation between auxiliary variables and the study variable is explained by a model — commonly known as superpopulation model— consisting of two

To evaluate the classifying algorithm of FUSAC, the unfiltered, UMI-filtered and non-FFPE- treated data sets were used as input for FUSAC with the setting all which flags any

When cutting with a laser power of 3500 W the heat affected zone is larger in comparison with samples cut with a laser power of 2500 W using the same cutting speed.. This is the

The samples were vortexed before using the liquid handling system for the sample preparation in both packed monolithic 96-tips and commercial 96-tips. The concentration range of

A sample can be concentrated orders or magnitude prior to performing the analysis, leading to higher accuracy and enabling the detection of a target even if its initial

Results from bivariate analysis.. Results from

Comparing the two test statistics shows that the Welch’s t-test is again yielding lower type I error rates in the case of two present outliers in the small sample size