A multivariate model for identiﬁcation of bacteria using Pyrolysis-GC/MS

(1)

A multivariate model for identiﬁcation of bacteria

using Pyrolysis-GC/MS

Nelson D. Urbina Uppsala University Department of Mathematics September 4, 2001 Abstract

It was possible to separate and identify each strain of bacterium using multivariate data analysis models with the fatty acid proﬁle of the cell wall.

The fatty acid proﬁle for each bacterium was obtained using a chro-matogram of the pyrolyzed samples. These data were used to build a library of the bacterial fatty acid proﬁles.

The data were analyzed using multivariate data analysis methods to get a better understanding and to ﬁnd an eﬃcient way of identifying the unknown strains of bacteria.

A program that uses the results from the data analysis was made for the fast identiﬁcation of the new samples.

(2)

To all the persons that I met in SCA packaging and graphic research, but especially to Tomas Nordstrand who ﬁrst recommended me for this ﬁnal thesis, my supervisor Ulrika Andreasson for her trust and all the guidance throughout the project work, Elisabeth Olsson for her patience and teaching when using the Py-GC/MS, and my friend Birgitta Grass–Backlund for all her help when growing the bacteria samples (and the opportunities to practice my Swedish).

To my supervisor in Uppsala University, Prof. Dag Jonsson, whose recom-mendations and suggestions helped me get all this together.

To Carin Olofsson from the Department of Clinical microbiology laboratory in Ume˚a University, who kindly supplied all the bacteria used in this project. To H˚akan Norberg from Mid Sweden University, that supplied some of the fatty acids for the calibration of the pyrolysation program.

I also want to sent my appreciation to all the persons in the “Universidad Simón Bol´ıvar”, my home university, which gave me the opportunity to come to Sweden and make this final thesis. Specially, I want to thank my coordinator, Prof. Bernardo Feijoo, who always believed in me, and Carmen Caleya and Irelis Baldirio from the international office without which I would not be here.

(5)

Chapter 1

Introduction

Biofouling and biocorrosion are two of the major problems caused by accu-mulation of bacteria in industry environments [1].

The accumulation of microorganisms produce contaminants that cause ox-idation and corrosion of machinery, which reduce the lifetime of the equip-ment. It can also increase power consumption of the plant when the fouling occurs in pipes, reducing or even blocking the ﬂuid ﬂow.

If the bacteria are not detected at an early stage of the process, the conse-quence can be reduced productivity and less eﬃciency of the plant. It can also lead to a full stop of production in order to clean the equipment or to replace a damaged piece, generating loss of time and money.

1.1 The paper industry problem

In the recycled paper process, all or part of the pulp for making the new paper comes from recycled paper, which probably has been exposed to un-hygienic conditions. Thus, the recycled paper usually gets contaminated with all types of microorganisms.

One common problem in the mills is the presence of slime deposits produced by the microorganisms [2]. The sticky nature of the slime can obstruct ma-chinery movement as well as produce hygienic problems in the mill environ-ment.

The presence of “endospore forming” bacteria, which are the only ones that can survive the drying section, can also cause problems in the ﬁnal product. The endospore forming bacteria have the ability to change themselves into spores.

The spores are the most resistant life form known. The bacteria spores can survive extreme conditions like very high temperatures, absence of oxygen

(6)

CHAPTER 1. INTRODUCTION 6 and many harmful chemicals. When the optimal conditions return, the spores transform themselves into bacteria and start to multiply again.

1.1.1 Product safety

It is important to keep the levels of bacteria spores in the ﬁnal paper low enough to prevent the revived bacteria to multiply to levels that might be harmful.

For example, some of the endospore forming bacteria, like the Bacillus cereus, can produce toxins during their growth process. This particular type of bacteria can be found in food products, speciﬁcally in rice and beans, but is often harmless if the bacterium level is low enough and the corresponding hygienic measures are followed.1

This bacterium can be a problem if it is present in the paper intended for packaging of food products. It is then important to be aware of the level of this bacterium (and of any other potentially dangerous bacteria) during the paper manufacturing process to keep it below a limit in the ﬁnal product. Even if the bacteria do not produce toxins, they can damage the ﬁnal paper. When the bacteria start growing, they will produce visible spots on the paper that will make it unusable, especially if it is white paper.

Imagine the loss of a consumer that has bought white paper to have in stock and ﬁnds it unusable after a month or less of storage.

1.2 The bacteria identiﬁcation problem

It is of special interest to get rid of the bacteria in the early stages of the process before it reaches the ﬁnal product. First, because it is harder or even impossible to eliminate the bacteria from the ﬁnal paper and secondly, for all the problems mentioned before that the bacteria can cause through the manufacturing process.

Unfortunately, there is not a “universal” biocide for bacterial contamination. In order to apply the most effective treatment we must know which types of bacteria are present in the industry. What’s more, different types of bacteria can be found at different levels of the manufacturing process in the same industry.

This is why we must have eﬃcient methods to identify the bacteria in the pulp and in the whitewater.

1_{The food should not stand at room temperature for a long period of time after}

(7)

Chapter 2

Background

Different methods have been used in history for the identification of bac-teria. Mainly in taxonomy, the identification objective is achieved finding differences between the species by means of phenotypic analyses.

Phenotypic analyses aim to characterize an organism considering what does it look like, what does it do or what enzymes does it contains. As bacte-ria are so small, these kind of procedures are ineﬀective to have a proper identiﬁcation of close related species.

Nevertheless, phenotypic analyses have played an important role in bacterial classiﬁcation and identiﬁcation, especially in applied situations like clinical diagnostics microbiology.

New methods have been developed for a more eﬀective identiﬁcation of the bacteria species. The term molecular taxonomy is commonly used to de-scribe this collection of taxonomic methods, since in many cases the meth-ods involve the chemical analysis of some cell component such as the cell wall.

2.1 Identiﬁcation methods

The conventional taxonomy methods used for bacteria identiﬁcation are mi-crobiological tests. These include the gram-positive – gram-negative classiﬁ-cation test,1 ability to ferment lactose, rod or non-rod-shape and others [3]. This process usually takes several weeks as it needs to have the bacterium isolated from the sample and from other bacteria.

1_{The gram stain is an important diﬀerential staining procedure widely used in}

bacteri-ology. Based on their reaction to the gram stain, bacteria can be divided into two major groups:gram-positive and gram-negative. After gram staining, gram-positive bacteria appear purple and gram-negative bacteria appear red.

(8)

CHAPTER 2. BACKGROUND 8 Some of the molecular taxonomy methods includes DNA:DNA hybridiza-tion and ribotyping [3]. These methods, though highly effective, need spe-cial and expensive equipment. For this reason, spespe-cial laboratories perform these analyses. The laboratories charge for each identification, which can be expensive if several identifications have to be carried out.

2.2 Fatty acid analyses using Pyrolysis-Gas

Chro-matography/Mass Spectrometry

Fatty acid analysis is a good alternative to the identiﬁcation of paper related bacteria [4]. Fatty acids are fatty carboxylic acids2 that are obtained from hydrolysis of naturally occurring fats and oils.

The fatty acid analysis has become very popular lately because the equip-ment needed for doing these analyses is common in chemical analysis lab-oratories. This analysis also gives a fast and reliably identiﬁcation for the mayor microorganisms of importance.

In the bacteria samples, the fatty acids are the main organic compounds present in the cell wall structure. Because the chemistry of fatty acids can be so variable, including differences in chain length, the presence or absence of unsaturated groups and branched chains, and hydroxy groups, the fatty acid profiles are an important tool for bacterial identification.

One of the procedures used for fatty acid analyses of bacteria samples is pyrolysis-gas chromatography/mass spectrometry (GC/MS). Using Py-GC/MS, the actual sample analysis takes nearly an hour, including sample preparation.3

The result from this analysis is a chromatogram showing the amount of the basic compounds that are present in the samples, together with the mass spectrum for each compound. The analyst uses the mass spectra for the identiﬁcation of each compound and the chromatogram works as a proﬁle of the sample [5, 6].

This method has been nicknamed FAME, for fatty acid methyl ester, as the fatty acids need to be chemically modiﬁed to form their corresponding methyl esters. These methyl esters are more volatile and easier to measure by the Py-GC/MS instrument than the original fatty acids.

Figure 2.1 shows the chromatogram of one of the bacillus species: the B. licheniformis. The mass spectrum at the bottom corresponds to 13-methyl

2_{Fatty carboxylic acids are carbon chains (}_CH

3−CH2−CH2−. . .) with one carboxylic

group

C

O

_OH_{attached to one end.}

(9)

CHAPTER 2. BACKGROUND 9

Acquired on 28-Jun-2001 at 19:59:46 Sample ID: Bacilus licheniformis

10.000 15.000 20.000 25.000 30.000 35.000 40.000 45.000 50.000 55.000 rt 0 100 % 29.483 1388 74 19.980 788 98 12.774 333 28 8.466 61 28 21.913 910 102 29.673 1400 74 34.107 1680 74 31.684 1527 74 Scan EI+ TIC 5.50e7 RT Scan BasePeak O1257 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 m/z 0 100 % 74 43 41 29 39 55 54 69 57 67 87 75 83 143 97101 129 111 137 213 157 199 171 185 256 227 241 258 281283299325327341355357 385 401415 429 432 446

O1257 1388 (29.483) Scan EI+

4.31e6

Figure 2.1:The chromatogram for theB. licheniformis. In the bottom is the mass spectra

for the highest peak in the chromatogram, which was identiﬁed as 13-methyl tetradecanoic acid, methyl ester.

tetradecanoic acid methyl ester, which was the highest peak in the chro-matogram.

The fatty acid composition works as a fingerprint for each bacterium and we can use it as a tool for the identification [7, 8]. We can determinate the fatty acid profile for each bacterium in a chromatogram, after identification of the peaks.

2.3 Multivariate data analysis

Even though it is more effective to use Py-GC/MS for the bacteria identi-fication instead of the classical microbiological methods and more efficient than the other molecular taxonomy methods, the chromatograms are not easy to interpret and the identification is not straight forward.

Multivariate data analysis (MDA) is a set of tools that allows us to analyze large quantities of data in a fast and enhanced way [9, 10]. Good results were obtained by previous experiments, showing that MDA can be used for the identiﬁcation of the paper industry related bacteria [11, 12].

(10)

Chapter 3

Methodology

One principal component (PC) model to identify the relationships between the bacteria and their fatty acid profiles and one PC model for each bac-terium studied in order to make the identification were implemented. Because the final practical objective involved the development of a com-puter application for the identification, a projection to latent structures (PLS) model for making fast discriminations between the observed groups of bacteria was fitted.

In this way, instead of checking an unknown strain with each bacterium model, the strain is ﬁrst pre-classiﬁed in a group and then compared with the bacteria models in that group.

Even if it is not present in our library of bacterial fatty acids, this approach will give us an idea of what kind of strain we have.

The programs used for the MDA were Simca-P, R and Matlab. In all cases, the data was scaled to unit variance and mean centered to zero.

The computer application was written in MS Visual Basic. It works as a macro for the program MassLab, version 1.4, in the Py-GC/MS computer.

3.1 The bacteria studied

In this project, 15bacteria strains were studied. They are listed in Table 3.1 together with their principal characteristics.

The pure bacteria samples came originally from diﬀerent laboratories. Most of them were kindly provided by Carin Olofsson from the clinical bacterio-logical laboratory in Norrland University Hospital, Ume˚a.

(11)

CHAPTER 3. METHODOLOGY 11

Bactirium Gram Spore

Toxic

strain group forming

Aeromonas hydrophyla G-negative No No

Bacillus brevis G-positive Yes No

Bacillus cereus G-positive Yes Food pathogen

Bacillus licheniformis G-positive Yes No

Bacillus megaterium G-positive Yes No

Bacillus polymyxa G-positive Yes No

Bacillus pumilus G-positive Yes No

Bacillus subtilis G-positive Yes No

Escherichia coli G-negative No Food pathogen Enterococcus faecalis G-positive No Often harmless

Micrococcus luteus G-positive No Harmless

Proteus mirabilis G-negative No No

Pseudonomas aeruginosa G-negative No No

Pseudonomas cepacia G-negative No No

Staphylococcus aureus G-positive No Pathogen

Table 3.1:Bacteria studied in this project. In the case of the S. aureus, two diﬀerent

samples were studied. One sample came originally from United States (ATCC 6538) and the other from G¨oteborg (CCUG 15915).

The strains were incubated on a nutritive medium for two days. The samples were analyzed in the Py-GC/MS within the next four week after incubation.1 Figure 3.1 shows two of the bacteria cultivated in our laboratories from the pure samples.

3.2 The variables measured

Table 3.2 lists the fatty acids measured in the chromatogram, together with their common abbreviations. This list was obtained keeping in mind previ-ous works in the same area [11, 7].

The reference for the mass spectrum and the retention time of each selected fatty acid was obtained using the Lipid Standard, 189-19 from Sigma and the Bacterial Acid Methyl Esters CP Mix, 47080-U from Supelco.

Some of the references for the fatty acids not present in these two laboratory samples were obtained directly from the bacteria samples.

1_{See appendix A for details on the bacteria grow and appendix B for reference to the}

(12)

Fatty acid Abb Fatty acid Abb

Octanoic C8:0 9-hexadecenoic, 10-methyl ∆-C17:0

Decanoic C10:0 Heptadecanoic C17:0

Undecanoic C11:0 2-hydroxyhexadecanoic C16:0, 2-OH

2-hydroxydecanoic C10:0, 2-OH 9,12,15-octadecatrienoic C18:3, c-9,12,15

Dodecanoic C12:0 6,9,12-octadecatrienoic C18:3, c-6,9,12

Dodecanoic, 11-methyl i-C13:0 Trans-9,12-octadecadienoic C18:2, t-9,12

Tridecanoic C13:0 9,12-octadecadienoic C18:2, cis-9,12

2-hydroxydodecanoic C12:0, 2-OH 9-octadecenoic C18:1, cis-9

3-hydroxydodecanoic C12:0 3-OH Trans-9, octadecenoic C18:1, trans-9

Tridecanoic, 12-methyl i-C14:0 Octadecanoic C18:0

9-tetradecenoic C14:1, cis-9 9-octadecenoic, 10-methyl ∆-C19:0

Tetradecanoic C14:0 Nonanoic C19:0

Tetradecanoic, 13-methyl i-C15:0 5,8,11,14-eicosatetraenoic C20:4, c-5,8,11,14

Tetradecanoic, 12 methyl a-C15:0 5,8,11,14,17-eicosapentaenoic C20:5

10-pentadecenoic C15:1 11,14,17-eicosatrienoic C20:3

Pentadecanoic C15:0 11,14-eicosadienoic C20:2

2-hydroxytetradecanoic C14:0, 2-OH 11-eicosenoic C20:1

Pentadecenoic, 14-methyl i-C16:1 8,11,14-eicosatrienoic C20:3

3-hydroxytetradecanoic C14:0, 3-OH Eicosanoic C20:0

Pentadecanoic, 14-methyl i-C16:0 Henicosanoic C21:0

9-hexadecenoic C16:1, cis-9 4,7,10,13,16,19-docosahexadienoic C22:6

Pentadecenoic, 13-methyl a-C16:1 13,16-docosadienoic C22:2

Hexadecanoic C16:0 13-docosenoic C22:1

11-hexadecenoic, 15-methyl i-C17:1 Docosanoic C22:0

Hexadecanoic, 15-methyl i-C17:0 Tricosanoic C23:0

Hexadecanoic, 14-methyl a-C17:0 15-tetracosenoic C24:1, cis-15

10-heptadecenoic C17:1 Tetracosanoic C24:0

(13)

Figure 3.1:Grown strains of bacteria from pure samples. To the left is theBacillus cereus

and to the right theBacillus licheniformis.

Variable Expresion

Var 1 Trans 9-octadecenoic acid_{Hexadecanoic acid}

Var 2 _{Pentadecenoic acid, 13-methyl}Trans 9-octadecenoic acid

Var 3 Tetradecanoic acid, 12-methyl_{Tetradecanoic acid, 13-methyl}

Var 4 _{Tetradecanoic acid, 13-methyl}Hexadecanoic acid, 14-methyl

Table 3.3:New variables for the hard to identify groups

In practice, only some of the fatty acids were included in the models. This is due to the fact that some of them never appeared in the samples or their amounts were too small to be consider as significant for the classification [13]. Four new variables were generated to accomplish the identification procedure for separation of two groups of bacteria that share similar profiles.2 These variables are listed in Table 3.3.

3.3 Separating into groups

As stated before, the final objective of this project is to write a computer application that automatically identifies the bacteria using the fatty acid profile from the Py-GC/MS.

To make the identiﬁcation in a more eﬃcient way, the bacteria were divided

2_The_{Bacillus licheniformis, Bacilus pumilus, Bacilus subtilis group and the}

(14)

CHAPTER 3. METHODOLOGY 14 into naturally occurring groups or clusters. For determining these groups, a hierarchical cluster analysis was done using the single linkage method and the Euclidean distance to generate the distances matrix.3

The clusters obtained in this analysis were used to deﬁne the dummy vari-ables that classify the bacteria into groups for doing the PLS–discriminant analysis (PLS–DA).

Using the dummy variables, a PLS–DA model was fitted. This model is used as a classification tool. If the bacterium is not identified as one of the bacteria in the group, the group characteristics are good enough to have an idea of the type of the unknown bacteria.

3.4 Individual PC models

One PC model for each of the most important bacterium strains was ﬁtted using between six and eight analyzed samples.4

In each model, only the fatty acids that appeared constantly and in high enough amounts where used.5

When an unknown strain is analyzed, its absolute and normalized distance to each model in its corresponding group is calculated. These two measure-ments are used to make a probabilistic test to check if the bacterium belongs to one of the PC models [10]. If it does, it is identiﬁed as the corresponding bacterium to that model.

3_{This metric and linkage method was chosen in order to increase the diﬀerence between}

the groups for a more eﬀective discriminant analysis.

4_{For reasons of time, it was not possible to analyze enough samples for ﬁtting a}

reliev-able PC model for each bacterium.

5_{If the amount of one fatty acid measured on the chromatogram was too low, its}

(15)

Chapter 4

Results and discussion

The more common fatty acids measured were C14:0, i-C15:0, a-C15:0, C15:0, i-C16:0, C16:0, i-C17:0 and a-C17:0. This group of fatty acids was enough to give a classiﬁcation of the samples.

With some exceptions, fatty acids with more than 18 carbons were not detected in the chromatograms. Fatty acids with less than 10 carbons were not found in the chromatograms.

It was possible to diﬀerentiate and identify the bacteria strains. Some prob-lems appeared when trying to separate the B. licheniformis, B. subtilis and B. pumilus.

Using the ratios of the dominant fatty acids in their chromatogram, enough significant difference can be found to separate B. subtilis from this group. It seems that is also possible to separate the B. licheniformis from the B. pumilis strain, but unfortunately there were not enough pyrolysed samples from B. pumilis in order to fit a model.

Something similar happened when the Staphylococcus aureus strain from G¨oteborg (CCUG 15915) and the Aeromonas hydrophyla strain were com-pared. However, the fact that the S. aureus strain from United States (ATCC 6538) showed a fatty acid profile different than the S. aureus strain from Göteborg (CCUG 15915) made necessary to send the two strains to a specialized laboratory for identification.1

The results from the laboratory showed that the S. aureus strain from G¨oteborg (CCUG 15915) was S. aureus mixed with A. hydrophyla while the S. aureus strain from United States (ATCC 6538) was in fact S. aureus in two variations.

1_{The fatty acid proﬁle for the}_{S. aureus strain from G¨oteborg (CCUG 15915) appeared}

to be too similar to theA. hydrophyla strain while the S. aureus strain from United States

(ATCC 6538) showed a proﬁle more similar to theM. luteus and the B. brevis strains

without any problems for the identiﬁcation.

(16)

CHAPTER 4. RESULTS AND DISCUSSION 16

Figure 4.1:Score and loading plot of the ﬁrst two PCs of the model with all the bacteria

From these results, it was decided to excluded the S. aureus strain from G¨oteborg (CCUG 15915) from the rest of the analyses.

4.1 Overview

When all the bacteria are included in the same model, some important diﬀerences can be seen between them. The model was ﬁtted using eight of the original fatty acids and two of the generated variables.

Figure 4.1 shows the score plot of the ﬁrst two principal components to-gether with the corresponding loading plot. The loading plot also shows the variables used to ﬁt this model.

These two components explain 72.2% of the total variance. Together with the third component, the model explains 86% of the variance.

In Figure 4.2, the ﬁrst vs. the third component is displayed together with their loading plot. The same color codiﬁcation as before is used.

(17)

Figure 4.2:Score and loading plot of the ﬁrst and third PCs of the model with all the bacteria

The B. megaterium, B. licheniformis, B. pumilus and B. subtilis strains are characterized for high levels of 13-methyl tetradecanoic acid, 12-methyl tridecanoic acid, 15-methyl hexadecanoic acid and 12-methyl tetradecanoic acid.

In particular, the B. megaterium strains have higher amount of 12-methyl tridecanoic acid than the B. licheniformis, B. pumilus and B. subtilis strains, which had more presence of 15-methyl hexadecanoic acid.

The Micrococcus luteus and Bacillus brevis strains showed high levels of 14-methyl hexadecanoic acid and 14-methyl pentadecanoic acid.

It is important to notice that the gram-negative bacteria has a tendency for high amounts of tetradecanoic acid and hexadecanoic acid in comparison with the gram-positive strains.

However, there was one gram-positive bacteria strain that show the same pattern: the E. faecalis. The B. cereus also seems to have the same pattern but more in the middle between these strains and the rest of the gram-positive strains.

(18)

Figure 4.3:Bacteria clusters obtained from a hierarchical cluster analysis of the PCs scores of the global model. The metric used for calculating the distance matrix was the Euclidean distance. The linkage method for grouping the clusters was the single linkage method

4.2 The cluster analysis

The dendrogram obtained from the cluster analysis is shown is Figure 4.3. Notice that, with the exception of the E. faecalis strain and one sample of B. cereus, all the gram-negative strains grouped together in one cluster. It was decided to divide the bacteria into ﬁve groups for the discriminant analysis.

Group 1 only contains B. megaterium. Group 2 is B. polymyxa.

Group 3 includes the strains A. hydrophyla, B. cereus, E. coli, E. faecalis,

P. aeruginosa, P. cepacea and P. mirabilis.

Group 4 is formed by B. licheniformis, B. pumilus and B. subtilis strains. Group 5 is the cluster produce by B. brevis, M. luteus and S. aureus from

(ATCC 6538).

4.3 The discriminant analysis

(19)

Figure 4.4:Discrimination coeﬃcients of the PLS–DA model for the selected group of bacteria

components. The same selection of fatty acids was used as in the global model.

Figure 4.4 shows the coeﬃcients of the PLS model for each of the bacteria groups.

Group 1 is characterized by 12-methyl tridecanoic acid. Also, the branched fatty acids 12-methyl tetradecanoic acid and 13-methyl tetradecanoic acid appeared in this group.

Group 2 shows a more complex proﬁle, but high values in the generated variable 3 and 4 are the more important diﬀerence. It is also chacterized by the presence of 14-methyl pentadecanoic acid and 12-methyl tridecanoic acid.

The main characteristic in group 3 is the levels of hexadecanoic and tetrade-canoic acid. A secondary characteristic is the level of 15-methyl hexade-canoic acid.

The presence of 15-methyl hexadecanoic acid is the main diﬀerence of group 4 from the other groups. The 14-methyl hexadecanoic acid and 13-methyl tetradecanoic acid are also distinctive of this group.

Finally, group 5is identiﬁed by the presence of the branched fatty acids 12-methyl tetradecanoic acid and 14-methyl hexadecanoic acid.

(20)

Figure 4.5:The score plots of the combined weights of the PLS–DA model. To the left is the ﬁrst vs. the second components weights. To the right is the ﬁrst vs. the third components weights.

components of the PLS–DA model. Similar conclusions as before can be obtained from these plots.

It is important to note from ﬁgure 4.5that groups 1 and 4 are quite similar but they are distinguished by the diﬀerence between 12-methyl tridecanoic acid and 15-methyl hexadecanoic acid. Group 1 shows more of the former and group 4 more of the latter.

Groups 2 and 5also share similar fatty acid proﬁles but not as much as the groups 1 and 4.

Group 3 is completely diﬀerent than the rest of the groups. This group can be identiﬁed by the presence of tetradecanoic and hexadecanoic acids.

4.4 The PC models for each bacterium

At the time of writing, only six individual models were ﬁtted. Each model was ﬁtted using only the more representative set of fatty acids for the bac-terium.

The individual models were ﬁtted for the strains of B. subtilis, B. brevis, B. licheniformis, B. megaterium, M. luteus and B. cereus.

The B. subtilis model was ﬁtted using the fatty acids i-C14:0, i-C15:0,

(21)

CHAPTER 4. RESULTS AND DISCUSSION 21 Model # of PCs R2 Q2 B. subtilis 3 98.7% 84.7% B. brevis 2 98.4% 92.4% B. licheniformis 5 99.7% 90.6% B. megaterium 4 99.3% 85.5% M. luteus 4 99.8% 94.3% B. cereus 6 99.6% 91.5%

Table 4.1:Cumulative explained variance and the cumulative fraction of the predicted variation for each individual model.

generated variables 3 and 4.

The B. brevis model includes the fatty acids i-C13:0, i-C15:0, a-C15:0,

i-C16:0, C16:0, i-C17:0 and a-C17:0 and the generated variables 3 and 4.

The B. licheniformis model contains ten fatty acids and the generated

variables 3 and 4. The fatty acids in this model are C10:0, C12:0, i-C14:0, i-C15:0, a-C15:0, i-C16:0, C16:0, i-C17:1, i-C17:0 and a-C17:0.

The B. megaterium model uses 7 fatty acids and the generated

vari-ables 3 and 4 two ﬁt the data. The fatty acids are i-C14:0, C14:0, i-C15:0, a-C15:0, i-C16:0, C16:0 and a-C17:0.

The M. luteus model ﬁts the data with the fatty acids i-C14:0, C14:0,

i-C15:0, a-C15:0, i-C16:0, C16:0, i-C17:0, a-C17:0 and C22:1. It also includes the generated variable 3 and 4.

The B. cereus model was the more complex of all the individual

mod-els. It was ﬁtted using 16 fatty acids and all four generated vari-ables. The fatty acids in this model are C12:0, i-C13:0, i-C14:0, C14:0, i-C15:0, a-C15:0, C15:0, C14:0 3-OH, i-C16:0, C16:1 cis-9, a-C16:1, C16:0, i-C17:1, , i-C17:0, a-C17:0 and C18:1 trans-9.

Table 4.1 shows the cumulative explained variance (R2) and the cumulative fraction of the predicted variation (Q2) for the selected number of compo-nents in each model.2

2_{The cumulative} _Q2 _{is the estimated fraction of the total variation of the}

(22)

Chapter 5

Conclusions

The bacterial fatty acid profile analysis is clearly a powerful tool for the identification of the bacteria when it is used together with MDA techniques. It can identify the strains that has been previously analyzed and registered in the library in a very short time, with the help of a computer program. It is important to study the chromatograms of the close related species before fitting a model. Sometimes, it is necessarily to come up with new variables to make a successful identification.

Normally, the ratios of the more important peaks will do. It is not re-commended to lay the identiﬁcation of the close related species on the fatty acids that appear with low amount in the chromatograms.1

5.1 Repeatability of the experiments

It is important to repeat the experiments in the same way as they had been done in this project. Failure to do so will inﬂuence a lot in the proﬁles of the bacteria.

The time that pass after the bacteria is incubated and before the actual pyrolysation takes place aﬀect the fatty acid proﬁle of the bacteria. For this reason, it is important to make the analysis as soon as the bacteria is incubated. The bacteria should not be kept longer than four weeks before the pyrolysation.

The characteristics of the Py-GC/MS machine will also aﬀect the proﬁles. The type of column used in the GC and the temperature of pyrolysation

1_{To decide if a fatty acid amount is low depends on the precision of the machine and}

in the variation of the fatty acids between the same species. In any case, fatty acids which relative peak area is less than 1% of the highest peak area should be treated with care.

(23)

CHAPTER 5. CONCLUSIONS 23 are probably the most important factors that will aﬀect the detection of the fatty acids, changing the corresponding proﬁles.

The running time of the GC program can be decreased changing the rate at which the temperature raises and the initial and ﬁnal temperature values. However, the GC program used will also aﬀect the range of fatty acids that can be detected by the equipment.

If the temperature is raised to fast, some fatty acids can overlap in the chromatogram and if the temperature interval is shortened too much, the machine can overlook some of the fatty acids.

5.2 Some recommendations

The bacterial fatty acids library should be extended to include more bacteria profiles. This will increase the effectiveness of the model to identify different types of bacteria.

New models must be ﬁtted every time new data is added to the library and the goodness of ﬁt of the model should be tested.

(24)

Appendix A

Method for bacteria grow

The work was done under sterile conditions, with sterile solutions and equip-ment. Distillated water was used for preparation of the medium and the dilution liquid.

The colonies were counted after incubation.

A.1 Materials for bacterial grow

1. Reagents

• TSA – Trypton soy agar. • Ethyl alcohol 70% for cleaning. • Tartaric acid 5M for pH-adjustment.

• Sodium chloride at 0.9% and Tween 80 at 0.05% mixed with one liter of deionized water for diluting.

(25)

APPENDIX A. METHOD FOR BACTERIA GROW 25 • Pasteur pipettes

• Microliter pipette • Pipette tips of 1000 µl • Measuring cylinder of 100 ml • Magnetic spin wedge

• Forceps

3. Equipment for freezing • Cryogenic vials + racks • Inoculating loops of 10 µl • Scissors

• Freeze incubator at −25 o_C

A.2 Preparation of the agar

To prepare the agar, the dry nutrient powder was added to one liter of water and shacked vigorously. The preparation was boiled until full dissolution took place.

The ﬂasks were sterilized in an autoclave for 15minutes at 121 oC. After sterilization, the medium was stored in an incubation cabinet at 50oC. The pH of the medium was adjusted to the pH of the samples (∼4.5–5).

A.3 Preparation of the bacteria grow

Agar medium was added to each petri dish. When the agar was solid, the pure bacteria were distributed in the medium.

The dishes were placed in a plastic bag and incubated up side down in 37oC for 24 to 48 hours. After this, the samples were stored in a freeze incubator at−25oC.

A.4 Method for freezing

A sterile inoculating loop of 10 µl was used to scrape oﬀ the bacteria without agar.

(26)

Appendix B

Method for pyrolysation

10–40 µg of the cultivated bacteria from the pure samples were placed on the platinum foil of the pyrolyser instrument using a sterile inoculating loop. The sample was then dried using a warm stream of air for ﬁve minutes. Because the pyrolysation process works best with volatile compounds, the sample needs to be methylized in order to have the methyl esters of the fatty acids.1 To achieve this, around 1 µl of tetra methyl ammonium hydroxide (TMAH) was applied to the platinum foil. The sample was dried again for ﬁve minutes and placed in the pyrolyser chamber.

In the pyrolyser chamber, the sample was decomposed under heat treatment at 700oC in an inert atmosphere of helium gas. The helium gas was used also as a carrier gas for the pyrolysed sample throughout the chromatographic column and to the mass spectrometer.

B.1 Program settings

The program used begins at 100oC and holds this temperature for 10 min-utes. The temperature was then increased at 8 _minoC to 180 oC. After this, the temperature was increased to 280 oC at 4 _minoC . This temperature was hold for ten minutes.

The split ﬂow of the helium was set to 15 _minml and the ﬂow through the column to 1 _minml . The total running time of the program is 55 minutes.

1_{The fatty acid methyl ester (FAME) is the same fatty acid with the proton in the}

carboxylic acid group

C

O

_O_H_{change by a methyl group} _C O

_O_CH

3

. This makes the fatty acid more volatile and easier to measure in the chromatogram.

(27)

APPENDIX B. METHOD FOR PYROLYSATION 27

B.2 Pyrolysation equipment

The experiments were performed using a Pyrola 2000, ﬁlament pulse pyro-lyser from Pyrol AB, Lund, Sweden.

The gas chromatograph was a Carlo Erba Top 8000 combined with a Voy-ager, bench top quadruple mass spectrometer from Thermo Quest.

(28)

Appendix C

Statistical theory

Multivariate data analysis (MDA) refers to a wide assortment of descriptive and inferential techniques that analyze data in a manner that takes into ac-count the relationships among variables. In contrast to univariate statistics, MDA methods use these relationships to extract the information from the data matrices in a more eﬀective way.

In this project, three diﬀerent methods from MDA were used to extract the results from the data. These methods were principal component analysis (PCA), hierarchical cluster analysis (HCA) and discriminant analysis using project to latent structures (PLS–DA).

The basic theory for these methods is outlined in this appendix. More detailed information can be found in statistical books on MDA [9, 10].

C.1 Principal component analysis

The PCA objective is to reduce the number of variables (columns) in a data matrix. To accomplish this, new variables are generated from the original variables.

The new variables are called scores and they are estimated together with a set of vectors call loadings. The loadings measure the relationships between the scores and the original variables.

The scores are calculated successively in order to retain the major fraction of variance possible from the observations with each new score explaining less variance than the previous score.

Therefore, the last scores usually explain an insigniﬁcant fraction of the variance, which is usually assumed as noise or other perturbations in the data recollection process, and the ﬁrst scores retain the major proportion of variance.

(29)

APPENDIX C. STATISTICAL THEORY 29 This makes it possible to ignore the last scores when doing the calcula-tions. The original data matrix can then be approximated in functions of the calculated scores and loadings as:

X = T P+ E ≈ T P (C.1)

Here X is the original data matrix, T is the matrix which columns are the scores, P is the matrix which columns are the loadings and E = X − T P is the matrix of residuals.

The scores can be plotted together with the loadings to identify the more relevant characteristics in the data. Groups of observations can be detected together with the relationships within the observations, within the variables, and between the observations and the variables.

PCA can also be applied for identiﬁcation. If we ﬁt a PC model with only the observations of one particular group or class, the model can be used to see if a new observation belongs to the same class.

This is accomplished by calculating the probability of the observation to belong to the model. If the observation k belongs to the model, the value

of: ie2ik (m − a) i,je2ij (n − a − a0)(m − a) (C.2)

is approximately F-distributed, where eij are the elements of the residual matrix, m is the number of original variables, a is the number of selected principal components and a0 is a constant value. If the data has been mean

centered then a0= 1, else a0 = 0.

Thus, we can test the null hypothesis of the new observations belonging to the model versus the alternative hypothesis of not belonging to the model.

C.2 Hierarchical cluster analysis

In contrast with PCA, the hierarchical cluster analysis objective is to reduce the number of rows in a data matrix, by means of grouping together similar observations into clusters.

The “hierarchical” adjective means that the clustering process works sequen-tially, beginning with all the observations classiﬁed as individual clusters and ending when all the observations are grouped into the same cluster.

(30)

APPENDIX C. STATISTICAL THEORY 30 in mind the natural characteristics and previous knowledge about the data.1 To begin with a HCA, a new matrix is generated from the original data matrix. The new matrix is called the dissimilarity matrix and is denoted here as D. The matrix D measures the diﬀerences between the observations in terms of the variables.

The elements of the dissimilarity matrix has the following properties:

dij 0

dii= 0

dij = dji

(C.3)

where dij corresponds to the dissimilarity between observation i and obser-vation j.

There are diﬀerent ways of calculating the dissimilarities or distances be-tween two observations. The one used in this project is called the Euclidean distance.

The Euclidean distance between observation i and observation j is deﬁned

as: m k=1 (xik− xjk)2 (C.4)

where m is the number of columns in the original data matrix.

There are several ways of determining the clusters from the dissimilarity matrix. The method used here is the single linkage method, also called the nearest neighbor method.

The single linkage method group together the clusters that show less dissim-ilarity, beginning with the two observations that has the lowest dissimilarity in the matrix D and going up until the highest dissimilarity.

Each time a new cluster is formed, the dissimilarity matrix has to be up-dated to reﬂect the dissimilarity between the new cluster and the rest of the clusters.

The result of this analysis is a tree diagram, usually called dendrogram, showing each step in the clustering procedure. Figure C.1 shows an example of this type of plot.

1_{For example, observations that are expected to group together because they are related}

(31)

APPENDIX C. STATISTICAL THEORY 31 0 10 20 30 40 0 2 04 0 6 08 0

Figure C.1:A typical example of a dendrogram.

C.3 Discriminant analysis using projection to

la-tent structures

The discriminant analysis objective is to determine the diﬀerences between groups of observations that are known to be closely related.

One way of doing a discriminant analysis is to generate dummy variables that represents the previously deﬁned groups.2 _{A regression model is then}

ﬁtted using the original data as the predictors and the dummy variables as the responses.

The PLS model can be thought as an extension of PCA to multivariate regression. In a PLS analysis there are two diﬀerent data matrices, X and Y .

The matrix X normally contains data obtained from experimental sources and is called the matrix of predictors. On the other hand, the matrix Y contains data that can be informative about the experimental process and is usually called the matrix of responses.

The main objective of PLS is to ﬁnd the maximal correlation between the X matrix and the Y matrix in order two make predictions of Y in function of X. It is also focused in revealing the relationships between the responses and the predictors, and the relationships within the observations.

2_{A dummy variable is a variable that can only take two values, usually one or zero.}

(32)

APPENDIX C. STATISTICAL THEORY 32 The PLS consists in ﬁtting two separated PC models: one model for the X matrix and one model for the Y matrix. These models pursue the same objectives as the normal PCA in each individual matrix.

Nevertheless, the calculation of the PC models is not the same as in a normal PCA. Besides the two groups of scores generated in the PC models, one for the X matrix and another for the Y matrix, a third set of scores is generated, usually called weights and denoted as W . These scores are used to calculated the X matrix scores in order to maximize the correlation between the PC models of the X and Y matrices.

Normally, the matrix W is called the x–weight matrix and the loadings of the Y matrix are called the y–weight matrix.

The general equations obtained from a PLS analysis are:

X = T P+ E Y = U C+ F U = T + H

(C.5)

where T and U are the matrices of scores, P and C are the matrix of loadings, E and F are the matrices of residuals, and H is the inner relations.

The major diﬀerence between PCA and PLS is that the former is a maximum variance projection of X while the latter is a maximum covariance model of the relationships between X and Y .

It is possible to combine discriminant analysis with PLS analysis. Defining Y as the matrix of dummy variables and X as the matrix of original data, a PLS model is fitted to find the relationships between the data and the grouping criteria.

This approach is referred to as PLS–discriminant analysis, or PLS–DA for short and it can also be used to classify a new observation into one of the groups.

When a new classiﬁcation is analyzed, its Y scores are predicted for each dummy variable (i.e. each group). The highest predicted value corresponds to the group to which the observation should be classiﬁed (if the observation belongs to one of the groups).

PLS–DA is a powerful tool to qualify and determinate the differences be-tween groups of data. Its only draw back is that the groups must be known before hand and the observations in the same group must be very similar while observations from different groups should be quite different.3

3_{This problem can be overcome is one considers that the scientist normally has a}

(33)

Bibliography

[1] H. Flemming, G. Geesey (Editors): Biofouling and Biocorrosion in In-dustrial Water Systems : Proceedings of the International Workshop on Industrial Biofouling and Biocorrosion, Springer Verlag, June 1991 [2] D. Gudlauski: Whitewater system closure means managing

microbio-logical buildup, Pulp and Paper, March 1996

[3] M. Madigan, J. Martinko, J. Parker: Biology of Microorganism, 9th edition, Prentice Hall, 2000

[4] U. Tillman, Birgitta Grass-Backlund, Elisabeth Olsson: Analysis of slime deposits from pulp and paper processes and identiﬁcation of bac-teria found in pulp samples, using pyrolysis-GC/MS technique, SCA Packaging Research, report F2552, 1999

[5] J. Perry: Introduction to analytical gas chromatography, chromato-graphic science series, volume 14, 1981

[6] D. Williams, I. Fleming: Spectroscopic methods in organic chemistry, 3rd edition, McGraw-Hill, 1980

[7] P. K¨ampfer: Limits and possibilities of total fatty acid analysis for clas-siﬁcation and identiﬁcation of bacillus species, System. Appl. Microbiol. 17, pp. 86-98, 1994

[8] K. Voorhees, F. Basile, M. Beverly, C. Abbas-Hawks, A. Hendricker, R. Cody, T. Hadﬁeld: The use of biomarker compounds for the identiﬁ-cation of bacteria by pyrolysis-mass spectrometry, Journal of analytical and applied pyrolysis 40–41, pp. 111-134, 1997

[9] J. Jobson: Applied multivariate data analysis, volume II, Springer-Verlag, 1992

[10] L. Eriksson, E. Johansson, N. Kettaneh-Wold and S. Wold: Introduc-tion to Multi- and Megavariate data analysis using projecIntroduc-tion methods (PCA & PLS), Umetrics, June 1999

(34)

BIBLIOGRAPHY 34 [11] T. Gustafsson: Identiﬁering av mikroorganismer med Py-GC/MS och

kemometri, STFI, Stockholm 1998

[12] K. Isberg, S. Johnsrud: Identiﬁcation of bacillus using pyrolytic in situ methylation GC-MS and chemometrics, STFI, December 1999

A multivariate model for identiﬁcation of bacteria using Pyrolysis-GC/MS