Quantifying metabolic fluxes using mathematical modeling

(1)

Quantifying metabolic fluxes

using mathematical modeling

Uppsatsens titel på engelska kan placeras här

– inklusive undertitel i samma storlek

Victor Viberg

fattare 2:s Namn

Supervisor: William Lövfors Examiner: Gunnar Cedersund

Linköping university SE-581 83 Linköping 013-28 10 00, www.liu.se

(2)

Abstract

Background

Cancer is one of the leading causes of death in Sweden. In order to develop better treat-ments against cancer we need to better understand it. One area of special interest is cancer metabolism and the metabolic fluxes. As these fluxes cannot be directly measured, modeling is required to determine them. Due to the complexity of cell metabolism, some limitations in the metabolism model are required. As the TCA-cycle (TriCarboxylic Acid cycle) is one of the most important parts of cell metabolism, it was chosen as a starting point.

The primary goal of this project has been to evaluate the previously constructed TCA-cycle model. The first step of the evaluation was to determine the CI (Confidence Interval) of the model parameters, to determine the parameters’ identifiability. The second step was to validate the model to see if the model could predict data for which the model had not been trained for. The last step of the evaluation was to determine the uncertainty of the model simulation.

Method

The TCA-cycle model was created using Isotopicaly labeled data and EMUs (Elementary Metabolic Units) in OpenFlux, an open source toolbox. The CIs of the TCA-cycle model pa-rameters were determined using both OpenFlux’s inbuilt functionality for it as well as using a method called PL (Profile Likelihood). The model validation was done using a leave one out method. In conjunction with using the leave on out method, a method called PPL (Prediction Profile Likelihood) was used to determine the CIs of the TCA-cycle model simulation.

Results and Discussion

Using PL to determine CIs had mixed success. The failures of PL are most likely caused by poor choice of settings. However, in the cases in which PL succeeded it gave comparable results to those of OpenFLux. However, the settings in OpenFlux are important, and the wrong settings can severely underestimate the confidence intervals. The confidence intervals from OpenFlux suggests that approximately 30% of the model parameters are identifiable. Results from the validation says that the model is able to predict certain parts of the data for which it has not been trained. The results from the PPL yields a small confidence interval of the simulation. These two results regarding the model simulation suggests that even though the identifiability of the parameters could be better, that the model structure as a whole is sound.

Conclusion

The majority of the model parameters in the TCA-cycle model are not identifiable, which is something future studies needs to address. However, the model is able to to predict data for

(3)

(4)

Acknowledgments

First let me thank my supervisor William and examiner Gunnar for support and an interest-ing project. I would also like to thank Nicolas, Irena and Roland for their help durinterest-ing the project. My gratitude also goes to the staff of IMT at Linköpings University for interesting discussions, without you I would be without a clear definition of cabinet. The greatest thanks goes to my friends and family for all the support and patience both during the project as well as through out the years. Without you this would not have been possible.

(5)

Abstract iii

Acknowledgments v

Contents vi

List of Figures vii

List of Tables ix

List of Abbreviations xi

1 Background 1

1.1 Biological background . . . 1

1.2 Isotopicaly Labeled Experimental data . . . 3

1.3 Systems Biology . . . 6

1.4 Metabolic Flux Modeling . . . 6

1.5 Prior work . . . 11

1.6 Aims . . . 13

1.7 Delimitations . . . 13

2 Method 15 2.1 Constructing the model in OpenFlux . . . 15

2.2 Determining Confidence interval of parameters . . . 16

2.3 Validation by leaving one out . . . 17

2.4 Visualization of the model . . . 18

3 Results 19 3.1 Determining confidence interval . . . 19

3.2 Validation and determining model uncertainty . . . 24

3.3 Visualization of the model . . . 31

4 Discussion 33 4.1 Discussion on the results . . . 33

4.2 Discussion on the method . . . 36

4.3 Future work . . . 39

4.4 The work in a wider context . . . 39

5 Conclusion 41

(6)

List of Figures

1.1 A schematic of a simple mock system with two different sets of metabolic fluxes. It shows different metabolites A-F and how they are connected to each other via metabolic reactions R1-R5. The activity of these reactions are the metabolic flux rate (here given by the numbers). An arrow pointing both directions indicates that the corresponding reaction is reversible. The number next to the diamond shapes representing reactions are an arbitrary value of their respective flux rate. . . 3 1.2 Example of how a pyruvate molecule could be labeled with C13in different ways.

The circle around carbon atoms represents the atom being C13. This creates four different weight classes for the pyruvate molecule. . . 4 1.3 A simple system in which C13 is distributed, black filled circles represent atom

positions which are always labeled. White circles are atom positions which are always unlabeled. Half black and white circles represents atom positions which may be either unlabeled or labeled. The amount of C13in each metabolite (A-F) is dependent on the flow rate of each reaction, which in turn will dictate the height of the bars in the bar diagram. The arrows represents the atom transitions in different reactions. The difference between system I and system II is that the flux rate of R2 is zero in system II. . . 5 1.4 A simple system in which several flux parameter sets were able to explain

experi-mental data. The curves are the cost as a function of parameter value. . . 10 3.1 Result for Profile Likelihood for flow nr 16. The X-axis shows different values for

flow parameter 16 and the Y-axis shows the lowest cost for that point. The red hor-izontal line is the cutoff which is defined as the cost at optimum+inv.χ2(0.95, 1). The two green vertical lines (overlapped by vertical blue) are the values for the confidence interval obtained from OpenFlux, without fine iteration. The two blue vertical lines are the values for the confidence interval obtained from OpenFlux, with fine iteration. Note that the results have been trimmed and peaks which were considered failed optimizations were removed, the untrimmed result are presented in supplementary material 5 - profile likelihood. The step length used, as described in section 2.2 Determining Confidence interval of parameters, is 0.001 21 3.2 Result for Profile Likelihood for flow nr 19. The X-axis shows different values for

flow parameter 19 and the Y-axis shows the lowest cost for that flow. The red hor-izontal line is the cutoff which is defined as the cost at optimum+inv.χ2₍_{0.95, 1}₎_.

The two green vertical lines are the values for the confidence interval obtained from OpenFlux, without fine iteration. The two blue vertical lines are the values for the confidence interval obtained from OpenFlux, with fine iteration. Note that the results have been trimmed and peaks which were considered failed optimiza-tions were removed, the untrimmed result are presented in supplementary

(7)

The interval obtained using OpenFlux, without fine iteration, is 0-7499.995 and therefore not included in the Figure. Note that the results have been trimmed and peaks which were considered failed optimizations were removed, the untrimmed result are presented in supplementary material 5 - profile likelihood. This exper-iment was ended prematurely.The step length used, as describded in 2.2, is 1. . . . 23 3.4 Result for Profile Likelihood for flow nr 39. The X-axis shows different values for

flow parameter 39 and the Y-axis shows the lowest cost for that point. The red hor-izontal line is the cutoff which is defined as the cost at optimum+inv.χ2(0.95, 1). The green vertical lines are the values for the confidence interval obtained from OpenFlux, without fine iteration. The step length used, as describded in 2.2, is 1. . . 24 3.5 Comparison of simulated data and experimental data, from parameters optimized

without akg in the dataset. On the X-axis are different mass distribution and Y-axis is the proportion of molecules having that mass distribution. The bars are ordered in the following order for all subfigures, Glutamine experimental data, Glutamine simulation data, Glucose experimental data and Glucose simulation data. The sum of square residuals are also included for each metabolite, they appear in the boxes on the left. . . 25 3.6 Comparison of simulated data and experimental data, from parameters optimized

with akg in the dataset. On the X-axis are different mass distribution and Y-axis is the proportion of molecules having that mass distribution. The bars are ordered in the following order for all subfigures, Glutamine experimental data, Glutamine simulation data, Glucose experimental data and Glucose simulation data. The sum of square residuals are also included for each metabolite, they appear in the boxes on the left. . . 26 3.7 The X-axis are different mass distributions, and the Y-axis are the cost of the

pa-rameter where the constrain to have corresponding proportion of mass distribu-tion are met. This shown for M0+to M5+the Cutoff line is set at F(optimum) +

inv. χ2(0.95, 1), which is around 153+3.8. The Y-axis is cut short at 300, as the cost of the lower MIDs continue to very high values when far away from the optimum. 27 3.8 The X-axis are different mass distributions and, the Y-axis are the cost of the

inv. χ2₍_{0.95, 1}₎_{, which is around 153+3.8 . The Y-axis is cut short at 300, as the cost}

of the lower MIDs continue to very high values when far away from the optimum. 28 3.9 The X-axis are different mass distributions, and the Y-axis are the cost of the

inv. χ2₍_{0.95, 1}₎_{, which is around 147+3.8 . The Y-axis is cut short at 180, the lower}

MIDs continue to very high values. . . 29 3.10 The X-axis are different mass distributions, and the Y-axis are the cost of the

inv. χ2(0.95, 1), which is around 147+3.8 . The Y-axis is cut short at 180, the lower MIDs continue to very high values. . . 30 3.11 A comparison between model simulation for akg and experimental data.

Experi-mental data for akg was not present when training the model. The bars represent the distribution of different masses in the metabolites. The error bars on simulated data represents the confidence intervals presented in Figures 3.7 and 3.8. Left to right are experimental data for Glutamine, simulated Glutamine, experimental data for Glucose and simulated Glucose. . . 31

(8)

List of Tables

1.1 A list of the metabolites which were measured using mass spectroscopy. The mea-surement of these metabolites were the experimental data which the project is based upon. . . 6 2.1 Metabolic uptake flows which were added to the model. RxnID is the name of

the reaction or flow. fluxPar is the number of the coresponding parameter given in OpenFlux. Reaction is the stoichiometric equation for the reaction. Atom map transfer is a mapping of where specific atoms goes after the reaction. Basis is the desired value of that flux parameter. These are fields which are required in the model file of OpenFlux, except for fluxPar. Any fields which are not presented here were left blank in the model file. . . 16 3.1 Confidence intervals obtained using OpenFLux without without using the perform

fine iteration setting. RxnID is the name given to a certain reaction. fluxPar is the number of the parameter given in OpenFlux. min and max are the stoichiometric range for a given parameter. low and high is the lower and higher limit of the confidence interval. %resolved is a measure of how large portion of the stoichio-metric range is limited by the confidence interval, a higher value suggests a more identifiable parameter. flag_high/low are the break status for the algorithm, see OpenFlux manual for details on their meaning. . . 19 3.2 Confidence intervals obtained using OpenFLuxwithout using the perform fine

iter-ation setting. RxnID is the name given to a certain reaction. fluxPar is the number of the parameter given in OpenFlux. min and max are the stoichiometric range for a given parameter. low and high is the lower and higher limit of the confidence interval. %resolved is a measure of how large portion of the stoichiometric range is limited by the confidence interval, a higher value suggests a more identifiable parameter. flag_high/low are the break status for the algorithm, see OpenFlux manual for details on their meaning. . . 20 3.3 A comparison of time consumption between different methods for determining

Confidence Intervals. . . 23 3.4 Different costs of the model. In the first column is the baseline cost for the model.

The second one is akg contribution to the total cost in the first one The third one is the cost of the parameters without comparing it to the full dataset, this is the cost the optimization functions uses.The fourth one is the cost of parameters optimized without akg but compared with the full dataset. The last one is the cutoff from the

χ2test used. . . 24

3.5 An explanation on how to interpret supplementary material 6 - model

visualiza-tion aor as and pdf in supplementary material 6 - model visualization b. Inside

(9)

(10)

List of Abbreviations

akg . . . Alpha-KetoGlutarate ATP . . . Adenosine TriPhosphate EMU . . . Elementary Metabolic Unit FBA . . . Flux Balance Analysis FBS . . . Fetal Bovine Serum hPL . . . Human Platelet Lysate INCA . . . Isotopomer Network Compartmental Analysis ISB-group . . . Integrative Systems Biology group MID . . . Mass Isotopomer Distribution MFA . . . Metabolic Flux Analysis NaN . . . Not a Number NMR . . . Nucelar Magnetic Resonance ODE . . . Ordinary Differential Equation pyr . . . Pyrovate SBML . . . Systems Biology Markup Language TCA-cycle . . . TriCarboxylic Acid cycle

(11)

(12)

1 Background

In Sweden, cancer is one of the leading causes of death [1]. Cancer is caused by mutations at the cellular level and leads to unwanted cell growth. Even though cancer research has come a long way the past couple of decades, our ability to detect and treat cancer is still lacking. However, if we were able to understand the metabolism of cancer cells, and how cancer cells metabolism differentiates from healthy cells, it might be possible to create more efficient treatments against cancer as well as making it easier to detect cancer earlier. Both earlier cancer detection and better treatment could prove essential in increasing the survival rate of cancer patients, and to achieve these goals we need to further study cell metabolism.

The growing scientific field of systems biology can provide powerful tools to study bio-logical processes such as cell metabolism. The mathematical models used in systems biology can be used to explain and draw conclusions from experimental data from biological systems. Modeling can be necessary as biological systems are sometimes to complex too understand without mathematical tools.

In this project, I have used a type of mathematical modeling that can be used to explain and analyze data from experiments using stable carbon isotopes as tracers. The kind of model used in this project can provide useful information on the interactions between dif-ferent molecules that makes up the metabolism of the cell. Many of these interactions can not be measured directly, and therefore modeling can be one of the few ways to determine them quantitatively [2].

1.1 Biological background

In this project, mathematical models have been used to model a complex biological system. Therefore it is necessary to have some basic knowledge of the underlying biological aspects.

Cancer

Cancer has a major impact on our society today. One portion of the impact comes from the fact that, in Sweden, one out three people will get a cancer diagnosis at some point in their

(13)

society. In addition to the suffering caused by cancer, it is also necessary to consider the monetary cost of cancer in society. The cost of cancer in Sweden is estimated to be 36 billion SEK yearly, and the cost of cancer is expected to rise to 70 billion SEK by the year of 2040 [3]. One of the reasons why cancer has such a high impact is due to the difficulty to treat it. The difficulty to treat cancer is caused by the varying nature of cancer. There are several types of cancer. The different types of cancer are caused by different types of mutations, which cause rapid and unwanted cell growth [4]. These different mutations can require drastically different treatments. Due to the different treatments required for different cancer types, it is therefore important to have reliable methods to distinguish between different cancer types so that the right treatment is used.

One of cancer cells most distinguishing features is their change in metabolism. The in-crease in cell metabolism means that cancer cells will take up most of the nutrients from sur-rounding tissue. One example of the change in cell metabolism of cancer cells is known as the Warburg effect [5]. Through the Warburg effect cancer cells can use an-aerobic metabolism even when oxygen is present. The increase in metabolism that the Warburg effect provides gives cancer cells an advantage over healthy cells.

Tricarboxylic Acid Cycle

Since the metabolism for cancer tissue can vary so much from that of healthy tissue, it is inter-esting to study. Metabolism consists of several levels of different processes. One such process is the TCA-cycle (Tricarboxylic Acid Cycle), which is also known as the citric acid cycle or the Krebs cycle. The TCA-cycle is a series of reactions which release energy stored in dif-ferent nutrients. This series of reactions convert pyruvate to ATP (Adenosine TriPhosphate). [6] Since the TCA-cycle is an essential part of the metabolism, providing up to 90% of total energy in the the cell [6], the TCA-cycle has been studied a great deal. The large amount of research on the TCA-cycle has led to that the reactions of the cycle are relatively well mapped [7] [8], in contrast something called the flux rates are not yet characterized, as the flux rates are not directly measurable.

Metabolic Fluxes

Metabolic reactions are the processes by which metabolites (a metabolite is any molecule which is relevant in a metabolic process) are transported between different cell compartments or converted into other metabolites. The Flux Rate is defined as the rate at which these reac-tions occur. In Figure 1.1, an example of a metabolic network is shown. Fluxes are represented with arrows between the different reactions (A-F). As can be seen in Figure 1.1, the two dif-ferent systems have difdif-ferent sets of metabolic fluxes, which in this case leads to an increase of metabolite E for system II. In a real scenario, this difference could have represented the difference between a healthy cell and a cancer cell.

It should be noted that metabolic reactions can represent different mechanisms such as diffusion or active transport. However, no such distinction is made in the modeling during this project.

(14)

1.2. Isotopicaly Labeled Experimental data

Figure 1.1: A schematic of a simple mock system with two different sets of metabolic fluxes. It shows different metabolites A-F and how they are connected to each other via metabolic reactions R1-R5. The activity of these reactions are the metabolic flux rate (here given by the numbers). An arrow pointing both directions indicates that the corresponding reaction is reversible. The number next to the diamond shapes representing reactions are an arbitrary value of their respective flux rate.

1.2 Isotopicaly Labeled Experimental data

To model the metabolic fluxes experimental data is required. The data used herein data has been collected by a collaborating group led by Roland Nilsson in Stockholm. Roland Nilsson has used a method based on labeling metabolites using C13(Carbon thirteen), where C13 is an isotope of Carbon.

Isotopic Labeling

The goal of isotopic labeling is to create a way to track and differentiate between molecules. It is possible to replace certain atoms in a molecule with that of an isotope (same element but with a different number of neutrons in the core) of the same atom [9]. The resulting altered molecule is then refereed to as an isotopomer of the original molecule. An example of how a molecule can be labeled differently using C13is showed in Figure 1.2. In Figure 1.2, we see different groups of isotopomers. These groups of isotopomers have different mass due to the additional n in neutrons some atom cores. This difference in mass is measurable using mass spectrometry.

(15)

Figure 1.2: Example of how a pyruvate molecule could be labeled with C13in different ways. The circle around carbon atoms represents the atom being C13. This creates four different weight classes for the pyruvate molecule.

As can be seen in Figure 1.3, if the atom transitions of a reaction is known, then it is possible to predict where different atoms will appear after a chain of reactions. If we can predict where atoms may end up, we may also predict where we will have C13. However, we will rarely have 100% C13 or regular carbon in a single place, but rather a proportion. The proportion between C13and regular Carbon is dependent on flux rates of the reactions. Different flux rates will result in different amounts of C13 ending up in different places. In Figure 1.3, the proportion of different masses caused by different amount of neutrons in the atom core can be seen in the bar plot, the data represented in the bar plot is the type of data that the TCA-cycle model is based upon.

Let us consider an example of how different flux rates give different measurements. If we were to change the flux rate of the reactions, this would change the final distribution in the different metabolites. If we look at Figure 1.3 again, and as an example change the flux rate of R2 to zero (which is done in system II), this change in flux rate would lead to M1+ (assuming no noise) being equal to 1 and all other mass groups (M0+, M2+, and M3+) being equal to zero. This distribution is due to the fact that because of the atom transitions in the system, the only possible isotopomer in F is one with one and only one C13.

(16)

1.2. Isotopicaly Labeled Experimental data

Figure 1.3: A simple system in which C13 is distributed, black filled circles represent atom positions which are always labeled. White circles are atom positions which are always unla-beled. Half black and white circles represents atom positions which may be either unlabeled or labeled. The amount of C13in each metabolite (A-F) is dependent on the flow rate of each reaction, which in turn will dictate the height of the bars in the bar diagram. The arrows represents the atom transitions in different reactions. The difference between system I and system II is that the flux rate of R2 is zero in system II.

Experimental data from Roland Nilsson

The experimental data which this project is built upon originates from Roland Nilsson’s re-search group. To obtain the experimental data, they had nine total cell cultures of mammary epithelium cells grown. The cells originated from “Human breast cancer cells generated by oncogenic transformation of primary mammary epithelial cells.” [10].The cells were grown in a lab. When the cell cultures reached a certain size, they were transferred into a different medium. The cell cultures were divided into three different groups of three cultures in each, and each group had a different medium. The different mediums had different carbon labels, one group had13C labeled Glutamin, one had 13C labeled Glucose, and the third had no added labels. After being divided into groups and having changed medium the cell cultures were put into a incubator until the cells reached metabolic steady state (a state in which the flux rates no longer changes). Afterwards the cells were extracted and quickly frozen. After being frozen, the cell cultures were send to another lab which measured the mass distribution of nine different metabolites, which metabolites which were measured can be seen in Table 1.1, using mass spectroscopy.

(17)

Table 1.1: A list of the metabolites which were measured using mass spectroscopy. The mea-surement of these metabolites were the experimental data which the project is based upon.

Name Name in model

Pyrovat pyr

Citrate/Isocitric acid cit

Aconitate acnt Alpha-ketoglutarate akg Glutamate glu_L Glutamine gln_L Fumarate fum Malate mal_L Aspartate asp_L

1.3 Systems Biology

Due to the complexity of some biological processes or mechanisms it can sometimes be dif-ficult or even impossible to understand them without mathematical tools. These tools are used within the field of systems biology. In other words, systems biology is the science of explaining biological processes or systems using mathematics. The explanation is done using models; a model is an abstract representation of system or process [11].

Mathematical models can be used in biology to test hypotheses of how a biological sys-tem works [12]. This testing is done by creating different models corresponding to different hypothesizes and testing which, if any, best explains experimental data. Using models in hy-pothesis testing is one way to gain a mechanistic understanding of how biological systems and process work. This understanding can allow us to gain insight of parameters which can not be directly measured, such as the metabolic fluxes. To be able to draw the right conclu-sions it is necessary to choose the right kind of model. There are many types of mathematical models, ranging from stochastic Monte Carlo models to deterministic ODE-models (Ordinary Differential Equations). ODE-models are the type most commonly used by the ISB-group (In-tegrated Systems Biology) in which this project was done in, and they are suitable to model both concentrations and volumes and many other aspects which change in regard to another variable. However, due to the nature of the experimental data in this project, another ap-proach was used, which will be further in section 1.4 Metabolic Flux Modeling below.

1.4 Metabolic Flux Modeling

Elementary Metabolic Units

Metabolic flux analysis is the field of determining metabolic fluxes using data. There are sev-eral different methods to perform metabolic flux analysis. However, one of the most promis-ing methods is perhaps EMU (Elementary Metabolic Units). EMU is a modelpromis-ing technique which can be used to model large amounts of Isotopically labeled data. One major advantage of EMUs compared to other methods capable of modeling these kinds of systems is that it can drastically reduce the number of required equations [7]. An EMU consists of all mass varia-tions of a certain metabolite. The EMU is made by combining information into MIDs (Mass Isotopomer Distribution) for each metabolite in the system. The MID is a vector which con-tains the distribution of different molecule masses for a certain metabolite. As an example, the MID for metabolite F in Figure 1.3 would be MIDf = [M0+, M1+, M2+, M3+]. Where

Mn+is the proportion of molecules in a sample which has n number of13C atoms. It is the

MID that are presented in the bar plot in Figure 1.3.

Due to the nature of MIDs EMU reactions require specific handling. For all reactions the size of MIDs (the size of MID is the number of carbon atoms which may be replace by13_C)

(18)

1.4. Metabolic Flux Modeling

must be the same before and after a reaction, below are examples of how specific types of reactions are handled.

• Condensation reaction, is a reaction where two or more molecules condensate into one other. The reaction is written as A+B=C, where A and B condensate into C. The MID of C is equal to the convolution of the MID of A and B, MIDC=MIDA˚MIDB.

• Cleavage reaction, is a reaction where a molecule is divided into two or more molecules, the reaction is written as A = B+C, where A cleaves into B and C. This type of reaction is essentially a split where a certain number of carbon atoms are re-moved.

• Unimolecular reaction, is a reaction where the molecules before and after a reaction has the same number of carbon atoms. The reaction is written as A=B.

It should be noted that molecules which does not contain any carbon are not present in these reactions.

After designing a network using EMUs, the network is broken down into smaller networks. The breakdown creates one subnetwork for each of the different sizes of the EMUs in the sys-tem. Where the first network consists of all EMU components of size one. The last network consists of EMUs of size up to the largest size of EMUs. The subnetworks can have smaller EMUs in it, but only the minimum amount needed to create it. This breakdown of the net-work ensures that the smallest amount of equations that are necessarily are used. For more information on how networks are broken down and how the equations are set up see “El-ementary metabolite units (EMU): A novel framework for modeling isotopic distributions” [7].

With these EMU reactions, we can now simulate an output by simply inserting a flux parameter set into the equations. The simulation will give us a simulated data set, which we can compare to our actual experimental data. Most likely, there will be some difference between the simulation and experimental data, and therefore, we need a method to try to minimize this difference.

Optimization

Since it is not feasible to measure the fluxes directly [2], some method to generate flux pa-rameters which can be put into our EMU reactions is required. The goal is to find the flux parameter set which gives the least deviation in the simulation from experimental data. To find this parameter set optimization can be used. Optimization is the process where variables in a function is changed to either maximize or minimize a objective function, given a set of constrains that the variables must fulfill. An example of an objective function with constrains is shown in equation 1.1. In equation 1.1, y(x, z)is to be minimized under the constrain x ą 0 and z ą 0. The function in equation 1.1 is a bit different from the objective function used in this project. In this project the objective function is a measure of our simulations deviation from data, the objective function used in this project can bee seen in equation 1.2.

min y(x, z) =x2˚z ´log(z)

x , x ą 0, z ą 0 (1.1)

For very simple problems, the optimization problems might be possible to solve analyt-ically, however for the problems involved in this project, other methods are required. Non analytical optimization methods can be divided into two main categories which are explained below.

(19)

Local optimization

A local optimization method is a method which from a starting guess finds a local optimum. It does this by moving along the gradient of the objective function. [13] Moving along the gradient means that the optimizer will take a step in the parameter space in the direction that improves the value of the objective function most. However moving along the gradient requires a place to start, and therefore it is important to have a good start guess. There are different ways the starting guess can be chosen, either with the help of prior knowledge, they can be randomized, or it could come from the result of a global optimization (which is further explained below).

There are a couple of factors to consider when using local optimization. The largest ad-vantage of local optimization algorithms is that they are quickly resolved and are easy to implement. However, a local optimization methods are not without problems. The biggest problem with local optimization methods is that since they follow the gradient of the objec-tive function it will get stuck at a local optimum rather than the global, since it cannot go in an unfavorable gradient direction. This also means that they are sensitive to parameter spaces that causes large variations in the objective function.

In this project local optimization was implemented using MATLABs fmincon function us-ing a algorithm called interior point.

Global optimization

To be certain that the global optimum has been found one would have to search through every point in the parameter space. The problem is that searching through the whole parameter set is unfeasible to do as it would take to much time. In order to search through a large enough portion of the parameter space in a reasonable amount of time we can utilize global optimization algorithms. There are several different global optimization algorithms available, which work on different principles. However something all global optimizations have in common is that they attempt to get around the problem of local optimization solutions of getting stuck at a local optimum.

In this project two different global optimization methods were used, which are described below.

• Multi Start Local Optimization, uses a local optimizer with several different starting points. These starting points are generated randomly and are preferably evenly dis-tributed in the parameter space. The result of the optimization from different starting points are then compared, and the best local optima point is the final result. While per-haps not a true global optimization algorithm, if a large enough of starting guesses are used with a good distribution it is possible to search a substantial part of the parameter space. This is the method used by OpenFlux [14], and whenever an optimum is said to have been optimized using OpenFlux in this report this is the method used.

• Simulated annealing is a global optimization algorithm that gets it name from the pro-cess of annealing (also known as tempering) of crystalline materials [15]. Simulated annealing works by taking a random step from an initial starting point and checks if the new point has a lower value of the objective function. What makes this algorithm special however is that it can keep this new point even if it gives a worse value in the objective function, it accepts this point based on random chance. The chance of keeping this new worse point is proportional to a simulated temperature as well as the difference in value of the objective function between the new point and previous. The simulated temperature is then lowered, meaning that the chance to pass the test is as well, and the process is repeated until the temperature is low enough or steps no longer produce a significant change in the value of the objective function. According to Search and optimization by metaheuristics: Techniques and algorithms inspired by nature this process

(20)

1.4. Metabolic Flux Modeling

provides a high probability that the global optima is reached [15]. In this project this algorithm was implemented using MATLAB’s simulannealbnd function.

Model Rejection

Once a set of parameters have been determined using optimization it is necessary to deter-mine if the model simulation using those parameters are able to explain experimental data. Sadly there is no method available that can determine if a model is correct [12]. The opposite is however doable, and there are several methods available that can determine if a model is likely false and should be rejected. One way to determine if a model is likely false is the weighted sum of square residuals that then can be compared to an inverse χ2 distribution. The result of sum of square residuals is often referred to as Cost and is defined in equation 1.2, where Yn(V)is the measured value in data point n and ˆYn(V)is the simulated value for

the same data point, as a function of V which is the current parameter set. The difference between simulated and measured value is squared, and then divided by σ2

n which is the

esti-mated standard deviation for the experimental data in that data point. By dividing with the standard deviation in the data point, a data point with a low estimated standard deviation will contribute more to the total cost. After calculating the cost of each individual data point, the cost for each data point is then summed together to form the total cost for the model.

Cost(V) = 8 ÿ n=1 (Yn(V)´ ˆYn(V))2 σ_n2 (1.2)

The total cost is then used in an inverse χ2_{test, seen in equation 1.3. If the Cost is lower}

then a certain threshold, then we say that the model has passed a χ2test. This threshold is defined in equation 1.3, where χ2is the χ2distribution, f is the degrees of freedom and α is the significance level of the test. These degrees of freedom can be chosen in several ways, in this project they are chosen as the number of data points minus the number of free flows. The significance level α was set to 0.95 in this project.

Cost ă χ2(α, f) (1.3)

If the model passes the χ2test it means that the deviation from experimental data is nor-mally distributed, in a way shown in equation 1.4.

(Yn´ ˆYn)2

σn2

=deviation2n, with deviation2n „N(0, 1) (1.4)

If the assumption in equation 1.4 holds true one could argue that the deviations in the simulated data compared to the experimental is caused by normally distributed noise [16]. If the deviation is normally distributed it can be interpreted as that the model being able to adequately explain the experimental data. If however the deviation can not be explained as noise the model needs to be rejected.

Another way that a model can be rejected if it conflicts with previous knowledge that is not represented in the experimental data. One such example is if a concentration takes a negative value in the model or a property is outside of biologically reasonable bounds.

Uncertainty, Confidence Interval And Profile Likelihood

Whet ever a model is rejected or not does not give a full picture of a models validity. In many cases there are several parameter sets which can explain data at least well enough to pass a χ2_{test described in section 1.4 Model Rejection. If these parameter sets have passed}

(21)

For R5 we have been able to obtain flux values between 60 and 80 that are able to explain experimental data well enough to pass a χ2test. In the plot in Figure 1.4, we can see that there is an optimal parameter set which provides the best cost and the cost gets worse the further away it gets from the optimum. However for R3 and R4 there seems to be no upper limit of their value, which means that no matter how high it is the model can compensate in an other place. Which can be seen in the plot as well, the cost goes asymptoticly to some value instead of increasing the further away from optimum we move. For R2 there is a limit of which values it can take, but there is no way to determine which sign it has, in other words the direction of the net flux is unknown. This is the uncertainty of the model parameters, and it is necessary to determine how high the uncertainty of the model parameters are.

Figure 1.4: A simple system in which several flux parameter sets were able to explain experi-mental data. The curves are the cost as a function of parameter value.

One method which can be used to determine the model parameter uncertainty is profile likelihood [17]. A profile likelihood is performed by fixing a parameter to a certain value, and optimizing all other parameters as usual to obtain the lowest cost. Then the process is repeated but this time the fixed parameters value is slightly changed, and this is repeated until the cost of the model becomes to high [17] [18]. The algorithm of this process is shown in equation 1.5. Where F is the objective function, which is defined as a least square function previously described in equation 1.2, with additional input arguments which regulates the constraints. pnis the parameter that is subjected to a constraint. pnoptis the value that

param-eter pnhad at the optimum, in other words the value of pnthat provided the lowest cost.∆pn

is the step length of the parameter, s is the step number and d is the direction. The direction can either take the value of +1 or -1 depending in which direction the search goes in.

minF(V, s, d), pn =pnopt+∆pn˚s ˚ d (1.5)

s is increased until the optimization algorithm can no longer find a parameter set that pro-vides a sufficiently low cost. The threshold for the cost is defined in equation 1.6. The cutoff used in this project is defined as F(Vopt), which is the cost at the previously located optimum,

plus the value of an χ2distribution with a single degree of freedom, and a significance level of α=0.95. After reaching a step number s, where the cost is above this cutoff threshold the direction is changed and everything else is reset to the starting point at the optimum point.

(22)

1.5. Prior work

This provides the smin and smax which in turn gives the pnmin and pnmax . The confidence

interval is then set to CIpn = pn min, pn max.

minF(V, s, d)ăF(Vopt) +χ2(0.95, 1) (1.6)

Prediction Profile Likelihood

The uncertainty of the model parameters is not the only uncertainty aspect of a model that is interesting. The uncertainty of the model simulation is also interesting. Model uncertainty is a measure of how much the simulation can vary and still provide an acceptable cost. To determine the model uncertainty a method called Prediction Profile Likelihood can be used. Prediction Profile Likelihood is similar to a Profile Likelihood, the difference is that instead of having a constrain put on a property of the model simulation instead of a parameter [19].

Identifiability

Given a set of confidence intervals it is possible to determine the identifiability of a parameter. A parameter is said to be identifiable if it is possible to determine its true interval [20], note that just because a true interval can be found does not necessarily mean that the model is true. True interval merely refers to that in the model a parameter has an interval we can find. The opposite of identifiability is unidentifiablity. Unidentifiablity can be caused by different reason. Two notable examples are model structure and available data. In the former case of unidentifiablity, take equation 1.7 as an example. Even if y is known it is not possible from this information alone to determine a and b separately, they can only be determined as a sum of the two.

y=a+b (1.7)

In the latter case of unidentifiablity, if the measurement noise is to high or the amount experimental data is insufficient to adequately explain a parameter. An example can be seen in Figure 1.4, where R3 and R4 are not identifiable. However, if more experimental data could be obtained the might become identifiable. In Figure 1.4 as an example, if metabolite E could be measured, it might be possible to reduce the confidence interval of R3 and R4.

Another thing to consider is the cutoff used in profile likelihood. If the cutoff was lowered, then obviously it becomes easier for a parameter to be classified as identifiable. In this project the cutoff was defined as in equation 1.6.

From now on in this report the two different types of identifiability will be handled to-gether, and a parameter will either be classified as identifiable or not with no distinction of its cause. And a parameter will be classified as identifiable if its confidence interval is limited in a meaningful way.

1.5 Prior work

Up until now Nicolas Sundqvist has created a TCA-cycle model using the experimental data from Roland Nilsson’s group. The TCA-cycle model is able to explain the experimental data. However, Nicolas Sundqvist has not determined the confidence intervals outside of using OpenFLux’s inbuilt functionality. As OpenFlux uses a different method to determine the con-fidence intervals than profile likelihood, it is interesting to see if OpenFlux gives comparable results to profile likelihood, as described in section 1.4 Profile likely hood. There has been no previous attempt at testing if the model can predict experimental data for which it has not been previously trained for, this can be done by leaving parts of the experimental data set

(23)

leave one out, it might also be interesting to do a prediction profile likelihood one the model simulation. The results from the prediction profile likelihood will give a measure of the un-certainty of the model simulation, something which has not been previously studied. Nicolas Sundqvist has previously made a mapping of an earlier version of the TCA-cycle model this mapping does not contain any additional information added in later versions. Hopefully, a method which can create a visualization of the TCA-cycle model with little manual input could be created. An automatic visualization method would simplify in the visualization of future versions of the TCA-cycle model.

(24)

1.6. Aims

1.6 Aims

The aim of the project was to continue to develop and evaluate the TCA-cycle model which Nicolas Sundqvist has created. This includes checking the identifiability of model parameters as well as the uncertainty of both the model parameters and the model simulation. It is also important to create a visual representation of the model, to make it easier to see how it fits into biology at large. The following list are the projects aims concretized.

1. Determine the confidence interval of the model parameters in the TCA-cycle model and determine their identifiability.

2. Validate the model of TCA-cycle using a leave one out method of experimental data to see if the model can still adequately explain it.

3. Determine the TCA-cycle models simulation uncertainty using prediction profile likeli-hood.

4. Visualize the model.

1.7 Delimitations

This project has been limited both by the available experimental data, the experimental data used is only from non cancer mammary epithelium cells and are limited to three measure-ments per experiment. The TCA-cycle model was also limited to only use the metabolic reactions provided by Roland Nilsson, whom leads a collaborating research group.

(25)

(26)

2 Method

2.1 Constructing the model in OpenFlux

The TCA-cycle model where built in OpenFlux V2.1 according to the manual. Using reactions provided by Roland Nilsson. The model file can be found in supplementary material 1

-Model file. The model where worked upon in MATLAB 2017(a). However some simulations

where run on older MATLAB versions, with MATLAB 2016(a) being the oldest.

Nicolas Sundqvists changes from standard OpenFlux

Nicolas Sundqvist had made some changes to the model and toolbox that diverged from standard OpenFlux. Most of these changes were to allow for using multiple dataset simulta-neously. The changes Nicolas Sundqvist made are listed below.

1. Rewrote OpenFlux to allow multiple expirements to be examined simultaneously, most notable this included changing the cost function to give a cost increase from the com-parison of the simulated values and measured data from more then one experiment. 2. Several additions that made OpenFlux more user friendly, such as functionality to save

results from several different optimizations.

3. The MIDs of certain metabolites were averaged between different compartments. As an example pyr_m, which is pyruvate in the mitochondrion was defined in the model as pyr_m= pyr_m ”OpenFLux”+pyr_c₂ . Where pyr_m "OpenFlux" is the MID OpenFlux would have given pyr_m. This is because there is no way to determine if it is located in the mi-tochondrion or cytosol. This was implemented by changing x_sim.m which is a model file generated by OpenFlux.

4. The standard deviation were set to a constant σ = 0.03. The standard deviation was changed in this project, which is explained later in section 2.1 Changes from Nicolas Sundqvists model where the change is also motivated.

(27)

Changes from Nicolas Sundqvists model

During the project some changes were made to Nicolas Sundqvists TCA-cycle model. They are detailed below.

1. The pyrovate uptake flux rate was set to 1 arbitrary flow unit. In other words the flux connected to pyr_ex = pyr_c where equal to 1. This normalization where done to give a reference point for the different flux rates.

2. Three new uptake flux rates where added, which represents external metabolites being added to the system.These were added as OpenFLux does not allow reversible reactions to use a given a set flux value, see earlier point 1. The added flux rates can be seen in Table 2.1

Table 2.1: Metabolic uptake flows which were added to the model. RxnID is the name of the reaction or flow. fluxPar is the number of the coresponding parameter given in OpenFlux. Reaction is the stoichiometric equation for the reaction. Atom map transfer is a mapping of where specific atoms goes after the reaction. Basis is the desired value of that flux parameter. These are fields which are required in the model file of OpenFlux, except for fluxPar. Any fields which are not presented here were left blank in the model file.

RxnID fluxPar Reaction atom transfer rxnType basis

Pyr uptake v(1) pyr_ex = pyr_c abc = abc F 1

gln_L uptake v(2) gln_L_ex = gln_L_c abcde = abcde F

ac uptake v(3) ac_ex = ac_m ab = ab F

3. Instead of using a constant standard deviation σ=0.03, a lower floor on the standard deviation was implemented. So that the estimated standard deviation was used if it were higher then 0.03 and otherwise 0.03 would be used as standard deviation in that data-point. This floor were implemented as some data points had much lower standard deviation then others, by a factor of thousand in some cases, lower than others. And it was assumed that there was at least some amount structural error which were not included in the estimated standard deviation.

2.2 Determining Confidence interval of parameters

Two methods of determining confidence intervals where used in this project, OpenFLux’s

Task 2- Sensitivity analysisand profile likelihood. When using OpenFlux two different

set-tings where used, with perform fine iteration and without perform fine iteration. See

supplemen-tary material OpenFlux manualfor full details on the difference. The profile likelihood test

where implemented as a function in MATLAB, which used the following algorithm. 1. Set d=1, where d determines the direction of the search.

2. Current step is set to be s =0, step length∆ = desired step length depending on f low and over cutoff count k=0.

3. Start guesses for the global optimization ,Vgs=0,and local optimization ,Vls=0, is both

set to be equal to an previously determined optimal value Vgs=0=Vls=0=Vopt.

4. The target value is set to pntarget =pnopt+∆pn˚s ˚ d. Where pnis the model parameter

to be tested.

5. A global optimization is started with Vgsas starting point, this uses simulated

anneal-ing as algorithm. An ad hoc cost that is quadratically proportional to the difference between pntargetand the actual simulation value is used to guide it to a correct position,

(28)

2.3. Validation by leaving one out

aswel as an ad hoc cost which punishes for deviating from 1 for flux parameter 1 (which is supposed to be kept at 1 at all times). The cost function for the global optimization is shown in equation 2.1. Vlsis set to the result.

Cost(V) = 8 ÿ n=1 (Yn(V)´ ˆYn(V))2 σn2 +10 ˚(pntarget´pnactual)2+10 ˚(V(1)´1)2 (2.1)

6. Two local optimizations are started, using Vgsand Vls´1as starting guesses. Vgsis set

to the point of these two which has the lowest cost. In this step MATLABs non-linear constraints are used to find pntarget.

7. The cost of Vgsand the actual value of pnis saved to a file. Note that the actual value

of pncan differ from pntarget.

8. Check if the cost is lower then the cutoff, and if it is not increase the cutoff count k by one. If it is lower than the cutoff set k=0.

9. Check if k ă=5, if it is increase s=s+1 and repeat from step 3.

10. Set pmax=actual value o f pnif d=1 or pmin=actual value o f pn´5if d=´1.

11. Check if d=1, if it is change d=´1 and repeat from step 2. 12. The test is completed.

The step length used varied for different flows, and where chosen to be relative to the confidence intervals calculated using Task 2- Sensitivity analysis. With fluxes having large ranges having a larger step size, the full list of step lengths can be found in supplementary

material 3- confidence intervals from OpenFlux. The settings for the optimization

algo-rithms can be found in supplementary material 4 - settings.

2.3 Validation by leaving one out

To remove parts of the dataset from the training set the standard deviation in those data points where set to 8, which in MATLAB is done through the Inf input. Task 1 – Parameter

Estimationwhere run in OpenFlux with the edited values. Since the data points had infinite

error they would not contribute to the final cost of the model. Infinite standard deviation was used instead of removing the data from the data set because it was easier to implement.

These new data sets where then used to train a new set of parameters. With the goal to see if the model could predict the left out pieces of experimental data. Then a profile likelihood test where performed, where the property tested where each piece of the MID, from M0+

to Mn+. akg (Alpha-ketoglutarate) where chosen because this metabolite had the highest

individual cost contribution of all the metabolites when using the original data set, as well as having a high standard deviation in the data. The cost and standard deviation can both be seen in Figure 3.6.

This profile likely hood test where implemented as detailed in the list below. 1. A step integer w=0 was defined.

2. pn, the property being studied was set to MIDw+.

3. The same algorithm used in section 2.2 Determining Confidence interval of parameters where used, except replace parameter with property.

(29)

2.4 Visualization of the model

The model structure where visualized using Cytoscape V3.6.0, which is an open source Visu-alization toolbox. To generate the necessary data list for Cytoscape a MATLAB script where written that extracted information from OpenFlux model file, as well as the result file from OpenFlux’s Task 2- Sensitivity analysis.

(30)

3 Results

3.1 Determining confidence interval

To be able to determine how identifiable the parameters of the TCA-cycle model were, it was first necessary to determine the parameters confidence intervals. The confidence intervals obtained using profile likelihood described in section 2.2 Determining Confidence interval of parameters, as well as the results obtained using OpenFlux’s Task 2- Sensitivity analysis are presented below.

Confidence intervals determined using OpenFlux

First OpenFlux was used to determine the confidence intervals. In Table 3.1, a selection of the confidence intervals which were obtained using OpenFlux’s Task 2- Sensitivity analysis are shown. The confidence intervals in Table 3.1 are obtained without using the perform fine itera-tion setting. For the full settings used in Task 2- Sensitivity analysis see supplementary material

4 - settings, and for more confidence intervals see supplementary material 3- confidence

in-tervals from OpenFlux. The coresponding values using perform fine iteration is presented in

Table 3.2

Table 3.1: Confidence intervals obtained using OpenFLux without without using the perform fine iteration setting. RxnID is the name given to a certain reaction. fluxPar is the number of the parameter given in OpenFlux. min and max are the stoichiometric range for a given parameter. low and high is the lower and higher limit of the confidence interval. %resolved is a measure of how large portion of the stoichiometric range is limited by the confidence interval, a higher value suggests a more identifiable parameter. flag_high/low are the break status for the algorithm, see OpenFlux manual for details on their meaning.

fluxPar opVal min max low flag_low high flag_high %resolved

v(5) 32.438 0 7500 0 converged 7499.995 boundary 0.00%

v(16) 0.095 0 8999 0.092 converged 0.107 converged 100.00% v(19) 0.169 0 749.5 0.162 converged 0.178 converged 100.00%

(31)

Table 3.2: Confidence intervals obtained using OpenFLuxwithout using the perform fine it-eration setting. RxnID is the name given to a certain reaction. fluxPar is the number of the parameter given in OpenFlux. min and max are the stoichiometric range for a given param-eter. low and high is the lower and higher limit of the confidence interval. %resolved is a measure of how large portion of the stoichiometric range is limited by the confidence inter-val, a higher value suggests a more identifiable parameter. flag_high/low are the break status for the algorithm, see OpenFlux manual for details on their meaning.

fluxPar opVal min max low flag_low high flag_high %resolved

v(5) 32.438 0 7500 0 converged 7500 converged 0.00%

v(16) 0.095 0 8999 0.095 choke 0.106 converged 100.00%

v(19) 0.169 0 749.5 0.135 converged 0.215 converged 100.00%

v(39) 0 0 14249 0 converged 0.456 converged 100.00%

Comparison between results from OpenFlux and Profile Likelihood

Profile Likelihood was used to determine confidence intervals of the model parameter out-side of OpenFlux, using profile likelihood previously described in section 2.2 Determining Confidence interval of parameters. Attempts were done to create curves for each parameter but due to a server crash only curves for parameter 2-11, 13-19, and 33-44 were completed. Figure 3.1 and 3.2, shows the Profile Likelihood curves for two different parameters (param-eter 16 and 19). In each Figure the confidence interval from OpenFlux are also presented as a comparison, these were the same intervals from the previous section 3.1 Confidence intervals determined using OpenFlux. All completed curves are presented in supplementary material

5 - profile likelihood.

In Figure 3.2, we can see that the results from profile likelihood and OpenFlux (with fine iteration) agrees quite well. However in Figure 3.1 we can see that OpenFlux underestimates the upper limit of the confidence interval slightly. In both Figures 3.1 and 3.2 wee see that there is quite a bit of difference between using fine iteration and not using it.

(32)

3.1. Determining confidence interval 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 V(16) 190 195 200 205 210 215 220 225 230 235 240 Cost f(V) V(16), GLUNm

Figure 3.1: Result for Profile Likelihood for flow nr 16. The X-axis shows different values for flow parameter 16 and the Y-axis shows the lowest cost for that point. The red horizontal line is the cutoff which is defined as the cost at optimum+inv.χ2(0.95, 1). The two green vertical lines (overlapped by vertical blue) are the values for the confidence interval obtained from OpenFlux, without fine iteration. The two blue vertical lines are the values for the confidence interval obtained from OpenFlux, with fine iteration. Note that the results have been trimmed and peaks which were considered failed optimizations were removed, the untrimmed result are presented in supplementary material 5 - profile likelihood. The step length used, as described in section 2.2 Determining Confidence interval of parameters, is 0.001

(33)

0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 V(19) 190 195 200 205 210 215 220 225 230 235 240 Cost f(V) V(19), SUCD1m

Figure 3.2: Result for Profile Likelihood for flow nr 19. The X-axis shows different values for flow parameter 19 and the Y-axis shows the lowest cost for that flow. The red horizontal line is the cutoff which is defined as the cost at optimum+inv.χ2(0.95, 1). The two green vertical lines are the values for the confidence interval obtained from OpenFlux, without fine iteration. The two blue vertical lines are the values for the confidence interval obtained from OpenFlux, with fine iteration. Note that the results have been trimmed and peaks which were considered failed optimizations were removed, the untrimmed result are presented in supplementary

material 5 - profile likelihood. The step length, as described in section 2.2 Determining

(34)

3.1. Determining confidence interval

Problems with Profile Likelihood method

There were some problems with Profile Likelihood. First of all profile likelihood was a much slower compared to OpenFLux’s method. In Table 3.3, one can see a comparison of the time needed to perform the analysis. This is also led to some tests being canceled prematurely as they took to much time. An example of an early canceled test is shown in Figure 3.3. The function attempting to find the confidence interval for flux parameter 5 was canceled on the fifth day for the right hand interval, and if one compares it to the value from OpenFlux in Table 3.1 or 3.2 one can see it was not close to completion.

In some cases the optimization failed to find any good value. One example is presented in Figure 3.4, were the points does not seem to follow any pattern. It also does not agree in any way with the results from OpenFlux seen in Table 3.1.

Table 3.3: A comparison of time consumption between different methods for determining Confidence Intervals.

OpenFlux with fine iteration (N = 32) OpenFlux without fine iteration (N = 15) Profile Likelihood

Mean time sec 1569 Mean time sec 9013

-Mean time h:m:s 00:26:10 Mean time h:m:s 02:30:13 1-5 days

500 1000 1500 2000 2500 3000 3500 V(5) 190 195 200 205 210 215 220 225 230 235 240 Cost f(V) V(5), SUCOASm R

Figure 3.3: Result for Profile Likelihood for flow nr 5. The X-axis shows different values for flow parameter 5 and the Y-axis shows the lowest cost for that point. The red horizontal line is the cutoff which is defined as the cost at optimum+inv.χ2(0.95, 1). The interval ob-tained using OpenFlux, without fine iteration, is 0-7499.995 and therefore not included in the Figure. Note that the results have been trimmed and peaks which were considered failed optimizations were removed, the untrimmed result are presented in supplementary

mate-rial 5 - profile likelihood. This experiment was ended prematurely.The step length used, as

(35)

0 5 10 15 20 25 30 V(39) 190 195 200 205 210 215 220 225 230 235 240 Cost f(V) V(39), CITtm R

Figure 3.4: Result for Profile Likelihood for flow nr 39. The X-axis shows different values for flow parameter 39 and the Y-axis shows the lowest cost for that point. The red horizontal line is the cutoff which is defined as the cost at optimum+inv.χ2₍_{0.95, 1}₎_{. The green vertical lines}

are the values for the confidence interval obtained from OpenFlux, without fine iteration. The step length used, as describded in 2.2, is 1.

3.2 Validation and determining model uncertainty

To validate the model the data for akg was removed from the data set, and then OpenFlux was used to try to find a parameter set which could explain the data. When evaluating the validation of the model three different results were considered. These results was a χ2test, how well the model fit to data, and the confidence interval of the simulation of akg.

Below are the result from the uncertainty analysis presented. In Figure 3.5 are the simu-lation of the parameters which were optimized without measurements of akg in the dataset compared to the full dataset. Which means that the cost includes the cost from akg as well. In Figure 3.6 the same thing are plotted except that the parameters used there were optimized with experimental data for akg in the dataset. In Table 3.4 a comparison of cost values for each model is shown.

Table 3.4: Different costs of the model. In the first column is the baseline cost for the model. The second one is akg contribution to the total cost in the first one The third one is the cost of the parameters without comparing it to the full dataset, this is the cost the optimization functions uses.The fourth one is the cost of parameters optimized without akg but compared with the full dataset. The last one is the cutoff from the χ2test used.

Full Dataset cost akg contribu-tion to cost Optimization without akg Full dataset without train-ing to akg χ2 cutoff 197.2 47.4 147.0 200.64 313

(36)

3.2. V alidation and determining model uncertainty 25

(37)

U

L

T

S

Figure 3.6: Comparison of simulated data and experimental data, from parameters optimized with akg in the dataset. On the X-axis are different mass distribution and Y-axis is the proportion of molecules having that mass distribution. The bars are ordered in the following order for all subfigures, Glutamine experimental data, Glutamine simulation data, Glucose experimental data and Glucose simulation data. The sum of square residuals are

(38)

3.2. Validation and determining model uncertainty

After determining that the model could pass a χ2test even without having been trained using the full dataset, a prediction profile likelihood test was performed on the model simu-lation. In Figures 3.7 and 3.8, the results of the prediction profile likelihood test are presented. Due to a mistake when choosing a starting point for the search algorithm the optimal starting parameter set was not used. Instead one of the top five parameter set provided by OpenFlux was used. This is why one point, the lowest, in each curve stands out. This point is the opti-mal starting position. If the assumption that the optimization is optiopti-mal, which means that it finds the global minima, then these curves still provide information of the confidence interval which is why they are included here.

0 0.1 0.2 0.3 0.4 0.5 0.6 MID 150 200 250 300 Cost F(V)

Cost as a function of different MID for simulated Glucose

M0+ M1+ M2+ M3+ M4+ M5+ Cutoff line

Figure 3.7: The X-axis are different mass distributions, and the Y-axis are the cost of the pa-rameter where the constrain to have corresponding proportion of mass distribution are met. This shown for M0+to M5+ the Cutoff line is set at F(optimum) +inv. χ2(0.95, 1), which is

around 153+3.8. The Y-axis is cut short at 300, as the cost of the lower MIDs continue to very high values when far away from the optimum.

(39)

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 MID 150 200 250 300 Cost F(V)

Cost as a function of different MID for simulated Glutamine

M0+ M1+ M2+ M3+ M4+ M5+ Cutoff line

Figure 3.8: The X-axis are different mass distributions and, the Y-axis are the cost of the pa-rameter where the constrain to have corresponding proportion of mass distribution are met. This shown for M0+to M5+ the Cutoff line is set at F(optimum) +inv. χ2(0.95, 1), which is

around 153+3.8 . The Y-axis is cut short at 300, as the cost of the lower MIDs continue to very high values when far away from the optimum.

In an attempt to fix the previously mentioned mistake of using the wrong starting guess the tests were done again. However this time the confidence intervals became much smaller and only had a few points within the interval. The results from these redone tests are pre-sented in Figures 3.9 and 3.10.

(40)

3.2. V alidation and determining model uncertainty 0.555 0.56 0.565 0.57 0.575 0.58 MID 140 150 160 170 180 Cost F(V) M0+ 0.072 0.074 0.076 0.078 0.08 0.082 0.084 0.086 0.088 MID 140 150 160 170 180 Cost F(V) M1+ 0.225 0.23 0.235 0.24 0.245 MID 140 150 160 170 180 Cost F(V) M2+ 0.058 0.06 0.062 0.064 0.066 0.068 0.07 0.072 0.074 MID 140 150 160 170 180 Cost F(V) M3+ 0.036 0.038 0.04 0.042 0.044 0.046 0.048 0.05 0.052 MID 140 150 160 170 180 Cost F(V) M4+

Cost as a function of MID for Glucose

0.01 0.012 0.014 0.016 0.018 0.02 0.022 0.024 MID 140 150 160 170 180 Cost F(V) M5+ 29

(41)

U L T S 0.265 0.27 0.275 0.28 0.285 MID 140 150 160 170 180 Cost F(V) M0+ 0.078 0.08 0.082 0.084 0.086 0.088 0.09 0.092 0.094 MID 140 150 160 170 180 Cost F(V) M1+ 0.054 0.056 0.058 0.06 0.062 0.064 0.066 0.068 MID 140 150 160 170 180 Cost F(V) M2+ 0.146 0.148 0.15 0.152 0.154 0.156 0.158 0.16 0.162 MID 140 150 160 170 180 Cost F(V) M3+ 0.022 0.024 0.026 0.028 0.03 0.032 0.034 MID 140 150 160 170 180 Cost F(V) M4+

Cost as a function of MID for Glutamine

0.382 0.384 0.386 0.388 0.39 0.392 0.394 0.396 0.398 0.4 MID 140 150 160 170 180 Cost F(V) M5+

Figure 3.10: The X-axis are different mass distributions, and the Y-axis are the cost of the parameter where the constrain to have corresponding propor-tion of mass distribupropor-tion are met. This shown for M to M the Cutoff line is set at F(optimum) +inv. χ2(0.95, 1), which is around 147+3.8 . The

(42)

3.3. Visualization of the model

After determining the confidence interval of the MID’s of akg they could be used to draw conclusions on the uncertainty of the model. In Figure 3.11, a comparison between simulated data and experimental data for akg are presented. In Figure 3.11, the results from the predic-tion profile likelihood are also shown as error bars on the simulated data. These are based on the confidence intervals from Figures 3.9 and 3.10, where the correct starting point for the search algorithm was used.

Figure 3.11: A comparison between model simulation for akg and experimental data. Exper-imental data for akg was not present when training the model. The bars represent the distri-bution of different masses in the metabolites. The error bars on simulated data represents the confidence intervals presented in Figures 3.7 and 3.8. Left to right are experimental data for Glutamine, simulated Glutamine, experimental data for Glucose and simulated Glucose.

3.3 Visualization of the model

The initial goal of the visualization was to create a method to automatically visualize the model only using the OpenFLux model file and confidence intervals as input. The end result does however still require a bit of manual input.

The visualization can be seen in Cytocape using supplementary material 6 - model

visu-alization aor as and pdf in supplementary material 6 - model visualization b. To reduce

clutter all metabolite that does not contain any carbon are separated from each other. This means that, as an example, that a H2O molecule is not connected to other H2O molecules

un-less they are connected to the same reaction. In Table 3.5, an explanation of how to interpret

supplementary material 6 - model visualization aor as and pdf in supplementary material

6 - model visualization bare provided. If a reaction is reversible or not is indicated by the