PROMINENT VARIABLE DETECTION IN LIPID NANOPARTICLE EXPERIMENTS: A SIMULATION STUDY ON NON-PARAMETRIC CHANGE POINT ANALYSIS

(1)

PROMINENT VARIABLE DETECTION IN LIPID

NANOPARTICLE EXPERIMENTS: A SIMULATION

STUDY ON NON-PARAMETRIC CHANGE POINT

ANALYSIS

Submitted by

Hongru Zhai

A thesis submitted to the Department of Statistics in partial

fulfillment of the requirements for a two-year Master of Arts degree

in Statistics in the Faculty of Social Sciences

Supervisors:

Salman Toor & Philip Fowler

Phil Harrison

(2)

ABSTRACT

Images as important sources of information, powered by the newest robotic microscopy technologies, make the volume of data available larger than ever before. From the images, hun-dreds of variables can be calculated. The medical research group within the HASTE project conducts a LNP (Lipid Nano-particle) experiment that is designed to transport a drug to target cells. In the LNP experiment, microscopy images are taken over time to record if the drug is uptaken. We propose that the non-parametric change point analysis can be used to identify the variable which shows the earliest state change (potentially signifying the drug uptake) among all variables calculated from the images. Two algorithms for non-parametric change point anal-ysis, an agglomerative and a divisive, are studied through simulation leading us to implement the agglomerative algorithm on the LNP experiment data. Furthermore, the simulation results show that the prominent variable detection accuracy improves when more time points are in-cluded in the experiment. In the application, correlation is most likely to be detected as the sole prominent variable.

(3)

1 Introduction

In medical research, drug delivery with nanoparticles (defined as a particle of matter that is between 1 and 100 nanometres in diameter) as a drug carrier has been gaining importance in recent years. In other words, to achieve the desired therapeutic effect, medical experiments are designed to transport the drug by nanoparticles to target cells that are cultured in small containers called wells. Usually before exploring the effect of the drug, it is important to know if the drug is successfully taken up by the target cell. Traditionally, green fluorescence protein (GFP) is used to signify the drug uptake by exhibiting bright green fluorescence in microscopy images when the drug is successfully taken up by cell.

Microscopy images can be taken over time to record experiments where the drug is uptaken and triggers the GFP. Figure1depicts the GFPs that is triggered by a successful uptake of the drug in a number of cells. When the drug that is carried by Lipid nanoparticles (LNP) enters the target cells (successful uptake), the target cells will show in Figure 1 as green stains. The more green stain shown in the images, the higher proportion of cells take up the drug successfully.

Figure 1: An example of GFP expression in 9 wells from 3 groups of High, medium and low LNP doses, where each column corresponds to one group.

When the cells start to take up the drug, a series of characteristic changes in the microscopy images are expected to occur, which demand suitable measures to exact the information and

(5)

describe the changes. Since generally a change in pixel intensities in the whole image can signify the GFP fluorescence in the experiment that is caused by drug uptake, measures that are associated with pixel intensity are adopted to extract information about drug uptake from the images. To quantify the GFP, variables that are calculated from the microscopy images can be selected as measures, such as total correlation in the images and Laplacian variance. A brief description of these measures can be found in Appendix C.

Of particular interest, the variable that brings about significant quantitative change first will be paid attention, that is because changes in variables (calculated from images) show increasing drug uptake activity. The earliest change of a variable gives us information of when the uptake starts, and the variable that changes first might be the "best" variable to detect the starting point of the uptake.

The HASTE project (Hierarchical Analysis of Spatial and Temporal Image Data) is an collaborative effort between Uppsala University, AstraZeneca and Vironova, to tackle the chal-lenge of acquiring, analyzing and interpreting image data from such a drug delivery exper-iments. Henceforth, the drug delivery experiment that the HASTE project provided will be referred to as the LNP experiment.

To address the task, we turn to the method of non-parametric change point analysis which can give us the estimations of locations of the change points for each of our three variables. The non-parametric change point analysis can be realised through two algorithms, namely an agglomerative algorithm and a divisive algorithm, both based on hierarchical clustering. To determine which algorithm is most suitable for the LNP experiment data, the two algorithms are studied through simulation.

We sum up the goals of the thesis as follows: Conducting a simulation study to help choose the suitable change point analysis algorithm for the LNP experiment data, and then applying the change point analysis with the preferred algorithm to the LNP experiment data. The thesis is also intended to work as an example for anyone who works with longitudinal data and wants to apply non-parametric change point analysis.

This thesis is structured as follows. Chapter 2 introduces the non-parametric change point analysis with details of two algorithms. Chapter 3 explains the simulation study and shows the simulation results. In Chapter 4, an overview of the LNP experiment data is provided and the change point analysis is carried out for the LNP experiment dataset using the algorithm we selected by the simulation results. Finally in Chapter 6, the limitations of the thesis and future

(6)

works are discussed.

2 Theory and Methodology

2.1 Change Point Analysis

Change point analysis is a method that gives change points by detecting distributional changes, within a series of time ordered observations. The idea of detecting change points has seen use in a variety of fields. It stems from financial modeling (Talih & Hengartner 2005) of correlated assets with historical market data. The use of change point analysis is specifically popular in quality control, and the cumulative sum (CUSUM) based control chart is widely adopted by quality engineers to monitor the changes in the production process (Taylor 2000). Change point analysis is also used to develop tools that are capable of detecting social network changes within the Al-Qaeda terrorist organization (McCulloh & Carley 2008).

Although the change point analysis is used in a lot of fields of study, strict assumptions are often needed. Previously, proposed change point analysis methods, such like those based on cumulative sum, only detects changes in the mean of a process. For more complex uses, such as in the LNP drug delivery experiment, images can generate variables that potentially can vary in distributions over time, more generalised change point analysis is needed and preferred.

In the context of the LNP experiment, change point analysis can be applied to detect the successful drug uptake. If a GFP expression happened in the experiment, we assume that the GFP expression process does not take a constant speed, and the entire process consists of phases with different speed of expression. In the LNP experiment, a successful drug uptake can lead to a distributional change which can in turn be detected by the change point analysis as a change point.

The analysis of LNP drug delivery experiment in this thesis utilizes the non-parametric change point analysis proposed by Matteson et al. (2014). Unlike the parametric change point analysis, the non-parametric method enables the detection for any distributional change in a time ordered series.

(7)

2.2 The Two Types of Algorithms

For non-parametric change point analysis, a divisive and an agglomerative algorithm were pro-posed by Matteson et al (2014) to provide estimates the number and the locations of change points. We now briefly describe the two algorithms.

2.2.1 Divisive Algorithm

The non-parametric technique of divisive algorithm utilizes bisection method to realize a hi-erarchical estimating of the change point locations. A divergence measure is a measure that is used by the divisive algorithm to determine how large the difference of two distributions is. The divergence measure ˆQ(Xτ, Yτ) is used by the divisive algorithm and is specified inSzekely

and Rizzo(2005).

The divergence measure used in the algorithm is originally based on the assumption that E|X|α_{,E|Y |}α _{< ∞. In another word, the divergence measure is only defined when the α-th}

moment exists.

The estimation starts with estimating one single change point.

Let Z1, Z2, . . . , ZT be an independent time sequence, and let 1 ≤ τ < T be integers,

where τ is the time of the unknown change point. Define the two sets of observations, Xτ =

{Z1, Z2, . . . , Zτ} and Yτ = {Zτ +1, Zτ +2, . . . , ZT}, then the location of the change point is

estimated by ˆ τ = argmax τ ˆ Q(Xτ, Yτ)

where ˆQ(Xτ, Yτ) is the measure of divergence mentioned above. Estimation of the change

points is achieved by maximizing the empirical divergence measure between the clusters Xτ

and Yτ. That is, the function tries to find the point τ such that ˆQ(Xτ, Yτ) is maximized.

Now we have the first estimated change point, the change point partitions the time sequence into two clusters ˆC1 and ˆC2 such that ˆC1 = {Z1, . . . , Zτˆ} and ˆC1 = {Zτ +1ˆ , . . . , ZT}. Given

these two clusters, we apply the procedure for finding the first change point within each of the clusters to get two more change points.

Iteratively applying the estimating technique results in partitioning the time series into k clusters ˆC1, ˆC2, . . . , ˆCk. Conditioning on the previously estimated change points, a

(8)

estimation procedure. The significance test is performed after each iteration of the estimating procedure, if the test gives a p-value that is less then the chosen significance level p0, the

algo-rithm will go on and estimate additional change point; if p-value is not less than the pre-fixed p0, the algorithm will terminate.

In previous applications studied in Matteson et al. 2014, divisive algorithm gave consistent estimates of change points under standard regularity assumptions. (Matteson & James 2014) 2.2.2 Agglomerative Algorithm

In practice, the length of the time in the experiment can be rather long. As the length of the time series grows, the divisive algorithm becomes more and more computationally inefficient with-out pre-partitioning the series. Thus, agglomerative algorithm was proposed as an alternative algorithm that runs faster than divisive algorithm in practice. The agglomerative algorithm fa-cilitates the small subset of possible change point locations to achieve a reduced computational time.

Given an initial clustering setting, where each cluster does not necessarily have to consist of one single time point, the adjacent pairs of clusters may be merged by optimizing the goodness-of-fit statistic. To determine which adjacent cluster pairs should be merged, the algorithm computes and records the increase or decrease in goodness-of-fit statistic for every merge. Repeat this procedure until the whole time series is merged into one single cluster. Ultimately, the estimates of change points are given by the clustering that maximizes the goodness-of-fit statistic over the whole merging process.

When overfitting is desired to be avoided, a penalty for the goodness-of-fit statistic can be used. For penalized estimation, the change point locations are estimated by maximizing penalized goodness-of-fit statistic (James & Matteson 2015)

˜

Sk = ˆSk+ penalty( ~τ (k))

In the formula, ~τ (k) = {τ1, τ2, . . . , τk} is the vector of the k change points estimated using

the agglomerative algorithm with goodness-of-fit statistic ˆSk. The penalty functions penalize

undersized segmentation, and two examples of penalty functions are given below. In the sim-ulation and application of this thesis, the second penalty function is implemented, but how to choose the penalty function is not studied in this thesis.

(9)

1. penalty( ~τ (k)) = −k 2. penalty( ~τ (k)) = 1 1+k k−1 X i=1 [τi+1− τi]

For both divisive and agglomerative algorithms, we can choose to enable the algorithm to detect the distributional change in the mean or the change in the variance by varying the value of parameter α. When applying the change point analysis in the ecp package in R, setting α = 1 results in a detection of the distributional change in mean, whereas setting α = 2 enables the detection in variance.

2.3 Simulation

Given the two available algorithms for the non-parametric change point analysis, it comes natu-rally that we want to know which algorithm is better for our data. To examine the performance of the algorithms, they are studied in the next chapter under a simulation setting.

In the thesis, correlation, Laplacian variance and summation of pixel intensities will be calculated from the LNP experiment images and used as three variables that represent the state of the LNP experiment. The simulation aims to mimic the three variables that calculated from LNP experiment images. The three variables are further explained in Appendix C.

Since in the later application of the change point analysis on the LNP experiment data, we aim to know which variable has the earliest first change point. Thus in the simulation, we look at the accuracy of each algorithm correctly detects the variable that has the earliest first change point as the performance. If one algorithm gives higher accuracy in successfully detecting the variable that has the earliest first change point among all variables, this algorithm will be preferred over the other algorithm on the similar data as the simulation data.

(10)

3 Simulation Study

3.1 Simulation Design

The simulation starts with mimicking the three variables that were calculated from LNP exper-iment images. For each variable, the number of change points is preset to be 3, dividing the 88 time points into 4 strata. With the number of change points set, we need to generate where the three change points occur. An equivalent way of doing it is by determining how many observations fall into each stratum (the sizes of strata). We use a multinomial distribution to determine the number of observations that falls into each stratum. The values of choice of the multinomial parameter p are summarized in Table1. As an example for Variable C, the number of observations falling into each stratum is determined by the distribution Multinomial(0.25, 0.25, 0.25, 0.25).

With the sizes of strata determined, we generate the observations in each stratum from normal distributions with parameters that depends on the stratum. In this way, variable A, variable B and variable C are generated from normal distributions where parameter µ and σ take the values as shown in Table1.

The values of µ and σ were chosen so that Variable A, variable B and variable C resem-ble correlation, Laplacian variance and sum of pixel intensities, respectively from the LNP experiment data.

Table 1: Distribution parameters used to generate the simulation data

Strata Parameters 1st 2nd 3rd 4th Variable A 0.3 0.2 0.1 0.18 µ Variable B 850 700 650 730 Variable C 7946000 8330000 7000000 6800000 Variable A 0.15 0.08 0.12 0.18 σ Variable B 250 150 250 300 Variable C 1500000 1250000 1800000 1200000 Variable A 0.15 0.3 0.4 0.15 p Variable B 0.4 0.1 0.3 0.2 Variable C 0.25 0.25 0.25 0.25

(11)

With the three variables generated, we are ready to apply the two algorithms to give esti-mates of the positions of the change points for these variables. For convenience, the variable with the earliest change point among the three variables will be referred to as the prominent variable in the rest of this thesis.

The prominent variable given by the algorithm is compared with the real prominent vari-able. If the prominent variable given by the algorithm matches with the real prominent variable, it can be said that the algorithm correctly selected the prominent variable. By repeating this procedure a number of times, we can estimate the detection accuracy of the algorithms, i.e. the percentage of times that the algorithm selected the correct prominent variable.

In the simulation study, we will examine the detection accuracy under different algorithm settings. For instance, the simulations will give detection performances for α = 1 and α = 2. When α = 1, the change in mean is considered, whereas α = 2 enables the detection of change in variance.

The other algorithm setting that will be studied by simulation is the starting clustering for the agglomerative algorithm. We examine the algorithm performances as the length of the time series is set to either 88 or 500 time points, to see if more time points help improve the accuracy.The number 88 was chosen since that is the number of time points in the LNP experiment data.

All simulations with the divisive and agglomerative algorithms are carried out using the current version of ecp package (James & Matteson 2015) in R (R Core Team 2018). Each simulation setting was carried out 1000 times for the case with 88 time points and 50 times for the cases with 500 time points. The reason for only carrying out 50 iterations for the case with 500 time points is that running 1000 times would be extremely time consuming. The sum-mary table of prominent variable matching results (known in Machine Learning as "Confusion Matrix") will be given by the confusionMatrix() function in R library caret.(Kuhn 2020) Table2shows an example of such confusion matrix, from which the detection accuracy is calculated as 426+23+0₁₀₀₀ = 44.9%.

(12)

Table 2: Confusion Matrix, Divisive Algorithm, α = 1

True Prominent Variable

A B C sum

A 426 14 0 440 Predicted Prominent Variable B 474 23 0 497

C 58 5 0 63

sum 958 42 0 1000

3.2 Results and Conclusions from Simulations

The results are summarized in Table 3 to show the detection accuracy for different settings. The agglomerative clustering procedure can have starting clusters that have more than one time point in them. In Table 3, the "Starting Clustering = 8" means that those starting clusters had 8 adjacent time points in them, while "NA" means that no starting clustering is specified. From the simulation results, we see that when the time points are few (88 time points), both algorithms generally give lower detection accuracy, with highest accuracy recorded at 64.6% by agglomerative algorithm, under the setting α = 1 and no specified starting clustering. When simulating with time series of up to 500 time points, the performance of both algorithms largely improved, with the highest accuracy recorded at 96% by agglomerative algorithm, under the setting α = 2 and a starting clustering of 100 time points.

For simulations with 500 time points, it is worth noting that, although agglomerative al-gorithm achieved the highest accuracy under the setting α = 2 and a starting clustering of 100 time points, the accuracy gets bad under some other settings. One example is that, under the setting α = 2 and no starting clustering, the agglomerative algorithm only achieved 54% accuracy. In contrast, the divisive algorithm tends to have stability in accuracy, with accuracy always around 90%.

For simulations with 88 time points, although both algorithms performed worse than with 500 time points, the agglomerative algorithm registered better results than the divisive algo-rithm.

The simulation results above brought us to the conclusion that, when the number of time points is large enough (for example 500), and have little knowledge of the number of potential change points in the dataset, divisive algorithm is favored, for its good enough performance

(13)

Table 3: The Summary of the Results. (*Unit: Time Points per Cluster)

α Algorithm Starting Clustering* # Time Points # Iterations Accuracy

1 Divisive NA 88 1000 44.9% 1 Agglomerative NA 88 1000 64.6% 1 Agglomerative 8 88 1000 50.4% 2 Divisive NA 88 1000 47.9% 2 Agglomerative NA 88 1000 50.1% 2 Agglomerative 8 88 1000 54.4% 1 Divisive NA 500 50 88% 1 Agglomerative NA 500 50 70% 1 Agglomerative 50 500 50 60% 2 Divisive NA 500 50 90% 2 Agglomerative NA 500 50 54% 2 Agglomerative 50 500 50 60% 1 Agglomerative 20 500 50 38% 2 Agglomerative 20 500 50 44% 1 Agglomerative 100 500 50 92% 2 Agglomerative 100 500 50 96%

and little variability in performance. When the number of time points is not so large, like what we have in the LNP experiment data, agglomerative algorithm will be the algorithm of choice, for the comparatively better performance.

Based on the simulation study, agglomerative algorithm is decided to be used in the analysis of LNP experiment data. Specifically, we will apply agglomerative algorithm under the settings that no starting clustering is specified and α = 1, which outperformed the agglomerative algo-rithm under other settings for that number of time points.

(14)

4 Application

4.1 Data

4.1.1 The LNP Experiment

The LNP experiment was conducted on a cell-culture plate that has multiple wells on it. A well is a small container on the well where biologists culture cells. In the LNP experiment, cells were pre-cultured and maintained in 60 numbered wells. The LNP doses were added to all the wells immediately before the experiment started. Then they were all imaged by a Yokogawa CV7000 robotic microscope every 15 minutes for 22 hours. This resulted in 88 images for each well being taken, making the total amount of images equal 5280. Each image was taken at a resolution of 2554 × 2154 pixels.

Missingness exists in the dataset, as 21 out of the 60 wells missed the image taken at the first time point. The wells containing missing first time point do not seem to depend on any obvious feature in the data, MCAR (Missing Completely At Random) is assumed for the dataset. With MCAR assumed, ignoring the missing observation (the first time point) can still lead to valid estimation. Table 4summarises the wells that contain the first time point and the wells with the first time point missing. Only 87 images are included in the dataset for the wells containing missingness.

(15)

Table 4: A summary of missingness in the 60 wells of the LNP experiment

Number of Time Points in the Dataset Numbering of the Wells

88

B02, B03, B07,

C02, C03, C05, C06, C08, C09, D02, D03, D04, D05, D06, D07, D08, E02, E03, E04, E05, E06, E07, E08, E09, F02, F03, F04, F05, F06, F07, F08 G02, G03, G04, G05, G06, G07, G08, G09 87 B04, B05, B06, B08, B09, B10, B11, C04, C07, C10, C11 D09, D10, D11 E10, E11 F09, F10, F11 G10, G11

4.1.2 Calculating Variables From the LNP Experiment Images

Before implementing the selected algorithm on the LNP dataset, we need to turn image data into data that the algorithm can take on. For this purpose, correlation, Laplacian variance and summation of pixel intensities are calculated from the LNP experiment images and used as three variables that represent the state of the LNP experiment. The three variables are further detailed in Appendix C.

To be clear, at a given time point in a given well, a variable can be calculated, and it takes one single value. We look at all 88 values from the 88 time points in the well and calculate the difference between the largest and the smallest. Such difference is called range of the variable in a specific well. If the image changes over time then we would expect the variables to change too. So a large range over time within a certain well indicates that the image has changed a lot for this well.

Figure2 gives a visual summary of the range of the variable "correlation" of all 60 wells. The figure shows that the range of correlation larger than 0.1 is observed in 27 wells, indicating apparent change (potentially a drug uptake) occurred during the period of experiment. We expect the change point analysis to be able to detect those changes.

(16)

Figure 2: The ranges of correlation across the 60 wells. There exists several wells that notably do not have too wide range of value, indicating a less active experiment over time in these wells.

Similar characteristic is also found for the range of Laplacian Variance and Sum (Figures see Appendix B). Comparing the figures for the three variables, the wells in which have larger range in value of one variable correspondingly have larger range of other two variables as we expected, because all three variables respond to the change of state in an image.

4.2 Prominent Variable Detection for the LNP Dataset

As before mentioned, the variable with the earliest change point among all variables is referred to as the prominent variable. The goal for this chapter is to utilize non-parametric change point analysis to detect the prominent variable in the 60 wells of the LNP experiment. As we learned from the simulation study, the agglomerative algorithm is deemed to be a better choice when applying the change point analysis to the LNP dataset and it performs better than the divisive counterpart on datasets with fewer time points which is exactly the case for our LNP dataset. So in this application, the estimates are given solely by the agglomerative algorithm under the settings α = 1 and no starting clustering.

By applying the agglomerative algorithm, the change points are estimated for each variable, then by comparing the estimated first change points of the three variables, the prominent

(17)

vari-able is shown.

4.3 Results

Table 5: Part of the detection results for the wells. (CP = Change Point)

FOV Well 1st CP (Correlation) 1st CP (LaplacianVar) 1st CP (Sum) Prominent Variable

F001 C02 32 29 44 LaplacianVariance

F001 C03 28 25 25 LapVar and Sum

F001 C06 21 31 31 Correlation

F001 G06 33 37 25 Sum

After applying the change point analysis with the agglomerative algorithm to LNP experi-ment data, we found that Correlation is found to be the prominent variable in 29 out of 60 wells, with the first change points locating from time point 21 to 37. In the wells that have correlation as the prominent variable, the first change points are most frequently found between 20 and 30. Table5shows part of the detection results.

Laplacian Variance is found to be the prominent variable in 35 out of 60 wells, with the first change points lying from time point 25 to 66. In the wells where Laplacian Variance is found to be the prominent variable, the first change point mostly happens at time points later than the 30th.

Finally, Sum is observed as the prominent variable in 36 out of 60 wells, with the first change points found from time point 25 to 66. For the wells where Sum is the prominent variable, the first change point distributes largely like that of Laplacian Variance, mostly found after the 30th time point and before the 40th time point.

It is worth noticing that multiple variables serving as "co-prominent" variables can be seen in a number of wells, intuitively due to the fact that the three variables are all measures of state change of the same subjects. This "co-prominent" phenomenon is predominately observed between Laplacian Variance and Sum, while the same phenomenon is rarely to never observed between Correlation and the other two variables.

Generally in all wells, correlation is most likely to be detected as the sole prominent vari-able, with 18 wells having correlation as the sole prominent varivari-able, accounting for most in all

(18)

(a) Well C06 at time point 20 (b) Well C06 at time point 22

Figure 3: A comparison of well C06 before and after the detected first change point (ˆt = 21) . The images were brightened for presentation purposes.

wells that have sole prominent variable (27 wells with sole prominent variable).

As a check-up, we take the well C06 as example, the earliest change point is the time point 21 as correlation being the prominent variable. Figure3and figure4show comparisons of the well C06 (images and correlation values) before and after the first change point. (ˆt = 21)

(19)

(20)

5 Discussion

In the thesis, the suitability of the two algorithms for non-parametric change point analysis were explored by simulation. The simulation study shows that the agglomerative algorithm gives better prominent variable detection accuracy than the divisive algorithm when the num-ber of time points in the dataset is small. In contrast, the divisive algorithm achieved worse performance and showed sizable performance gap with the agglomerative counterpart. In our simulation study, the agglomerative algorithm performs best when no starting clustering is pre-specified. Although when using the agglomerative algorithm, the estimation process could benefit from the reduction in computational time with specifying certain starting clustering, it requires great familiarity and expertise to the experiment to be able to avoid sacrificing the detection accuracy.

The simulation study also shows that when the amount of time points becomes abundant (increasing to 500), the performance of both algorithms shows huge improvement and both algorithms can achieve very high detection accuracy up to 96%. In this case, the divisive algorithm becomes worth choosing because of the simplicity and high performance. In the LNP experiment dataset, the number of time points per well is far from abundant, which eventually led us to choosing the agglomerative algorithm for analysis.

By applying the non-parametric change point analysis on the LNP experiment dataset, cor-relation is detected as the prominent variable in most of the wells. The changing patterns for Laplacian Variance and Sum largely resemble each other, thus it can be argued that the two variables have little difference in reacting to state changes happening in the LNP experiment.

5.1 Limitations and Future Works

The simulation study can be subject to some sources of error such as the lack of simulation study on other amount of time points. In this thesis, we only did simulation on data with 88 time points and 500 time points. Thus, simulation on other time point amount is highly suggested for future research. Besides, the simulation only had the case with three true change points. Simulation with other amount of true change points is also worth exploring in the future. Another source of error can be the validity of the MCAR (Missing Completely At Random) assumption. If the MCAR assumption is violated, the estimates of the change points can be biased. A check on the validity of the MCAR assumption will be needed in the future works.

(21)

Even though the agglomerative algorithm provides an arguably satisfactory detection ac-curacy (close to 65%) when there is not sufficient amount of time points in the data, the fact that the time points in the experiment are too few will still bottleneck the performance being improved further. The simulation study with 500 time points shows a prospect of improving the detection accuracy, so we advise the scientists in AstraZeneca considering more frequent taking of the images during the imaging process in the future.

The change point analysis could also be applied multivariately over the three variables to potentially boost the detection performance, but we have yet to explore that possibility.

6 Acknowledgements

Upon the completion of this thesis, I sincerely express my gratitude to Salman Toor for helping me and introducing me to the project. I am grateful to Phil Harrison and Håkan Wieslander for the help regarding features and images. A special thank goes to Philip Fowler, for the helpful discussions on the method and simulation. I also would like to thank Alan Sabirsh at AstraZeneca, for providing data and the walk-through of the experiments in the Zoom meetings. Finally, I would like to thank the whole HASTE group for having me in the past semester, it is a pleasure to have been able to take part in the project.

(22)

References

Bradski, G. (2000). The OpenCV Library. Dr. Dobb’s Journal of Software Tools.

Forslid, G., Wieslander, H., Bengtsson, E., Wählby, C., Hirsch, J., Stark, C. R., & Sadanandan, S. K. (2017, Oct). Deep convolutional neural networks for detecting cellular changes due to malignancy. , 82-89. doi: 10.1109/ICCVW.2017.18

James, N., & Matteson, D. (2015). ecp: An r package for nonparametric multiple change point analysis of multivariate data. Journal of Statistical Software, Articles, 62(7), 1–25. doi: 10.18637/jss.v062.i07

Kuhn, M. (2020). caret: Classification and regression training [Computer software manual]. Retrieved fromhttps://CRAN.R-project.org/package=caret (R package version 6.0-86)

Matteson, D. S., & James, N. A. (2014). A nonparametric approach for multiple change point analysis of multivariate data. Journal of the American Statistical Association, 109(505), 334-345. doi: 10.1080/01621459.2013.849605

McCulloh, I., & Carley, K. M. (2008, March). Social network change detection [Computer software manual]. Retrieved fromhttps://ssrn.com/abstract=2726799

Pech-Pacheco, J. L., Cristobal, G., Chamorro-Martinez, J., & Fernandez-Valdivia, J. (2000). Diatom autofocusing in brightfield microscopy: a comparative study. , 3, 314-317 vol.3. R Core Team. (2018). R: A language and environment for statistical computing [Computer

soft-ware manual]. Vienna, Austria. Retrieved fromhttps://www.R-project.org/

Sobel, I. (2014, 02). An isotropic 3x3 image gradient operator. Presentation at Stanford A.I. Project 1968.

Szekely, G., & Rizzo, M. (2005, 02). Hierarchical clustering via joint between-within distances: Extending ward’s minimum variance method. Journal of Classification, 22, 151-183. doi: 10.1007/s00357-005-0012-9

Talih, M., & Hengartner, N. (2005). Structural learning with time-varying components: track-ing the cross-section of financial time series. Journal of the Royal Statistical Society: Se-ries B (Statistical Methodology), 67(3), 321-341. doi: 10.1111/j.1467-9868.2005.00504 .x

Taylor, W. (2000, 03). Change-point analysis: A powerful new tool for detecting changes. Deerfield, IL: Baxter Healthcare Corporation.

(23)

(24)

A

Code Availability

The R scripts used for simulation and analysis can be found on Github in the repository: https://github.com/HongruZhai/MasterThesis

B

Complementary Data Descriptions

(25)

(26)

C

Measures of the Change

The application starts with selecting variables that measure the change over time in the im-ages. With the modern development of computer vision techniques, libraries that can capture characteristics in the images were developed and maintained.

Previously, pharmacologist at AstraZeneca found that sharper edges (the parts of the image that are more defined and distinguishable) appearing in the images indicates the GFP expres-sion is starting (Forslid et al. 2017). Motivated by this characteristic and to utilize the prior information, a measure that can effectively capture the edges in the image should be selected.

Among the image processing tools, OpenCV (Bradski 2000) and scikit-image (Van der Walt et al. 2014) stand out as two of the popular computer vision libraries that offer a range of variable calculating options. Three variables that capture the changes in the image, namely Correlation, Laplacian Variance and Summation of pixel intensities are calculated by the func-tions from OpenCV and scikit-image libraries. In our case, the three measures act as different ways of summarising information in an image. Briefs are given in the next sections about the unique features of these three measures.

C.1 Laplacian Variance

In image processing, Laplacian Variance as a focus measure is mostly calculated in cases such as blur detection and edge detection. The basic idea is when the image is blurred, it is likely that fewer edges are in the image, therefore less contrast in adjacent areas of the image leads to lower Laplacian variance (Forslid et al. 2017). Figure7 exemplifies the "Fewer Edges v.s. More Edges" comparison.

The OpenCV function "Laplacian" calculates the second order derivatives of the intensity function of the image using the Sobel operator.(Sobel 2014) The Sobel operator processes the image by using a 3×3 kernel matrix that convolves with the image, resulting in getting the Laplacian variance. The formula can be found in Pech-Pacheco, Cristobal, Chamorro-Martinez, and Fernandez-Valdivia(2000).

Laplacian variance can capture the edges in the images from the LNP experiment, it also accounts for the local changes of an image, not only in the image as a whole. In OpenCV library, the Laplace Variance is calculated by the function cv2.Laplacian().

(27)

C.2 Correlation

The variable Correlation is calculated from the gray-level co-occurrence matrix (GLCM). In image processing, the co-occurrence matrix calculates how often a pixel with gray-level (grayscale intensity) value i occurs horizontally adjacent to a pixel with the value j. The correlation prop-erty of the GLCM can be used as a measure of the image texture. Similar to the concept of correlations in mathematics, the value of correlation can range from -1 to 1. It measures how correlated a pixel is to its neighbor over the whole image.

In the LNP images, a large value in Correlation indicates the adjacent intensity values tend to be of similar scale, then it is less likely that there are relatively more edges. When GFP expression begins accelerating, the correlation value will be subject to change. In scikit-image library, the function normalizes the gray-level co-occurrence matrix to have a sum of 1 before the computation of correlation. The correlation accounts for the local change of the image as well.

C.3 Summation of the Pixel Intensities

Another measure that summarises the information in an image is the summation of the pixel intensities. Unlike the previous two, the summation only detects the "global" change in the image, so one disadvantage is that if changes happen in opposite directions in different part of the image, there is possibility that the summation incorrectly indicates no change happening.

Although lack of the ability to capture regional changes, the summation is included to portray the total visible objects in the image. The measure simply takes the summation of intensities of every pixel in the image. The summation of the pixel intensities is capable of monitoring the change of the GFP expression, but due to the lack of regional detection capacity, it can be less sensitive to the minor changes in the image than the other two measures considered in the LNP experiment.

(28)

(a) Fewer Edges (b) More Edges

(c) Fewer Edges (d) More Edges

Figure 7: A comparison of the images with less edges and the images with more edges. (a) shows an image of earlier phase of the experiment, while (b) shows an image from a later phase. The images were brightened for presentation purposes.

PROMINENT VARIABLE DETECTION IN LIPID NANOPARTICLE EXPERIMENTS: A SIMULATION STUDY ON NON-PARAMETRIC CHANGE POINT ANALYSIS