General image classifier for fluorescence microscopy using transfer learning

(1)

UPTEC F 19033

Examensarbete 30 hp Juni 2019

General image classifier for fluorescence microscopy using transfer learning

Håkan Öhrn

(2)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress:

Box 536 751 21 Uppsala Telefon:

018 – 471 30 03 Telefax:

018 – 471 30 00 Hemsida:

http://www.teknat.uu.se/student

Abstract

General image classifier for fluorescence microscopy using transfer learning

Håkan Öhrn

Modern microscopy and automation technologies enable experiments which can produce millions of images each day. The valuable information is often sparse, and requires clever methods to find useful data. In this thesis a general image

classification tool for fluorescence microscopy images was developed using

features extracted from a general Convolutional Neural Network (CNN) trained on natural images. The user selects interesting regions in a microscopy image and then, through an iterative process, using active learning, continually builds a training data set to train a classifier that finds similar regions in other images. The classifier uses conformal prediction to find samples that, if labeled, would most improve the learned model as well as specifying the frequency of errors the classifier commits. The result show that with the appropriate choice of significance one can reach a high confidence in true positive. The active learning approach increased the precision with a downside of finding fewer examples.

Examinator: Tomas Nyberg Ämnesgranskare: Carolina Wählby Handledare: Håkan Wieslander

(3)

Sammanfattning

Teknikutvecklingen har gjort det möjligt att samla in stora mängder mikroskopidata men en stor del av datan som samlas in är ointressant. Därför krävs smarta metoder för att urskilja relevant information. Syftet med detta examensarbete var att utveckla ett applika- tionsprogram som ska göra det möjligt för en användare att utifr˚an en översiktsbild välja ut intressanta omr˚aden i en bild. Programmet ska sedan identifiera liknande omr˚aden i andra bilder.

Forskning har visat att man kan använda förtränade Convolutional Neural Networks (CNN) för för att hitta relevanta egenskaper i nya bilder. CNNs är konstruerade för att imitera hur djur uppfattar bilder och objekt. Neuroner i djurs syncentrum svarar mot ett specifikt omr˚ade i synfältet. CNNs är p˚a liknande sett uppbyggda av lager inneh˚allande neuroner. Neuronerna sitter ihop i olika strukturer och best˚ar av vikter. Dessa kan sedan tränas för att kunna skilja p˚a olika bilder. För att kunna träna ett CNN krävs det ett stort antal bilder. Det har visat sig att oberoende p˚a vilka typer av bilder som CNN tränas p˚a s˚a kommer de första lagren att lära sig identifiera linjer och kanter och de senare lagerna identifierar mer komplexa former och objekt. Om man inte har m˚anga bilder att träna p˚a kan man utnyttja förtränade CNNs förm˚aga att hitta relevanta egenskaper hos bilder och sedan träna en klassificerare p˚a dessa egenskaper istället för den ursprungliga bilden.

För att se till att klassificeraren blir s˚a bra som möjligt kan man använda en metod som kallas aktiv inlärning. Det g˚ar ut p˚a att man först tränar en modell, sedan l˚ater man en användare klassificera den data som modellen var mest osäker p˚a. Efter det tränar man en ny modell och inkluderare den nya datan i träningsdatan och p˚a s˚a sätt gör modellen bättre. I det här projektet ville man vara extra säker p˚a att de omr˚aden som modellen klassificerar som positiva var korrekta, därför lät man även användaren kontrollera att ett urval av dessa stämde. För att se vilken data som modellen var osäker p˚a användes ett ramverk som kallas Transductive Conformal Prediction(TCP). TCP gör det även möjligt att specificera frekvensen av fel man till˚ater modellen att göra.

Resultatet visade att applikationen kunde lära sig att känna igen de intressanta omr˚adena för olika typer av problem. Fler tester m˚aste dock göras för att säkerställa modellens generella prestanda.

(4)

Acknowledgement

This project is part of the Swedish Foundation for Strategic Research (SSF) project HASTE under the call Big Data and Computational Science.

I sincerely express my gratitude to my supervisor H˚akan Wieslander, for your guidance and helping me refine my report. I would also like to thank Ola Spjuth and Phil Harrison for introducing me to conformal prediction. In addition I would like to thank Johan Karlsson and Alan Sabirsh at Astrazeneca for providing data from the Yokogawa microscope as well as contributing to the discussion on method and solution. Finally I would like to thank my thesis reviewer, Carolina W¨ahlby for your guidance and feedback on my work.

(5)

1 Introduction

1.1 Background

Modern microscopy and automation technologies enable experiments which can produce millions of images each day [19]. The valuable information is often sparse, and requires clever methods to find useful data.

A common approach today is to use specialised software such as CellProfiler [2] that rely on hand-tuned segmentation and feature extraction to find specific features [18]. Another approach taken in reference [23] is to transfer features learned from natural images to obtain morphological profiles for fluorescence microscopy images. This method obviates the need for single cell identification and therefore doesn’t require human input to tweak parameters.

In this thesis, we try to develop a general image classification tool for fluorescence microscopy images based on the method used in reference [23]. The application asks the user to select interesting regions in a microscopy image and then, through an iterative process, using active learning, continually builds a training data set to train a classifier that finds similar regions in other images.

Ideally the classification output to this problem has a high confidence in true positives.

High recall in positive samples is not prioritised since generally only few samples are required for an analysis. To produce results in accordance with this requirement the classifier outputs a prediction region instead of a single label classification prediction using conformal prediction. A prediction region contains the true value of an example with probability of at least 1 ✏ where ✏ is some chosen significance level.

1.2 Purpose

The purpose of this master thesis is to see if one can create a general image classification tool for florecent microscopy using transfer learning. This would be used to speed up the processing of microscopy images.

1.2.1 Goals

• Evaluate the ability of a CNN as a feature extractor to classify images using transfer learning with active learning. There will be a focus on achieving a high precision for the positive class.

• Create a simple user interface that makes it easy for a user to select regions of interest and with an iterative approach collect training data.

(8)

1.3 Delimitation

This project has been limited to only investigating one layer of a specific CNN, Resnet 18.

Also, a very simple active learning approach is used.

(9)

2 Theory

2.1 Fluorescence Imaging

Fluorescence microscopy of live cells has become an integral part of modern cell biology.

The method uses light absorbed with a specific wavelength near the peak of the fluorophore excitation spectrum. This can be done either using fluorescence labelled proteins for live cell microscopy, fluorescence labelled antibodies binding to fixated cells or looking at the intracellular autofluorescence. The microscope camera filters away excitation light and only light emitted by the fluorescence proteins inside the cells is captured [7]. The intensity values in the image represent the amount of fluorophore present in a specific area of the specimen and holds information about spatial appearance and local concentration of fluorophores [28].

2.2 Feature Extraction and Reduction

2.2.1 Convolutional Neural Networks

Convolutional Neural Networks (CNN) are similar to ordinary neural networks in that they are made up of neurons that have learnable weights and biases. CNN architecture makes the assumption that the inputs are images. This allows for certain properties to be encoded that vastly reduces the amount of parameters in the network [17]. Regular neural networks receive a single vector as input and transform it through a series of hidden layers. The neurons in regular neural nets do not share any connections and, in contrast to CNN, are completely independent from the other neurons in its layer.

CNN emulates the visual processing system of animals. The individual neurons in the visual cortex respond to a specific overlapping region in the visible field. CNNs have a similarly built spatial architecture where a specific region in one layer is connected to a specific region in the next layer [4]. This means that the inputs in a neuron in layer m are from a subset of units in layer m 1, where those units have a spatially contiguous receptive fields. This architecture allows the network to concentrate on low-level features in the first hidden layers, then combine them into higher-level features in the next hidden layer.

Images have similar hierarchical structure in the sense that small local simple features such as edges combine to create more complex features such as faces and trees. This similarity makes CNNs very successful at image recognition [12].

2.2.2 Preprocessing

Before sending an input image through a CNN the image should first be subject to normalisation. This is because a network struggles to modify its weights in a way that suits all ranges of pixel intensities [16]. When training using a pre-trained model the input image is

(10)

expected to be normalised using specific values. The input image is loaded into a range of [0, 1] and then normalised using set values for mean and standard deviation.

2.2.3 Augmentation of Images

Augmentation is a method used to reduce overfitting by synthetically enlarging the data set [20]. The basic idea of augmenting an image is to manipulate the image in such a way that does not change the original label.

There are several transformation methods that can be used for augmentation e.g trans- lation, mirroring and rotation. Generally one does not use interpolation to augment microscopy images, due to the fact that is alters the intensities. Therefore the maximum amount of examples one can produce from a single image is 8 through rotation and mirroring.

2.2.4 Transfer Learning with CNN

The general idea of transfer learning is to use knowledge from an existing model trained to perform a di↵erent task where there is a lot of labeled training data, in a new task where there is little data.

One approach of transfer learning is to use a CNN to discover the best representation of ones problem, in other words finding the most relevant features for ones problem. This means that instead of starting the learning process from scratch, learned features from a previous network can be transferred to a new problem.

To train a neural network from scratch you need a lot of data to avoid overfitting. In the case where you do not have access to a large amount of data, then transfer learning becomes useful since you can build a solid machine learning model with comparatively little training data. This process only works if the features extracted from the neural net are general, meaning suitable to both the original and new task [29].

The earlier layers of a CNN capture low-level image features such as edges, while higher convolutional layers capture more and more complex details, such as body parts and faces if the CNN is trained on people. The final fully connected layer of a CNN is specialised to solve the original task. For example the last layer of Resnet18 [13] indicates which features are relevant to classify an image into one of its 1000 object categories. This last last layer is only relevant to the original task. However, features contained in one of the final convolutional layers or early fully-connected layers at the end of the CNN capture general information of how an image is composed and what combinations of edges and shapes it contains. For a new task, we can use the extracted general features of a pre-trained CNN and train a new model on these features [25].

(11)

Studies have shown that you can achieve state-of-the-art results when transferring features from higher layers in neural nets [6][30]. In [23] a generic CNN pre-trained on natural images was able to extract biologically meaningful features from microscopy images without the need of segmenting individual cells. The network was modified by cutting o↵ the final classification layer and using the extracted features in a k-nearest neighbour algorithm.

2.2.5 Principle Component Analysis

Machine Learning problems that have a large amount of variables make training slow but also make it harder to find a good solution. This problem is often referred to as the curse of dimensionality [12].

Principal Component Analysis (PCA) is by far the most popular dimensionality reduction algorithm. It retains the variation in the data set while reducing the number of variables. The idea is that it identifies the hyperplane that lies closest to the data, and then projects the data onto it [12].

Suppose that x is a vector of p variables and there is some correlation between them.

The first step is to look for a linear function ↵⁰₁x of the elements of x having maximum variance, where ↵1 is a vector of p constants ↵11x1+ ↵12x2+ ... + ↵1pxp so that

↵⁰₁x = ↵11x1+ ↵12x2+ ... + ↵1pxp= Xp j=1

↵1jxj. (1)

The apostrophe means that the vector is transposed. The next step is finding a linear function ↵⁰₂x that’s uncorrelated with ↵⁰₁x. This process continues until a kth stage linear function ↵⁰_kx is found that has the maximum variance but is uncorrelated to all previous

↵⁰_dx where d = 1, 2, ..., (k 1). ↵⁰_kx is the kth Principle component(PC). Up to p PC can be found but most variation in x can usually be found by m PC where m << p.

To find the PC, consider the case where the vector of x has a co-variance matrix ⌃. It turns out that for k = 1, 2, .., p, the kth PC is given by zk = ↵⁰_kx where ↵kis an eigenvector of ⌃ corresponding to its k’th largest eigenvalue k. If ↵k is chosen to have unit length

↵⁰_k↵k = 1, then var(zk) = k. The derivation for finding PC can be found in [15].

Instead of arbitrarily choosing the number of dimensions to reduce the data set to, it is preferable to choose the number of dimensions that add up to a sufficiently large portion of the variance. A common value is to retain 95% of the variation [12].

2.3 Classification

In this project a random forest classifier is used. The classifier is an enhanced variation of decision tree, a simpler classifier. Comprehending random forest requires understanding of

(12)

decision trees. Therefore in this section we will first describe the decision tree algorithm before moving on to its variation.

2.3.1 Decision trees

Decision trees are a conceptually simple but powerful method. The basic idea is to partition the feature space into a set of regions and then assign a class for the samples in each one. To create a decision tree the algorithm needs to be able to automatically decide on the splitting variables, split points and the topology of the tree. Lets say our data consist of p inputs and a response for each N observations (xi, yi) for i = 1, 2, ..., N, with xi= (xi1, xi,2,...,xi,p).

Suppose that the model has split the feature space into M regions R1, R2, ..., RM, let the classification of observations in region Rm be k(m) = argmax_kpˆmk where

ˆ

pmk= 1 Nm

X

xi2Rm

I(yi= k). (2)

To find the best splitting nodes a greedy approach is used. Starting at the root with all the data, consider a splitting variable j and a split point s that minimises the node impurity Qm(T ). There are several impurity measures but one commonly used is the Gini index [8]

X

k=k⁰

ˆ

pmkpˆmk⁰ = XK k=1

ˆ

pmk(1 pˆmk). (3)

The impurity measure functions as a score for how well a split separates the di↵erent classes.

If the split divides the data into two daughter nodes where the classes are completely separate the impurity measure will give a score of 0, while a split that produces two nodes with a 50/50 distribution will give a score of 1.

For each splitting variable, the determination of the split point s can be done quickly and thus scanning through all of the inputs and determining the best pair (j, s) is not computationally expensive. This splitting of the data set will continue until some criteria is met. A common strategy is to grow the tree very large and stop the splitting when all classes are separated into di↵erent leaf nodes. The resulting tree is then pruned using some pruning method to reduce the risk of overfitting the training data [8].

2.3.2 Bagging

A major issue with decision trees is their high variance. A small change in the training data can result in a very di↵erent series of splits, making the trees unstable. This is due to the instability of the process hierarchical nature. The e↵ect of an error in the top split is propagated down to all of the splits below it [9].

(13)

A solution to this problem is to generate several trees from the training data and average their output, a technique called Bagging [10]. To be able to produce several trees from the same data set one needs to use a general tool called bootstrapping. The idea is to randomly draw new data sets with replacement from the existing data set. Suppose we have a training set Z = (z1, z2, ..., zN) where zi= (xi, yi). We then draw with replacement from the training data a sample with the same size as the original training data. This is done B times creating B bootstrap data sets [11]. For each bootstrap sample Z^⇤b, b = 1, 2, ..., B, we grow a tree.

The bagging estimate is then defined as fˆbag(x) = 1

B XB b=1

fˆ^⇤b(x) (4)

in the regression case or a majority vote in case of classification. Each tree generated in bagging is identically distributed. This means that the expectation of an average of B such trees is the same as the expectation of any one of them. Therefore the bias of the bagged trees is the same as that of the individual trees and the only means of improvement is through variance reduction. In bagging you reduce variance of the average

Varh fˆbag

i= ⇢ ²+1 ⇢ B

2 (5)

where ⇢ is the positive pairwise correlation and ² the variance. As B becomes larger the second term disappears. This means that it’s the correlation between the trees that becomes the dominating factor [9].

2.3.3 Random forest

Random forests is a modification of bagging. The basic idea is to decorrelate the bagged trees and then average their output. The process of decorrelation works as such that you randomly pick m variables to use for each tree where m < p and p is number of variables [1]. This process reduces the first term in equation 5, therefor reduces the average variance.

Typical values for m are pp but sometimes as small as 1. When the number of variables is large, but the fraction of relevant variables small, random forests are likely to perform poorly with small m. This is due to the fact that the chance of a relevant variable to be selected for a split becomes small. Note that as m becomes smaller the bias becomes larger [9]. The generalisation error for forests converges to a limit as the number of trees in the forest becomes large [1]. A benefit of random forest is that the method requires very little tuning [9]. The algorithm for random forest classification is as follows:

1. for b = 1 to B

(a) Draw a bootstrap sample Z^⇤ with size N from the training data

(14)

(b) Grow a random-forest tree Tb to the bootstrapped data by recursivly repeating the following steps for each node of the tree until the minimum node size nmin

is reached.

i. select m variables at random from total p variables ii. select the best variable/split-point among m iii. split the node into two daughter nodes 2. Output the ensemble of trees {T^b}^B1

3. To make a prediction for a new data point x : Let ˆCb(x) be the class prediction of the b’th random-forest tree. Then ˆC_rf^B(x) = majority vote{ ˆCb(x)}^B1.

2.4 Conformal Prediction

Conformal prediction is a method that allows us to make reliable predictions in the sense that one can specify the frequency of errors the classifier commits [27]. It outputs a prediction region instead of a single label classification prediction like standard prediction algorithms, such as random forest or SVM.

A prediction region is a set ✏ that contains the true value of an example with the probability of at least 1 ✏ where ✏ is some chosen significance level. For a binary classifier with the two classes A and B, the prediction region could be any of the following sets: {A}, {B}, {A, B}, that the example belong to both classes, or {null}, that the example belong to none of the classes [22].

Let’s say that we are given a training set of examples (x1, y1), ..., (xn 1, yn 1) where xi 2 X is a vector of attributes and yⁱ 2 Y is a label out of a finite set of possible labels.

The goal is to find yn for a vector of attributes xn. The combination of label and example is an object zi = (xi, yi)2 Z = X ⇥ Y . A noncomformity measure is a set of measurable mappings {Aⁿ : n 2 N} of type Aⁿ : Z^{n 1}⇥ Z ! ( 1, +1], where Z^{n 1} is a set of all bags of elements of Z of size n 1. For each possible label y, we state the hypothesis that yn= y and the noncomformity measure assigns some nonconformity score ↵i to every example {zⁱ, i = 1, ..., n} including the new example and evaluates nonconformity ↵ⁱ :=

An({{z¹, ..., zi 1, zi+1, ..., zn}}, zⁱ), i = 1, ..., n between a set and its elements. The {{...}}

indicates that it is a multiset.

For each tested hypothesis yn= y we compare ↵n to all the other ↵i’s and calculate the p-value

p(y) = |{i = 1, ..., n : ↵ⁱ ↵n}|

n . (6)

The p-value shows how well a new example with this label conforms with the rest of the sequence. A small p-value means that it is very nonconforming, in other words that the new

(15)

example is di↵erent from previous examples. If the p-value is large (close to 1), then zn is very conforming.

The conformal predictor determined by the nonconformity score An, n2 N and a significance level ✏ is then defined as a function : Z^⇤⇥ X ⇥ (0, 1) ! 2^Y, where 2^Y is a set of all subsets of Y , such that the prediction set

✏(x1, y1, ..., xn 1, yn 1, xn) is defined as set of all labels y2 Y such that p > ✏ [27].

The prediction regions produced by a conformal predictor have the property of always being valid. This means that the frequency of errors it commits does not exceed ✏ at a chosen confidence level 1 ✏. This property holds under the randomness assumption, which means that the examples are independently drawn from the same distribution. Instead the conformal predictor is evaluated based on its efficiency. In classification problems, a measure of efficiency could be the number of prediction sets containing two or more labels [22].

How to compute the noncomformity score depends on the underlying classification algorithm. In this thesis a random forest algorithm is used. An intuitive method of scoring using random forest is presented in reference [5]. A random forest is constructed from a training set (x1, y1), (x2, y2), ..., (xm, ym). The conformity score of a new example (x, y) is then equal to the percentage of correct predictions given for x by decision trees. The conformity score is simply 1 A(x, y) where A(x, y) is the nonconformity measure.

This form of conformal prediction where each new sample is compared to all previous examples is called Transductive Conformal Prediction (TCP). It requires that one for each new test example trains a standard machine learning model for each class. This makes the method very computationally demanding [22].

Another form is Inductive Conformal Prediction (ICP) where instead of comparing to all precious examples one sets aside a portion of the training data to form a calibration set [22].

2.5 Active learning

Most classification problems are solved with a random sampling approach. This means that the training samples used for learning are randomly chosen, called passive learning.

The active learning method refers to algorithms that autonomously selects data points from which they will learn. The hypothesis is that if data for the training set is carefully selected, then the learning model will reach a better performance with fewer training examples [24]

[14].

Active learning is used in cases where large amounts of unlabeled data are available, but labels are in some way costly to obtain. An active learning algorithm finds data points that, if labeled, would most improve the learned model.

(16)

For such a model to work the learning algorithm must be able to send unlabeled data to an expert who acts as a source of labels for such instances. This allows the learner to create a training set of data points that would be most informative for the learning process.

2.5.1 Pool-based sampling

One way of incorporating active learning is though pool-based sampling. In such case all of the instances in which the learner queries the expert for labels come from existing unlabeled data. It is an iterative process where data points are first sampled from an input distribution, and then the most informative instances for learning are selected for labeling and added to the training set [21][14].

During active learning the method of evaluating which of the unlabeled data are informative is crucial. There are several strategies such as uncertainty sampling [21] and query by committee [3]. When using an uncertainty sampling strategy in a pool-based manner it is beneficial to have a classifier that does not only make classification decisions but also esti- mates their certainty [21]. Prediction made by conformal predictors is a suitable candidate for such a task.

2.6 Model Evaluation

When evaluating a binary classification model, one can divide predictions into four categories, which can be seen in table 1.

Table 1: Table showing the di↵erent prediction categories.

Predicted positive Predicted negative Actual positive True positive (TP) False negative (FN) Actual negative False positive (FP) True negative (TN)

With these four classes one can now calculate measurements on how well the models predictions are. Precision (7) is a measure of the rate of positive samples that are correctly classified as positive. Recall (8) measures the rate of which positive samples are classified correctly [26].

Precision = TP

TP + FP (7)

Recall = TP

TP + FN (8)

(17)

3 Data

To evaluate the method two di↵erent data sets are used. The reason for using two data sets is to see whether the application can solve distinctly di↵erent image classification problems.

3.1 Data set with organoids

This data set depicts organoids taken with the Yokogawa microscope. The original data set had 4 channels, however only 3 were used since the input for Resnet 18 CNN is a 3 channel image. The data set consists of 4 di↵erent images where some of the organoids contain red lines or web like structure.

Figure 1: 4 fluorescence microscopy images of organoids.

3.2 Data set with cells

The data set depicts clusters of cells. In some of the images the cells have been exposed to a drug making them change their appearance as seen in figure 2. In cells where no drug is

(18)

present the stain resides in the cytoplasm, in other words the cells have a green nuclei and red area around it. When the drug is present the stain is instead present inside the nuclei.

This data set originally only contained 2 channels. To be able to use the images as an input for the CNN another channel was added containing all zeros. The data consist of a total of 80 images.

Figure 2: The right image shows cells where the drug is present, the left image shows a sample where there is no drug present.

(19)

4 Application structure

4.1 Overview

In this project a general image classification tool is developed. The idea is that a user selects interesting regions in an image, the classifier then finds regions in other images with the same characteristics.

The tool is structured as such that initially the user selects positive and negative regions from a fluorescence labeled microscopy image. Then a pool of samples is generated from all images using a sliding window algorithm. Afterwards an iterative process starts where a classifier is trained based on the user-selected regions. New samples are presented to the user, displaying the regions that the classifier is most certain are positive, as well as regions that the classifier is most unsure what class they belong to, and asks the user to verify their class, creating a larger pool of training data. After x iterations the regions are grouped into positive, negative, both or neither and the positions of positive regions in the original image is stored.

Choosing the initial training data

Iterative learning

Final classification

Figure 3: Figure showing the application flowchart.

4.2 Choosing the initial training data

4.2.1 Selection of the positive and negative data points

In the first step the user is asked to choose an image in which to pick positive and negative regions. The image is shown to the user with a movable square placed in the image as seen in figure 4. The user can then scale the square to appropriate size and then select areas that have the desired features. Once the first region is chosen the dimensions of the square is fixed. This is done so that the scaling information from the object is not lost. After some chosen amount of regions is selected the user is required to select the same amount of regions that doesn’t have the required feature (negative class). This is done in the same fashion as when selecting the positive class.

(20)

Figure 4: Figure showing how regions are selected.

4.2.2 Preprocessing and feature extraction

To prepare the images for the CNN they are first preprocessed using normalisation. The selected regions are converted into 8 times more samples using image augmentation techniques seen in figure 5 and scaled to 224⇥ 224, the desired input size for the CNN. Features are extracted using a pretrained Resnet 18 network. The network was modified by removing the final classification layer, so that the penultimate layer represented the feature embedding.

(21)

Figure 5: Original image mirrored and rotated.

4.3 Iterative learning

4.3.1 Sliding window algorithm

The original images are divided into several smaller regions using a sliding window algorithm.

The basic idea is that a rectangular region of fixed width and height “slides” across the image, producing several smaller images that can be individually classified. The size of the rectangle is set by the dimension of the rectangle from when the first positive region was selected. The stride can be manually adjusted but is shifted somewhat so that the whole original image is captured. Each extracted region is scaled to the required input size and fed through the modified CNN. The output is stored as a vector with 512 features. This step is only done once.

4.3.2 Principle component analysis and classification

To reduce the number of features PCA is used capturing 95% of the variance. The resulting number of features di↵er somewhat depending on the training data. Next, the samples produced from the sliding window algorithm are classified using conformal prediction with an underlying random forest classification algorithm using 100 trees. This gives each sample 2 p-values associated with how well the sample conforms with the associated class.

4.3.3 Active learning step with pool based sampling

The data points that, if labeled, would most improve the learned model are the ones that the classifier is the most uncertain what class they belong to. These are labeled by the user and added to the training data for the next iteration.

(22)

Since it is crucial that the classifier produces high confidence for true positives, samples that the classifier is confident belong to the positive class are presented to the user for verification.

To find the most certain positive samples the application selects 25 samples with the highest T - value where T = ppositive pnegative. pclass indicates how well a new example conforms with the training data belonging to that class.

These images are presented to the user as in figure 6, the user then clicks the images that don’t belong to the positive class. To find the samples where the model is the most uncertain the application selects the 25 samples with the smallest absolute value|T |. These images are presented to the user in the same fashion as with the 25 most certain. Once the samples have been labeled they are augmented and added to the training data.

Figure 6: The user clicks on the wrongly classified images. All the images are added to the training data sets corresponding label.

(23)

4.4 Final classification

For the final classification the p-values for each sample decide its class. If Pclass> ✏, where 1 ✏ is some chosen confidence, then the sample is deemed to be of that class. The result can be that the sample is of the positive class, negative class, neither or both. Figure 7 shows a scatter plot of p-values as well as the di↵erent prediction regions. As ✏ increases the classifier will try for a narrower prediction region. This means that the no class region (yellow) will become larger while the both class prediction region (blue) becomes smaller.

The specific class regions will also become larger while ✏ < 0.5.

Figure 7: Figure showing the di↵erent prediction regions in a p-value scatterplot for ✏ = 0.2.

Samples in the green region are classified as positive, red region negative, yellow region belong to no class and the samples in the blue region belong to both classes.

(24)

5 Experiments

To evaluate the method a series of tests were conducted. To investigate the method generalisation performance three di↵erent image classification problems were evaluated using two di↵erent data sets. A test was also conducted without using active learning. The tests were conducted with three iterations of the active learning step. The application would normally include the regions that the user labels manually during the iterative learning process in its final result. For the purpose of these tests those regions were not included in the result.

5.1 Organoid data set

Figure 1 shows four di↵erent microscopy images of organoids. The objects in the images are quite di↵erent from one another but they all feature some organoids with veins in them, depicted as red lines or a web like structure. The first 3 images were used for training and the last image as a testing sample.

5.1.1 Object detection

The first experiment was conducted to try to simply identify the organoids. If the organoid is visible in the sample image it is defined as a positive, even if only a small part of the image contains the object.

5.1.2 Finding organoids with veins

The next experiment was to find the veins in the organoids, visually they look like red lines or web like structures as in figure 8a.

(a) Example of positive class. (b) Example of negative class.

Figure 8

(25)

5.1.3 Finding veins without active learning

To evaluate the e↵ects the active learning approach has on the performance a test not using active learning was conducted. Instead of selecting training data with TCP at each step, a normal random forest picking 25 positive and negatively classified images for the user to verify at each iteration was used. In the final step TCP was used to classify the images.

5.2 Finding cells not a↵ected by a drug

A classification problem using the data set containing cells was evaluated. The samples that contain cells that are not a↵ected by the drug, where the stain resides in the cytoplasm, were defined as the positive class. 8 images were used to train the classifier in which half of them contained cells a↵ected by the drug. 72 images were used to test the classifier, where half of them contain cells a↵ected by the drug.

(26)

6 Results

6.1 Object detection

Figure 9 shows the result of trying to find the organoids in the organoid data set. The two top graphs show how the positive and negative samples are distributed in the di↵erent prediction regions when significance ✏ changes. The bottom two graphs show the precision and recall for the positive class.

(27)

Figure 9: Plots showing the precision and recall as well as how positive and negative test samples are distributed in the di↵erent prediction regions depending on ✏ when trying to find organoids in the image.

Figure 10 shows the distribution of p-values for the positive and negative samples when

(28)

trying to find organoids.

Figure 10: Figure showing the di↵erent p-values for the positive and negative samples when trying to find organoids in the image.

6.2 Finding organoids with veins

Figure 11 shows the result of trying to find the organoids with veins in the organoid data set. The two top graphs show how the positive and negative samples are distributed in the di↵erent prediction regions depending on ✏. The two bottom graphs show the precision and recall for the positive class. Figure 13 shows the images in the positive prediction region for a fixed significance ✏.

(29)

Figure 11: Plots showing the precision and recall as well as how positive and negative test samples are distributed in the di↵erent prediction regions depending on ✏ when trying to find organoids with veins.

Figure 12 shows the di↵erent p-values for the positive and negative samples when trying

(30)

to find organoids with veins in them.

Figure 12: Figure showing the di↵erent p-values for the positive and negative samples when trying to find organoids with veins in the image.

Figure 13 shows the samples in the positive prediction region for a fixed significance.

Figure 13: Examples of positively classified samples for ✏ = 0.2 after 3 iterations.

(31)

6.2.1 Without active learning

Figure 14 shows the result of trying to find organoids with veins without using active learning. The two top graphs show how the positive and negative samples are distributed in the di↵erent prediction regions depending on the significance ✏. The two bottom graphs show the precision and recall for the positive class.

It takes about 4.5 min to classify 572 images without active learning using a Macbook pro with 2,3 GHz Intel Core i5. It takes about three times as long to use TCP at every step.

(32)

Figure 14: Plots showing the precision and recall for the positive class as well as how positive and negative test samples are distributed in the di↵erent prediction regions depending on ✏ when trying to find organoids with veins without using active learning.

Figure 15 shows the p-values for the positive and negative samples when trying to find

(33)

organoids with veins in them when not using active learning.

Figure 15: Figure showing the di↵erent p-values for the positive and negative samples when trying to find organoids with veins in the image without active learning.

6.3 Finding cells not a↵ected by a drug

Figure 16 shows the result of trying to find the cells not a↵ected by the drug in the cell data set. The top two graphs show the distribution of the positive and negative samples in the di↵erent prediction regions depending on the significance ✏. The bottom two graphs show the precision and recall for the positive class.

(34)

Figure 16: Plots showing the precision and recall for the positive class as well as how positive and negative test samples are distributed in the di↵erent prediction regions depending on ✏ when trying to find cells not a↵ected by the drug.

(35)

Figure 17 shows the p-values for the positive samples and figure 18 shows the p-values for the negative samples for when trying to find cells not a↵ected by the drug.

Figure 17: Figure showing the di↵erent p-values for the positive samples when trying to find cells not a↵ected by the drug.

Figure 18: Figure showing the di↵erent p-values for the negative samples when trying to find cells not a↵ected by the drug.

(36)

7 Discussion

7.1 Application and user interaction

The application is easy to use in the sense that it only requires that the user marks regions with and without the features of interest as well as clicking on negative regions as they are shown to the user. To click the samples that lack the required features minimises the amount of user input at each iteration because the samples the classifier was unsure of tended to belong to the positive class.

An alternative to having the user initially mark negative regions, a user interaction that the user might find particularly tedious, the application could randomly select regions in the image where the user hasn’t marked as a positive region, thus require no user input.

The downside to this approach is that the user must select all the positive regions in the image to avoid that a positive region is selected as a negative region.

A downside with using TCP in the application is that TCP is very computationally demanding making it slow to run on a standard computer. TCP is however parallelizable which makes it possible to reduce the computational time significantly. A benefit with choosing TCP over other conformal prediction methods such as ICP is that it doesn’t require a large data set. ICP would require a substantial portion of the labeled data to be set aside and therefore it is not suitable for this application.

One way of improving the speed of the classification would be to reduce the number of iterations and instead let the user label more than 25 images at each iteration. This would increase the burden on the user but would make each iteration collect more training data.

7.2 Organoid data set

7.2.1 Organoid detection

In figure 9 we can see that as ✏ becomes larger a portion of the positive samples are placed in the no class prediction region. When ✏ becomes larger the classifier tries for a narrower prediction set. This results in samples placed in the no class prediction region. During the training process one could see that samples with a small part of the image containing a organoid had low p-values for both positive and negative class. It is mainly these samples that cause a substantial portion of positive samples placed in the no class prediction region.

The precision of the classifier seems to consistently be around 1 regardless of ✏.

7.2.2 Finding organoids with veins

As ✏ becomes larger in figure 11 more samples are moved from the both class prediction region to the positive, negative and no class prediction regions. When ✏ become larger the

(37)

classifier tries for a narrower prediction set which makes the classifier move samples from the both class prediction region to the other prediction regions.

The classifier tend to miss a lot of the positive samples and instead they are placed in the no class prediction region. A reason why the classifier placed so many positive examples in the no class prediction region could be that the desired feature looked quite di↵erent in the di↵erent images. Of the images used to collect training data from, two out of three looked quite di↵erent from the test image as seen in figure 1. The organoids in two of the images are of a di↵erent colour and their veins are much thinner and blurry compared to the other two.

The precision is fixed at 1 throughout the whole span of ✏. In many applications a larger focus can be put towards maximising precision for positives and one can disregard false negatives. This is because generally only few samples are required for an analysis.

Therefore the results, even though the classifier misses a large portion of the positives, should still be considered as good.

The classifier found it easier to identify organoids than specifically organoids with veins in them. This can clearly be seen when comparing figure 12 and 10. In figure 10 the di↵erent samples align either along the positive or negative axis while in figure 12 the positive samples were more spread out. This could be because the features that separate organoid from background is better represented in the extracted layer than the feature that capture the red veins.

A note on the selection process, when training the classifier it presented samples that it was the most unsure of. This tended to correlate with samples that had very unclear veins in them.

7.2.3 The e↵ect of active learning

When the classifier doesn’t use active learning one can see in figure 14 that the classifier tends to find more positive samples than when using active learning. As ✏ increases, the amount of positive samples placed in the both class prediction regions is reduced but also the amount of samples placed in the no class prediction region is increased. The precision for the positive class also decreases somewhat with the increase of ✏. The negative samples are almost exclusively placed in the negative class prediction region independent of ✏.

When comparing figure 12 and 15 the samples tend to have low p-values when not using active learning. The reason for the classifier placing a larger amount of positive samples in the positive prediction region when not using active learning could be that half of the samples the active learner presents to the user for labeling are chosen to make sure that the classifier is labeling the positive samples correctly. However choosing the most confident positive regions as training data might not improve the classifiers ability to find new examples of

(38)

positive regions as much as when random positive/negative samples are used.

The small improvement in performance might not warrant the significantly longer computation time. However as the computation is parallelizable this would not be a problem on a powerful enough CPU.

7.3 Cell data set

The classifier produced a similar result as with previous experiments when trying to find cells not a↵ected by the drug. In figure 16 we can see that as ✏ increases the number of positive samples in the both class prediction region is reduced, the samples in positive and the negative prediction regions are increased as well as the no class prediction region. The precision for the positive class seems to be constant at 0.95 regardless of value for ✏. As ✏ becomes larger the recall for the positive class increases as well. Figure 17 and 18 show that the wrongly classified samples tend to have p-values < 0.5.

A difficulty with the dataset is that cells tend to cluster making selecting a region that would capture individual cells very difficult. Instead larger regions were selected and if that regions contained cells with the desired features it would be labeled as positive. This might have made it more difficult for the classifier to find the correct features.

7.4 Conclusion

The main focus in this project was to create a general classifier that managed to achieve high precision for the positive class. The result show that with the appropriate choice of

✏ one can reach a high confidence in true positive. The active learning approach increased the precision with a downside of finding fewer examples. The ability to choose ✏ makes it possible to decide whether to find few but most likely correctly classified positive examples or a larger pool of samples where some samples will not have the desired features. However when the active learning was used the choice of ✏ did not a↵ect the precision.

7.5 Future work

The application still needs more evaluation to verify its generalisation performance. It would also be of interest to experiment with di↵erent layers for feature extraction. Late layers are beneficial when the object one want to classify are similar to the objects that were originally classified using the CNN and that features found in the earlier layers expresses more basic features like edges and lines. There are also more advanced active learning approaches to evaluate.

To reduce the computation time it could be interesting to start with TCP but as the training data becomes larger transition into another conformal predictor such as Inductive

(39)

conformal prediction.

(40)

References

[1] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.

[2] Peter D Caie, Rebecca E Walls, Alexandra Ingleston-Orme, Sandeep Daya, Tom Hous- lay, Rob Eagle, Mark E Roberts, and Neil O Carragher. High-content phenotypic profiling of drug response signatures across distinct cancer cells. Molecular cancer ther- apeutics, 9(6):1913–1926, 2010.

[3] Ido Dagan and Sean P Engelson. Committee-based sampling for training probabilistic classifiers. In Machine Learning Proceedings 1995, pages 150–157. Elsevier, 1995.

[4] deeplearning.net. Convolutional Neural Networks (LeNet)., 2017. http://

deeplearning.net/tutorial/lenet.html.

[5] Dmitry Devetyarov and Ilia Nouretdinov. Prediction with confidence based on a random forest classifier. In IFIP International Conference on Artificial Intelligence Applications and Innovations, pages 37–44. Springer, 2010.

[6] Je↵ Donahue, Yangqing Jia, Oriol Vinyals, Judy Ho↵man, Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning, pages 647–655, 2014.

[7] Andreas Ettinger and Torsten Wittmann. Fluorescence live cell imaging. In Jennifer C.

Waters and Torsten Wittman, editors, Methods in cell biology, volume 123, pages 77–94.

Elsevier, 2014.

[8] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. volume 1, chapter 9.2. 2001.

[9] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. volume 1, chapter 15. 2001.

[12] Aur´elien G´eron. Hands-on machine learning with Scikit-Learn and TensorFlow: con- cepts, tools, and techniques to build intelligent systems. ” O’Reilly Media, Inc.”, 2017.

General image classifier for fluorescence microscopy using transfer learning

Examensarbete 30 hp Juni 2019

General image classifier for fluorescence microscopy using transfer learning

Håkan Öhrn

Abstract

General image classifier for fluorescence microscopy using transfer learning

Sammanfattning

Acknowledgement

Contents

1 Introduction

2 Theory

3 Data

4 Application structure

5 Experiments

6 Results

7 Discussion

References