Comparison of automated feature extraction methods for image based screening of cancer cells

(1)

UPTEC F11068

Examensarbete 30 hp Januari 2012

Comparison of automated feature extraction methods for image based screening of cancer cells

Michael Brennan

(2)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0

Postadress:

Box 536 751 21 Uppsala

Telefon:

018 – 471 30 03

Telefax:

018 – 471 30 00

Hemsida:

http://www.teknat.uu.se/student

Abstract

Comparison of automated feature extraction methods for image based screening of cancer cells

Michael Brennan

Image based screening is an important tool used in research for development of drugs to fight cancer. Phase contrast video microscopy - a cheap and fast image screening technology - enables a rapid generation of large amounts of data, which requires a fast method for analysis of this data. As videos contain a lot of redundant information, the difficulty is to extract usable information in form of features from the videos, by compressing available information, or filter out redundant data. In this thesis, the problem is approached in an experimental fashion where three different methods have been devised and tested, to evaluate different ways to automatically extract features from phase contrast microscopy videos containing cultured cancer cells. The three methods considered are, in order: an adaptive linear filter, an on-line clustering algorithm, and an artificial neural network. The ambition is that outputs from these methods can create time-varying histograms of features that can be used in further mathematical modeling of cell dynamics. It is concluded that, while the results of the first method is not impressive and can be dismissed, the remaining two are more promising and are able to successfully extract features automatically and aggregate them into time-varying histograms.

Ämnesgranskare: Ewert Bengtsson Handledare: Mats Gustafsson

(3)

1 Introduction

Image based screening of cultured cells is a widely used technique in labo- ratories doing biomedical research. From microscopy images or videos, it is possible to study live cells and their behavior as they grow when perturbed with dierent kinds of drugs. In the search for eective drugs to ght cancer, it is a great asset to be able to study the cells in detail, where the ultimate goal is to develop drugs that will not harm normal healthy cells, but hinder cancer cells from continuing to grow.

At our research site when this thesis work was performed¹, a video capture of a growing cell culture normally runs for three days. This permits the cells to grow, after which the recorded videos can be carefully studied.

The phase contrast microscope available is capable of scanning about 1500 dierent wells containing cells in the same run, each of which have a separate cell culture growing. This renders close to 53000 image frames, totaling in an amount of about 60 gigabyte (GB) of data for a three day period.

Clearly, since this throughput generates such a large amount of data it is not a feasible option to let a person observe all of the videos obtained from the microscope. As the procedure to obtain video microscopy images is be- coming increasingly cheaper and more automated, the requirement of human interaction in the evaluation of the images brings a severe impediment to the total throughput. Thus, there is an equivalent need for automation of the identication of cells and their features in video microscopy images. With such high a throughput and inexpensive way of obtaining microscopy images of cells, it enables for example the possibility to easily try numerous dierent new drugs on real cancer cells, and quickly see the results of the treated cells in computer images. But it requires replacing the human observer with a computer that can process all the data, and lter out all unimportant information. The idea of using computers to interpret images used in biomedical applications is not new, but rather than searching for a pattern known in advance, the idea here is to exploit the low cost of the generation of data and focus on the search for new patterns and unexplored behavior of cancer cells.

Alas, automated identication of cells in phase contrast images is a dif-

cult task. While it is easy for a person to quickly identify dierent types of cells by looking at an image, it is not trivial to program a computer to reliably and rapidly detect and identify cells from images. Several factors such as variability in cell size and morphology, as well as background noise, contribute to the diculty of the problem.

This project evaluates three dierent approaches for automatic feature extraction from phase contrast video microscopy: one adaptive linear lter, a clustering algorithm and an articial neural network. All three methods

1Cancer Pharmacology & Computational Medicine, Dept of Medical Sciences, UU

(5)

make use of a sliding window that traverses the microscopy images and attempts to classify what is displayed on the sub-image extracted by the sliding window. The idea then is to use the output of such an classier to generate time-varying histograms of all extracted features, and further use that as a base for a mathematical model of the cell dynamics in the videos.

This master thesis project was done during autumn 2011 as a part of the Engineering Physics programme at Uppsala University. The work was performed at the Cancer Pharmacology & Computational Medicine, Dept of Medical Sciences, UU. My supervisor was Prof. Mats Gustafsson, and my examiner was Prof. Ewert Bengtsson, both from which I am very grateful for their support during this work.

2 Review of related work

Most research in the area of computer based cell detection in microscopy images involves either uorescence images or phase contrast (or bright eld) images. The absence of uorescent markers in the latter makes them inherently more challenging to process, but similar techniques are still often used for both types of images.

Both articial neural networks (ANN) and support vector machines (SVM) are common classiers used to detect cells in images. For instance, in the uorescence imaging case, Nattkemper et al successfully used ANNs together with principal component analysis to detect lymphocytes in a high throughput screening system with uorescence images [1], while Huang et al used SVMs to capture sub-cellular information such as protein distribution in stained cells [2].

Research regarding automatic detection of cells in phase contrast images (or bright eld images, which are similar) is also highly active. Long et al [3] used a feedforward neural network for classication of cultured cells in bright eld images. In [4] the same authors used SVM for the same type of images to try to avoid over-tting, which is a common problem with ANNs.

In both approaches they ran the classiers directly on the image by using a sliding window to extract features. This is to be compared with segmenting the image prior to classication, which with the introduction of an extra processing step require more computer power. Both of these papers describe similar conditions as the ones considered in this thesis and their solutions give good results. However, as they only use binary classiers they are not able to distinguish between dierent cell types in the images.

Most of the research relating to the subject seem to restrict itself to only analyzing static images of cells and not take the change in time into consideration. This is especially true for images of cells taken with uorescence microscopes, where several consequential images of cells is hard or unfeasible to generate. Abassi et al, addressed this issue and developed a

(6)

real-time electronic cell sensing system [5]. Here images were not used, but it shows the importance of studying cells in time, which is something often overlooked. Dierent drugs may perturb the cells at dierent points in time and to measure this they used impedance measurement of cells growing on microelectronic sensors to generate cell kinetic proles. This gives the ability to study the temporal interaction of drugs with the cells. This dynamic interaction would be desirable to accomplish even for microscopy images, and Li et al presents a fully automated tracking system that can track hun- dreds to thousands of cells in phase contrast video microscopy [6]. This is very close to the aims of the thesis project reported here but as in the other cases, the cell detection algorithm does not distinguish between dierent cell types.

3 Problem description and goals

Drug development and discovery is highly dependent on a reliable method to observe living cell cultures. This is often done with the help of microscopes, and the identication of cells has traditionally been performed by experi- enced human observers. With the advancements towards high-throughput microscopy devices however, the necessity of a human observer to examine all generated image data is a serious impediment to the speed. To attain and keep a high throughput in drug testing it is therefore necessary, where it is applicable, to replace the human observer with computer algorithms that can by themselves identify cells and gather important statistics from the images.

A common aid in the identication of cells is to use uorescence microscopy, whereby cells are stained with uorescent markers prior to observation. A uorescent marker emits light observable by the microscope. This will in eect generate microscopy images containing cells, or parts of cells, vi- sualized with dierent colors, depending on their current state. An observer will then be able to quickly identify dierent types of cells by looking at their color. The development of automated algorithms capable of extracting features from these types of images is very active.

This project, however, focuses on the processing and extraction of data from images retrieved from a dierent type of microscope, the phase contrast microscope. Phase contrast microscopy is a method that does not depend on the staining of cells. Instead, it shows a plain optical image of the cell population, without colors, but uses the phase-shift of light to help it further enhance the image. The phase of light is altered when the light passes through dierent mediums and will reveal extra information from the cells that will be visible on the resulting image.

Phase contrast microscopy has both advantages and disadvantages compared to uorescent microscopy.

(7)

• Advantages

cheap The concept of phase microscopy is simple, and as it requires no staining it is much cheaper than uorescence microscopy.

fast As staining is both a dicult and time consuming procedure, the eliminated need for it proves phase contrast microscopy to be a faster alternative.

perturbation free By staining cells, they are actually modied and perturbed before study. Cells observed with a phase contrast microscope need no modication before study.

temporal measurement The property of being perturbation free enables video microscopy. The growing cell culture can be observed in time by taking snapshots in regular time intervals, generating frames in a video. This allows observation of cell behavior in time.

• Disadvantages

less information As a phase contrast microscopy image is just an optical image of the actual cell population, no information regarding the biochemical parameters of the cells are available.

This is to be compared with the stained cells in uorescence microscopy, where the dierent colors reveals the state of cells and can give a helpful hint of what underlying processes are taking place in the cells.

harder cell-detection Fluorescent markers aid the detection of cells. Without emitters the resulting images are only gray scale, making it a much more a challenging task to develop algorithms that enables computers to reliably detect cells.

Thus, images containing stained cells hold more information and are easier to process with a computer, but the generation of such images are more expensive. In contrast, phase contrast images can be generated in large amounts, giving more samples for study.

The motivation of this project is to exploit the availability of large amounts of data and eectively use computers to nd patterns in these images. Large amounts data, too much for a human to process, is already available in the research environment, which proves the need for a fast algorithm that can process all the collected data. The method proposed here diers from many others in this eld: rather than just detecting and count- ing cells in videos, we ultimately want to automatically extract features from phase microscopy videos and create histograms with these features over time.

These histograms can then hopefully be used in a later stage for modeling of cell dynamics.

(8)

The features, and the methods used for extracting them, can take different forms and thus the histograms will have dierent characteristics depending on method used. Many dierent approaches are worth trying. For example, both supervised cell detection classication and unsupervised classication can be used. With unsupervised classication, it should be possible to detect unknown and unusual patterns not previously seen in the growing cell cultures, and immediately we are not limited to the mindset of just detecting cells, which likely is the most commonly used approach in this eld.

Whichever method used for feature extraction in this thesis, the result will always be a time-varying histogram, but with dierent scales and extracted information. Essentially, this feature extraction method is just a compres- sion of information, where we search for algorithms that are able to obtain information-rich features that can describe the cell dynamics with much less data and without redundant information.

3.1 Goals

Based on the preceding discussion of the problem, the main idea for this project is to study dierent algorithms, which automatically extract features that can be displayed as time-varying histograms. This project therefore had the following two goals:

1. Development of novel image processing algorithms for fast and robust extraction of features from microscopy images of cell populations.

2. Comparison of speed and performance of developed algorithms to de- termine their eligibility for extracting features, which are to be used later in mathematical modeling of cell population dynamics.

4 Theory

4.1 Basic concepts

Before delving into the theory of each type of algorithm used for processing images, common concepts used by all methods considered in this thesis, and widely used in the scientic discipline known as pattern recognition, are presented here. Pattern recognition in its simplest sense means to take a set of objects and use a classier to assign each object in the set to a particular class. The kinds of objects considered and how they are represented depends on the application.

When classifying objects from a given set X, it is necessary to be able to describe them mathematically. This is done with feature vectors which are dened as x = [x1, x₂, . . . , x_M]^T. Each of the components x1, x₂, . . ., are called features and how they are interpreted depends on context. If M is the number of features retrieved from each sample, each feature vector has

(9)

dimension M and the space spanned by all feature vectors are called feature space.

Many pattern recognition algorithms require a set of data, called training data, that is used for training the classier to separate feature vectors into dierent classes. After the classier has been trained with all available training data, it can use the information it got from the training to try to de- termine what class a new unknown feature vector belongs to. An important aspect that is important to aware of when training a classier for pattern recognition is over-tting. This happens when a classier takes too much consideration to the data in its training set and thereby loses generality. In the case of articial neural networks, presented later, this happens when the networks have too many free parameters in the form of the number of layers and the number of neurons. A classier which exhibits over-tting will adapt to particular details of the specic training data during training. The classier will correctly classify all or almost all of the features from the training data, but when new unknown features is presented the classier will yield poor performance.

The remaining three sections in this chapter focuses on and introduces the theory behind the three particular approaches tried in this project: adaptive linear lters, clustering and articial neural networks.

4.2 Adaptive linear lters

Linear classiers are the simplest of the three dierent types of classiers considered in this project. Their major advantage is their simplicity and computational speed.

Before continuing, the concept of linear separability will be introduced.

Let ω1, ω2 be two classes in the M-dimensional feature space. ω1 and ω2

are said to be linearly separable if there exists a hyperplane, dened by w^∗Tx = 0, that satises

w^∗Tx> 0 ∀x ∈ ω₁ (1)

w^∗Tx< 0 ∀x ∈ ω₂, (2)

where T denotes the transpose and w^∗ is the normal vector of the hyperplane separating the two classes. Figure 1 illustrates the concept.

For a linear classier to work without error, the classes it distinguishes between have to be linearly separable. This is rare in real life problems and our case should be no exception, however, linear classiers can serve as a good approximation if the feature vectors of the dierent classes are at least separable to some degree. The advantage of having an attractive computational speed can outweigh the disadvantage of having some errors in classication. The linear classier considered here is actually a type of

lter called the matched lter, but modied to work as a classier.

(10)

−0.5 0.0 0.5 1.0

x1

−0.5 0.0 0.5 1.0

x2

(a) Example of two classes that are linearly separable. Feature space is two-dimensional with features x1 and x2, and the hyperplane (which is a line in two dimensions) is able to fully separate the two classes.

−0.5 0.0 0.5 1.0

x1

−0.5 0.0 0.5 1.0

x2

(b) Example of two classes that are not linearly separate. Feature space is two- dimensional with features x1and x2, and there is no hyperplane (line, in two dimensions) that is able to fully separate the two classes into two regions.

Figure 1: Two examples of linearly separate and not linearly separate classes in two-dimensional feature space.

(11)

4.2.1 Matched lter

The matched lter is a type of linear lter used in signal processing designed to maximize the signal-to-noise (SNR) ratio of a signal with white noise added to it. It can be used for detecting known signals in the presence of noise. The time-discrete version of the matched lter is the cross-correlation between the known signal f and the incoming signal with added noise, x:

y_i =

M

X

k=0

x_i−kf_k, (3)

where the length of the known signal f is M.

This cross-correlation is actually a scalar-product performed at every oset of the incoming signal:

y_i= f^Tx_i. (4)

The known signal is written in vector form, f, with length M, and xi is the part of the incoming signal at oset i, also with length M. The vector f can then be interpreted as a feature vector in a M-dimensional feature space where the classier is separating the space into two halves with a hyperplane.

With an appropriately chosen threshold it will separate the feature space into two classes.

This classier falls into the family of supervised classiers which means it has to be told in advance, by training, what it is supposed to look for.

To be able to detect dierent kinds of objects, several adaptive lters would have to be used, one for each type of object.

4.3 Clustering

Clustering is an unsupervised approach to pattern recognition. In contrast to supervised learning, there is no predetermined class labeling of the dierent classes. Instead, the data is automatically clustered into groups that makes

sense, in some way or another.

The very nature of the problem of the project suggests that an unsupervised approach, such as clustering, would be promising. Indeed, if an algorithm could obtain certain useful statistics of the available data set, gathered from normal images of cell populations, it should be able to dif- ferentiate images that contain unusual patterns that deviate from the normal. For example, a clustering algorithm could process a number of video microscopy images that are known from beforehand to contain only cells that behave in a normal matter. The algorithm would partition feature vectors extracted from the image into dierent clusters depending on their characteristics. When the algorithm has completed its training, inputting a new image to the algorithm would return a distribution of the classication of all the feature vectors.

(12)

Now, if the processed image is similar to the previous images seen by the classier, the distribution of feature vector classications should look similar to earlier results. But if fed with an image containing something unusual,

it would be desirable if the distribution changes to such a degree that it signals the need for review by a human observer.

4.3.1 On-line clustering

Algorithm 1 presents pseudo-code for an on-line clustering algorithm studied in this project. The algorithm seems to be new but is inspired by the adaptive resonance theory of neural networks introduced by Stephen Grossberg in the 1980's[7]. It is a sequential algorithm which is straightforward and easy to understand. The method adds clusters as they are needed when encountering new vectors from the data set, and it is not known beforehand how many clusters that are needed for a particular data set. The clusters here are represented as prototype vectors, w, which are vectors from the same feature space as the vectors from the data set X.

Algorithm 1 On-line clustering m ← 1

N₁ ← 1 w_m← x₁

for i = 2 → N do found ← false for j = 1 → m do

if ||wj − x_i|| < τ then w_j ← ^w^j_N^·N^j^+xⁱ

j+1

N_j ← N_j+ 1 found ← true break

end if end for

if not found then m ← m + 1 wm ← x_i N_m← 1 end if end for

Described in words, the algorithm will for each feature vector compare its Euclidean distance from each prototype vector. The comparison is made in the same order as the prototype vectors were created, and the feature vector will belong to the prototype it nds rst with a distance smaller than a predetermined value, τ. When such a prototype has been found, a counter

(13)

for that prototype is incremented and the prototype is adjusted to be slightly closer to the added feature vector. If no prototype with a distance smaller than τ is found, a new prototype is added, representing a new cluster. The new prototype is set to be a copy of the feature vector.

The only adjustable parameter for Algorithm 1 is τ, which determines how small the distance between the feature and prototype vector has to be for it to be considered as representable by the prototype. A larger value of τ will enable the prototypes to catch more vectors from the data and a smaller value of τ will cause the prototypes to be more picky of which feature vectors to select. Consequently, a larger τ will generate a smaller number of prototypes (and thereby clusters) than a smaller τ, if operated on the same data set.

When the algorithm has nished processing all data in a set, it will have created a number of clusters, each represented by its corresponding prototype vector. At this time the clusters are not populated by any data. This has to be done in a second pass as the prototype vectors are adjusted during the training, and the clusters are not denite until the rst run of the algorithm has been completed. The second pass is performed by running an algorithm which is essentially the same as Algorithm 1, but with the code for creating new prototype vectors removed. The output from the second pass can be viewed in a histogram where each bin will correspond a cluster.

4.4 Feedforward multilayer articial neural networks

The remaining type of classier introduced here is nonlinear: the feedforward multilayer articial neural network (ANN). A nonlinear classier is a classier that is not limited to hyperplanes as boundaries when separating classes in feature space. Instead, the boundaries are more general and can assume much more complex shapes. Because of this, nonlinear classiers do not impose the requirement that the classes it separates between have to be linearly separate to obtain a fully correct classication. As Figure 1b showed, a linear classier can not completely separate the two classes that does not have this property. Nonlinear classiers does not bear this limitation, as illustrated in Figure 2. When entering the realm of nonlinear classiers it is important to be aware that while not being restricted to classifying between linear separable classes, and thereby greatly increase the domain of problems that can be solved, there can be a considerable cost in terms of required computer power. Nonlinear classiers are also inherently more complex and are less understood than the linear classiers.

Articial neural networks get its fundamental idea from the biological networks of neurons that exist in our brains and are designed to mimic its behavior. There are dierent types of ANN diering in implementation and complexity. Here we show the type that is likely the most popular one in use, the multilayer perceptron. It builds upon concepts known as the neuron

(14)

−0.5 0.0 0.5 1.0

x1

−0.5 0.0 0.5 1.0

x2

Figure 2: Nonlinear classiers are able to assume general decision boundaries between classes in feature space. The samples are identical to those in Figure 1b.

and the single layer perceptron.

4.4.1 The neuron

A neural network, simply put, consists of a number of neurons connected together. While a human brain contain up to 10¹¹ neurons connected in a very complex setup, an articial neural network supposed to be run by a computer has to be extremely simplied.

A very simple model of a neuron is one that computes a weighted sum from the signals it receives as input from other neurons connected to it. If this sum exceeds some predetermined threshold the neuron will output a one and the neuron is said to re, otherwise, it will output a zero. Hence, it can be formulated as:

n_i = Θ(X

j

w_ijn_j− µ_i), (5)

where ni is either 1 or 0 depending on whether the i-th neuron res or not.

The weight wij determines the strength of the connection between neuron j to neuron i. Θ(x) is the unit step function, or Heaviside function:

Θ(x) =

(1 if x ≥ 0

0 otherwise. (6)

(15)

The threshold value µi is specic to the i-th neuron and if the weighted sum exceeds it the neuron will re. This model of the neuron is sometimes referred to as the McCulloch-Pitts neuron.[9]

A slightly more general denition of an articial neuron is to replace the binary threshold function Θ(x) with a nonlinear function g(x). The model from Equation (5) then becomes

n_i =g(X

j

w_ijn_j− µ_i), (7)

where the function g(x) is called the activation function.

4.4.2 The single layer perceptron

The articial neuron in Equation (5) is actually a linear classier. To see this, consider the argument to the unit step function: P_jw_ijn_j− µ_i. If the output from M neurons are used as input, then j = 1, . . . , M and we can replace the sum with the more compact vector notation:

w_i^Tn −µ_i,

where wi = [wi1, wi2, . . . , wiM]^T and n = [n1, n2, . . . , nM]^T. Given a M- dimensional feature vector x, each of the input neurons n1, n₂, . . . , n_M can be identied with the components of a feature vector x: x1, x₂, . . . , x_M. Thus, by exchanging the neurons with a feature vector we get:

w^T_i x −µ_i, (8)

where w is called the weight vector and µithe threshold. Setting this function equal to zero gives a hyperplane, dividing feature space into two halves, each one representing a class. Thus, the classication of an unknown feature vector will follow the simple rule:

If w^Tx +µ > 0 assign x to ω1 (9) If w^Tx +µ < 0 assign x to ω2 (10) Next, applying the unit step function on the result will yield a one if it belongs to one class, or a zero if it belongs to the other class.

Assuming the classes are linearly separable, using an algorithm known as the perceptron, it is guaranteed to nd a weight vector w^∗ and a threshold µ^∗ such that it classies all features correctly. This is known as training the perceptron. After training, the perceptron can be used to classify unknown feature vectors into dierent classes.

(16)

4.4.3 Multilayer perceptron

When linear classiers such as the perceptron are not good enough to classify data, more complex classiers that are not limited to classes that are linearly separable have to be introduced. The multilayer perceptron (MLP) is such a classier and is a natural extension of its single layer counterpart. As the name implies, this classier uses several layers of neurons wired together in a network and represents one very common type of ANN. It also uses a model of a neuron with a nonlinear activation function, like the one in Equation (7), instead of the binary threshold.

A MLP consists of two or more layers of neurons. The most common type is the feedforward neural network, where the information only ow in one direction. Starting from the nodes in the input layer, they provide inputs for the neurons in the rst hidden layer, whose output become input for the second hidden layer, and propagate in that fashion all the way to the output layer. No connections are made back to neurons belonging to an earlier or the same layer.

To train a MLP, training pairs, {(yi, x_i)}^N_i=1, have to be available. That is, for an input (feature) vector xi, it is desirable that the MLP, f, with parameters θ (weights and thresholds), outputs a vector yi,

y_i =f (x_i; θ), (11)

consistent with the training pairs. As the output layer can hold more than one neuron, yi is no longer a scalar, but a vector. Training is achieved by minimizing a cost function J(θ). This can be done with the method of least squares, by minimizing

J(θ) = 1 N

N

X

i=1

||y_i− f (x_i; θ)||². (12) MLP training is often based on gradient descent or some quasi-Newton (second order) method. The gradients of the objective function minimized is calculated by means of the computationally ecient backpropagation algorithm. This name comes from the fact that the prediction errors at the network output are propagated back through earlier layers (implementing the chain rule). Backpropagation requires the activation function to be dieren- tiable, and hence the unit step function can not be used anymore. Instead, a nonlinear function approximating the unit step function is used. A popular family of functions used for that purpose are the sigmoid functions, one of which is dened as

ϕ(x) = 1

1 +e^−αx, (13)

where α is a slope parameter. Another popular activation function belonging to the same family is the hyperbolic tangent function:

ϕ(x) = tanh(x). (14)

(17)

The former has a range of (0, 1) while the latter has a range of (−1, 1).

To construct a neural network with good performance, the parameters θ of a MLP generally have to be determined experimentally. While extremely simple networks featuring only a couple of neurons can be fully understood, the network quickly gets uncomprehendingly complicated with the addition of only a few neurons, thanks to the numerous weights and connections that wire all the neurons together. A good set of parameters can often be found in an iterative fashion, where the MLP is trained several times with a systematic adjustment in the parameters each time. For each iteration, the network can be evaluated by inputting test data and calculating the error, which is how much the expected output diers from the actual output. The parameters that yield the smallest error should be a good choice for the MLP, but certain precautions should also be taken to avoid nding local minima in the objective function J.

4.5 Principal component analysis

Since a classier assigns feature vectors given to it to dierent classes, it is not surprising that the nature of the feature vectors highly aect the performance of the classier, both during training and after. The data fed to a classier should obviously be relevant for classication. If irrelevant or redundant features are present in a feature vector, not only will this require more computer resources, because of the larger feature vector, but it often also decrease the generalization capabilities of the classier. With irrelevant or redundant features, it is hard for the classier to know which ones really are important for determining a correct classication.

One way to lter out redundant data, prior to training and classication, is to use the method of principal component analysis, or PCA. This reduces the dimension of the feature vector by mapping it to a lower-dimensional space, but still retain much of the information present in the feature vector, producing a compact representation of the original high-dimensional feature space. By a change of base, the basis vectors are aligned with the directions of most variance, in order from the greatest to the lowest. Following is a brief explanation of the transformation.[11]

First, without loss of generality, assume that the data samples (the set of feature vectors) have a zero mean. This can otherwise be achieved simply by subtracting the mean. Then we introduce a linear transformation,

y =A^Tx. (15)

A is a matrix where its columns, a1, . . . , a_M, are eigenvectors of the covariance matrix Cx of x, when x is viewed as a random variable. Since the distribution of the random variable is not known, neither is the covariance matrix. But given N feature vectors, x1, x₂, . . . , x_N, the covariance matrix

(18)

C_x can be approximated by,

C_x ≈ 1 N

N

X

i=1

x_ix^T_i . (16)

The transform given in Equation (15) is known as the Karhunen-Loève (KL) transform and the original vector x can be represented in the new base as

x =

M

X

i=1

y_ia_i. (17)

A feature of this transform is that the components of the feature vectors in the new base are uncorrelated, and makes it possible to easily reduce the dimension by removing some of the components of the vector. The vector with reduced dimension serves as an approximation to the original feature vector,

ˆ x =

l

X

i=1

y_ia_i, (18)

where l < M, and has the error

x − ˆx =

M

X

i=l+1

yiai. (19)

Now, it can be shown, that if the transformation matrix A is constructed in such a way that the rst l columns of eigenvectors ai, i = 0, 1, . . . , l, correspond to the l largest eigenvalues of the covariance matrix Cx, the mean square error of Equation (19) will be the smallest. Also, the components corresponding to the largest eigenvalues also correspond to the directions with the largest variance in the feature vector data. The eigenvector with the largest eigenvalue will align itself with the direction in feature space that has the most variance in the sample data. The eigenvector with the next to largest eigenvalue aligns itself to the greatest variance that is orthonormal to the rst eigenvector, and so on. See Figure 3 for an example. The number l, indicating the dimension of the new lower-dimensional space, is called the number of latent dimensions. Naturally, the more latent dimensions, the smaller the error.

5 Material

All implemented computer algorithms were tested on a standard desktop computer with a Intel Pentium 4 CPU 3 GHz and 2 GB RAM, running Debian Linux Squeeze 32-bit edition. Software used was MATLAB version 7.11.0.584 (R2010b) and Python 2.6.6.

(19)

−10 −5 0 5 10

x1

−8

−6

−4

−2 0 2 4 6 8

x2

v1

v2

Figure 3: Demonstration of how PCA transforms vectors in two-dimensional feature space by projecting them to one dimension. The two eigenvectors, v1

and v2, are orthogonal and scaled according to their respective eigenvalue.

Projecting the data on the largest eigenvector keeps the data as separate as possible in one dimension.

The code for the linear classier and the clustering algorithm were written in MATLAB, while the articial neural networks were implemented in Python with the help of the Fast Articial Neural Network Library (FANN) [12]. FANN is an open source neural network library written in C, implementing multilayer articial neural networks. It claims to be up to 150 times faster than other libraries and supports dierent types of training algorithms and activation functions. The ones used in this project (determined by ex- perimenting and nding the ones with the most plausible results), were a training algorithm known as quickprop, designed as an improvement to the standard backpropagation algorithm in terms of speed[8], and the sigmoid activation function (see Equation (13)).

1280 × 1024 pixel TIFF-images were used for testing the algorithms, which were live images containing cultured cancer cells, retrieved from a phase contrast microscope used in the laboratory.

6 Methods

All methods used in this project have the same procedure of extracting features from the microscopy images. Every image frame from a microscopy video is traversed with a square window, one pixel at a time, extracting sub-

(20)

images of dimension d×d. The pixel values of each sub-image are extracted, column-wise, and stacked into a d²-dimensional feature vector, x. The feature vector for a sub-window is hence just an enumeration of all pixel values in that particular region of the larger image. The adaptive linear lter and the clustering algorithm use these feature vectors as input to their classiers without modication, while the articial neural network also compresses the feature vectors with principal component analysis (PCA). The ANN also has an option to make the traversing window have a stride larger than one, to speed up execution.

6.1 Adaptive linear lters

The code for calculation of the adaptive linear lter was implemented in MATLAB. Prior to training, the user is requested to select regions of interest in an image. This is done interactively by letting the user click in an image to select regions containing patterns which the resulting classier is supposed to detect. The clicked locations mark the centers of the sub- images which are to be used as training data. The user is also asked to select negative classication examples which are regions in the image which are not of interest. These are necessary to present the classier with typical background structures that act as a contrast and should be rejected by the classier.

If N is the total number of selected samples, the training data consists of input/output pairs

Ω = {(y_i, x_i)}^N_i=1. (20) This set is composed of two disjoint subsets, representing the positive and negative regions selected by the user, Ω = Ω^pos∪ Ω^neg. Output is binary and is positive if the feature vector belongs to Ω^pos and negative if it belongs to Ω^neg. Thus,

(1, x) if x ∈ Ω^pos (21)

(0, x) if x ∈ Ω^neg. (22)

Next, training is performed using the method of least squares (MLS), which will nd a lter, f (also of dimension d², same as the feature vectors), that minimizes the error,

E = X

∀x∈Ω^pos

||f (x) − 1||²+ X

∀x∈Ω^neg

||f (x) − 0||². (23) After the adaptive lter has been found, and hence the classier has been trained, running the classier is only a matter of applying the lter to a new image.

The adaptive linear lter is a form of matched lter, and as mentioned earlier in Section 4.2.1, this is a cross-correlation between the lter and the

(21)

input signal. Since cross-correlation is actually a form of convolution, it is advantageous to rst transform the problem to the frequency domain with the Fourier transform. This is because convolution in the spatial domain is equivalent to multiplication in the frequency domain, achieving a much faster computation.[10] Dealing with a discrete signal, as opposed to a continuous, it is necessary to use the discrete Fourier transform (DFT). In practice, the DFT is calculated with something known as the fast Fourier transform (FFT), which is simply an ecient algorithm for evaluating a DFT.

MATLAB provides a filter2 function that can lter data with a two- dimensional lter. This function is called for each frame in the video, where input is a 1280×1024 matrix representing the image frame, and a d×d matrix representing the lter. The function returns the ltered output as a 1280 × 1024matrix, which is same size as the input image.

The filter2 function uses FFT behind the scenes which makes such l- tering a very fast operation. Applying the constructed adaptive linear lter is therefore just a matter of calling this function. Using FFT for ltering is limited to linear lters, and moving from a linear lter to a nonlinear will in general drastically increase execution time, as no shortcut with transformation to the frequency domain will be taken.

6.2 Clustering

The clustering algorithm, also implemented in MATLAB, takes every feature extracted from the moving window and feeds them in order to the algorithm (based on Algorithm 1). Given a value of the parameter τ, the database of prototype vectors is built as the code processes, in order, the feature vectors given to it. When nished, a list of prototype vectors (representing the clusters) are returned. This database is then used to classify feature vectors from new videos fed to the classier. As no information about the classes are known beforehand, no user interaction is necessary for selecting patterns.

6.3 Articial neural networks

The code for the articial neural network classier is written in Python, together with FANN[12], a free open source neural network library. As this is again a supervised approach where the network has to be trained with a number of predetermined classes, a user is requested, for each of the classes in the classier, to select N training samples from microscopy images of cells.

Using PCA to reduce dimensionality by projecting the feature vectors to a subspace, the outcome is a training set Φ, composed of N input/output pairs

Φ = {(I_i, O_i)}, i = 1, . . . , N, (24) where Oi is the class assigned to feature vector Ii.

(22)

During training, this set is divided into two disjoint subsets Φtrain and Φtest, (Φ = Φtrain∪Φ_test), where the former is used for training of the network and the latter is used to test the generalization capabilities of the network.

The division is formed by randomly selecting 10 % of the elements from Φ and assigning them to Φtest, while the remaining elements are assigned to Φ_train. To nd a good candidate for conguration, which is the number of layers, number of hidden nodes and the number of latent dimensions, several congurations are systematically tried and the error is calculated for both training and test data. Because many parameters can be adjusted and the training of a network takes signicantly longer than just running a trained network, this can be a very time consuming process.

7 Results and analysis

7.1 Adaptive linear lter

To test its performance, the adaptive linear lter was trained with images of

bubbles which have been appearing on some types of cells. These bubbles are clearly visible in the images and is a good example of a pattern of which automatic detection would be desirable to attain. Figure 4 shows the selected samples used for training, which contains both the positive and negative examples selected interactively. Here only eight samples from each class are used to construct the adaptive lter, as adding more samples does not improve, but rather decrease, the performance of the resulting lter. Each sample has the dimension of 35×35 pixels, large enough to accommodate an image of a typical bubble searched for in the images. After performing the method of least squares on the training samples from Figure 4, the resulting

lter have coecients very close to zero. To let it utilize the full range of the 8-bit values used in the images, the coecients are shifted and scaled. This is performed by the operation

f_m =f − min(f ) , (25)

followed by

f_s = 255 · [f_m/ max(f_m)], (26) where f is the original lter. The resulting adaptive lter, produced by performing the method of least squares and applying the above formulas, is shown in Figure 5.

In order to evaluate the usefulness of the adaptive lter, two dierent videos were fed to the classier, one in which bubbles are predominant and one in which they are absent. To visualize the result and get a feeling of the performance, the output from running the classier on only the last frames of each video is shown in Figures 6 and 7. The gure shows the original images of the cell cultures overlaid with regions in red, marking areas where

(23)

Positive classification examples

Negative classification examples

Figure 4: Sixteen training samples of dimension 35 × 35.

Figure 5: The adaptive lter.

the classier believed it found the pattern sought for. These marked areas correspond to pixels in the ltered image (which is not shown here), which exceed a certain threshold.

By inspecting these gures, one can quickly draw the conclusion that the adaptive linear lter does not perform well in recognizing patterns. While all detections in Figure 6 do correctly mark the presence of a bubble, it

nds only a small portion of them and thereby misses the vast majority of bubbles. In Figure 7, the detections are all false alarms, marking areas where no bubbles are present.

The speed of the adaptive linear lter averages at 58.98 seconds per video, and thus approximately 2 seconds per image. While being relatively fast, the performance of the classier is not very good as it does not reliably detect patterns.

7.2 Clustering

The output of the clustering algorithm is presented here as histograms instead of annotated images, as this classier works in a fundamentally different way compared to the other two classiers. Normally, the output of the algorithm after having processed a video, is a time-sequence of features clustered into dierent groups. This can be displayed as a three-dimensional plot that shows the bins from each time step, but as there turns out to be much variation between each time step in the output, such a plot tends look very disorganized and hard to interpret. Thus, here only the output

(24)

(a) Output from the adaptive linear lter after processing the last image frame in a video, containing cells where bubbles have been appearing. The classier is trained to detect these bubbles and red areas mark their presence.

(b) Close up view of the marked rectangle in (a).

Figure 6: All detections marked by the classier are correct, but constitute only a small portion of all bubbles present in the image. Most bubbles remain undetected.

(25)

(a) Output from the adaptive linear lter after processing the last image frame of a video, containing untreated cancer cells. The classier is trained to detect bubbles, which are not present in this video.

Figure 7: All detections marked by the classier are incorrect. No bubbles are present in the image, but several marked regions exist in the output.

(26)

corresponding to the last frame of each video is presented, to give a simple overview on how the algorithm performs.

Figure 8 shows the results from running the algorithm on four dierent cell microscopy videos. Each bin in the histograms represents one particular cluster or prototype and the y-axis species the number of features classied to each cluster. The bins are ordered in the same order as the clusters were formed during training. The y-axis is logarithmic as the vast majority of features are classied into the rst few clusters. With a linear scale, none but the rst bin would be visible on the plots. All of the four videos represented in Figure 8 were created with the same database acquired from training, and thus the ordering of the bins and the prototypes they represent is the same in each of the four histograms. It should be noted that running the training algorithm on an other set of videos would create a dierent set of prototypes, and consequently, histograms created from that database would not have the same meaning as histograms created from an earlier database.

Figures 8a and 8b show histograms from two videos where cells have not been treated with any drug and thus are growing normally. These two videos were also part of the videos used for training. Figures 8c and 8d show the output from two videos which were not part of the training where the

rst one contained cells which had bubbles, and the other contained cells that merged into clumps.

The main motivation of using a clustering algorithm here is for it to be able to detect and distinguish patterns in images deviating from the normal images used in training. It would in this case be desirable if the last two histograms would distinguish themselves from the rst two, as they contain unusual patterns, resulting from cells with a strange behavior that we do want to detect. It can be seen in the gure that the last two histograms of features vectors in fact are in fact slightly dierent than the rst two: some feature vectors, present in the rst two histograms, are absent in the last two histograms. While it is more dicult to discern between the last two histograms individually, it is promising that they do dier from the rst two normal histograms.

Training of the clustering algorithm consisted of building a database of prototype vectors from a set of videos of cells. These were videos of cells not exposed to any drug. The congurable parameter τ was set to 800, resulting in the generation of 200 clusters. Training takes around 19.7 hours, for a set of 48 dierent video captures of growing cells (corresponding to 1568 images), averaging at 45 seconds per image.

With the nished database, processing one video takes 44.14 seconds, with an average of 1.33 seconds per image. The advantage of this type of clustering is the fact that it is completely unsupervised, no prior training is needed to teach the classier what patterns to look for. Also, this is the fastest classier considered in this thesis.

(27)

0 50 100 150 200 250 10⁰

10¹ 10² 10³ 10⁴ 10⁵

Prototype

Feature vectors

(a) Normal cells

0 50 100 150 200 250

10⁰ 10¹ 10² 10³ 10⁴ 10⁵

Prototype

Feature vectors

(b) Normal cells

0 50 100 150 200 250

10⁰ 10¹ 10² 10³ 10⁴ 10⁵

Prototype

Feature vectors

(c) Cells with bubbles

0 50 100 150 200 250

10⁰ 10¹ 10² 10³ 10⁴ 10⁵

Prototype

Feature vectors

(d) Cells forming clumps

Figure 8: Output from the clustering algorithm when processing four dierent cell microscopy videos. Only the last histogram of each video is shown.

(28)

7.3 Articial neural networks

The samples used for training of the articial neural network are given in Figure 9. Each class contains 300 samples, selected manually from cell microscope images prior to training. The patterns represented by each class are, in order, living cells, dead cells, bubbles, cells clumped together, and background/noise. The last class is used to present counterexamples for the rst four classes which are the ones of interest, and it therefore contains both images of background and noise, such as edges of cells or other undesired perturbations of the image.

Class 1 Class 2 Class 3 Class 4 Class 5

Figure 9: All samples from all classes used in training of the ANN.

The conguration yielding the best performance, in terms of correct classication of dierent celltypes, consists of two hidden layers, containing 200 and 150 neurons respectively. The number of latent dimensions in PCA for preprocessing is 30 (and is hence also the number of neurons in the input layer). The size of the sub-window extracting the features has the dimensions of 31 × 31 pixels. These values were obtained experimentally by systematic training of the network with dierent parameters. The decision was based on both the classication error (see Equation (12)) and by visual inspection of the classication results.

Figures 10 and 11 show two images produced by the ANN classier. Just as in the results from the adaptive linear lter, overlaid images are used to illustrate the classication of dierent cell types. The dierent colors represent and mark the presence of samples belonging to the dierent classes.

Figure 10 shows the output from the last image of a video where bubbles are present and Figure 11 shows the output from the last image of a video where cell clumps were formed.

It is clear that this classier has the ability to separate features into

(29)

multiple classes as it classies most of the cells in the images in a correct manner. Unfortunately, there are also many false alarms scattered in the images, but as these are often small regions of only a few pixels, they should be easy to lter out. Also, it is evident from Figure 11 that the classier performance is depending on class: some classes, such as the dead cells, are easy to detect, while other, such as the cell clumps, have unimpressive performance with a lot of false alarms.

Training time depends on conguration, the more layers and neurons in the network, the longer the training time. With the conguration used here training takes approximately 57 minutes. Execution time with trained net is around 7.65 hours for one video containing 32 frames, resulting in an average of 14.34 minutes per image. By increasing the stride of the moving sub-window to 3 pixels, execution can be reduced ninefold but results become more inaccurate. While this classier can separate between multiple classes and does it relatively well, it is far from perfect as many areas in the images are misclassied. It is also several orders of magnitude slower than both of the other two classiers.

8 Conclusions and discussion

All classiers treated in this project are able to generate time-varying histograms, as it is just a matter of gathering all features extracted for each image frame into a graphical representation. But as they all work on dierent levels and extract features dierently, the histograms will have dierent characteristics. For example the number of features and thus, the number of bins in the histograms vary from 2 (for the adaptive linear lter) to 200 (for the clustering algorithm), and the features between the classi-

ers have dierent meanings. The features obtained with the ANN classier represent the dierent cell types detected while the features from clustering algorithm represents the prototype vectors found during training. Therefore it is of little meaning to compare histograms from the classiers directly, and instead their feature extraction capabilities are discussed. Following is a discussion of the three classiers in order, followed with a general discussion and conclusion where their performances are compared.

8.1 The adaptive linear classier

Thanks to the Fast Fourier Transform, the adaptive linear classier is very fast. However, due to its inability to reliably detect patterns in its current state, there is no immediate way to use the adaptive linear classier for any productive use without any signicant improvement.

Adding more samples for training of the adaptive lter makes the classier perform even worse, which gives a hint of cause of the problem: the method to calculate the adaptive lter is apparently not very eective. As

(30)

(a) Output from the ANN classier after processing the last image frame in video. The video contain common elements such as dead cells, living cells and some bubbles.

Figure 10: Much of the detections are correct, but there is a considerable amount of false alarms. Red is alive cells, green is dead cells, blue is bubbles and purple is cell clumps. The last class is not marked in the image but would ll up most of the uncolored parts.

(31)

(a) Output from the ANN classier after processing the last image frame in video, containing cells that have merged into large cell clumps.

Figure 11: Example of how the classier fails on images with cell clumps.

Red is alive cells, green is dead cells, blue is bubbles and purple is cell clumps. The last class is not marked in the image but would ll up most of the uncolored parts.

(32)

the adaptive lter essentially is a kind of average of all the training samples, it is allegedly a poor representative of the underlying signal that a normal matched lter uses to detect patterns in noisy channels.

The use of linear classiers seems to be limited for the problem considered here. Nevertheless, if a relatively small improvement could be made, giving the linear classier the ability to lter out easy patterns with a high reliability, it could prove highly useful for some kind of preprocessing. This could relieve the work for classiers in later stages, which normally would be more expensive to run.

8.2 The clustering algorithm

The clustering algorithm is both very fast and produce histograms that re-

ect the information stored on the video microscopy images, which makes this a promising method. What this algorithm does, is that it compresses the large and also redundant amount of data stored in a microscopy image into a compressed form with only a few features. This could be used as an intermediate step in the search for features useful in a mathematical modeling of cell dynamics. But there is a risk that the extracted features are not very information rich, and that the changes are of more random nature.

This can be tested by using more data for building the prototype database and evaluating it on more videos.

A bin in the generated histograms indicates how many feature vectors (extracted by the sub-window) that are close to a particular prototype vector.

As a feature vector is a vector with pixel values from a sub-image, it is in actuality an image, and all images classied to the same cluster looks similar to the image represented by the prototype vector. As the algorithm does not try to identify cells, the prototypes will simply be building blocks with dierent shapes that can be used to reconstruct the whole image. Unless the number of prototypes are really high, the produced prototypes are not likely to resemble images of cells. If there are a vast number of prototypes, some of them may look like cells, but it will only involve a fraction of all the other prototypes in the database, as many sub-windows will overlap edges and other parts of cells. Because of this, it is not surprising, and is also evident from the results, that a change of adding some cells of a dierent type to a video microscopy image, will not induce a change in just one or a few bins in the histogram, but instead a change in the whole spectra. This is a consequence of not interpreting the images in any way.

8.3 The articial neural network

This classier was to some extent successful in classifying cell types from all classes. The reason the articial neural network was much more successful in classifying the features and thereby detecting dierent cell types than the

(33)

adaptive linear counterpart, is most likely due to its nonlinear properties.

Even though the adaptive linear classier adapts to some kind of mean of all interesting sub-images, it still relies on separability of classes to be eective.

Though the ANN classier is much more reliable in detecting patterns than the linear adaptive lter, it is far from perfect and there are many erroneous classications. The error rate also varies between classes, for example, the symmetric and easily distinguished (by the human eye) dead cells are very reliably detected while the more irregular shaped living cells are somewhat more unreliable in detection with many false positives.

A straightforward way to improve the results of the ANN classier should be to add more training data to the training of the network. This is plausible as the current results show that the classier does correctly classify most of the cells, but there are many false alarms in a classied image. These would probably be alleviated with more training data to make the network more condent in its classication.

Many of the false positives tend to be isolated into single or only a few pixels together. These can easily be removed by ltering single pixels from the classier output, to further improve the results. From inspection, classier output seems to be more likely correct when the pixels marking a particular class are together in large connected regions.

The reason for the longer execution time compared to the other classiers is the size of the neural network. A two-layer network with 200 and 150 neurons each, together with the 30 input and 5 output neurons, yields a total of 36750 connections, each having a particular weight. During training all of these have to be set and explains the long training time. Execution after training is naturally faster but still takes time if every possible sub-window in the image is to be processed. As mentioned in the results section, the stride of the moving sub-window can be increased to speed up the execution, as less features has to be classied with the network. But it is a trade o, and the price of a greater stride is a more inaccurate classication result.

Other ways to achieve a speedup could be to remove uninteresting features in a preprocessing step, avoiding feeding trivial features into the neural network. For example, the parts of the microscopy images that are just background are quite uninteresting. If a big percentage of a microscopy image consists of background (few cells), valuable time is wasted if all sub-images containing background has to be processed by the ANN. Since background is plain gray it should be easy to lter them out, perhaps with a linear lter.

8.4 General discussion

An important issue to be aware of in this problem is the non-regularity of cell sizes and their morphology. Both in the adaptive linear classier and the ANN, it is, to some extent, assumed that all cells, or all patterns searched for, more or less lls a particular size of 35 × 35 pixels. While most cells do

Comparison of automated feature extraction methods for image based screening of cancer cells

Examensarbete 30 hp Januari 2012

Comparison of automated feature extraction methods for image based screening of cancer cells

Michael Brennan

Abstract

Comparison of automated feature extraction methods for image based screening of cancer cells

Contents

1 Introduction

2 Review of related work

3 Problem description and goals

4 Theory

5 Material

6 Methods

7 Results and analysis

8 Conclusions and discussion