Signal Extraction from Scans of Electrocardiograms

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2018,

Signal Extraction from Scans of Electrocardiograms

JULIEN FONTANARAVA

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

Signal Extraction from Scans of Electrocardiograms

Julien Fontanarava julienfo@kth.se

Master’s programme in Computer Science Supervisor: Pawel Herman

Examiner: Hedvig Kjellstr¨ om Host company: Cardiologs

January 3, 2019

(4)

Abstract

In this thesis, we propose a Deep Learning method for fully automated digitization of ECG(Electrocardiogram) sheets. We perform the digitization of ECG sheets in three steps:

layout detection, column-wise signal segmentation, and finally signal retrieval - each of them performed by a Convolutional Neural Network. These steps leverage advances in the fields of object detection and pixel-wise segmentation due to the rise of CNNs in image processing.

We train each network on synthetic images that reflect the challenges of real-world data. The use of these realistic synthetic images aims at making our models robust to the variability of real-world ECG sheets. Compared with computer vision benchmarks, our networks show promising results. Our signal retrieval network significantly outperforms our implementation of the benchmark. Our column segmentation model shows robustness to overlapping signals, an issue of signal segmentation that computer vision methods are not equipped to deal with.

Overall, this fully automated pipeline provides a gain in time and precision for physicians willing to digitize their ECG database.

(5)

Sammanfattning

Signalextraheringar fr˚an skanningar av elektrokardiogram

I detta examensarbete föresl˚ar vi en Deep Learning-metod för fullständig automatiserad digitalisering av EKG-grafer. Vi utför digitaliseringen av EKG-graferna i tre steg: layout- detektering, kolumnvis signalsegmentering och slutligen signalhämtning. Var och en av dem utförs av ett faltningsnätverk. Dessa nätverk är inspirerade av nätverk som används för ob- jektdetektering och pixelvis segmentering. Vi tränar varje nätverk p˚a syntetiska bilder som

˚aterspeglar utmaningarna i den verkliga datan. Användningen av dessa realistiska syntetiska bilder syftar till att göra v˚ara modeller robusta mot variationer av EKG-graferna i den riktiga världen. Jämfört med riktmärkning fr˚an datorseende visar v˚ara nätverk lovande re- sultat. V˚art signalhämtningsnätverk överträffar avsevärt v˚ar implementering av riktmärket.

V˚ar kolumnsegmenteringsmodell visar robusthet mot överlappande signaler, en fr˚aga om signalsegmentering som metoder i datorseende inte kan hantera. Sammantaget ger denna helautomatiska pipeline en förbättring i tid och precision för läkare som är villiga att digi- talisera sina EKG-databaser.

(6)

Chapter 1 Introduction

1.1 Problem Statement

Electrocardiograms (ECGs) register the electrical activity of the heart by recording the differences of potential between pairs of electrodes placed on the body. Due to the simplicity and low cost of the examination, they constitute one of the first medical procedures performed by a physician to detect an abnormality in the conduction of the electrical signal throughout the heart, either through an abnormal heartbeat morphology or an abnormality in the heart rhythm. A resting electrocardiogram, usually recording 12 signals, is performed on a patient at rest for a few seconds (typically 10 seconds) ¹.

Nowadays resting electrocardiograms are both acquired digitally and printed for immediate analysis by the physician. The digital acquisition gives several numerical signals, with voltage values corresponding to time samples. The printed version shows the 12 signals traced on paper. An example of the printed version is shown in Figure 1.1.

1https://en.wikipedia.org/wiki/Electrocardiography

(9)

Figure 1.1: A standard scanned ECG paper

The way these signals are displayed on paper can vary (c.f. Section 3.2.1). All the signals can be displayed for the whole recording duration, i.e. spanning the whole width corresponding to 10 seconds on the ECG sheet. Signals can also be displayed for less than 10 seconds each, which means the signal from a lead will be drawn for the first few seconds, typically 2.5 or 5 seconds, and the signal from another lead will replace it for the following few seconds (c.f. Figure 1.1). This format is meant to have all the leads fit a page in a compact and readable way, enabling the physician to see different angles of the heart’s electrical activites for a few seconds. In that case, the last row of the ECG sheet usually displays one of the 12 signals for the whole length (c.f. Figure 1.1), and is commonly called the ”rhythm lead”.

Indeed, its use is to allow the physician to refer to one lead for the whole recording duration to spot the heartbeats, and compare with other leads that only span part of the recording duration.

As long as both paper and digital versions are available, physicians can both read ECGs on the spot and keep record of them. However, for many years and in a number of medical studies, only the paper version has been stored. This is damaging for record-keeping, as paper versions degrade quite fast, as described by Waits and Soliman [1]. It is also impractical for processing, making the automation of certain processing tasks, such as the calculus of certain signal constants, difficult and time-costly.

We aim at providing a tool for full and faithful digitization of an ECG sheet. This tool has two main goals. First, it can enable the cardiology community to use their dataset more thoroughly and more quickly. Secondly, it ensure the storage and sharing of their databases.

(10)

Figure 1.2: A scanned single lead image

A number of papers have been published since 1987 to deal with the issue of digitizing ECGs. The methods presented in these works aiming at converting single lead images to their corresponding digital signal are outlined in a 2017 survey by Waits and Soliman [1]. These steps, particularly the removal of the grid, and the smoothing operations that are performed on the image prior to extraction of the signals, lead to a loss of quality of the image. As the authors state, ”no method resulted in a flawless gridline removal”. Furthermore, computer vision approaches, such as those outlined in Zhang et al. [2] and later works, provide few prospects of improvement, as they require fundamentally changing the algorithm to obtain better results.

Finally, in the work of Zhang et al. [2] and subsequent works (c.f. [3]), the discussion focuses on the technology for scanning a single lead. To the best of our knowledge, no work has focused on an automatized pipeline including segmentation of the signals prior to extracting the coordinates from each signal on the paper.

The purpose of this work is to provide a full end-to-end pipeline, that outputs as many digital signals from a paper that shows 12 signals, without the user specifying the boxes in which each signal is located. In other words our goal is to automatically detect and locate all signals before the digitization of each of them individually.

We also aim to overcome certain difficulties, notably in the quality of the paper, handling artifacts, bad scanning, and ink degradation, that are not the main focus of these papers.

To avoid the loss of information that can occur in a full computer vision pipeline, and address a variety of paper qualities and layouts, we decide to explore machine learning approaches, and more precisely CNNs which have proven efficient in similar problems(c.f. Chapter 2).

1.2 Scope and objectives

This work focuses on the main parts of the pipeline that turns an image into several digitized signals. Therefore, the algorithms presented make the assumption that the image does not need to be rotated.

(11)

The algorithms focus on infering the pixel coordinates of each curve. They do not consider the conversion from pixels to voltage. this voltage can be derived from the grid and calibration by a physician. Indeed, one square of the grid represent a certain amplitude and a certain time, which are precised by calibration square waves on the sides of the ECG sheet.

Therefore, a separate, possibly more simple algorithm can be used to determine a conversion scale from the number of pixels contained in a grid square on the image, as in [3].

We also do not delve into the problem of extracting meta-information that would be contained on the sheet, e.g. text information, as it is a different and well-studied issue.

In the models presented, and in the data we use and synthesize (c.f. Section 3.1.1 ), we make a number of assumptions on the images that correspond to the standards of ECG sheet formatting. The assumptions will be presented in the sections related to the data used for the training of each method.

This leads us to two questions:

• Regarding the problem of single lead signal retrieval (c.f. Figure 1.2), can CNNs trained on synthetic images (using real ECG signals) overcome the lack of generalization to a variety of input image characteristics of the image processing methods from [3, 4]?

• Regarding the problem of image segmentation, can a deep learning approach to object detection locate individual ECG signals on a scan with more precision than approaches using thresholding and clustering algorithms similar to [5]?

1.3 Thesis Outline

To investigate these questions, we propose in this work a pipeline for fully automated digitization, from a raw ECG image to the extracted curves of each signal present on the image.

We first approach the issue by presenting a network that extracts a signal from an image containing a single lead. We then add on top of it a method to automatize the detection and location of each signal on the sheet. To leverage the specificity of the ECG format, we divide this location pipeline into two parts presented in Chapter 3.2 (c.f. Figure 3.11).

We will present related work in Chapter 2. We will then present the motivation, evolution and performance of the different parts of this architecture. We will begin with signal retrieval in chapter 3.1, as it has been the most studied part of ECG digitization. We will then present the two parts of signal segmentation in chapter 3.2.

(12)

Chapter 2 Related Work

2.1 ECG Fundamentals

The principle of an ECG is to use the voltage between two electrodes placed on the skin to record the electrical activity of the heart in the axis of these two electrodes. Using differences of potential between 10 electrodes, 12 derivations, i.e. 12 different axes, of the heart’s electrical activity are recorded, creating a three-dimensional map used to identify and locate heart abnormalities. These axes are split into three groups: limb leads and augmented limb leads, that both record voltages between electrodes located on the wrists and ankles, and precordial leads that record the differences of potential between 6 electrodes placed on the chest.

The main components of the ECG signal are the P, QRS and T waves (c.f. Figure 2.2), which reflect the activity of different muscles in the heart. A cycle of contraction of the heart’s muscles is as follows : an electrical signal is propagated at a regular pace from the sinus node, to the atria. The contraction of the atria corresponds to the P wave. The electrical signal is then propagated through the atrio-ventricular node to go to the ventricles. The subsequent contraction of the ventricles corresponds to the higher itensity QRS wave, then followed by the T wave that occurs at the relaxation of the ventricular muscles.

Abnormalities in the shape of these different waves can indicate issues in the muscles contraction (e.g. an infarction is often detected using the shape of the T wave in several derivations), while abnormalities in the pattern and rhythm at which these waves occur are used for the diagnosis of arrhythmias (atrial fibrillation is shown by the absence of P waves and the irregular irregularity of QRS waves).

(13)

Figure 2.1: PQRST waves explanation figure ¹

2.2 ECG Digitization

A number of papers have, in the last decades, studied the problem of ECG digitization, starting in 1987 with the work by Zhang et al. [2]. These works focus on the issue of recovering a digital signal from an image containing one ECG signal. They use diverse computer vision approaches which we will outline in this section.

In 2017, the survey by Waits and Soliman [1] outlines the most relevant works in the field, with summaries of their approaches and limitations. This survey outlines a mostly common pipeline for the digitization works it mentions.

The first step is optical scanning. This simply describes the operation to get a scanned image of the physical ECG sheet.

The second option that is commonly referred to is gridline removal ([4]). Since the vast majority of ECG sheets have a background grid for the purpose of measuring time and voltage on the signal, many algorithms start with isolating the signal from this background grid [6, 7].

After obtaining an image where only the signal is present, the next step is to convert this 2D image - that can be seen as a 2D binary map of the signal’s coordinate - into a 1D vector.

To that end a variety of continuity algorithms is presented in the various works.

To fix certain issues in the signal extraction, such as irregularities due to noise, some additional filters are sometimes applied.

1Source: by Agateller (Anthony Atkielski) https://commons.wikimedia.org/w/index.php?curid=1560893

(14)

Finally some works [4] focus on extracting other relevant data from the ECG scan, such as information contained in the text (Patient information or diagnosis for instance). This is however not the focus of our study.

We will now detail the evolution of these different steps of the signal extraction pipeline through the works that have addressed this issue.

2.2.1 Original Work

Originally, the work by Zhang et al. [2] relies on histogram filtering for step 2. Two peaks are identified in an ECG image’s histogram as the signal (with an intensity value nearing zero) and noise (around 40). A value between these peaks is chosen for thresholding, a first step of signal isolation, followed by grid removal using detection of the grid lines on the image gradient. Then, step 3 is performed through eight-neighborhood tracing: this consists in selecting points belonging to the curve one after the other. On the binary image, pixels are selected iteratively as one of the positive pixels in the direct neighbourhood (i.e. the height pixels in the direct surrounding) of their predecessor.

One difficulty faced in this work is the difference between the QRS wave (a high frequency part of the signal, c.f. Figure 2.1) and the rest of the signal, that can lead to problems in such continuity methods. This is an issue faced in several approaches as it is difficult to reconcile the need to filter out unwanted irregularities and the need to keep the frequency spike of the QRS wave.

Indeed, after the eight-neighborhood tracing, an additional post-processing method is used for QRS recovery. The discontinuity region corresponding to the QRS wave is detected, and some gap in continuity is accepted in the reconstruction in this detected region.

After this final step, the ECG is considered recovered.

This paper does not provide a quantitative way of validating the method - presenting only qualitative results.

2.2.2 The evolution of the methodology

After this introductory paper [2], a number of papers have worked on iterations of the different steps of the pipeline.

Thresholding: There have been attempts, e.g. [8, 4], similar to the original method re- ported in [2]: a thresholding value is selected based on a gap between two peaks of the grayscale image’s histogram, and pixels below this value are kept.

Swamy et al. [6] propose a more complex approach. Using Otsus’s adaptive thresholding algorithm, the authors define a first thresholding value, which they use to crop the image, in order to find a second threshold value that is applied on the cropped image. Otsu’s algorithm is also used by Hussain et al. [7].

This use of an adaptive algorithm such as Otsu’s thresholding is well justified as this algorithm computes a threshold out of a bimodal image - e.g. with two, and only two, separate

(15)

distributions of intensities. In these papers, these two distributions are assumed to be the signal and the grid.

Shen and Laio [9] present a thresholding method that uses low-pass filtering on a 2D Fourier transform of the image. However, the paper concludes to better performance with a grayscale intensity thresholding method, because of factors such as more added salt-and-pepper noise, and amplitude displacements of the recovered signal, when using the Fourier Transform.

Using Fourier Transform followed by Inverse Fourier Transform indeed adds more factors of distortion or noise creation than staying in the same space.

Gridline removal: When it is not assumed that intensity thresholding is sufficient to isolate the signal, some papers, such as Zhang et al. [2], detail methods that detect and remove gridlines. To that end,Mitra and Mitra [10] use the run-length smearing algorithm.

This technique changes the value of pixels of the binarized image when these pixels are surrounded by a number of blank points (value 1) superior to a certain threshold. By using this technique, they assume that grid points will have more adjacent blank points than the curve. In the case of a dotted grid line, this assumption is reasonable. It is less so with continuous gridlines.

Continuity Algorithm: Once these preprocessing steps have been performed, it is assumed that only the signal’s curve remains on the obtained binarized image. The next objective is to obtain a 1D vector from this 2D array. This may need some additional processing steps. Indeed, for Ravichandran et al. [4], the creation of salt and pepper noise by imperfect gridline removal makes an additional step of median filtering necessary, before applying the continuity algorithm.

Several algorithms are proposed. The algorithm proposed by Zhang et al. [2] separated its treatment of QRS complexes and the rest of the signal. But other methods generally aim to detect the entirety of the signal with a unique algorithm. One preferred method is column-wise pixel scan, which can have different variants. We consider here a binarized image: Positive pixels for what is considered signal, and null pixels for what is considered background. Swamy et al. [6]and Ravichandran et al. [4] compute the upper and lower envelopes of the signal, as the highest and lowest positive pixels of each column. The recovered signal is then the mean of the upper and lower envelopes. Chebil et al. [8] instead consider another form of column-wise pixel scan that computes the signal pixel as the median of all positive pixels of each column. This variant aims to overcome errors due to outliers such as remaining salt-and-pepper noise.

Mitra and Mitra [10] use a thinning algorithm to go from a several-pixel thick signal to a one-pixel wide signal, and choose at each column the remaining pixel as the signal pixel.

Finally Patil and Karandikar [11] proposes a method that replaces the thresholding approach with image enhancement techniques: the k-fill method is used for an anti-aliasing purpose.

The authors then use vertical scanning to identify high-valued pixels corresponding to the signal in the enhanced image.

ECG Scan: The method presented by Badilini et al. [3] stands out in the survey by Waits and Soliman [1], as it led to a software used by physicians. It is also a bit different from

(16)

other works, as it develops a method based on vertical active contour. This iterative method is based on minimizing an external(attraction to low intensity) and an internal (smoothness) energy. It does not require gridline removal, and incorporates smoothness constraints, as opposed to other methods such as column-wise pixel scan.

Segmentation: One paper focuses solely on segmenting the image Lozano-Fernandez et al.

[5]. The authors focus more precisely on smartphone photos. They perform grayscale intensity thresholding to remove the grid. They then compute an horizontal projection of the pixel values. They locate local maxima on this projection that are close to the global maximum. Finally, the vertical bounds corresponding to each maximum are found as the lowest and highest pixels with values greater than zero. After testing on thirty one ECGs and making the resolution vary, the authors note errors in some cases of dotted grid lines, and when signals overlap vertically.

Validation: A number of validation datasets and metrics have been used in the papers mentioned above. The evaluation metrics are of two kinds. Comparison is made either directly on the distance between the true curve and the recovered one, or on clinical parameters of the recovered signal, such as the heart rate or cardiac activity intervals (c.f. Figure 2.1).

Among papers presenting direct comparison results, Badilini et al. [3] compute the median RMS (root mean square) deviation for 60 ECGs, insisting on the fact that these ECGs are from a ”well-controlled research environment”. Indeed, they do not claim to be provide a solution effective in all real-life cases. Lobodzinski et al. [12] plot the difference between truth and prediction for 240 ECGs.

Other researchers reason in terms of clinical parameters: Swamy et al. [6] compare the number of R peaks detected and the computer heart-rate for truth and prediction, but only for 5 signals. Ravichandran et al. [4] also present results on a limited number of ECGs (10), on which they present a number of validation metrics: Fit between true and predicted signals and correlation between clinical parameters of true and predicted signals (RR, PR, QRS, QT and QTc intervals).

These examples show that both validation methods and validation datasets differ between studies. No standard evaluation method has been established. These works consequently lack comparison with other studies, making it difficult to assess which constitutes the best approach.

2.3 Digitization Shift

Our work follows a trend of going from pure computer vision methods to deep learning approaches in different areas of digitization. We take the examples of Document Analysis and Optical Music recognition, in which this shift has been taking place for a few years.

2.3.1 Document Analysis

Document analysis consists in extracting text from documents. One of its studied sub-fields, optical character recognition has been widely studied and consists in the classification of

(17)

characters in order to convert an image (containing characters) into characters. Like ECG digitization, many early works in this field rely on multiple steps of processing in order to obtain the final digitized document. Due to the importance and variability of the issue, different works focus on different areas: Different types of documents (focus on newspapers for instance), or different languages (Some papers focus on arabic or chinese documents for instance). This diversity also led to an abundance of methods.

Computer Vision:

O’Gorman and Kasturi [13] gather the different issues of document analysis, and methods that have been developed to perform these tasks. This book outlines several steps that are similar to those seen in section 2.2. Image thresholding is first studied, used in this case more to remove noise and shadows due to the scanning process than for background removal (which was the primary focus of thresholding methods in section 2.2). After other steps of preprocessing such as noise reduction and thinning, methods are described for layout analysis and, finally, character classification.

Layout analysis consists in segmenting different blocks of text - corresponding to article blocks in a newspaper for instance. Several methods are developed, separated into top- down and bottom-up categories. Top-down approaches rely on recursive splitting. One of these approaches is based on recursive splits based on vertical and horizontal projections.

It leverages the same idea as the approach used by Lozano-Fernandez et al. [5]. Bottom-up approaches are also described as creating a hierarchy of blocks recursively merged into bigger blocks.

Machine-printed character recognition is described in several steps. First, characters have to be segmented - an example being again intensity thresholds on projections of the image.

Different features of an image containing a single character are then discussed, such as bounding box aspect-ratio or projected intensity histograms. Finally a classifier is used, the example given in the book being a naive bayesian classifier. The book emphasizes that further improvements expected at the time of publication where the use of character context for disambiguation.

Machine Learning

Both layout analysis and character recognition have been studied from the prism of Deep Learning.

Meier et al. [14] analyze the layout of newspapers using a Fully Convolutional Network (FCN), taking as input a newspaper sheet image, and outputting a segmentation mask of the same size. It is trained on newspapers annotated with the bounding boxes of expected output blocks of text.

Ul-Hasan et al. [15] present a method for Optical Character Recognition from lines of text in english or in fraktur writing, using Deep Learning (LSTMs). Due to the lack of real-world examples. They complete their training data with synthetic generation and degradation of fraktur texts, a method similar to what we use for the training of our networks.

(18)

2.3.2 Optical Music Recognition

A similar issue, Optical Music recognition, consists in recovering a musical score from its partition sheet. The Computer Vision pipeline is quite similar to character recognition. The main difference lies in the detection and removal of the staff line. This process, as well as an overview of the rest of the pipeline, are detailed in Rebelo et al. [16]. This issue of staff line somewhat resembles the issue of grid removal for ECGs. One method is to use an horizontal projection (e.g. summing up intensity values on each the pixels of a row). Regular bursts of intensity on this projection correspond to the staff line. This can face problems as there can be distortions of the staff line, making it curved, which make the process more difficult.

As for a deep learning approach, Dorfer et al. [17] analyses score sheets - not for digitization - but for matching with an audio piece following the score on the sheet. It takes the raw image containing a line of the score as input for the network. This is an example of CNNs removing preprocessing issues. On the process of staff line removals, rather than re- quiring explicit removal of the lines, the network learns to ignore the staff pattern by training on images with staff lines. Such a method can be an inspiration to avoid preprocessing in the case of ECG grids.

2.4 CNNs in Computer Vision

Our motivation for the use of CNN as a preferred Machine Learning technique comes from a series of breakthroughs in recent years on issues similar to ours.

2.5 CNN fundamentals

Convolutional Neural Networks (CNN) are a kind of network made popular by the work of LeCun et al. [18]. The fundamental block of a convolutional netowrk is a convolutional layer. This layer is mainly composed of a small square kernel of weights that is applied to all pixels of an image in a convolutional manner. Instead of applying a different weight for each pixel like a fully connected layer, the use of a sliding kernel in the convolutional layer dramatically reduces the number of parameters. This has two implications. First, it is far less costly than the fully conencted layer. Secondly, it learns to recognize information that is invariant throughout the image (e.g. the detection of lines, angles, and in higher level representations, shapes).

A convolutional layer has a few essential parameters : the size of the 2-D kernel (i.e. k1 x k2 where k1 an k2 correspond to the width and height of the kernel), the number of filters (i.e.

the number of channels in the output tensor), the stride corresponding to the step at which the convolution is applied, and the padding applied at the boundaries of the image, which changes how convolution affects the output dimensions at the boundaries of the image.

CNNs are applied in the vast majority of current deep learning computer vision works. They are often used alternately with downsampling operations - for instance pooling layers (as in VGG [19]), to obtain tensors in which height and width progressively decrease while the

(19)

number of channels increases, thus going from detailed low level information (e.g. lines, edges) to more global high level information (e.g. complex shapes such as faces).

2.5.1 Edge detection

First of all, the issue of edge detection is closely related to signal retrieval. Indeed the curves we aim to isolate from the rest of the image can be considered as dark edges in the image.

That is why methods meant for edge detection have been used for signal retrieval (The use of active contour [3] or of image gradients [2]).

This issue of edge detection has been treated in a number of works which propose a deep learning approach with good performances. Bertasius et al. [20] propose to detect strong edges in an image in two steps. First they select candidate pixels in the image using the Canny edge detector (which performs a double thresholding to select points belonging to an edge). They then use a CNN on a surrounding patch for classification of this candidate pixel as edge or not. A window is selected around the candidate point at each of the 5 convolutional layers of the network. This is intended to leverage both low-level information from the first layers and higher-level information relative to objects in the last layers, in order to discriminate between strong and weak contours (for instance a tree’s contour versus the details of the tree’s bark). A multi-scale variant of this network is also presented where different patch sizes are used around the candidate points, which significantly increases edge detection performance.

Xie and Tu [21] reflect on a different approach for multi-scale edge detection using deep learning. Instead of explicitly using different scales around candidate points, they take the whole image and output a prediction of the edge map at the end of each block of convolutions of a VGG. This is called holistically nested because the outputs from different blocks of convolutions give predictions of different scales and are linked to each other.

The results of these approaches show that a CNN is able to learn a hierarchy between edges and can associate low-level and high-level information to select which edges are relevant.

2.5.2 Image segmentation

Another use of CNNs that can be leveraged on our issue is image segmentation. Indeed the signal’s curve can be both considered as an edge (making it a subject of edge detection) or as an object (making it a problem of segmentation). Pixel-wise Segmentation has been extensively investigated using CNNs. In the field of medical imaging, Ronneberger et al.

[22] propose a network based on the idea of using a fully convolutional network(FCN) for semantic segmentation [23], with convolutions followed by transposed convolutions. Like [21, 20] this method uses feature maps of different levels: at deconvolution, it merges the output of the deconvolution layer with the feature map from the corresponding convolution step, hence the name U-Net.

Semantic segmentation - i.e. classifying each pixel as belonging to a certain type of object - also is a widely addressed issue. Instance segmentation goes a step further by not only classifying if a pixel belongs to a type of object, but also by separating different instances of the same type of object. To that end, Mask-RCNN [24] builds on an object-detection

(20)

model [25]. The authors split their original object detection network (Faster RCNN) into two branches: a branch of box regression and classification and a parallel branch of pixel- wise segmentation on each detected bounding box. Instance segmentation could be useful to separate overlapping curves like it can separate overlapping real-world objects.

2.5.3 Object detection

Finally, signal segmentation considered here is a sub-problem of object detection, as our goal is to detect and isolate several signals on an image. Two main approaches have been used to address this subject. A first method is to split the object detection pipeline into two steps. Candidate boxes are proposed by a first aglorithm. A neural network is then used to classify and correct the candidate boxes. The series of RCNN papers have made a first significant breakthrough in deep-learning based object detection using such a method.

A Region Proposal Network [25] first selects regions of interest. Another network [26] takes as input each proposed region on which it proposes, for each possible class (a number K of classes is decided before training), a bounding box and a probability (softmax over all classes). It is this work which led to the MaskRCNN work[24] mentioned in section 2.5.2.

Other works propose a one-shot pipeline directly outputting probabilities and bounding- box regressions from a set of default boxes. You Only Look Once (YOLO) [27] outputs probability and bounding-box adjustment from default boxes that from a grid decomposition of the image. The Single-Shot (Multibox) Detector (SSD) [28] develops a more complex architecture meant to be more precise on different object scales. Like YOLO, predictions are made on a set of default boxes. But these boxes are defined in a different way to be more precise at every scale. Predictions are made on feature maps of different scales, having several anchor boxes (representing different aspect ratios) at each location of each feature map.

(21)

Chapter 3 Methods

3.1 Signal Retrieval

The first focus of our work is to apply machine learning methods to the problem of Signal retrieval: from an image with a single lead (c.f. Figure 1.2), outputting a 1D vector corresponding to the extracted signal. We limit our work to an output corresponding to the ordinate of the signal at each pixel of the image’s width. The estimator we consider can be formalized as:

f : A → (ˆy_i)i∈[0..M −1] (3.1)

Where A is an image array of dimension (M, N, 3), ˆyi is the predicted ordinate corresponding to absciss i.

In this section, we present the methods studied to address the following question: regarding the problem of single lead signal retrieval (c.f. Figure 1.2), can CNNs trained on synthetic images (using real ECG signals) overcome the lack of generalization to a variety of input image characteristics of the image processing methods from [3, 4]?

Our goal is to outperform the computer vision methods of [3, 4] in regressing a signal recovered from an image. We intend to replace a pipeline containing several steps with a single deep learning model. In our case the deep learning model will be a CNN, for which we will discuss possible architectures and outputs. We expect the CNN to adapt to the variability of the training data, therefore advantageously replacing a human-engineered pipeline. For instance, avoiding the need to remove manually the background of the signal can prevent us from creating further noise or from losing information [4].

We will first present our work on data generation for training and evaluation. We will then discuss the performances of two deep learning architectures, and discuss the differences that led from one to the other. Finally, we will compare them to a computer vision benchmark.

(22)

3.1.1 Data

Synthetic data

For training effectively our CNNs we need a good amount of representative scan data. Such data extracted from real scans would require a large paper database as well as an important annotation time. Instead, to have a sufficient number of examples with good variability, we use synthetic images instead of actual scans. Our synthetic data uses real digital signals, drawn on an image with elements added to resemble a real scan (noise, grid pattern, text...).

We have at our disposal a large database of digital 12-lead ECG signals. We create synthetic images that mimick a scan of these signals printed on paper from this database. For this first problem of signal retrieval, we intend to mimick a single lead image such as Figure 1.2.

We use the Python Cairo library for graph drawing, as it provides good customization properties.

We want to mimick the variability of real scans, with varying thickness and color of the grid and pattern, which depend on the machines. We also draw two common grid patterns, dotted and continuous, as shown in Figure 3.1.

(a) (b)

Figure 3.1: The same signal displayed on a dotted grid (a), and on a continuous grid (b).

During training, we also add two common kinds of noise in scanned images (c.f. Figure 3.2):

• Gaussian noise: Each pixel’s value varies following a gaussian distribution centered on its value.

• Salt and Pepper noise: Each channel of a pixel has a certain probability of dying, that is turning to value 0 or 255.

(23)

(a) (b)

Figure 3.2: Two noise patterns applied to image with continuous grid at training time:

Gaussian noise (a) and salt-and-pepper noise (b).

Evaluation Data

For evaluation, we use both synthetic and real data.

For synthetic evaluation data, we use the image creation process described in section 3.1.1.

Since the digital signals we use come from our database, we have additional information about them. In particular, we have at our disposal the start and end times of QRS complexes and T waves (fig. 2.1). This is interesting because, as seen with [2], QRS and T waves do not have the same characteristics: the QRS complex is far sharper than the T wave, leading to different issues of signal recovery. Using those time segments, we can compute mean square error separately on those two portions of signal, and that way estimate how the different methods handle high (QRS) and low (T) frequency signals.

To complete our evaluation dataset, we also turn to real annotated data. The issue, however, is the lack of actual scans for which we have the corresponding signal coordinates. To gather this data, we developed an internal annotation tool, which enabled us to annotate 212 strips for ECG scans, that display a realistic variability. 6 representative samples of this dataset will be used to report the performance of our approaches.

The goal of our annotation tool is to enable the user to extract individual signal strips associated with its annotation (the signal’s list of coordinates) from a full scanned ECG sheet. This tool is articulated around an algorithm and an interface.

Extraction algorithm: We use a vertical active contour algorithm that will be described in section 3.1.4 as a benchmark for our deep learning methods. This algorithm offers a first proposal of coordinates. To refine this proposal, the user can also place anchor points on the

(24)

image, through which the predicted curve must pass. The active contour algorithm is then re-run using this newly applied constraint.

Interface: Our interface takes the form of a web-app. The frontend presents an interface on which the user can:

1. Upload an image,

2. Select boxes on the image for annotation,

3. Visualize the proposed signal coordinates and fix them using anchor points, with a zooming feature to ensure precision,

4. Validate rectangles that are considered to have a correct associated signal, or delete rectangles.

Finally, in the backend, the annotated images are stored in a database, and validated boxes can be downloaded along with their annotation (i.e. their list of coordinates).

3.1.2 Curve Retrieval

The first direction we explore is a CNN which takes as input an image, and outputs directly the signal’s coordinates, as described in formula 3.1. In other words, we want the CNN to explicitly regress the vertical position of the curve’s pixels along the x axis.

We will refer to it in the following sections as Curve Net.

Architecture

The architecture described below is illustrated in Figure 3.3.

Input and Output Our architecture has to output a vector from an image input. Origi- nally, we decided to set fixed width and height for the input image as (W, H) = (256, 480), and to output a vector v = [v₁, ..., v₄₈₀, v_i ∈ [0..255] corresponding to the ordinate of the curve at each abscissa. However, we observe that learning is significantly faster and more efficient when we decide to first apply a linear downsampling to obtain an image of dimensions (W, H) = (240, 128), before giving it as an input to the network.

To predict the same output, this downsampling of the input reduces both the number of parameters of the network and the information available in the input. These two factors can explain this easier learning, as a smaller input space and a less complex model can lead to better generalization up to a certain point. Finally, our architecture respects the following specifications:

• Input: A tensor A = (A_i,j,k)i=1..240,j=1..128,k=1..3] where A_i,j,k ∈ [0..255],

corresponding to the downsampled image, where [0..255] corresponds to RGB pixel values.

• Output: A tensor y = (y_i)_i∈[0..479] where y_i ∈ [0..255].

where [0..255] corresponds to the ordinate value of the curve on the original image.

(25)

Hidden Layers The base architecture of our network is the 16-layer VGGnet [19], as it has been extensively used with success in works that inspired us [24, 21]. VGGnet is composed of five blocks of convolutions, each followed by a Max-Pooling layer. It then ends with three dense (fully-connected) layers for classification. We use the 4 first blocks of convolutions of this network, and remove the 5th block. Indeed, besides removing significant computation cost, at this block, our tensor would have dimension (15, 8), which we do not consider relevant for the estimation task at hand. We also remove the top dense layers as we do not have the same classification output as the original VGGNet.

The first two blocks of convolution have two convolutional layers each, and the third and fourth layers have 3 convolutional layers. The depth of the output feature map doubles with each block, as the spatial dimensions are divided by two. The feature map has 64 features after block 1, 128 after block 2, 256 after block 3, and finally 512 after block 4. Our estimator has a different output from the works mentioned regarding image segmentation and edge detection (sections 2.5.1 and 2.5.2). It outputs a 1D vector instead of a 2D map.However we keep the idea of a prediction that leverages multiple levels of the CNN to increase precision.

By merging information from the first layers, keeping low-level details of the image, and last layers, keeping object-level information, our intention is to have a pixel-precise (low-level) feature map in which only relevant objects have been kept (high-level: e.g. removing the background).

To leverage the full potential of every level, we merge the output feature maps of each block, following downsampling (for block 1) or upsampling operations (for blocks 2, 3, 4), to obtain a concatenated feature map of dimensions (width×height, depth) = (240×32, 960). Its depth is the sum of the depths of feature maps from blocks 1, 2, 3, 4: (64 + 128 + 256 + 512 = 960).

A horizontal upsampling doubles the width of the merged feature map, giving it dimension (480 × 32, 960).

After obtaining our feature map. The next steps crush the vertical dimension: one 3 × 3 convolution with stride 1 × 1, followed by a 3 × 3 convolution with stride 1 × 3 which divides by 3 the vertical dimension, and finally a column-wise MaxPooling. We obtain a tensor of dimension (heigh × width, depth) = (1 × 480, 256).

Finally two 1D convolutions - (32 filters, window size 1) to reduce the depth-wise dimension and (1 filter, window size 5) for curve regularity - lead us to the desired output: vector of length 480.

(26)

Block 1 Block 2 Block 3 Block 4 Input

Merge Output Vector

Figure 3.3: Architecture for Deep Curve recovery network where the different blocks from the VGG16 architecture are upsampled and then all merged together for the final layers.

Training

We choose to use the mean squared error (MSE) as our loss function. This illustrates the mean square distance between prediction and ground truth for each ordinate.

We fine-tune our network, keeping the VGGNet layers fixed, initialized with parameters trained from ImageNet (the network fails to converge when trained from scratch).

For better generalization, we also add a L2 regularization on the weights, with a factor of 10⁻³.

For training, we use the Adadelta [29] optimizer, updating parameters as follows:

( ∆θ_t= −^{RM S(∆θ)}_{RM S(g)}^t

t × g_t θt+1 = θt+ ∆θt

Where g_t= ∇θJ (θ), J (θ) refers to the objective function, θ refers to the parameters (weights) of the network and RMS is defined for a variable x at step t as:











RM S(x)_t=^qE(x²)_t+ E(x)_t= γE(x)_t−1+ (1 − γ)x_t E(x)₀ = x₀

Where is a small constant added for non-zero division, γ is the chosen parameter for the exponential average E(x)_t, x₀ is the value of the variable at step 0. This ensures a robust adaptive learning rate without tuning of a general learning rate (as opposed to Stochastic Gradient Descent), and showed during our experiments a more stable learning curve than Adagrad.

We train our network on batches of size 8.

(27)

Our network is defined and trained using the Keras library[30] with Tensorflow backend.

All following networks are trained using the same framework.

The network is trained on 30 000 images generated following section 3.1.1, associated with their ground-truth signal’s coordinates. In addition to salt & pepper and gaussian noise, we perform random vertical crops of the images to ensure invariance to vertical position of the baseline.

3.1.3 ECG Map Retrieval

Here we turn to pixel-wise segmentation to propose an effective signal-recovery model. Our model is a single CNN that takes as input an image, and outputs:

• Intermediate output (Fig. 3.4): a probability map of the same size as the image, where locations corresponding to signal pixels on the image have a value close to 1, and pixels corresponding to anything else have a value close to 0.

• Final output (Fig. 3.5): a sotfmax map of twice the height and width of the image, that aims at selecting at each column one pixel that represents the signal. From this softmax map, we extract the coordinates through a column-wise argmax.

We will refer to it in following sections as ECG Map Net.

Figure 3.4: Probability map obtained after merging the 4 blocks from the VGG16 structure

(28)

Figure 3.5: Softmax map obtained after the final layers of the network, including upsampling and column wise softmax activation.

Architecture

Block 1 Block 2 Block 3 Block 4

Probability map Merge

Input

Figure 3.6: Architecture of the ECG Map Retrieval Net

Holistically Nested architecture : Our architecture follows the architecture proposed by Xie and Tu [21] to segment our image into a map of signal presence probability. The idea is to use the last levels of the CNN to eliminate anything that is not signal, and the first layers to be pixel-precise. Following [21], after each block of the VGG Net, we fork into a branch continuing to the next block of convolutions, and a branch which gives an intermediate probability map.

(29)

We can see on figure 3.6 a fork at the end of each VGGNet block. Each branch gives an intermediate probability map shown on the figure. At each of these blocks, this branch contains a 3 × 3 2D Convolutional layer (with respective number of filters 1, 64, 128, 256 for blocks 1, 2, 3, 4). In cases of blocks 2, 3 and 4, this is followed by a Transpose Convolutional layer with 1 filter which respectively multiplies by 2, 4 and 8 the height and width of the feature map for block 2, 3 and 4. The output of this branch is therefore a feature map of depth 1 and of dimension (W × H) equal to the dimension of the image. A sigmoid activation is finally applied to obtain a probability map of same dimension as the final output. This intermediate probability map can be used to apply multi-level loss as will be seen in section 3.1.3. After 4 blocks, we thus obtain 4 intermediate probability maps.

After the 4 first blocks of the VGGNet, we concatenate our 4 probability maps depth-wise, and obtain our final probability map after two additional 3 × 3 convolutions of respective filter depth 256 and 1.

In Figure 3.6, blocks 1 to 4 progressively refine the detection of the signal until the curve is the only part of the image detected with non-negligible probability.

Getting the softmax Map : From those intermediate probability maps, we obtain our softmax map through a few additional layers. We merge the 4 intermediate predictions, and apply a 3 × 3 convolution of depth 256 - the weights of which are the same as the one following merging in the first part of the architecture. This is followed by a transpose convolution of depth 1 that doubles the dimensions of our map. Finally we apply a softmax layer column-wise to outline the location of the signal at each column.

Figure 3.7: Last layers of the ECG Map Retrieval Net

(30)

Training

Losses We use two losses for the outputs of our network:

• For the probability maps we use the weighted crossentropy error outlined in [21], as there is a clear imbalance between positive and negative pixels in the probability map.

This error is described in [21] as taking the form:

L(ˆy, y) = −β ^X

i,j,yi,j=1

log(ˆy_i,j) − (1 − β) ^X

i,j,yi,j=0

log(1 − ˆy_i,j)

Where β is the proportion of positive samples, ˆy is the prediction treated as a probability of each pixel being positive, and y is the binary target.

• For the softmax map, we use a weighted squared error:

L(ˆy, y) = −β ^X

i,j,yi,j=1

(1 − ˆy_i,j)²− (1 − β) ^X

i,j,yi,j=0

ˆ y²_i,j

We penalize errors more weakly than for probability maps, as several pixels of a column can be relevant choices. This indeed makes convergence possible, while weighted crossentropy did not show relevant outputs.

We use the Adadelta optimizer and apply the same regularization on weights as in section 3.1.2.

Influence of intermediate losses By intermediate losses we mean two things:

• The probability map that is an intermediate output to our final output: the SoftMax map.

• The intermediate probability maps obtained after each block of convolutional layers.

Applying a loss on the probability map : We find that the use of the probability map as an intermediate output is what enables the model to converge. Indeed, with it, training from scratch gives better results than fine-tuning on VGG. But without it - i.e. using only the softmax output in the loss calculation - training does not converge.

Multi-scale losses : A multi-scale loss is used in [21], applied both to the final outputs and the intermediate outputs. In our case those intermediate outputs are the probability maps obtained after blocks 1 to 4. We therefore sum the weighted cross-entropy errors of all 5 levels.

We note that in our case, using these multi-scale losses provides a better training curve, as outlined in figure 3.8. We see that the loss on the SoftMax Layer goes down to around 1400 with the use of intermediate losses, while stabilizing to around 1600 without them.

This is not surprising as we directly backpropagate the error on intermediate losses to their respective blocks, avoiding too deep a backpropagation.

(31)

Figure 3.8: Effect of the use multi-scale loss on training: evoluation of final output loss (Weighted square error) with epochs

Training data Here we present the data used for training, along with an example in figure 3.9. The images used are those described in section 3.1.1. To these images we associate the two outputs necessary for training:

• A probability map that corresponds to all pixels that are colored when drawing the curve. this includes antialiasing pixels. Therefore, this region of interest in the probability map is thus a bit thicker than the actual curve.

• A binary map with value 1 at only one pixel per column, corresponding to the curve’s exact coordinates.

We augment these data during training by adding random text, random noise and random vertical crops.

(a) Input image augmented with random text

(b) Intermediate output - Prob- ability map

(c) Final output - Max map

Figure 3.9: Example of training sample

Histogram Equalization Before applying the CNN to our data, we add a preprocessing step of histogram equalization. The motivation of historam equalization - mentioned in section 4.1.1 - is better generalization of our synthetic data. The effect on real world data estimation is shown on Figure 4.4.

(32)

3.1.4 Benchmark

Here, we intend to implement computer vision methods representative of the state of the art for comparison with our methods. We select two methods:

• A column-wise pixel scan method, selected as an example of the common multi-step pipeline described in Chapter 2, and close to [4].

• Vertical active contour, as described by Badilini et al. [3], which stands out from other methods as it does not require preprocessing of the image, and imposes continuity constraints on the result.

Column-wise Pixel Scan

This method performs the following steps:

1. Grayscale conversion of the image

2. Median-filtering of kernel 3 (we find that a first filtering of the image increases performance).

3. Grayscale thresholding: We use a 2-component k-mean initialized with centroids at values 0 and 255. We put value 1 for pixels belonging to the cluster with lowest intensity (making the assumption that lower intensities correspond to the signal). We put value 0 for the rest.

4. Median-filtering of kernel 3, as motivated in [4] by the creation of salt & pepper noise during the thresholding procedure.

5. Column-wise pixel scan: At each column, we select the lowest and highest positive pixels as the envelope of the signal. The signal’s coordinates are then computed as the mean of the upper and lower envelopes.

Active Contour

The active contour algorithm consists in iteratively moving the points of a curve in the direction that minimizes a defined energy function. As stated by Badilini et al. [3], that energy function is:

E(s(x)) = λ_lineE_line(s(x)) + λ_smoothE_smooth(s(x)) + λ_lengthE_length(s(x)) (3.2) Where E_line corresponds to an external energy function, that is minimized in lower intensity regions of the image, and Esmooth and Elength are internal energy functions that diminish with lower length and higher smoothness, to which λ weighting parameters are applied:

Eline = I(x, y), Esmooth = ¹₂α^dv_ds², Elength = ¹₂β^d_ds²^v2

2

(33)

where α and β are parameters chosen by the user.

We keep the vertical component of the scikit-image [31] implementation that minimizes energy 3.2 using the method of Kass et al. [32]:

Ayⁿ− ∆P_y(yⁿ⁻¹) = −γ(yⁿ− yⁿ⁻¹)

⇒ yⁿ− = (A + γI)⁻¹(γyⁿ⁻¹+ ∆P_y(yⁿ⁻¹))

Where A corresponds to the second and fourth derivatives operator, y corresponds to the vector of vertical coordinates of the curve, ∆P is the gradient of the external potential corresponding to the intensity of each pixel of the image. Since yⁿ’s update dependency on external information only includes local gradient information (∆P_y(yⁿ⁻¹))), having a start point far away from the signal curve (and possibly close to local minima such as grid lines), it is hard for the active contour curve to evolve towards the expected result. A solution to this issue is to apply active contour on a smoothed image.

We performed a grid search on parameters α, β and w_line to find the best performing parameters. However, we have not found a set of parameters enabling to capture accurately QRS complexes while limiting the noise on the baseline. The performances of our method is not meant to fully reflect the performance of other works using Active Contour (e.g. [3]), as we have the same energy formulation but we may have different parameter values and a different energy minimization algorithm.

Finally, to allow correction of the algorithm’s proposals, we implement the possibility to set anchor points in the algorithm. These will be fixed points during the iterations of the active contour algorithm. These anchor points are not used during evaluation of active contour. We used them to annotate real-world data for evaluation, as described in section 3.1.1.

(34)

3.2 Signal Segmentation

3.2.1 Problem

To automate the ECG digitization pipeline, a part of signal segmentation is essential. The goal of signal segmentation is to detect and isolate the different signals in an ECG sheet.

Several factors add complexity to the issue.

Diversity of layouts: In a resting ECG (made during a medical exam), the duration of the ECG is 10 seconds, and the number of leads (signals) is usually 12, corresponding to different directions on which the voltage is recorded.

Although ECG paper sheets are well formatted, these 12 leads can be displayed in different ways, the more common being:

• A 2 × 6 layout: 6 rows on each of which 2 signals are displayed one after the other.

The first 5 seconds come from the first signal’s recording, and the last 5 seconds come from the second signal’s recording.

• A 3 × 4 layout: In this layout, there are 4 signals per row, each spanning 2.5 seconds.

In these common layouts rows are sometimes added at the bottom of the sheet. These rows can be called ”rhythm leads”: on each of these rows, only one signal is displayed (i.e.

spanning all 10 seconds). Their goal is to enable the reader to follow the rhythm and to provide a reference for the reading of other leads. In the case of Figure 1.1, the layout is 2x6 with one rhythm lead.

Signal overlap: A certain space on the amplitude axis is allotted to each signal. For signals showing a higher amplitude than this allotted space (in particular during QRS complexes), there can be overlapping with other signals. This is the case in Figure 3.10, where leads V2, V3 and V4 overlap. This can cause great difficulties in recovering each of these overlapping signals. In particular, it is hard to precisely detect the individual bounding boxes of overlapping signals.

(35)

Figure 3.10: ECG sheet with overlapping signals

Image quality : As for individual leads, variability also lies in the quality of the scan, and in paper or ink ageing. Text, handwritten or machine-printed, can be around or even on the curves.

3.2.2 A two-step pipeline

We first intend to train a single object detection method - such as those mentioned in section 2.5.3 - on synthetic data representing a full ECG sheet associated with the bounding boxes of all signals (section 3.3.1). However a single unconstrained object detection pipeline would not be a good fit for ECG sheets. As opposed to real-world object detection, detection of leads on an ECG sheet should exploit - and comply with - the formatting of an ECG.

The layouts described in 3.2.1 display a grid of signals. Horizontally, this means that leads start and end simultaneously. This means locating each lead independently would be both unnecessary and problematic as a discrepancy in horizontal boundaries detection could lead to bad time synchronisation in the reconstructed signal. What remains to be independently treated is vertical boundaries of each signal as amplitudes are unconstrained and are proper to each signal. This is also where the core challenge lies, as overlapping signals make vertical bounding box regression more complex.

(36)

Figure 3.11: Dividing the issue: layout detection followed by column segmentation

This leads us to divide our pipeline into two steps (c.f. Figure 3.11):

1. Layout detection(Section 3.3) : Evaluating the number of rows and columns of signal on the sheet, as well as the bounding box of the signal area.

2. Column Segmentation (Section 3.4): On each detected column, locating each signal and estimating vertical bounding boxes for these signals.

3.3 Layout Detection

Here we aim at detecting the layout of an ECG sheet, which regroups tasks of classification and regression. Taking as input an image, we output the bounding box of the ECG signal region, the number of columns, rows and rhythm leads (c.f. Figure 3.12). To that end, we use a CNN, the top layers of which are 4 parallel dense layers corresponding to each of these 4 outputs.

(37)

Figure 3.12: Layout detection - Specification

3.3.1 Synthetic Data Generation

We do not have a sufficient number of real scans representing only a fraction of the actual variability of scanned ECG sheet. Furthermore, annotating a sufficient number of images would require a consequent amount of time.

To address the variability of scans described in section 3.2.1, we build a pipeline for synthetic image generation that offers a number of factors of output variability, illustrated in Figure 3.13.

• Different layouts: 3 × 4 (a, b), 2 × 6 (c, d).

• Random number of rhythm leads added after this layout, between 0 and 4.

• Different types of separators between signals in a row: dotted (a) or continuous (b).

• Different layouts of calibration square waves: both sides (a) or a single side (b, c, d)

• Added artifacts both in the upper and lower margins (shapes and text of different opacity), and on the signal (traces (b, c) or added text)

• The same variability as signal retrieval data: Variation of the color and type of grid, color and thickness of the signal.

(38)

(a) (b)

(c) (d)

Figure 3.13: Different examples of synthetic images generated from 12 leads ECGs We use the PyCairo library to generate these images, and generate ground-truth annotations associated with each image by saving the bounds of the signal’s region and the number of columns, rows and rhythm leads.

3.3.2 Model

Architecture Our architecture takes as input a 224×224 is composed of 5 (3×3)convolutional layers each followed by a max pooling layer. At the top of this convolutional block, we add one dense layer of size 2048, followed by 4 parallel dense layers. One of them predicts the coordinates (x₁, y₁, x₂, y₂) of the bounding box. The other three, predicting number of rows, columns and rhythm leads, are each followed by a softmax activation layer, to predict a categorical output of the form : (p_i), i ∈ N_categories, where p_i are probabilities summing to 1, and N_categories is:

• 4 for the number of columns: Between 1 and 4 columns

• 14 for the number of rows: Between 1 and 14 rows

• 5 for the number of rhythm leads: between 0 and 4 rhythm leads.

(39)

This approach is illustrated in Figure 3.14.

Figure 3.14: Architecture of the Layout Detection Network

Training We use the following losses on our 4 outputs:

• For the regression loss (bounding box estimation), we use the smooth L1 loss used in [26, 28]:

L(ˆy, y) =

4

X

i=1

smooth_L1(ˆy_i− y_i)

Where ˆy and y represent the prediction and the target for the 4 bounding box coordinates:

smooth_L1(x) =

( 0.5x² if |x| < 1

|x| otherwise

• For the classification losses (number of rows, columns and rhythm leads), we use the categorical cross-entropy:

Error(ˆp, p) = −

Ncategories

X

i=0

p_ilog(ˆp_i) = −log(ˆp_k)

(40)

Where p is the target probability, k is the expected category and ˆp is the predicted probability.

We train our network on 30.000 images following the description from section 3.3.1, associated with ground-truth bounding boxes (illustrated in Figure 3.15) and categories.

(a) (b)

Figure 3.15: Synthetic images with their ground truth bounding boxes. (a) with 6 rows, 2 columns and 4 rhythm leads. (b) with 3 rows, 4 columns and 1 rhythm leads.

3.4 Column Segmentation

Our final task is to propose a method for column segmentation. Assuming the first part of segmentation has been carried out, we now have at our disposal a column of signals. Our task is to determine the bounding box of each signal.

The main challenge in this task is the case of overlapping signals which make bounding box delimitation more difficult. Indeed this challenge cannot really be addressed in computer vision methods, and our goal is to allow overlapping bounding boxes to capture the precise locations of overlapping signals. To measure the performance of our model, we aim at an- swering the question: Can an object detection deep learning method locate individual ECG signals on a scan with more precision than approaches using thresholding and clustering inspired from [5]?

3.4.1 Data

For this task, our generated data corresponds to columns of signals. The data variability changes from layout data (section 3.3.1) in the following ways:

Signal Extraction from Scans of Electrocardiograms

Signal Extraction from Scans of Electrocardiograms

JULIEN FONTANARAVA

Signal Extraction from Scans of Electrocardiograms

Julien Fontanarava julienfo@kth.se

Master’s programme in Computer Science Supervisor: Pawel Herman

Examiner: Hedvig Kjellstr¨ om Host company: Cardiologs

January 3, 2019

Contents

Chapter 1 Introduction

1.1 Problem Statement

1.2 Scope and objectives

1.3 Thesis Outline

Chapter 2

Related Work

2.1 ECG Fundamentals

2.2 ECG Digitization

2.2.1 Original Work

2.2.2 The evolution of the methodology

2.3 Digitization Shift

2.3.1 Document Analysis

2.3.2 Optical Music Recognition

2.4 CNNs in Computer Vision

2.5 CNN fundamentals

2.5.1 Edge detection

2.5.2 Image segmentation

2.5.3 Object detection

Chapter 3 Methods

3.1 Signal Retrieval

3.1.1 Data

3.1.2 Curve Retrieval

3.1.3 ECG Map Retrieval

3.1.4 Benchmark

3.2 Signal Segmentation

3.2.1 Problem

3.2.2 A two-step pipeline

3.3 Layout Detection

3.3.1 Synthetic Data Generation

3.3.2 Model

3.4 Column Segmentation

3.4.1 Data