Simulating Fetal ECG Using Machine Learning on Ultrasound Images

(1)

IN

DEGREE PROJECT TECHNOLOGY, FIRST CYCLE, 15 CREDITS

STOCKHOLM SWEDEN 2020,

Simulating Fetal ECG Using

Machine Learning on Ultrasound Images

MATHILDA VILLOT BERLING JULIA ÖNERUD

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ENGINEERING SCIENCES IN CHEMISTRY, BIOTECHNOLOGY AND HEALTH

(2)

(3)

i This project was performed in collaboration with

Center for Fetal Medicine, Department of Obstetrics and Gynecology Karolinska University Hospital

Supervisors: Jonas Johnson and Lotta Herling

Simulating Fetal ECG Using Machine Learning on Ultrasound Images

Simulering av foster-EKG genom maskininlärning på ultraljudsbilder

M A T H I L D A V I L L O T B E R L I N G J U L I A Ö N E R U D

Degree project in medical engineering First level, 15 hp

Supervisor at KTH: Tobias Nyberg, Mattias Mårtensson Examiner: Mats Nilsson

KTH Royal Institute of Technology

School of Engineering Sciences in Chemistry, Biotechnology and Health SE-141 86 Flemingsberg, Sweden

http://www.kth.se/cbh 2020

(4)

ii

(5)

iii

Abstract

ECG is used clinically to detect a multitude of medical conditions, such as heart-problems like arrhythmias and heart failure, and to give a good general image of the function of the heart with a quick and harmless exam. In many clinical cases, normal ECG measurements cannot be taken, such as with fetuses where ECG signals from the mother’s own body hinder the measurement. This paper examines using machine learning algorithms to be able to simulate ECG graphs from ultrasound data alone. These algorithms are trained on ultrasound and ECG data acquired from the same patient simultaneously. The data used in the training of the algorithms is taken from samples acquired from 100 adult patients. The results found using this method to simulate an ECG indicate good possibilities for future usefulness, where machine learning to acquire simulated ECG can help facilitate clinicians in evaluating fetal heart function, as well as in other cases where ECG cannot be measured normally.

Keywords: ECG, Ultrasound, Fetal-ECG, Heart, Machine learning, Simulated

(6)

iv

Sammanfattning

EKG används kliniskt för att upptäcka en mängd olika åkommor, så som hjärtsvikt och arytmier, men också för att ge en generell bild av hjärtfunktionen med en snabb och harmlös undersökning. I många kliniska fall kan dock inte normal EKG mätning ske, så som för foster då EKG signaler från moderns egna kropp hindrar EKG-mätningen. I detta papper undersöks användandet av maskininlärningsalgoritmer för att kunna simulera EKG grafer från enbart ultraljuds data. Dessa algoritmer är tränade på ultraljud och EKG data som simultant fåtts från samma undersökning av en patient. I detta papper har ultraljudsdatan som använts kommit från 100 mätningar från olika vuxna patienter. Resultaten funna från undersökningen av EKG simulerings metoden indikerar goda möjligheter för framtida användbarhet, då

maskininlärningsalgoritmer för att simulera EKG kan underlätta när kliniker ska utvärdera hjärtfunktionen hos foster, eller i andra fall då EKG inte kan mätas normalt.

Nyckelord: EKG, Ultraljud, Foster-EKG, Hjärta, Maskininlärning, Simulerat

(7)

v

Contents

Abstract iii

Sammanfattning iv

Contents v

Abbreviations vii

Glossary vii

1 Introduction 1

1.1 Aim 1

2 Background 3

2.1 Tissue Doppler 3

2.2 Electrocardiography 3

2.3 Fetal Heart Physiology 4

2.4 Importance of Fetal Echocardiography 4

2.5 Machine Learning 5

2.6 Artificial Neural Networks 5

2.7 Statistical Methods 6

2.8 Machine Learning in Fetal Cardiology 7

3 Method 9

3.1 Datasets 9

3.2 Programming language and hardware 10

3.3 Initial processing of training datasets 10

3.4 Further processing of training datasets 10

3.4.2 Dataset 1B: Tissue Doppler dataset of one heart cycle 11

3.4.4 Dataset 2B: Cine-loop dataset, rate of change 11

3.4.5 Dataset 2C, 2D, 2E and 2F: Cine-loop dataset, rate of change and minimized data 11

3.5 Training the algorithm 11

3.6 Evaluating performance 12

3.6.1 Dataset 1A and 1B: The tissue Doppler datasets 12

3.6.3 Dataset 2: The cine-loop datasets 12

3.6.4 Dataset 2F: The best performing cine-loop dataset 12

3.7 Testing on fetal data 12

4 Results 13

(8)

vi 4.1 Dataset 1A: The tissue Doppler dataset of multiple heart cycles 13 4.2 Dataset 1B: The tissue Doppler dataset of one heart cycle 15

4.3 Dataset 2B through E: The cine-loop datasets 17

4.4 Dataset 2F: The cine-loop dataset, rate of change and minimized data 19

5 Discussion 21

5.1 Results on dataset 1 21

5.2 Results on dataset 2 21

5.3 Fetal data results 22

5.4 Results regarding aim 23

5.5 General improvements of models 23

5.6 Improvements for results on fetal data 24

5.7 Results of study on the future of fetal diagnostics 24

6 Conclusion 25

7 References 27

Appendix 1: Optimised parameters for learning models

(9)

vii

Abbreviations

ML – Machine learning

ANN – Artificial neural networks MSE – Mean squared error ECG – Electrocardiogram

PCC – Pearson correlation coefficient ROI – Region of interest

Glossary

Hyperparameters – layers and nodes in an ANN

Mean squared error – difference between predicted value and true value squared Variance – Flexibility of model

Bias – Error of the model

Overfitting – A model highly adapted to training data that generalizes poorly R-squared – Statistical measure based on variance

Pearson – Statistical measure based on linearity

Cine-loop – Echocardiography images in a digital form as a sequence of a determined frame number

(10)

viii

(11)

1 1 Introduction

Electrocardiogram (ECG) is used clinically to detect a multitude of medical conditions, such as heart-problems like arrhythmias and heart failure, and also to give a good general image of the function of the heart [1]. In situations with fetal cardiac dysfunction and structural cardiac anomalies, it is of importance for both the fetus and the mother to detect these problems prenatally to minimise perinatal complications and to have more time to prepare for possible post-birth surgeries or interventions. Problematically though, the results from classical ECG are of significantly diminished quality when performed on a fetus compared to a postnatal patient. This is due to the mothers own electrical signals from the heart and body adding a large amount of noise to the measurement [2]. A method of obtaining a fetal ECG could therefore be an important tool in order to diagnose cardiac conditions.

Echocardiography can be used to obtain images of the fetal heart, with tissue Doppler ultrasonography the velocity of the walls of the fetal heart can be obtained to be able to evaluate fetal heart function. Tissue Doppler ultrasonography shows promise in assessing fetal cardiac function. However, it requires an experiences sonographer to spend a large amount of time analysing the data. This is where an ECG curve could be of great help in evaluating fetal cardiac function more accurately and effectively due to the simplicity of the ECG.

Many studies have been performed on different methods of extracting the fetal ECG-signal from the ECG-signal of the mother, via filtering and data separation, but these have problems with accuracy due to not being able to completely remove the noise [2]. The proposed

solution to this, is to instead get the supplementary ECG from the less noisy ultrasound measurements by using algorithms trained using machine learning. These algorithms will learn using data from adult patients containing both ultrasound measurements and classical ECG measurements. When trained the algorithms will then be tested on prenatal patients. The aim of this project is therefore to create machine learning algorithms that can produce an ECG from ultrasound data and assess the grade to which they work.

1.1 Aim

The aim of this project was to train a model that could produce a plausible simulated ECG- curve from tissue Doppler ultrasound data sampled from fetuses, and for unseen ultrasound data in adults also achieve:

1) A result better than noise

2) P-wave, QRS-complex and T-wave visible in every heart cycle for 90% of the test samples

3) A mean Pearson correlation coefficient (PCC) score of at least 0.7

(12)

2

(13)

3 2 Background

The background of this project is twofold, knowledge about the diagnostic techniques and physiology combined with the science of machine learning. In this background segment, we will provide insights needed to understand our study in both these aspects.

2.1 Tissue Doppler

Tissue Doppler is a form of echocardiography that measures the velocity of the myocardium (heart muscle) throughout one or more heartbeats using the Doppler effect. The Doppler effect is the principle that the ultrasound reflected back from an object will have an altered frequency depending on the velocity of the object that it is reflected on [3]. And therefore, by simply looking at the frequencies sent out and received by the transducer, motion can be deduced [3]. The velocity of the myocardium, the heart valves and the blood can all be used to find signs of heart defects and problems which makes tissue Doppler one of the more

important modalities when it comes to cardiovascular defects and diseases.

2.2 Electrocardiography

Electrocardiography, is the process of creating an ECG, which is a graph of the measured electrical activity of the heart. The electrical activity is measured using a multitude of electrodes placed in direct contact with the skin which detect the small electrical changes in body that are a result of cardiac muscle depolarization and repolarization during each heartbeat [4]. The ECG contains three main parts, known as the P-wave, QRS complex and the T-wave (Figure 1). The P-wave represents the atrial depolarization, the T- wave

represents the repolarization of the ventricles and the most important one, the QRS complex, represents a combination of depolarization of the left and right ventricles and the contraction of the large ventricular muscles [5]. Clinicians can quickly detect cardiac anomalies by looking at the ECG and these three main parts in particular, and seeing if the amplitude of a certain part looks strange or if a certain interval is too long.

Figure 1:Left: Schematic diagram of normal sinus rhythm for a human heart as seen on ECG (with English labels), With the P part representing the depolarization of the left and right atrium, the QRS part representing electrical impulses spreading through the ventricles and indicating ventricular depolarization and the T part representing ventricular repolarization [6].

Right: ECG (green) and simultaneous tissue Doppler (yellow) in combined plot

(14)

4 2.3 Fetal Heart Physiology

The heart of the fetus is markedly different from an adult heart, both in physiology and function. These differences are partly due to the fetus still being in stages of development, having a much higher amount of stem cells in circulation and a vastly different circulatory need compared to an adult [7]. One clear difference that exist due to this is the much higher heart rate of a fetus compared to an adult, with heart rates ranging between 120 and 160 bpm being the normal [8]. The fetus is also fully dependent on the placenta, which is located inside the womb with connection to both the uterus and the liquid-filled sac within which the fetus is held. Oxygen and nourishment is transferred through the placenta and via the umbilical cord to the fetus, and there is no direct contact between the circulatory system of the fetus and the mother. The lungs of the fetus are filled with amniotic fluid during gestation and only a small amount of blood is pumped past the lungs [7].

Figure 2: Fetal Circulatory System-02.jpg, CC BY 3.0 License [9]

Since there is less of a need for blood to pump past the lungs while they are filled with amniotic fluid, the fetal heart does not have a separate pulmonary artery and aorta. Instead they are connected by a blood vessel called the ductus arteriosus. This extra blood vessel is closed after birth and the pulmonary artery and aorta becomes separate. There is also an opening between the left and the right atria in the fetal heart, called the foramen ovale. It allows for blood to flow directly to the left atrium from the right (Figure 2)[9]. And as with the ductus arteriosus, the foramen ovale is also closed and disappears shortly after birth [7].

2.4 Importance of Fetal Echocardiography

Fetal ECG is important because it would aid the clinician in correctly diagnosing the fetus, which in turn makes it possible to do two things, firstly it helps with planning the perinatal management and identifying what kind of intervention may be required in the delivery room or within the first days of life. And secondly it helps to identify fetuses who may benefit from fetal cardiac intervention, meaning different medicines or surgeries on the fetuses heart while in the womb [10]. Fetal echocardiography is used to detect arrhythmias, a collection of ailments where the heart beats irregularly. Examples include Atrioventricular block (AV-

(15)

5 block), an impairment in the electrical signals when atria and ventricles beat asynchronously and supraventricular extrasystole (SVES), an early depolarisation which causes the heart to beat irregularly [5].

2.5 Machine Learning

Machine learning (ML) is the science of computational learning, combining statistics with computer science to build algorithms which can process data and derive complex conclusions and models that otherwise would be impossible to discern. ML algorithms can be categorized in different ways, but all are based on inputs, the measured data which in turn affects the output of a system [11].

2.6 Artificial Neural Networks

There are many different methods in machine learning to achieve a well performing model, a very flexible and diverse method is Artificial neural networks (ANN), a nonlinear statistical model [12]. According to Rebala [13] ANNs were initially created to mimic the function of neurons in the human brain in an oversimplified manner, each neuron is modelled with multiple inputs and one single output and every neuron is connected in a network to each other. He further states that the neurons, or nodes form columns, or layers that are not

connected to each other vertically inside the layer but the layers are in turn connected to each other via each node (Figure 3). In computer science terms, the artificial neuron is simply a function regulated by a weight factor to control the strength of the impact to other artificial neurons or functions via their connection [13].

Figure 3: Schematic overview of an ANN, the rectangles represent the different layers, circles are the artificial

neurons/nodes and thin lines represent the connection between different nodes. a) input layer b) hidden layer c) output layer.

The input layer of artificial neurons gathers input data from the given dataset and sends that information to the next layer in with respect to the weight [14]. The middle, or hidden layers process this input with an activation function, typically a sigmoid function, and sends the information to the output layer which in turn produces an output [13]. The artificial neural network can have multiple layers and nodes in each layer, i.e. hyperparameters [15]. Sharma [16] explains the sigmoid functions as a class of functions with similar shapes and attributes with resemblance of an ‘S’ shape (Figure 4, right), examples include Softmax, logistic function and tanh. Their purpose as an activation function is to make the connection nonlinear, which is needed for the ANN to find complex correlations [17]. Another activation function is ReLU (Figure 4, left), that according to Sharma [16] it is the most commonly used activation function today.

(16)

6

Figure 4: Example of a Sigmoid function with typical 'S' shape (right) and ReLU function (left).

For the artificial neural network to correctly model the behaviour of the system, it needs to be trained on a given dataset. To train the algorithm, the standard approach is to change the weights according to the stochastic gradient descent method (SGD), although other methods exist [18]. The SGD method uses partial derivatives of a loss function (a function defining how wrong the algorithm is) with respect to the weights in order to find the local minimum of the loss function and change the weights accordingly [13]. The limited memory Broyden- Fletcher-Goldfarb-Shanno (LBFGS) algorithm is another example of an optimisation

algorithm extensively used. Although it has limitations it often converges faster than standard SGD [19]. A variant of the SGD method is Adam, presented by Sci-kit learn [20] as a SGD- based optimiser that works well on larger datasets. They further explain that LBFGS is more useful for smaller datasets, with faster convergence and better results.

Multilayer perceptron is a basic ANN, with inputs flowing through the network in a unidirectional way – forward [18]. The documentation for the Multilayer Perceptron

Regressor (MLPRegressor) from Sci-kit learn describe the learning algorithm with 23 tuning parameters, for example hidden layer sizes, activation function and solver (optimisation method) [20]. Further the documentation shows that the solvers available are LBFGS, SGD and Adam. There are five methods of the MLPRegressor, “fit” that uses training data (both input and target) to train the model, “predict” that predicts an output given an input after training, “score” which evaluates the model and “get_params” as well as “set_params” which are methods for configuration of parameters [20].

2.7 Statistical Methods

The most common way to evaluate the accuracy of a ML regression model is to use the Mean squared error (MSE) [17]. Since there is little interest in how well the model performs on training data, an unseen portion of the data from the dataset is used to assess model performance, this is also called the testing data. Overfitting is a common problem in ML algorithms, not least in regard to artificial neural networks. Overfitting occurs when the algorithm is too flexible in regards to the training portion of the dataset, and perceives patterns occurring randomly in the training dataset that are not properties of the system (Figure 5)[21]. When a loss function is at its minimum, the model is usually overfitted. To prevent overfitting there are a number of methods that reduced the flexibility of the model, for

(17)

7 example weight decay or early stopping rule [12].

Figure 5: Example of a regression problem a) Overfitted example b) Not overfitted example c) true regression

R-squared (R2) is a commonly used statistical measure for quantifying the variance of a regression problem, usually it measures the overall fit of the model along a scale of negative infinity to 1 with higher scores indicating a better fit [22]. Another statistical measure is the Pearson correlation coefficient (PCC) which measures the strength of linear association between two variables on a scale from -1 (perfect negative correlation) to 1 (perfect positive correlation) with 0 indicating uncorrelated variables [23].

2.8 Machine Learning in Fetal Cardiology

Garcia-Canadilla et al. [24] states that ML in fetal cardiology is of great interest and development since evaluations of cardiac function and structures in fetuses often face challenges. Examples include fetal movement, small heart size and inexperienced medical personnel. Garcia-Canadilla also states that ML can facilitate the assessment of the fetal heart, for example by improving image acquisition, extracting information for evaluation and

diagnosing abnormalities. Many papers are published on extraction of maternal ECG from abdominal ECG readings to produce a fetal ECG or fetal QRS complexes with the use of machine learning methods. For example Yu et al. [25] propose using independent component analysis, Muduli et al. [26] focus on deep learning and Lukosevicius et al. [27] proposed a method using ANN. Another approach from Sulas et al. [28] is to use data from pulsed-wave doppler to extract features including heartbeat of the fetus using ANN.

One issue that arises when using ML methods to diagnose conditions is the “black-box” effect that is especially apparent when using deep learning methods [24]. The “black-box” effect is problematic since the decisions made by the model cannot be logically followed by medical personnel, they are completely non-transparent in most ML methods [29].

(18)

8

(19)

9 3 Method

Many datasets were used in this project to test different methods of training the algorithm.

The training datasets (datasets 1 and 2) consists of adult data with corresponding ECG, and the testing dataset (dataset 3) consists of fetal data without corresponding ECG.

3.1 Datasets

The training datasets were based on adult ultrasound data with regular heart rythm imaged on Vivid S6 ultrasound imaging system equipped with a M4S-RS (1.9-4.1 MHz) phased-array transducer (GE CV Ultrasound, Haifa, Israel) with correlating ECG data taken

simultaneously. The data was exported in software EcoPAC version 201 (GE Vingmed Ultrasound AS, Horten, Norway).

Two types of data were exported in the EcoPAC software which gave rise to dataset 1 and dataset 2. An overview of the training datasets can be found in figure 7.

Dataset 1 consisted of 100 color tissue Doppler ultrasound data and ECG patient samples from 100 different adults, with lengths ranging between 1 and 3 seconds. The Doppler ultrasound images was exported by placing a Region of interest (ROI) on the septal wall of the heart (Figure 6) while in ‘q-analysis’ mode, and exporting the processed velocity curve of that ROI to a .txt-file. The ECG data was also stored in the same .txt-file.

Dataset 2 consisted of 100 ultrasound cine-loops taken from 100 different adult patients, with cine-loop lengths ranging between 1 and 4 seconds, sampled and saved in .avi files, along with corresponding ECG data for each of the cine-loops, saved in a .txt-file.

A testing dataset consisting of fetal data was also used to evaluate the performance of the models on fetal data. The fetal data consisted of tissue Doppler ultrasound data in two patient groups. The first group, named dataset 3A, consisted of four samples from fetuses of normal heart function. The second group, named dataset 3B consisted of four samples from fetuses with irregular heart rythm, where arrhythmias like for example AV-block and SVES were present.

Figure 6: Placement of the ROI when extracting tissue velocity data in EcoPAC software

(20)

10

Figure 7: Overview of the training datasets based on adult ultrasound data

3.2 Programming language and hardware

The language used for processing, training, visualising and evaluating on both datasets was Python 3.7 (Python Software Foundation, Wilmington, DE, United states), with accompanied libraries such as Numpy, Scikit-learn, Matplotlib and Scipy. The MLPRegressor from Scikit- learn was used as our learning algorithm for all datasets. The processing, training and

evaluation of the models were done on a 2017 Macbook Air with 1,8 GHz Intel Core i5 processor for dataset 1 and on a stationary computer using a AMD ryzen 3900x cpu and a AMD Radeon RX 5700 XT gpu on Windows 10 for dataset 2.

3.3 Initial processing of training datasets

The .txt-files from Dataset 1 were extracted into arrays and looped to normalise the length to 3 sec. An interpolation function (interp1d, Scipy) was used on both tissue Doppler data and ECG to sample the data 500 times, all data therefore had a common x-axis. The data was also smoothed using a Savitzky-Golay filter.

Dataset 2 consisted of cine-loops, with each of these cine-loops being made up of consecutive images, called frames that in turn consisted of grey-level pixel values. The initial processing of these cine-loops was merely to retrieve these pixel values and store them in an array that could more easily be used by our later functions and neural networks

The measurements in both datasets were visually evaluated on quality, in order to categorize individual measurements into three groups - low, medium and high quality. For dataset 1, the quality depended on a multitude of issues. ECG data and velocity curve not aligned,

ECG/velocity reading null or noisy and velocity reading not sampled on correct ROI

(incorrect shape of plot). For dataset 2, the quality was evaluated based on null or noisy data and grainy or low-resolution frames. In either dataset, if the ECG was found to be upside- down it was flipped to show a correct trace.

3.4 Further processing of training datasets

Each dataset was processed with different methods, therefore resulting in two new datasets based on dataset 1 and four new datasets based on dataset 2. The processing of these datasets will be presented in this section. All datasets were normalized using mean and standard deviation, see equation 1. This causes the mean of the dataset to be zero, and the standard

(21)

11 deviation to be one. See figure 7 for an overview of the training datasets. Datasets 1A and 2A were used with no further processing.

Equation 1: Normalization formula, z denotes the normalized datapoint, x the original datapoint as well as  and  mean and standard deviation of the dataset

3.4.2 Dataset 1B: Tissue Doppler dataset of one heart cycle

Dataset 1B consisted of velocity traces and ECG data from dataset 1 divided into heart cycles.

Using the ECG, the data was cut from R-peak to R-peak so that each sample of the dataset consisted of one heart cycle, in this process high and medium quality data was used. Since each heart cycle has unique length all samples had different lengths and unique x-axis. The sample frequency was normalised using an interpolation function to 300 samples per heart cycle. The resulting dataset was velocity inputs of one heart cycle sampled 300 times and corresponding ECG targets of one heart cycle, sampled 300 times. After quality evaluation and segmentation in heart cycles dataset 1B consisted of 162 heart cycles.

3.4.4 Dataset 2B: Cine-loop dataset, rate of change

Dataset 2B was processed by looking at how fast the pixels of the frames changed. This was done by creating frames with new pixels representing the change in the original pixels over multiple images. This resulted in a lower number of total frames compared to before, and now contained information about how much the pixel values had changed over a set number of cine-loop frames instead of information about the current state. The number of frames that each cine-loop contained varied between 50 and 259 frames and also had varying lengths of ECG that did not correspond to the number of frames, and so in order to correlate these cine- loops to the ECG they had to be interpolated into arrays of the same size. Both the ECG data and the cine-loop data were transformed into arrays of length 64 using interpolation, one array per starting cine-loop and one per ECG data relating each cine loop.

3.4.5 Dataset 2C, 2D, 2E and 2F: Cine-loop dataset, rate of change and minimized data Datasets 2C through E was processed in similar ways. For each frame in the original cine- loops, a new frame was created where the pixel values in a square area of pixels in the

original frame was averaged into one pixel in the new frame. For Datasets 2C that square was 4x4 pixels for 2D 8x8 pixels and for 2E 16x16 pixels. Thereby reducing the image size by 16, 64, and 256 respectively. After that the same processing as for 2B was done for all three of the datasets. Dataset 2F was processed the same way as 2E, but only using high quality data.

3.5 Training the algorithm

The inputs and targets from each of the datasets were split into a train and test group using the train_test_split method from Scikit-learn, 30% of the dataset was split into testing. The

MLPRegressor was then trained on the training dataset using the method ‘fit’, with the input of the dataset as the input (X) and the target of the dataset as the true values of the output (y).

For tissue Doppler datasets 1A and 1B, the optimal parameters of the MLPRegressor were chosen with an optimiser function. Three optimisers were constructed, one for each type of solver, ‘Adam’, ‘SGD’ and ‘LBFGS’. For each type of activation, ‘tanh’, ‘logistic’ and

‘ReLU’ and each type of solver the remaining parameters of the MLPRegressor were one by one iterated on a chosen interval corresponding to that specific parameter, see appendix 2.

The parameter value that corresponded to the model with the best PCC score was then

(22)

12 selected, and the next parameter value started the same process. Resulting was therefore nine models optimised on parameters for each different combination of activation and solver. The nine models were evaluated on performance.

For dataset 2A through 2F, the same optimization method of parameters was used, with optimization of two parameters for the ‘LBFGS’ solver and three for the ‘Adam’ solver. The

‘SGD’ solver was not used past initial testing for dataset 2.

For the differing sizes of datasets 2A through 2F, different neural net layer size and structure was used. For dataset 2A through 2D the neural net was simple layered, meaning that it had only one hidden layer with less than 50 neurons, and for dataset 2E and 2F the neural net had many hidden layers with between 32 and 384 neurons, see appendix 1.

3.6 Evaluating performance

The evaluation of the models was implemented differently for the different datasets. The set of statistical measures used for evaluation were: PCC, MSE and R2 score. All datasets except 2A were evaluated on the statistical measures and a visual score from 0-10.

On datasets 1A and 2F, the data was divided into low, medium and high quality, and required an evaluation on which portions to use. The evaluation was a test that was devised with three different dataset portions of dataset 1 and 2 respectively, only good quality data, good and medium quality data and all quality of data to test on the best performing model in each dataset group. The portion with the best PCC on 10 different iterations with random partitions of training and testing data was chosen. Dataset 1B as well as Datasets 2B through 2E used high and medium quality data, without an evaluation test.

3.6.1 Dataset 1A and 1B: The tissue Doppler datasets

For dataset 1A and 1B the evaluation was done iteratively for all combinations of solver and activation function 10 times. The solver and activation combination that received the best overall score was chosen as the best performing and further tested on the different data quality portions.

3.6.2 Dataset 2: The cine-loop datasets

Due to the size of the datasets 2B through 2F, the long-time of each optimization and the large differences in performance based on pre-processing, dataset 2A through F was evaluated comparing different types of pre-processing and a few key parameter values.

3.6.3 Dataset 2F: The best performing cine-loop dataset

Dataset 2F was seen to be the best performing dataset out of the cine-loop datasets, and as such it was chosen to be evaluated more strongly. For datasets 2A through E only one test/train split of 70 % training and 30 testing data was trained and optimized on due to time constraints, but 2F was done with many randomized splits of 70%/30% test/train data and then evaluated on the average result of the training on all of the splits. To be able to more reliably evaluate the effectiveness of the algorithms trained on the dataset.

3.7 Testing on fetal data

Since the fetal data only consisted of tissue Doppler data it was tested on models trained on dataset 1A and 1B. The fetal data was pre-processed to fit the inputs of these datasets, the model trained on dataset 1B inputs heart cycle data therefore the fetal data was cut into heart cycles by manually segmenting the data. Since the fetal data does not have a correlating ECG only visual correlation could be shown, with no statistical measures.

(23)

13 4 Results

The results are presented based on models trained and tested on each dataset. The statistical measures presented are PCC (closer to 1 indicates a better fit), MSE (lower error is better), R2 (closer to 1 indicates a better fit) and visual score where 0 is lowest and 10 highest. For each of the datasets, the quality of gathered data was evaluated. This evaluation showed that out of 100 samples from dataset 1, 51 were high quality, 24 medium quality and 22 low quality. Out of 100 samples from dataset 2, 52 were high quality, 31 medium quality and 17 low quality.

Examples of ECG data from each evaluated quality in dataset 1 are presented in figure 8.

Figure 8: Examples of results on the data quality evaluation for dataset 1 for the ECG signal (yellow). X-axis in seconds and Y-axis 10 V

4.1 Dataset 1A: The tissue Doppler dataset of multiple heart cycles The results of the 10 trials for the different optimised models are presented in figure 9, whereas the average scores are presented in table 1. The best performing model in a majority of the statistical measures presented in table 1 has the combination of solver and activation LBFGS and ReLU with an average PCC of 0.517 and visual score of 5.8. Further results on data portions regarding quality were tested on this model and presented in table 2, the results show that performance was optimised when only high-quality data was used. Parameter result from the optimised models are presented in appendix 1. The combination SGD and ReLU could not be trained.

(24)

14

Figure 9: Test results for each model performance trained on dataset 1A (all quality data) based on the statistical measures and visual scores. All axes have dimensionless values.

Table 1: Average of statistical scores for each combination of activation and solver for models trained on dataset 1A

Solver &

activation

Average PCC Average MSE Average R2 Average visual score

LBFGS + ReLU 0.517 2662 0.0777 5.8

LBFGS + logistic 0.401 2795 -0.086 5.0

LBFGS + tanh 0.380 2952 -0.018 5.0

Adam + logistic 0.352 3341 -0.010 3.7

Adam + ReLU 0.254 5384 -1.029 3.7

Adam + tanh 0.385 2845 0.015 3.0

SGD + logistic 0.425 3137 -0.069 5.9

SGD + tanh 0.424 3753 -0.259 5.3

Table 2: Results of test of data portion used on best performing model

Data portion Average PCC on 10 iterations

High quality 0.538

High and medium quality 0.517

High, medium and low quality 0.506

The visual results are presented from best performing model on adult data with only high- quality data, solver LBFGS and activation ReLU in figure 10.

(25)

15

Figure 10: Visual results on adult data for best performing model trained on dataset 1A using only high quality data. Red curve indicates predicted ECG, green curve indicates true ECG and yellow curve indicates tissue Doppler velocity. X-axis in

seconds and Y-axis in 10V (ECG) and cm/s (tissue Doppler).

The results on fetal data 3A (normal heart function and rhythm) for the best performing model on dataset 1A are presented in figure 11b. Only visual results can be presented since

corresponding ECG on fetal data does not exist. An ECG curve is present, but it does not correctly identify the heart cycles. The results on fetal data 3B (irregular heart rhythm) for the best performing model (solver LBFGS, activation ReLU) on dataset 1A are presented in figure 11a. Only visual results can be presented since corresponding ECG on fetal data does not exist. An ECG curve is slightly present, but it does not correctly identify the heart cycles.

Figure 11: Visual results on a) abnormal heart function, b) healthy heart function tissue Doppler data from prenatal patients of the best performing model trained on dataset 1A using only high-quality data. Red curve indicates predicted ECG and yellow curve fetal tissue Doppler velocity. Abnormal heart function ailments in a) are as follows: top left: Arrhythmia, top right: long QT, bottom left: SVES, bottom right: AV-block III. X-axis in seconds and Y-axis in 10V (ECG) and cm/s (tissue

Doppler).

4.2 Dataset 1B: The tissue Doppler dataset of one heart cycle

The results of the 10 trials for the different optimised models are presented in figure 12, whereas the average scores are presented in table 3. The best performing model in a majority of the statistical measures presented in table 3 has the combination of solver and activation Adam and ReLU with an average PCC of 0.752 and visual score of 8.2, although the model with solver Adam and activation tanh and the model with activation LBFGS and activation tanh also performed well. Parameter result from the optimised models are presented in appendix 1.

(26)

16

Figure 12: Test results for each model performance trained on dataset 1B (medium and high quality data) based on the statistical measures and visual scores. All axes have dimensionless values.

Table 3: Average of statistical scores for each combination of activation and solver for models trained on dataset 1B

Solver &

activation

Average PCC Average MSE Average R2 Average visual score

LBFGS + ReLU 0.720 1588 0.511 7.7

LBFGS + logistic 0.734 1544 0.531 7.5

LBFGS + tanh 0.752 1657 0.549 7.7

Adam + logistic 0.727 1629 0.513 7.4

Adam + ReLU 0.752 1546 0.550 8.2

Adam + tanh 0.723 1627 0.511 8.4

SGD + logistic 0.734 1598 0.524 7.5

SGD + tanh 0.725 1500 0.522 7.7

SGD + ReLU -0.582 7419 -1.398 0.9

The visual results on adult data with high and medium quality data, with solver Adam and activation ReLU is shown from figure 13.

Figure 13: Results on adult data for best performing model with high and medium quality data, yellow: tissue doppler velocity, green: true ECG, red: predicted ECG. X-axis in seconds and Y-axis in 10V (ECG) and cm/s (tissue Doppler).

(27)

17 The results on fetal data 3A (normal heart function and rhythm) for the best performing model (solver Adam, activation ReLU) on dataset 1B are presented in figure 14. The ECG curve can accurately be predicted with some exceptions. The results on fetal data 3B (irregular heart rhythm) for the best performing model (solver LBFGS, activation ReLU) on dataset 1B are presented in figure 15. Only visual results can be presented since corresponding ECG on fetal data does not exist.

Figure 14: Visual results on healthy heart function tissue Doppler data from prenatal patients from highest scoring model trained on dataset 1B using high and medium quality data. Red curve indicates predicted ECG and yellow curve fetal tissue

Doppler velocity. X-axis in seconds and Y-axis in 10V (ECG) and cm/s (tissue Doppler).

Figure 15:Visual results on abnormal heart function tissue Doppler from prenatal patients from highest scoring model trained on dataset 1B using hgh and medium quality data. Red curve indicates predicted ECG and yellow curve fetal tissue

Doppler velocity. Abnormal heart function ailments are as follows: top left: Arrhythmia, top right: long QT, bottom left:

SVES, bottom right: AV-block III. X-axis in seconds and Y-axis in 10V (ECG) and cm/s (tissue Doppler).

4.3 Dataset 2B through E: The cine-loop datasets

Table 4 shows performance results from neural net training on datasets B through E with different activation and solver settings, optimised towards higher PCC score. Evaluated on three statistical measures and one visual. Parameter result from the optimized models are presented in appendix 1. The best performing model in a majority of the statistical measures is trained on dataset 2D-2E with the combination of activation and solver Adam and tanh with an average PCC of 0.637 and visual score of 5.

(28)

18

Table 4: Statistical scores for each combination of activation and solver for models trained on datasets B and C as well as D and E

Solver & activation Average PCC

Average MSE Average R2 Average visual score Dataset 2B-C ReLU,

Adam

0.135 2071000 -973 0

Dataset 2B-C tanh, Adam

0.374 2465 -0.156 4

Dataset 2B-C ReLU, LBFGS

0.293 2600 -0.224 3

Dataset 2B-C tanh, LBFGS

0.334 3080 -0.449 4

Dataset 2D-E ReLU, Adam

0.333 2690 -0.265 0

Dataset 2D-E tanh, Adam

0.425 2352 -0.107 5

Dataset 2D-E ReLU, LBFGS

0.365 2642 -0.243 3

Dataset 2D-E tanh, LBFGS

0.407 2847 -0.340 6

In Figures 16-18 are plots from the neural net training on datasets B through E with their visual scores. These illustrate how the different similarities in terms of simulated and real ECG graph would be visually scored.

Figure 16: Dataset 2E, tanh activation and LBFGS solver. Visual score 6. In each graph in figures 16-18, the simulated ECG is the red line while the green line is the real ECG. X-axis in seconds and Y-axis in 10V.

(29)

19

Figure 17: Dataset 2C, ReLU activation and LBFGS solver. Visual score 4. X-axis in seconds and Y-axis in 10V. The simulated ECG is the red line while the green line is the real ECG

Figure 18: Dataset 2B, ReLU activation and Adam solver. Visual score 0. X-axis in seconds and Y-axis in 10V. The simulated ECG is the red line while the green line is the real ECG

4.4 Dataset 2F: The cine-loop dataset, rate of change and minimized data Table 5 shows performance results from neural net training on dataset F while Figure 19 shows graphs of simulated ECG compared to real ECG from training done on one train/test split of dataset 2F. Parameter result from the optimised models are presented in appendix 1.

The performance results were acquired by taking the performance results from training the neural net 11 times with different test/train split randomization and then averaging the results. For each of these 11 times, 8 results were acquired with 8 different combinations of layer complexity, activation settings and solver settings. For all combinations, parameters were optimized towards higher PCC score. The performance results are three statistical measures and one visual. The best performing model in a majority of the statistical measures is the combination of activation and solver LBFGS and ReLU with an average PCC of 0.637 and visual score of 7.

(30)

20

Table 5: Statistical scores for each combination of activation, solver and layer complexity for dataset 2F

Solver &

activation Average PCC Average MSE Average R2 Average visual score Dataset F

Simple layers ReLU, Adam

0.420 627478 -152.8 1

Dataset F Simple layers

tanh, Adam

0.580 5949 -0.169 6

Dataset F Simple layers ReLU, LBFGS

0.607 5623 -0.085 6

Dataset F Simple layers tanh, LBFGS

0.632 5804 -0.142 7

Dataset F Complex layers

ReLU, Adam

0.610 6391 -0.258 4

tanh, Adam

0.540 6287 -0.227 7

ReLU, LBFGS

0.637 5584 -0.072 7

tanh, LBFGS

0.604 5975 -0.177 8

Figure 19: Dataset F, complex layer, activation ReLU and solver LBFGS. Visual score 8. X-axis in seconds and Y-axis in 10V. The simulated ECG is the red line while the green line is the real ECG

(31)

21 5 Discussion

The results of the different machine learning models showed promising signs of being able to produce a fetal ECG from ultrasound data. Moreover, they also showed which improvements could be made to further the indication that an adult training set can be used to predict the fetal ECG. The continuation of other connected projects would further the development and indicate any clinical usability of the method.

5.1 Results on dataset 1

As seen in figure 9, the performance in dataset 1A had very varying results depending on test number, for the same model, PCC score could fluctuate between approximately 0.15 and 0.50. Since the test and training data was shuffled between each iteration, variance in this regard could be due to inconsistent data - depending on which data happens to fall in training the model performs differently. In comparison, dataset 1B had fewer varying results

regarding test number, the largest fluctuation of PCC was approximately 0.6 to 0.75, see figure 12. Since the data in dataset 1B was cut in heart cycles, the data was less varied and more consistent which could explain this difference between dataset 1A and 1B.

Performance in dataset 1A was also optimized on only high-quality data (table 2) - which could further the indication that lower quality of data has a stronger negative impact on the model than a smaller amount of data.

By comparing table 1 and table 3 it could be seen that the different models in dataset 1A were more diverse in performance than dataset 1B, with exception of SGD + ReLU for dataset 1B. This would be explained by the same reasoning as earlier - the data is more consistent for 1B and can be more easily interpreted by any model. The SGD + ReLU performed comparably worse on dataset 1B than the other models, and in dataset 1A it produced unviable values. A conclusion for this could be that this combination of solver and activation function does not suit dataset 1 at all.

Both datasets had the best performing model with activation ReLU, which is unsurprising since it is the most used activation function today [16]. However, the two datasets differed in the best performing solver, LBFGS for dataset 1A and Adam for dataset 1B. The Adam solver generally performs better on larger datasets [20] which could be an explanation, since dataset 1B was cut in heart-cycles the dataset contains more training/testing samples.

Regarding both visual and statistical measures, the best performing model in dataset 1B outperforms the best performing model in dataset 1A, visually this can be seen by comparing figure 10 and 13 and statistically by comparing table 1 and 3. The model trained on dataset 1A predicted some ECG curves nearly perfect whereas some are mostly noise and have no distinguishable pattern. The model trained on dataset 1B is much more consistent and rarely predicts noise, although one could argue that it could be overfitted to the healthy adult heart and not detect changes or medical conditions changing the heart function. More data on patients with different heart functions would be needed to assess this fact.

5.2 Results on dataset 2

Through observations of early testing using dataset 2 it was shown that the performance of the algorithms trained on dataset 2 had drastically varying results depending on pre-

processing, solve type and activation type. As well as less variation depending on layer size, and maximum amount of iterations. The rest of the parameters made little to no difference and since optimization of these parameters was omitted due to time constraints, they will

(32)

22 also be omitted from the results. The solver type ‘SGD’ was not used for dataset 2 due to it never producing a result better than noise during initial testing.

The performance results for datasets 2B through 2E, shown in table 4, indicated

improvement that correlated with the amount of pre-processing. The neural net training went faster the smaller the dataset was in size, and also got higher average PCC scores. Visually the improvement due to pre-processing could be seen when looking at the graphs in figure 16 compared to 17. With figure 17 being a typical simulated ECG curve for the low pre- processing dataset 2C and figure 16 being a typical simulated ECG curve for the high pre- processing dataset 2E. The algorithms trained on low pre-processing datasets like 2B and 2C in general missed more QRS complexes and was visually scored lower than the high pre- processing datasets. A couple of the graphs for dataset 2B and 2C are visually close to noise, similarly to the one in figure 18. While dataset 2D and 2E had all trained algorithms

producing simulated ECG curves resembling the correct ones, except for the graphs acquired using the combination of ReLU activation and Adam solver. As seen in figure 16, the

simulated ECGs made using dataset 2E missed QRS complexes only a small amount of times and visually very similar to the real ECG

The different combinations of activations and solvers gave noticeably different results, with the combination of ReLU activation and Adam solver only producing results better than noise for datasets 2E and, while Tanh activation with both Adam and LBFGS solver

produced statistically and visually good results. For dataset 2, as shown in table 4 and 5, the solve type LBFGS consistently outperformed the solve type Adam in terms of statistical and visual results, with the Adam solve type only catching up in performance with the smallest dataset 2E and dataset 2F. This could be explained by how the different solvers operate with Adam using a type of gradient descent [30] and LBFGS using a more complex approach with functions [31].

In terms of visual score, the correlation with dataset size was also clear with the smaller datasets having less noisy graphs and more often finding the QRS complex as well as the P- wave and the T-wave. Dataset 2F had the best performance in both visual score and

statistical score, only rarely failing to find the different parts of the heart cycle in its testing.

In some cases, the simulated ECG visually resembles a normal ECG more than the corresponding real ECG, as shown in figure 18. The real ECGs in these cases that do not resemble a normal ECG could in the future be removed by adding a pre-processing step where the real ECGs are checked for their similarity to all types of normal ECGs including arrythmias and removed if they are too different from all kinds of normal ECGs. How this should best be done would need to be researched.

In the evaluation, only dataset 2F was evaluated based on an average of different

test/training data splits. This was done to save time since the optimization for each of the iterations took over thirty minutes. Dataset 2F was the dataset which got the best results out of all the cine-loop datasets so that one was chosen to be evaluated harder. It could be argued that all of the datasets should have been evaluated based on averages, but the time

investment to make that possible was decided not to be worth it.

5.3 Fetal data results

The fetal datasets 3A and 3B were tested on the best performing models trained on dataset

(33)

23 1A and 1B. The results on dataset 1A were not satisfactory, since the model could not

accurately identify the different heart cycles (see figure 11). This could be due to the higher fetal heart rate compared to adults, if the model was overfitted to the heart rate of an adult.

The results on dataset 1B were better (see figure 14), the resulting ECG curve could in most cases be predicted. Because the inputs were cut into heart cycles for models trained on dataset 1B, the ECG was easier to predict. Abnormalities in the tissue Doppler curves from fetal dataset 3B gave results unlike a normal ECG for the model trained on dataset 1B, as seen in figure 15. This is probably due to the fact that the model was not trained on adult data containing irregular rhythm and thus has not learned how the corresponding ECG would look like.

Generally, the results of the best performing model trained on dataset 1B indicates that an algorithm trained on adult data could predict a potential fetal ECG of a patient with regular heart rhythm. However, in predicting a fetal ECG of a patient with irregular heart rhythm, more training and testing data would be necessary to assess possibility and performance.

5.4 Results regarding the aim

The first aim of this project was to produce a model that could produce a plausible simulated ECG-curve from tissue Doppler ultrasound data. The results from figure 14 show that the best performing model in dataset 1B could produce an ECG signal that has the right characteristics of an ECG and is therefore considered plausible. This aim is consequently considered met.

The second aim was to obtain statistical results for the models when tested on adult data. The first statistical aim was to produce a result better than noise, an example of producing a result equal to noise can be found in figure 18 which equalled a visual score of 0. The best performing model in this paper obtained an average visual score of 8.2 which is considered better than noise. The second statistical aim was to have P-wave, QRS-complex and T-wave visible in every heart cycle for 90% of the test samples, examples of models obtaining this aim can be found in figure 13 and figure 19. The third statistical aim was to have a PCC score greater than 0.7, which was met by the best performing model trained on dataset 1B, see table 3.

5.5 General improvements of models

Exemplified in figure 8, there was a lot of low-quality data in the datasets for this study, which led to a decreased amount of useful data for training and testing. An improvement in the quality of the overall data could increase the amount of viable data and produce more accurate models for both datasets since the quality is elevated. The amount of data could also be increased for the model to more accurately learn the correlation between movement of the heart and ECG. More data from diverse ranges of heart function and rhythm could also improve all models to correctly present an ECG for each abnormal heart function case.

For dataset 2 there are many possible improvements, apart from just having more data. Since there is a lot of pre-processing done to the cine-loop data in order to make it usable for machine learning, changes to that pre-processing can lead to big improvements in terms of better algorithms or faster training times. Given more time, many different ways of reducing frame size and measuring change over time in the cine-loops could be tested. This is

something that could be done automatically via code, but for a good home computer testing just one of the pre-processing types would take around ten hours which would mean weeks to find the best type.