AI-based Age Estimation using X-ray Hand Images

(1)

AI-based Age Estimation using X-ray Hand

Images

A comparison of Object Detection and Deep Learning models

Erik Westerberg

June 8, 2020

Faculty of Computing

(2)

This thesis is submitted to the Faculty of Computing at Blekinge Institute of Technology in partial ful-ﬁllment of the requirements for the bachelor’s degree in software engineering. The thesis is equivalent to 10 weeks of full-time studies.

Contact Information: Author: Erik Westerberg E-mail: erikwesterberg92@gmail.com University advisor: Abbas Cheddad

Department of Computer Science

Faculty of Computing

Blekinge Institute of Technology SE-371 79 Karlskrona, Sweden

(3)

Abstract

Bone age assessment can be useful in a variety of ways. It can help pediatricians predict growth, puberty entrance, identify diseases, and assess if a person lacking proper identification is a minor or not. It is a time-consuming process that is also prone to intra-observer variation, which can cause problems in many ways. This thesis attempts to improve and speed up bone age assessments by using different object detection methods to detect and segment bones anatomically important for the assessment and using these segmented bones to train deep learning models to predict bone age. A dataset consisting of 12811 X-ray hand images of persons ranging from infant age to 19 years of age was used. In the first research question, we compared the performance of three state-of-the-art object detection models: Mask R-CNN, Yolo, and RetinaNet. We chose the best performing model, Yolo, to segment all the growth plates in the phalanges of the dataset. We proceeded to train four different pre-trained models: Xception, InceptionV3, VGG19, and ResNet152, using both the segmented and unsegmented dataset and compared the performance. We achieved good results using both the unsegmented and segmented dataset, although the performance was slightly better using the unsegmented dataset. The analysis suggests that we might be able to achieve a higher accuracy using the segmented dataset by adding the detection of growth plates from the carpal bones, epiphysis, and the diaphysis. The best performing model was Xception, which achieved a mean average error of 1.007 years using the unsegmented dataset and 1.193 years using the segmented dataset.

(4)

Acknowledgments

(5)

1 Introduction

1.1 Bone age assessment and Machine Learning

Applications Bone age assessment (BAA) can be useful in a variety of situations. For example, it can be used to predict how much longer a child will grow, when they will hit puberty or even their final height [1]. It can also be used to monitor the progress of children being treated for conditions that affect growth. BAA is also very useful when it comes to identifying people lacking proper identification [2]. In recent years, there has been a significant increase in the number of refugees lacking proper identification seeking asylum in Europe. Unaccompanied individuals under the age of 18 are eligible special rights according to the United Nations Convention on Rights of the Child, so from a legal standpoint, an accurate assessment is important to create a fair process.

BAA Methods Generally, there are two methods used for manual BAA: The Greulich-Pyle (GP) and the Tanner-Whitehouse methods (TW)[3]. The GP-method uses an atlas containing reference images col-lected from the years 1931-1942. During the assessment, the assessed radiograph is compared to reference radiographs in the atlas. Two bones called the epiphysis and diaphysis are used for the assessment. Most institutions use a rapid, modiﬁed version, which is faster than the original method but also less accurate.

The TW-method calculates maturity scores based on several regions in the hand. These regions are the carpals, radius, ulna, and short bones. This method was developed during the years 1950-1960 but was updated in 2001 based on newly discovered patterns of development. The third iteration of the method is currently being used: TW3.

The most important regions for both these methods are the growth plates, which is where cartilage gradually turns into bone tissue [4]. The gaps between the bones are wide during childhood, but gradually disappear as the child matures, as shown in Figure1.

(7)

Comparison In 1999, Bull et al.[3] made a comparison between the two methods. The study concluded that the GP method was more prone to intra-observer variation than the TW method, meaning that the estimated age can vary depending on which doctor is performing the assessment. This can cause huge problems concerning skeletal maturity evaluation as well as assessing whether a person is a minor or not. Bull et al. suggested only one of the methods be used during manual BAA, preferably the TW-method. CASAS The ﬁrst automated system for BAA was suggested by Tanner et al. in 1994: The CASAS system [5]. They proposed an automated computer-based skeletal age scoring system based on the TW2 method. The biggest problem with this system was that it could not locate all the regions that were anatomically meaningful for the assessment, hence it was only semi-automated. It could at most process 90% [6] of the cases, which meant that it still needed to be supervised by a trained pediatrician.

BoneXpert In 2009, Thodberg et al. introduced a fully automated method for determining skeletal maturity: BoneXpert [6]. It used methods from statistics and machine learning, which at the time had not been used for this task. The BoneXpert system divides the BAA into three steps. The ﬁrst step reconstructs the bone borders, which are used for the assessment. The second step computes the bone age values for each area. The third step converts the bone age values into either GP or TW-bone age using simple postprocessing. This software is licensed on a pay-per-analysis basis. BoneXpert connects as a DICOM node to a DI-COM network. To perform a BAA, a hand radiograph is pushed from the PACS (image Archiving and Communication System) to the BoneXpert DICOM node. The BoneXpert server performs the assessment and then returns an annotated image to PACS and stores it next to the original radiograph. BoneXpert is not open-sourced, meaning that we do not know what is going on internally during the BAA. It is also relatively expensive. In 2011, the price was 10 euros per assessment [7]. Current pricing is to our knowledge not available through their website.

RSNA In 2017, a competition was hosted by the Radiological Society of North America (RSNA) in which the competitors set out to determine the bone age of a person based on their hand radiograph [8] using machine learning techniques. All of the competitors used a dataset provided by KAGGLE, which is the same dataset that is used in this thesis. The competition was judged based on the competitors best provided mean absolute distance (MAD).

A total of 260 individuals and teams participated on the challenge website. Finally, 105 submissions from 48 unique users were submitted. The best result provided was a MAD of 4.2 months, meaning that the average error of each assessed bone age and its corresponding ground truth had an absolute value of 4.2 months. The concept of this competition is very diﬀerent from a commercially intended product such as the BoneXpert. The data collected from projects such as BoneXpert are rarely shared outside the organization, so it does not contribute to the community in the same way.

(8)

1.1.1 Ethical Aspects

Bone age assessments for Unaccompanied Minors Bone age assessments are one of several methods used to assess the age of refugees lacking proper identification. Other methods include interviews, dental assessments, assessment by a doctor, and psychological methods [9]. Minors are entitled to welfare and various other benefits, hence a proper assessment is very important in two aspects: it minimizes the risk of having a child being incorrectly assessed as an adult, and it minimizes the risk of having an adult being assessed as a child and illegibly claiming these benefits. Swedens’s National Board of Forensic Medicine began carrying out age estimations in 2018 [10]. Between mid-March and the end of October 7858 age assessments were carried out using X-rays of wisdom teeth and MRI scans of knee joints. The assessments suggested that 6628 individual which had stated to be minors were actually adults. This signifies the importance of accurate assessments, both for sake of the individuals involved and the publics’ trust in the government.

The currently used bone age assessment methods have been derived primarily from a white population, and some researchers suggest evidence exists that these methods are less generalizable on children of other ethnicities [11], which would make it unfair since most of the refugees come from other geographical locations. According to Bull et al. [3], the 95% confidence interval (CI) is different for the TW2 and GP-method. 362 radiographs were assessed using both methods, and the GP-method achieved a CI of -2.46 to 2.18 years, while the TW2 method achieved a CI of -1.42 to 1.43 years. This is a huge risk of concern since which methods are used depends on the country. If either of the methods is used, especially so the GP-method, the assessed individual runs a significant risk of being falsely assessed as an adult. It should also be noted that a child’s bone age might not necessarily correspond with their chronological age.Healthy, tall children also tend to have an advanced bone age, while healthy, short children tend to have a delayed bone age [12]. Some conditions, syndromes, and diseases which can contribute to delayed or advanced bone are shown in Table1 [11].

Delayed Bone Age Advanced Bone Age

Untreated growth hormone deﬁciency Childhood Obesity

Chronic diseases Leydig cell hyperplasia

Inﬂamatory bowel disease McCune-Albright syndrome

Celiac disease Sotos and Beckwith-Wiedemann syndromes

Cystic ﬁbrosis Anorexia Depression

Table 1: Potential causes of delayed and advanced bone age.

These conditions can signiﬁcantly impact the result of the age assessment and thus contribute to an unfair assessment. The US Immigration and Customs Enforcement (ICE) suggested that wrist and hand X-rays may be considered, but that it should not be exclusively used for age assessments [13].

Bone age assessments for Medical Conditions Bone age assessments have various usages in health-care. It can as mentioned, be used to identify delayed and advanced bone age. Endocrine Disorders are commonly associated with delayed bone age, meaning that a normal bone age can help rule out many such disorders [11]. Bone age assessments can be used as a tool to rule out certain conditions. It can also indicate certain diseases, but most of them demand further investigation and testing for a diagnosis. Children with chronic diseases may have delayed bone age, but not necessarily so. Bone age assessments are also useful when it comes to treating children with short stature.

(9)

concluded that bone age assessments should be part of the follow-up for this kind of treatment. In another study by Wilson [15], the author argued the opposite: that few data support the usefulness of bone age assessments when monitoring children receiving growth hormone therapy. The author argued that bone age determination is fraught with technical difficulties and that the interobserver difference is significant. There appear to be many different opinions on the usage of bone age assessments and their accuracy.

1.2 Segmentation Problem

Image segmentation is the process of dividing an image into multiple sub-images, where each sub-image contains a certain object. Each sub-image is represented by a set of coordinate points that deﬁnes a bounding box around the object. An example of predicted bounding boxes and its corresponding segmented sub-images can be seen in Figure 2. Each point is the absolute pixel position in the image, starting from the top left corner.

Figure 2: Bounding boxes around the detected growth plates.

Three diﬀerent feature object detection methods are used to segment the growth plates in this thesis: Mask R-CNN, Yolo and RetinaNet. All of them are state-of-the-art object detection methods that have been trained on ImageNet for general-purpose object detection. They require additional training on custom datasets to detect custom objects. For this, we need to provide images and corresponding ground truth labels in either PascalVOC or MS COCO-format. Open-source versions of the frameworks available on Github has been utilized for all of the three methods.

(10)

Backbone The backbone feature extractor of Mask R-CNN is usually ResNet50 or ResNet101 (ResNet101 in this thesis). It starts by detecting low-level features such as edges, corners, and moves on to detect high-level features such as cars and people.

RPN Next, Mask R-CNN uses a regional proposal network (RPN) to scan the image for objects, using a kernel that slides over the image. The regions that the RPN scans over are called anchors, of which there are hundreds of thousands of in the scanned image. This can be done in parallel utilizing the GPU, so it is rather fast. It does not scan all the anchors, only the anchors present in the feature map generated by the backbone.

Finally, Mask R-CNN generates bounding boxes with an accuracy score around each detected object in the image.

Yolo (You Only Look Once) Yolo’s objective is the same as Mask R-CNN’s. However, it has a different approach. Mask R-CNN’s algorithm is based on classification, while Yolo’s is based on regression [17]. Instead of selecting one anchor at a time, Yolo predicts classes and bounding boxes for the whole image at once, hence “You Only Look Once”. It is commonly used for real-time object detection because of its speed. It is significantly faster than Mask R-CNN.

Yolo works by splitting the image into a S x S grid, where each cell is responsible for predicting 5 bounding boxes. Most of these cells will not contain any objects, so Yolo generates a conﬁdence score for each bounding box so that we can easily ﬁlter out bounding boxes that are not likely to contain any objects.

RetinaNet RetinaNet uses a feature pyramid network (FPN) built on top of ResNet50/ResNet101 as its backbone [18]. The FPN generates a feature map, which is passed on to the subnetworks. The first subnet starts by predicting the probability of objects being present for N anchors for each object class. The next subnet outputs the coordinates of the object in the image. Just like Yolo, it generates a lot of bounding boxes, which can be filtered out based on the confidence scores.

1.3 Purpose and Objective

This thesis is going to explore the possibility of training neural networks to categorize X-ray hand images by their correct bone age. To do this the RSNA Bone Age dataset will be used, which is a dataset containing 12811 X-ray hand images [8] with corresponding bone age and gender. Each image of the dataset contains an unsegmented radiograph, meaning that they contain a lot of details not anatomically necessary for the assessment. The objective of this thesis is to find the best segmentation method for this problem, as well as to evaluate how this segmentation affects the accuracy of different pre-trained models when it comes to age estimation. Does it increase or decrease accuracy?

1.4 Scope

The main focus of this research will be the growth plate segmentation and age estimation using both the segmented and unsegmented dataset. Transfer learning will be utilized using several diﬀerent pre-trained models, since training a randomly initialized model from scratch is very time-consuming.

(11)

2 Research Questions

• RQ1: What image segmentation method will be more useful for the growth plate segmentation? – Motivation: Since the growth plate segmentation will be applied to the whole dataset of 12813

images, we need to evaluate which segmentation method is the most suitable in terms of both computational time and accuracy. Both computational resources and annotated images are ﬁnite. • RQ2: How do diﬀerent pre-trained models perform on the RSNA Bone Age dataset?

– Motivation: To find out if the segmentation from RQ1 is useful, we first need to see how the pre-trained models perform without the segmentation. Evaluating different models is necessary, since the better the model is to learn from the total area of a hand, the better it may be at learning features from the segmented growth plates.

• RQ3: Can the accuracy (from RQ2) be improved by extracting each hand’s growth plates from every image in the dataset (RQ1) as compared to the total area of a hand?

– Motivation: The most anatomically signiﬁcant parts for a BAA are the growth plates. Therefore, there are many parts of the image which are not needed to perform the assessment. To reduce the risk of having the models learn from unrelated features, it might be necessary to remove unwanted information from the radiographs.

3 Literature Review

All performance metrics discussed in this literature review are the same ones used in this thesis empirical study and can be read about in more detail in Section4.4.

3.1 Object Detection

In 2018, a study was performed to compare the detection of balls in handball images [19]. The performance of Yolo and Mask R-CNN was analyzed in terms of speed and accuracy, using a custom dataset with images gathered from various handball videos, as well as publicly available images from Google. Both models used weights pre-trained on the MS-COCO dataset and were trained additionally on the custom dataset.

Both Yolo and Mask R-CNN performed relatively well on the public dataset using only the pre-trained weights while performing poorly on the custom dataset. Yolo was better at handling objects further away but had a higher number of false positives. Training both models on the custom dataset increased the accuracy for the custom dataset a lot, but decreased the accuracy for predictions on the public dataset. Especially so with Yolo. Training both models on the custom dataset signiﬁcantly improved the accuracy of the custom dataset for both models.

(12)

All three models were compared in a study by Mukhopadhyay et al. [21], in which they used Mask R-CNN, Yolo, and RetinaNet to detect cars, auto-rickshaws and various objects on Indian roads. They found that Mask R-CNN performed significantly better than both RetinaNet and Yolo. Yolo outperformed RetinaNet with a large margin, while also outperforming Mask R-CNN in computational time. False negatives were more prevalent with RetinaNet, while there was no significant difference between the other models.

Although Mask R-CNN was the most accurate model, the authors concluded that Yolo would the best model for their purpose, since it is signiﬁcantly faster than Mask R-CNN. They proposed an expert system with live lane and obstacle tracking on roads, hence Yolo’s speed would make it ideal for real-time object detection.

The studies reviewed suggest that Mask R-CNN will be the most accurate model in RQ1, although Yolo will come close and most likely be better suited because of its speed since the segmentation will be performed on 12811 radiographs. RetinaNet should perform the worst, though requiring less computational time than Mask R-CNN.

3.2 Deep Learning Background

Deep learning is a subset of machine learning, in which the ﬁring of neurons is mimicked to simulate the function of the human brain to make intelligent decisions. The architecture mimicking the human brain is called a neural network. These networks are multi-layered networks consisting of three kinds of layers: The

Convolutional, Pooling and Fully Connected Layer, as seen in Figure3.

Figure 3: A simple overview of a Convolutional Neural Network. The output is either a classiﬁcation or a continuous value (regression).

These networks are trained by using large amounts of data, which is labeled with the correct category or value. In general, all deep learning problems can be divided into one of two categories: classification or regression. Classification means that from a given input, we categorize it into one of k classes. Regression means that we map an input (image) to a continues value. For example, age. Deep learning is useful for finding patterns in images, which can help doctors identify diseases such as cancer. One area in which deep learning has not been utilized to such a high degree is for bone age estimation.

Convolutional Neural Networks

(13)

Convolutional Layer The first layer of the CNN is called the Kernel or Convolutional Layer. A kernel of a predefined size, for example [3, 3, 1] moves across the image, each step being referred to as a stride. The kernel starts in the top left corner of the image and performs matrix multiplications on the pixel values that it is currently hovering over. It then moves to the right by the stride value and repeats the process until it reaches the full width of the image. It then moves down one step and goes back to the left side of the image and starts over. This is done to extract the high-level features of the image, such as edges. This finally reduces the resolution of the image, before it is passed to the next layer.

Pooling Layer Next is the Pooling layer. In this layer, Max or Average Pooling is applied to the image to decrease the computational power necessary to process the data. Max pooling is the most commonly used method since it can help to suppress noise in the image. A simple example of reducing the resolution of a 4x4 image using max pooling is shown in Figure4.

Figure 4: Max Pooling performed on a 4x4 grayscale image using a 2x2 kernel with a stride size of 2. The colors are only used for clariﬁcation.

A kernel of a defined size moves across the image of a set stride size and picks the largest values from each area. It performs a lot better than Average pooling. After our image has passed through all of the layers and all the filters have been applied, it is flattened. An image of size 2x2 pixels can be considered a

matrix as shown in Eq.1.

240 222

251 231

(1) Flattening this matrix would transform it into a linear list as shown in Eq.2.

240 222 251 231 (2)

Fully Connected Layer This flattened image is then passed to the fully connected layer and forward propagated through the fully connected network. Forward propagation is a term used in deep learning, which refers to each pixel value being passed through the fully connected network and its hidden layers. In this step, all the images passing through have to be of the same size. Each pixel value is passed into its neuron and then passed through the network. The network finally outputs its final prediction.

(14)

Sigmoid One of the most common activation functions for CNN’s is the Sigmoid. The sigmoid transforms its input into a value between 0 and 1. It is deﬁned as in Eq. 3.

f (x) = 1

1 + e−x (3)

The Sigmoid is useful since keeps the values small and helps us observe small changes in the outputs. ReLU Another popular activation function is the Rectiﬁed Linear Unit.

f (x) = max(0, x) (4) The main advantage of ReLU is that it does not make all neurons ﬁre at the same time. Zeiler et al. [22] suggested that ReLU be used in deep learning networks since they are easier to optimize and converge, generalize, and compute faster.

Loss Functions The goal of the training process is to minimize the error of the prediction. Thus, it can be seen as a minimization problem. The goal of the objective function is to minimize the error. Therefore, in deep learning, we refer to this as the loss of the algorithm. The loss function’s job is to evaluate the performance of our algorithm. If our algorithm is performing well, the loss function will output a small number. If on the other hand, it is far oﬀ, it will output a big number. The loss function is used during the backpropagation to adjust the weights of the model. The weights of our model are adjusted and the loss function is checked accordingly. If the outputted number decreased, it means that the adjustments of the weights were in the right direction. If it increased, it means that the weights were not adjusted in the right way.

Gradient Descent Gradient descent is an optimization algorithm commonly used in neural networks to optimize the weights. This can be thought about as climbing down a mountain to the lowest point. We start at the mountain top: at our current prediction. We then take a step in the direction of the negative gradient and get a new prediction. We then recalculate the gradient and take another step in that direction. This is repeated until we reach the lowest point or the local minimum. The size of the steps we take is referred to as the learning rate. With a low learning rate, we can make sure that we take small steps in the right direction, as the negative gradient is recurrently calculated. However, this requires a lot of computational power. It is therefore common practice to use a gradually decreasing learning rate. This is done by starting with relatively big steps, optimize until the error does not decrease, and then decrease the step size. 3.2.1 Common problems in Deep Learning

Overﬁtting Overﬁtting occurs when we reach a point during training our model, in which the model learns to generalize well on the data that we trained it with, but it does not perform well on never before seen data. There are several ways to improve this, but the best one is generally to gather more training data. However, this is not always possible since it can be time-consuming or expensive to do so. Fortunately, there are plenty of huge datasets for various tasks available for free on the internet. Websites such as Kaggle [8], which provide the dataset used in this thesis, also provide datasets for other machine learning tasks. At this moment, there is roughly 39000 dataset available on their website. Various other datasets, such as MNIST [23] (handwritten digits) can be easily accessed via a Google search.

Underﬁtting Underﬁtting refers to a model which neither performs well on the training data or the test data. This occurs when the model is not able to recognize any patterns in the dataset. This is generally caused by too small of a dataset, too much variance in the dataset, or both.

Both overﬁtting and underﬁtting can be decreased by using k-fold cross-validation.

(15)

cross validation is when we divide our dataset into k partitions, for example,{A, B, C, D, E}. Then, we ﬁrst

use {A, B, C, D} for training, and {E} for validation. Next, we use {B, C, D, E} for training and {A} for validation, and so on. This means that we use all partitions for both training and validating our model. 3.2.2 Regression Problem

Both classiﬁcation and regression are considered supervised learning. Our problem is a regression problem since the objective is to predict age: a continuous value. To pick the most suitable models for the training process in RQ2 and RQ3, some research has to be done to evaluate the models available in the Keras TensorFlow library. Keras provides the following models, which have all been trained and evaluated on ImageNet.

Xception, VGG16, VGG19, ResNet, ResNetV2, InceptionV3, InceptionResNetV2, MobileNet, MobileNetV2, DenseNet and NASNet.

According to the study by Canziani et al. [24], InceptionV4 is the most accurate architecture on ImageNet. However, it is not included in the Keras application library and is therefore excluded from this study due to issues of convenience and considering the slight performance improvement that it provides.

The best performer overall when comparing Top-1 accuracy except for InceptionV4 is its previous version: InceptionV3. However, in another study by C.Francois [25] at Google, Xception, a modiﬁed version of Inception, outperforms the InceptionV3 architecture. Xception got a Top-1 accuracy of 0.790, vs InceptionV3 which got 0.782. The author also notes that InceptionV3 was developed with a focus on ImageNet, which might explain why it performs so well on it. A comparison was also made using the JFT dataset on which neither of the architectures has been trained on. This time, Xception beat InceptionV3 with a bigger margin. The next best model after InceptionV3 is ResNet152. ResNet152 is the best-performing architecture within the ResNet family. It has 152 layers, which is much deeper than InceptionV3’s 48 and Xception’s 71 layers, so training time will be slower. The ResNet family was partially introduced to deal with the vanishing gradient problem.

Next is VGG19. VGG19 is a network developed by Oxford University’s Geometric Group and is 19 layers deep. It is quite similar to ResNet152 but performs convolutions on the whole images, which makes it slower. Since these models seem to be the best performing models within research, they will be utilized in RQ2 and RQ3. Based on the papers reviewed, we hypothesize that Xception will be the best performer on the unsegmented dataset, followed closely by InceptionV3. ResNet152 will most likely be slower but more accurate than VGG19.

3.3 Bone Age Assessment using Deep Learning

(16)

Pre-trained weights from ImageNet were used during the training. By the 1000th iteration during training, a decrease in error had already started to plateau, which indicated convergence of the network weights. They acquired quite good results for the training set. A mean absolute difference of 6.4 months was achieved, and the largest error was 24.4 months. However, the results were not as good for the testing set. MAD for the testing set was 18.9 months and the largest error was 69.2 months, which is about 68.4 months (5.7 years) off. The authors suggested that this could be due to overfitting since the training set was quite small (400 images).

Another similar study was done by Lee et al. [27] in 2017. They proposed a fully automated deep learning system for bone age assessment, which utilized a preprocessing pipeline to standardize each radiograph in the collected dataset to reduce as much noise as possible. Therefore, they proposed a preprocessing engine that consists of a CNN to segment the hand and generate a corresponding mask, followed by a vision pipeline to standardize and maximize the invariant features of the images. In this thesis, a diﬀerent dataset was used for the radiographs.

Radiographs were collected using an internal report search engine called Render, where all radiology reports containing the exam code “XRBAGE” were collected. Radiographs from patients of chronological age 5-18+ were collected from various sources. Ages 0-4 were excluded for various reasons. The total number of radiographs after removing uninterpretable and deformed images were 4278 female and 4047 male radiographs – a total of 8325 radiographs.

Since the radiographs were collected from a lot of different sources, they varied a lot in both contrast and aspect ratio, and some of the radiographs were inverted (white background, gray bones). The inverted radiograph issue, as well as the different aspect ratios, were handled in the preprocessing pipeline. The differing aspect ratios were handled by giving all the radiographs a height of 512 pixels and then using zero-padding to fill out the width to 512 pixels. Zero padding means filling out the image with black pixels until the desired width is reached.

Three networks for transfer learning were considered: AlexNet, GoogleNet, and VGG16, as they have all been validated in the ImageNet Large Scale Visual Recognition Competition. According to the study by Canziani et al. [24], VGG16 is the best performer of the three based on accuracy, and AlexNet is the worst. However, GoogleNet utilizes about 25 times less trainable parameters and achieves comparable performance to VGG16. Therefore, they decided to use a pre-trained GoogleNet model from Caffe Zoo. Various tweaks had to be done to fine-tune the network since GoogleNet is trained on RGB images while radiographs are in grayscale format. Data augmentation was used to artificially increase the size of the collected dataset.

The authors produced four diﬀerent models, which they measured accuracy on. The ﬁrst model was trained with the original radiographs, simply reshaped to a resolution of 224 x 224 pixels. This achieved a test accuracy of 39.06% for females and 40.60% for males. Both females and males were assigned an age within one year of ground truth 75.59 and 75.54% of the time.

(17)

The final result in the study achieved a classification accuracy of 98% with an error margin of 2 years, and 90% accuracy with an error margin of 1 year. The authors suggest that the CNN could be using unknown features from the radiograph to perform the classifications, other than what is used in traditional manual bone age assessments and that further investigation is needed to determine this.

4 Research Methodology

4.1 Theoretical

The main information fetching source in this thesis is Google Scholar. Mendeley Desktop is used to handle references. Relevant articles are selected by a certain set of content criteria. For a paper to be of relevance, it needs to include one or more of the following:

Bone Age Assessments

BAA’s are one of the main focuses of this thesis. To be able to train a deep learning model to automatically predict bone age, a basic knowledge of how manual BAA’s are performed will be necessary. Not all parts of the radiograph are anatomically relevant in performing a BAA, so to avoid the network from learning from irrelevant features, a basic understanding of manual BAA’s are acquired from papers discussing the GP or TW method.

Examples of search phrases used: ”bone age assessment methods”, ”bone age assessments AND comparison”, ”bone age assessments AND deep learning”, ”bone age assessments AND rsna”,

Object Detection methods

Since this thesis is limited to using three diﬀerent models: Mask R-CNN, Yolo, and RetinaNet for the growth plate segmentation, only articles in which these three models were used will be considered. Comparisons between the three are optimal.

Examples of search phrases used: ”mask r-cnn vs yolo”, ”object detection AND convolutional neural network”, ”object detection AND comparison”, ”object detection methods”

Convolutional Neural Networks

CNN’s are widely used in both deep learning and deep learning-based object detection. Therefore some knowledge regarding CNN’s is necessary to successfully segment the dataset in the desired way and also to train the model in the later stage. Preferably, articles regarding medical imaging should be used.

Examples of search phrases used: ”convolutional neural networks”, ”pre-trained models”, ”deep learning models AND comparison”,”xception AND inceptionv3 AND resnet AND vgg”.

4.2 Methods and Experiments

(18)

Data is collected during training, validation, and testing of the models in all three research questions. The data is analyzed using Python and MatplotLib by creating relevant graphs.

4.3 Experimental Procedure

4.3.1 RQ1

What image segmentation method will be more useful for the growth plate segmentation? Dataset The ﬁrst step in the setup is to download the RSNA X-ray Hand dataset. It can be downloaded from Kaggle’s website [8]. The dataset comes pre-split into training and testing. However, the testing set does not include ground truths. Therefore, the testing set is excluded and the training set of 12611 images is split into training, validation and testing in a 70/15/15 ratio, as shown in Table4.

Total Nr of Images 12611

Training 8827

Validation 1892

Testing 1892

Table 2: Dataset Split.

id bone age male

1377 180 False

1378 12 False

1379 94 False

Table 3: Dataframe Structure. Table 4: Dataset split and dataframe structures.

The gender and age distribution can be seen in Figure 5. The genders are distributed as 54% male and 46% female.

Figure 5: Gender and Age Distribution.

The mean bone age of the dataset is 127 months and the standard deviation is 41 months. The dataset includes radiographs within a large range of bone ages, the maximum being 228 months (19 years) and the minimum being 1 month. The dataframes also include the gender of the radiograph, although we do not consider this during neither training nor testing.

(19)

Figure 6: LabelImg, software for creating image annotations.

These annotation ﬁles along with their corresponding radiographs are used to train Mask R-CNN, Yolo, and RetinaNet. To train the models, Google Colab is utilized, which is a free cloud service that can be used to execute code on a Tesla K80 GPU. All project ﬁles need to be uploaded to Google Drive to do this.

During training, the 192 training radiographs are used to train each model for 10 hours each continuously. Colab has a limit of 12 hours of continuous execution for each kernel. Mask R-CNN takes about 3 and a half hours to train during each epoch, so it can train a maximum of three epochs during a 12-hour runtime. Therefore, for issues of fairness, Yolo and RetinaNet were also limited to 10 hours of training time. The source code of all models have been modiﬁed to save the predicted bounding box coordinates during testing in the following format:

classname, conﬁdence, x1, y1, x2, y2

Classname is the name of our class: plate. Conﬁdence is the percentage of how sure the model is that the segmented region is a growth plate. (X1, Y1) and (X2, Y2) refers to the top left and bottom-right coordinates of the predicted bounding boxes.

Yolo To set up the Yolo environment, we need to clone the Darknet repository [28], which is an implemen-tation of YOLO. After that, we need to change several ﬁles in the repository and convert the annoimplemen-tations for the training set.

Yolo requires a special format for its annotation ﬁles, so we need to run a script available in the repository to convert them from XML-format to the Yolo txt-format. When all the annotations have been converted and uploaded to Google Drive in the correct folder, the training process is initiated with this command as shown in Listing1.

1 . / d a r k n e t d e t e c t o r t r a i n c f g / o b j . d a t a c f g / y o l o v 3−t i n y . c f g darknet53 . conv . 7 4 Listing 1: Code to initiate training Yolo.

(20)

1 . / d a r k n e t d e t e c t o r t e s t c f g / o b j . d a t a c f g / y o l o v 3−t i n y . c f g backup/ yolov3−t i n y . backup d a t a / 4 5 3 5 . png

Listing 2: Code to use Yolo for detecting objects

Mask R-CNN The setup for Mask R-CNN is similar to that of YOLO. We start by cloning the required repository [29]. Then we need to put all the annotations and radiographs in the correct folder. For Mask R-CNN, we’re utilizing transfer learning with the COCO datasets pre-trained weights. Training is initiated using the command shown in Listing3.

1 . / model . t r a i n ( t r a i n\ s e t , t e s t \ s e t , l e a r n i n g \ r a t e =2∗ c o n f i g .LEARNING\ RATE, e p o c h s =5 , l a y e r s=’ h e a d s ’)

Listing 3: Code to initiate training Mask R-CNN.

The weights are automatically backed up after every epoch. After training the model for 3 epochs, it is tested by running the command shown in Listing 4.

1 r e s u l t = model . d e t e c t ( [ img ] ) 2 r = r e s u l t [ 0 ]

3 v i s u a l i z e . d i s p l a y\ i n s t a n c e s ( img , r [’ r o i s ’] , r [ ’ masks ’] , r [ ’ c l a s s\ i d s ’] , t e s t\ s e t .c l a s s\ names , r [ ’ s c o r e s ’] , t i t l e =” P r e d i c t i o n s ”)

Listing 4: Code to use Mask R-CNN for detecting objects.

During detection, the segmented radiograph and predicted bounding boxes are saved to /prediction-s/images and /predictions/bounding boxes respectively. The bounding boxes will be used to compare Mask R-CNN’s performance metrics against Yolo’s and RetinaNet’s.

RetinaNet For the final model, we are using this implementation of RetinaNet [30]. To train this model, we first need to generate CSV files for the training and testing set. This is done by using a provided script from the GitHub repository. After that, the annotations and images along with its generated CSV files are placed in the dataset directory. Training is initiated with the command shown in Listing5.

1 r e t i n a n e t −t r a i n −−weights r e s n e t 5 0 \ coco \ b e s t \ v2 . 1 . 0 . h5 −−s t e p s 400 −− e p o c h s 20 −−snapshot−path snapshots −−tensorboard −d i r t e n s o r b o a r d c s v d a t a s e t / t r a i n . c s v d a t a s e t / c l a s s e s . c s v

Listing 5: Code to initiate training RetinaNet.

During training, the model is continuously backed up after each epoch. When training is done, the trained model needs to be converted to inference mode before it is loaded. This is done using the commands in Listing6.

1 r e t i n a n e t−convert−model f i n a l . h5 i n f e r e n c e /model . h5 2 m o d e l p a t h = o s . path . j o i n (’ i n f e r e n c e ’, ’ model . h5 ’)

Listing 6: Code to convert RetinaNet’s trained model into inference mode and load it.

(21)

1 from tqdm i m p o r t tqdm 2 3 t e s t p a t h = ’ . . / t e s t i n g i m a g e s c l a h e / ’ 4 r e s u l t i m a g e s p a t h = ’ p r e d i c t i o n s / i m a g e s / ’ 5 r e s u l t b o u n d i n g b o x p a t h = ’ p r e d i c t i o n s / b o u n d i n g b o x e s / ’ 6 t e s t i m a g e s = [ f f o r f i n l i s t d i r ( t e s t p a t h ) i f i s f i l e ( j o i n ( t e s t p a t h , f ) ) ] 7 8 f o r f i n tqdm ( t e s t i m a g e s ) : 9 image = r e a d i m a g e b g r ( t e s t p a t h + f ) 10 l a b e l s t o n a m e s = { 0 : ’ p l a t e ’} 11 12 s , b = [ ] 13 14 image = p r e p r o c e s s i m a g e ( image ) 15 image , s c a l e = r e s i z e i m a g e ( image )

16 boxes , s c o r e s , l a b e l s = model . p r e d i c t o n b a t c h ( np . expand dims ( image , a x i s =0) ) 17 b o x e s /= s c a l e 18 l i n e s = [ ] 19 20 f o r box , s c o r e , l a b e l i n z i p( b o x e s [ 0 ] , s c o r e s [ 0 ] , l a b e l s [ 0 ] ) : 21 i f s c o r e < 0 . 7 : 22 b r e a k 23 l i n e s . append ( (s t r( l a b e l ) + ” ” + s t r( s c o r e ) + ” ” + s t r( box [ 0 ] ) + ” ” +

s t r( box [ 1 ] ) + ” ” + s t r( box [ 2 ] ) + ” ” + s t r( box [ 3 ] ) ) ) 24 c o l o r = l a b e l c o l o r ( l a b e l ) 25 26 #Write d e t e c t e d bounding b o x e s t o f i l e 27 w i t h open( r e s u l t b o u n d i n g b o x p a t h + f [ :−3] + ” t x t ”, ’w ’) a s f i l e : 28 f o r l i n l i n e s : 29 f i l e . w r i t e (”%s\n” % l )

Listing 7: Code for running the RetinaNet model in inference mode and saving the predicted bounding boxes.

Contrast Limited Adaptive Histogram Equalization (CLAHE) Due to the large variance in bright-ness and contrast in the radiographs, we also tried applying CLAHE to the training and testing set, to see if it has any positive eﬀects on the models. CLAHE is an adaptive histogram equalization (AHE) technique used to improve the contrast in images. Normal AHE has a common problem of creating too much noise in areas that are quite homogenous, which might be a problem with some of the radiographs since they have black and gray backgrounds. To prevent this noise, we are using CLAHE, which extends normal AHE by preventing the overampliﬁcation of these areas. This function is part of the OpenCV-library for Python. 4.3.2 RQ2

(22)

All four models are pre-trained on ImageNet and are trained using the same hyperparameters, as shown in Table 5. The Keras TensorFlow framework is used to train the models, and we start by setting up our training, validation and test generators. Since the dataset is large, all images are downscaled to a resolution of 256x256. We normalize the input, eg the age variable. This is done by calculating the standard deviation and mean of the dataset and then performing the operation shown in Eq. 5.

N ormalized Age = Actual Age− mean(Age)

std(age) (5)

This means that values close to the mean get a value close to 0, values lower than the mean gets a negative value and values above the mean get a positive value. This speeds up the training since we don’t need to use dropout layers in between the layers added to the pre-trained model.

We also add additional layers to the pre-trained model to get a real value as output. Our problem is not a classiﬁcation problem: it is a regression problem. Therefore, we need to add additional layers to get a real value as output instead of classiﬁcation.

Training The hyperparameters used during training is shown in Table5.

Hyperparameter Values Seed 42 Batch Size 32 Shuﬄe True Image Size 256 x 256 Initial LR* 0.001 LR Patience 5 LR Decrease factor 0.1

Training steps per Epoch 30

Validation steps per Epoch 10

Table 5: Hyperparameters used during training.

We train the networks using a batch size of 32 and 30 steps per epoch, which means that we pass 960 images to the network per epoch. We experimented with lower and higher batch sizes. Increasing the batch size seemed to reduce the network’s ability to learn while decreasing it slowed down the training process. A batch size of 32 seems to be the optimal value. During validation, we use a step size of 10 and the same batch size, which means that we pass 320 validation images each epoch, and the weights are updated as many times. A preprocessing function is applied from each model’s corresponding library. After the training is completed, each model is used to predict the age of all the 1892 testing images. The log for each training process is also saved so that we can plot the training process for each model against the others. The workﬂow from passing an image to the pre-trained model to outputting an age prediction can be seen in Figure 7.

(23)

Figure 8: The complete workﬂow of RQ3.

4.3.3 RQ3

Can the accuracy (from RQ2) be improved by extracting each hand’s growth plate from every image in the dataset (RQ1) as compared to the total area of a hand? To evaluate if the achieved segmentation in RQ1 has any positive eﬀect, we use the same models as in RQ2, but this time training them using each extracted growth plate at a time. The trained Yolo model from RQ1 is used to extract the growth plates from the 12611 training images. A custom script written in Python is used to run the model on all the radiographs in the training and testing set, and then exclude radiographs which did not meet the minimum criteria of 10 successfully segmented growth plates.

The training and validation set contains a total of 10719 images of which 240 were excluded due to Yolo not being able to detect enough growth plates. From the training set, 137 761 growth plates were successfully segmented. The testing set contains 1892 images. Yolo could segment a total of 1305 images successfully, which resulted in 14140 segmented growth plates for the testing set. Each segmented growth plate from the training set is passed to the same networks used in RQ2. Figure 8 shows the complete workﬂow from the radiograph being passed to and segmented by Yolo, to the pre-trained model predicting age on each growth plate and then calculating the ﬁnal age prediction.

The networks are trained using the same hyperparameters as in RQ2, as shown in Table 5, except for the input size. The segmented growth plates are sub-images of the radiographs, which means that they contain fewer pixels than the whole radiographs. Each segmented growth plate has a resolution ranging from about 60x60 to 80x80. Xception is the model that requires the largest input of all the models: 75x75 pixels. Therefore, all images are resized to a 75x75 resolution before they are passed to the networks for training.

(24)

plates. All these predictions are then used to calculate the ﬁnal age prediction for the radiograph. id bone age male

1721 0 66 True 1721 1 66 True 1721 2 66 True 1721 3 66 True 1721 4 66 True 1721 5 66 True 1721 6 66 True 1721 7 66 True 1721 8 66 True 1721 9 66 True 1721 10 66 True 1721 11 66 True 1721 12 66 True 1721 13 66 True

Table 6: Testing dataframe structure. The number before and after the underscore is the original id of the radiograph and the id of its respectively segmented growth plates.

Five diﬀerent formulas are used to calculate the ﬁnal age prediction, based on the individual age predic-tions for each radiograph. The most accurate one for the whole testing set will be compared to the MAE achieved in RQ2.

• Mean = Sum of all predictions T otal number of predictions

• Median =n+1

2

• Median And Mean Average =M edian(P )+M ean(P ) 2

• Mode, meaning the most common prediction • Minimum, the smallest prediction

Early stopping with a patience of 30 epochs is used, which means that the training is terminated automatically if the MAE on the validation set is not improved for 30 epochs. Since the input size is much smaller as compared to RQ2 (75x75 vs 256x256,) we use a higher patience since a smaller resolution of the training images require less computational time.

4.4 Performance Metrics

4.4.1 RQ1

True Positives A true positive is when a predicted bounding box intersects the GT-bounding box (ground truth) by a certain percentage, 70% in our experiment. This is referred to as a correct prediction.

False Positives A false positive is when our model predicts an object where there is no object. It does not intersect any GT-bounding boxes and is an incorrect classiﬁcation.

(25)

Using these three values for each model, we can calculate the performance metrics that will be used to evaluate which model will be used for the segmentation in RQ3.

Precision Precision describes how many of the models predicted bounding boxes match the GT-bounding boxes. In simple terms, it is how many objects of the total number of objects we can successfully detect. It is calculated as shown in Eq.6.

P recision = T P

T P + F P (6)

Recall Recall, also referred to as sensitivity, is the true positive rate of the predictions done by our model. It measures the probability that our model can successfully detect the objects. It is calculated as shown in Eq. 7.

Recall = T P

T P + F N (7)

F1 Score F1 score is the calculated weighted average of precision and recall. It is useful since it takes both false positives and false negatives into account. Precision does not take into account how many present objects were not detected, while recall does not take into account how many objects were falsely detected. It is a good score for evaluating the overall performance of a model.

F 1 Score = 2∗ P recision∗ Recall

P recision + Recall (8)

4.4.2 RQ2 and RQ3

To evaluate the second and third research questions, Mean Average Error (MAE) will be used. MAE is one of the most commonly used performance metrics for regression problems. It calculates the average absolute distance between the expected value and its corresponding predicted value. Mean Average Error is a synonym for Mean Absolute Distance (MAD), which was used in the study by Lee et al. [26] and the RSNA competion [8].

When we evaluate our models on the testing dataset, some predictions may be very accurate while other predictions may be far oﬀ. To account for this, we need a good general measurement. By performing all the predictions, calculating the absolute value of the diﬀerence between prediction and ground truth, summarizing them, and calculating the mean, we get a good general metric for evaluating the performance of our model. MAE is calculated as follows in Eq.9,

M AE =

|age − agep_|

N (9)

where age is the actual age in months, agep _{is the predicted age,} _{|.| returns the absolute value of the} diﬀerence, and N is the total number of observations (i.e., number of samples). After calculating the MAE for the models in RQ2 and RQ3, we can get a sense if the achieved segmentation improved the accuracy or not.

(26)

5 Analysis and Results

5.1 RQ1

After training Mask R-CNN, Yolo, and RetinaNet continuously for 10 hours each on Colab using the 192 training radiographs labeled with ground truth, we were able to achieve the results as shown in Figure9.

Figure 9: Precision, Recall and F1 score for the three models when testing on the 100 testing images. By looking at the precision, we can see that it is high for all the models, which means that they all do rather well when it comes to the accuracy of their predictions. Not that many of the predictions are wrong, meaning that we do not have many false positives for either of the models.

However, when we observe the recall, we see that Mask R-CNN has trouble detecting a big number of the growth plates, more than 16% stay undetected. A lot of the images have low resolution, rotated hands and growth plates close to each other (especially children), but Mask R-CNN even seem to have trouble with very clear images, that none of the other models have any trouble with. This agrees with the study by Sumit et al. [20], in which Mask R-CNN had trouble identifying multiple human beings in clear images. Mask R-CNN also had more false positives than Yolo, which contradicts the results in the study by Buric et al. [19]. RetinaNet and Yolo generally seem to perform well on radiographs which are clear for human perception, while Mask R-CNN randomly performs poorly on the same images.

(27)

Figure 10: RetinaNet detecting growth plates outside the distal phalanges.

RetinaNet also has trouble with the labels in the radiographs corners, sometimes detecting their edges. The total number of true positives, false positives and false negatives for each model can be seen in Figure11.

Figure 11: Total TP, FP and FN for all the models.

To give the models a general score, the F1 score was calculated using the precision and recall for each model. In this context, Mask R-CNN performed the worst, because of its low recall, achieving a score of 0.91. Yolo outperforms RetinaNet by a very small margin: 0.974 vs 0.971.

(28)

Figure 12: Before - After CLAHE contrast improvement.

After applying CLAHE on the testing set, the models were tested to see if the contrast improvement had any positive effect. The precision had a positive effect of 0.16% for Mask R-CNN, while not affecting Yolo’s. RetinaNet’s precision dropped by 0.6%. The results are show in Figure13.

Figure 13: Performance after applying CLAHE on the testing set.

The recall dropped by 1.73% for Mask R-CNN, while slightly improving for Yolo and noticeably so with 1.23% for RetinaNet.

(29)

but not noticeably so. However, we cannot know the true eﬀect of CLAHE without also applying it on the training set, so we did that for Mask R-CNN and Yolo. The results can be seen in Figure 14. RetinaNet was excluded due to time constraints.

Figure 14: Performance after applying CLAHE on the train and testing set.

Mask R-CNN’s precision increased slightly by 0.001, while the recall dropped substantially from 0.838 to 0.625. Applying CLAHE on the training and testing set seems to have substantially increased the number of false negatives for Mask R-CNN. Yolo’s precision stayed the same, while the recall dropped from 0.955 to 0.917. CLAHE seems to hurt both models, especially so on Mask R-CNN. However, it should be noted that this could be due to us re-training the model, and not actually because of CLAHE being applied. Further training and testing are required for a deﬁnite conclusion that CLAHE is the cause of this decrease in performance.

5.2 RQ2

(30)

Figure 15: Error on the validation set during training for Xception, InceptionV3 and ResNet152. Xception terminated after 63 epochs, having achieved the best MAE of 9.5293 months. ResNet152 finished after 53 epochs with 13.9263, and InceptionV3 finished after 79 epochs with an MAE of 9.8129. ResNet152 was the first network to reach its lowest local minimum, which might be due to its depth, resulting in a vanishing gradient. The second network to converge was Xception. It converged slightly after ResNet152 but with a 32% smaller MAE. 17 epochs later, InceptionV3 reached its lowest local minimum at 9.8129 MAE. This was expected since Xception is a modified, improved version of Inception.

VGG19 had trouble learning from the dataset and ended up converging at about 30 MAE. This could be because VGG19 is so shallow that it might not be able to learn the complex features needed for BAA. When evaluating VGG19 on the testing set, it ended up predicting about 150 months on all images, which is close to the mean age of the training dataset. This suggests that the model was simply not able to learn from the data and learned that it would achieve the best score by simply predicting the mean for all images. By looking at the identity lines for each of the three successfully trained models in Figure16, we see that both Xception and InceptionV3 are performing quite well on images within any age group, while ResNet152 performs quite well on images centered around the mean (126.67 months) but struggles with the lower and higher ranges. This could be due to slight overﬁtting to the training set since the dataset is quite heavily centered around the mean.

Figure 16: Identity lines for the three successfully trained models.

(31)

hands: one such case is shown in Figure 17. However, Xception and InceptionV3 are usually oﬀ by 20-30 months while ResNet152 is way oﬀ, almost by 120 months.

Figure 17: All three models predicting the age of a baby hand, ResNet152 being far oﬀ.

As in the study by Lee et al. [27], we evaluated how well the three models perform within four diﬀerent age ranges: prepuberty (2-8), early-and-mid puberty (9-13), late puberty (14-16), and post puberty (17+). MAE per age group and model can be seen in Table 7.

Age Groups Xception InceptionV3 ResNet152

all 12.08 13.66 33.61

prepuberty 13.86 17.31 47.75

early-mid puberty 11.37 12.42 18.57

late puberty 10.46 11.20 39.88

post puberty 11.85 13.62 32.99

Table 7: MAE per model and age group.

The study by Lee et al. took gender into account during both training and evaluation, so the ranges are a bit diﬀerent in this thesis since we do not.

Prepuberty All models seem to perform the worst on prepuberty radiographs. Four samples are shown in Figure18.

Figure 18: Four samples from the prepuberty range.

(32)

histogram equalization technique on the testing set, although it did not make any signiﬁcant diﬀerence when used for the detectors in RQ1.

Late puberty Both Xception and InceptionV3 performed the best within the late-puberty range, samples shown in Figure19.

Figure 19: Four samples from the late puberty range.

These radiographs do not vary as much in characteristics as the radiographs in the prepuberty range. Early-and-mid puberty ResNet152 performed the best in the early-and-mid puberty range. The ra-diographs in this range have a variance in characteristics similar to that of the late puberty range, but ResNet152 performed significantly better within this range with about a 50% smaller MAE. Samples can be seen in Figure20. This suggests that ResNet152 suffers from slight overfitting to the training dataset, which is heavily distributed around this range. Xception and InceptionV3 seem to be less affected by this.

Figure 20: Four samples from the early-and-mid puberty range.

(33)

Figure 21: Final MAE on the testing set for all three models.

5.3 RQ3

(34)

Figure 22: Error on the validation set during training for Xception and InceptionV3.

Xception reached its lowest local minimum of 10.7632 after 57 epochs. It stopped early 30 epochs later, due to no decrease in error. InceptionV3 reached its lowest local minimum after 72 epochs, with an MAE of

15.5753. This was expected since it reached its local minimum later than Xception in RQ2. We also tried

training InceptionV3 an additional time, during which it reached its lowest local minimum after 111 epochs. During this session, it achieved an MAE of 14.6911. We used the ﬁrst MAE though for issues of fairness since the other models were not allowed to train past a 100 epochs.

(35)

Figure 23: InceptionV3’s MAE when predicting on individual growth plates.

When observing the identity line for InceptionV3, we see that it does not make any predictions above 150 months. The model did not have this problem in RQ2, so this is a significant performance decrease. The performance also seems to have decreased around radiographs close to the mean of the dataset, with some predictions being very far off to the right. There does not appear to be any significant changes in the lower ranges.

(36)

Figure 24: Xception’s MAE when predicting on individual growth plates.

(37)

Figure 25: Identity line for Xception in RQ2 vs Xception in RQ3.

As in RQ2, we evaluated how well Xception and InceptionV3 perform within the four diﬀerent age ranges in Table8below.

Age Groups Xception InceptionV3

all 14.44300358 22.2357084

prepuberty 14.54484937 20.87413088

early-mid puberty 14.0084219 15.91264059

late puberty 13.69136133 30.4433828

post puberty 14.4523507 22.16788289

Table 8: MAE per age group for Xception and InceptionV3.

(38)

in RQ2, suﬀers from overﬁtting to the training dataset. The segmentation using Yolo in RQ3 seems to have decreased the overall performance for InveptionV3.

The ﬁnal age prediction so far has been calculated using the mean of all individual predictions per radiograph. Since Xception was the best performing model, we decided to further investigate if we can achieve a better score by using diﬀerent formulas. We started by analyzing the individual predictions per radiograph. Some samples are shown in Table9.

GT Growth Plate Predictions Mean

27 27.43, 31.02, 41.91, 43.09, 48.68, 50.20, 50.35, 50.87, 52.03, 54.45, 55.06, 63.15, 69.39, 99.71 52.67 30 14.21, 16.88, 18.74, 22.65, 28.48, 28.86, 32.32, 35.21, 35.58, 38.90, 41.81, 42.21 29.65 36 30.01, 32.01, 38.89, 41.30, 42.07, 46.72, 46.74, 49.61, 50.54, 51.76, 53.45, 58.93 45.17 42 41.92, 46.15, 46.33, 46.44, 47.50, 47.58, 49.66, 50.52, 51.46, 51.69, 52.70, 53.20 48.76 55 54.29, 64.18, 68.46, 71.81, 82.28, 83.20, 86.44, 91.09, 91.83, 92.77, 95.69, 97.29, 98.85, 109.82 84.86 60 60.53, 69.18, 75.36, 77.84, 79.38, 85.49, 88.45, 90.74, 91.86, 93.92, 97.44, 99.12, 99.24 85.27 67 72.54, 76.56, 78.00, 78.45, 78.78, 79.77, 82.96, 83.10, 83.51, 84.96, 87.17, 89.51, 90.59, 92.16 82.72 69 69.91, 72.77, 73.62, 75.86, 77.53, 78.26, 79.27, 79.99, 81.07, 82.68, 85.14, 85.65, 87.62 79.18 72 46.05, 50.06, 54.53, 69.23, 70.51, 72.20, 72.88, 75.20, 76.69, 79.11, 79.53, 86.38, 89.78, 70.94 Table 9: Individual growth plate age predictions, calculated mean and corresponding ground truths. All predictions were rounded to two decimals in the table.

When observing Table9, we see that the mean sometimes gets us a good score. In the second row, where GT=30, the mean of all the predicted ages gets us a final age of 29 months, which is very accurate. In the first row with GT=27 however, we see that the final predicted age is very off, almost by 50%. The standard deviation of the predicted ages for these rows are fairly similar, about 10 vs 11 months, so there does not seem to be any causation between the standard deviation of predicted ages and the accuracy of the mean calculation.

Lowest, Median and Mode Since the lowest predictions sometimes seem to be the most accurate, we tried calculating the bone age by simply taking the lowest prediction. This did not make any improvements but rather raised the MAE to 23.79573857. It seems that this method is a bit more accurate in the lower ranges, where outliers strongly aﬀect the mean.

Taking the median of the predicted values decreased the MAE from 14.44300358 to 14.3173284, an improvement of about 1%. This is still higher than the MAE of 12.08723898 Xception achieved in RQ2. The segmentation using Yolo seems to lower the accuracy of Xception slightly, while significantly decreasing it for InceptionV3 and making ResNet152 unable to learn from the dataset. VGG19’s performance did neither decrease nor increase since it was not able to learn from the unsegmented dataset in the first place. Simply taking the lowest predicted value worked well for some radiographs, but decreased the performance significantly on the whole dataset to an MAE of 23.79573857 months. Using the mode got us the same accuracy as simply taking the minimum. Since none of the predicted ages are the same, it simply used the first predictions. We also tried mode after rounding all the predictions, which decreased the MAE to

16.43351548269581, which is still worse than the mean and median.

Since the mean got us a good score for some of the radiographs while the median got us a good score for other radiographs, we also tried taking the average of the two. This slightly increased the MAE to

14.37235406, which is better than simply taking the mean but worse than the median. Of all the evaluated

methods, it turns out that taking the median prediction is the best.

(39)

Figure 26: Relationship between the standard deviaiton and error for each method.

We see that most of the predictions per radiograph have a standard deviation between 5 and 15 months. All methods except the minimum seem to perform well within this range. The median and mean has a similar performance overall, though the mean seems to perform slightly better than the median with a lower standard deviation, while it seems to be the other way around when the standard deviation is higher.

The attention maps generated in the study by Lee et al. [27] showed that their model was mainly learning features from regions around the phalanges on radiographs in the early-and-mid and late-puberty ranges. These are the regions where all the growth plates in this thesis were segmented, and also where Xception performed the best. This suggests that we could improve the overall MAE by extending our segmentation to also include growth plates in the carpal bones, which is where their model was learning features from in the pre and post-puberty ranges.

5.4 Ethical Concerns

(40)

RQ2 Error (months) RQ3 Error (months) 74.480835 115.2178879 64.12715912 114.3650551 57.5043945 106.4831238 47.7598419 102.4392548 44.9095154 101.9046936 44.64319611 101.3532562 44.6193085 99.6309967 44.4885559 88.27610016 43.8568115 84.04755783 43.6403503 79.41596985 42.56364441 73.40597153 42.1785736 69.4181366 42.0700684 65.76865387 41.77805328 64.0119476 40.1562958 62.30851364 40.1433258 61.24718475 37.90615082 60.31703186 37.5257874 55.5433197 36.6811218 53.06529999 36.3927307 52.55615997 36.13266754 48.1257782 36.0345688 47.99790955 36.0195312 47.78572845 36.00326538 47.00524902 35.3105164 46.93473434

Table 10: The 25 highter errors in RQ2 and RQ3

(41)

6 Conclusion and Future Work

6.1 Answering RQ1

What image segmentation method will be more useful for the growth plate segmentation? Using the results from RQ1’s experiments, we conclude that Yolo is the best model to perform the growth plate segmentation. Mask R-CNN suﬀers from a low recall, resulting in many undetected growth plates. This might be improved by a larger training set, but manually labeling 14 growth plates per training image is a very tedious task and not ideal for our scope. Mask R-CNN also has a lot slower inference time, up to 5x slower than Yolo and RetinaNet’s, which is a problem since the segmentation will be applied to the whole dataset of about 12000 images.

RetinaNet has a higher recall than Mask R-CNN, but instead suﬀers from a lower precision and thus a higher false-positive rate. RetinaNet has trouble with detecting unrelated bones and general noise in the images as growth plates. This is not ideal since many regions outside the RoI’s will be passed to the CNN in RQ3.

Applying CLAHE to the training and testing set does not appear to have any noticeable improvement for the models regarding precision, but have a massively negative eﬀect on Mask R-CNN.

Yolo has the best F1 score of all the models, which indicates a good balance between precision and recall. It had a single false positive, which was only counted as a false positive because of not meeting the Intersection over Union criteria of 70%, which defines a True Positive in our experiment. RetinaNet has more true positives than Yolo, but also more false positives. False positives do more damage than fewer true positives in our case. We are utilizing a custom script to filter out radiographs that do not have enough segmented growth plates, so even if we get less total true positives using Yolo, losing some images will likely not make a huge difference, since we will still have around 130 000 segmented growth plates. By using Yolo, we can at least be sure that most of them are correctly segmented.

The results from this experiment diﬀer a bit from the hypothesis based on the literature review. Yolo was the fastest detector as expected. However, Both RetinaNet and Yolo outperformed Mask R-CNN regarding the recall and F1 score. Mask R-CNN had a slightly higher precision than RetinaNet, due to RetinaNet’s high false positive rate. But generally, Mask R-CNN performed the worst, followed by RetinaNet and then Yolo.

Hence, we conclude that Yolo is currently the best object detection technique for our purpose, and proceed with it to RQ3.

6.2 Answering RQ2

How do diﬀerent pre-trained models perform on the RSNA Bone Age dataset? Xception, InceptionV3, and ResNet152 were all able to learn features from the dataset. ResNet152 performed quite well at images with an age centered around the mean but performed very poorly with ages in the higher and lower ranges. Xception and InceptionV3 both performed better than ResNet152, but Xception got an 11% smaller MAE on the testing set than InceptionV3. Xception also reached its local minimum earlier than InceptionV3. VGG19 was not able to learn any features from the unsegmented dataset and ended up predicting the mean age of the training set for every single radiograph in the testing set.

Overall, Xception seems to be the best model for predicting bone age and we hypothesize that it will also best the best performer in RQ3. Xception achieved an MAE of 12.08723898 months on the 1892 testing radiographs.

AI-based Age Estimation using X-ray Hand Images