Classification of Atypical Femur Fracture with Deep Neural Networks

(1)

IN

DEGREE PROJECT MEDICAL ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2019 ,

Classification of Atypical Femur Fracture with Deep Neural

Networks

YUPEI CHEN

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ENGINEERING SCIENCES IN CHEMISTRY,

BIOTECHNOLOGY AND HEALTH

(2)

(3)

Classification of Atypical Femur Fracture with Deep Neural Networks

YUPEI CHEN

Master in Medical Engineering Date: August 7, 2019

Supervisor: Chunliang Wang Reviewer: Örjan Smedby Examiner: Mats Nilsson

School of Engineering Sciences in Chemistry, Biotechnology and

Health

(4)

(5)

iii

Abstract

Atypical Femur Fracture (AFF) is a type of stress fracture that occurs in con- junction with prolonged bisphosphonate treatment. In practice, AFF is very rarely identified from Normal Femur Fracture (NFF) correctly on the first diag- nostic X-ray examination. This project aims at developing an algorithm based on deep neural networks to assist clinicians with the diagnosis of atypical fe- mur fracture.

Two diagnostic pipelines were constructed using the Convolutional Neu- ral Network (CNN) as the core classifier. One is a fully automatic pipeline, where the X-rays image is directly input into the network with only standard- ized pre-processing steps. Another interactive pipeline requires the user to re-orient the femur bones above the fractures to a vertical position and move the fracture line to the image center, before the repositioned image is sent to the CNNs. Three most popularCNNs architectures, namely VGG19, Incep- tionV3 and ResNet50, were tested for classifying the images to either AFF or NFF. Transfer learning technique was used to pre-train these networks using images form ImageNet. The diagnosis accuracy was evaluated using 5-fold cross-validation.

With the fully automatic diagnosis pipeline, we achieved diagnosis accu- racy of 82.7%, 89.4%, 90.5%, with VGG19, InceptionV3 and ResNet50, re- spectively. With the interactive diagnostic pipeline, the diagnosis accuracy was improved to 92.2%, 93.4% and 94.4%, respectively. To further validate the results, class activation mapping is used for indicating the discriminative image regions that the neural networks learn to identify a certain class.

Keywords : deep neural networks, classification, atypical femur fracture,

class activation mapping

(6)

iv

Acknowledgements

I would like to express my gratitude for the following persons:

Chunliang Wang, my supervisor, for offering me the opportunity to con- duct the project and guiding me throughout the thesis project, during which I have learnt a lot.

Jörg Schilcher, for providing the dataset and medical advice for this project.

Örjan Smedby, for reviewing my report and giving me feedback about my progress during the project.

Mehdi Astaraki, Gabriel Carrizo, Irene Brusini, David Andersson for of- fering me technical advice and ideas in each group meeting.

Last but not least, thanks to my family and friends who are always being

supportive for me and encouraging me during my entire study.

(7)

v

Acronyms

AFF Atypical Femur Fracture NFF Normal Femur Fracture

CNN Convolutional Neural Network

ILSVRC ImageNet Large Scale Visual Recognition Challenge

CAM Class Activation Mapping

(8)

Chapter 1 Introduction

1.1 Atypical Femur Fracture

Femur fracture is one of the most common fracture crossing all ages world- wide. There are an estimated 9 to 22 femur fractures per 1000 people world- wide that present every year [1]. Antiresorptive drugs such as bisphosphonates have been used successfully in the treatment and prophylaxis of osteoporosis for decades. In recent years, however, a strong association between bisphos- phonate treatment and the occurrence of an insufficiency type of fracture in the femur has been shown. These fractures do not occur in the metaphyseal area of the femur, as most fragility type fractures do. Their location is in the diaphyseal or subtrochanteric area of the bone, and therefore has coined the name Atypical Femur Fracture (AFF). The insufficiency type of fracture of AFF is associated with specific radiographic features, including a transverse or short oblique fracture configuration and focal cortical thickening. These features differ from Normal Femur Fracture (NFF) which shown oblique frac- ture lines and no signs of cortical thickening [2]. In practice, Early diagnosis of AFF is significant since afterwards complete fracture may occur. The main method for diagnosing AFF is orthogonal radiographs. However in practice, the diagnostic accuracy to identify AFF from radiology reports is poor due to the subtle difference between these two types of fractures and the unknown mechanism behind bisphosphonate-associated AFF [3].

1.2 Deep Neural Networks

Deep learning has been very successful in recent years due to the competi- tive performance in many research fields including computer vision, speech

1

(12)

2 CHAPTER 1. INTRODUCTION

recognition, and natural language processing. For image analysis, convolu- tional networks are proved effective in many tasks such as classification [4]

and segmentation [5]. In this project, we want to test the ability of deep neu- ral networks to classify AFF and NFF through X-ray images with and without user intervention.

1.3 Aim

This project aims at:

1. Evaluating the ability of deep neural networks to identify AFF from NFF.

2. Visualizing the learning features using Class Activation Mapping (CAM).

3. Testing the importance of additional user intervention with automatic

and interactive pipelines.

(13)

Chapter 2 Methods

2.1 Dataset and Pre-processing

The dataset is provided by Doctor Jörg Schilcher from Department of Ortho- pedics and Experimental and Clinical Medicine, Faculty of Health Science, Linköping University, Linköping, Sweden. The original dataset was extracted from clinical PACS at Linköping University hospital and anonymized. Ethi- cal permit was obtained. There are 94 subjects with Atypical Femur Fracture and 106 subjects with Normal Femur Fracture. Each subject has several X-ray images. Manual screening is conducted to remove images with bad quality.

Therefore, there are 796 images in total after screening, of which 397 with AFF, 399 with NFF.

The dataset is pre-processed with the several standardized image procesing methods, including downsampling, normalization and augmentation. The X- ray images are of different shapes and all the input images are downsampled before feeding in to the networks. The intensities of images are rescaled to have pixel values in the range of 0-1. This is a common technique in digital image processing. The images may have pixel values in different range and scaling them will make them contribute more evenly in computing loss and weight. In this way, we can treat all images in the same manner. Additionally, to eliminate the effect of varying image sizes, all images are converted to squares using zero-padding to avoid distortion.

In order to gain more robust results with various and more images., the images are augmented through rotation, shift and zooming. Images are ran- domly rotated within ±10 degrees and zoomed within 10%. Width and height are shifted within 10% as well. The parameters are set based on reasonable transformation of the input images. The specific factors is provided in section

3

(14)

4 CHAPTER 2. METHODS

2.7. 2.2 Network Architectures

Deep neural networks consist of an input layer, an output layer and at least two hidden layers. Feedforward network is a representative deep neural network and the core of feedforward network is parametric function approximation. For image classification, convolutional networks are most widely applied among deep neural networks. Convolutional layers are the basic component of a con- volutional network. In image processing, the filter slides across the whole image and stack along the channel’s dimension in the convolutional layer [6].

ImageNet Large Scale Visual Recognition Challenge (ILSVRC) has be- come the benchmark for large scale visual recognition [7]. It is a contest where software programs compete to correctly classify and detect objects and scenes.

Several network structures with top performance in the contest for image clas- sification in recent years are tested in this project and they will be illustrated respectively below.

2.2.1 VGG

The first neural network tested in this project is VGG. VGG networks are char- acterized with its uniform architecture [8]. The structure of VGG consists of stacked convolutional layers, pooling layers and fully connected layers. VGG is similar to Alexnet [9] but with more filters and layers. VGG networks vary in depth and the 19-layer model is one of best-performing models. In this project VGG19 are applied for experiments. See section A.3.2 in appendix for more details.

2.2.2 Inception

In this project InceptionV3 is applied as a representative of Inception struc- ture. Inception architecture is also tested GoogLenet won ILSVRC 2014 and obtained a top-5 error rate of 6.67%, which is close to human level perception.

It consists of 22 convolutional layers with batch normalization, image distor-

tions and RMSprop. The application of inception blocks reduces the number

of parameters from 60 million (AlexNet) to 4 million because the application

of very small convolutions. The computational cost of Inception is much lower

than VGG and AlexNet, which makes it accessible with limited memories and

computational resources.

(15)

CHAPTER 2. METHODS 5

There are several versions of Inception architecture. After the first intro- duction of GoogLeNet, the Inception architecture was refined in various ways, first by the introduction of batch normalization called InceptionV2 [10]. Later are additional factorization ideas in the third iteration which is referred to as InceptionV3 [11]. With 42 layers deep, the computation cost of InceptionV3 is only about 2.5 higher than that of GoogLeNet, and much more efficient than that of VGG.

2.2.3 ResNet

ResNet was brought up by He et al. [12] and won the ILSVRC 2015. It focuses on solving the problem of degradation with the increasing of depth. A residual function was applied in residual network on the hypothesis that optimizing a residual mapping function is easier than to optimize the original, unreferenced mapping. It turns out that the network converges faster and can gain accuracy with considerably increased depth [12].

2.3 Transfer Learning

We use the pre-trained network on Imagenet to train CNNs. This technique is called transfer learning. Transfer learning is a technique that applies known weights to train deep neural networks in order to obtain better performance. In practice, transfer learning is common in training process of convolutional neu- ral networks because of the fact that the dataset is usually not large enough and it is too expensive to train from scratch with random initialization. Imagenet [13] is currently the largest publicly available dataset for object recognition and is widely applied in transfer learning. Despite the significant differences be- tween natural and medical images, Shin et al. [14] found that transfer learning from Imagenet can still make medical image recognition tasks more effective.

The cross-modality imaging transferability makes transfer learning in CNN representation from Imagenet popular in various modality imaging recogni- tion.

2.4 Automatic and Interactive Pipelines

Two diagnostic pipelines were constructed using the Convolutional Neural

Network (CNN) as the core classifier in order to study the influence of user in-

tervention on network performance. One is a fully automatic pipeline, where

(16)

6 CHAPTER 2. METHODS

the x-ray image is directly input into the network with size and intensity nor- malization steps as described above. Three popular CNN architectures, VGG19, InceptionV3 and ResNet50, were tested for classifying the images to either AFF or NFF. Transfer learning technique was used to pre-train these networks using images from ImageNet.

Another interactive pipeline requires user intervention for the dataset be- fore the images are sent to the CNNs. In the X-ray images, the femur bones are in different positions and we expect the fracture to be in the center with ver- tical femur bones. Therefore, we wrote a script to move the fracture towards the center with vertical femur bones above. Additionally, the rotated images are cropped into the size of 256×256 from the fracture center. In this way we can expect the networks to focus more on the fracture features. This whole process is done by two clickings from the user interface. Manual screening is conducted again afterwards to ensure the quality of these images. Several subjects are removed and there are 389 subjects with AFF and 388 with NFF.

Figure 2.1 gives an example of the described method.

Figure 2.1: Illustration of Interactive Method

2.5 Class Activation Mapping

So far, neural networks have been seen as a ’black box’, which we are not en-

tirely aware of what’s happening in the networks. Therefore, visualizing the

(17)

CHAPTER 2. METHODS 7

networks are of significance for further comprehension of the training process in deep neural networks. In this project class activation mapping is applied to visualize the features that the network is learning. They indicates the discrim- inative image region used by the network to identify a certain type of fracture [15].

2.6 Cross Validation

Cross validation is a common technique to gain robust results in deep neural networks. Theoretically, the dataset is split into K folds, of which one fold is for validation and the other folds are for training. The training and validation process are repeated several times and each time with different folds. The final results are averaged and standard deviation will be calculated. In this way, cross validation will give us more accurate results with less bias.

2.7 Experimental Setup

To prepare the input data for neural networks, the original images are con- verted to JPEG from DICOM format. For the convenience of transfer learning from ImageNet, the grayscale images are converted to RGB images with three channels. Moreover, the images are padding to square size and downsampled to 256×256 pixel to reduce data size and computing time. The image intensity was normalized to the range of 0-1.0. The images were augmented through random rotation (±10 degrees), shifting (< 10%) and zooming (< 10%).

During the training process, the batch size for training data is set to 5 and

it is trained for 100 epochs for InceptionV3 and ResNet50. For VGG19, we

trained for 200 epochs since it converged slower. The learning rate is set to

10

⁻⁵

for all three models as a results of fine tuning. Stochastic gradient de-

scent(SGD) optimizer is used for VGG19. RMSprop and Adam optimizer are

used in InceptionV3 and ResNet50 respectively.

(18)

(19)

Chapter 3 Results

3.1 Automatic Method

In the first stage, VGG19, InceptionV3 and ResNet50 are tested with the stan- dardized pre-processed input data as described in section 2.1. The evaluated accuracy on validation dataset is 82.7%, 89.4%, 90.5% for VGG19, Incep- tionV3 and ResNet50 respectively. It takes around four hours to train one fold with 100 epochs each. The accuracy plots of networks are shown in Figure 3.1.

According to the accuracy plots we can see that ResNet has the best per- formance among three networks. For further examination, cross validation is applied for the three networks with 5 folds each. The results are shown in Table 3.1.

Table 3.1: Cross Validation for Automatic Method

K-Fold Accuracy

VGG19 InceptionV3 ResNet

Fold 1 81.6 % 85.3 % 89.1 %

Fold 2 84.4 % 89.1 % 90.7 %

Fold 3 84.4 % 91.7 % 90.0 %

Fold 4 81.6 % 91.1 % 92.5 %

Fold 5 81.4 % 90.0 % 90.0 %

Average 82.7 % 89.4 % 90.5 %

Standard Deviation ±1.57 % ±2.52 % ±1.27 %

9

(20)

10 CHAPTER 3. RESULTS

(a) VGG19 (b) InceptionV3

(c) ResNet50

Figure 3.1: Accuracy Plots for Automatic Method

3.2 Interactive Method

In the second stage, the images are manipulated with limited user intervention.

Accuracy plots are shown in Figure 3.2 for repeated experiments with same setups and the evaluation accuracy is improved to 92.2%, 93.4% and 94.4%

for VGG19, Inception, ResNet50 respectively. Once again with cross valida-

tion for three networks, results are shown in Table3.2. We can see the obvious

improvements compared to the automatic method, which will be further dis-

cussed in next section.

(21)

CHAPTER 3. RESULTS 11

(a) VGG19 (b) InceptionV3

(c) ResNet50

Figure 3.2: Accuracy Plots for Interactive Method

Table 3.2: Cross Validation with Interactive Method

K-Fold Accuracy

VGG19 InceptionV3 ResNet

Fold 1 90.4 % 97.0 % 94.1 %

Fold 2 94.9 % 94.3 % 94.3 %

Fold 3 91.4 % 91.7 % 94.2 %

Fold 4 95.1 % 88.0 % 97.0 %

Fold 5 89.0 % 96.2 % 95.2 %

Average 92.2 % 93.4 % 94.4%

Standard Deviation ±2.73 % ±3.66 % ±2.03 %

(22)

12 CHAPTER 3. RESULTS

3.3 Comparison with Multi Metrics

Besides accuracy, we calculated several other metrics for evaluation and com- parison: precision, sensitivity and specificity. The results are shown in Table 3.3. From Table 3.3 we can easily see that ResNet50 has the overall best per- formance compared with VGG19 and InceptionV3 for both of the methods.

And there is significant improvement after user intervention, comparing with the automatic method.

Table 3.3: Comparison with Multi Metrics

Netowrk Method Accuracy Sensitivity Specificity Precision

VGG19 Automatic 82.7(±1.57) % 85.4(±4.04) % 79.6(±4.82) % 81.0(±3.39) % Interactive 92.2(±2.73) % 93.0(±2.00) % 91.6(±5.18) % 91.6(±5.18) % InceptionV3 Automatic 89.4( ±2.52 ) % 90.6(±3.44) % 88.4(±2.61) % 88.4(±2.61) % Interactive 93.4(±3.66) % 92.8(±5.02) % 94.2(±3.70) % 94.8(±3.27) % ResNet50 Automatic 90.5(±1.27) % 89.0(±3.32) % 92.2(±2.95) % 92.2(±2.95) % Interactive 94.4(±2.03) % 94.4(±1.52) % 95.8(±1.92) % 95.8(±1.92) %

3.4 Visualization of Results

The final stage is to visualize the features that networks are learning with class

activation mapping. They indicate the discriminative image regions used by

the neural networks to identify AFF or NFF. Some examples of the results are

shown in Figure 3.3. As we can see, in case (a) - (d), the network is focusing on

the fracture regions as we expected. But in (e) and (f), the network is learning

with other regions, which may result in misclassification.

(23)

CHAPTER 3. RESULTS 13

(a) (b) (c)

(d) (e) (f)

Figure 3.3: Attention Maps

(24)

(25)

Chapter 4 Discussion

From the results of experiments, we can see that ResNet50 outperforms the other two networks, which is in line with other studies indicating ResNet have better performance for classification [12]. This may be related to the fact that ResNet uses residual block to solve the problem of vanishing gradient with in- creasing depth of neural networks. Moreover, the converging speed of VGG19 is the lowest and that of ResNet50 is the highest in this project. It means that ResNet can learn faster than Inception and VGG in this case, which can be beneficial with the condition of limited time duration.

In 2018, Kim and MacKinnon [16] brought up that transfer learning from pre-trained non-medical images could be used to improve the performance of deep neural in wrist fracture identification. They achieved the accuracy of 95.4% with a dataset of 1,389 radiographs (695 “fracture” and 694 “no frac- ture”), which is better than what we have achieved in this project. We may assume that classifying fracture and non-fracture images is an easier task than classifying two types of fractures. This is confirmed from the study of apply- ing a CNN to classify proximal humerus fractures using plain anteroposterior shoulder radiographs by Chung et al. [17]. Their results showed high per- formance in automated detecting fracture from normal shoulders with top-1 accuracy of 96% and relatively low performance in classifying different types of fractures with 65-86% top-1 accuracy. Comparing with our results in this project, we think the application of transfer learning, different network archi- tectures and user intervention all contributed to the boosted performance in classifying AFF and NFF.

15

(26)

16 CHAPTER 4. DISCUSSION

4.1 User Intervention

Comparing the two pipelines we implemented, we can see that both methods have relatively good performance in identifying AFF or NFF. With user in- tervention in the interactive method, the accuracy is improved by around 4%

compared with the automatic method. This is consistent with our hypothesis that proper user intervention would benefit the results since the input images are more optimized so that the network is focusing more on the fracture region as we expect. But we need to balance the effort we put in user intervention and the improvements we obtain. In general the improvement through interactive method is promising but in the future whether to apply user intervention in practice needs to be discussed depending on the practical circumstances.

4.2 Network Visualization

Network visualization plays an important part in this project for evaluation and validation. At the beginning of this project, we obtained suspiciously high performance with cross validation. But after visualization with class activation mapping, we found that the network was learning the notations converted from DICOM as indicated in Fig 4.1. This is apparently a mistake that we need to learn from but this instance also showed us the significance of attention maps as a validation and visualization method for neural networks.

After the mistake has been corrected, attention maps were generated again.

From the attention maps we generated for both AFF and NFF, we can see that in most cases the networks are using the fracture region for classification. In some of the failed cases, such as (e) and (f) showed in Figure 3.3. We can see that the image regions used by the network are artefacts. Accordingly, reducing artefacts with appropriate image processing might be a method for further improvement. Additionally, attention maps could be used to assist the user with how reliable the predicted result is. If the image region used by the network is not the fracture region, then less reliable the predicted result would be.

For comparison, Figure 4.2 and Figure 4.3 give some examples of the at- tention maps generated for trained ResNet from automatic method and inter- active method respectively. We can see that the focusing region by interactive method is more consistent with our expectation than the automatic method.

This is reasonable since less noise is present for the input of neural networks.

Therefore we would expect that the networks focus more on the fracture fea-

(27)

CHAPTER 4. DISCUSSION 17

Figure 4.1: Attention Maps

tures and bring better performance.

Figure 4.2: Attention Maps from Automatic Method

4.3 Future Work

In this project we adopt an image-targeted method as subjects for convenience.

However, patient-targeted method would be more persuasive and realistic since

clinicians diagnose AFF based on several X-ray scans. Moreover, it is expected

to gain accuracy with more data and acceleration with multi GPU training.

(28)

18 CHAPTER 4. DISCUSSION

Figure 4.3: Attention Maps from Interactive Method

For visualization, attention maps may play a more important role in as- sisting clinicians with diagnosing since the wrong predicted subjects can be neglected if the network is not focusing on the fracture features as we expected.

Therefore, more applications of Class Activation Mapping can be further de-

veloped for this project.

(29)

Chapter 5 Conclusions

In this project, we tested two types of CNN based diagnostic pipelines for classifying AFF and NFF using plain radiographs . Three network architec- tures and two diagnostic pipelines were investigated. A series of experiments have been run for comparison. According to the results we can conclude that all three networks display great potential in identifying AFF from NFF. And ResNet50 has overall the best performance compared with VGG19 and Incep- tionV3. Two diagnostic pipelines – fully automatic pipeline and interactive pipeline, have been implemented and compared. The results demonstrated the significance of proper user intervention, which brings significant improve- ments on network performance.

The results show promising performance of high diagnostic accuracy with limited user intervention. Therefore we think there is possibility for the pro- posed method being applied to assist clinicians with diagnosis of AFF. How- ever, more clinical data and a clinical view are required for further develop- ment.

19

(30)

Bibliography

[1] Halvorson J Medda S. Diaphyseal Femur Fracture Kernel Description.

2019. url: https : / / www . ncbi . nlm . nih . gov / books / NBK493169/ (visited on 05/29/2019).

[2] Joong Sup Shin, Nak Chul Kim, and Kyoung Ho Moon. “Clinical fea- tures of atypical femur fracture”. In: Osteoporosis and Sarcopenia 2.4 (2016), pp. 244–249.

[3] Katrina Harborne et al. “Compliance with established guidelines for the radiological reporting of atypical femoral fractures”. In: The British journal of radiology 89.1057 (2016), p. 20150443.

[4] Olga Russakovsky et al. “Imagenet large scale visual recognition chal- lenge”. In: International journal of computer vision 115.3 (2015), pp. 211–

252. [5] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. “U-net: Convolu- tional networks for biomedical image segmentation”. In: International Conference on Medical image computing and computer-assisted inter- vention . Springer. 2015, pp. 234–241.

[6] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning.

MIT press, 2016.

[7] Olga Russakovsky et al. “ImageNet Large Scale Visual Recognition Challenge”. In: International Journal of Computer Vision (IJCV) 115.3 (2015), pp. 211–252. doi: 10.1007/s11263-015-0816-y.

[8] Karen Simonyan and Andrew Zisserman. “Very deep convolutional net- works for large-scale image recognition”. In: arXiv preprint arXiv:1409.1556 (2014).

[9] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenet classification with deep convolutional neural networks”. In: Advances in neural information processing systems . 2012, pp. 1097–1105.

20

(31)

BIBLIOGRAPHY 21

[10] Sergey Ioffe and Christian Szegedy. “Batch normalization: Accelerating deep network training by reducing internal covariate shift”. In: arXiv preprint arXiv:1502.03167 (2015).

[11] Christian Szegedy et al. “Rethinking the inception architecture for com- puter vision”. In: Proceedings of the IEEE conference on computer vi- sion and pattern recognition. 2016, pp. 2818–2826.

[12] Kaiming He et al. “Deep Residual Learning for Image Recognition”.

In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . June 2016.

[13] Jia Deng et al. “Imagenet: A large-scale hierarchical image database”.

In: 2009 IEEE conference on computer vision and pattern recognition.

Ieee. 2009, pp. 248–255.

[14] Hoo-Chang Shin et al. “Deep convolutional neural networks for computer- aided detection: CNN architectures, dataset characteristics and trans- fer learning”. In: IEEE transactions on medical imaging 35.5 (2016), pp. 1285–1298.

[15] Bolei Zhou et al. “Learning deep features for discriminative localiza- tion”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 2016, pp. 2921–2929.

[16] DH Kim and T MacKinnon. “Artificial intelligence in fracture detec- tion: transfer learning from deep convolutional neural networks”. In:

Clinical radiology 73.5 (2018), pp. 439–445.

[17] Seok Won Chung et al. “Automated detection and classification of the proximal humerus fracture by using deep learning algorithm”. In: Acta orthopaedica 89.4 (2018), pp. 468–473.

[18] Carolyn J Crandall et al. “Comparative effectiveness of pharmacologic treatments to prevent fractures: an updated systematic review”. In: An- nals of internal medicine 161.10 (2014), pp. 711–723.

[19] Jian Guo Liu and Philippa J Mason. Essential image processing and GIS for remote sensing . John Wiley & Sons, 2013.

[20] Katarzyna Janocha and Wojciech Marian Czarnecki. “On loss functions

for deep neural networks in classification”. In: arXiv preprint arXiv:1702.05659 (2017).

[21] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic

optimization”. In: arXiv preprint arXiv:1412.6980 (2014).

(32)

22 BIBLIOGRAPHY

[22] Antoine Bordes, Léon Bottou, and Patrick Gallinari. “SGD-QN: Care- ful quasi-Newton stochastic gradient descent”. In: Journal of Machine Learning Research 10.Jul (2009), pp. 1737–1754.

[23] Yann LeCun et al. “Gradient-based learning applied to document recog- nition”. In: Proceedings of the IEEE 86.11 (1998), pp. 2278–2324.

[24] Nitish Srivastava et al. “Dropout: a simple way to prevent neural net- works from overfitting”. In: The Journal of Machine Learning Research 15.1 (2014), pp. 1929–1958.

[25] Xiang Li et al. Understanding the Disharmony between Dropout and

Batch Normalization by Variance Shift . 2018. arXiv: 1801.05134

[cs.LG] .

(33)

Appendix A

State of the Art

This state of the art provides information about the background in the field of atypical and typical femur fracture, image classification, deep neural networks and current research in bone fracture classification using deep learning.

A.1 Atypical Femur Fracture

In clinical routine, antiresorptive drugs such as bisphosphonates are associated with reducing the risk of bone fracture [18]. However, in recent years, compli- cations in the use of antiresorptive drugs have become more and more appar- ent. Atypical Femur Fracture (AFF) is a type of stress fracture that occurs in conjunction with prolonged bisphosphonate treatment. Unlike Normal Femur Fracture (NFF), AFF usually occurs in the subtrochanteric area with specific features, including transverse or short oblique fracture configurations, gener- alized thickening of the femoral cortices, non-comminuted fracture patterns and so on, as indicated in Figure A.1 [2].

Early diagnosis of AFF is significant since afterwards complete fracture may occur. However in practice, AFF is very rarely identified from NFF cor- rectly on the first diagnostic X-ray examination [3].

A.2 Image Classification

Image classification belongs to the field of pattern recognition in computing research. Image pixels can be classified by their variable statistical properties or spatial relations with neighboring pixels. In general, image classification can be catalogued into supervised and unsupervised classifications [19].

1

(34)

2 APPENDIX A. STATE OF THE ART

Figure A.1: Radiographs of Atypical Femur Fracture. (A, B) shows AFF in femoral shaft region, and (C) in the subtrochanteric area. The radiographic features of AFF: (1) medial spike; (2) transverse fracture pattern (3) localized periosteal thickening of the lateral cortex; (4) generalized thickening of the femoral cortices [2]

.

Unsupervised classification is also called clustering, usually based on the statistics of the image data distribution. It is characterized by objective features and entirely data driven, utilizing cluster statistics without any ground truth.

It’s widely applied in the case where no ground truth knowledge is available.

Ultimately, interpretation is still needed based on ground truth to obtain the final results containing clustering and matic information.

On the other hand, supervised classification is based on the training data, in which ground truth knowledge is included. On some occasions the classifi- cation can be misguided since biased subjective knowledge may be provided in the process.

Since both methods have limitations, hybrid classification is brought up. In this approach, unsupervised classification is first performed and the results are interpreted using ground truth knowledge. Then the supervised classification is applied using the previous results as training data to obtain the final results.

The hybrid approach is a comprehensive training procedure which provides more reliable and objective results [19].

A.3 Deep Learning

Deep learning is a branch of machine learning which has been popular in re-

cent years due to the top performance in many research fields such as natural

(35)

APPENDIX A. STATE OF THE ART 3

language processing, speech recognition, computer vision and so on. Deep neural networks have at least two hidden layers together with the input and output layer. Learning can be divided into supervised, semi-supervised and unsupervised [6]. Here a simple feed forward network is illustrated theoret- ically by describing the core parametric function approximation, in order to give the basic concept of deep neural networks which have been widely ap- plied in industry.

A.3.1 Feedforward Neural Network

Feedforward neural networks, also called multilayer perceptrons (MLPs),are representative deep learning models [6]. The main goal of feedforward neural networks is to approximate function f. The networks defines function f(x, θ) where parameters θ will be learnt in order to get the best approximation. θ usually consists of weighs w and bias b. The feedforward networks usually combine many functions together in chains. The length of the chain gives the depth of the final network. The last layer of the model is called output layer and the layers in between are named hidden layers because the output isn’t provided for each layer. The dimensionality of the hidden layers determines the width of the network. In each layer, the output is determined by linear transformation of weights and bias, followed by non-linear transformation of activation function as indicated in equation A.1. With updating weights and bias in back-propagation, we drive f(x) to match f(x) in order to obtain desired output y.

y = activation( X

i

w

_i

x

_i

+ b) (A.1)

A.3.2 Convolutional Neural Network

With the rapid development of deep learning, deep neural networks have been applied in several research fields including image classification. Convolutional Neural Network (CNN) is a representative model of feedforward neural net- work, characterized with partially connected neurons in the network.

Convolutional layers are the basic blocks to build a CNN and the basic

operation in convolutional layers is convolution. In image processing, a 3D

tensor is taken as input volume and several 3D kernels will be used in convo-

lutional layers to obtain output 3D volume(height, width, channels). A feature

map is generated when the filter slides across the whole image and stack along

the channels dimension. The output 3D volume will be taken as input volume

(36)

4 APPENDIX A. STATE OF THE ART

for the next layer. Zero padding is usually applied in convolutional layers in order to prevent size shrinkage and it should be carefully selected according to the network structure [6].

In network training, one epoch means the network is trained forward and backward once through the entire dataset. In practice, one epoch is too large to feed into the computer at once so we divide one epoch into many batches.

Iteration is the number batches needed to complete one epoch. Typically, it- eration equals to batch size divided by the number of samples in one epoch.

During the training process, Loss function is supposed to be optimized in order to obtain minimum loss. Loss function is defined based on difference between the predicted results and ground truth. Therefore, minimum loss indi- cates better predictions. Square loss, cross entropy loss, logistic loss and hinge loss are typical loss functions used in deep neural networks [20]. The choice of loss function should depend on the network and given ground truth. Opti- mizer is used to update weights and bias in order to reduce the error. Adam [21] and Stochastic Gradient Descent (SGD) [22] are two of the most widely used optimizers. Learning rate is an important parameter in optimizers since it defines how fast the networks learns. If the learning rate is too low, it would be time consuming for the network to converge. On the other hand, if it is too high, the accuracy would be low and jumpy. Therefore, we usually need to fine tune the learning rate according to our trials of training in order to reach appropriate convergence [6].

ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is a con- test where software programs compete to correctly classify and detect objects and scenes. It has become the benchmark for large scale visual recognition [7]. Several network structures with top performance in the contest will be illustrated below.

LeNet

LeNet is a 7-layer CNN brought up by LeCun et al. [23] in 1998, which has

been successfully applied in handwritten digits recognition [23]. It consists

of two convolutional layers,two pooling layers, two fully connected layers and

one output layer, as Figure A.2 shows. This model has limited usage since

large and more convolutional layers are required to process high resolution

images.

(37)

APPENDIX A. STATE OF THE ART 5

AlexNet

In 2012, AlexNet outperformed all the previous competitors in ILSVRC and reduced the error from 26% to 15.3% [9]. AlexNet resembles LeNet but is deeper, with more filters and stacked convolutional layers.AlexNet has five convolution layers, three pooling layers, and two fully-connected layers with approximately 60 million parameters. It attaches ReLU activation function after every convolutional and fully-connected layer, speeding up by five times than sigmoid or tanh functions with the same accuracy. Dropout is applied to prevent overfitting.

VGG

VGG (named after Visual Geometry Group) networks won ILSVRC 2014 characterized with its uniform architecture [8]. The structure of VGG is sim- ilar to Alexnet but with more filters and layers. VGG networks vary in depth and the 16-layer and 19-layer model are the two best-performing models. The configuration is shown in Figure A.3.

A.3.3 Transfer Learning

Transfer learning is a technique that stores knowledge obtained from one prob-

lem and apply it in a related but different problem. In practice, it is common

to apply transfer learning in convolutional neural networks because of the fact

that available datasets are usually not large enough to train from scratch and

random initialization is too expensive to train in deep neural networks which

requires large resources. Accordingly, pre-trained neural networks are widely

used to obtain weights as initialization. The theoretical reason that we can ap-

ply transfer learning in neural networks is that the networks try to detect edges

in the earlier layers, shapes in the middle layers and other deep features in the

(38)

6 APPENDIX A. STATE OF THE ART

Figure A.3: VGG configurations. The depth of the configurations increases from the left (A) to the right (E), as more layers are added (the added lay- ers are shown in bold). The convolutional layer parameters are denoted as

“conv(receptive field size)-(number of channels)” [8].

later layers. Therefore, we can apply more general features detected in early layers in another case and extract more specific features in later layers [14].

ImageNet [13] has more than 1.2 million 256×256 images categorized with 1000 object class categories. Imagenet is currently the largest publicly available dataset for object recognition and is widely applied in transfer learn- ing. Despite the significant differences between natural and medical images, Shin et al. [14] found that transfer learning from Imagenet can still make medi- cal image recognition tasks more effective. The cross-modality imaging trans- ferability makes transfer learning in CNN representation from Imagenet pop- ular in various modality imaging recognition.

A.3.4 Dropout and Batch Normalization

Dropout was brought up by Srivastava et al. [24] in 2014 to prevent neural

networks from overfitting [24]. It’s a simple but effective method to boost the

(39)

APPENDIX A. STATE OF THE ART 7

performance of neural networks. The term "dropout" literally refers to drop out units in the network, as indicated in Figure A.4. For dropout rate p, there is a chance of p to keep the neuron in the training process. Accordingly, with the use of dropout, calculation time decreases and the final performance improves.

Figure A.4: Dropout Neural Net Model. Left: A standard neural net with 2 hidden layers. Right: An example of a thinned net produced by applying dropout to the network on the left. Crossed units have been dropped [24].

Batch normalization [10] was introduced in 2015 for regularization and speeding up the networks. For each batch set, the mean value and standard deviation are calculated, and the batch of data is normalized by the calculated values. In this way, batch normalization can regularize the mode and reduce the dependence of gradients on the scale of the parameters or their initial val- ues. However, these two methods always fail to obtain extra benefits when com- bined together. The incompatibility of the two methods was studied in [25] and they found out that variance shift is the key reason for the decreasing perfor- mance. As indicated in Figure A.5, in the test phase, the neural variance of X is different from that in the train phase due to dropout, but batch normaliza- tion attempts to regard that variance as the popular statistic accumulated from training [25].

A.3.5 Class Activation Mapping

Class Activation Mapping (CAM) was introduced in 2016 using global aver- age pooling layers in CNN. In the original paper [15], CAM is generated by projecting back the weights of the output layer on to the convolutional feature maps. The CAM highlights the class-specific discriminative regions. This procedure is shown in Figure A.6. There are two main points in this method.

First, the class activation map of a certain class is considered as the weighed

(40)

8 APPENDIX A. STATE OF THE ART

Figure A.5: Simplified mathematical illustration of “variance shift”. Note that p denotes for the Dropout retain ratio and a comes from Bernoulli distribution which has probability p of being 1 [25].

sum of the feature map from the last convolutional layer. Secondly, a global average pooling layer is used to convert a feature map into a single value, and acts as the glue for calculating the associated weights. CAM has wide applica- tion s in various fields. In this project the classification results are verified and visualized by the CAM to guarantee the correct features are learnt to classify the dataset.

A.4 Current Research on Bone Fracture Clas- sification

In 2018, Chung et al. [17] found that deep neural networks can be applied in

radiographs for wrist fracture detection [16]. Moreover, transfer learning from

neural networks on pre-trained non-medical images improved the performance

of networks, which makes deep learning widely accessible with high accuracy.

(41)

APPENDIX A. STATE OF THE ART 9

They indicated that "Similar applications of this technique could be used to improve efficiency and reduce patient harm." as conclusion in their paper [16].

Another research team Chung et al. [17] applied CNN to detect and clas- sify proximal humerus fractures using plain anteroposterior shoulder radio- graphs in 2018 [17]. Their results showed that CNN has high performance for distinguishing different types of proximal humerus fractures as well as normal shoulders. Therefore, they concluded that deep neural networks can accurately detect and classify proximal humerus fractures on plain shoulder anteroposte- rior radiographs. They pointed out in their paper that further studies are nec- essary to explore the feasibility of applying deep neural networks in the clinic for fractures detection and whether the application could improve outcomes compared with current orthopedic assessments in hospitals.

According to the author’s knowledge, nothing seems to have been pub- lished in the field of classifying AFF and NFF using deep neural networks.

Classification of Atypical Femur Fracture with Deep Neural Networks

IN

DEGREE PROJECT MEDICAL ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2019 ,

Classification of Atypical Femur Fracture with Deep Neural

Networks

YUPEI CHEN

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ENGINEERING SCIENCES IN CHEMISTRY,

BIOTECHNOLOGY AND HEALTH

Classification of Atypical Femur Fracture with Deep Neural Networks

YUPEI CHEN

Master in Medical Engineering Date: August 7, 2019

Supervisor: Chunliang Wang Reviewer: Örjan Smedby Examiner: Mats Nilsson

School of Engineering Sciences in Chemistry, Biotechnology and

Health

iii

Abstract

Keywords : deep neural networks, classification, atypical femur fracture,

class activation mapping

iv

Acknowledgements

I would like to express my gratitude for the following persons:

Chunliang Wang, my supervisor, for offering me the opportunity to con- duct the project and guiding me throughout the thesis project, during which I have learnt a lot.

Jörg Schilcher, for providing the dataset and medical advice for this project.

Örjan Smedby, for reviewing my report and giving me feedback about my progress during the project.

Mehdi Astaraki, Gabriel Carrizo, Irene Brusini, David Andersson for of- fering me technical advice and ideas in each group meeting.

Last but not least, thanks to my family and friends who are always being

supportive for me and encouraging me during my entire study.

v

Acronyms

AFF Atypical Femur Fracture NFF Normal Femur Fracture

CNN Convolutional Neural Network

ILSVRC ImageNet Large Scale Visual Recognition Challenge

CAM Class Activation Mapping

Contents

1 Introduction 1

1.1 Atypical Femur Fracture . . . . 1

1.2 Deep Neural Networks . . . . 1

1.3 Aim . . . . 2

2 Methods 3 2.1 Dataset and Pre-processing . . . . 3

2.2 Network Architectures . . . . 4

2.2.1 VGG . . . . 4

2.2.2 Inception . . . . 4

2.2.3 ResNet . . . . 5

2.3 Transfer Learning . . . . 5

2.4 Automatic and Interactive Pipelines . . . . 5

2.5 Class Activation Mapping . . . . 6

2.6 Cross Validation . . . . 7

2.7 Experimental Setup . . . . 7

3 Results 9 3.1 Automatic Method . . . . 9

3.2 Interactive Method . . . 10

3.3 Comparison with Multi Metrics . . . 12

3.4 Visualization of Results . . . 12

4 Discussion 15 4.1 User Intervention . . . 16

4.2 Network Visualization . . . 16

4.3 Future Work . . . 17

5 Conclusions 19

vi

CONTENTS vii

Bibliography 20

A State of the Art 1

A.1 Atypical Femur Fracture . . . . 1

A.2 Image Classification . . . . 1

A.3 Deep Learning . . . . 2

A.3.1 Feedforward Neural Network . . . . 3

A.3.2 Convolutional Neural Network . . . . 3

A.3.3 Transfer Learning . . . . 5

A.3.4 Dropout and Batch Normalization . . . . 6

A.3.5 Class Activation Mapping . . . . 7

A.4 Current Research on Bone Fracture Classification . . . . 8

Chapter 1 Introduction

1.1 Atypical Femur Fracture

1.2 Deep Neural Networks

Deep learning has been very successful in recent years due to the competi- tive performance in many research fields including computer vision, speech

1

2 CHAPTER 1. INTRODUCTION

recognition, and natural language processing. For image analysis, convolu- tional networks are proved effective in many tasks such as classification [4]

and segmentation [5]. In this project, we want to test the ability of deep neu- ral networks to classify AFF and NFF through X-ray images with and without user intervention.

1.3 Aim