• No results found

Semantic Segmentation of Historical Document Images Using Recurrent Neural Networks

N/A
N/A
Protected

Academic year: 2021

Share "Semantic Segmentation of Historical Document Images Using Recurrent Neural Networks"

Copied!
52
0
0

Loading.... (view fulltext now)

Full text

(1)

Master of Science in Engineering: Game and Software Engineering June 2019

Semantic Segmentation of

Historical Document Images Using

Recurrent Neural Networks

Jakob Ahrneteg

Dean Kulenovic

(2)

This thesis is submitted to the Faculty of Computing at Blekinge Institute of Technology in partial fulfilment of the requirements for the degree of Master of Science in Engineering: Game and Software Engineering. The thesis is equivalent to 20 weeks of full time studies.

The authors declare that they are the sole authors of this thesis and that they have not used any sources other than those listed in the bibliography and identified as references. They further declare that they have not submitted this thesis at any other institution to obtain a degree.

Implementation Source: https://github.com/Grimwan/ReccurentNeuralDSS Contact Information: Author(s): Jakob Ahrneteg E-mail: jaag14@student.bth.se Dean Kulenovic E-mail: dekb14@student.bth.se University advisor:

Doctoral Student Florian Westphal Department of Computer Science

Faculty of Computing Internet : www.bth.se

Blekinge Institute of Technology Phone : +46 455 38 50 00 SE–371 79 Karlskrona, Sweden Fax : +46 455 38 50 57

(3)

Abstract

Background. This thesis focuses on the task of historical document semantic segmentation with recurrent neural networks. Document semantic segmentation in-volves the segmentation of a page into different meaningful regions and is an impor-tant prerequisite step of automated document analysis and digitisation with optical character recognition. At the time of writing this thesis, convolutional neural network based solutions are the state-of-the-art for analyzing document images while the use of recurrent neural networks in document semantic segmentation has not yet been studied. In contrast to a convolutional neural network, a recurrent neural network is able to ’memorize’ previously seen inputs, thus it should be possible for a recurrent neural network to recognize relationships between different document regions and as a result achieve higher or comparable performance results to a convolutional neural network in document semantic segmentation.

Objectives. The main objective of this thesis is to investigate if recurrent neural networks are a viable alternative to convolutional neural networks in document se-mantic segmentation. By using a combination of a convolutional neural network and a recurrent neural network, another objective is also to determine if the performance of the combination can improve upon the existing case of only using the recurrent neural network.

Methods. To investigate the impact of recurrent neural networks in document semantic segmentation, three different recurrent neural network architectures are implemented and trained while their performance are further evaluated with Inter-section over Union. Afterwards their segmentation result are compared to a convolu-tional neural network. By performing pre-processing on training images and multi-class labeling, prediction images are ultimately produced by the employed models. Results. The results from the gathered performance data shows a 2.7% performance difference between the best recurrent neural network model and the convolutional neural network. Notably, it can be observed that this recurrent neural network model has a more consistent performance than the convolutional neural network but comparable performance results overall. For the other recurrent neural network architectures lower performance results are observed which is connected to the com-plexity of these models. Furthermore, by analyzing the performance results of a model using a combination of a convolutional neural network and a recurrent neural network, it can be noticed that the combination performs significantly better with a 4.9% performance increase compared to the case with only using the recurrent neural network.

Conclusions. This thesis concludes that recurrent neural networks are likely a viable alternative to convolutional neural networks in document semantic segmen-tation but that further investigations are required. Furthermore, by combining a convolutional neural network with a recurrent neural network it is concluded that the performance of a recurrent neural network model is significantly increased.

Keywords: semantic segmentation, page segmentation, recurrent neural network, layout analysis

(4)
(5)

Sammanfattning

Bakgrund. Detta arbete handlar om semantisk segmentering av historiska doku-ment med recurrent neural network. Semantisk segdoku-mentering av dokudoku-ment inbegriper att dela in ett dokument i olika regioner, något som är viktigt för att i efterhand kunna utföra automatisk dokument analys och digitalisering med optisk teckenläs-ning. Vidare är convolutional neural network det främsta alternativet för bearbetning av dokumentbilder medan recurrent neural network aldrig har använts för semantisk segmentering av dokument. Detta är intressant eftersom om vi tar hänsyn till att ett recurrent neural network kan ’memorera’ tidigare input data, borde det vara möjligt för ett recurrent neural network att känna igen sammanhängande mönster mellan olika dokumentregioner och därmed uppnå ett bättre eller likvärdigt resultat jämfört med ett convolutional neural network för semantisk segmentering av dokument. Syfte. Syftet med arbetet är att undersöka om ett recurrent neural network kan uppnå ett likvärdigt prestandaresultat jämfört med ett convolutional neural network för semantisk segmentering av dokument. Vidare är syftet även att undersöka om en kombination av ett convolutional neural network och ett recurrent neural network kan ge ett bättre resultat än att bara endast använda ett recurrent neural network. Metod. För att kunna avgöra om ett recurrent neural network är ett lämpligt alternativ för semantisk segmentering av dokument utvärderas prestandaresultatet för tre olika modeller av recurrent neural network. Därefter jämförs dessa resultat med prestandaresultatet för ett convolutional neural network. Vidare utförs förbe-handling av bilder och multiklassificering för att modellerna i slutändan ska kunna producera mätbara resultat av uppskattningsbilder.

Resultat. Genom att utvärdera prestandaresultaten för modellerna kan vi i en jämförelse med den bästa modellen och ett convolutional neural network uppmäta en prestandaskillnad på 2,7%. Noterbart i det här fallet är att den bästa modellen uppvisar en jämnare fördelning av prestanda. För de två modellerna som uppvisade en lägre prestanda kan slutsatsen dras att deras utfall beror på en lägre model-lkomplexitet. Vidare vid en jämförelse av dessa två modeller, där den ena har en kombination av ett convolutional neural network och ett recurrent neural network medan den andra endast har ett recurrent neural network uppmäts en prestandaskill-nad på 4,9%.

Slutsatser. Resultatet antyder att ett recurrent neural network förmodligen är ett lämpligt alternativ till ett convolutional neural network för semantisk segmentering av dokument, dock krävs ytterligare efterforskningar. Vidare dras slutsatsen att en kombination av de båda varianterna bidrar till ett bättre prestandaresultat.

Nyckelord: semantisk segmentering, dokument segmentering, recurrent neural net-work, layout analys

(6)
(7)

Acknowledgments

We would like to thank our supervisor, Florian Westphal, for his engagement, count-less proofreadings and availability regarding questions and meetings. This work would not have been possible without him.

(8)
(9)

Contents

Abstract i Sammanfattning iii Acknowledgments v 1 Introduction 1 1.1 Thesis Scope . . . 2 1.2 Research Questions . . . 2 2 Background 5 2.1 Document Semantic Segmentation . . . 5

2.2 Convolutional Neural Network . . . 6

2.3 Recurrent Neural Network . . . 7

2.3.1 Long Short-Term Memory . . . 8

2.3.2 Bidirectional Long Short-Term Memory . . . 8

2.3.3 ReNet . . . 9

2.4 Performance Metrics . . . 9

2.4.1 Binary Cross-Entropy . . . 9

2.4.2 Intersection over Union . . . 9

3 Related Work 11 3.1 Convolutional Neural Network . . . 11

3.2 Recurrent Neural Network . . . 12

4 Method 13 4.1 Image Pre-processing . . . 13

4.2 Multi-class Labeling . . . 13

4.3 Implementation . . . 14

4.3.1 Bidirectional Long-Short Term Memory . . . 14

4.3.2 Bidirectional Long-Short Term Memory+ . . . 15

4.3.3 ReNet . . . 15 5 Experiment 17 5.1 Dataset . . . 17 5.2 Training . . . 17 5.3 Evaluation . . . 18 vii

(10)

6 Results 19 6.1 Performance . . . 19 6.2 Image Prediction . . . 20 7 Discussion 23 7.1 Performance . . . 23 7.1.1 Model Complexity . . . 23 7.1.2 Performance Variation . . . 24 7.1.3 Overlapping Classes . . . 24 7.1.4 Footprint Size . . . 24 7.2 Validity Threats . . . 25 8 Conclusions 27 9 Future Work 29 9.1 ReNet . . . 29 9.2 Image Post-processing . . . 29 9.3 Augmentation . . . 29 9.4 Multi-dimensional Processing . . . 30 References 31 A Raw Data 35 viii

(11)

List of Figures

1.1 Semantic segmentation of a historical document image from a medieval manuscript [32]. In (b), the different ground truth regions: main-text-body (blue), comments (green) and decorations (red) are visually

highlighted. . . 2

2.1 Semantic segmentation and binarization of a historical document im-age [32]. . . 5

2.2 Image classification for a typical CNN architecture. . . 6

2.3 Illustration of a traditional RNN model. . . 7

2.4 LSTM memory block with one cell. . . 8

4.1 BLSTM model architecture. . . 14

4.2 ReNet model architecture. . . 15

6.1 IU results for the evaluated architectures on all of the test images. Boxes indicate the upper and lower quartile while the separation line is the median. Whiskers represent the lowest and highest observation. 19 6.2 Image sample variations from a testing page of the CB55 manuscript. Prediction images are produced by the ReNet architecture. . . 21

6.3 Image sample variations from a testing page of the CSG18 manuscript. Prediction images are produced by the ReNet architecture. . . 21

6.4 Image sample variations from a testing page of the CSG863 manuscript. Prediction images are produced by the ReNet architecture. . . 22

7.1 Image samples of different patch sizes. . . 24

(12)
(13)

List of Tables

5.1 Page properties of the separate manuscripts [40]. . . 17 6.1 Complexity of the evaluated architectures. . . 20 A.1 IU (%) results of individual test images for the evaluated architectures.

Images 1-10 corresponds to the CB55 manuscript, while images 11-20 and 21-30, belongs to the CSG18 and CSG863 manuscripts, respectively. 35

(14)
(15)

Chapter 1

Introduction

In recent years, machine learning centered on deep learning has gathered substantial attention. Modern semantic segmentation incorporates deep learning to separate and classify pixel-wise elements in an image. This approach has been applied to many applications, such as self-driving vehicles [23], handwriting recognition [34] and med-ical image diagnostics [28]. In contrast to applications such as self-driving vehicles where pixel precision of image regions, such as street signs, pedestrians and roads is not as critical for a working application – document semantic segmentation (DSS) requires a high degree of precision to produce readable documents. DSS involves the segmentation of a page into different meaningful regions (see Figure 1.1), a process which is an important prerequisite step of automated document analysis and digiti-sation with optical character recognition (OCR). This is especially challenging with historical document images as these documents commonly suffer from degradation and feature unique layouts, writing styles, ornaments or decorations.

Neural networks excel at semantic segmentation and the current state-of-the-art for analyzing document images are convolutional neural network (CNN) based solutions, such as U-Nets [22, 27], fully convolutional networks (FCNs) [39, 40, 41] and deep supervised networks (DSNs) [36]. Recently, recurrent neural networks (RNNs) have successfully been used for document image binarization (DIB) [2, 38] – that is, the separation of text foreground from page background. But as of today, the impact of RNNs in DSS has not yet been studied.

Considering the relationship between DIB and DSS and the fact that RNNs op-erate on sequences of data, it should be possible to employ an RNN for DSS if we further consider it as a sequence learning problem. Also, in contrast to a CNN, an RNN is able to ’memorize’ previously seen inputs which should enable it to recognize relationships between different document regions and as a result achieve higher or comparable performance results to a CNN in DSS. This thesis aims to investigate the use of RNNs in DSS by designing, training and evaluating the performance of three different RNN configurations and comparing their segmentation result to a U-Net implementation [24]. The results indicate that RNNs are likely a viable alternative to CNNs in DSS but that further investigations are required. Furthermore, by com-bining a CNN with an RNN it is concluded that the performance of an RNN model is significantly increased.

(16)

2 Chapter 1. Introduction

(a) Original image. (b) Segmentation result.

Figure 1.1: Semantic segmentation of a historical document image from a medieval manuscript [32]. In (b), the different ground truth regions: main-text-body (blue), comments (green) and decorations (red) are visually highlighted.

1.1

Thesis Scope

In this thesis, we focus on the task of historical document image segmentation. This particular task will be performed with three different configurations of RNNs and their performance evaluated with Intersection over Union (IU). Afterwards the per-formance results will be compared to a U-Net implementation. For training we will use a single publicly available dataset, DIVA-HisDB [32], which consists of three me-dieval manuscripts of high-resolution images. This dataset has been chosen because the images provide challenging layouts with overlapping classes, such as decorations and main-text-body (see Figure 1.1) and due to the fact that it has been used in a recent layout analysis competition [31].

1.2

Research Questions

Given the above information, we define our research questions for this thesis as: RQ1: Are recurrent neural networks a viable alternative to convolutional

neural networks in document semantic segmentation?

This is the main research question and it further gives insight if an RNN is able to produce comparable performance results to a CNN architecture. By examining the results we are able to determine if RNNs are a viable alternative to CNNs in DSS. To answer this question we perform an experiment where we train and evaluate the performance of three different RNN configurations and compare their segmentation result to a U-Net implementation.

(17)

1.2. Research Questions 3 RQ2: What is the performance impact of combining convolutional neural networks with recurrent neural networks in document semantic segmentation?

The combination of a CNN and an RNN is possible as we can for instance have a CNN architecture before the processing part of an RNN. However, if the combination can generate a better performance compared to the regular RNN case is a worthwhile investigation. To answer this question we design an RNN model with and without a CNN part and then perform an experiment where we evaluate the performance of the two configurations.

(18)
(19)

Chapter 2

Background

This chapter introduces the necessary concepts for the remainder of this thesis. It contains an overview of page segmentation, neural network architectures and the employed performance metrics.

2.1

Document Semantic Segmentation

Document semantic segmentation involves separating page background and classify-ing multiple text foreground elements, such as main text, comments and decorations. In this context, an original document image is provided along with a ground truth im-age to either aid or visualize the desired outcome. DSS is related to DIB as they both deal with eliminating background noise and labeling document text. However, DIB, is a binary semantic classification problem while DSS can be denoted as multi-class semantic segmentation as the separation and classification implicates more than two elements. An overview of the two segmentation methods can be seen in Figure 2.1.

(a) Segmented image.

(b) Binarized image.

Figure 2.1: Semantic segmentation and binarization of a historical document im-age [32].

The produced segmentation result is useful for automated document analysis where the separated regions can be detected and transcribed. Therefore, pixel label-ing in DSS requires a high degree of precision to produce readable documents. This

(20)

6 Chapter 2. Background is especially important for post-processing applications such as OCR, as mislabeling directly affects the outcome of the transcription process into digitised characters [39]. Notably, labeling historical document images is a challenging task as these documents commonly suffer from degradations, such as faded ink or bleedthrough and often in-clude ornaments, decorations or unique layouts. Thus, degradation factors such as bleedthrough can easily be mistaken for text foreground, while faded ink appears similar to background noise rather than text [37], ultimately reducing the precision of a segmentation algorithm.

2.2

Convolutional Neural Network

Convolutional neural networks are primarily used for image classification but have proven successful for a variety of tasks, such as speech recognition [1], pose estima-tion [17] and visual saliency detecestima-tion [5]. There are numerous variaestima-tions of CNN architectures, including fully convolutional networks [39, 40, 41] and deep super-vised networks [36], however their underlying structure are similar as it involves three types of layers, namely convolutional, pooling and fully-connected layers. The convolutional layer learns feature representations of input images by using several convolution kernels. Here, a kernel is a matrix of weight values which is used for computing feature maps that provide feature information such as edges or curves. Furthermore, the pooling layer is usually placed between two convolutional layers and aims to reduce the resolution of feature maps. This is practical as kernels in suc-ceeding convolutional layers are able to encode more abstract features as the number of pooling and convolutional layers increases.

After several convolutional and pooling layers, it is common to use one or more fully-connected layers that connect neurons in the previous layer with every single neuron of the current layer. Lastly, as a final step, the network uses a softmax activation function to produce probabilities of classes from the input data [14]. In Figure 2.2 the whole process of image classification for a typical CNN architecture can be seen.

Input

Convolution Pooling Convolution Pooling Fully-connected Output Feature Maps

(21)

2.3. RecurrentNeuralNetwork 7

2

.3 Recurrent Neura

l Network

Recurrentneuralnetworksaredefinedbyrecurrentconnectionswhichallowthemto ’memorize’previouslyseeninputs. ThissignificantbehaviorofRNNarchitectures hasachievedstate-of-the-artresultsinapplicationssuchastimeseriesprediction[7], machinetranslation[6]andspeechrecognition[15].Thetraditionalmodelconsistof aninputlayer,hiddenlayerandanoutputlayer,whereeachlayerhasnunits. Here, thethreelayersaredenotedas,I,HandO,respectively.Inthismodel,theinputcan berepresentedasasequenceofvectorsthroughtimetsuchas{...,xt 1,xt,xt+1,...},

wherext=(x1,x2,...,xp).Furthermore,theinputunitsreceivetheinformationand

haveconnectionswiththehiddenunits,ht=(h1,h2,...,hM),wheretheconnections

aredefinedbyaweightmatrixWIH. Notably,thehiddenunitsareconnectedtoeach

otherthroughtimewithrecurrentconnections,thusthehiddenlayerdefinethestate of’memory’. Thiscanbeexpressedas

ht=fH(ot), (2.1)

where

ot=WIHxt+WHHht 1+bH, (2.2)

fHisthehiddenlayeractivationfunctionandbH thebiasvectorofthehiddenunits.

Lastly,thehiddenunitsareconnectedtotheoutputunits,yt=(y1,y2,...,yk)with

weightedconnectionsWHO,whereytcanbecomputedas

yt=fO(WHOht+bO). (2.3)

Inthiscase,fOistheactivationfunctionandbOthebiasvectorintheoutputlayer.It

followsfromthenon-linearstateequations(2.1),(2.2)and(2.3)thatanRNNisable toiteratethroughtime,thusineachtimestep,thehiddenstatewhichsummarizes allnecessaryinformationaboutthepasttimesteps,givesapredictionbasedonthe inputvectortotheoutputlayer[29].InFigure2.3anillustrationoftheRNNmodel isprovided.

InputLayer HiddenLayer

OutputLayer

(22)

8 Chapter2. Background

2

.3

.1 LongShort-Term Memory

ThetraditionalRNNmodelfailswithlearninglong-termsequentialinformationdue tovanishingandexplodinggradientproblems[16]. TheLongShort-Term Memory (LSTM)networkaddressestheseissuesbyhavingamemoryblockwithonecellthat holdsvaluesovertimewhilethree multiplicativegatescontroltheinputandoutput informationflowofthecell(seeFigure2.4).

σ Ct σ gf t go t gi t xt ht 1 yt ht xt ht 1 xt ht 1 xt ht 1 ForgetGate InputGate OutputGate Cell

Figure2.4:LSTM memoryblockwithonecell.

TheLSTMnetworkpreservessignalsandpropagateserrorsforlongertimethan traditional RNNs. Thisissignificantasitallowsthestructureofthenetworkto potentiallyrememberinformationforlongerperiods. Notably,thestandardLSTM architectureisbyformulationone-dimensionalbutitispossib letoextendthepro-cessinginto multipledimensionstoenhancethelearningoflong-termdependencies [29].

2

.3

.2 B

id

irect

iona

lLongShort-Term Memory

BidirectionalLongShort-Term Memory(BLSTM)networksproposedbyGravesand Schmidhuber[13]areavariantoftheLSTMmodel.TheyaresimilartoBidirectional RNNs[30]astheybothconsiderinputsequencesinboththepastandfuturewhen calculatingtheoutputvector. ThisispossibleasBLSTMnetworksusestwoLSTM networks,oneforprocessinganinputsequenceinaforwarddirectionwhiletheother processesthesequencebackwards[29].

(23)

2.4. Performance Metrics 9

2.3.3

ReNet

The ReNet architecture introduced by Visin et al. [35] consist of four one-dimensional RNNs that process images horizontally and vertically in separate directions. The deep neural network model replaces the convolution and pooling layer stage in the CNN architecture and is suggested as a viable alternative to CNNs.

2.4

Performance Metrics

Performance measurements yield a statistical representation of pixel mislabeling. In this thesis, different performance evaluation methods are used during training and final validation. For training, the binary cross-entropy (BCE) loss function is used as it can measure the loss between a background and text foreground label. Although, BCE is not ideal for DSS as it does not take into account class imbalances where a page could consist of more background than foreground [38]. To address this issue and produce reasonable performance results, Intersection over Union is employed as an evaluation metric for final validation.

2.4.1

Binary Cross-Entropy

Binary cross-entropy is a special case of cross-entropy loss that only deals with clas-sification of two classes. It can be defined as

L = yi· − log(ˆyi) + (1 − yi) · − log(1 − ˆyi),

where i is represented by the context of the application, ˆyi a predicted label and yi

the target label. The equation yields a loss value of a label compared to another and can be explained by an example where i is a point from a distribution of red and green points and the two labels green and red color. Furthermore, let us consider that we have a green point and we predict the value of it being green as 0.01 – that is, the point has a very low probability of being green. However, as this is not the actual case, the loss function should produce a high loss value. This is true since the logarithm of values between 0 and 1 is negative, thus adding a negative sign before the logarithm term gives us a positive value. Conversely, if the prediction of our green point would have been 1 then the loss value would be 0.

2.4.2

Intersection over Union

The Jaccard index, commonly known as Intersection over Union, measures the sim-ilarity between two sets. It can be expressed as

IU (A, B) = |A ∩ B| |A ∪ B|,

where A and B are sets of data. In the context of image classification, A is a set of ground truth bounding boxes and B a set of predicted bounding boxes, both ex-tracted from a provided ground truth image and a resulting output prediction image. By examining the formula, it can be interpreted for the task of image classification

(24)

10 Chapter 2. Background as the ratio between the area of intersection and the area of union between ground truth and predicted bounding boxes, where a resulting value closer to 1 is desired as a value of 1 indicates that there is a perfect match between a ground truth and predicted bounding box.

(25)

Chapter 3

Related Work

Over the past years, many methods have been proposed for segmentation of docu-ment images. These methods can be broadly divided into three categories: granular-based [3, 11], block-granular-based [25, 26] and texture-granular-based [10, 18]. Granular-granular-based and block-based methods are somewhat similar as they both focus on merging image areas until homogeneous regions are produced, while texture-based methods rely on the extraction of texture features and classification with statistical models. Further-more, block-based methods use a top-down approach while the other two methods adopt a bottom-up procedure, where bottom-up approaches are far more computa-tionally expensive but also superior in terms of the produced segmentation result [40]. Today, modern page segmentation approaches incorporate machine learning where convolutional neural networks based solutions are the current state-of-the-art for segmentation of document images [39].

3.1

Convolutional Neural Network

The usage of CNNs for page segmentation has been explored by Chen et al. [8], which employed a three layer CNN with only one convolutional layer. In their im-plementation model, the traditional pooling layer is omitted which means the data is instead directly passed to a fully-connected layer and then processed through an output layer. By utilizing a superpixel algorithm, the computation time of the pixel labeling process is further reduced.

Xu et al. [40] proposed a fully convolutional network based framework for the task of page segmentation which is optimized by stochastic gradient descent with backpropagation. The design of the FCN model is based on the VGG 16-layer net-work [33], which is a deep CNN architecture intended for large-scale image classifi-cation tasks. However, several modificlassifi-cations are applied in order to fit the use case of page segmentation, such as low-level processing in an earlier stage of the net-work and additional convolutional layers before the last stages. Furthermore, as the produced segmentation result contains noise and mislabeled pixels, heuristic based post-processing is applied to further enhance the performance.

Wick and Puppe [39] also implemented an FCN model for page segmentation, however their model is an adaptation of the U-Net [27] but without skip connections between the encoder and decoder part of the network. In their page segmentation framework, binarization is applied to input images as a pre-processing step to produce a bitmask which is used to separate background and foreground. Afterwards, a binarized image can be multiplied with the segmentation result of the network to

(26)

12 Chapter 3. Related Work generate a final prediction image. Additionally, the proposed model is very fast and is able to learn and predict a complete page in one step.

3.2

Recurrent Neural Network

Recurrent neural networks have been used in various applications, such as time series prediction [7], machine translation [6] and speech recognition [15]. There has also been usage of RNNs for image classification, where Afzal et al. [2] employed a 2D BLSTM for binarization of document images. In their work, they considered each pixel as a timestep and process image patches with four independent LSTM, one for each direction from the corners of an image. Westphal et al. [38] further improved on the previous work by using Grid LSTM [20] cells for multi-dimensional input support. Their approach allowed context information to be incorporated in each binarization step for efficient binarization. Notably, the aforementioned works with RNNs in image classification only apply to DIB as there have not been any published work related to RNNs for DSS.

(27)

Chapter 4

Method

This chapter describes the proposed method for recurrent segmentation of histori-cal document images which includes pre-processing steps of the images, multi-class labeling and the resulting implementations of the employed RNN architectures.

4.1

Image Pre-processing

The amount of GPU memory allocated is related to the resolution of an image which itself is further connected to the complexity of a model as the resolution scales with the complexity. Here, complexity refers to the number of parameters which is required to fit arbitrary data. Therefore, considering these limitations, it is not possible to upload entire images to the GPU which means the images have to be split into patches of f × f pixels, with f being a configured footprint size. Notably, when the resolution of an original image does not allow the whole image to be split into equally divided patches of the footprint size, we pad the image with zeros. Afterwards when the training of a network has been completed, the padding is removed and image patches are ultimately combined into a prediction image.

4.2

Multi-class Labeling

In an image, each pixel can be encoded into an image class. This means that a pixel either belongs to a particular class or not, which can be thought of as a number of either 0 or 1, i.e. a binary number. For multiple classes, each pixel can be represented as a sequence, {b1, b2, . . . , bn}, where btis a binary number corresponding to an image

class. Furthermore, there are cases when a pixel can be a part of multiple classes. Thus, an image could for example consist of the following classes: background, main-text-body, and decorations, where main-text-body and decorations can overlap as sentences in the main-text-body could begin with a decoration element. In those cases we have chosen to use a bitmask approach when representing sequences as it allows a vast array of possible class combinations.

To generate prediction values of the classes, all our models have an output layer that uses a sigmoid activation function for each image class at the end of the layer. The sigmoid functions produce probability values between 0 and 1, but as we have encoded our classes as binary numbers we have to ultimately convert all of the probability values into binary numbers. In our approach, this is done by setting a threshold value of 0.5 and then all of the values who fall into the upper range of this

(28)

14 Chapter 4. Method threshold will have a value of 1, while the lower range of values will have a value of 0.

4.3

Implementation

This section presents implementation details of the employed RNN architectures.

4.3.1

Bidirectional Long-Short Term Memory

In the Bidirectional Long-Short Term Memory model we use two BLSTM networks, B1 and B2, which read image patches with four independent LSTM from separate

directions. The different LSTM networks are denoted as LC

1, LC2, LR1 and LR2, where

{LC

1, LC2} ∈ B1 and {LR1, LR2} ∈ B2. Then, for B1, LC1 starts reading a sequence of

pixels column-wise from left to right while LC

2 processes the sequence from right to

left. For B2, we have LR1 reading a sequence row-wise from top to bottom while LR2

processes the sequence from bottom to top. This is optimal as we are able to predict the value of a single pixel with contextual information from two dimensions.

After the LSTM networks are done with their processing we align the result and then concatenate it into one feature sequence (A). Here, the alignment ensures that the output data matches the corresponding input pixels. As a last step in the network we process the feature sequence through an output layer (B) which generates an output that has the same size as the input. In Figure 4.1 an overview of the model architecture can be seen.

L

R1

L

C2

L

C 1

L

R 2 (A) (B)

Figure 4.1: BLSTM model architecture.

In detail, the LSTM networks read one whole row or column of pixels for each timestep, where the pixel values vary between 0 and 1. Furthermore, each of the LSTMs outputs whole sequences and has the same amount of memory cells as the total size of one sequence, i.e. f × c, where f is the current footprint size and c is the amount of color channels in the data. For a minor optimization we also add a bias of 1 to the LSTMs forget gate at initialization. This is recommended by Jozefowicz et al. [19] as it improves the general performance of an LSTM network.

(29)

4.3. Implementation 15

4.3.2

Bidirectional Long-Short Term Memory+

This implementation extends the Bidirectional Long-Short Term Memory model in section 4.3.1 by adding a regular CNN before the processing part of the two BLSTM networks. This is significant as the CNN reads block-wise compared to the sequential LSTM, which further allows it to produce data with preserved spatial information. Therefore, with the assistance of the CNN, it is possible to pass pre-processed data which is far more ’optimized’ than the usual input to the BLSTM networks. Fur-thermore, the designed CNN consists of three convolutional layers with two pooling layers placed in-between. Here, the first convolutional layer uses 128 filters while the second and third use 64, along with kernel dimensions of 3×3 pixels. Lastly, we use rectified linear units (ReLUs) as activation functions and for the pooling layers we downscale the feature maps produced by the convolutional layers to half of their original size.

4.3.3

ReNet

The ReNet architecture is similar to the structure of a CNN which indicates that an employed model could potentially generate high performance results. In our implementation, we adapt the described architecture in the original article [35] to fit the use case of our application.

L

Vn 2

L

Vn 1

L

Hn 2

L

Hn 1 (An) (Bn) C Vertical+Horizontal Pass Figure 4.2: ReNet model architecture.

The proposed architecture in Figure 4.2 consists of two vertical and horizontal reading pass iterations where vertically means column-wise reading and horizontally is row-wise reading. Here, each pass uses two LSTM networks with the same settings as described in section 4.3.1. In the first iteration, n = 1, we begin by providing an image patch to the two networks of the vertical pass, where LV1

1 sweeps across pixels

from left to right while LV1

2 sweeps from right to left. Notably, all of the vertical

passes functions as a convolutional+pooling layer which means that the horizontal pass will operate on downsampled data. Also, at the end of this pass we align and concatenate the reading results into one feature sequence (A1) and then save it for

additional processing in the following horizontal pass.

The horizontal pass is similar to the previously vertical but differs in the aspect that the LSTM networks read pixels horizontally. Thus, we have LH1

1 which sweeps

from top to bottom whereas LH1

(30)

16 Chapter 4. Method pass, this pass also conclude by aligning and concatenating the reading results into one feature sequence (B1). Furthermore, at this point, we repeat the whole process

of vertical and horizontal processing for one more time before ultimately processing the output data from the last horizontal pass through an output layer (C) which generates an output with the same dimensions as the input.

(31)

Chapter 5

Experiment

This chapter explains the experimental research design, which includes details about the chosen dataset along with the training procedure of the different models and how the performance results were obtained.

5.1

Dataset

The DIVA-HisDB dataset [32] is a collection of three handwritten medieval manuscripts consisting of 120 pages in total, including training, validation and testing pages (see Table 5.1).

Manuscript Training(pages) Validation(pages) Testing(pages) Resolution(pixels)

CSG18 20 10 10 3328 × 4992

CSG863 20 10 10 3328 × 4992

CB55 20 10 10 4872 × 6496

Table 5.1: Page properties of the separate manuscripts [40].

The dataset images provide challenging layouts and for the whole dataset there are four RGB encoded classes: background (0x000001), comments (0x000002), decora-tions (0x000004) and main-text-body (0x000008). It is also possible for the annotated classes to overlap, thus a pixel can both be a part of the comment and decoration class. If this occurs, the value of the pixel will be the sum of the corresponding classes, i.e. 0x000006 for both comment and decoration.

5.2

Training

The training of the four neural network architectures (U-Net, BLSTM, BLSTM+, ReNet) were performed on the GPU with Keras 2.2.41 and TensorFlow 1.12.02 as

the backend. For the experiments we use the Adam optimization algorithm [21] configured with Keras’ default parameters while the training performance is further computed with BCE loss. Images are also divided into samples of patches with a footprint size of 32 pixels and processed in batches containing 50 samples. These

1https://keras.io/

2https://www.tensorflow.org/

(32)

18 Chapter 5. Experiment settings have been the same for all of the models to produce fair evaluation results. The computer specifications for the experiments were as follows:

• CPU: Intel Xeon E5-1620 v4 3.50 GHz

• GPU: Nvidia Gefore GTX 1080, 8 GB VRAM • RAM: 16 GB DDR4

• OS: Windows 10

We begin the training procedure by training a model on the training+validation pages of the CB55 manuscript, where a validation split of 20% is used throughout the training of all the manuscripts. Then upon completion we save the trained model to disk and clear the memory of loaded images. This is required as we cannot load image patches from the three manuscripts into RAM at once, thus we have to continually save and load a model while resuming training on the training+validation pages of the CSG18 and CSG863 manuscripts, respectively, until the training on all the manuscripts have been completed. Notably, we train all our models for 100 epochs on each of the three manuscripts, where an epoch is defined as a complete training round of all the provided samples. In our case, one epoch consists of 259584 samples for the CSG18 and CSG863 manuscripts, while respectively 493696 samples for the CB55 manuscript. It should also be mentioned that it is possible to train for more epochs but we find that this limit gives a fair trade-off between accuracy and training time, where complete training time is at least a week for a single model.

5.3

Evaluation

Performance results of each model are obtained by using the DIVA layout analysis evaluation tool [4]. This command-line tool has been chosen as it is used in the ICDAR2017 competition [31] where the DIVA-HisDB dataset [32] is the provided dataset for the competition. The evaluation tool requires a prediction image and a ground truth image as input before producing a vast array of different precision results. In our approach we employ a batch script for processing the ground truth and produced prediction images and then solely extract the mean IU evaluation result for each individual prediction image.

(33)

Chapter 6

Results

This chapter presents statistical performance results of the evaluated architectures along with visualizations of the produced prediction images.

6.1

Performance

The performance of the evaluated architectures can be seen in Figure 6.1, where the U-Net model achieves the highest (92.953) and also the lowest (44.137) IU score. For the RNN architectures we observe significant performance differences between the three test models with the BLSTM having the worst performance and the ReNet the best performance. In the case of the BLSTM and BLSTM+ models we observe a performance difference of 4.9% which is caused by the pre-processing part of the CNN in the BLSTM+ model. Furthermore, by comparing the results of the ReNet and the U-Net it is given that the ReNet model has a more consistent performance than the U-Net but also comparable performance results overall. In summary, the performance difference between the two of these models were 2.7%. Detailed performance results are provided in Table A.1 in appendix A.

Figure 6.1: IU results for the evaluated architectures on all of the test images. Boxes indicate the upper and lower quartile while the separation line is the median. Whiskers represent the lowest and highest observation.

(34)

20 Chapter 6. Results It should be noted that the performance varies for all of the models between the different manuscripts, where the last manuscript has the best performance results overall (see Table A.1). For this dataset, a variation is expected as some manuscripts have pages which are more complex than others. Although in our case, the reason why the last manuscript significantly outperforms the other ones has likely to do with the training order of the manuscripts.

Architecture Parameters U-Net 31,454,853 BLSTM 7,023,274 BLSTM+ 21,224,064 ReNet 170,302,464

Table 6.1: Complexity of the evaluated architectures.

In general, the complexity of a model can determine the outcome of the per-formance results as utilizing more parameters allows a model to fit arbitrary data better. This is shown in Table 6.1 where the ReNet and the U-Net have the highest complexity of the evaluated architectures. Notably, the U-Net does not require the same amount of parameters as the ReNet, which could be explained by the fact that the ReNet is an RNN, thus having to perform more processing as it rely on previous information.

6.2

Image Prediction

In this section, we only present image prediction results of the ReNet model from testing pages of the three manuscripts. This is the case since the labeling results produced by all of the RNN models are very similar. The prediction results are further shown in Figure 6.2, 6.3, 6.4, along with samples of original, ground truth and enlarged prediction images. In the ground truth and prediction images the colors: black, red, green, and blue represent the image classes: background, decorations, main-text-body, and comments, respectively. Notably, in the cases when classes overlap, such as background and decorations, a mix of the colors from the annotated classes is used. By examining the enlarged samples we can notice that overlapping classes is an issue for the network, as they are likely harder to label since they can be misinterpreted as another class. Furthermore, in Figure 6.2 we observe that the network have problems to distinguish between the classes main-text-body and comments, while Figure 6.4 shows significantly better results for the same issues. Here, the prediction samples in Figure 6.2 belongs to the first manuscript in the training order while the samples in Figure 6.4 corresponds to the last manuscript. Thus, as mentioned earlier, this outcome is likely the result of the training order of the manuscripts.

It is possible to identify similar issues in all of the prediction images such as line thickness, edges of the foreground and small areas where the background has been mislabeled as foreground. In all of those cases, blocky patterns of pixels can be observed, which can be explained by the choice of footprint size as the produced

(35)

6.2. Image Prediction 21 image patches are too small to be able to provide our model with complete contextual information.

(a) Original. (b) Ground truth.

(c) Enlarged prediction. (d) Prediction.

Figure 6.2: Image sample variations from a testing page of the CB55 manuscript. Prediction images are produced by the ReNet architecture.

(a) Original. (b) Ground truth.

(c) Enlarged prediction. (d) Prediction.

Figure 6.3: Image sample variations from a testing page of the CSG18 manuscript. Prediction images are produced by the ReNet architecture.

(36)

22 Chapter 6. Results

(a) Original. (b) Ground truth.

(c) Enlarged prediction. (d) Prediction.

Figure 6.4: Image sample variations from a testing page of the CSG863 manuscript. Prediction images are produced by the ReNet architecture.

(37)

Chapter 7

Discussion

7.1

Performance

In Chapter 6, we concluded that the U-Net achieved the best performance of the evaluated architectures. However, it was interesting to note that the ReNet had a more consistent performance than the U-Net. This could possibly be explained by the fact that the ReNet has an underlying RNN architecture which means that it is able to recognize relationships between document labels in some of the testing images better than the U-Net. In a comparison between the two models we observed a performance difference of 2.7% which indicates that an RNN is able to produce comparable performance to a CNN in DSS. However, this is only true for the current footprint size, thus we cannot ultimately claim without varying the footprint size that RNNs are a viable alternative to CNNs in DSS. Furthermore, we observed that the BLSTM+ model performed significantly better than the regular BLSTM variant. This was expected since the CNN part of the BLSTM+ model could pass ’optimized’ pre-processed data to its’ BLSTM counterpart. Notably, as an observation, it should be possible to achieve even higher performance results with this model if we would replace the regular CNN with a U-Net instead. Although, in this case, the U-Net would probably perform the overall processing of the model which indicates that this configuration might not benefit from an RNN architecture.

7.1.1

Model Complexity

As previously mentioned, the complexity of a model can ultimately determine the outcome of the performance results. This is the case since utilizing more parameters allows a model to fit arbitrary data better. Therefore, if the BLSTM and BLSTM+ models would perform more processing, their performance results would likely be higher. Notably, this fact also applies to the ReNet which means that if we would add more iterations to the vertical and horizontal passes, we could potentially produce equal or even better performance than the U-Net. However, as the complexity of the ReNet is already very high, this suggestion is not viable with the current GPU memory limitation as the whole model needs to be stored on the GPU. Although, a possible workaround would be to decrease the resolution of image patches since the resolution scales with the complexity of a model. But as this would likely generate even worse performance results, it should not be considered as an option.

(38)

24 Chapter 7. Discussion

7.1.2

Performance Variation

For the different manuscripts we observed a significant variation in the performance results, where the last manuscript in the training order has the best performance results overall. This is reasonable because the last manuscript is the one which all of our models end their training on, thus ultimately producing the highest performance results as the models’ are likely overfitting the data since their training weights are adjusted to favor the images in this particular manuscript. The underlying problem for this outcome has to do with our RAM limitation as it forces us to train on manuscripts separately rather than loading all of the data into memory at once and then having Keras’ shuffle it before each epoch. Notably, it is possible to shuffle the images of the different manuscripts manually, but as it is impossible to know beforehand which combinations perform better than others, this process would be rather time consuming.

7.1.3

Overlapping Classes

In the provided prediction images of the ReNet model, labeling overlapping classes tends to be a recurring problem for the network. This could be related to the fact that the network can misinterpret the combination of classes as only one of the combined classes. In some cases, such as for decoration elements, this is reasonable as there can be class imbalances between background and foreground.

7.1.4

Footprint Size

We identified similar issues with the edges and line thickness of the foreground in the prediction images of the three manuscripts. We argue that this can be explained by the choice of footprint size as the current image patches are too small to be able to provide our models with complete contextual information (see Figure 7.1).

(a) 32×32 image patch. (b) 64×64 image patch.

Figure 7.1: Image samples of different patch sizes.

Thus, it is likely that if we would increase the footprint size, multiple issues would be reduced and the performance results for all of the models further improved. However, the current footprint size of 32 pixels was chosen because of our GPU memory limitation as the resolution of an image scales with the complexity of a model.

(39)

7.2. Validity Threats 25

7.2

Validity Threats

In this thesis, we have evaluated the performance of our models from a single dataset. This indicates that the contributions of our work are not completely reliable. How-ever, as the DIVA-HisDB dataset [32] is a competitive dataset and contains pages with varying degrees of difficulty, it could be argued that the conclusions can be jus-tified. Although, for future work if training time is not a limitation, benchmarking on other datasets could be an option to further support the claims of the results.

(40)
(41)

Chapter 8

Conclusions

In this thesis, we have evaluated the performance of multiple RNN architectures in DSS and compared their performance results to a CNN. By analyzing the results, the CNN achieves the best performance overall while the ReNet architecture shows a more consistent but similar performance. In a comparison between the two models we observed a performance difference of 2.7% which indicates that an RNN is able to produce comparable performance to a CNN in DSS. However, as this is only true for the current footprint size, we cannot truly claim without varying the footprint size that RNNs are a viable alternative to CNNs in DSS. This means that the first research question which aims to resolve if RNNs are a comparable alternative to CNNs in DSS, cannot be completely answered as further investigations are required, however, it is likely that RNNs are a viable alternative to CNNs in DSS.

For the BLSTM and BLSTM+ models, we observed a significant 4.9% perfor-mance increase for the BLSTM+ model. This implies that a combination of a CNN and an RNN generates a better performance overall. The second research question which involves the performance impact of combining a CNN and an RNN in DSS can therefore be properly answered, as a combination of a CNN and an RNN gives a higher performance overall compared to the case without a CNN.

During the evaluation we noticed a performance variation for all of the models on the different manuscripts in the dataset, which was likely the result of the training or-der of the manuscripts. Although, by having access to more RAM it could have been possible to prevent this issue and instead produce a more consistent performance. Notably, the complexity of a model is related to the outcome of the performance results, thus performing more processing by all of the RNN models should generate even higher performance results. However, this suggestion is not viable for the ReNet as we are limited by the GPU memory.

As for the produced prediction images, we noticed that overlapping classes tends to be a recurring issue for a model. In some cases, this is reasonable as overlapping classes can be misinterpreted as only one of the combined classes. Also, for the current footprint size, we concluded that small image patches were connected to labeling problems with the edges and line thickness of the foreground. Therefore, by increasing the footprint size a model should be able to generate better predictions as a larger footprint size allows processing of additional contextual information.

(42)
(43)

Chapter 9

Future Work

In Chapter 8, we concluded that performing more processing and increasing the footprint size should allow higher performance results. Apart from these options, this chapter describes other alternatives which could be considered in future work.

9.1

ReNet

In the ReNet architecture we currently use LSTM as the recurrent unit but it could also be replaced with a Gated Recurrent Unit (GRU) [9]. This could be an advantage as the GRU utilizes fewer parameters which as a result should allow the ReNet model to perform more processing cheaper. However, as the reported performance results of the LSTM and GRU architectures are very similar, this option should at its best be considered as a rather minor improvement.

9.2

Image Post-processing

Post-processing can be applied to image patches before they are combined into a prediction image. This includes methods such as noise reduction, region correction and overlap refinement whereas noise reduction is the most common. All of these methods are employed as post-processing steps by Xu et al. [40] which as an outcome achieves a slight performance increase in their segmentation process.

9.3

Augmentation

Augmentation involves altering image pixels and manipulating the transformation of an image, which as a result provides machine learning models with data variation. This is significant as multiple data variations can prevent overfitting which is the case when a model adapts to particular patterns in the training data. Furthermore, augmentation can be divided into two categories, namely offline and online augmen-tation. Offline augmentation is usually used for smaller datasets and is performed as a pre-processing step before images are processed in batches. Online augmentation on the other hand, can be performed directly on images and is the preferred approach when memory is an issue. In our case, online augmentation would be the more suit-able choice and as a result our models could possibly produce higher performance on all of the manuscripts in the dataset.

(44)

30 Chapter 9. Future Work

9.4

Multi-dimensional Processing

In this thesis, we only process data in two dimensions but other RNN architec-tures such as the Grid Long Short-Term Memory (Grid LSTM) [20] and the Multi-dimensional Long Short-Term Memory (MDLSTM) [12] could be employed to enable multi-dimensional processing. Here, multi-dimensional processing indicates that we process a sequence from the possible dimensions as there are dimensions in the data. This has the advantage that a model is able to predict data with maximum contextual information which could further prove to be a major optimization.

(45)

References

[1] O. Abdel-Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu. Con-volutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(10):1533–1545, Oct 2014.

[2] M. Afzal, J. Pastor-Pellicer, F. Shafait, T. M. Breuel, A. Dengel, and M. Liwicki. Document image binarization using lstm: A sequence learning approach. In Proceedings of the 3rd International Workshop on Historical Document Imaging and Processing, HIP ’15, pages 79–84, New York, NY, USA, 2015. ACM. [3] M. Agrawal and D. Doermann. Voronoi++: A dynamic page segmentation

approach based on voronoi and docstrum features. In 2009 10th International Conference on Document Analysis and Recognition, pages 1011–1015, July 2009. [4] M. Alberti, M. Bouillon, R. Ingold, and M. Liwicki. Layoutanalysisevalua-tor. https://github.com/DIVA-DIA/DIVA_Layout_Analysis_EvaluaLayoutanalysisevalua-tor. Ac-cessed: 2019-06-05.

[5] N. Anantrasirichai, I. D. Gilchrist, and D. R. Bull. Visual salience and priority estimation for locomotion using a deep convolutional neural network. In 2016 IEEE International Conference on Image Processing (ICIP), pages 1599–1603, Sep. 2016.

[6] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473, 2014.

[7] Z. Che, S. Purushotham, K. Cho, D. Sontag, and Y. Liu. Recurrent neural networks for multivariate time series with missing values. arXiv:1606.01865, 2016.

[8] K. Chen, M. Seuret, J. Hennebert, and R. Ingold. Convolutional neural net-works for page segmentation of historical document images. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol-ume 01, pages 965–970, Nov 2017.

[9] K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y. Ben-gio. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Meth-ods in Natural Language Processing (EMNLP), pages 1724–1734, Doha, Qatar, October 2014. Association for Computational Linguistics.

(46)

32 References [10] R. Cohen, A. Asi, K. Kedem, J. El-Sana, and I. Dinstein. Robust text and drawing segmentation algorithm for historical documents. In Proceedings of the 2Nd International Workshop on Historical Document Imaging and Processing, HIP ’13, pages 110–117, New York, NY, USA, 2013. ACM.

[11] A. Garz, R. Sablatnig, and M. Diem. Layout analysis for historical manuscripts using sift features. In 2011 International Conference on Document Analysis and Recognition, pages 508–512, Sep. 2011.

[12] A. Graves, S. Fernández, and J. Schmidhuber. Multi-dimensional recurrent neural networks. In Artificial Neural Networks – ICANN 2007, pages 549–558, Berlin, Heidelberg, 2007. Springer Berlin Heidelberg.

[13] A. Graves and J. Schmidhuber. Framewise phoneme classification with bidi-rectional lstm and other neural network architectures. Neural networks : the official journal of the International Neural Network Society, 18:602–10, 07 2005. [14] J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang, G. Wang, J. Cai, and T. Chen. Recent advances in convolutional neural net-works. Pattern Recognition, 77:354 – 377, 2018.

[15] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury. Deep neural net-works for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.

[16] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735–1780, November 1997.

[17] S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for hu-man action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):221–231, Jan 2013.

[18] N. Journet, J. Ramel, R. Mullot, and V. Eglin. Document image characterization using a multiresolution analysis of the texture: application to old documents. International Journal of Document Analysis and Recognition (IJDAR), 11(1):9– 18, Oct 2008.

[19] R. Jozefowicz, W. Zaremba, and I. Sutskever. An empirical exploration of recur-rent network architectures. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pages 2342–2350. JMLR.org, 2015.

[20] N. Kalchbrenner, I. Danihelka, and A. Graves. Grid long short-term memory. arXiv:1507.01526, 2015.

[21] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.

(47)

References 33 [22] S. A. A. Kohl, B. Romera-Paredes, C. Meyer, J. De Fauw, J. R. Ledsam, K. H. Maier-Hein, S. M. Eslami, D. Rezende, and O. Ronneberger. A probabilistic u-net for segmentation of ambiguous images. In Proceedings of the 32Nd Inter-national Conference on Neural Information Processing Systems, NIPS’18, pages 6965–6975, USA, 2018. Curran Associates Inc.

[23] K. L. Lim, T. Drage, and T. Bräunl. Implementation of semantic segmentation for road and lane detection on an autonomous ground vehicle with lidar. In 2017 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI), pages 429–434, Nov 2017.

[24] M. Masyagin. robin. https://github.com/masyagin1998/robin. Accessed: 2019-06-05.

[25] G. Nagy, S. Seth, and M. Viswanathan. A prototype document image analysis system for technical journals. Computer, 25(7):10–22, July 1992.

[26] N. Ouwayed and A. Belaïd. Multi-oriented text line extraction from handwrit-ten arabic documents. In 2008 The Eighth IAPR International Workshop on Document Analysis Systems, pages 339–346, Sep. 2008.

[27] O. Ronneberger, P. Fischer, and T Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pages 234–241, Cham, 2015. Springer International Publishing.

[28] H. R. Roth, H. Oda, X. Zhou, N. Shimizu, Y. Yang, Y. Hayashi, M. Oda, M. Fujiwara, K. Misawa, and K. Mori. An application of cascaded 3d fully convolutional networks for medical image segmentation. Computerized Medical Imaging and Graphics, 66:90 – 99, 2018.

[29] H. Salehinejad, J. Baarbe, S. Sankar, J. Barfett, E. Colak, and S. Valaee. Recent advances in recurrent neural networks. arXiv:1801.01078, 2018.

[30] M. Schuster and K. K. Paliwal. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11):2673–2681, Nov 1997.

[31] F. Simistira, M. Bouillon, M. Seuret, M. Würsch, M. Alberti, R. Ingold, and M. Liwicki. Icdar2017 competition on layout analysis for challenging medieval manuscripts. In 2017 14th IAPR International Conference on Document Anal-ysis and Recognition (ICDAR), volume 01, pages 1361–1370, Nov 2017.

[32] F. Simistira, M. Seuret, N. Eichenberger, A. Garz, M. Liwicki, and R. In-gold. Diva-hisdb: A precisely annotated large dataset of challenging medieval manuscripts. In 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), pages 471–476, Oct 2016.

[33] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014.

(48)

34 References [34] S. Stewart and B. Barrett. Document image page segmentation and character recognition as semantic segmentation. In Proceedings of the 4th International Workshop on Historical Document Imaging and Processing, HIP2017, pages 101– 106, New York, NY, USA, 2017. ACM.

[35] F. Visin, K. Kastner, K. Cho, M. Matteucci, A. C. Courville, and Y. Bengio. Renet: A recurrent neural network based alternative to convolutional networks. arXiv:1505.00393, 2015.

[36] Q. Vo, S. Kim, H. Yang, and G. Lee. Binarization of degraded document images based on hierarchical deep supervised network. Pattern Recognition, 74, 09 2017. [37] F. Westphal. Efficient document image binarization using heterogeneous com-puting and parameter tuning. International Journal on Document Analysis and Recognition (IJDAR), 21(1):41–58, Jun 2018.

[38] F. Westphal, N. Lavesson, and H. Grahn. Document image binarization us-ing recurrent neural networks. In 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), pages 263–268, April 2018.

[39] C. Wick and F. Puppe. Fully convolutional neural networks for page segmenta-tion of historical document images. In 2018 13th IAPR Internasegmenta-tional Workshop on Document Analysis Systems (DAS), pages 287–292, April 2018.

[40] Y. Xu, W. He, F. Yin, and C. Liu. Page segmentation for historical handwritten documents using fully convolutional networks. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), volume 01, pages 541–546, Nov 2017.

[41] X. Yang, E. Yumer, P. Asente, M. Kraley, D. Kifer, and C. L. Giles. Learning to extract semantic structure from documents using multimodal fully convolutional neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4342–4351, July 2017.

(49)

Appendix A

Raw Data

Image U-Net BLSTM BLSTM+ ReNet 1 72.084 59.822 68.678 69.315 2 69.128 53.078 61.591 62.473 3 67.462 48.722 57.787 58.859 4 65.381 50.698 58.389 59.279 5 66.838 51.469 60.204 60.210 6 63.077 47.757 55.288 57.213 7 67.847 52.155 58.878 58.008 8 68.941 52.913 62.162 58.871 9 70.902 53.829 63.307 57.499 10 64.618 44.812 62.031 58.922 11 51.053 52.431 49.911 49.449 12 52.301 54.570 51.854 52.478 13 71.192 64.670 74.373 74.459 14 64.980 64.605 70.127 70.876 15 53.328 63.365 59.467 55.349 16 44.137 52.237 52.630 51.437 17 59.631 63.818 66.804 64.432 18 60.315 64.231 64.436 63.841 19 66.339 65.397 72.146 73.757 20 62.627 57.872 66.608 67.290 21 92.953 79.998 84.114 86.678 22 90.798 73.323 78.096 83.991 23 92.876 81.095 84.484 89.075 24 64.705 52.936 55.260 77.380 25 76.743 46.956 45.915 65.214 26 76.842 69.221 69.862 72.806 27 77.004 66.064 70.195 72.204 28 88.608 74.035 76.487 79.762 29 82.628 74.593 77.714 80.635 30 85.748 72.836 77.516 78.247 Avg 69.703 60.317 65.210 67.000

Table A.1: IU (%) results of individual test images for the evaluated architectures. Images 1-10 corresponds to the CB55 manuscript, while images 11-20 and 21-30, belongs to the CSG18 and CSG863 manuscripts, respectively.

(50)
(51)
(52)

Figure

Figure 1.1: Semantic segmentation of a historical document image from a medieval manuscript [32]
Figure 2.1: Semantic segmentation and binarization of a historical document im- im-age [32].
Figure 2.2: Image classification for a typical CNN architecture.
Figure 4.1: BLSTM model architecture.
+7

References

Related documents

FMV har inom ramen för den kunskapsuppbyggande verksamheten Försvarets Framtida Taktiska Kommunikation (FFTK) en uppgift att för Försvarsmakten under 2003 demonstrera

We utilise a Recurrent Neural Network (RNN) used in other domains where temporal awareness is important. Our RNN model will be used to generate scenarios which include the speed

Figure 5.2: Mean scores with standard error as error bars for 5 independent measurements at each setting, which were the default settings but different ratios of positive and

The recurrent neural network model estimates a lower

IM och IM-tjänster (Instant messaging): Vi har valt att använda oss av termen IM och IM-tjänster när vi talar om tjänster till smartphones för direkt kommunikation till

We highlight that knowledge has a role as: the motor of transition in Transition Management literature, a consultant supporting transition in Transformational Climate

We know the coordinates of the leaflet markers to within roughly 0.1 mm through the cardiac cycle, and hold the best-fit leaflet plane in a constant position in each frame, yet

I denna optimeringsmodell kommer optimala periodintervall för beställning POQ att beräknas eftersom det är lättare att samordna beställningar för olika artiklar från samma