PDNet: Semantic segmentation integrated with a primal-dual network for document binarization

(1)

http://www.diva-portal.org

Postprint

This is the accepted version of a paper published in Pattern Recognition Letters. This paper

has been peer-reviewed but does not include the final publisher proof-corrections or journal

pagination.

Citation for the original published paper (version of record):

Ayyalasomayajula, K R., Malmberg, F., Brun, A. (2019)

PDNet: Semantic segmentation integrated with a primal-dual network for document

binarization

Pattern Recognition Letters, 121: 52-60

https://doi.org/10.1016/j.patrec.2018.05.011

Access to the published version may require subscription.

N.B. When citing this work, cite the original published paper.

Permanent link to this version:

(2)

PDNet: Semantic Segmentation integrated with a Primal-Dual Network for Document

binarization

Kalyan RamAyyalasomayajulaa,∗∗, FilipMalmberga, AndersBruna

a_{Division of Visual Information and Interaction, Dept. of Information Technology, Uppsala University, Uppsala, 751 05, Sweden.}

ABSTRACT

Binarization of digital documents is the task of classifying each pixel in an image of the document as belonging to the background (parchment/paper) or foreground (text/ink). Historical documents are often subjected to degradations, that make the task challenging. In the current work a deep neural network architecture is proposed that combines a fully convolutional network with an unrolled pri-mal-dual network that can be trained end-to-end to achieve state of the art binarization on four out of seven datasets. Document binarization is formulated as an energy minimization problem. A fully convolutional neural network is trained for semantic segmentation of pixels that provides labeling cost associated with each pixel. This cost estimate is refined along the edges to compensate for any over or under estimation of the foreground class using a primal-dual approach. We provide necessary overview on proximal operator that facilitates theoretical underpinning required to train a primal-dual network using a gradient descent algorithm. Numerical instabilities encountered due to the recurrent nature of primal-dual approach are handled. We provide experimental results on document binariza-tion competibinariza-tion dataset along with network changes and hyperparameter tuning required for stability and performance of the network. The network when pre-trained on synthetic dataset performs better as per the competition metrics.

Accepted in PRL DOI:10.1016/j.patrec.2018.05.011

1. Introduction

The process of binarizing digital documents deals with clas-sifying each pixel as belonging to the background (parch-ment/paper) or foreground (text/ink) while preserving most of the relevant visual information in the image. Binarization is a common pre-processing step in most tasks performed on doc-ument images, such as word spotting and transcription where a high-quality and accurate binarization significantly simplifies the task at hand. In addition to the challenges due to uneven illumination and artifacts introduced by capturing devices, his-torical documents may have other degradations such as; bleed through; fading or paling of the ink in some areas; smudges, stains and blots covering the text; textured background and handwritten documents with heavy-feeble pen strokes for cur-sive or calligraphic effects to name a few. In general, this makes the task of document binarization very challenging as shown in

∗∗_{Corresponding author:}

e-mail: kalyan.ram@it.uu.se (Kalyan Ram Ayyalasomayajula)

Fig.1. The task is often subjective with multiple acceptable out-comes encountered in corner cases such as considering ink blot as being part of foreground or background. The problem of doc-ument binarization garners interest in the field, which has led to the document image binarization content (DIBCO) Pratikakis et al. (2011a), for automatic methods with minimum parame-ters to tune.

The task of document binarization borrows techniques from denoising, background removal, image segmentation and image in-painting, hence there exist several successful methods with individual strengths. The classical approaches have tried to separate the pixels into two classes using a single global threshold or series of finer local thresholds. The approach by Otsu (1979), which tries to maximize the gray level separation between foreground (FG) and background (BG) classes by maximizing the inter class variance to separate the classes. However, local intensity variations and other artifacts introduced when creating a digital image have led to

(3)

Fig. 1: Examples of typical image degradations from DIBCO dataset (a) smudging of text (b) staining of the parchment (c) textured background (d) uneven pen strokes (e) scanning artifacts (f) bleed through of ink from the other side of the document (g) blotting over text (h) feeble contrast between ink and parchment (i) artifacts from document aging (j) fading of text

the success of locally adaptive techniques, such as the methods from Niblack (1986), Sauvola and Pietik¨ainen (2000). The techniques discussed in all these classical methods are generic and applicable to any image in general. However, developing an approach specific to document images has been the trend in winning entries of DIBCO in the past. These methods seek to improve binarization through modeling properties of FG/BG in documents images specifically. Lu et al. (2010), have for instance modeled background using polynomial smoothing followed by local thresholding on detected text strokes, Bar et al. (2007), iteratively grow FG and BG within a 7×7 window.

The recent success of deep neural methods in vision related tasks has been broadly due to their ability to effectively en-code the spatial dependencies and redundancy in an image. A fully convolutional neural network (FCNN) (Shelhamer et al., 2015), is best suited for semantic labeling of pixels, which is the primary objective in segmentation. The crucial idea is to use skip connections to combine the coarse features from deep layers with fine features from shallow layers to improve the fi-nal segmentation. Training such models on text images, how-ever, often result in loss of finer details along edges. Hence a post processing step such as a graph-cut (Boykov and Kol-mogorov, 2004), often improves the results as shown in our previous work (Ayyalasomayajula and Brun, 2017). Here, we improve upon our previous work by incorporating the energy minimization step directly in the network to facilitate joint end-to-end training of both the semantic segmentation and energy minimization steps. To this end, we adopt a primal-dual up-date (PDUpdate scheme Ochs et al. (2015)). See Fig.2(a). This framework helps in training the unary cost associated with pixel labeling, pairwise cost (Blake et al., 2011) associated with smoothness of neighboring pixels and the cost of overlooking an edge when merging regions into a single framework, result-ing in an optimally segmented image. Our contributions in the

present method can be summarized as:

• A stable framework that allows end-to-end training of an energy minimization function along with a semantic label-ing network termed as Primal-Dual Net (PDNet).

• Improved segmentation output from a semantic labeling network that is lightweight in terms of trainable weights. • A numerically stable, unrolled PDUpdate scheme when

formulating binarization as a total-variation problem that can be extended to generic image based segmentation with multiple classes.

• Improved gradient propagation from PDUpdate, using modified class weighting in loss function.

This paper is divided into five main sections describing the approach and an appendix that provides a summary of the nec-essary theory. An overview of document binarization with rel-evant background is covered in the introduction section. This is followed by a section reviewing some of the related work, highlighting some of the challenges that were overcome in our approach. An overview of the network architecture and func-tionality of its basic blocks are discussed in the methodology section, covering all the details on architectural changes made in building the network. This is followed by an experimental section that covers the results on DIBCO dataset for the archi-tectural choices discussed previously. The article is concluded with contributions in the current work and possible directions for future research. All the necessary details on the theoretical framework is covered in an appendix towards the end.

2. Related work

The basic algorithm proposed in this paper draws motiva-tions from other ideas that have used a high level loss function

(4)

Input Unary Network ∂ ∂x, ∂ ∂y

PDUpdate _functionLoss Binary_output

(a)

(b)

Fig. 2: (a) Basic architecture of binarization network (PDNet) with unary network (ENet), primal-dual update (PDUpdate), loss function and finite difference scheme based edge estimation blocks is depicted. Network modules are shown in solid lines and layers are shown as dashed lines. (b) This image summarizes the advantage of the PDNet over simple semantic segmentation. Upon zooming into a typical FG patch the pixels in blue are segmented by both ENet and PDNet. Pixels marked in red and green are over and under estimated by ENet. These are successfully delineated in PDNet due to PDUpdate scheme.

as an energy associated with binarization. This loss is then optimized by minimizing the said energy over the image; typical examples include the use of Markov random fields (MRF) for binarization (Mishra et al., 2011) and intensity variation reduction using Laplacian kernel (Howe, 2011). These methods take both the global and the local aspects of the image into consideration to label the pixels. The former uses the Laplacian of the image to obtain invariance in BG intensity, followed by a graph-cut with suitable source-sink priors (seed points for FG-BG, respectively) and edge estimates required to build an image graph. Although the fundamental idea of using a defined loss as employed in Howe’s method has been explored previously as separate methods, combining them into an energy function proved particularly effective. Further improvement of Howe’s approach was proposed in our previous method (Ayyalasomayajula and Brun, 2014) by defining a 3D binarization space comprising of the intensity, horizontal and vertical derivatives at a pixel. A hierarchical clustering exploiting the inherent topology in this space, led to effective detection of seeds for the source and sink estimates and refinement of edges to improve the binarization result.

Before proceeding with the details in the method, we would like to motivate the reader towards an end-to-end trainable model as proposed in the current approach. The methods win-ning DIBCO competitions, 2009-2016 tune their parameters by training on the labeled data available from the previous competitions. Though this approach is common to all the supervised training methods in learning based approaches, these binarization methods fail to support their claim to generalizability through an exhaustive k-fold cross validation where the model is trained on all the labeled data from other competitions years and testing on data for the DIBCO com-petition year under consideration. We do compare our results with these state-of-the-art methods in Table.2 however, it must be noted that these DIBCO competition winners are different for each year (Gatos et al. (2011), Pratikakis et al. (2010), Pratikakis et al. (2011b), Pratikakis et al. (2012), Pratikakis et al. (2013), Ntirogiannis et al. (2014), Pratikakis et al. (2016)). As the current approach is based on a deep neural network we focus on FCNNs based approaches for further discussion.

To our knowledge there are two methods based on FCNNs that provide an exhaustive cross-validation results on each of the DIBCO datasets for document binarization namely, Ayyalaso-mayajula and Brun (2017) and Tensmeyer and Martinez (2017). We used a post-processing based on graphcut in Ayyalasomaya-jula and Brun (2017), to improve the segmentation output of a FCNN, however this approach is not end-to-end trainable. The proposed approach is an improvement over the former, as it is an end-to-end trained energy minimization approach for-mulated as a total-variation scheme on the FCNN output. As shown in Fig.2(b) output from the later network compensates for the overestimated FG along the thick stroke boundaries and preserves the under estimated FG that is missed along the thin strokes by the former network.

We conclude this sections by discussing some of the short comings in the other deep network proposed for document im-age binarization. As introduced in the previous section bina-rization can be solved through various formulations involving pixel classification. The approach that is relevant in all the FCNN based binarization methods is that of semantic segmen-tation. In the semantic segmentation literature (Garcia-Garcia et al., 2017) integrating the context information surrounding a pixel showed an improved segmentation output. The proposed algorithm and Tensmeyer and Martinez (2017) broadly fall into this category of refining the output from a FCNN. The former uses a total-variation framework and the later uses feature aug-mentation along with a loss function tailored for DIBCO met-rics to meet this end. The performance of Tensmeyer and Mar-tinez (2017) depends two aspects. First, the binarization out-put is obtained from an ensemble of 5 networks trained on the DIBCO data instead of single trained network. Second, using feature augmentation such as image Laplacian (Howe, 2011) or Relative Brightness feature, which effect the binarization qual-ity. Zhou et al. (2002), have shown that using an ensemble is always better than using a single trained network and Hariharan et al. (2014) shown the advantage of feature augmentation on classification output of deep networks. Both these tricks though often used in training deep networks do not alter the fundamen-tal segmentation output of the FCNN, instead improve the fi-nal output of segmentation after a suitable loss function. How-ever, improving the fundamental output of FCNN is known to improve segmentation output (Zheng et al., 2015). PDNet is

(5)

generic network for semantic segmentation based on this idea of CRF unrolling (Zheng et al., 2015) that is aimed at improv-ing the fundamental segmentation output from the FCNN, that achieves either state of the art or close to state of the art results on a single trained network instance instead of an ensemble.

3. Methodology

The basic architecture of the proposed end-to-end binariza-tion network PDNet is shown in Fig.2(a). The network is built of three basic blocks:

– Unary network: This is a semantic segmentation network that is capable of classifying each pixel in a given image into respective classes. Ideally such a network is quite ca-pable of segmenting a given image by itself. Underlying such a classification is a cost associated with labeling each pixel as a particular class. We use the efficient neural net-work(ENet) proposed by Paszke et al. (2016). The motiva-tion for such an architecture is presented in the following section. However, as shown in our previous work Ayyala-somayajula and Brun (2014) and AyyalaAyyala-somayajula and Brun (2017), text segmentation is sensitive to edge arti-facts. Instead of using this network output directly we use the output prior to a typical softmax-like classification layer as the cost term for each pixel that can be further refined to improve the segmentation result.

– Primal-Dual Update: The design of this network is in-spired from previous methods that made use of conditional random field (CRF) such as Chen et al. (2014), as a post processing layer in segmentation. This allows for a way to incorporate structural information among neighboring pixels in segmentation. However, we wanted to extend this idea further in text images, where the label propagation be-tween neighbors should be encouraged but also restricted along the edges as shown in Fig.2(b). Use of PDUpdate scheme in tasks that involve total-variation formulation of energy function is already explored in depth super resolu-tion Riegler et al. (2016), and multi-class labeling problem Ochs et al. (2015). We extend these ideas into a more sta-ble architecture that permits end-to-end training within the intended theoretical framework of the underlying proximal operator, eliminating the exploding gradient problem due to its recurrent structure.

– Loss function: A typical document image has a lot of background pixels as opposed to written text, which natu-rally leads to class imbalance between the two classes in the training data. The loss function used in the network is a weighted Spatial Cross Entropy loss Badrinarayanan et al. (2015), often used to counter any class imbalance in the training samples. However, due to the redistribution of pixel labels resulting from the total-variation regulariza-tion as a part of the PDUpdate these weights need to be re-adjusted. We propose an empirical approach to achieve this intended outcome.

Table 1: ENet architecture for an example input of 512×512, C in the fullconv layer is the number of classes BN is the bottle neck layer indexed 1-5

Name Type Output size

initial 16 × 256 × 256 Encoder BN1.0 downsampling 64 × 128 × 128 4×BN1.x 64 × 128 × 128 BN2.0 downsampling 128 × 64 × 64 BN2.1 128 × 64 × 64 BN2.2 dilated 2 128 × 64 × 64 BN2.3 asymmetric 5 128 × 64 × 64 BN2.4 dilated 4 128 × 64 × 64 BN2.5 128 × 64 × 64 BN2.6 dilated 8 128 × 64 × 64 BN2.7 asymmetric 5 128 × 64 × 64 BN2.8 dilated 16 128 × 64 × 64

Repeat section 2, without BN2.0

Decoder BN4.0 upsampling 64 × 128 × 128 BN4.1 64 × 128 × 128 BN4.2 64 × 128 × 128 BN5.0 upsampling 16 × 256 × 256 BN5.1 16 × 256 × 256 fullconv C ×512 × 512 3.0.1. ENet architecture

The ENet architecture is inspired from scene parsing CNNs based on probabilistic auto-encoders Ngiam et al. (2011), where two separate neural networks are combined as an encoder-decoder pair. The encoder is trained to classify an input through downsampling and the decoder is used to up sample the encoder output. The ENet architecture also tries to make use of the bottleneck module that was introduced in the ResNets (He et al., 2015a). A bottleneck structure consists of a main branch that is separated from an extension consisting of convolutions filters. These two branches are later merged using elementwise addition as shown in Fig.3(b). The convolution layer conv, is either a regular, dilated or full convolution and if the bottleneck is downsampling then a max pooling layer is added to the main branch.

We conclude the section by discussing a few key aspects of ENet architecture as shown in Table.1. The encoder constitutes of the bottleneck sections 1-3 and sections 4,5 are part of the decoder. (indicated by BNs.l; s for section and l for layer within the section). ENet has some important architectural details that improve the speed of training keeping the parameters quite low. The projection layers do not include bias terms, to reduce the number of kernel calls and overall memory operations without effecting the accuracy. The network architecture heavily reduces the input size in the first two blocks allowing for small feature maps to be retained for further processing. This is because visual information can be highly compressed due to its inherent spatial redundancy. ENet opts for a large encoder with smaller decoder as opposed to a more symmetric design. This is motivated by the fact that encoder must be able to operate on smaller resolution data, reducing the role of the decoder to that of simple upsampling.

(6)

Input Conv(3x3) MaxPool(2x2) Concat (a) MaxPooling Padding + Conv 1x1 1x1 Regularizer PReLU PReLU (b)

Fig. 3: (a) ENet initial block. Convolution with kernel size 3 × 3, stride 2; MaxPooling is performed with non-overlapping 2 × 2 windows, there are 13 convolution filters, which sums up to 16 feature maps after concatenation. (b) ENet bottleneck module. conv is either a regular, dilated, or transposed convolution (also known as deconvolution) with 3×3 filters, or a 5×5 convolution decomposed into two asymmetric ones.

ENet further exploits the redundancy in convolution weights by strategic replacement of n × n convolutions with two n ×1 and 1 × n convolutions filters as discussed in Jin et al. (2014) and Szegedy et al. (2015). Strong downsampling of feature space needs to be compensated with equally adept up-sampling. Convolutions layer were alternated with dilated convolutions to have a wide receptive field, while at the same time avoiding overly downsampling the input. Parametric rectifier linear Units (PReLU) (He et al., 2015b), were used to learn the negative slope of non-linearities through an additional learnable parameter. Most of these aspects on receptive fields, non-linear activation functions and concerned limitation in text segmentation were raised in our previous work (Ayyala-somayajula and Brun, 2017), making ENet architecture worth exploring for text binarization.

3.0.2. Primal-Dual Update scheme

The primal-dual update is built on three basic concepts – Total variation formulation for segmentation.

– Proximal operator approach to decouple the underlying total variation and gradient operations.

– Bregman functions based proximal operator smoothing to make proximal calculations differentiable.

as discussed in the Appendix section. The primal-dual formu-lation of the segmentation problem is given by

min u=(ul)k_l=1 max p=(pl)k_l=1         k X l=1 h∇ul, pli+ hul, fli        + δU (u) − δP(p) (1)

with the primal and dual updates given by ˆu= ΠU( ¯u − τ∇Tp −τ f )

ˆp= ΠP( ¯p+ σ∇u)

(2) where are u = (ul)k_l₌₁, p = (pl)k_l₌₁, f = ( fl)k_l₌₁ are the

pri-mal, dual and cost vectors for 1, · · · , k classes respectively.

δU(u), δP(p) are the indicator functions for the primal and dual

variables u, p corresponding to the constraints sets U, P, respec-tively defined in Eqs.10,15. The orthogonal projections on to U, P are given by ΠU, ΠP, respectively . One approach to obtain

a closed form representation for the projections in Eq.2 satisfy-ing the constraints implicitly, is to use the Bregman proximity functions. The updates for the p, u are given in Eqs.21,24, re-spectively. For the segmentation result to converge, the primal and dual updates need to be iterated over. PDNet has primal-dual updates unrolled over five times. A PDUpdate with thrice such unrolled iteration is shown in Fig.4, The ¯pk_i, ¯uk_i, and usum

are initialized to 0,1_k and 1_k, respectively. The overall out-put of the network can be interpreted as a perturbation of the unary cost using primal-dual updates to give a more controlled segmentation. The final segmentation is obtained by using a weighted cross entropy loss on the final usum.

3.0.3. Loss function

A cross entropy criterion (CEC) combines the logistic-softmax over the class with classwise negative log likelihood criterion to obtain the final classification. A common problem in classification is imbalance in the samples over classes. This problem can be countered by adjusted the weights associating with each class, often estimated from the class histograms. However, due to the smoothing introduced by the PDUpdate the weights for CEC loss estimated from the histogram alone overcompensate for the imbalance. We propose a power law over the histogram based weight calculation. The weighing used is inverse of square-root of class histograms. This power law is determined by the computing the training loss with PDUpdate being part of the network iterated over various exponents as shown in Fig.5(a).

The source code for the implementation of the PDNet used for binarization is made publicly available at https: //github.com/krayyalasomayajula/pdNet.git

(7)

u

k i

=

k1

p

ki

= 0

u

sum

=

1_k

ˆ

u

k i

=

exp

₍

−2τ(∇Tpk)i−2τfik

)

uki Pk l=1exp

(

−2τ(∇Tpl)i−2τfil

)

uli

ˆ

p

kj

=

exp

(

−2σ(∇uk₎ j

)

− 1_−¯pkj 1+¯pk_j exp

(

−2σ(∇uk₎ j

)

+ 1_−¯pkj 1+¯pk_j

ˆ

u

sum

= u

sum

+ αˆ

u

ki

ˆ

u

k i

=

exp

₍

)

uki Pk l=1exp

(

)

uli

ˆ

p

kj

=

exp

(

−2σ(∇uk₎ j

)

− 1_−¯pkj 1+¯pk_j exp

(

−2σ(∇uk₎ j

)

+ 1_−¯pkj 1+¯pk_j

ˆ

u

sum

= u

sum

+ αˆ

u

ki

ˆ

u

k i

=

exp

₍

)

uki Pk l=1exp

(

)

uli

ˆ

u

sum

= u

sum

+ αˆ

u

ki

Fig. 4: PDUpdate scheme unrolled thrice ¯pk_i is initialized to zeros, ¯uk_i, usumare initialized to1_k the parameters τ, σ, and α are calculated in the network through

gradient descent. The arrows indicate the dependencies of each block

3.1. Architectural modifications

In the subsections below, we summarize the architectural changes from the basic network blocks.

3.1.1. E-Net architecture initial block changes

When carrying out experiments on the DIBCO dataset we experimented with both color and gray scale images as well as using both color and gray channels. Although state of the art results were obtained using gray-scale images, we would like to highlight some extreme cases as shown in Fig.6(a). When using color and gray channels the initial block was modified to include RGB and two gray channels with the max pooling applied to RGB channels alone; results for this case were similar to using RBG channels alone. When training the network for RGB and gray-scale alone the original ENet architecture was used without any changes.

3.1.2. Clamped primal-dual updates

The typical primal-dual updates when properly initialized are usually stable. However, when training a PDUpdate scheme on image data in deep networks exploding gradient problem is commonly encountered. Implementations by Riegler et al. (2016) and Ochs et al. (2015), have dealt with this issue by gra-dient clipping to specific bounds during back-propagation. We resorted to another approach of clamping the values in primal and dual updates as shown in Fig.5(b). This approach has two advantages:

– The clamping as shown in Fig.5(b) resets the pixel where instability was encountered to their initial values 0,1_k for

¯pk i, ¯u

k

i, respectively. The cost estimates for these pixels can

be refined in further iterations, thus the resulting scheme is more faithful to the theoretical primal-dual approach. – The gradients in this approach are not clipped during

back-propagation thus leading to a faster training of the network making the loss converge within 10 epochs as opposed to 30 epochs in a network without clamping.

3.1.3. Choice of labeling

The convention often followed in semantic segmentation is to allow an unknown class, to neglect objects such as background scenery or to include classes that need to be ignored. Gener-ally, presence of such classes does not cause any change in the result, but do increase the training time for the PDNet. The unknown class is usually labeled 1 and then other classes are labeled incrementally. We begin by relabeling the FG, BG and unknown classes to 1,2 and 3, respectively. Once the unary net-work is trained, the cost vector can be truncated to include just the FG and BG costs. Training PDNet on these costs leads to faster convergence as the size of ¯pk_i, ¯uk_i and further vectors used in computations are reduced by 30%.

3.2. Tuning hyperparameters

The binarization network has two hyperparameters to tune: The index to be used in the power law for the class weights used in the CEC loss function and the weights to be associated with the edges along the class boundaries. As mentioned pre-viously the exponent in the power law for the weights was de-termined through an exhaustive search carried out on the train-ing error for various exponents. The weights associated with the edges is a shared parameter over all the unrolled loops and multiplied with τ weight of each loop unroll. Training for both the edge weights and τ can cause instability. We instead ini-tialized the edge weight to 1.0 and trained on a small training-set to determine the convergence. This value was later used in the network. Although the edge weight determined by this approach is suboptimal, it is compensated for by learning τ val-ues through back-propagation in the network over the complete training sets. This method of training for the edge weight pro-duced a better validation loss, though this mode of pre-training is not critical for the network performance it can save time re-quired for the over all training of the network.

4. Experimental results 4.0.1. DIBCO dataset

The experiments were conducted on the DIBCO datasets by Pratikakis et al. (2011a), for binarization competition from

(8)

(a)

-λ

max

λ

max

ε

(b)

Fig. 5: (a) Tuning of the exponent to be determined in the power law. The training error is plotted against exponent values form [−1, −0.1] ∪ [−0.1, −0.01] is steps of -0.1,-0.01 for the first and second intervals, respectively. The ellipse indicates range of suitable exponents that produce identical segmentation results. The exponent used in PDNet is -0.5. (b) The network learns permissible values for primal and dual variables ui, piin PDUpdate by setting a large maximum value of λmax= 1030

in positive and negative direction for them. At the same time values less than |ε| are clamped to 10−8_.

years 2009-2016 consisting of 76 images in total. The images and ground truth images were augmented by applying an identi-cal deformation field transformation to both. These augmented images were then converted into 128 × 256 pixels of cropped images with 25% overlap horizontally and vertically to create more data for training. The cropped size 128 × 256 was selected based on the encoder requirement for the width and height to be equal to 2n_{for some n and to fit the training batches into}

mem-ory in Torch7 framework (Collobert et al., 2011). The training set was picked by including the all the augmented images, ex-cept those from competition year under evaluation. The valida-tion set was made by randomly picking 3000 original cropped images from the DIBCO datasets excluding the year under eval-uation. The trained network was then used to produce binary output on the dataset of DIBCO competition year. The images were then stitched to the original size to compute the DIBCO evaluation metrics. Three main evaluation metrics for compari-son are F-Measure, Peak signal to noise ratio (PSNR) and Dis-tance Reciprocal Distortion Metric(DRD), definitions of which can be found in Pratikakis et al. (2011a), Ntirogiannis et al. (2014).

4.0.2. Synthetic Dataset

To have better weight initialization and prevent over fitting the network to data, the network was pre-trained on synthetic data. Documents resembling historical handwritten and printed material were generated synthetically. Various filters were ap-plied to resemble background textures and degradations in the parchment. The text was generated using handwriting and ma-chine printed fonts from the GoogleTM_{(Fonts, 2011). Fig.6(b)}

shows few cropped images from the synthetic dataset. The re-sults from binarization on DIBCO dataset using the network pre-trained on the synthetic dataset are presented in Table-2. 4.0.3. Training

The network was trained in three stages:

– Pre-training unary network on synthetic dataset. – Training unary network on DIBCO dataset.

– Combined training of unary and PDUpdate scheme on DIBCO dataset.

The encoder of the unary network was pre-trained on patches of sizes 128 × 256 with batch-size 30 on the Synthetic dataset. The pre-trained encoder was then used to initialize the decoder weights and then the decoder was per-trained on the Synthetic dataset using the same patch size and batch size. This gives the pre-trained unary model. This model was then used to initialize the weights of the unary model that is to be trained on the augmented DIBCO dataset. The gradients were trained using ADAM (Kingma and Ba, 2014), with the learning rate, weight decay, and momentum set to 5 × 10−4_{, 2 × 10}−4_{and 0.9,}

respectively. The unary models were trained for 20 epochs and the best model with least loss was picked in further steps. When training the unary and PDUpdate the learning rate was set to 5 × 10−4 for the first 10 epochs and decayed to 2 × 10−4and 1 × 10−4 between 11 - 15 and 16 - 20 epochs, respectively. The results on DIBCO dataset, with the proposed binarization network is summarized in Table.2.

PDg, PDcare the outputs from primal-dual networks trained

on grayscale and color images, respectively. GC is the result obtained from a graph-cut based approach as discussed in our previous work Ayyalasomayajula and Brun (2017). It takes the output from the unary model as seed points along with the out-put from the classification layer acting as costs for pixels and Canny edges as boundaries estimates for segmentation. This approach requires three parameters a) cost associated with pix-els labeling as obtained from the unary network, 2) weight as-sociated to edges and Canny threshold to estimates the bound-aries. It then employees an external graph-cut method (Boykov and Kolmogorov, 2004), to obtain the final segmentation. In

(9)

(a)

(b)

Fig. 6: (a)(1) Document with low contrast between FG and BG (2) Binary output from network trained on gray scale images alone (3) Binary output using color input (4) Document with a strong color BG (5) Binary output from network trained on gray scale images alone (6) Binary output using color input (b)Few samples from synthetic text data

Table 2: Comparison of the results for F-Measure, PSNR, DRD for various methods.

Year FMeasure (↑) PSNR (↑) DRD (↓) PDg PDc GC TM DBC PDg PDc GC TM DBC PDg PDc GC TM DBC 2009 91.50 90.46 89.24 89.76 91.24 19.25 17.82 17.28 18.43 18.66 3.06 3.45 4.05 4.89 -2010 92.91 90.45 89.84 94.89 91.50 20.40 19.62 18.73 21.84 19.78 1.85 3.37 3.29 1.26 -2011 91.87 85.68 88.36 93.60 88.74 19.07 17.43 17.22 20.11 17.97 2.57 15.45 4.27 1.85 5.36 2012 93.04 89.64 91.97 92.53 92.85 20.50 19.63 19.80 20.60 21.80 2.92 4.78 2.81 2.48 2.66 2013 93.97 93.20 90.59 93.17 92.70 21.30 20.75 19.05 20.71 21.29 1.83 2.23 3.18 2.21 3.10 2014 89.99 93.79 92.40 91.96 96.88 20.52 20.79 18.68 20.76 22.66 7.42 2.30 2.72 2.72 0.90 2016 90.18 89.89 88.79 89.52 88.72 18.99 18.88 18.05 18.67 18.45 3.61 3.68 4.33 3.76 3.86

contrast to this approach the current methods does all these pa-rameter estimations and the energy minimization in end-to-end manner in a single framework. TM are the results from another CNN based approach developed by Tensmeyer and Martinez (2017), which augments the segmentation result with relative darkness featureWu et al. (2015), to aid in binarization. DBC are results from winning entries in DIBCO competitions using various classical approaches in as given in Ntirogiannis et al. (2014), Pratikakis et al. (2016).

5. Conclusion

In the current work we have extended our previous approach of using graph-cuts as post-processing to a FCNN, to obtain bi-narized documents images. In this paper, we propose an archi-tecture that combines the use of an energy minimization func-tion and FCNN based feature learning. The energy minimiza-tion framework is flexible in imposing constraints on the de-sired segmentation. This primal-dual network imposes a total variation based energy minimization, with an unrolled primal-dual update scheme in a gradient descent trainable CNN archi-tecture. The unary CNN, on the other hand, learns pixel label-ing costs and the combined network learns all the associated parameters within a single end-to-end trainable framework. The numerical instability caused by the recurrent nature of the

primal-dual steps was solved by using a clamping function on the primal and dual updates within the PDUpdate block. Prop-agation of gradients from the PDUpdate was further facilitated by employing a weighted cross entropy loss adjusted by a power law. The final primal-dual architecture improves the binariza-tion results, compared to the FCNN output alone, even when combined post-processing such as graph-cut or feature augmen-tation. The binarization achieves state of the art results on four out of seven DIBCO datasets, with comparable to best result on the rest. Further investigation in dropout layers in the unary model and architectural variations in PDUpdate are planned for future work. The results from using a trained network to other historical document images and cases of transfer learning are of practical interest. Using more sophisticated unary and primal-dual schemes could also yield an even better result on document binarization.

Acknowledgments

This project is a part of q2b, From quill to bytes, an initia-tive sponsored by the Swedish Research Council ”Vetenskap-srådet D.Nr 2012-5743) and Riksbankens Jubileumsfond (R.Nr NHS14-2068:1) and Uppsala university. The authors would like to thank Tomas Wilkinson of Dept. of Information Tech., Uppsala University for discussions on debugging and perfor-mance optimization in Torch framework.

(10)

References

Ayyalasomayajula, K., Brun, A., 2014. Document binarization using topo-logical clustering guided laplacian energy segmentation, in: Int. Conf. on Frontiers in Handwriting Recognition, pp. 523–528.

Ayyalasomayajula, K., Brun, A., 2017. Historical document binarization com-bining semantic labeling and graph cuts, in: LNCS, Scandinavian Confer-ence on Image Analysis. Vol(1), pp. 386–396.

Badrinarayanan, V., Kendall, A., Cipolla, R., 2015. Segnet: A deep con-volutional encoder-decoder architecture for image segmentation. CoRR abs/1511.00561.

Bar, I., Beckman, I., Kedem, K., Dinstein, I., 2007. Binarization, character extraction, and writer identification of historical hebrew calligraphy docu-ments. Int. Jou. on Document Analysis and Recognition 9, 89–99. Blake, A., Kohli, P., Rother, C., 2011. Markov Random Fields for Vision and

Image Processing. The MIT Press. chapter 1.

Boykov, Y., Kolmogorov, V., 2004. An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Trans. Pattern Anal. Mach. Intell. 26, 1124–1137.

Chambolle, A., Pock, T., 2011. A first-order primal-dual algorithm for convex problems withapplications to imaging. Journal of Mathematical Imaging and Vision 40, 120–145.

Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L., 2014. Seman-tic image segmentation with deep convolutional nets and fully connected crfs. CoRR abs/1412.7062.

Collobert, R., Kavukcuoglu, K., Farabet, C., 2011. Torch7: A matlab-like environment for machine learning, in: BigLearn, NIPS Workshop. Fonts, 2011. https://github.com/google/fonts.

Garcia-Garcia, A., Orts-Escolano, S., Oprea, S., Villena-Martinez, V., Rodr´ıguez, J.G., 2017. A review on deep learning techniques applied to semantic segmentation. CoRR abs/1704.06857.

Gatos, B., Ntirogiannis, K., Pratikakis, I., 2011. Dibco 2009: Document image binarization contest. Int. J. Doc. Anal. Recognit. 14, 35–44. doi:10.1007/ s10032-010-0115-7.

Hariharan, B., Arbel´aez, P.A., Girshick, R.B., Malik, J., 2014. Hypercolumns for object segmentation and fine-grained localization. CoRR abs/1411.5752. He, K., Zhang, X., Ren, S., Sun, J., 2015a. Deep residual learning for image

recognition. CoRR abs/1512.03385.

He, K., Zhang, X., Ren, S., Sun, J., 2015b. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. CoRR abs/1502.01852.

Howe, N., 2011. A laplacian energy for document binarization. Int. Conf. on Document Analysis and Recognition , 6–10.

Jin, J., Dundar, A., Culurciello, E., 2014. Flattened convolutional neural net-works for feedforward acceleration. CoRR abs/1412.5474.

Kingma, D.P., Ba, J., 2014. Adam: A method for stochastic optimization. CoRR abs/1412.6980.

Lu, S., Su, B., Tan, C., 2010. Document image binarization using background estimation and stroke edges. IJDAR 13(4), 303–314.

Mishra, A., Alahari, K., Jawahar, C., 2011. An mrf model for binarization of natural scene text. Int. Conf. on Document Analysis and Recognition . Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y., 2011.

Multi-modal deep learning, in: Proceedings of the 28th International Conference on Machine Learning (ICML-11), ACM. pp. 689–696.

Niblack, W., 1986. An introduction to digital image processing, in: Introduc-tion to the Electronic Age. Prentice-Hall, Englewood Cliffs, New York, NY, pp. 115–116.

Ntirogiannis, K., Gatos, B., Pratikakis, I., 2014. Icfhr2014 competition on hand-written document image binarization (h-dibco 2014), p. 809813. Ochs, P., Ranftl, R., Brox, T., Pock, T., 2015. Bilevel optimization with

nons-mooth lower level problems. Scale Space and Variational Methods in Com-puter Vision. (SSVM) Lecture Notes in ComCom-puter Science vol 9087. Otsu, N., 1979. A threshold selection method from gray level histograms. IEEE

Trans. Systems, Man and Cybernetics 9, 62–66.

Paszke, A., Chaurasia, A., Kim, S., Culurciello, E., 2016. Enet: A deep neural network architecture for real-time semantic segmentation. CoRR abs/1606.02147.

Pratikakis, I., Gatos, B., Ntirogiannis, K., 2010. H-dibco 2010 - hand-written document image binarization competition, in: 2010 12th Interna-tional Conference on Frontiers in Handwriting Recognition, pp. 727–732. doi:10.1109/ICFHR.2010.118.

Pratikakis, I., Gatos, B., Ntirogiannis, K., 2011a. Icdar 2011 document image

binarization contest. International Conference on Document Analysis and Recognition , 1506–1510.

Pratikakis, I., Gatos, B., Ntirogiannis, K., 2011b. Icdar 2011 document im-age binarization contest (dibco 2011), in: 2011 International Conference on Document Analysis and Recognition, pp. 1506–1510. doi:10.1109/ ICDAR.2011.299.

Pratikakis, I., Gatos, B., Ntirogiannis, K., 2012. Icfhr 2012 competition on handwritten document image binarization (h-dibco 2012), in: 2012 Interna-tional Conference on Frontiers in Handwriting Recognition, pp. 817–822. doi:10.1109/ICFHR.2012.216.

Pratikakis, I., Gatos, B., Ntirogiannis, K., 2013. Icdar 2013 document image binarization contest (dibco 2013), in: 2013 12th International Conference on Document Analysis and Recognition, pp. 1471–1476. doi:10.1109/ ICDAR.2013.219.

Pratikakis, I., Zagoris, K., Barlas, G., Gatos, B., 2016. Icfhr2016 handwritten document image binarization contest (h-dibco 2016), p. 619623.

Riegler, G., Ferstl, D., R¨uther, M., Bischof, H., 2016. A deep primal-dual network for guided depth super-resolution. CoRR abs/1607.08569. Sauvola, J., Pietik¨ainen, M., 2000. Adaptive document image binarization.

Pattern Recognition 33, 225–236.

Shelhamer, E., Long, J., Darrell, T., 2015. Fully convolutional networks for semantic segmentation, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z., 2015. Rethinking the inception architecture for computer vision. CoRR abs/1512.00567. Tensmeyer, C., Martinez, T., 2017. Document image binarization with fully

convolutional neural networks, in: ICDAR, IEEE.

Wu, Y., Rawls, S., Abd-Almageed, W., Natarajan, P., 2015. Learning document image binarization from data. CoRR abs/1505.00529.

Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., Torr, P.H.S., 2015. Conditional random fields as recurrent neural networks, in: Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), IEEE Computer Society, Washington, DC, USA. pp. 1529–1537. doi:10.1109/ICCV.2015.179.

Zhou, Z.H., Wu, J., Tang, W., 2002. Ensembling neural networks: Many could be better than all. Artificial Intelligence 137, 239 – 263.

Appendix

Total variation formulation: Formulating image segmenta-tion as a saddle point problem and applying proximal opera-tor to the primal-dual variables is a well studied problem in fixed point analysis of convex functions (Chambolle and Pock, 2011). An overview of theory is provided here to make the dis-cussion comprehensive and to provide intuition for PDUpdate discussed in the method. We consider segmentation of image into k-pairwise disjoint regions as a total variation on the seg-mentation image as:

min (Rl)k_l=1,(cl)k_l=1 1 2 k X l=1 Per(Rl, Ω) + λ 2 k X l=1 Z Rl |g(x) − cl|2dx, (3)

where Per(Rl, Ω) is the perimeter of the region Rlin a domain

Ω, g : Ω → R is the input image, cl∈ R are the optimal mean

values and the regions (Rl)k_l₌₁ form a partition of Ω that is,

Rl∩ Rm= ∅, l , m and ∪k_l₌₁Rl= Ω, λ is a regularization weight.

This is an optimization problem between the data fitting term |g(x) − cl| and the length term Per(Rl, Ω) where the ideal mean

values of the region cl = R

Rlg(x)

|Rl| are unknown a-priori as they

depend on the partition we seek.

By introducing a labeling function u = (ul)k_l₌₀ : Ω → R

(11)

λ

2|g(x) − cl| Eq(3) can be generalized as:

min u=(ul)k_l=1 J(u)+ k X l=1 Z Ωulfldx, s.t k X l=1 ul(x)= 1, ul≥ 0, ∀x ∈Ω, (4)

where J(u) is the relaxation term.

Proximal operator: This basic structure in Eq.4 is of the form

min

x∈X F(K x)+ G(x), (5)

involving a linear map K : X → Y with the induced norm ||K||= max {||K x|| : x ∈ X with ||x|| ≤ 1} where X, Y are the primal and dual spaces, respectively. The corresponding dual formulation of this equation is a generic saddle-point problem

min

x∈X maxy∈YhK x, yi + G(x) − F

∗_(y), ₍₆₎

where hK x, yi is the inner product induced by the vector space Yand F∗_{is the convex conjugate of F.}

The advantage of such an approach is discussed further down with the segmentation example, for now we can observe that this structure readily presents a computationally tractable algorithm. The dual variable y ∈ Y acts like bounded slack variables introduced to ease the solution in the resulting dual space Y. Introducing the dual variable y relieves the initial composition of F(K x) making computations involving hK x, yi independent of F∗(y). As per the structure of segmentation problem, F is typically an indicator function for capturing the constraints on x, which translates to F∗ being an indicator function for capturing the constraints on its dual variable y. Since the primal space has ||x|| ≤ 1 if the dual variable is bounded, which is most often the case then iterating repeatedly between the two variables should converge to a solution.

The solution takes a form involving the proximal operator or gradient of the functions F, G depending on them being con-vex or concon-vex as well as differentiable, respectively. The basic idea behind a proximal operator is a generalization of projec-tion on to a vector space. This makes it an ideal operator that can be used in a gradient descent algorithm where the iteration involves taking a suitable step towards the solution along the gradient direction. But since the function need not be di ffer-entiable the gradient need not necessarily exist and hence the question of uniqueness along the gradient direction does not arise. This results in a set of permissible vectors that though not strictly a gradient can act as one at a given point x, such a set of permissible vectors is called subgradient. The set ∂F is the subgradient it is also the set of underestimators of F at x. A closely related set is the resolvent operator with the property

x= (I + τ∂F)−1(y)= arg min

x

( ||x − y||2

2τ + F(x)

)

. (7)

It can be interpreted as the closest point x in the set under consideration to the given point y under an associated error

F(x). The primal-dual formulation allows for an algorithm that iterates between the primal and dual variables x, y, re-spectively in this case leading to convergence according to the

f orward − backwardalgorithm

yn+1= proxx(yn)= (I + σ∂F∗)−1(yn+ σK ¯xn);

xn+1= proxy(xn)= (I + τ∂G)−1(xn−τK∗yn+1);

¯xn+1= xn+1+ θ(xn+1− xn),

(8)

where τ, σ are step lengths along dual and primal subgradients and θ is the relaxation parameter in iterating the primal variable. Considering the image discretised over the Cartesian grid of size M × N as {(ih, jh) : 1 ≤ i ≤ M, 1 ≤ j ≤ N}, where h is the size spacing and (i, j) the indices in discrete notation. X is a vector space in RMN _{equipped with standard inner product}

hu, vi for u, v ∈ X. The gradient is defined as ∇ : X → Y, ∇u = (ui+1, j_h−ui, j,ui, j+1_h−ui, j) with Y = X × X equipped with the inner product defined as,

hp, qiY =

X

i, j

p1_{i, j}q1_{i, j}+ p2_{i, j}q_{i, j}2 , p = (p1, p2), q= (q1, q2) ∈ Y.

Applying the above framework to Eq.4 with J(u) =

1 2 Pk l=1 R Ω|∇ul|, we have min u=(ul)k_l=1 1 2 k X l=1 Z Ω |∇ul|+ hul, fli ! + δU(u), (9)

where G(u) = δU(u) is the indicator function for the unit

sim-plex, U=        u ∈ Xk: k X l=1 ul(x)= 1, ul≥ 0        . (10)

f = ( fl)k_l₌₁ ∈ Xk is the discretized weighting function or the

cost per pixel, u = (ul)k_l₌₁is the primal variable and Xk is the

extension of the vector space for k classes. Considering 1 2 k X l=1 Z Ω|∇ul |+ hul, fli ! = max p=(pl)k_l=1         k X l=1 h∇ul, pli+ hul, fli         −δ_P(p), (11)

where p ∈ Yk is the dual variable, with Ykis the extension of the gradient vector space for k classes, δP(p) is the indicator

function for p ∈ P defined as P= ( p ∈ Yk: ||pl||∞≤ 1 2 ) . (12)

We have the primal-dual formulation as

min u=(ul)k_l=1 max p=(pl)k_l=1         k X l=1 h∇ul, pli+ hul, fli        + δU (u) − δP(p), (13)

with u, p related as h∇u, piY = −hu, divpiX. This result is a

(12)

scalar function u and vector field p. A corollary of the fore mentioned result is the relation −div = ∇∗ where div, ∇∗ are divergence in Y; and the conjugate of gradient ∇, respectively. Further since ∇∗_{= −∇}T_{it turns out that div}_{= ∇}T_.

Bregman Proximity Functions: The sets considered so far are unit simplex and unit ball (or more strictly a ball of radius1₂) as defined by U, P in Eqs.10,12, respectively. The resolvent of these sets are orthogonal projections on unit simplex and point projection onto unit ball, respectively. However, in the case of segmentation when using more sophisticated relaxation that yield better delineation along edges, like paired calibrations is used in Eq.4 given by,

J(u)= Z ΩΨ(Du); s.t Ψ(a) = sup b        k X l₌₁ hal, bmi : |al− bm| ≤ 1, 1 ≤ l ≤ m ≤ k        , (14) where a= (a1, · · · , ak), b= (b1, · · · , bk). The corresponding set

for the dual variables is no longer a unit ball, but intersection of unit balls given by,

P=np ∈ Yk: |pl− pm|∞≤ 1, 1 ≤ l ≤ m ≤ ko . (15)

The resolvent of which, is an orthogonal projection on to such an intersection of unit balls. As the relaxations get more so-phisticated the corresponding resolvent set becomes more com-plex and orthogonal projections on to them get computationally more involved. One approach towards getting a solution that can be used in a computational algorithm is to use Bregman functions. Suppose we have a convex function ψ(x) that is con-tinuously differentiable on the interior of its domain; int(X) and continuous on its closure; cl(X), we use ¯x to denote a point from int(X). A Bregman proximity function Dψ : X × int(X) → R generated by ψ is defined as

Dψ(x, ¯x)= ψ(x) − ψ( ¯x) − h∇ψ( ¯x), x − ¯xi. (16) In iterative algorithms, the Bregman proximity function can be used with the proximity operator for a convex function g : X → R as

proxψαg( ¯x)= arg min

x∈Xαg(x) + Dψ(x, ¯x). (17)

In image segmentation problem the basic class of functions of interest are of the form g(x)= hx, ci+δX(x) as seen in Eq.13.

The associated proximal operator is proxψαg( ¯x)= arg min

x∈Xαhx, ci + Dψ(x, ¯x). (18)

The necessary and sufficient condition for optimality, which has a unique solution for Eq.18 is

∇ψ( ¯x) − c = ∇ψ(x). (19)

This constraint is implicitly taken care by the Bregman prox-imity function. For further details on Bregman functions one

may refer Ochs et al. (2015). In image segmentation the dual variables belong to the intersection of unit balls as shown in Eq.15 so each coordinate of the dual variable p should satisfy −1 ≤ pj≤ 1 and solve the dual problem

max p=(pl)k_l=1 k X l=1 h∇ul, pli −δP(p). (20)

A suitable Bregman proximity functions that encode the dual variable constraints and the corresponding proximity solution along each coordinate i are given by

ψ(x) = 1 2(1+ x) log(1 + x) + (1 − x) log(1 − x) ; proxψαg( ¯x) i= exp (−2αci) −1− ¯x₁_{+ ¯x}i i exp (−2αci)+1− ¯x₁_{+ ¯x}i_i , (21)

where ci = (∇u)i is the i-component of the ∇u. Similarly, the

primal problem deals with min

u=(ul)k_l=1

hul, fli+ δU(u), (22)

with the primal variables u restricted by ui ≥ 0 along each

co-ordinate. The Bregman function encoding these constraints and the corresponding proximal solution are given by

ψ(x) = x log x, proxψαg( ¯x) i= xiexp (−2αci). (23) To satisfyPk

l=1ul(x)= 1 in Eq.10 it is sufficient to normalize it

as proxψαg( ¯x)_i= xiexp (−2αci) PK j=1xjexp −2αcj , (24)

where ci= (∇Tp)i− fi, fidenoting the cost associated with the

ith class. Eqs.21 and 24, can now be used in Eq.8 to converge to a solution for image segmentation.