Clustering of Image Search Results to Support Historical Document Recognition

(1)

Thesis no: MGCS-2014-06

Clustering of Image Search Results to Support Historical Document

Recognition

Javier Espinosa

Faculty of Computing

Blekinge Institute of Technology

SE371 79 Karlskrona, Sweden

(2)

This thesis is submitted to the Faculty of Computing at Blekinge Institute of Technology in partial fulllment of the requirements for the degree of Master of Science in Computer Science.

The thesis is equivalent to 10 weeks of full-time studies.

Contact Information:

Author:

Javier Espinosa

E-mail: jaes13@student.bth.se

University advisor:

Dr. Niklas Lavesson

Dept. Computer Science & Engineering

Faculty of Computing Internet : www.bth.se

Blekinge Institute of Technology Phone : +46 455 38 50 00

SE371 79 Karlskrona, Sweden Fax : +46 455 38 50 57

(3)

Abstract

Context. Image searching in historical handwritten documents is a challenging problem in computer vision and pattern recognition. The amount of documents which have been digitalized is increasing each day, and the task to nd occurrences of a selected sub-image in a col- lection of documents has special interest for historians and genealogist.

Objectives. This thesis develops a technique for image searching in historical documents. Divided in three phases, rst the document is segmented into sub-images according to the words on it. These sub- images are dened by a features vector with measurable attributes of its content. And based on these vectors, a clustering algorithm com- putes the distance between vectors to decide which images match with the selected by the user.

Methods. The research methodology is experimentation. A quasi- experiment is designed based on repeated measures over a single group of data. The image processing, features selection, and clustering ap- proach are the independent variables; whereas the accuracies measure- ments are the dependent variable. This design provides a measurement net based on a set of outcomes related to each other.

Results. The statistical analysis is based on the F

1

score to measure the accuracy of the experimental results. This test analyses the accu- racy of the experiment regarding to its true positives, false positives, and false negatives detected. The average F-measure for the experi- ment conducted is F

1

= 0.59, whereas the actual performance value of the method is matching ratio of 66.4%.

Conclusions. This thesis provides a starting point in order to develop a search engine for historical document collections based on pattern recognition. The main research ndings are focused in image enhance- ment and segmentation for degraded documents, and image matching based on features denition and cluster analysis.

Keywords: Historical documents, Computer vision, Features extrac- tion, Clustering.

i

(4)

Acknowledgments

First and foremost, I thank my supervisor Dr. Niklas Lavesson. With- out his support, encouragement, academic guide, and patient this the- sis would have not be possible. His professionalism is an example to follow.

Secondly, I thank Arkiv Digital AD AB for providing the data for the thesis, its image database has been the foundations of all the experiments conducted.

Thirdly, I thank my parents for giving me the opportunity and privi- lege of studying abroad, in addition to their unconditional trust and support.

Last but not least, I would like to thank the friends who have always supported me, even without understanding my work. No matter how, they always help me.

ii

(5)

Abstract i

Acknowledgments ii

1 Introduction 1

1.1 Terminology . . . . 1

2 Background 3 3 Related Work 6 3.1 Identication of Gap . . . . 7

4 Aim and Objectives 9 5 Approach 10 5.1 Software library . . . 10

5.2 Image processing . . . 11

5.2.1 Binarization . . . 11

5.2.2 Image ltering . . . 12

5.2.3 Segmentation . . . 13

5.3 Feature extraction . . . 15

5.4 Clustering . . . 16

5.5 Data preparation . . . 17

5.6 Data representation . . . 18

5.7 Cluster borders . . . 19

6 Research Methodology 20 6.1 Experimental design . . . 21

6.2 Data collection . . . 22

6.3 Data analysis . . . 23

6.4 Validity threats . . . 23

6.4.1 Internal validity . . . 23

6.4.2 External validity . . . 24

iii

(6)

6.4.3 Construct validity . . . 24 6.4.4 Statistical validity . . . 24

7 Results 25

7.1 Image processing . . . 25 7.2 Clustering . . . 27

8 Analysis 29

8.1 Image processing analysis . . . 29 8.2 Features analysis . . . 30 8.3 Clustering approach . . . 30

9 Conclusions and Future Work 33

References 34

iv

(7)

Chapter 1 Introduction

Computer-based historical document recognition and analysis has become an im- portant research area due to the amount of documents from the past which have been digitalized. In order to learn from and preserve our cultural heritage, the national archives and companies such as Arkiv Digital AD AB

¹

have scanned or photographed millions of historical documents. Historical documents are usually handwritten and many of them are damaged. Thus, it is more time consuming to extract knowledge from such documents than from high-quality and typed documents.

For genealogists and historians the nding of multiple occurrences of a name, a marker, or some other fragment of text in a collection of historical documents is still a manual task. There exist no solution for searching automatically for a selected sub image of one historical handwritten document image in a collection of document images. The application of content-based image retrieval techniques in conjunction with machine learning may introduce a solution to reduce the time it takes to manually skim through all the documents to interpret the contents (Dharani and Aroquiaraj, 2013; Zakariya et al., 2010). Several possible solutions appear in order to solve the searching problem, but the following background is focused on image processing for sub image extraction and clustering for image matching.

1.1 Terminology

To avoid misunderstandings and make clear the explanations in the rest of the document, some terms are dened according to their specic meaning in this context. A document is called a collection of digital images from the same book.

1

http://www.arkivdigital.net/

1

(8)

Chapter 1. Introduction 2

An image is understood as the digital image of one page from a book. And nally,

a sub-image is an image extracted from one of the aforementioned pages.

(9)

Chapter 2 Background

In relation to image processing, the Optical Character Recognition (OCR) method- ology provides a solution to extract the elements from handwritten or machine- printed documents (Vamvakas et al., 2008; Stamatopoulos et al., 2009). OCR methodology can be divided into three general steps: a binarization step to ob- tain a black-and-white copy with imperfections removed; a recursive segmenta- tion of lines, words and characters; and the character recognition using a database elaborated for the case. The last step presents a disadvantage in relation to the database availability, since the character recognition is a supervised task. OCR techniques only work properly with machine-printed documents, which only use one type of font and it makes the recognition process manageable. But, hand- written documents present dierent writing styles and type fonts. Thus, to create a database for each document becomes infeasible. The support for very old hand- written documents is scarce (Vamvakas et al., 2008). However, the two rst steps emerge as a common factor for the historical documents image processing.

Image binarization is the rst step in image processing. In historical documents it is common to have a low contrast between the background color and the ink, physical noises, or even ink bled through the other side (Su et al., 2013, 2011).

The binarization operation consists in transforming a color picture to a black- and-white picture. In the process, each pixel is turned black or white based on a threshold. An appropriate conguration of this threshold is essential to remove the undesirable elements, due to that any element in the background may cause a bad segmentation in the next step.

Once the picture has been pre-processed, in the segmentation phase, the picture is divided into dierent sub-pictures which can contain separated lines, words or characters (Messaoud et al., 2012). This step is decisive for the remaining work, and its success depends on the accuracy of the segmentation model. There is no currently standard for segmentation. Dierent algorithms have been developed and each one can be used depending on the type of document and the distribution

3

(10)

Chapter 2. Background 4

Figure 1: Example page of an historical document image. Source: Arkiv Digital AD AB, www.arkivdigital.se

of its content (Likforman-Sulem et al., 2007).

Machine learning techniques are used for image recognition tasks. According to Mohri et al. (2012), there are dierent techniques depending on the training data available and the test data used to evaluate the algorithm. Supervised learning, for example, builds its model making use of a set of labelled data called training set. In semi-supervised learning the learner makes use of a training set with labelled and unlabelled data, and makes predictions for the unknown points.

And unsupervised learning only makes use of unlabelled data. Clustering is an example of unsupervised learning.

Content and pattern recognition are the two dierent objectives once the segmen- tation step is complete. If the purpose of the technique is to obtain the transcript of the document, content recognition techniques may be applied. For this case, we have a classication problem where supervised or semi-supervised machine learning techniques may be applied (Ball and Srihari, 2009; Vajda et al., 2011;

Marinai et al., 2005). A training set is also a necessary condition to predict which

character matches in the transcription. Historical documents present disadvan-

tages which maximize the human eort and only work for particular cases. Each

document has been written by dierent authors, and each author has a dierent

(11)

Chapter 2. Background 5 writing style. To carry out a content recognition in that case, a special dataset with a large amount of labeled characters, based in the author writing style, is required for the training data.

On the other hand we have the content-based recognition techniques. In this

case the transcription of the document is unavailable, but in comparison with

the previous technique it presents more advantages. The possibility of using an

unsupervised model minimizes the human interaction, and the previous labeling

work over the same document is not required to proceed with the recognition

process (Panichkriangkrai et al., 2013; Phan et al., 2012). In this scenario we

have an input image used as pattern, and a bundle of sub-pictures corresponding

to all the segmented words previously obtained. The recognition is performed

in two phases: features or metadata extraction, and clustering. Each image is

represented as a vector of features, and these features dene the content of the

image without any knowledge about it. The distance between the features vector

and the others is used in the clustering phase to make the decision. A clustering

algorithm is used due to that, there is no target value to be predicted, but the

images can be divided into groups. The aforementioned distances between vectors

are used by the clustering algorithm to separate the images which match with

the pattern from those which do not.

(12)

Chapter 3 Related Work

Image recognition and document analysis have a wide open research area. There are many applications based on image matching such as the Google and Bing image search engines (Tang et al., 2012; Jing et al., 2013). These web applica- tions use image features to evaluate the image and calculate distances to rank the outputs according to their similarity. Other studies also aim to develop a search method in handwritten documents, Imura and Tanaka (2010) proposes a method based on a string-matching technique. This method treats the text in the docu- ment as pattern images by dening them with a vector of statistical features of character shapes. The text search is made by comparison of the feature vectors, measuring the distance between them to make the decision.

Following the pattern matching work, previous studies in pattern recognition have focused their attention in Asian languages. In studies such as Vajda et al. (2011) or Marinai et al. (2005), the common objective of the authors is the character retrieval from historical documents. The study conducted by Marinai et al. (2005) uses a segmentation procedure based on recursive column-row cut method, and a clustering technique divided into three steps: character normalization, feature extraction, and classication. The feature extraction consists of eight dierent values corresponding to the gradient in eight dierent planes. The clustering method used was K-means, and the accuracy of the relation between the correct clustered characters over the total. The evaluation was made manually, and the approach achieved a total accuracy of 71%.

In the study conducted by Vajda et al. (2011), the pre-processing steps include binarization and histogram-based segmentation. Once the characters are sepa- rated, two features are used to dene the content: the histogram of the oriented gradients, and the pixel-wise dierence for global distributions. Similar characters are obtained measuring the feature dierences between the pattern and the rest of the characters. To evaluate the accuracy, the same manual method explained in the previous paragraph is used. The accuracy achieved in this study is between

6

(13)

Chapter 3. Related Work 7 86% and 97%.

Studies of documents with dierent alphabets present other techniques for hand- written recognition. Semi-supervised learning techniques are applied in order to minimize the human eort in the transcription of the document. For Vajda et al.

(2011), clustering algorithms are used to gather each character in dierent groups, and then manually label each one. The process used by Ball and Srihari (2009) reverse the order: samples of handwritten characters are labelled before using unsupervised learning for recognition.

For the image pre-processing steps, the previous studies of historical documents have improved the binarization and segmentation techniques. Rangsanseri and Rodtook (2001) develops the idea of local thresholds to balance the background contrast or the irregular illumination in the document. The local thresholds are calculated dynamically for each pixel based on its neighbourhood. This local threshold outperforms in degraded documents those based on static values. Su et al. (2013) also proposes a binarization technique based on adaptive image contrasting. An adaptive threshold based on the image gradient is used for edge detection in the text. This improvement is tolerant to typical types of document degradation or irregular illumination at the moment of the digitalization.

The segmentation method study by Likforman-Sulem et al. (2007) is based on histogram projections. A handwritten text can vary with respect to font and size depending on the author style and the histogram protection adapts the segmen- tation according to these factors. A horizontal histogram projection is calculated to detect the lines, in which each peak represents a dierent line and the same procedure with a vertical projection is used for the word segmentation.

One step beyond the last segmentation example is the framework introduced by Messaoud et al. (2012). The authors combine the pixel projection in addition with angle correction, in order to address the problem of the text-lines with an angle deviation with respect to the horizontal guideline. The method was tested using the IAM historical database (IAM-HistDB) and images from the ICFHR 2010 competition, outperforming the results based only on pixel projections.

3.1 Identication of Gap

Content-based image search in historical documents is still an open research area.

Previous studies have developed specic techniques to address one particular

study object, but there is no general technique for image searching in documents

written in Latin alphabets. The principal challenge remains in nding elements

in a collection of document images which match with a sub-image previously

selected. It can be names, markers, or some other fragments of text. In other

(14)

Chapter 3. Related Work 8 words, the handwritten recognition problem is reducing to an image matching problem.

In Figure 2 there is an example of one possible image searching objective. The same name, Augusta occurs in consecutive rows apparently written by the same person. With a suitable content recognition and image searching technique, the detection of the same name in image documents would be possible with a sub- image input example. After selecting the rst Augusta, the sample in the second row would be recognized regardless of subtle dierences between the writing styles of each letter.

Figure 2: Example page from the Uppsala Cathedral household records, years

1881-1883. The rst column represents a machined-typed index and the sub-

sequent eight columns contain tabular, handwritten information about indi-

viduals from the Uppsala Cathedral parish. Source: Arkiv Digital AD AB,

www.arkivdigital.se

(15)

Chapter 4 Aim and Objectives

The aim of this thesis is to develop a technique for image searching in historical handwritten document images. Image processing techniques such as binarization and segmentation, in conjunction with clustering algorithms, will be analyzed and evaluated in order to achieve high performance. As a result of achieving the aim, a complete methodology for image searching in historical documents, and a window-based software tool to perform the sub-image selection and matching will be provided.

The objectives to fulll the stated aim are to:

1. Find an optimal conguration for binarization and segmentation to correctly extract the words in a document.

2. Study which combination of features is optimal to dene a text sub-image without any additional knowledge about the content.

3. Design a clustering approach to analyse all the sub-images features and provide a solution for the image matching problem.

4. Develop a window-based software tool to perform the image recognition experiments and evaluations, enabling a visual tool to validate the results.

9

(16)

Chapter 5 Approach

The approach is separated into three dierent phases: image processing, features extraction, and clustering analysis. The objective of the image processing is to split the documents into sub-images containing all the objects of interest. The image separation is a trade-o, and it is necessary to nd the balance in the binarization and segmentation steps to separate the content information from the background. Dierent thresholds have to be tuned to nd a suitable conguration for the studied datasets.

Once the image documents are fragmented into sub-images, it is necessary to dene their content without any knowledge about it. The content of a picture can be expressed by dening a set of features, creating a vector for each image.

The feature vector will be used later for the clustering algorithm to match the dierent sub-images.

The last step of the process is the cluster analysis. With a previously selected sample of the document, the clustering algorithm identies which sub-images match the sample. The aforementioned feature vector is used to measure the disparity between images in order to make the decision.

The steps of the approach are now explained in detail, including a description of the software library used for the application development.

5.1 Software library

OpenCV

¹

is an open source computer vision and machine learning software li- brary. It includes programming functions focused in real-time image processing.

The windows-based software tool for the project is developed on this platform.

The library includes a set of algorithms, e.g., for binarization, segmentation, and

1

http://opencv.org/

10

(17)

Chapter 5. Approach 11 recognition. This platform also has a large user community and it is extensively used in companies and by researchers. For these reasons, OpenCV provides the tools to develop the aforementioned application. OpenCV has interfaces for dif- ferent programming languages, but since its primary interface is in C++, the software tool is written in that language to make the most of the platform.

5.2 Image processing

The image processing procedure is conducted in three steps: binarization, lter- ing, and segmentation. The binarization process consists in converting a colour image to black-and-white using a threshold for all the pixels in the image. Af- terwards the image is ltered to remove all the imperfections remaining, like background noise, holes or bumps. And later, the segmentation process splits the binary image into multiple sub-images using vertical and horizontal projections of pixels.

5.2.1 Binarization

A digital image is processed as an array of pixels, and each pixel is dened with 8 bits, from 0 for black to 255 for white. To binarize an image it is necessary to use a threshold to decide the new value of the pixels. This threshold is dened as a number between 0 and 255 and it can be static or dynamic. A static threshold uses the same value for all the pixels of the image, whereas a dynamic threshold recalculates its value for each pixel according to a formula. Figure 3 illustrates a static threshold operation.

For the binarization of historical documents the threshold chosen is dynamic, con- sidering that it provides a better result when the background or the illumination is irregular (Su et al., 2013). The dynamic threshold calculates its value for each individual pixel as a weighted sum of the neighbour pixels. The weighted sum is calculated as a cross-correlation with a Gaussian window subtracted by a chosen constant C. In this way, the function takes into account a predened number of neighbouring pixels for the adaptive threshold. However, the threshold type used is inverted; the dark pixels become white and the bright pixels black. White pixels are represented with the value 255. Therefore it is convenient to invert the threshold in order avoid histograms lled with high numbers, that may be a disadvantage in order to nd the maxima and minima to cut the image.

The binarization function is dened according to the formula 1, where maxValue

is 255 for a black-and-white image, and T(x,y) is the calculated threshold, which

(18)

Chapter 5. Approach 12

Figure 3: Example of threshold operation. The red contour represents the pixel values of a picture, and the horizontal blue line a static threshold. The pixels above the threshold are set to the predened maximum value and the others are set to zero. Source: OpenCV Documentation, docs.opencv.org

is the aforementioned weighted sum of the neighbouring pixels subtracted by the constant C.

dst(x, y) =

( maxV alue if src(x, y) > T (x, y)

0 otherwise (1)

5.2.2 Image ltering

In handwritten documents it is common to nd undesirable elements like ink drops, background marks that persist after the binarization step, or even holes and bumps (Nina et al., 2011). These imperfections may cause a decrease in the segmentation accuracy. The purpose of this ltering process is to remove those elements from the black-and-white image.

The ltering process is based on erosion and dilation operations. These functions

have contrary reactions over the image. The erosion operation makes bigger the

dark areas of the image and reduces the bright regions, whereas the dilation

reduces the dark areas and dilates the brighter zones. Figure 4 illustrates the

eect of both operations over the same image. The objective of the erosion lter

is to remove the undesired elements in the images such as ink drops, whereas the

dilation lters is aimed to make bigger the text lines in order to facilitate the text

detection.

(19)

Chapter 5. Approach 13

Figure 4: Example of ltering, from left to right: original image, erosion ltering, dilation ltering. Source: OpenCV Documentation, docs.opencv.org

The image ltering process is conducted in two dierent phases: The dilation

lter is applied before the segmentation, and the erosion lter before the features extraction. In order to increase the accuracy of the segmentation, the dilation operation is performed to expand the white pixels. The segmentation process is based on the projection of these pixels. Therefore, bigger lines cause a better detection of lines and words.

The erosion lter is applied to the sub-images once they are extracted. The features extracted are based on the distribution of the pixels over the image, but after the dilation, the content of the images lose quality regarding the writing style. The objective of the erosion operation is to return the image to its original form, and remove the aforementioned imperfections.

5.2.3 Segmentation

The method used for the segmentation process is based on horizontal and vertical

pixel projections. These projections are calculated as the sum of pixels along

a particular direction. This concept is inspired by the histogram projections

presented by dos Santos et al. (2009), and which are used to separate lines and

words in handwritten documents.

(20)

Chapter 5. Approach 14

Figure 5: Example of horizontal and vertical pixel projection histograms. Source:

http://archive.nlm.nih.gov/

The results of the projections are column and row vectors where the maximum values represent the lines with the higher amount of white pixels and vice versa.

These maximum points are used to identify where the image has to be cut. A recursive function over all the images is used to conduct the process, separating

rst the lines and then the objects of interest in them.

Handwritten documents present particular threats for an accurate segmentation (Sanchez et al., 2008). Due to the quality of the picture or the writing style, many lines in the document may appear diagonally, making the horizontal line detection harder. In other cases, the same line may contain sub-lines as in the

gure below.

Figure 6: Example of line with detection threats. The document style is not the same in all the columns, some of them only have one word per column whereas others have three lines of text. Source: Arkiv Digital AD AB, www.arkivdigital.se

The vertical segmentation to extract complete words also presents threats in

handwritten documents. The process is based on the search of the minima in the

projection, which are points that identify a space between written elements. But

this theory is only reliable in the ideal case where the background is clear and

(21)

Chapter 5. Approach 15 there is a lack of other elements. If the document presents tables, lines, or similar elements, the minima of the projection may be unreliable.

In order to solve the problem, and considering that the objective is not to identify the content but it is to detect the text, the sub-images are extracted between minima regardless of whether the word is complete or not. The extraction is not necessary full-text based. The same vertical segmentation process is conducted over the user sample in order to get the same sub-images as the extracted from the original image. Further, the cluster analysis is conducted over each sub-image extracted from the user selection. This selection is made in order to address the problem of the extracted images containing incomplete words.

To evaluate the validity of the previous techniques, it is necessary to manually count the number of elements incorrectly extracted. The relation between the correctly separated elements and the total gives us the accuracy of the method.

If the previous techniques do not achieve a suitable result, the steps are repeated tuning the parameters of the process until obtain a suitable conguration.

5.3 Feature extraction

Once the elements are extracted in separated images, the following step consists in dening the content. For this purpose, a features vector is created for each image. Kumar and Bhatia (2014) classify the feature extraction models into three groups: statistical features, signal processing methods, and geometrical or topological features.

For this particular case of handwritten historical documents, the features selected have to be meaningful and ecient (Impedovo and Pirlo, 2011). The features must dene the content of each sample, and they have to provide a balance between computing time and quality performance. Statistical and geometrical features are based on the distribution of pixels and the contour shapes. Focused on the writing style, they may provide more information with less complexity (Impedovo et al., 2012). However, signal transformations and series expansion are focused in deformations like rotations or translations, properties which are more relevant for pictures than for written documents.

To extract the features, the image is no longer modied due to its basic level of segmentation, similar in some cases to a syllable extraction. Each sub-image is analysed separately and the same numerical attributes are extracted for each.

First the geometrical features are extracted, these features only depend on the

distribution of the pixels. The rows and columns of the sub-image are the rst

parameters in order to measure the dimension of the sub-image. Number of white

pixels and its density on the sub-image are the subsequent parameters, they are

(22)

Chapter 5. Approach 16 calculated in order to measure the amount of text. And the last geometrical parameters are the mean and the standard deviation of the image, understood the image how an array of data.

The statistical features are calculated afterwards. They compare the sub-images extracted against the user samples by template matching overlapping. The over- lapped images are compared using three dierent methods based on square dif- ferences, correlation, and clustering coecients. The result is a similarity ratio between both images. This triad of features is computed for each sub-sample of the user selection.

Once the features are extracted for each image they are used to construct a feature vector. This vector contains the numerical features previously calculated, and represents the content of the image for the pattern recognition. Therefore, all the image features may be represented as a matrix where each row belongs to an image and each column to a feature.

F eatures =







F eature 1 F eature 2 · · · F eature N

Element 1 a

_1,1

a

_1,2

· · · a

_1,n

Element 2 a

_2,1

a

_2,2

· · · a

_2,n

.. . .. . .. . . . . .. .

Element M a

_m,1

a

_m,2

· · · a

_m,n





 (2)

Once the features matrix is built the next step is the cluster analysis.

5.4 Clustering

For the analysis and the classication of the extracted images according to their features is used a clustering algorithm. Jain et al. (2004) classify clustering al- gorithms based on: the input and output data representation, and optimization process. For the sub-images scenario, the input data are the feature vectors, or the feature matrix, and the output data a pair of clusters with the matched and non-matched images. The clustering algorithm for this image matching has to be centre-based. The main cluster is built using the user sample as centre. The main cluster contains the matched images while the other images belong to the second cluster.

Well known clustering centre-based algorithms, as K-means or K-medoid, require

a priori knowledge of the centre of the clusters. Therefore, they cannot be used

since the centre of the non-matching cluster is undened. The algorithm to match

the images needs the centre of the main cluster and the function to mark out

(23)

Chapter 5. Approach 17 the area of it. The clustering algorithm developed is explained in detail in the subsequent sections.

For this approach, the outputs given are the input image, the extracted images, and those in the match cluster. This way the evaluation is made manually by counting the correctly matched images and the total of them. The relationship between the correctly matched and the total gives us the accuracy of the feature extraction and the clustering step. Thus, the total accuracy of the developed technique may be expressed as the combination of the two previously calculated accuracies.

Regarding to the input data and the expected outputs, the algorithm developed is divided in two phases: analyse the matrix of features previously built, and divide the data into two clusters, one with the matched elements and other with the non-matched. The algorithm is centre-based: the representation of the sample provided by the user is used denes the centre of the matching cluster. The objectives to reach the aim of the algorithm are to prepare the data for analysis, and to dene the function that represents the border of the cluster.

According to the classication of Dubes and Jain (1988), this clustering algorithm is exclusive, intrinsic, and partitional. Exclusive clustering means that the clus- ters are non-overlapping, one element cannot belong to both clusters. Intrinsic clustering due to a dataset to train the clusters is unavailable, there it is no knowl- edge a priori about the elements to classify. And partitional or non-hierarchical because it creates single partitions without sub-categories or groups.

5.5 Data preparation

Before proceeding with the clustering algorithm, it is necessary to prepare or transform the data. The features vector contains attributes related to the distri- bution of pixels in the image, and triads of features measuring the dierence with the user selection sub-images. To prepare the data it is necessary to take only the triad of features which compare the image to the pertinent user sub-image.

F eatures = (

Pixel distribution

z }| {

a

0

, . . . , a

5

, a

6

, a

7

, a

8

| {z }

Triad 1

, . . . , a

n−2

, a

n−1

, a

n

| {z }

Triad m

) (3)

Following, it is necessary to remove the features which do not bring information for

the clustering approach. In the previous chapter, it is explained that the features

extracted have to be concise and meaningful. But it may be possible that dierent

features provide the same or any information in order to dierentiate one image

from the others. For that reason, the cross-correlation, variation, and standard

(24)

Chapter 5. Approach 18 deviation of the features are calculated beforehand. A high-dimensional matrix of data may cause lower accuracy, in these matrix the distances between closest and farthest neighbours may be similar. Therefore, the features with little variation (no-information) or high cross-correlation (same information) are discarded.

Once the number of features is reduced, in order to avoid predominant attributes, the next step to prepare the data is to standardize the values. This way all the measures are in the same scale. The data is transformed according to the formula 4. Creating a new scale for all the attributes where the results lie between -1 and 1:

a

_m,n⁰

= a

_m,n

max

m

|a

m,n

| (4)

These new feature values, with a suitable number of dimensions and the same scale, create a new matrix. Which is used to represent the data in the following steps.

5.6 Data representation

The data is represented using a proximity matrix. This matrix contains the distances between the values of the sample image and the others, measuring the dierence of all the transformed feature values.

The distances, or proximity indices, are calculated according to the formula 5 for the Euclidean distance, which is the most common way to measure a distance between two points:

p

_m,n

= v u u t

d

X

k=1

(a

_m,k

− a

_n,k

)

²

(5)

Considering that the distances are calculated in relation to only one sample, the proximity matrix is always a column vector.

P roximityM atrix =







Distances Element 1 p

1,1

Element 2 p

_2,1

.. . .. .

Element M p

_m,1







(6)

(25)

Chapter 5. Approach 19 If the number of sub-images extracted from the user sample is greater than one, other proximity matrices are created according to the new centre of reference.

5.7 Cluster borders

The last step of the algorithm is to dene the border of the main cluster. It is an optimization problem where the objective is to nd the maximum acceptable distance to consider an element part of the match group. Unlike other clustering algorithms, the only information available is the centre of one of the clusters.

Therefore, the border is a density-based estimation.

Some clustering algorithms construct the clusters based on the data density of certain areas. Following the same idea, this algorithm for image matching com- putes the border of the main cluster from the density of the proximity points.

The match elements have the proximity points around the centre in a high density area, whereas the others are further in a lower density area.

The algorithm 1 the optimization process to nd the cluster border. Begin- ning from the centre, the algorithm measures the distance between the proximity points. The elements around the centre, due to its high-density, have low distances between them. It is the rst long distance between sorted proximity points which marks the border of the cluster.

Data: Proximity Matrix Result: Border

double border, previousDistance, distance;

Sort the elements of the proximity matrix;

for each element of the matrix do Calculate the distance with the next;

if distance > previousDistance*0.8 then border = previousDistance; break;

end end return border;

Algorithm 1: Clustering optimization algorithm

(26)

Chapter 6 Research Methodology

In order to evaluate the validity of the proposed approach the following research questions are stated. The motivation for these questions aims to nd the maxi- mum accuracy according to the design choices.

RQ1 What is the relationship between the choice of binarization and segmen- tation approach and the degree to which the foreground content can be separated from the background?

The image separation is a trade-o, is necessary to nd the balance between the content information and then background. Avoiding either remove too much information or keep too much background noises.

RQ2 Which information is necessary to dene the content image?

The content of a picture can be expressed in a unique way by dening a set of features for it. This vector of features will be used later for the clustering algorithm to match the dierent sub images. Therefore dierent combinations of features can cause dierent results increasing or decreasing the matching accuracy.

RQ3 What is the relationship between the choice of clustering approach and accuracy?

Clustering algorithms can be categorized according of their cluster and probability model; therefore it is necessary to nd an optimal algorithm for the proposed problem. The computation of accuracy needs to be done by a human, due to it is necessary to manually count which elements have been correctly matched and which do not.

20

(27)

Chapter 6. Research Methodology 21

6.1 Experimental design

The main research method is experimentation. The aim is to perform a quasi- experiment to estimate the impact of the independent variables versus the de- pendent variables. The independent variables are ve: binarization, ltering, segmentation, features selection, and clustering. These variables are manipu- lated to change the values of the dependent variables: the local accuracies of the intermediate steps, and the global accuracy of the complete approach. The characteristics of this quasi-experiment are the nonrandomized group assignment, and the repeated measures design.

This quasi-experiment follows the Pattern Matching Non-equivalent Dependent Variables Design described by Trochim (2001). This model has only a single group of data, in contradistinction to other experimental designs with a control and an experimental group. There are separate variables but there is not needed a control group, the eects of the independent variables are studied over all the samples. The design is based on repeated measures, where each variation of the independent variables is applied over the same group of data. This design provides a measurement net based on a set of outcomes related to each other.

For the rst phase of the quasi-experiment, the image processing accuracy is mea- sured after making changes in the binarization, ltering, and segmentation vari- ables. Therefore, this phase aims to achieve a suitable binarization conguration, analyse the ltering contribution, and evaluate and segmentation performance.

The experiment conducted is measured in two dierent ways in order to obtain the double of outcome variables, and thereby, to analyse more in detail how the approach aects to the dependent variables. The rst group of measures evaluates the accuracy of the horizontal and vertical segmentation separately, and their results are not correlated. For the horizontal segmentation the lines extracted are compared with the total of lines in the document. However, for the vertical segmentation, the accuracy of the line extraction is irrelevant. The measures are made over the number of lines correctly extracted. The second group of measures are made at the same time as the rst group, with one dierence in the vertical segmentation measurement. To obtain the total accuracy of the segmentation, the results of the word extraction are evaluated over the total of words in the document, including those in the undetected lines. This two-way measurement allows evaluating the accuracy for the image processing approach, and for the intermediate steps.

For the second phase of the quasi-experiment, the independent variables are the features selection, and the clustering performance. In this case the dependent variable is the accuracy of the method, related to the parameter tuning of the

rst experiment and its outputs measurements.

(28)

Chapter 6. Research Methodology 22 This phase of the quasi-experiment has a twofold objective. The rst is to evalu- ate which features are used by the algorithm and which are discarded; the second is to evaluate the clustering performance. To evaluate the accuracy of the cluster- ing algorithm it is necessary to measure three dierent variables: images correctly matched (true positives), images incorrectly matched (false positives), and images non-matched (false negatives). The experiment conducted is focused in the mea- sure of the correctly matched images. However, the measure of the non-matched images is implicit in the calculus of the accuracy percentage for the true posi- tives. The false positives are useful to measure the border deviation of the main cluster and for the statistical analysis, but are excluded from the accuracy since the objective is to detect the match words. The incorrectly matched images may be easily discarded for the user at rst sight.

6.2 Data collection

The historical document images used for the quasi-experiments have been pro- vided by Arkiv Digital AD AB. The database is formed by high resolution images with average dimensions of 5200x3500 pixels. However, these images contain a black border around the document itself (see Figure 1), being the average docu- ment dimensions of 3900x3400 pixels.

The images are stored in folders, and the folder path is provided to the application at the moment to perform the experiment. The evaluation of the method has to be made manually, counting lines and words in each document. In order to make it feasible, instead of store a complete book in the same folder, it is divide in dierent folders with twenty images each one. All the folders contain images from the same book, this way the handwriting style and text distribution is the same for each group of images.

The experiments conducted use images with dierent sizes and styles: complete document images, sub-images from one page, tabular pages, and documents with- out marks. The sub-image selection is a semi-automatic process, the dimensions are set manually, and the image extraction is conducted recursively over the com- plete images. The sub-images selected are classied in three groups: horizontal, vertical, and rectangular. Horizontal images are higher than wide, and their lines are short, usually two or three words. Whereas vertical images are wider than high and their lines contain a bigger amount of text.

The objective of this selection is to evaluate the accuracy of the approach over

dierent scenarios. However, this data selection only concerns the image process-

ing experiments. The data used for the clustering experiments is the image set

that result of the previous phase.

(29)

Chapter 6. Research Methodology 23

6.3 Data analysis

For this pattern recognition scenario, it is performed a statistical analysis based on precision and recall, and the F

1

score. In the experiment design section, the results of the cluster analysis are classied in three groups: true positives, false positives, and false negatives. The F

1

score is a measure of a test's accuracy, and is calculated as the harmonic mean of precision and recall. The equations 7 show the relation between the clustering outcomes and the statistic measures.

precision = true positives

true positives + f alse positives (7a)

recall = true positives

true positives + f alse negatives (7b)

F

₁

= 2 ∗ precision ∗ recal

precision + recal (7c)

F

₁

= 2 ∗ true positives

2 ∗ true positives + f alse positives + f alse negatives (7d) The hypotheses stated for the analysis are:

Null hypothesis: F

1

< 0.5. The number or false positives plus the false negatives is higher than the double of true positives.

Alternative hypothesis: F

1

> 0.5. The number or false positives plus the false negatives is lower than the double of true positives.

6.4 Validity threats

In this section are identify the threats to each of the four main validity groups concerning the experiments.

6.4.1 Internal validity

There are two factors in the quasi-experiment design that may have a negative

impact with respect to the internal validity: a single group of data, and the

nonrandomized assignment of it. In relation with the single group, this design

provides another model for the causal assessment dierent than the experiment-

control groups design. It strength is based on the number of outcome variables. To

ensure the validity, the experiment provides multiple and intermediate outcomes

and a theory of how each independent variable aects the measurement.

(30)

Chapter 6. Research Methodology 24 The reason for the nonrandomized assignment of the groups of data attends to the nature of this case of study. The handwriting style has to be the same in all the document images used for the clustering matching. And a random selection of data would mix the data giving as result groups where the document style is dierent in each image. Therefore, the randomised selection of data is a threat for this particular design.

6.4.2 External validity

To ensure the generalization of the results achieved, the selection of the images for the study includes the majority of possible scenarios. The image selection is made regarding to the document style and image dimensions, including dierent size of images with tabular documents, list of names, and familiar records.

6.4.3 Construct validity

To prove the construct validity of the experiment is necessary to show how the parameter tuning aects the nal accuracy. In order to improve the reliability of the experiment construction, it is developed a measurement interconnection between all the outcome results. As a result of this, it is increased the level of detail about the relationship between the outcome measures.

6.4.4 Statistical validity

To ensure the validity of the statistical analysis is necessary to show that the chosen test is the adequate for the type of data and results provided by the experiments.

The F

1

score is chosen as the main statistical test due to the form of the clus-

tering outputs, dividing the results in three groups in relation to their outcome

and condition. According to Powers (2011), the F-measure is a common test in

machine learning and information retrieval.

(31)

Chapter 7 Results

This chapter presents the results of the experiments conducted to evaluate the performance of the proposed approach. The experiments are divided in two parts:

image processing and clustering. The rst part has the objective to evaluate the accuracy of the text extraction process, whereas the second part evaluates the matching accuracy.

7.1 Image processing

The aim of this phase is to evaluate the accuracy of the extraction process. In order to achieve a suitable conguration, rst is needed to tune the binarization parameters to get an image with the minimum noise and text degradation. These tests are not measurable due to the subjectivity of the image quality. Therefore, the objective is to nd a rank of values for each parameter in which the image full the quality requirements of noise and degradation.

The adaptive threshold for the binarization is dened by three parameters: method, block size, and the constant C. The method makes reference to the decision al- gorithm to transform the pixel, can be either Mean-based or Gaussian. Both methods can achieve the same level of image quality, but the Gaussian approach needs lower values of block size and C to obtain the same result. The next pa- rameter to adjust is the block size, tuning the number of the pixel neighbourhood is a trade-o between the white pixels of the document and the black pixels of the background. A low number gives a result with thin traces in the text, making harder the segmentation task due to the low levels in the histogram projection maxima. But high numbers erode the text due to the high amount of black pixels used to calculate the threshold. Finally, the C parameter is tuned to remove the noise remaining. The suitable rank for this value is between four and seven, being

25

(32)

Chapter 7. Results 26 this last the most suitable. These limits are dened according to the level of noise removed and the text degradation caused respectively.

Table 1 shows the binarization conguration used for the subsequent experiments.

Binarization parameters Method Gaussian

Block size 11

C 7

Table 1: Conguration of the binarization parameters for the segmentation phase.

The next experiments aim to evaluate the accuracy of the segmentation approach according to the experimental design of the previous chapter. Table 2 shows the results for the horizontal segmentation experiment conducted over dierent sub-selections of a complete image.

Horizontal segmentation results

Test Image type Lines Detected Accuracy

1 Horizontal 4 4 100%

2 Horizontal 6 6 100%

3 Horizontal 8 8 100%

4 Horizontal 12 9 75%

5 Horizontal 20 19 95%

6 Vertical 18 17 94.44%

7 Vertical 22 21 95.45%

8 Vertical 22 21 95.45%

9 Vertical 22 20 90.90%

10 Rectangular 27 26 96.3%

Table 2: Horizontal segmentation accuracy for text line extraction according to the image shape.

The line extraction over horizontal images achieves better results due to the form of the lines when the image is wider. When the image is thin and the text only has few words, the lines are straight and the pixel projection is more reliable. But lines in vertical images show problems like bent lines or words overlaying each other that make the pixel projection more confusing to split the lines.

For the vertical segmentation, the lines extracted before are used to measure the

accuracy of the word extraction. In the method chapter is explained that nal

sub-images are extracted regardless of if the word is complete or not. Therefore,

to consider a word as correctly extracted is only necessary to separate one of its

parts. Table 3 shows the results of the experiment.

(33)

Chapter 7. Results 27 Vertical segmentation results

Test Words Detected Accuracy%

1 3 3 100%

2 4 4 100%

3 5 3 60%

4 6 5 83.33%

5 9 8 88.89%

6 10 10 100%

7 11 10 90.9%

8 11 11 100%

9 12 12 100%

Table 3: Vertical segmentation results for word extraction no correlated to the horizontal segmentation results.

In general, all the words in the lines are identied and extracted, but these results are not correlated to the line extraction. The next step aims to measure the global accuracy of the segmentation process. In order to obtain a general view, the images used for this experiment include complete pages from dierent books.

Table 4 shows the results with the global accuracy of the image processing seg- mentation phase.

Segmentation results Lines detected 94.25%

Word detected 86.2%

Table 4: Results for the text extraction. The word detection result shows the global accuracy of the method.

7.2 Clustering

The aim of this phase is to evaluate the accuracy of the clustering algorithm de-

signed and the validity of features chosen. The set of images previously extracted

is used for this purpose. The rst objective of the experiments is to evaluate

which features are used by the algorithm and which are discarded. Table 5 shows

the typical values for a clustering execution. In all the experiments performed,

the density of pixels attribute is removed, which means that it does not provide

useful information to construct the main cluster.

(34)

Chapter 7. Results 28

Feature Standard deviation

Number of rows 6.768

Number of columns 68.339

Number of white pixels 373.91

Density of pixels 0.049

Mean of white pixels in the image 12.664

Standard deviation of white pixels in the image 8.905

Comparison 1 0.208

Comparison 2 0.109

Comparison 3 0.148

Table 5: Typical standard deviation values for the features matrix. In all the tests conducted the density of pixels were removed due to his low standard deviation.

The next step is to evaluate the accuracy of the clustering algorithm according to the second experimental design previously explained. Table 6 shows the three variable results for ten executions of the experiment over dierent type of images.

Clustering matching results

Test Correctly matched (%) False negatives (%) False positives

1 100 0 5

2 75 25 6

3 42.85 57.15 2

4 50 50 3

5 62.5 37.5 6

6 83.33 16.67 8

7 80 20 6

8 63.64 36.36 4

9 66.67 33.33 5

10 40 60 3

Table 6: Clustering results according to the three dierent classication groups.

The true positives and false negatives are in percentages in order to show a rst sight of the algorithm accuracy.

The average for the correctly matched images is 66.4%. But this clustering

accuracy is dependent on the previous results, the matching ratio is correlated to

the image segmentation performance. The words unidentied, including those in

undetected lines, are labelled as non-matched images, and it decreases the global

performance of the clustering analysis. Therefore, the previous result is not an

absolute measure of the clustering approach, but is a global measure of the nal

accuracy.

(35)

Chapter 8 Analysis

This chapter presents the analysis of the research methodology applied and of the results previously described. In order to provide an answer for each research questions, and analyse in detail the design choices, the chapter follows the same steps of the approach.

8.1 Image processing analysis

For the image processing phase, the propose method achieves a global accuracy for word extraction of 86.2%. An adaptive threshold for the image binarization becomes necessary to manipulate historical documents, which provides a solution to address the irregular illumination or noises in the image. But is the image

ltering design which provides an increase or decrease of the text extraction ac- cording to his use through the process. A rst approach used an erosion/dilation

lter between the binarization and the segmentation steps, which suppose a bad impact over the extraction results. This lter design removes the noise in the document, but also degrades the written text decreasing the segmentation accu- racy. Therefore, the second approach splits the ltering process in two phases:

dilating the image before segmentation, and erode it after that. The segmenta- tion is based on the projection of pixels, thus, dilated text is easier to nd and extract. And the erosion lter after the segmentation helps to remove undesirable elements in the image.

By conducting the segmentation experiments, the line extraction achieves an accuracy of 94.25%. Analysing the results and how the pixel projection works, we nd that the undetected lines are always bent lines with the lowest amount of text. These conditions cause a pixel projection without a clear maximum and low values, making them invisible to the horizontal segmentation algorithm. The same problem appears in the vertical segmentation. If we analyse it individually,

29

(36)

Chapter 8. Analysis 30 it presents similar results as the horizontal segmentation: unclear maxima and minima in the pixel projection make some disjunction points invisible to the algorithm. But for the global evaluation of the approach, both results are related and the vertical segmentation is subordinate to the horizontal process, which causes a decrease in the accuracy of the model.

8.2 Features analysis

The accuracy of the features selection is immeasurable, but its eects can be studied from the clustering results. Tuning the features the clustering approach achieves better or worse results according to the information provided by the new vector. Therefore, the evaluation of the chosen features is made regarding to the variations of the clustering results.

For my rst approach, the features selection used was based on geographical distribution of pixels. The clustering results included all the occurrences of word selected by the user, but it also included a high amount of words dierent from it.

Therefore, the statistical comparisons of pixels between the images were included in the vector in order to solve the problem. The extracted image and the sample to study are compared by overlapping, and that relation is measured and stored in the feature vector. This second approach improved the results in comparison with the geographical feature vector.

The clustering design, in the data preparation step, also includes a function to remove the features which do not provide information for the analysis. The vari- ation of each column of features is calculated and used to decide if that attribute is useful or meaningless. If the variation is lower than a predened level, 0.1 in this case, that features column is removed considering that all the values are around the mean. In the experiments conducted, only the white pixels density was remove from the features matrix due to its low variation.

8.3 Clustering approach

For the clustering approach there are three variables to address the evaluation of

the model. The false positives are not included in the accuracy measure, but they

are used for the statistical analysis of the clustering algorithm and the inuence

of dierent combination of features. These images do not match with the user

sample but are incorrectly include in the main cluster. This problem decreases

the accuracy of the clustering design but it is excluded in its calculus. The main

objective of the methodology is to nd all the occurrences of the same image in the

(37)

Chapter 8. Analysis 31 whole document. Therefore, as long as the correct images are matched and the amount of false positives does not exceed the true positives, these errors may be tolerated. In the previous section about the features analysis, are described two approaches with dierent combination of features and their eect in the amount of false positives.

The local accuracy of the approach is not measured in the experiments conducted since the data used for the clustering have as origin the segmentation process.

Thus, the matching ratio of 66.4% makes reference to the global accuracy of the complete methodology, including the image processing steps and the features selection.

For the statistical analysis of the experiment, the F-measure is calculated for each test described in the results chapter. This analysis includes the false positives to test the accuracy of the model. The results are shown in Table 7.

Clustering statistical analysis

Test True positives False negatives False positives F-measure

1 4 0 5 0.6154

2 6 2 6 0.6

3 5 7 2 0.5333

4 4 4 3 0.5263

5 9 5 6 0.6207

6 10 2 8 0.6667

7 8 2 6 0.6667

8 10 6 4 0.667

9 6 3 5 0.6

10 2 3 3 0.4

Table 7: F-measures test results for the clustering performance in each of the experiments.

Figure 7 illustrates the results of the F-measure for each test conducted. The

average F-measure is F

1

= 0.59, therefore the null hypothesis is rejected. The

addition of false positives and false negatives is lower than the double of true

positives.

(38)

Chapter 8. Analysis 32

Figure 7: Representation of the F-measures for each test conducted

(39)

Chapter 9 Conclusions and Future Work

The aim of this thesis is to develop a technique for image searching in historical handwritten document images, based on image processing techniques in addition with machine learning algorithms. In order to do that, it is performed a research about pattern recognition that includes computer vision, features analysis, and clustering design. A set of quasi-experiments are conducted to evaluate the va- lidity of each step of the approach.

From the image processing research in historical documents, it is concluded that the most suitable conguration includes a dynamic binarization method for the image enhancement, image ltering to remove imperfections, and text segmen- tation based on pixel projections. Tuning these variables according to each par- ticular case, we can increase the accuracy of the text extraction to its maximum value.

The main research ndings in pattern recognition are related to the features selection and clustering algorithm. A combination of geometrical and statistical features becomes a proper conguration to dene the content of a text image.

And for the image matching, a centre-based clustering algorithm is developed in order to match the images according to their distance to the main sample.

Large collections of historical documents are being digitalized, but regular image- retrieval techniques are not suitable for this kind of documents. And this thesis provides a starting point to develop a search engine for document collections based on pattern recognition. Future work based on this study could follow dierent ways. A rst investigation line could be focused on improve the results achieved, where the steps of the proposed methodology could be redesign in order to increase the accuracy. A second research goal could be to develop a method based on content-retrieval of the image.

33

(40)

References

G.R. Ball and S.N. Srihari. Semi-supervised learning for handwriting recognition.

In 10th International Conference on Document Analysis and Recognition, pages 2630, July 2009.

T. Dharani and I.L. Aroquiaraj. A survey on content based image retrieval.

In International Conference on Pattern Recognition, Informatics and Mobile Engineering (PRIME), pages 485490, February 2013.

R.P. dos Santos, G.S. Clemente, Tsang Ing Ren, and G.D.C. Cavalcanti. Text line segmentation based on morphology and histogram projection. In 10th In- ternational Conference on Document Analysis and Recognition, 2009. ICDAR '09, pages 651655, July 2009.

R.C. Dubes and A.K. Jain. Algorithms for Clustering Data. Prentice Hall College Div, 1988.

D. Impedovo, G. Pirlo, and R. Modugno. New advancements in zoning-based recognition of handwritten characters. In 2012 International Conference on Frontiers in Handwriting Recognition (ICFHR), pages 665669, September 2012.

S. Impedovo and G. Pirlo. Tuning between exponential functions and zones for membership functions selection in voronoi-based zoning for handwritten character recognition. In 2011 International Conference on Document Analysis and Recognition (ICDAR), pages 9971001, September 2011.

H. Imura and Y. Tanaka. A full-text search system for images of hand-written cur- sive documents. In 2010 International Conference on Frontiers in Handwriting Recognition (ICFHR), pages 640645, November 2010.

A.K. Jain, A. Topchy, M.H.C. Law, and J.M. Buhmann. Landscape of clustering algorithms. In Proceedings of the 17th International Conference on Pattern Recognition, pages 260263 Vol.1, August 2004.

Yushi Jing, M. Covell, D. Tsai, and J.M. Rehg. Learning query-specic distance

34

(41)

REFERENCES 35 functions for large-scale web image search. IEEE Transactions on Multimedia, 15(8):20222034, December 2013.

Gaurav Kumar and Pradeep Kumar Bhatia. A detailed review of feature ex- traction in image processing systems. In Fourth International Conference on Advanced Computing Communication Technologies, pages 512, February 2014.

Laurence Likforman-Sulem, Abderrazak Zahour, and Bruno Taconet. Text line segmentation of historical documents: A survey. International Journal on Doc- ument Analysis and Recognition, 9(2-4):123138, 2007.

S. Marinai, M. Gori, and G. Soda. Articial neural networks for document anal- ysis and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(1):2335, January 2005.

IB. Messaoud, H. Amiri, H.E. Abed, and V. Margner. A multilevel text-line segmentation framework for handwritten historical documents. In 2012 Inter- national Conference on Frontiers in Handwriting Recognition (ICFHR), pages 515520, September 2012.

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. The MIT Press, 2012.

O. Nina, B. Morse, and W. Barrett. A recursive otsu thresholding method for scanned document binarization. In 2011 IEEE Workshop on Applications of Computer Vision (WACV), pages 307314, January 2011.

Chulapong Panichkriangkrai, Liang Li, and Kozaburo Hachimura. Character segmentation and retrieval for learning support system of japanese historical books. In Proceedings of the 2Nd International Workshop on Historical Docu- ment Imaging and Processing, page 118122. ACM, 2013.

Truyen Van Phan, Bilan Zhu, and M. Nakagawa. Collecting handwritten nom character patterns from historical document pages. In 10th IAPR International Workshop on Document Analysis Systems (DAS), pages 344348, March 2012.

D.M.W Powers. Evaluation: From precision, recall and f-measure to roc, in- formedness, markedness and correlation. Journal of Machine Learning Tech- nologies, 2(1):3763, February 2011.

Y. Rangsanseri and S. Rodtook. Comparative study of thresholding techniques for gray-level document image binarization. In TENCON 2001. Proceedings of IEEE Region 10 International Conference on Electrical and Electronic Tech- nology, volume 1, pages 152155 vol.1, 2001.

A. Sanchez, P.D. Suarez, C.A.B. Mello, A.L.I. Oliveira, and V.M.O. Alves. Text

line segmentation in images of handwritten historical documents. In First

Clustering of Image Search Results to Support Historical Document Recognition

Thesis no: MGCS-2014-06

Clustering of Image Search Results to Support Historical Document

Recognition

Javier Espinosa

Faculty of Computing

Blekinge Institute of Technology

SE371 79 Karlskrona, Sweden

This thesis is submitted to the Faculty of Computing at Blekinge Institute of Technology in partial fulllment of the requirements for the degree of Master of Science in Computer Science.

The thesis is equivalent to 10 weeks of full-time studies.

Contact Information:

Author:

Javier Espinosa

E-mail: jaes13@student.bth.se

University advisor:

Dr. Niklas Lavesson

Dept. Computer Science & Engineering

Faculty of Computing Internet : www.bth.se

Blekinge Institute of Technology Phone : +46 455 38 50 00

SE371 79 Karlskrona, Sweden Fax : +46 455 38 50 57

Abstract

Results. The statistical analysis is based on the F

score to measure the accuracy of the experimental results. This test analyses the accu- racy of the experiment regarding to its true positives, false positives, and false negatives detected. The average F-measure for the experi- ment conducted is F

= 0.59, whereas the actual performance value of the method is matching ratio of 66.4%.

Keywords: Historical documents, Computer vision, Features extrac- tion, Clustering.

i

Acknowledgments

First and foremost, I thank my supervisor Dr. Niklas Lavesson. With- out his support, encouragement, academic guide, and patient this the- sis would have not be possible. His professionalism is an example to follow.

Secondly, I thank Arkiv Digital AD AB for providing the data for the thesis, its image database has been the foundations of all the experiments conducted.

Thirdly, I thank my parents for giving me the opportunity and privi- lege of studying abroad, in addition to their unconditional trust and support.

Last but not least, I would like to thank the friends who have always supported me, even without understanding my work. No matter how, they always help me.

ii

Contents

Abstract i

Acknowledgments ii

1 Introduction 1

1.1 Terminology . . . . 1

2 Background 3 3 Related Work 6 3.1 Identication of Gap . . . . 7

4 Aim and Objectives 9 5 Approach 10 5.1 Software library . . . 10

5.2 Image processing . . . 11

5.2.1 Binarization . . . 11

5.2.2 Image ltering . . . 12

5.2.3 Segmentation . . . 13

5.3 Feature extraction . . . 15

5.4 Clustering . . . 16

5.5 Data preparation . . . 17

5.6 Data representation . . . 18

5.7 Cluster borders . . . 19

6 Research Methodology 20 6.1 Experimental design . . . 21

6.2 Data collection . . . 22

6.3 Data analysis . . . 23

6.4 Validity threats . . . 23

6.4.1 Internal validity . . . 23

6.4.2 External validity . . . 24

iii

6.4.3 Construct validity . . . 24 6.4.4 Statistical validity . . . 24

7 Results 25

7.1 Image processing . . . 25 7.2 Clustering . . . 27

8 Analysis 29

8.1 Image processing analysis . . . 29 8.2 Features analysis . . . 30 8.3 Clustering approach . . . 30

9 Conclusions and Future Work 33

References 34

iv

Chapter 1

Introduction

Computer-based historical document recognition and analysis has become an im- portant research area due to the amount of documents from the past which have been digitalized. In order to learn from and preserve our cultural heritage, the national archives and companies such as Arkiv Digital AD AB

have scanned or photographed millions of historical documents. Historical documents are usually handwritten and many of them are damaged. Thus, it is more time consuming to extract knowledge from such documents than from high-quality and typed documents.

1.1 Terminology

To avoid misunderstandings and make clear the explanations in the rest of the document, some terms are dened according to their specic meaning in this context. A document is called a collection of digital images from the same book.

http://www.arkivdigital.net/

1

Chapter 1. Introduction 2

An image is understood as the digital image of one page from a book. And nally,

a sub-image is an image extracted from one of the aforementioned pages.

Chapter 2

Background

Image binarization is the rst step in image processing. In historical documents it is common to have a low contrast between the background color and the ink, physical noises, or even ink bled through the other side (Su et al., 2013, 2011).

3

Chapter 2. Background 4

Figure 1: Example page of an historical document image. Source: Arkiv Digital AD AB, www.arkivdigital.se

SE371 79 Karlskrona, Sweden

This thesis is submitted to the Faculty of Computing at Blekinge Institute of Technology in partial fulllment of the requirements for the degree of Master of Science in Computer Science.

SE371 79 Karlskrona, Sweden Fax : +46 455 38 50 57

2 Background 3 3 Related Work 6 3.1 Identication of Gap . . . . 7

5.2.2 Image ltering . . . 12

To avoid misunderstandings and make clear the explanations in the rest of the document, some terms are dened according to their specic meaning in this context. A document is called a collection of digital images from the same book.

An image is understood as the digital image of one page from a book. And nally,

Image binarization is the rst step in image processing. In historical documents it is common to have a low contrast between the background color and the ink, physical noises, or even ink bled through the other side (Su et al., 2013, 2011).

tages which maximize the human eort and only work for particular cases. Each

document has been written by dierent authors, and each author has a dierent

represented as a vector of features, and these features dene the content of the

Studies of documents with dierent alphabets present other techniques for hand- written recognition. Semi-supervised learning techniques are applied in order to minimize the human eort in the transcription of the document. For Vajda et al.

(2011), clustering algorithms are used to gather each character in dierent groups, and then manually label each one. The process used by Ball and Srihari (2009) reverse the order: samples of handwritten characters are labelled before using unsupervised learning for recognition.

3.1 Identication of Gap

Previous studies have developed specic techniques to address one particular

written in Latin alphabets. The principal challenge remains in nding elements

1881-1883. The rst column represents a machined-typed index and the sub-

The objectives to fulll the stated aim are to:

1. Find an optimal conguration for binarization and segmentation to correctly extract the words in a document.

2. Study which combination of features is optimal to dene a text sub-image without any additional knowledge about the content.