• No results found

Segmentation and Beautification of Handwriting using Mobile Devices

N/A
N/A
Protected

Academic year: 2022

Share "Segmentation and Beautification of Handwriting using Mobile Devices"

Copied!
49
0
0

Loading.... (view fulltext now)

Full text

(1)

UPTEC F 15016

Examensarbete 30 hp April 2015

Segmentation and Beautification

of Handwriting using Mobile Devices

Jesper Dürebrandt

(2)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress:

Box 536 751 21 Uppsala Telefon:

018 – 471 30 03 Telefax:

018 – 471 30 00 Hemsida:

http://www.teknat.uu.se/student

Abstract

Segmentation and Beautification of Handwriting using Mobile Devices

Jesper Dürebrandt

Converting handwritten or machine printed documents into a computer readable format allows more efficient storage and processing. The

recognition of machine printed text is very reliable with today's technology, but the recognition of offline handwriting still remains a problem to the research community due to the high variance in handwriting styles. Modern mobile devices are capable of performing complex tasks such as scanning invoices, reading traffic signs, and online handwriting recognition, but there are only a few applications that treat offline handwriting.

This thesis investigates the segmentation of handwritten documents into text lines and words, how the legibility of handwriting can be increased by beautification, as well as implementing it for modern mobile devices. Text line and word segmentation are crucial steps towards

implementing a complete handwriting recognition system.

The results of this thesis show that text line and word segmentation along with handwriting beautification can be implemented successfully for modern mobile devices and a survey concluding that the writing on processed documents is more legible than their unprocessed counterparts. An application for the operating system iOS is developed for demonstration.

ISSN: 1401-5757, UPTEC F 15016 Examinator: Tomas Nyberg Ämnesgranskare: Anders Brun Handledare: André Strindby

(3)

Popul¨arvetenskaplig sammanfattning

Att konvertera handskrivna eller tryckta dokument till ett maskinl¨asligt format m¨ojligg¨or e↵ektivare f¨orvaring och bearbet- ning. Med dagens teknologi s˚a ¨ar igenk¨anning av tryckt text v¨aldigt p˚alitlig, men p˚a grund av den h¨oga variansen av handstilar

¨ar igenk¨anning av o✏ine handskrift ¨ar ett kvarst˚aende problem f¨or forskarv¨arlden. Moderna mobila enheter ¨ar kapabla att utf¨ora avancerade uppgifter, s˚asom att skanna fakturor, l¨asa trafikskyltar och online handskriftsigenk¨anning, men det ¨ar bara ett f˚atal applikationer som ber¨or o✏ine handskrift.

Det h¨ar examensarbetet unders¨oker textrad- och ordsegmentering av handskrivna dokument, hur l¨asbarheten av handskrift kan ¨okas med hj¨alp av f¨orb¨attring, s˚av¨al som implementation av detta f¨or mobila enheter. Textlinje- och ordsegmentering ¨ar kritiska steg f¨or en implementation av ett fullst¨andigt handskriftsigenk¨anningssys- tem.

Resultaten fr˚an det h¨ar examensarbetet visar att textrad- och ordsegmentering samt f¨orb¨attring av handskriven text kan im- plementeras f¨or mobila enheter och enk¨aten styrker att l¨asbarheten f¨or de bearbetade handskrivna dokumenten ¨okar. En applikation f¨or operativsystemet iOS har utvecklats f¨or demonstration.

(4)

Contents

1 Introduction 1

1.1 Background . . . 1

1.2 History . . . 2

1.3 Bontouch . . . 2

1.4 Objective . . . 3

1.5 Scope and Limitations . . . 3

2 Related Work 4 2.1 Handwriting Recognition . . . 4

2.2 The IAM Handwriting Database . . . 5

2.3 Detection . . . 5

2.4 Segmentation . . . 6

2.4.1 Text Line Segmentation . . . 6

2.4.2 Word Segmentation . . . 7

2.5 Beautification . . . 7

3 Theory 8 3.1 Text Line Features . . . 8

3.2 Digital Image Processing . . . 9

3.2.1 Representation of Digital Images . . . 9

3.2.2 Adjacency and Connectivity of Pixels . . . 9

3.2.3 Di↵erentiating an Image . . . 10

3.2.4 Smoothing Linear Filter . . . 11

3.2.5 Global Optimum Thresholding Using Otsu’s Method . . . 11

4 Development Environment and Code 14 4.1 OpenCV . . . 14

5 Method 15 5.1 Binarization . . . 16

5.2 Connected Component Analysis . . . 17

5.3 Text Line Detection . . . 17

5.3.1 Vertical Projection Profile Method . . . 18

5.3.2 Smearing Method . . . 20

5.4 Text Line Segmentation . . . 20

5.5 Word Segmentation . . . 21

5.6 Beautification . . . 22

(5)

6 Results 24

6.1 Binarization . . . 24

6.2 Text Line Segmentation . . . 25

6.2.1 Vertical Projection Profile Method . . . 26

6.2.2 Smearing Method . . . 27

6.2.3 Hypothesis Test . . . 28

6.3 Word Segmentation . . . 29

6.4 Beautification . . . 31

6.5 iOS Application . . . 32

7 Discussion 33 7.1 Binarization . . . 33

7.2 Text Line Segmentation . . . 33

7.3 Word Segmentation . . . 34

7.4 Beautification . . . 35

7.5 iOS Application . . . 35

7.6 Conclusions . . . 35

8 Future Work 36

References 38

Appendices 40

A Survey 40

(6)

List of Figures

2.1 A sample from the IAM Handwriting Database and its cropped version. . . 5

3.1 Useful font metrics for describing the body of a text line. . . . 8

3.2 Examples of baseline distortions. . . . 8

3.3 Neighborhoods of pixel p. . . . 9

3.4 Connected components in an image. . . 10

3.5 Examples of smoothing linear filters. . . 11

3.6 Otsu’s method performed on an image of handwritten text. . . 12

4.1 An iPhone 6. . . 14

5.1 Parts of a HWR system . . . 15

5.2 Proposed system. . . 15

5.3 Binarization by background subtraction and Otsu’s method. . . 17

5.4 Text line detection based on the SVPP and its derivatives. . . 19

5.5 The Smearing method. . . 20

5.6 Text line segmentation. . . 21

5.7 Steps in the word segmentation process. . . 22

5.8 Steps in the baseline correction process. . . 23

6.1 Comparison of binarization using Otsu’s method with and without back- ground subtraction. . . 24

6.2 Text line segmentation result. . . 25

6.3 The found text line bounding boxes are compared to the ground truth (blue boxes). Green boxes (correct) and red boxes (incorrect) are drawn over the ground truth. . . 25

6.4 Text line segmentation performance. . . 26

6.5 Execution time for the implementation using the VPP method. . . 26

6.6 Text line segmentation performance. . . 27

6.7 Execution time for the implementation using the Smearing method. . . 27

6.8 Word segmentation result. . . 29

6.9 The found word bounding boxes are compared to the ground truth (blue boxes). Green boxes (correct) and red boxes (incorrect) are drawn over the ground truth. . . 29

6.10 Word segmentation performance. . . 30

6.11 iOS Application and its functionality. . . . 32

7.1 Image containing CCs that belongs to two text lines. . . 34

7.2 Image containing CCs that belongs to two words. . . 34

(7)

Chapter 1 Introduction

1.1 Background

Character Recognition (CR) is an umbrella term for the conversion of handwritten or machine printed text into a computer readable format. CR allows documents of all kinds to be digitally stored, edited, and searched, which is more efficient and secure than the use of traditional paper. It may also be used for assisting tools such as an application that reads text out loud for the visually impaired. CR is never- theless a difficult task, since text is written in di↵erent languages, sizes, directions, and font styles. Additionally, the image might be defective containing non-textual objects that should not be processed. The recognition process can be decomposed into four main steps; detection of the text, segmentation into text lines and words, normalization of the segmented words, and finally classification of the words [1].

Here, normalization refers to the process of converting the segmented words into a standard format in order to reduce the variance of the input to a classifier. Base- line correction is a common normalization step that will be used in this thesis for beautification of the handwritten text.

Being a large field of research, CR is divided into several subcategories in which research in some areas has come very far and others require more attention. Optical Character Recognition (OCR), the recognition of machine printed text, is considered a solved problem and there are various applications available. Those applications may utilize the knowledge of the font style and that the text is written in separated straight lines. Handwriting Recognition (HWR) is, however, a more complicated task since people have highly individual handwriting styles. The text cannot be assumed to be written in straight lines and it is difficult to generalize handwritten text in all its forms. Online HWR, recognition as and when a user writes text on a touch screen or using a digital pen, is also considered a solved problem. The stroke direction and time lapse between characters and words during writing aid to e↵ectively eliminate the detection and segmentation phases. Unfortunately such information is not readily available in O✏ine HWR, recognition after that the text is written. O✏ine HWR is still regarded as an unsolved problem by the research community.

(8)

Modern mobile devices are equipped with high quality cameras and processors which allows us to perform complex image analysis right in the palm of the hand. Yet, the performance is only a fraction of that in a personal computer. Since mobile applica- tions have to be responsive in order to provide a good user experience, performance is important when developing applications for mobile devices. This thesis inves- tigates how detection and segmentation of handwriting into text lines and words can be implemented for modern mobile devices. In addition it discusses a means of improving handwriting legibility through baseline correction and text alignment.

1.2 History

The Russian scientist Tyuring made early attempts on CR in 1900 when he re- searched an assistive tool for the visually impaired [8]. Tausheck and Handel re- ceived patents on OCR in Germany in 1929 and the U.S. in 1933, respectively [10].

Their approaches were based on template matching, which was the first computer implemented method and most popular in the field until the 1980’s. In the lack of computers, their system used mechanical templates that light passed through and triggered photodetectors in order to recognize di↵erent characters. In the computer implementations, the characters were matched to the contents of a library of tem- plate images. It was however realized that the templates could only capture the features of printed characters; the variation in handwritten characters required a more flexible method. This initiated the research of structural analysis and statis- tical methods. When electronic tablets that captured the pen-tip movement were invented in the 1950’s, the research of online HWR commenced. The overall progress of the di↵erent CR fields were however constrained by the best computational re- sources available through the technology of a given time [1]. Most of the progress in CR has been made since the 1990’s, as the advancements in technology allowed implementation of complex methods such as Neural Networks and Hidden Markov Models, which today are the state of the art.

1.3 Bontouch

Bontouch1 is a company in Stockholm that partners in long term collaborations with clients who are serious about mobile challenges. They have several years of experience through relations with their clients, continuous development and gov- ernance of their customers’ mobile strategies for most mobile platforms, including iOS and Android. Among such collaborations are apps that locate and enhance documents in images and the first banking app in Sweden to perform OCR scanning of invoices using the camera of the mobile device. Bontouch is now researching the field of HWR in order to further enhance the user experience. A presentation and discussions with customers of Bontouch indicate that there is an interest for the applications of this thesis.

1http://www.bontouch.com

(9)

1.4 Objective

There are di↵erent methods of performing o↵-line HWR, but most methods deal with localization, segmentation, normalization, and classification in isolated steps. The work in this thesis is focused upon the localization and segmentation of text into text lines and words. The text lines will also be corrected so that they lie on a horizontal baseline, which is a common part of the normalization. The purpose is to beautify handwritten notes, i.e. to make them more legible. This is useful for users that write and share handwritten notes in their everyday life. The result will also be used as the first step of a complete HWR system. The implementation will be carried out in C++ using the open source computer vision framework OpenCV, which allows easy integration in many smartphone platforms, such as iOS and Android. The result of this thesis will be demonstrated in an iOS application that aims to increase the legibility of o✏ine handwriting.

1.5 Scope and Limitations

Text is encountered in many di↵erent forms and contexts; some making detection easier and some making it more difficult. The data used in this thesis is limited to handwritten notes written with a dark color on a light background. This reduces the difficulty of finding the text in the images and puts more emphasis on the segmentation of the text. The images are also assumed to be cropped to the bounds of documents of text written mainly horizontallly. This means that the text lines are allowed to be skewed or curvilinear. Furthermore, it is assumed that characters from adjacent lines and adjacent words in a line are all disconnected.

(10)

Chapter 2

Related Work

2.1 Handwriting Recognition

HWR is even difficult for the human eye; reading in absence of context fails 4%

of the times [8]. As opposed to reading machine printed characters, the context plays a great part in distinguishing handwritten characters. Until the 1990’s, the implemented approaches for CR only utilized local information of a character. It was simply not feasible to feed more information into the system due to the growth in computation time. The input to the classifier was typically a 9⇥ 9 image of a single character or a set of features that describe a single character [10]. The result may be corrected by comparison with a dictionary or advanced language models, but this method isn’t capable of correcting all errors that arise from the segmen- tation into characters. Since the language models are specific to certain languages, it is also difficult to extend to more languages. The difficulties of segmentation are efficiently described by Sayre’s paradox; a character cannot be segmented before having been recognized and cannot be recognized before having been segmented [15].

Using modern methodologies such as Neural Networks, Hidden Markov Models, and combinations thereof, it is possible to use an image of a complete word or sen- tence as input to the classifier, as described in [4] and [5, pp. 1-8]. The methods e↵ectively perform the segmentation and classification simultaneously. Such models are examples of supervised learning, where a classifier is trained on a large set of known samples. The idea is that the model finds a function that generalizes the data in the training set, and that function may later be applied to unknown sam- ples. Classification using such a method is typically a computationally inexpensive operation, but the training of the network requires a lot of time.

Until recent years, o✏ine HWR has been applied almost exclusively on images ob- tained using scanners. As cameras now are available in most mobile devices that are carried at all times, an interest in applications that utilizes spontaneous image capture is on the rise. Camera based input for o✏ine HWR is not yet considered as a trend, but it is a promising application field [5, pp. 67-75]. The iOS application developed in this thesis shows the use of HWR related tasks on camera based input.

(11)

2.2 The IAM Handwriting Database

The Department of Computer Science and Applied Mathematics at the University of Bern in Switzerland maintains the IAM (Institut f¨ur Informatik und angewandte Mathematik) O✏ine Handwriting Database [9] that has been used extensively in re- search of handwriting recognizers and writer identification. It contains 1539 scanned forms of handwritten English text written by 657 di↵erent writers1. The forms con- tain a field of machine printed text and a blank space where the writer repeats the text in handwriting, as shown in Figure 2.1a. Each form comes with a ground truth xml-file with labels for each sentence, text line, and word as well as estimates of various parameters from the preprocessing steps. There are 13353 isolated and labeled text lines and 115320 isolated and labeled words in total. Since this thesis is focused upon handwritten text, the samples are cropped as shown in 2.1b.

(a) A sample from the IAM database. (b) Cropped version of the sample.

Figure 2.1: A sample from the IAM Handwriting Database and its cropped version.

2.3 Detection

The difficulty of finding text in images vary a lot depending on the type of image.

Some applications search for text in natural images, where a traffic sign, for in- stance, might be surrounded by foliage and other objects. In this field, the Stroke Width Transform (SWT)[3] has shown good results. This approach does not require supervised learning, but instead relies on clustering techniques. By detecting edges in the image, the stroke width of each element is approximated and the text is then detected under the assumption that the textual elements are drawn with homoge- nous stroke width and may therefore be distinguished using clustering techniques.

For images of documents where a foreground of dark text is written on a bright

(12)

background, the text may instead be distinguished by the pixel values. In good quality images, there is typically a global threshold value that separates the back- ground and foreground. It may however be difficult to distinguish the foreground from the background if there is lighting variation in the image; the value of a back- ground pixel at a shaded area might have the same value as a brighter foreground pixel. In [16], the lighting variation is handled by computing a continous background surface, that is used in order to compute a local threshold at each pixel. The im- plemented binarization method described in section 5.1 is inspired by this method by estimating a continuous background and subtracting it from the original image in order to reduce lighting variations and distinguish the foreground.

2.4 Segmentation

2.4.1 Text Line Segmentation

An evaluation of di↵erent text line segmentation techniques is found in [12]. Among the evaluated techniques are a Vertical Projection Profile (VPP) based method and a Smearing method. The VPP is essentially a histogram produced by taking the sum of the pixel intensities in each row, from which the line boundaries can be ex- tracted. The rectangle based filtering method blurs the image with a horizontally elongated median filter, followed by binarization using Otsu’s method. Then, in the ideal case, the obtained binary image contains a connected component for each text line. The evaluation on the data-set found that the latter method is more successful.

The implemented Smearing method described in section 5.4 takes inspiration from this method, but uses an averaging filter instead of a median filter.

A VPP method further developed in [14] investigates into VPP of several non- overlapping vertical segments of the image. The text line boundaries are then ex- tracted using local minimas and maximas of the first derivative of the VPP. A Hidden Markov Model based on the text and gap area statistics of the document is used to refine the boundaries. The connected components are assigned to a text line if its area intersection with a text line boundary exceeds a given threshold. The implemented VPP method described in section 5.3 is similar to this method, but finds text line regions using the sign of the second derivative of the VPP. Further- more, the text line segmentation method described in 5.4 takes inspiration from this method, but instead assigns each connected component to the text line with which it has the most intersecting pixels.

Another text line segmentation described in [2] utilizes Seam Carving to segment the text lines. In a document image, high energy regions typically corresponds to fore- ground components and low energy regions correspond to background components.

Seam Carving computes minimum energy seams in an image, and by constrain- ing the carving between two consecutive text lines a separating seam is produced.

The constraints are calculated using a projection profile based on an edge image computed using the Sobel operator.

(13)

2.4.2 Word Segmentation

The words in a text line are unlikely to be connected and this assumption is used in the word segmentation method proposed in [14]. A connected component analysis is performed on each segmented text line, and the gaps between the connected com- ponents are classified as either within or between a word. The classification is based on a threshold computed using the histogram over all gaps in the document. The gap or separability between connected components are measured by treating the connected components as elements of two distinct classes and finding the margin of an SVM classifier that separates the classes. Using the classified gaps, the connected components are grouped into words.

The implemented word segmentation method described in section 5.5 takes inspira- tion from the method in [14] by assigning pixels to di↵erent words based on connected components. The implemented method di↵ers by finding the boundaries of words using a horizontal projection profile, and then assigns each connected component to the word with which it has the most intersecting pixels.

2.5 Beautification

Baseline correction is a normalization procedure that reduces the variance in the relative position of characters with respect to their neighbors along a text line. It is done by finding the baseline of a text line and transforming the image so that the detected baseline becomes horizontal. In [11], the baseline is found by a VPP method. A segmented text line is horizontally traversed by a sliding window, for which the VPP is computed and stored in the center column of the window. The resulting image is then binarized using Otsu’s method. The baseline can then be extracted by examining the binary image. Each pixel column is then translated vertically in order to straighten the baseline. The implementation of the baseline correction method described in section 5.6 is based on this method.

In [4], several normalization procedures are implemented using supervised learning techniques. Four Multi Layer Perceptrons, a type of Neural Networks, are trained for binarization, slope removal, slant removal, and size normalization, respectively.

A data-set of 1000 images was semi automatically labeled and then manually super- vised and corrected in order to produce supervised training patterns.

(14)

Chapter 3 Theory

3.1 Text Line Features

In order to describe the properties of a text line we need to introduce some terms referred to as features. The baseline, midline, corpus, ascent, and descent are the common font metrics that are useful when describing the body of a text line, as illustrated in Figure 3.1. The baseline is the line that most of the characters rest upon and the midline sits upon the lower case vowels. The area between the baseline and the midline is the corpus of the text line. The ascent and descent of a text line is the distance that characters extend above and below, respectively, the baseline.

Ascent Midline

Corpus Baseline

Descent

Example

Figure 3.1: Useful font metrics for describing the body of a text line.

In typed text, the mentioned font metrics are constant and defined by straight lines.

This does not apply for handwritten text, as the ascent and descent of text lines might overlap and the baseline might be skewed or curvilinear as shown in Figure 3.2. What we perceive as neat text is directly related to a straight baseline. The baseline is also expected to have the least variance among the features, which means that it requires less adjustment than e.g. the midline. Therefore it is reasonable to focus the detection and correction on the baseline, so that the text appears well ordered with minimal adjustment.

(a) A skewed baseline. (b) A curvilinear baseline.

Figure 3.2: Examples of baseline distortions.

(15)

3.2 Digital Image Processing

3.2.1 Representation of Digital Images

An image can be represented by a continuous function f (s, t), describing the image intensity at a certain point given by two continuous spatial variables s and t. By sampling and quantizing the image function, it can be stored digitally in a 2-D array, f (x, y), with M rows and N columns [6, pp. 55-59]. Hence, each element, or pixel, in f (x, y) is adressed by the discrete coordinate (x, y), where x = 0, 1, 2, ..., M 1 and y = 0, 1, 2, ..., N 1. The image is written as a numerical array as

f (x, y) = 2 66 64

f (0, 0) f (0, 1) · · · f (0, N 1) f (1, 0) f (1, 1) · · · f (1, N 1)

... ... . .. ...

f (M 1, 0) f (M 1, 1) · · · f(M 1, N 1) 3 77 75.

We now have a convenient way of addressing each pixel in a digital image and can therefore manipulate it in various ways. Note that the digital image can be stored in di↵erent data types, which impacts the precision of the intensity levels and the required storage space. Also, there are several color spaces that the image can be represented in. The most common format is 8-bit precision in RGB (Red, Green, Blue) color space, where each pixel stores three 8-bit integers for the intensity level in the red, green, and blue color channel, respectively. 8-bit precision allows 28 = 256 intensity levels in the range 0 to 255 in each channel. Grayscale images have 1 channel and each pixel is represented by a single integer where 0 is black and 255 is white. In binary images each pixel is represented by either 0 or 1.

3.2.2 Adjacency and Connectivity of Pixels

This section introduces adjacency and connectivity of pixels, for more details see [6, pp. 68]. The 4-neighbors of pixel p is the set of horizontal and vertical neighbors at coordinates (x±1, y) and (x, y±1), as illustrated in Figure 3.3a. The 8-neighbors of p includes the 4-neighbors as well as the diagonal neighbors at coordinates (x±1, y±1), as illustrated in Figure 3.3b. We say that pixel p and pixel q are adjacent if they are defined as neighbors and belong to the same set of intensity values.

p

(a) The 4-neighbors of pixel p.

p

(b) The 8-neighbors of pixel p.

Figure 3.3: Neighborhoods of pixel p.

(16)

Two pixels p and q are connected if there is a path of adjacent pixels between them. Thus, the connectivity is subject to the definition of adjacency. A set of pixels in which all are connected is referred to as a Connected Component (CC).

If we consider the subset of pixels of an image in Figure 3.4a, we see that the definition of adjacency makes impact on the number of CCs. Figure 3.4b shows that choosing 4-neighbors adjacency produces 3 CCs and Figure 3.4c shows that choosing 8-neighbors adjacency produces 1 CC.

(a) A subset of pixels in an image.

(b) 4-neighbors connected components.

(c) 8-neighbors connected component.

Figure 3.4: Connected components in an image.

3.2.3 Di↵erentiating an Image

The derivatives of an image reveals information about the transitions between in- tensity levels, which is the foundation of operations such as image sharpening and edge detection [6, pp. 158]. In this thesis we will make use of the first derivatives.

We cannot resort to the conventional definition of the first derivatives,

@f

@x = lim

h!0

f (x + h, y) f (x, y) h

@f

@y = lim

h!0

f (x, y + h) f (x, y)

h ,

since the images are discrete, the minimum distance h over which the intensity can change is limited to one pixel, and the absolute value of the change in intensity is limited to the number of intensity levels. The first order derivatives, @f@x and @f@y, are therefore given by the di↵erences

@f

@x = f (x + 1, y) f (x, y),

@f

@y = f (x, y + 1) f (x, y).

By this definition, we get a derivative that is defined at each pixel (except for bound- ary pixels, which are handled by boundary conditions) and that behaves like we are used to; it is zero in a neighborhood of pixels where the intensity is constant and non-zero in a neighborhood of pixels where the intensity is changing. Approxima- tions of higher order derivatives are obtained by simply applying the formula above repeatedly.

(17)

3.2.4 Smoothing Linear Filter

Smoothing is a powerful tool in image processing that can be used for tasks such as noise reduction and object extraction. A smoothing linear filter, also referred to as an averaging filter, smoothens an image by replacing each pixel value by the average of its neighbors [6, pp. 152-155]. This is carried out by traversing the image with a m⇥ n mask, taking the average of the pixel values within the mask and storing the result in an output image at the center pixel of the mask. This e↵ectively reduces sharp intensity transitions in the image, which removes noise but also blurs the edges. Sometimes it is desired to emphasize certain pixels in the mask, e.g. give the center pixel more importance in order to reduce blurring. A weighted average filter accomplishes this by multiplying the pixels with di↵erent coefficients. The general filter with input image f (x, y) and output image g(x, y) is given by

g(x, y) = Pa s= a

Pb t= b

w(s, t)f (x + s, y + t) Pa

s= a

Pb t= b

w(s, t)

,

where w(s, t) denotes the mask elements, a = (m 1)/2, and b = (n 1)/2. If the sum of the mask elements is more or less than one, the result will be brightened or darkened. Examples of an averaging and a weighted averaging filter, respectively, are illustrated in Figure 3.5.

1 mn

2 64

1 . . . 1 ... ... ...

1 . . . 1 3 75

(a) m⇥ n averaging filter.

1 16

2

4 1 2 1 2 4 2 1 2 1

3 5

(b) 3⇥ 3 weighted averaging filter.

Figure 3.5: Examples of smoothing linear filters.

3.2.5 Global Optimum Thresholding Using Otsu’s Method

In images of documents, there are typically two distinct classes of pixels; a foreground of text written on a background. It is often desirable to have a binary representation of the image where 0 and 1 represent the background and foreground, respectively.

Optimum global thresholding is the process of separating data into two or more classes by finding optimal threshold values that delimit the classes. The optimality is subject to a function that measures the error of the selected thresholds. Otsu’s method [13] is a popular optimum global thresholding method for binarizing doc- ument images. It finds the optimal threshold value that maximizes the variance between the delimited classes using the histogram of the image. Figure 3.6 shows how an image is binarized using Otsu’s method.

(18)

(a) Grayscale image of handwritten text. (b) Binarized image using Otsu’s method.

0 20 40 60 80 100 120 140 160 180 200 220 240 0

0.1 0.2

Intensity

Probability

(c) The Otsu threshold is marked by the dashed line in the normalized histogram.

Figure 3.6: Otsu’s method performed on an image of handwritten text.

Let’s assume that the pixels of an image with N pixels and L intensity levels can be separated into two classes C0 and C1 by a threshold t, so that intensity levels [1, 2, ..., t] belongs to C0 and [t + 1, t + 2, ..., L] belongs to C1. The histogram of the image counts the frequency ni of pixels with intensity level i, i = 1, 2, ..., L.

By normalizing the histogram we can regard it as a probability density function, describing the probability pi for a pixel to assume intensity level i:

pi = ni

N, XL

i=1

pi = 1, pi  0.

The probabilities P0 and P1 that a pixel belongs to class C0 and C1, respectively, are given by

P0 = Xt

i=1

pi = P (t), P1 = XL i=t+1

pi = 1 P (t) and the mean intensity levels µ0 and µ1 within the classes are

µ0 = Xt

i=1

ipi

P0

= µ(t)

P (t), µ1 = XL i=t+1

ipi

P1

= µT µ(t) 1 P (t),

(19)

where the mean intensity level µ(t) for C0 and the total mean intensity level µT are

µ(t) = Xt

i=1

ipi, and µT = XL

i=1

ipi.

We can now express the variance between the classes B2 as a function of the chosen threshold t and the formula for obtaining the optimal threshold t as

2

B(t) = (µTP (t) µ(t))2

P (t)(1 P (t)) , and B2(t) = max

1tL 2 B(t),

which is guaranteed to exist. Otsu’s method is efficient under decent lighting con- ditions, for which the background and foreground show as two distinct groups in the histogram, as shown in Figure 3.6c. The peak representing the foreground is much smaller than the peak representing the background because there are a lot more background pixels than foreground pixels in the image. Since the histogram of an image with L intensity levels is represented by a 1D-array with L elements, the method is also computationally inexpensive.

(20)

Chapter 4

Development Environment and Code

This project is developed on a MacBook Pro laptop with an 2,3 GHz Intel Core i5 processor and 8 GB RAM. The project is proprietary and will not be open sourced.

The code is written and compiled using the IDE Xcode1. A main engine that carries out the text line and word segmentation and beautification is written in 900 lines of C++ code and utilizes the OpenCV framework. The evaluation of the engine using the IAM database is performed in a C++ program that runs on the laptop.

The engine is also integrated in a iOS application implemented in Objective-C. The languages C++ and Objective-C interacts directly in a file compiled as Objective- C++, which allows the iOS application to process images using the C++ engine.

An iPhone 6, shown in Figure 4.1, is used for deployment of the iOS application.

Figure 4.1: An iPhone 6.

4.1 OpenCV

OpenCV2 is an open source computer vision framework. It has C++, C, Java, and Python interfaces, which allows support for the most popular mobile platforms, such as iOS and Android. It is a powerful tool as it contains hundreds of computer vision algorithms, including smoothing and thresholding [7]. It is chosen due to its portability, efficiency, and active maintenance.

1https://developer.apple.com/xcode/ide/

2http://opencv.org

(21)

Chapter 5 Method

A complete HWR system, as illustrated in Figure 5.1, typically consist of the fol- lowing main steps detection, segmentation, normalization, and classification.

Detection Segmentation Normalization Classification

Figure 5.1: Parts of a HWR system

The proposed system in this thesis is focused on the segmentation and normalization steps, as the purpose is to segment handwritten text into text lines and perform baseline correction. Figure 5.2 illustrates the di↵erent parts of the proposed system and in which order they are performed. The implemented methods in this thesis are chosen with respect to computational complexity and that they don’t require supervised learning.

Binarization CC analysis Text line detection

Text line segmentation Beautification

Word segmentation

Figure 5.2: Proposed system.

The binarization extracts the foreground of text from the background and produces a binary image. Then the connected component analysis finds connected components in the binary image and provides estimates of the text size. The text line detection finds text line zones, which are used in the text line segmentation to assign the connected components to text lines. The segmented text lines are then able to be segmented into words or beautified by baseline correction. The dashed arrow in Figure 5.2 indicates that word segmentation can be performed after beautification, but then there will be no ground truth to verify the results. In the following sections

(22)

5.1 Binarization

Binarization is the process of converting an image to a binary image where each pixel is represented by either black or white. Here, we want the binarization to produce an image where the background pixel values are equal to zero and the foreground pixel values are equal to 255. In order to classify each pixel as foreground or background, we make two general assumptions of handwritten text;

1. Handwritten text is written with a dark color on a bright background

2. Areas containing mainly foreground pixels have a larger standard deviation of intensity values than areas containing mainly background pixels.

Noise in the image is first reduced by a Gaussian filter, a weighted average linear filter where the kernel weights are given by a Gaussian function with a standard deviation of 1 in the x and y direction. The kernel size used is 5⇥ 5.

The approach is then to consider non-overlapping square blocks of size wb of the image, and by the second assumption we find blocks containing foreground by ex- amining the standard deviation of the pixel values within the blocks. A mask image with Mw

b rows and wN

b columns is created, such that each pixel represents a block.

Blocks with a standard deviation above the average of all blocks are represented by ones in the mask image and the others with zero.

In order to create an estimate of the background, the original image is down sam- pled to the same size as the mask image and the background of the masked pixels is reconstructed using the OpenCV function inpaint, which is described in more detail in [17]. Inpainting is essentially an iterative process where the pixel values on the boundary of the mask are interpolated from the surrounding pixels until all the masked pixels are reconstructed.

The reconstructed part of the low resolution image is then up sampled and written over the foreground pixels of the original image. We now have an estimate of the background, as illustrated in Figure 5.3b, that represents the lighting variations.

By taking the di↵erence between the estimated background and the original image, we get an image where the foreground is easier to distinguish, as shown in Figure 5.3c. By thresholding the di↵erence image using Otsu’s method, we obtain a binary image as illustrated in Figure 5.3d.

(23)

(a) Image to be binarized. (b) Interpolated background.

(c) Background subtracted from original. (d) Binarized image.

Figure 5.3: Binarization by background subtraction and Otsu’s method.

5.2 Connected Component Analysis

A Connected Component (CC) analysis is performed on the binary image in order to obtain some information about the text size and location. The CC analysis returns a label image and the bounding boxes of the CCs. In the label image, the background is represented by zeros and each group of 8-connected foreground pixels is represented by a number corresponding to the group label. Under the assumption that the text is written mainly horizontally, the average height hCC and width wCC

of the bounding boxes provide an estimate of the text size. The sizes of the bounding boxes are also used to remove objects that are either too small or too large in area or has a too small or too large ratio between the height and width.

5.3 Text Line Detection

This section describes the method of finding text lines in a binary image, where the background pixels are zero and the foreground pixels are one. The text line detec- tion is based on the assumption that text lines are written in a mainly horizontal direction. This still allows the text lines to be curvilinear. Furthermore, the detec- tion is focused upon the corpus of the text lines, which is assumed to have a small variation in thickness and to be well separated from other text lines. The ascent and descent of a text line, on the other hand, may vary a lot and overlap with those of other text lines.

(24)

5.3.1 Vertical Projection Profile Method

Perfectly horizontal text lines are trivial to detect using the Vertical Projection Profile (VPP)[12] of an image. The VPP is obtained by projecting the image onto the vertical axis, which e↵ectively is to take the sum of the pixel values in each row.

The result shows distinct peaks where the text lines are. However, if the text lines are curvilinear so that they cannot be separated by straight lines there won’t be any distinct peaks in the VPP. This problem can be solved by dividing the image into vertical segments of width w, as shown in Figure 5.4a,

VPP(f (x, y), w) = 2 66 66 64

Pw i=0

f (0, i) . . . N 1P

i=N 1 w

f (0, i)

... . .. ...

Pw i=0

f (M 1, i) . . . N 1P

i=N 1 w

f (M 1, i) 3 77 77 75 .

The width w is chosen small enough to linearly separate the text lines that they contain, and large enough to render peaks in the VPP. The VPPs are then stored in a 32-bit image with M rows and Nw columns. We cannot use the standard 8-bit format since the sum of the pixel values in a row is likely to exceed 255.

In order to compensate for vertical segments that doesn’t cover any text of a text line we consider the Smoothed Vertical Projection Profile (SVPP), the dashed line shown in Figure 5.4b. Note in the figure that the SVPP show peaks for all text lines although no text is covered in the sixth text line. The SVPP is produced by applying an averaging filter to the VPP image, which e↵ectively smoothens each VPP and alters them to take their neighbors into account. The number of rows of the filter is set to hCC, as we are only interested in peaks that are wide enough to cover the average height of the text lines. The number of columns of the filter is not as straightforward to decide, since a too narrow filter divides the text lines at large spaces and a too wide filter won’t handle curvilinear text lines.

Figure 5.4c shows that the local maxima and minima of the first derivative of the SVPP indicate the midline and the baseline, respectively, of a text line. A straight- forward method of finding the corpus, or the area between consecutive local maxima and minima, is to consider the second derivative. The corpus of the text line lies where the second derivative is negative, as illustrated in Figure 5.4d.

(25)

(a) Investigated vertical seg- ment.

VPP SVPP

(b) The VPP and SVPP of the segment.

(c) Smooth first derivative of the SVPP.

(d) Smooth second derivative of the SVPP.

Figure 5.4: Text line detection based on the SVPP and its derivatives.

However, the smoothing of the VPPs is a trade o↵ between being able to detect text lines where there is much space in between characters and the precision of the corpus location. Hence for a good choice of the width parameter w and filter size, the output of this method will contain an area of negative values for each text line, as illustrated in Figure 5.6b. An image containing the labeled text line areas is obtained by performing a CC analysis on an image where the negative areas have been assigned the pixel value 255 and the others zero. The sizes of the CC bounding boxes are used to remove detected text lines that are either too small or too large in area or has a too small or too large ratio between the height and width. The detected text line areas can then be used to assign the foreground pixels to certain text lines.

(26)

5.3.2 Smearing Method

Another method of detecting text lines is the Smearing method. It utilizes smoothing filters in order to blur neighboring characters and words together in order to create a blurry line representing each text line, as illustrated in Figure 5.5b. Since the gap between words may be as wide as the gap between text lines, the filter kernel needs to be selected accordingly. The height of the kernel is set to hCC, so that the text lines are completely blurred horizontally but not fused with the text line above and below. Selecting the width is a trade o↵ between handling curvilinear text lines well and a robust detection of textlines with larger gaps in-between words. The blurred image is then thresholded using Otsu’s method in order to create a CC for each text line. In the same manner as in the VPP method, an image containing the labeled text line areas is obtained by performing a CC analysis.

(a) Binary image to be blurred. (b) Blurred image.

Figure 5.5: The Smearing method.

5.4 Text Line Segmentation

This section describes the method of assigning the foreground pixels to the correct text line. The output from the text line detection algorithm provides guidelines for the boundaries of the text lines, combined with the bounding boxes of the CCs, the foreground pixels can be assigned to text lines by majority voting. This is implemented by considering the histogram over the non-zero pixel values in the labeled text line image for each CC bounding box, as illustrated in Figure 5.6b.

The foreground CCs are then assigned to the text line that corresponds to the most frequent pixel value in the histogram. If a CC bounding box does not overlap with a text line area, it is assigned to the vertically nearest text line within 2hCC. If the CC still isn’t assigned to a text line, it is regarded as noise. Problems arise however if a foreground CC consists of several characters that belongs to di↵erent text lines.

The text lines are now segmented and can be treated in separate images.

(27)

(a) Binary image to be segmented. (b) Superimposed text line areas in gray and CC bounding boxes in blue.

Figure 5.6: Text line segmentation.

5.5 Word Segmentation

The word segmentation is performed by considering the Horizontal Projection Profile (HPP) of the text line image. The HPP, shown in Figure 5.7b, is obtained by projecting the image on the horizontal axis, which is e↵ectively to take the sum over the pixel values in each column;

HPP(f (x, y)) =

M 1 P

i=0

f (i, 0) . . .

M 1P

i=0

f (i, N 1) .

A Smoothed HPP (SHPP), shown in Figure 5.7c, is obtained by smoothin the HPP using an averaging filter of width wf, chosen so that the gaps within the words are closed but leaves the gaps between the words open. By thresholding the image using Otsu’s method, we obtain an image that reveals the boundaries of each word, as illustrated in Figure 5.7d. The obtained word zones are then labeled and used for assigning the connected components of the text line to di↵erent words, which is implemented similar to the text line segmentation by a majority vote of the pixels within the bounding box of the connected component.

(28)

(a) Text line to be segmented into words.

(b) HPP of the text line image.

(c) Smoothed HPP of the text line image.

(d) Thresholded smoothed HPP using Otsu’s method.

Figure 5.7: Steps in the word segmentation process.

5.6 Beautification

This section describes the method of finding the baseline of a text line, how it is corrected by a vertical translation, and alignment of the text lines. The detection of the baseline and midline is performed using a vertical projection based technique on each image of the separated text lines, as the example illustrated in Figure 5.8a. The text line images are traversed with a sliding window, in which the sum of the pixel values in each row is computed and stored in the center column of the window. By thresholding the result using Otsu’s method, the image in Figure 5.8b is produced.

For each row of the VPP image there might be more than one area of consecutive foreground pixels. Therefore, the longest chain of foreground pixels in each column is assumed to correspond to the corpus of the text line. By finding the highest and lowest pixel of the corpus, we now have an estimate of the midline and baseline, as shown in Figure 5.8c. The baseline is then refined by comparing it with a line produced by finding the lowest foreground pixel in a small neighborhood, as shown in Figure 5.8d. The average distance between the midline and baseline provides a measure of the character height, which is used to refine the baseline, which is illustrated in Figure 5.8e. In order to obtain a smooth and continuous baseline, the estimated baseline is smoothed using a linear filter with a kernel of size 1⇥ hCC. The distances for each point of the estimated baseline from the average y-position of the baseline are then used to translate each pixel column vertically in order to correct the baseline, as shown in Figure 5.8f.

(29)

(a) Text line to be baseline corrected.

(b) VPP image produced by sliding window and Otsu thresholding.

(c) Baseline and midline extracted from the VPP image.

(d) Baseline (black) and line produced from the lowest pixels in the image (gray).

(e) Refined baseline and midline.

(f ) Result of baseline correction.

Figure 5.8: Steps in the baseline correction process.

Using the bounding boxes of the text lines, the position of the text lines can be adjusted so that the left side or the center of the bounding boxes are aligned. The new left padding of the bounding boxes is based on the smallest distance to the left margin for left alignment and half of the di↵erence between the image width and the bounding box width for center alignment.

(30)

Chapter 6 Results

6.1 Binarization

This section presents the results of the binarization by Otsu’s method with back- ground subtraction. Figure 6.1 illustrates a comparison of binarization using Otsu’s method and Otsu’s method with background subtraction.

(a) Image to be binarized. (b) Interpolated back- ground.

(c) Background subtracted from original.

(d) Otsu’s method without background subtraction.

(e) Otsu’s method with background sub- traction.

Figure 6.1: Comparison of binarization using Otsu’s method with and without back- ground subtraction.

(31)

6.2 Text Line Segmentation

This section presents the results of the text line segmentation performed by the Vertical Projection Profile method and the Smearing Method. Figure 6.2 shows an example of segmented text lines drawn in di↵erent colors. The obtained bounding boxes for the segmented text lines are compared with the ground truth bounding boxes. The performance is measured as the number of correctly detected text line bounding boxes divided by the total number of text lines. The dataset consists of 1539 forms containing 13353 isolated and labeled text lines. As the samples have been binarized using a di↵erent method than the method used in this thesis when creating the ground truth, an error of 10 pixels in vertical and horizontal displacement and width and height of the bounding boxes is tolerated. Figure 6.3 shows examples of the found bounding boxes compared to the ground truth. The average computation time is also presented for the implementation using the di↵erent methods.

(a) Binary image to be segmented. (b) Segmented text lines in di↵erent colors.

Figure 6.2: Text line segmentation result.

(a) Successful text line segmentation. (b) Unsuccessful text line segmentation.

Figure 6.3: The found text line bounding boxes are compared to the ground truth (blue boxes). Green boxes (correct) and red boxes (incorrect) are drawn over the

(32)

6.2.1 Vertical Projection Profile Method

Figure 6.4 shows how the performance of the text line segmentation depends on the width w of the vertical segments. The performance is measured as the number of correctly detected text line bounding boxes compared to the IAM database ground truth divided by the total number of text lines. The average height hCC of the foreground CCs is assumed to be proportional to the character width, and therefore the parameter w is expressed as a multiple of hCC. The average execution time per image of the implementation using the VPP method is shown in Figure 6.5.

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

0.6 0.7 0.8 0.9 1

0.9542

Vertical segment width as a multiple of hCC

Performance

Figure 6.4: Text line segmentation performance.

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

0.7 0.75 0.8 0.85 0.9 0.95

Vertical segment width as a multiple of hCC

Executiontime[s]

Figure 6.5: Execution time for the implementation using the VPP method.

(33)

6.2.2 Smearing Method

Figure 6.6 shows how the performance of the text line segmentation depends on the width w of the vertical segments. The performance is measured as the number of correctly detected text line bounding boxes compared to the IAM database ground truth divided by the total number of text lines. The average height hCC of the foreground CCs is assumed to be proportional to the character width, and therefore the parameter w is expressed as a multiple of hCC. The average execution time per image of the implementation using the Smearing method is shown in Figure 6.7.

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

0 0.2 0.4 0.6 0.8 1 0.9393

Width of smoothing linear filter as a multiple of hCC

Performance

Figure 6.6: Text line segmentation performance.

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

0.7 0.75 0.8 0.85 0.9 0.95

Width of smoothing linear filter as a multiple of hCC

Executiontime[s]

Figure 6.7: Execution time for the implementation using the Smearing method.

(34)

6.2.3 Hypothesis Test

In order to verify with statistical significance that the performance of the VPP method and the Smearing method in fact di↵er, a two-proportion Z-test is con- ducted. The Z-test is appropriate for problems with many samples, since it is based on the assumption that the test statistic is normal distributed. It is convenient as it only has a single critical value for each significance level. We test the null hypothe- sis H0, that the proportions of correctly segmented text lines from the investigated methods are equal, versus the hypothesis H1, that the proportion of correctly seg- mented text lines from the VPP method is significantly larger than that from the Smearing method. The test function is

Z = p1 p2

qp(1 p)(n1

1 + n1

2)

= 5.42,

where p1 = 0.9542 is the performance of the VPP method with the optimal param- eter setting, p2 = 0.9393 the performance of the Smearing method with the optimal parameter setting, n1 = 13353 the sample size of the VPP method, n2 = 13353 the sample size of the Smearing method and p = n1np1+n2p2

1+n2 . If Z > 1.96, we can reject the null hypothesis with 95% certainty. Z = 5.42 corresponds to that the probabil- ity that the null hypothesis is true is less than 0.00001%. This is not a surprising result, since the sample size is large and that the graphs in Figures 6.4 and 6.6 are smooth.

(35)

6.3 Word Segmentation

This section presents the results of the word segmentation performed by the HPP method. Figure 6.8 shows an example of segmented words drawn in di↵erent colors.

The performance of the word segmentation method is measured using the ground truth provided with the IAM database samples. The obtained bounding boxes for the segmented words are compared with the ground truth bounding boxes. The performance is measured as the number of correctly detected bounding boxes divided by the total number of words. The dataset consists of 1539 forms containing 115320 isolated and labeled words. As the samples have been binarized using a di↵erent method than the method used in this thesis when creating the ground truth, an error of 10 pixels in vertical and horizontal displacement and width and height of the bounding boxes is tolerated. Figure 6.9 shows examples of the found bounding boxes compared to the ground truth.

(a) Binary image to be segmented. (b) Segmented words drawn in di↵erent colors.

Figure 6.8: Word segmentation result.

Figure 6.9: The found word bounding boxes are compared to the ground truth (blue boxes). Green boxes (correct) and red boxes (incorrect) are drawn over the

(36)

Figure 6.10 shows how the performance of the word segmentation depends on the width wf of the smoothing linear filter. The width of the vertical segments is fixed at the empirically found optimal value w = 2. The average height hCC of the foreground CCs is assumed to be proportional to the character width, and therefore the parameter wf is expressed as k⇥ hCC, where 0 < k < 1.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.5 0.6 0.7 0.7297

Width of smoothing linear filter as a multiple of hCC

Performance

Figure 6.10: Word segmentation performance.

(37)

6.4 Beautification

The Survey in Appendix A was made in order to quantify the results of the baseline correction and adjustment of the text layout. The survey was created using Poll- daddy1 and was distributed using social media in order to reach a wide audience including several ages and backgrounds. It consists of ten questions, each presenting three options; the original document, the baseline corrected and left aligned docu- ment, and ”Can’t tell the di↵erence”. The participants are instructed to select the document they find the easiest to read or ”Can’t tell the di↵erence” if that is the case. The documents are randomly selected from the IAM Handwriting Database and the order of the options are randomized for each participant. There were 42 participants in the survey and their responses are shown in Table 6.1.

Table 6.1: Responses from the survey with 42 participants.

Processed document Original document Can’t tell the di↵erence

Question 1 22 10 10

Question 2 25 5 12

Question 3 16 6 20

Question 4 17 4 21

Question 5 24 8 10

Question 6 22 5 15

Question 7 23 3 16

Question 8 19 6 17

Question 9 21 3 18

Question 10 26 4 12

Total 215 (51%) 54 (13%) 151 (36%)

(38)

6.5 iOS Application

In order to demonstrate use of text line segmentation and baseline correction, an application for iOS was developed. The application takes a cropped image of a handwritten document as input from either the camera or the camera roll. The user can then apply baseline correction by toggling a switch and change the layout of the text by tapping the corresponding text alignment buttons. It is possible to select the original layout, left alignment, or center alignment. The functionality of the application is illustrated in Figure 6.11. It takes the application less than a second to process a 1028 x 1028 image on a iPhone 6. If the image is larger than that it is down sampled to fit a 1028 x 1028 window.

(a) Start view of the application. (b) Image selected for processing.

(c) Baseline corrected and left aligned. (d) Baseline corrected and center aligned.

Figure 6.11: iOS Application and its functionality.

(39)

Chapter 7 Discussion

This sections summarizes the results and brings the implemented segmentation and beautification methods to discussion. General conclusions are then made in the conclusions subsection.

7.1 Binarization

The binarization of the images in the IAM Handwriting database is satisfactory, since the result match the ground truth to a great extent in the text line and word segmentation. This indicates that the occurrence of noise pixels that might a↵ect the size of the detected bounding boxes is low. The images in the database are however not subject to great lighting variations, meaning that the e↵ect of the background subtraction is marginal. The lighting variations are typically larger in documents captured using the iOS application, and background subtraction before thresholding using Otsu’s method show improvements in the binarization, as shown in Figure 6.1. The background may be estimated more accurately by pinpointing the image at original resolution, but since it is an iterative process and computationally expensive the processing time is greatly reduced by inpainting a low resolution image.

7.2 Text Line Segmentation

The investigated methods for text line segmentation show pleasing results, as the VPP method segmented 95.4% of the text lines and the Smearing method segmented 93.9% of the text lines correctly compared to the ground truth. Since the execution time is not significantly e↵ected by the choice of method, the VPP method is the preferred text line segmentation method. It was however found that the ground truth contained some errors, as shown in Figure 6.3b. This indicates that the measure of correctness could be chosen di↵erently to provide a more accurate result. The chosen measure does not take the pixels assigned to each text line into account, but only the size and location of the bounding boxes. Additionally, the chosen method of assigning the pixels to text lines fails completely in the case when a connected component belongs to two separate text lines, as shown in Figure 7.1. Since the

(40)

occurrence of such cases is low in the IAM Handwriting Database, the investigated methods provided good results regardless of this issue.

Figure 7.1: Image containing CCs that belongs to two text lines.

7.3 Word Segmentation

The results from the word segmentation imply that a more sophisticated method that takes the context into account is required, as the HPP method is only able to segment 73.0% of the words correctly compared to the ground truth. As the correctness was measured in the same way as for the text line segmentation, the measurement is subject to the same faults. A weakness of the chosen method is that the assignment of pixels to words is based on CCs, meaning that it cannot separate words correctly if a CC belongs to two separate words as shown in Figure 7.2.

Figure 7.2: Image containing CCs that belongs to two words.

References

Related documents

Industrial Emissions Directive, supplemented by horizontal legislation (e.g., Framework Directives on Waste and Water, Emissions Trading System, etc) and guidance on operating

The EU exports of waste abroad have negative environmental and public health consequences in the countries of destination, while resources for the circular economy.. domestically

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

The increasing availability of data and attention to services has increased the understanding of the contribution of services to innovation and productivity in

Närmare 90 procent av de statliga medlen (intäkter och utgifter) för näringslivets klimatomställning går till generella styrmedel, det vill säga styrmedel som påverkar

Den förbättrade tillgängligheten berör framför allt boende i områden med en mycket hög eller hög tillgänglighet till tätorter, men även antalet personer med längre än

På många små orter i gles- och landsbygder, där varken några nya apotek eller försälj- ningsställen för receptfria läkemedel har tillkommit, är nätet av

The section “…to produced…” should read “…to produce …”O. The section “…the β-hydroxyacyl-ACP intermediate…” should read “…the β-