Improving character recognition by thresholding natural images

(1)

IN

DEGREE PROJECT TECHNOLOGY, FIRST CYCLE, 15 CREDITS

STOCKHOLM SWEDEN 2017,

Improving character recognition by thresholding natural images

OSKAR GRANLUND KAI BÖHRNSEN

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

(2)

Förbättra optisk

teckeninläsning genom att segmentera naturliga bilder

OSKAR GRANLUND KAI BÖHRNSEN

Bachelor in Computer Science Date: June 5, 2017

Supervisor: Kevin Smith Examiner: Örjan Ekeberg

School of Computer Science and Communication

(3)

Abstract

The current state of the art optical character recognition (OCR) algorithms are capable of extracting text from images in predefined conditions. OCR is extremely reliable for interpreting machine-written text with minimal distortions, but images taken in a natural scene are still challenging. In recent years the topic of improving recognition rates in natural images has gained interest because more powerful handheld devices are used. The main problem faced dealing with recognition in natural images are distortions like illuminations, font textures, and complex backgrounds. Different preprocessing approaches to separate text from its

background have been researched lately. In our study, we assess the improvement reached by two of these preprocessing methods called k-means and Otsu by

comparing their results from an OCR algorithm. The study showed that the

preprocessing made some improvement on special occasions, but overall gained

worse accuracy compared to the unaltered images.

(4)

Sammanfattning

Dagens optisk teckeninläsnings (OCR) algoritmer är kapabla av att extrahera text

från bilder inom fördefinierade förhållanden. De moderna metoderna har uppnått en

hög träffsäkerhet för maskinskriven text med minimala förvrängningar, men bilder

tagna i en naturlig scen är fortfarande svåra att hantera. De senaste åren har ett

stort intresse för att förbättra tecken igenkännings algoritmerna uppstått, eftersom

fler kraftfulla och handhållna enheter används. Det huvudsakliga problemet när det

kommer till igenkänning i naturliga bilder är olika förvrängningar som infallande ljus,

textens textur och komplicerade bakgrunder. Olika metoder för förbehandling och

därmed separation av texten och dess bakgrund har studerats under den senaste

tiden. I våran studie bedömer vi förbättringen som uppnås vid förbehandlingen med

två metoder som kallas för k-means och Otsu genom att jämföra svaren från en

OCR algoritm. Studien visar att Otsu och k-means kan förbättra träffsäkerheten i

vissa förhållanden men generellt sett ger det ett sämre resultat än de oförändrade

bilderna.

(5)

1. Introduction

Technology has long been a tool to facilitate repetitive tasks which otherwise had to be performed by a human. One of these tasks has been extracting names and addresses of mail recipients or money transfer. It has been replaced with optical character recognition (OCR) techniques that take use of a camera and an OCR algorithm to identify the written characters.

The use of OCR techniques has been limited to predefined scannable surfaces containing specified frames for words and characters of interest. In the past, camera technology and mobile phones computational power has increased. This led to more research concentrating on developing OCR algorithms working in less restrictive environments so that modern algorithms achieve almost perfect results when interpreting machine-written text with minimal distortions [6].

Since handheld devices like smartphones gained the image capture abilities and

computational power comparable to previous large equipment new use cases for OCR have evolved. Capturing images of natural scenes has become common and their use with OCR could help to increase accessibility of historical documents or help visually impaired to read a text in the real world [8].

Historical documents are mostly handwritten text and today’s algorithms are still inadequate to solve these with a high accuracy [7]. Images of natural scenes are called natural images and often contain more variations in text fonts, colors and distortion from illuminations than machine-written documents and are therefore a challenge for conventional OCR algorithms.

These challenging conditions have led to a new type of research in the field of OCR in recent years where the modification of natural images to have similar properties as machine-written documents is focused. Important factors to take into account while dealing with natural images in combination with OCR are compensating the skew of text to be in horizontal, the angle between the photographer and separation of text from its background [1][5].

This study concentrates on the separation of text from its background with image

binarization called thresholding before sending it to a modern OCR engine and to study the difference between the unaltered image and the preprocessed image.

Previous work has been done in this area and algorithms for thresholding have been

developed. In this study two, well-known algorithms for binarization are analyzed, one using histogram analysis and one approaching color clustering. Through histogram analysis, there is a well-known technique called Otsu´s method which is known for separating the

foreground from the background. Color clustering is often used to narrow the search space when determining the color of the motive or in combination with other algorithms.

(7)

1.1 Problem statement

The question we will be investigating in our study is “Which thresholding method among clustering and histogram analysis improves accuracy the most on a modern OCR algorithm?”

The accuracy is measured as the percentage of correctly identified characters.

1.2 Scope and limitations

This study concentrates on thresholding rather than OCR, our focus is to preprocess the image by separating the text from the background. In this study, we are relying on an

existing OCR framework provided by Microsoft. The images are taken from the Kaist dataset which contains a wide variety of images taken in a real context and are limited to the modern Latin alphabet.

(8)

2. Background

Character recognition in natural images is often based on three different parts including text localization, image segmentation and running it through an OCR algorithm. Text localization is used to extract text areas of the images, image segmentation to group an image into different groups of interest and running through an OCR algorithm to perform the interpretation.

2.1 Text localization

In order to provide a deeper understanding in the domain of extracting text areas in natural images, we will present a few widely used methods. There is a lightweight approach called edge and region-based which operates under the assumption that text is contained within a border. Some methods are more conservative than others regarding in which circumstances the picture is taken in. Another aspect when choosing an edge and region-based or neural network solution is the required computational power. Neural networks tend to use a lot more computational power than edge-based algorithms [1].

2.1.1 Edge and region based

There are two specifically interesting methods using properties of edges in a picture to extract text.

A method called Quads uses the fact that text often is contained within a quadrilateral formed border. This is typical for many objects in natural images, for example books, road signs, stickers, and whiteboards just to mention a few [1]. The quads algorithm identifies abrupt changes of colors in the picture which indicates the existence of an edge [2]. The edges function as a frame containing the text which then can be extracted, this saves computational power since we can extract a relevant subset of the image. However, these regions may cause some false positives but these can be discovered by a method based on the fact that characters tend to have a lot of edges. When all angles within this region are measured there should be a matching counter angle for almost every angle [1].

Another edge-based feature extraction method is built upon the character characteristics.

The number of changes in concave and convex edges tends to be very small in a text area.

Also, the number of inflection points tends to be much lower for a character than nontext objects such as grass, flight plans, and bikes. This method is effective to eliminate false positive regions [7].

2.1.2 Neural networks

Neural networks are a model used in machine learning, trained by examples, to provide computers the ability to learn feature detections unsupervised. Feature extraction with neural networks has a great advantage over region-based methods since it can recognize text without specific context. Disadvantages are a high use of CPU power and complexity in its implementation. Multilevel Gabor filters meaning repeatedly applying Gaussian filters on sub-portions of the image and analyzing the width of the spatial variance and peaks in the color histogram can be used to build neural networks for separating text areas from graphical areas [3].

(9)

2.2 Image segmentation

Segmentation of an image means to divide it into different groups of interest, for example, background and foreground. Each pixel in the image will be assigned a label which represents a property. Thresholding is one type of segmentation used by researchers to separate the text from its background. It separates the image into pixels only labeled with two different colors. Global thresholding uses an analysis of the entire image before setting the labels while local thresholding takes the neighboring pixels into account.

2.2.1 Global Thresholding

All pixels in the image are analyzed before determining how the labels should be assigned to each of them. One well-known global thresholding method is described by N. Otsu where a gray level histogram is used to estimate fore- and background colors with maximal contrast.

Another often used method is to group pixels into clusters after analyzing their position in the color space.

2.2.1.1 Histogram

One often mentioned algorithm described by N. Otsu is used to reduce a grayscaled image into a binarized by analyzing its histogram. Each image is assumed to contain two different classes of pixels, one for the foreground, in our case the text and one for the background.

To separate these classes their inter-class variance is maximized, meaning the algorithm finds an optimal threshold for which their mean deviation and probabilities are maximized.

The optimal threshold is found by iterating over all possible gray levels as potential thresholds [4]. Otsu’s method can be extended to differentiate more than two classes by adding multiple thresholds. His method has been improved over time by different

researchers by improving the peaks analysis [14].

2.2.1.2 Clustering

A commonly used clustering method regarding character recognition in natural images is called k-means. This method uses k number of clusters, each of the pixels in the image belongs to one of the k clusters. For each cluster, there is one centroid which represents the empirical mean of the previous iteration. Determining k is a difficult task since each picture has it’s own unique properties. The goal is to minimize the square distance between all pixels and the centroid. At first, the centroid is placed at random in the picture. After each iteration, the centroid is moved to the empirical mean representing the center of all pixels within its cluster. Since this is an optimization problem and belongs to the complexity class np-hard it is often limited to a tolerance measured in the distance between the centroid and the center of mass or a certain amount of iterations [13]. One of the reasons making k- means such a popular choice is it relatively cheap considering computational power.

Clustering algorithms have great success of narrowing down the amount of possible colors in an image. A normal image in jpeg format can consist of a maximum of about 16 million colors and this is a huge search space for the color classification algorithm. With clustering, we can narrow the picture into k amount of colors which greatly improves the efficiency and performance of classification algorithms.

(10)

2.2.2 Local Thresholding

Local thresholding is similar to global thresholding with the difference that the image is divided into smaller parts which are analyzed separately. The different parts of the image are later combined to one thresholded image [12].

A type of local thresholding called dynamic thresholding iterates over each pixel and sets the pixel either to black or white. In order to determine whether a pixel should be black or white a grayscale histogram is used in combination with comparing with neighboring pixels. If the pixel is darker than the average neighboring pixels then it’s set to black [5].

2.2.3.1 Saliency

Saliency is a measurement of how a pixel stands out from its neighbors. In the context of separating text from the background, saliency uses the fact that text in real scenes is often designed to draw attention. This is done by writing the text on a surface which is different than the majority of the background and the text itself. This boosts the saliency values in the text region which can be used to segment the text from the background [11].

2.3 Optical Character Recognition (OCR)

State of the art OCR frameworks often rely on advanced technologies such as neural networks. By using machine learning the algorithm does not follow a set of predefined rules instead it defines neural networks to unsupervised defining them from previous tasks.

Today’s OCR frameworks often keep the details of their implementation secret. However, Microsoft claims their OCR is built upon Azure machine learning which is made public without any implementation specifics disclosed. A big research field within Microsoft is to read an unstructured text which often is the case in character recognition in natural images.

Microsoft uses neural networks to extract relevant areas [15].

2.3.1 Microsoft’s Computer Vision

Before the image is ready to be interpreted by an OCR algorithm it must be skew-corrected.

Microsoft’s Computer Vision API uses a recognition step to detect text areas that are not in horizontal position and rotates the images by the needed angle to achieve horizontally aligned text areas. This step is called skew-correction [15].

Figure 2.3.1.1: Skew-correction by Microsoft’s Computer Vision

(11)

2.3.1.1 Limitations

The use of the API is limited to images with a size of at least 50 times 50 pixels. Microsoft states that the recognition rate is highly dependant on the quality of the image and

inaccurate reading can occur by different factors. Handwritten, cursive and small text styles are partly supported but may result in low recognition. Images including blur, complex backgrounds, glare or shadows are stated to be hard to read by the framework. Microsoft mentions in the documentation that images dominant text are more likely to result receive false positives in their results.

(12)

3. Method

The dataset used for this study contained a number of different images of real scenes, we carefully selected the ones that contained only modern Latin alphabetic characters. Each image is annotated with coordinates which together form a region containing the text. The coordinates are used to extract the text area. The extracted images are thresholded with Otsu's method and a variant of k-means where we use two clusters. An implementation of these methods in the Python library SciKit learn was used. After thresholding, the extracted images were sent to Microsoft’s OCR framework for evaluation. These steps were all done automatically except for the extraction itself. Masking coordinates were manually gathered from the annotations.

3.1 Dataset KAIST

KAISTs Scene Text Database which was published for the ICDAR conference in 2011 was used in this study. The dataset contains 3000 images captured by mobile phones or system cameras and annotations containing meta data. Different classes for the languages English and Korean were made and a mixed category was added containing randomly chosen images out of these classes. For our thesis 191 images out of the English class were selected and their annotations were reviewed to be correct and only contain Latin letters.

Another criteria for our selection was the limitation of Microsoft’s Computer Vision OCR to only being capable of processing images with a height and width of at least 50 pixels, therefore images containing smaller text areas were removed.

Figure 3.1.1: Sample of images selected from the KAIST Scene Text Database

3.2 Text localization

Character recognition is as mentioned above used in combination with text detection. The annotations contain information about the texts position which is used to extract the text areas. Some of the selected images contain multiple text areas which results in a total of 260 separate images of these text areas used in this thesis.

(13)

Figure 3.2.1: Extracted text areas of some sample images

3.3 Thresholding

We evaluated two methods for binarizing (thresholding) images. These methods included Otsu's threshold selection method and clustering with K-means.

3.3.1 Otsu's method

Images in the KAIST dataset are saved as matrices containing the RGB channels for each pixel. The RGB values are reduced from a three-dimensional space to a one-dimensional grayscale space. To calculate the grayscale values (𝑌) the following formula was used.

𝑌 = 0.2125 ⋅ 𝑅 + 0.7154 ⋅ 𝐺 + 0.0721 ⋅ 𝐵 where 0 ≤ 𝑅, 𝐺, 𝐵, 𝑌 ≤ 255

Figure 3.3.1.1: Example of image conversion from RGB to grayscale

The optimal threshold is then selected as Otsu proposed in his thesis [4] by maximizing the discriminant measure of separability of the resultant classes in gray levels.

A gray level is a value in the grayscale different from zero.

The highest gray level found in the image is denoted 𝐿 where 1 ≤ 𝐿 ≤ 255.

The number of pixels for any gray level 𝑖 is denoted 𝑛₅ and normalized as 𝑝₅ where 𝑝₅ ≥ 0, ⁸_59:𝑝₅ = 1.

To find the optimal threshold its initial value 𝑘 is set to zero and repeatedly set to the values for all gray levels up to 𝐿.

The image is assumed to contain two classes, one for background and one for objects. The probability for each of these classes are denoted as

𝜔₌= 𝑝₅

>

59:

(14)

𝜔_:= 𝑝₅

8 59>?:

Class variances are then calculated by the following formulas:

𝜎_A^B= 𝜔₌𝜔_:(𝜇_:− 𝜇₌)^B 𝜇₌= 𝑖 𝑝₅

𝜔₌

>

59:

𝜇_:= 𝑖 ^G^H

IJ 859>?:

The optimal threshold k* can therefore be written as 𝜎_A^B(𝑘 ∗) = 𝑚𝑎𝑥: O > P 8 𝜎_A^B(𝑘)

Figure 3.2.1.2: Examples of optimal threshold selected from histogram analysis Images are then binarized by setting pixels with a gray level higher och equal to the optimal threshold to zero (0) and pixels with a lower gray level to one (1).

Figure 3.2.1.3: Example of image conversion from grayscale to binarized

3.3.2 K-means clustering

Image segmentation can be done by clustering pixels and a commonly used method is k- means. In this thesis an existing implementation of k-means is used. An image is loaded into an array containing raw RGB data and each value in the triplet is normalized to a value ranging from zero (0) to one (1). K-means algorithm finds a subset of pixels such that the square error between the empirical mean of the subset and the centroid is minimized for each subset. Each subset represents a cluster and the grouping of pixels is done by coloring each pixel in all subsets to its nearest means color.

K-means is an optimization algorithm and belongs in the complexity class NP-hard.

However, the time performance can be greatly improved by adding a tolerance distance and max number of iterations. In this thesis, it is run with a tolerance of 10⁻⁴length units and a maximum of 300 iterations for each cluster. The distance is calculated by comparing the square error between the centroid and the empirical mean within each cluster.

(15)

Where 𝑘 is the number of clusters and 𝑥 is the set of pixels within each cluster set 𝑆₅ and 𝜇₅ is the centroids position. Then an empirical mean within the cluster is calculated and the centroid is moved to that empirical mean and this is repeated 300 iterations or when the tolerance is reached [13].

For the approach of thresholding, the image is divided into two clusters.

In this thesis, we run k-means with k = 2 which means we have two different clusters.

Figure 3.3.2.1: Example of original image and its clustering performed by k-means Pixels assigned to the first cluster are set to one (1) for the binarization step in the

thresholding. The remaining pixels, those assigned to the second cluster, are set to zero (0).

Figure 3.3.2.2: Example of original image and the binarized after clustering

3.4 Accuracy

The accuracy of the text recognition has been measured by comparing all found characters in the Microsoft OCR engine to the correct annotations. Each correctly identified character adds one point to the total score with no consideration of upper or lower case. The score is then divided by the total amount of characters which gives us the accuracy in percentage.

The amount of letters recognized in wrong capital and false positives are counted as well. A false positive is when a character is recognized but it doesn't exist in the annotations. Our ambition with this measurement is to provide a fair comparison between k-means, Otsu and the unaltered images.

(16)

4. Results

In figure 4.1 the average values of the results are presented. More variations in the average accuracy were expected. We measured caps error as the percentage of times an uppercase character is recognized instead of a lowercase character and vice versa. Caps errors are counted as correctly identified characters. Some graphical objects are similar to characters and may occur in the response from the OCR framework. These are called false positive and do not affect the accuracy.

Method Caps error False positives Accuracy

Unaltered 13% 6% 68%

K-means 10% 4% 65%

Otsu 12% 4% 63%

Figure 4.1: Summary of results in average

4.1 Accuracy differences

When analyzing the results we found three interesting groups of pictures, one where

thresholding performed better than unaltered, one where the thresholding performed well but the OCR responded with negative results and one where there were we encountered

illumination problems leading to low accuracy.

4.1.1 Thresholding advantages over unaltered

In 49 out of 260 pictures k-means resulted in 100% accuracy while the unaltered images got 0% accuracy out of the Computer Vision API. The same case occurred for 35 preprocessed by Otsu’s method. 34 out of these images resulted in 100% accuracy for both k-means and Otsu’s method.

Figure 4.1.1.1: A few examples when unaltered performed 0%

accuracy whilst the thresholded performed 100% accuracy

(17)

4.1.2 Close to perfect thresholding but bad results

We found that some of the thresholded pictures achieved 0% accuracy while the unrendered counterpart received 100% accuracy. To the left, we have the unrendered image and to the right, we have the thresholded image.

There were totally 24 k-means images which were close to perfectly thresholded and resulted in 0% accuracy and for Otsu’s method, there were 18 images out of 260 images.

Figure 4.1.2.1: A few examples when unaltered performed 100% accuracy whilst thresholded performed 0% accuracy

4.1.3 Thresholding illumination problem

In total 11 out of 260 pictures were affected of illuminations caused by external light. This affected our results in a negative way because unaltered images could score higher accuracy whilst the thresholded images resulted in low accuracy.

Figure 4.1.3.1: Example of when the motive is grouped into the background

(18)

4.2 Results with modified dataset

In the section 4.1.2 we saw that a big amount of pictures had very good thresholding but bad results, here we run our experiment without those pictures.

4.2.1 Removed K-means images

Removed 24 images where K-means thresholded almost perfect (see section 4.1.2) but achieved 0% accuracy.

Totally 236 images.

K-means 10% 5% 70%

Figure 4.2.1.1: Summary of modified results in average

4.2.2 Removed Otsu's method images

Removed 18 images where Otsu’s method thresholded almost perfect (see section 4.1.2) but achieved 0% accuracy.

Totally 242 images.

Otsu 13% 5% 68%

Figure 4.2.2.1: Summary of modified results in average

(19)

5. Discussion

The results show that the difference between unaltered, Otsu and k-means images are very small, almost negligible. On average the unaltered images perform slightly better than k- means and Otsu are performing slightly worse than k-means. However, no matter which of these three methods is used the difference is within a span of 5 percent units. One problem with the grouping of pixels, images with illumination or varying color may cause a subset or whole set of characters to be grouped into the background.

The Kaister dataset has a wide variety of pictures with very few limitations, the motive being the text were not always fronto-parallel and some pictures had changing text color which in some cases were caused by illumination. This in combination with good thresholding but bad results in section 4.1.2 were the cause of Otsu's method and k-means failed to improve the accuracy.

When reading about Microsoft’s framework they mostly give away very high-level abstraction of the implementation. One way to eliminate the uncertainties of the OCR implementation would be to try more frameworks and preferably open source implementations. This would give a good comparison how big of a factor the OCR implementation is towards our preprocessed images.

Before we decided to go forward with this idea we tried to send perfect thresholded images (by manually thresholding them black and white) into the Microsoft OCR framework and got positive results. However, when we pre-processed images we ended up having a total of 50 cases where the segmentation was good but we believe jagged edges caused problems for the OCR. When Otsu and k-means perform grouping of pixels the processed images has characters with jagged edges due to low resolution in the unaltered image.

Both Otsu's method and k-means with k = 2 do guarantee that the image is binarized with one black and one white color but it does not guarantee the text being black on a white background. We ran tests to see if black text on white background would yield a different result than white text on black background and the result showed that the differences were negligible.

5.1 Thresholding advantages over unaltered

Images containing a complex background with different patterns and colors gained higher accuracy after thresholding when its color variation differed from the text color. The dataset contained images that should be easily interpretable by humans and therefore are designed to draw attention. The attention is being reached by placing text on surfaces with a different color which is a great property for the thresholding methods used in this thesis.

(20)

5.2 Thresholding disadvantages over unaltered

In figure 4.1.1.1 we can see just a few out of many examples where it seems like the

thresholding performed very good but the results came back negative from Microsoft's OCR framework. The unaltered images perform 100% accuracy on all three occasions whilst the thresholded images perform 0%. To the human eye, it's probably easier to read the

thresholded image than the unaltered image but the OCR may have a problem with jagged edges which can be seen in the thresholded image in figure 4.1.1.1

We applied a gaussian filter to eliminate the jagged edges as it can be seen in figure 5.2.1 but the result remained the same. The Gaussian filter managed to reduce the number of jagged edges but the result is not perfect therefore we cannot eliminate the risk of jagged edges interfering with the results.

Figure 5.2.1: Image with applied Gaussian filter

5.3 Thresholding failure

In figure 4.1.3.1 we show the problem with illumination, in total there were 11 pictures out of 260 where the text is fading into the background. This is another factor why k-mean and Otsu perform worse than unaltered images on average.

5.3 Self-Criticism

In this thesis, we only used one state of the art framework provided by Microsoft. This causes an uncertainty in our results since we don't know which problems are solely related to Microsoft OCR implementation. We used the Microsoft Computer Vision framework because it fitted our required use restriction.

During this project, we failed to increase the number of clusters in k-means due to difficulties to choose the cluster containing the text. This forced us to use k = 2 where we set one cluster to black and the other to white.

(21)

6. Conclusions

The results showed that the preprocessing with thresholding did not improve the accuracy on Microsoft's OCR framework on average.

Images containing complex backgrounds in combination with high variance between the background and text color did improve the capability of the OCR to recognize the characters present.

A small subset of images contained high illuminations and shadows resulting in a loss of contrast leading to text and background melting together by our preprocessing methods.

This thesis showed that simple methods such as k-means with two clusters and Otsu's method can improve the accuracy on a modern state of the art OCR framework.

6.1 Future work

This thesis showed that illumination is a problem for both histogram analysis and clustering.

As a starting point for future work, we suggest to investigate how to increase the number of clusters in combination with a reliable text color selection method. Future comparison studies may investigate the use of multiple thresholds in Otsu’s method.

(22)

7. References

1. P. Clark, M. Mirmehdi (2002), Character Recognition in Real Scenes. IJDAR.

Springer-Verlag, pp. 243-257

2. M. Petrou, J. Kittler (1991), Optimal edge detectors for ramp edges. IEEE Trans Pattern Anal Mach Intell, pp. 483–491

3. A.K. Jain, S. Bhattacharjee (1992), Text segmentation using Gabor filters for automatic document processing, Machine Vision Appl, pp. 169–184

4. N. Otsu (1979), A Threshold Selection method from gray-level histograms, IEEE Trans. Systems, Man, And Cybernetics, pp. 62-66

5. M. White, G. D. Rohrer, Image Thresholding for Optical Character Recognition and Other Applications Requiring Character Image Extraction

6. J. J. Weinman, E. Learned-Miller and A. Hanson (2009), Scene text recognition using similarity and a lexicon with sparse belief propagation, IEEE Trans. Pattern Anal.

Mach. Intell, pp. 1733–1746

7. L. Neumann, J. Matas (2012), Real-Time Scene Text Localization and Recognition, Providence: CVPR IEEE Conference, pp. 1-8.

8. U. Pal, N. Sharma, T. Wakabayashi, and F. Kimura (2007). Off-line handwritten character recognition of devnagari script. In International Conference on Document Analysis and Recognition (ICDAR), pp. 496–500, Curitiba, PR, Brazil: IEEE.

9. X. Chen, A. Yuille (2004). Detecting and Reading Text in Natural Scenes. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Los Angeles, USA: CVPR IEEE.

10. R. Gao, F. Shafait, S. Uchida, and Y. Feng (2014). A Hierarchical Visual Saliency Model for Character Detection in Natural Scenes. CBDAR 2013, Washington, USA:

Springer International Publishing Switzerland, pp. 18-19.

11. R. Gao, S. Uchida, F. Shafait, and V. Finken (2014), Visual Saliency Models for Text Detection in Real World. PLoS One 9.

12. J. Weszka (1978), A Survey of Threshold Selection Techniques. Elsevier Inc., pp.

259-265.

13. Anil K.Jain (2009), Data clustering: 50 years beyond K-means. Elsevier Inc., pp. 651- 661.

14. J. Kittler and J. Illingworth (1985), Threshold Selection Based on a Simple Image Statistic. Elsevier Inc., pp. 125-147.

(23)

15. Microsoft Azure (2016). Text Analytics Documentation. URL:

https://docs.microsoft.com/sv-se/azure/cognitive-services/computer-vision/home#a- nameocroptical-character-recognition-ocra

(24)

www.kth.se

Improving character recognition by thresholding natural images

Improving character recognition by thresholding natural images

OSKAR GRANLUND KAI BÖHRNSEN

Förbättra optisk

teckeninläsning genom att segmentera naturliga bilder

OSKAR GRANLUND KAI BÖHRNSEN

Abstract

background have been researched lately. In our study, we assess the improvement reached by two of these preprocessing methods called k-means and Otsu by

comparing their results from an OCR algorithm. The study showed that the

preprocessing made some improvement on special occasions, but overall gained

worse accuracy compared to the unaltered images.

Sammanfattning

Dagens optisk teckeninläsnings (OCR) algoritmer är kapabla av att extrahera text

från bilder inom fördefinierade förhållanden. De moderna metoderna har uppnått en

hög träffsäkerhet för maskinskriven text med minimala förvrängningar, men bilder

tagna i en naturlig scen är fortfarande svåra att hantera. De senaste åren har ett

stort intresse för att förbättra tecken igenkännings algoritmerna uppstått, eftersom

fler kraftfulla och handhållna enheter används. Det huvudsakliga problemet när det

kommer till igenkänning i naturliga bilder är olika förvrängningar som infallande ljus,

textens textur och komplicerade bakgrunder. Olika metoder för förbehandling och

därmed separation av texten och dess bakgrund har studerats under den senaste

tiden. I våran studie bedömer vi förbättringen som uppnås vid förbehandlingen med

två metoder som kallas för k-means och Otsu genom att jämföra svaren från en

OCR algoritm. Studien visar att Otsu och k-means kan förbättra träffsäkerheten i

vissa förhållanden men generellt sett ger det ett sämre resultat än de oförändrade

bilderna.

Table of contents

1. Introduction

1.1 Problem statement

1.2 Scope and limitations

2. Background

2.1 Text localization

2.1.1 Edge and region based

2.1.2 Neural networks

2.2 Image segmentation

2.2.1 Global Thresholding

2.2.2 Local Thresholding

2.3 Optical Character Recognition (OCR)

2.3.1 Microsoft’s Computer Vision

3. Method

3.1 Dataset KAIST

3.2 Text localization

3.3 Thresholding

3.3.1 Otsu's method

3.3.2 K-means clustering

3.4 Accuracy

4. Results

4.1 Accuracy differences

4.1.1 Thresholding advantages over unaltered

4.1.2 Close to perfect thresholding but bad results

4.1.3 Thresholding illumination problem

4.2 Results with modified dataset

4.2.1 Removed K-means images

4.2.2 Removed Otsu's method images

5. Discussion

5.1 Thresholding advantages over unaltered

5.2 Thresholding disadvantages over unaltered

5.3 Thresholding failure

5.3 Self-Criticism

6. Conclusions

6.1 Future work

7. References