Automated system tests with image recognition : focused on text detection and recognition

(1)

Linköpings universitet

Linköping University | Department of Computer and Information Science

Bachelor’s thesis, 16 ECTS | Datateknik

2019 | LIU-IDA/LITH-EX-G--19/025--SE

Automated system tests with

im-age recognition

–

focused on text detection and recognition

Automatiserat systemtest med bildigenkänning

-fokuserat på text detektering och igenkänning

Moa Eriksson

Oskar Olsson

Supervisor : Azeem Ahmad Examiner : Ola Leiﬂer

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

Today’s airplanes and modern cars are equipped with displays to communicate impor-tant information to the pilot or driver. These displays needs to be tested for safety reasons; displays that fail can be a huge safety risk and lead to catastrophic events. Today displays are tested by checking the output signals or with the help of a person who validates the physical display manually. However this technique is very inefficient and can lead to im-portant errors being unnoticed. MindRoad AB is searching for a solution where validation of the display is made from a camera pointed at it, text and numbers will then be recog-nized using a computer vision algorithm and validated in a time efficient and accurate way.

This thesis compares the three different text detection algorithms, EAST, SWT and Tesseract to determine the most suitable for continued work. The chosen algorithm is then optimized and the possibility to develop a program which meets MindRoad ABs expec-tations is investigated. As a result several algorithms were combined to a fully working program to detect and recognize text in industrial displays.

(4)

Acknowledgments

We would like to thank MindRoad AB for providing us the assignment, office space and all the help. Especially Anders Larsson for the assistance regarding the project.

We would also like to give our thanks to our supervisor Azeem Ahmad and eximinator Ola Leifler for all constructive feedback and help with our thesis.

(5)

2 Theory 6 2.1 Algorithms . . . 6 2.1.1 Image pre-processing . . . 6 2.1.2 Text detecting . . . 7 2.1.3 Text recognition . . . 8 3 Method 10 3.1 Timeline . . . 10 3.2 Pre-study . . . 11 3.2.1 Literature study . . . 11 3.2.2 Implementation . . . 11 3.3 Data collection . . . 12 3.4 Actual implementation . . . 13 3.4.1 Physical setup . . . 14 3.4.2 Pre-procesing . . . 14 3.4.3 Tesseract . . . 15 3.5 Evaluation . . . 15 3.5.1 Measure detection . . . 15 3.5.2 Measure recognition . . . 18 3.5.3 CRR . . . 18 3.5.4 CPR . . . 18 3.5.5 WRR . . . 18 4 Result 19

(6)

4.1 Implementation study . . . 19 4.1.1 Recall . . . 19 4.1.2 Precision . . . 21 4.1.3 Summary . . . 22 4.2 Actual implementation . . . 22 5 Discussion 24 5.1 Result . . . 24 5.1.1 Measuring detection . . . 24

5.1.2 Images and threshold . . . 24

5.1.3 Measure recognition . . . 25

5.1.4 Recall and Word Recognition Rate . . . 25

5.2 Method . . . 26

5.2.1 Pre-study and selection of algorithms . . . 26

5.2.2 Pilot study and algorithm performance . . . 26

5.2.3 Data collection . . . 27 5.2.4 Implementation . . . 27 5.2.5 Evaluation . . . 28 5.3 Wider context . . . 28 6 Conclusion 29 6.1 Future work . . . 29 Bibliography 30

(7)

List of Figures

1.1 Examples of displays provided by MindRoad AB, since the original images were copyrighted this is an approximate drawing of how they looked. These images are used as a baseline to collect other similar looking images for testing. . . 3 1.2 Reflection example, the left side of the image is dazzled with background light.

This level of reflection is considered high. . . 4 1.3 Overview of approach, a photo of the display together with interesting areas and

their expected data is inputted into the script. The script runs and returns true or false. . . 5 1.4 Example of input area and detected area. . . 5 2.1 Overview of Tesseracts LSTM OCR Engine. The data goes through several steps

in order for recognition to be performed. . . 9 3.1 Timeline for studying algorithms. Pre-study will be performed through literature,

then three algorithms will be examined and tested more thoroughly and as a last step one algorithms will be implemented and tested on the specific usecase. . . 10 3.2 These images were collected from the internet as well as gathered in our close

environment. The images from figure 1.1 were used as a baseline. Images b, c, d, g, j, k, l, m, n, o, p, q, r, s and t were collected from the internet while a, e, f , h and i were photographed by us. The images taken from the internet are only shown as drawn examples here due to copyright restrictions. . . 13 3.3 ArUco markers. (a) show an example of a ArUco marker, the same maker id used

in testing for this report. (b) visualise how the ArUco markers are placed when testing and the red point markse the position where the image will get cropped. . . 15 3.4 Expected and detected area . . . 16 3.5 Illustration over recall, where the overlapped area symbolises the true positive,

this is then divided with the total are of what was expected. . . 17 3.6 Illustration over precision, where the overlapped area symbolises the true positive,

this is then divided with the total are of what was detected. . . 17 3.7 Example of recognition. Expected- and detected characters with the

correspond-ing True positive, False positive and False negative characters. . . 18 4.1 Overview of recall. The images are available in figure 3.2 . . . 20 4.2 Overview of precision. The images are available in figure 3.2 . . . 21 4.3 Detection and recognition on images from table 4.4. Images (a) and (b) were the

images that had the highest CRR, CPR and WRR with 100% in all fields. (g) and (h) were the images in which Tesseract performed the worst on with a 0% in WRR. 23 5.1 Tesseracts detection when used on green led displays. . . 25

(8)

List of Tables

2.1 Tesseracts different PSM modes explained . . . 8 4.1 The calculated recall for each image. The images are available in figure 3.2 . . . 20 4.2 The calculated precision for each image. The images are available in figure 3.2 . . . 21 4.3 The total recall and precision calculated over all images for each algorithm. . . 22 4.4 The total recall and precision calculated over a selection of images for each

(9)

Lexicon

ICDAR - International Conference on Document Analysis and Recognition. The shortening can refer to the conference it self or to the database of images provided on the conferences.

OCR - Optical Character Recognition ER - Extremal region detection

EAST - Accurate scene text detection, explained in section §2.1.2.1 SWT - Stroke width transform, explained in section §2.1.2.2 LSTM - Long Short-Term Memory, explained in section §2.1.2.3 PSM - Page segmentation mode, explained in section §2.1.2.3 TORC - Transym optical character recognition

Recall - See equation and explanation in section §3.5.1.0.1 Precision - See equation and explanation in section §3.5.1.0.2 Accuracy - See equation and explanation in section §3.5.1.0.3 CRR - Character recognition rate, see section §3.5.2 CPR - Character precision rate, see section §3.5.2 WRR - Word recognition rate, see section §3.5.2

(10)

1 Introduction

Today displays are used in a wide range of electronics, everything from fridges to airplanes. They are an important part of the interaction between man and machine, conveying impor-tant information and interaction between the user and the machine. Testing the displays is a crucial step in development to prevent any potential flaws, for example when developing screens for airplanes a flaw in the display can convey faulty information to the pilot and lead to a crash. Automating tests for displays speeds up development time by reducing manual labour and time complexity. Automated tests for displays are already a part of development but relies on simulations and screenshots, however not all faults can be detected using sim-ulations. Performing tests on a photo of the actual display avoids this problem, today this is mostly done manually. But manual testing is both time consuming and expensive.

Modern computer vision has the potential to automate the testing of the photos of the display. However most text detection algorithms are focused on general open scene photos, for example road signs, licence plates and other complex noisy photos. They do not have sufficient performance to be used in testing, however focusing on photos with specific char-acteristics allows for optimisation of the algorithm in order to improve performance. This paper explored a range of different algorithms and proposes a way to optimise one of these algorithms in order to work with photos of displays with certain characteristics.

The following sections will explain the background, aim and limitations for this thesis. The main topic is detection and recognition of words in industrial dashboards such as displays in airplanes or modern cars.

1.1 Company background

This project was requested from the company MindRoad AB. Mindroad is a small company with approximately 30 employees. They are a consultancy firm and mainly does work with software systems and software development1. Mindroad is looking for a solution for auto-mated tests of photos of displays with image recognition. The tests should be highly accurate and extract correct values, text and numbers from the photo of the display.

(11)

1.2. Background

This will be used as a confirmation of the display working correctly when new features or bug fixes are introduced.

1.2 Background

Wang [1], among others, has expressed that testing, for example displays by checking output signals, is not always reliable and can lead to undetected faults. Manual testing is also expensive and time- consuming. By running automated tests on a photo of the actual display these potential undetected faults can be discovered in a time and cost efficient way. For this to be viable, the tests carried out must have extremely high recall as well as a fairly good precision in a time efficient manner so that faults like the ones mentioned above can be found. Current text detection algorithms do not have sufficient2recall. However, algorithms used for text detection today focus mostly on detection in natural scenes. The algorithms performing best in recall for these environments have a precision rate under 83%, as seen in ICDAR competition 2015 [2]. This result is not sufficient3_{. Multiple studies[2, 3, 4, 5, 6, 7, 8]} measure the outcome of the detection in recall and precision, some of them also use accuracy. How this is measured is explained in section 3.5.1.

Tao et al. [3] worked with industrial dashboards and reached an overall accuracy on 94.2%. They used a hybrid solution with an OpenCV tool called extremal region detection (ER). The ER method was originally designed by Neumann and Matas [4] in 2012 and reached a 64.7% precision in the ICDAR competition. Tao et al. modified the ER-algorithm in different stages so that it would fit their need and managed to reach an better result for their type of images. This paper explore some of the most common algorithms for text detection and recognition. It also proposes modifications which uses the specific image characteristics to outperform the current text detection systems on photos with the characteristics as described in section 1.4.1.

Example of test cases to handle might be:

• Does this text occur in this certain area of the display?

• Is there a number between 1 and 10 in this area of the display?

• Is the number in this area between 20 and 30 when the number in another area is 1? • Is there a number to the right of this text within a specified area?

1.3 Aim and research question

Displays are widely used in a wide range of appliances and electronics. They are an important part of the interaction between humans and machines. The displays handled in this paper are similar to those found in cars or airplanes, see figure 1.1. They have a variation of text and images. The background is often but not always dark and text can have different colours. For more information about the image characteristics refer to section 1.4.1.

2_{For this thesis 60% recall is considered sufficient for an assisting program and 99% recall is considered sufficient}

for a stand alone program.

(12)

1.4. Limitations and challenges

(a) Example of car display _{(b) Example of airplane display}

Figure 1.1: Examples of displays provided by MindRoad AB, since the original images were copyrighted this is an approximate drawing of how they looked. These images are used as a baseline to collect other similar looking images for testing.

There exists a lot of different text detection algorithms on the market today, both open-source and commercial. In preparation for this thesis three of these text detection algorithms were selected, EAST (Efficient and Accurate Scene Text detector), SWT (Stroke Width Trans-form) and Tesseract. Several algorithms were looked into by a pilot literature study, these were then narrowed down by pilot tests. When performing these studies the focus was on recall, since it was deemed the most important for building test cases. Precision and speed was deemed less important for the purpose.

As text recognition in single rows already has a sufficiently high recall rate [9], we chose not to focus on the recognition part and chose to work with one of the most popular Optical Character Recognition (OCR) engines, Tesseract [10]. Tesseract have two different parts which can be run separately, text detection and OCR. In this thesis, the text detection part of Tesseract is evaluated with the other text detection algorithms while Tesseracts OCR is used for recognition independent of the text detection algorithm.

The report will study the following question:

1. Which of the three mentioned text detection algorithms have the highest recall when deployed with Tesseract as OCR on photos of displays with the specific characteristics detailed in section 1.4.1?

1.4 Limitations and challenges

1. This project will only work with algorithms which are open source or have a license which allows free commercial use and distribution.

2. The program is restricted to only handle mild disturbances in the photo, like small reflections or shadows. An example of reflection can be seen in figure 1.2.

3. The photo of the display may have a different size and contain a frame of disturbing background that wont match original image.

4. The types of images used are restricted to those matching the criteria set by MindRoad AB. The type is described in detail beneath in section 1.4.1.

(13)

1.5. Scope

Figure 1.2: Reflection example, the left side of the image is dazzled with background light. This level of reflection is considered high.

1.4.1 Image characteristics

The image characteristics are described in order to choose which algorithms are relevant and to specify which problems need to be handled. The following assumptions can be made about the image according to information from the company and figure 1.1.

General display characteristics

• Minor reflections • Minor shadows

• Photo might be slightly crooked

Text and numbers characteristics of the display

• High contrast • Clear font

• One colour without gradient • English

• Expected position is known

1.5 Scope

A photo of a display will be transferred to a computer. This photo is then handled by the computer together with test cases. Each test case consists of a user defined area of the display as well as what is expected inside the defined area, also defined by the user (also referred to as testperson). These areas are then used to detect and recognise text in. Section 1.5.1 explains how these input and detection areas look like. An overview of the scope can be seen in figure 1.3

(14)

1.5. Scope

Figure 1.3: Overview of approach, a photo of the display together with interesting areas and their expected data is inputted into the script. The script runs and returns true or false.

With the help of the restricted areas provided from the user in the test cases, a detection script will then search the area for exactly where all text within the boundaries are. The detected text areas will then be scanned with a OCR tool called Tesseract, described in section 2.1.3, that will extract read- and writable text from the image. The extracted text will be compared with the data from what was expected and the script will return either true or false depending on what was found. This prototype is meant to be used during the development of new software for a screen. A guide of how the screen is meant to look is available. The creator of the test can use the guide to mark areas and what text is suppose to be there. This test can then be run repeatably on the actual display as the software is updated.

Example of test cases to handle is mentioned in the last section of 1.2 Background.

1.5.1 Input- and detection area

As the test cases are provided with a specified area, the original image will be cropped into this area/areas, where text and numbers are expected, as seen in light blue in figure 1.4(a). The text detection algorithm will then go through the cropped area and extract finer areas around the text, as seen in pink in figure 1.4(b).

(a) Example of input area to limit detection in, marked in

light blue (b) Example of detection in limited area, marked with pink

Figure 1.4: Example of input area and detected area.

The detected areas can then be used for comparing positions and to create a secure testcase similar to the one mentioned at the end of 1.2 Background.

(15)

2 Theory

In this chapter the theory is explained. The theory is the basis of the method and discussion later on in the report. The chapter will deal with all algorithm used in this project.

2.1 Algorithms

This section will first explain algorithms for pre-processing the image that may be needed for detection or recognition algorithms to work. It will then explain how the text detection algorithms work and lastly explain how the text recognition algorithm works.

2.1.1 Image pre-processing

This section will go through techniques which might improve the text detection algorithms.

2.1.1.1 Thresholding

There exists two different kinds of thresholding, static and adaptive. Static thresholding has a hard limit, every pixel higher than the limit is turned white and lower turned black. Adap-tive thresholding uses an algorithm to choose which pixels are turned black or white. There are several algorithms to choose from but some of the more popular are Mean, Otsu and Gaussian.1

2.1.1.2 Erosion and Dilation

Erosion erodes white, which makes white lines be thinner. Dilation does the opposite and dilates the white, making white lines thicker. Both erosion and dilation is run with a kernel, the kernel can be any shape but is usually a rectangle or circle. The size of the kernel affects how erosion and dilation will be done. The scanning can also be done in several iterations which means that the image is scanned several times.2

1_OpenCV, _Image _{Thresholding,} _{https://docs.opencv.org/3.4.3/d7/d4d/tutorial_py_}

thresholding.html

2_{OpenCV, Eroding and Dilating, https://docs.opencv.org/2.4/doc/tutorials/imgproc/erosion_}

(16)

2.1. Algorithms

2.1.2 Text detecting

As explained in section 1.5.1, the text detection algorithm will detect areas containing text in a distinct area. In this project three detection algorithms; EAST, SWT and Tesseract, have been examined. Each one explained in detail below.

2.1.2.1 Efficient and Accurate Scene Text detector (EAST)

Zhou et al. [5] created EAST in 2017 to detect text in natural scenes. EAST is built using a fully convolutional network with a deep learning structure which looks for patterns depicting lines of text. The lines are then evaluated to decide whether they are structured from text or not. Because of the use of lines, the text must be linear but it can handle it being angled or rotated. EAST can be used to detect text in both photos and film. It is relatively fast and has learned to handle different text formats in diverse natural scenes. It is also possible to train EAST with data sets or from scratch, provided you have the right hardware [11].

EAST returns an array of coordinates and sizes of where it has determined where there are text. The areas returned by EAST are however squares which are perpendicular to the image. Since EAST is open-source there exists a lot of different versions and implementations. EAST may have a hard time finding single characters due to the fact that single characters will not create a line.

2.1.2.2 Stroke Width Transform (SWT)

Stroke width transform is a technique that searches for lines with similar stroke width; the distance between two parallel edges. The technique works for most fonts where a letter has a similar stroke width over all. SWT aim to segment pixels into characters and later group the characters into words by a geometric heuristics. The technique won a competition back in 2005 called “Text Location Competition” at ICDAR3. The technique has then evolved in some aspects and Epshtein et al. [6] got good result in 2010. Epshtein et al. has created a better SWT-algorithm and can find edges better with the help of Canny4. SWT returns an array containing potential widths that each separate pixel most likely belong to. From this, letters can be extracted and then be grouped into words or sentences. According to Epshtein et al. [6], SWT has a low rate of false positive, but according to Jaderberg et al. [7] this technique get a lot of false positive when working with a noisy image, where noise could gets confused with letters.

2.1.2.3 Tesseract text detection

Tesseract was developed between 1985 and 1996 and released as open-source by HP in 2005. Since 2006 it has been developed by Google, who continues to develop it to this day5. Tesser-acts overall architecture has mostly stayed the same while the individual parts has been up-dated and improved over time. More recently they have moved into artificial intelligence and uses a Long Short-Term Memory (LSTM) deep learning system for text recognition6.

Tesseract is an engine for both text detection and OCR but can do each independently. In R. Smith’s paper [10] he explains how Tesseract’s text detection is divided into different stages. Firstly a connected component analysis is made, this allows Tesseract to detect white on black as easy as black on white. Then the outlines are gathered together by nesting and organised into text lines. This is done by sorting and organising them based on x-coordinates, skewed images are handled by tracking the slope across the image and adjusting accordingly.

3_{Robust Reading competition, Introduction, http://rrc.cvc.uab.es/}

4_{OpenCV, Canny Edge Detector, https://docs.opencv.org/2.4.13.7/doc/tutorials/imgproc/}

imgtrans/canny_detector/canny_detector.html

5_{Tesseract github, Brief history, https://github.com/tesseract-ocr/tesseract}

6_{Tesseract github, Modernization efforts, https://github.com/tesseract-ocr/docs/blob/master/}

(17)

2.1. Algorithms

The lines and regions are then analysed for fixed pitch and proportional text in order to break them down into words and letters.

Tesseract have several page segmentation modes (PSM), and use these modes to know how to handle the text and make assumptions about what is on a particular image. The different modes are listed in table 2.1.7

Tesseract PSM Number Explaination

0 Orientation and script detection (OSD) only. 1 Automatic page segmentation with OSD.

2 Automatic page segmentation, but no OSD, or OCR. 3 Fully automatic page segmentation, but no OSD. (Default) 4 Assume a single column of text of variable sizes.

5 Assume a single uniform block of vertically aligned text. 6 Assume a single uniform block of text.

7 Treat the image as a single text line. 8 Treat the image as a single word.

9 Treat the image as a single word in a circle. 10 Treat the image as a single character.

11 Sparse text. Find as much text as possible in no particular order. 12 Sparse text with OSD.

13 Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific. Table 2.1: Tesseracts different PSM modes explained

2.1.3 Text recognition

The next step is to go through each detected areas extracted from the previous step, and read the text. The text must be returned as something the computer can handle, a read- and writable- text. This will be done with Tesseract.

2.1.3.1 Tesseract

Tesseract 4 uses a specially designed LSTM neural network architecture for text recognition. It works by using a sliding window through the line of text each frame is run through two LSTM networks, see figure 2.1. One of the LSTM networks is reversed, which means that the input and output is reversed. Sutskever et al. [12] discovered that reversed LSTM leads to decreased perplexity and increased accuracy compared to regular LSTM. The result from the LSTM and reversed LSTM are stacked by 1x1 convolution to decrease dimensionality and increase the speed of the softmax nonlineraty filter. The softmax nonlinearity filter is used to calculate probabilities depending on the score from different matches. The data is lastly run through Tesseracts language model and beam search and the result is outputted.

7_{Tesseract github, tesseract-/OCR wiki, https://github.com/tesseract-ocr/tesseract/wiki/}

(18)

2.1. Algorithms

Figure 2.1: Overview of Tesseracts LSTM OCR Engine. The data goes through several steps in order for recognition to be performed.

Patel et al. [13] compares Tesseract with the commercial engine TransymOpticalCharacterRecognition (TORC). They test the the engines by trying to recog-nise license plates in images. The tests show that Tesseract is considerably faster as well as having a higher accuracy.

Nilsson [9] conducts a wider study where she compares Tesseract with several different OCR engines, both open-source and commercial. Nilsson found that Tesseract was the best performing engine in all the tests, she also notes that this is extra impressive since Tesseract did not recognise the Swedish letters presented in the tests.

(19)

3 Method

This chapter will go into detail of how the pre-study was conducted and explain how the data was collected. It will also summarise how the code was implemented and what packages were required to make the prototype work. The chapter will finish up with an evaluation of the result from the implemented prototype.

This thesis will answer the following question:

1. Which of EAST, SWT and Tesseract have the highest recall, when deployed with Tesser-act as OCR, on photos of displays as mentioned in section 1.3.

3.1 Timeline

Figure 3.1 shows an overview of how and were the algorithms is validated and selected and/or eliminated.

Figure 3.1: Timeline for studying algorithms. Pre-study will be performed through literature, then three algorithms will be examined and tested more thoroughly and as a last step one algorithms will be implemented and tested on the specific usecase.

The pre-study will cover up the most common algorithms discovered during the literature study. After the literature study only three algorithms is selected for future study. These three algorithms is studied on a deeper level and a test implementation is also made for each three. One of this three is then selected to go through to the real implementation.

(20)

3.2. Pre-study

3.2 Pre-study

The project started with a pre-study to investigate and test some of the most common algo-rithms for detection and recognition. The pre-study started with a literature study to get a wider perspective of more popular algorithms for detection in similar environments and find three suitable candidates for a pilot test.

3.2.1 Literature study

There is a wide variation of algorithms in the market. To find the most suitable algorithms for this study, several articles researching detection and recognition in similar environments were analysed. There are however very few studies for text detection in industrial dashboards and displays as used in this study. Most studies work with photos of road signs, bus numbers and other non-display images.

For recognition, Tesseract were, by margin, the most popular and well performing algo-rithm compared to other OCR algoalgo-rithms. Tesseract were also well documented with guide-lines and code. More about Tesseract is found under section 2.1.3.

Some of the algorithms that were investigated were not provided as code or missed important implementation guidelines and could therefore not be run in the pilot test. Some of the investigated algorithms that did not go through to a real pilot test were: SikuliX, ER-Filter, FASText, YOLO, CNN/RCNN in different combinations, Textboxes among others. The three algorithm that were selected all had a unique way to handle detection. Every-one in a unique combination of using machine learning and/or manual detection. The selected algorithms were SWT, EAST and Tesseract, all described under section 2.1.2. All three performed well in earlier literature studies. They are also well documented, some were even provided with guidelines for implementation and for understanding important parameters in the code.

3.2.2 Implementation

For the study, each of the three chosen algorithms were implemented. Through these im-plementations, a test with several images was constructed to analyse and evaluate how the algorithms worked on photos of displays.

Each algorithm were then run with 20 selected images shown in figure 3.2 which match the image characteristics described in section 1.1. The recall and precision was then calculated for the 20 images to measure and evaluate the algorithms detection rate. See chapter 4 for result.

3.2.2.1 Implementation on EAST

Two ways to implement EAST in Python 3 was examined, using a open-source Tensor-flow/Keras implementation [11, 14] and an OpenCV implementation. Since OpenCV is well established this method was chosen.

To implement EAST using OpenCV in python 3; opencv ´ contrib ´ python, numpy and imutils packages was used. Opencv ´ contrib ´ python is the openCV implementation for python and is used to access the algorithms as well as a number of function for opening and editing images. Numpy is used to handle matrices in Python, and is often used by algorithms when handling images. Imutils is a helper package with useful functions implemented using OpenCV and Numpy.

(21)

3.3. Data collection

3.2.2.2 Implementation on SWT

There exists a lot of implementations of SWT in Python, however, these do either not sup-ply test results [15] or the result provided is less promising than the original paper [16]. The implementation were therefore continued by using a wrapper for the C implementation of the algorithm [17]. However the wrapper was made for Python 2 and some major changes had to be made to the supplied code to get it working in Python 3, mainly because bytes are handled differently in Python 2 and Python 3. The wrapper is created using Simplified Wrap-per and Inter f aceGenerator(SW IG). The perquisites are scikit ´ image, numpy, matplotlib, gcccompiler and SW IG. The gcccompiler is used to compile the C code, SW IG is used as a bridge between C and Python. Numpy is used to represent matrices, scikit ´ image and matplotlib are used to edit and display the images.

3.2.2.3 Implementation on Tesseract

For Tesseract there exists a well developed and commonly used wrapper to Google Tesseract engine1. Since this is a well established standard for Tesseract in Python 3 no alternatives were explored. The packages used in Python were numpy, matplotlib, opencv ´ contrib ´ python and pytesseract as well as Tesseract4 installed on the system. Numpy was used for matrices, matplotlib and opencv ´ contrib ´ python was used for image handling. Tesseract4 was used together with the wrapper pytesseract for the Tesseract functions.

3.3 Data collection

The data collected for the actual test for this project consists of photos. The images need to have similar2characteristics or in some way remind of the display provided by the company. The images most suitable for this project is often protected and trade secret. Therefore data were collected from displays found in the nearby environment, as example from the university and at MindRoad AB.

Some of these photos were taken in a unprotected environment which lead to unfocused images, tilted images and/or images with high reflections. To compensate for this images from the internet, similar to the images showed by the company, were also collected for test-ing but will not be published in this report due to copyright restrictions, instead drawn rep-resentations of these images are shown. However the drawings are only meant to give a overview and will not contain reflections or shadows. Figure 3.2 shows all images which were used in testing.

1_{Pypi, Pytesseract 0.2.6 - Project description, https://pypi.org/project/pytesseract/} 2_{A majority of the characteristics features, section 1.4.1, is fulfilled}

(22)

3.4. Actual implementation (a) Photo of fitness watch with 4 words to be recognised. (b) Image from a car with 2 clear words.

(c) A cutout from figure 1.1b show-ing 4 words.

(d) Photo from car display showing several single numbers and some single letters.

(e) Photo of a fit-ness watch, show-ing both numbers and letters.

(f) A photo of a display with re-flections and flare.

(g) A blurry im-age of a fuel gauge showing a num-ber followed by letters

(h) A display showing numbers which are large and pixelated.

(i) A display showing numbers which are large and pixelated. (j) A display showing a blurry number. (k) A display showing several separate small text lines. (l) A display showing a big clear number. (m) A display showing lots of text, mostly letters. (n) A display showing lots of text, mostly letters.

(o) A car display showing small numbers and some text.

(p) A cutout from figure 1.1b show-ing 2 words and 2 letters. (q) An image showing lots of single character numbers. (r) A gauge show-ing 5 words. (s) A photo of a car display showing words and numbers on different lines. (t) A photo show-ing words on dif-ferent lines.

Figure 3.2: These images were collected from the internet as well as gathered in our close environment. The images from figure 1.1 were used as a baseline. Images b, c, d, g, j, k, l, m, n, o, p, q, r, s and t were collected from the internet while a, e, f , h and i were photographed by us. The images taken from the internet are only shown as drawn examples here due to copyright restrictions.

3.4 Actual implementation

This section will explain setup and in detail how Tesseract was implemented and optimised. This section also explains how we implemented ArUco detectors to crop the image and ex-plains how we adjust the image to work better with Tesseract.

(23)

3.4. Actual implementation

3.4.1 Physical setup

Johansson and Dahl [18] developed a system similar to the one in this paper. The authors use a USB camera pointed at the screen and ran tests in real-time. According to their own tests they achieve an error rate of 6%. However they are mostly focused on image and figure recognition and not text. They used OpenCV to accomplish this.

Tao et al. [3] also worked with similar images characteristics as the one used in this project. They used a simulated signal controller in order to increase sample size while still testing on images with specific image characteristics. The simulated signal controller provided ground truths which the result from the tests could be compared to.

The prototype in this project will be able to use, similar to the method mentioned in the articles above, a camera to take pictures used in testing. One of the challenges is to match the picture coordinator to the a correct image used by the testperson. To solve this the display is marked with two ArUco markers3in the upper left and bottom right corner, as in figure 3.3(b).

This will help the prototype to crop the picture to match the image and the testperson can estimate areas in the image where text can be detected.

3.4.2 Pre-procesing

This section explain the pre-processing done to the photo before it is implemented with Tesseract.

3.4.2.1 ArUco markers

ArUco markers are special kind of markers which can be easily detected by existing algo-rithms. The ArUco markers is detected with OpenCVs libary cv2.aruco and cropped where the ArUco markers expect to connect with the actual display. The dataset of ArUco markes is limited to only detect aruco.DICT_4x4_50 which mean that it is only small sized (4x4) mark-ers that is detected, and only the first 50. This makes the detection works faster. The ArUco marker used in this project in showed in figure 3.3(a). The point of the left bottom corner of the marker placed in the top left is saved, and the opposite is made for the marker placed in the bottom right, an example of this can be viewed in figure 3.3(b) where detected corners is marked in red. This marks the corners where the image is cropped. The code is developed in a way that a rotation of the ArUco markers will not matter. The cropped photo is also scaled to mach the width and height of the testpersons image so that the coordinates fully match.

3_{OpenCV, Detection of ArUco Markers, https://docs.opencv.org/3.1.0/d5/dae/tutorial_aruco_}

(24)

3.5. Evaluation

(a) ArUco marker (Marker ID = 1) (b) Example of ArUco marking

Figure 3.3: ArUco markers. (a) show an example of a ArUco marker, the same maker id used in testing for this report. (b) visualise how the ArUco markers are placed when testing and the red point markse the position where the image will get cropped.

3.4.2.2 Pre-processing

Before text detection and recognition is run on the image, it is pre-proccessed. This is a series of steps to prepare the image to be optimal for the Tesseract algorithm. The images work bet-ter when they have a high resolution, therefore images are scaled up with a factor of 2. Scaling too much leads to blurry images which lead to worse result. The image is then grayscaled to be able to perform thresholding, static thresholding with a limit of 127 is then performed. One iteration of dilation with a square kernel of size 1x1 is then performed followed by one iteration of erosion with a square kernel size of 1x1. Thresholding, dilation and erosion are done to reduce noise in the picture.

3.4.3 Tesseract

Tesseract is then run with different settings depending on the expected data from the tests. Different tests run with different filters but all of the runs use the same basic settings with page segmentation mode 6 (Assumeasingleuni f ormblocko f text). The filter is setup so Tesser-act only searches for expected charTesser-acters. For example when searching for the word "success", the filter would be [’s’,’u’,’c’,’e’,’s’].

A problem might occur when the test is looking for a word with a space, for example "Friday 3". Tesseract might detect "Friday" and "3" separately, therefore an algorithm is run to check words and/or characters which are close to each other and combine them. This algorithm works by assuming that the text has the extremely similar y value and a small variance in x value.

3.5 Evaluation

This section will explain evaluation.

3.5.1 Measure detection

To measure recall, precision and accuracy, the following states need to be known: True positive, False positive, True negative, False negative. These states can be found in figure 3.4, which shows the following:

Figure 3.4(a) is the original image. Figure 3.4(b) shows the expected areas marked in green. Figure 3.4(c) is an example of detected areas marked in orange. A merge between the

(25)

3.5. Evaluation

expected and the detected area is shown in Figure 3.4(d).

Every section where the green area collides with a orange area ("68", "9", "26" and "Fredag") result in a True positive state. Areas only marked with green ("%", ":" and "3") results in a False negative state. Single orange areas (battery symbol) results in a False positive state. The remaining area and background states True negative. This method is called rectangle matching [19].

(a) Original image (b) Expected areas in green

(c) Example of detected area in orange (d) Expected- and detected area combined

Figure 3.4: Expected and detected area

The following three sections will explain recall, precision and accuracy and the differences between them.

3.5.1.0.1 Recall Recall measure the differences between the correct hits and the expected number of correct hits and is calculated with the following equation.

Recall= True positive

(26)

3.5. Evaluation

This equation is also illustrated in figure 3.5.

Recall is the proportion of the detected True positive out of the expected hits. The recall for the detection in figure 3.4(d) is 80%.

Figure 3.5: Illustration over recall, where the overlapped area symbolises the true positive, this is then divided with the total are of what was expected.

3.5.1.0.2 Precision Precision measure the differences between the correct hits and the number of hits in total and is calculated with the following equation.

Precision= True positive

Detected result =

True positive

True positive+False positive (3.2) This equation is also illustrated in figure 3.6.

Precision answer how precise the detection hit were, in other words, how much of the de-tected area were false or true.

The precision in figure 3.4(d) is 95%.

Figure 3.6: Illustration over precision, where the overlapped area symbolises the true positive, this is then divided with the total are of what was detected.

3.5.1.0.3 Accuracy Accuracy if often seen as a combined indicator from recall and preci-sion which is somewhat true, but it is calculated with the whole image in mind while recall

(27)

3.5. Evaluation

and precision focus on the distinct areas of expected- and detected areas. Accuracy is calcu-lated with the following equation.

Accuracy= True positive+True negative

Total (3.3)

Accuracy uses Total= (True positive+True negative+False positive+False negative), which is the total area of the image. The accuracy in figure 3.4(d) is 99%.

This measurement was however seen as an unnecessary part of the result. This is due to the fact that accuracy focus the total image area. If the expected- and detected area were really small, it would result in a high percentage of correctly spotted background, which creates a unbalanced result.

3.5.2 Measure recognition

Chen et al. [20] measure recognition in character recognition rate (CRR), character precision rate (CPR) and word recognition rate (WRR). CRR and CPR works in a similar way as recall and precision for detection but instead of comparing area, characters is compared.

Figure 3.7: Example of recognition. Expected- and detected characters with the correspond-ing True positive, False positive and False negative characters.

Figure 3.7 shows a recognition example with expected words: "Recognition rate", and de-tected words: "8ecognili 0 rate". These words are split in to characters and evaluated character by character.

3.5.3 CRR

CRR is calculated in the same way as equation 3.1, number of correct characters (true positive) divided with number of expected characters. The example in figure 3.7 therefore gets an CRR on:

CRR= 11

15 =73% .

3.5.4 CPR

CPR is calculated in the same way as equation 3.2, number of correct characters (true positive) divided with number of detected characters. The example in figure 3.7 therefor gets an CPR on:

CPR= 11

14 =79%

3.5.5 WRR

WRR looks at the correct amount of detected words of the expected words. "rate" was the only word detected correctly in this example, therefore this example only reached a WRR of 1/2 =50%

(28)

4 Result

In this chapter the result from tests are presented. First the result from the first implementa-tion and then the actual implementaimplementa-tion.

4.1 Implementation study

This section will go through the result from the implementations of EAST, SWT and Tesseract.

4.1.1 Recall

The recall of the test are shown in table 4.1 and summarised in figure 4.1. Table 4.1 is sorted falling by the last column, Tesseracts results. Figure 4.1 shows recall on the y-axis and the different images on the x-axis, EAST is marked in blue, SWT in orange and Tesseract in gray.

(29)

4.1. Implementation study

Recall summary

Image name EAST [%] SWT [%] Tesseract [%]

q 0 0 100 a 87 54 97 b 89 91 88 p 91 76 88 m 75 86 88 r 89 26 88 c 72 0 88 h 43 14 84 e 88 0 84 t 81 73 83 l 85 86 81 o 62 87 77 f 80 71 69 n 82 72 68 j 0 72 67 s 75 44 65 g 0 100 60 k 89 86 52 i 62 33 37 d 43 46 26

Table 4.1: The calculated recall for each image. The images are available in figure 3.2 Tesseract performed best on image q with a 100% recall and did not fail (did not get 0%) on detecting for any of the images.

EAST reached a maximal recall with 90% on image p, forth in line in Table 4.1, but failed to detect anything on image q,j and g.

SWT reached a maximal recall on 100% on image g, number four from the bottom in table 4.1 and can be seen as a orange peeking to the right in figure 4.1. But SWT did however fail when detecting images q, c and e.

(30)

4.1. Implementation study

4.1.2 Precision

The precision of the test are shown in table 4.2 and summarised in figure 4.2. Table 4.2 is sorted with recalls result so that comparing the result for the same images is facilitated. Figure 4.2 shows precision on the y-axis and the different images on the x-axis, EAST is marked in blue, SWT in orange and Tesseract in gray.

Precision summary

Image name EAST [%] SWT [%] Tesseract [%]

q 0 0 25 a 92 96 15 b 92 100 100 p 82 97 57 m 70 86 73 r 73 75 52 c 90 0 45 h 94 30 47 e 93 0 75 t 81 43 18 l 88 100 100 o 66 44 73 f 95 96 50 n 84 84 63 j 0 100 100 s 95 100 100 g 0 47 46 k 73 13 20 i 92 63 78 d 55 73 34

Table 4.2: The calculated precision for each image. The images are available in figure 3.2

(31)

4.1.3 Summary

Table 4.3 shows a summary of the average recall and precision in the test. This is done by summarising all recall results for each algorithm and divide it with the number of images executed. The same is done for precision. Table 4.3 is sorted falling by recall.

Total summary

Algorithm Recall [%] Precision [%]

Tesseract 74 59

EAST 65 71

SWT 56 62

Table 4.3: The total recall and precision calculated over all images for each algorithm. Tesseract peformed best with a average of 74% recall, followed by EAST with a average on 65%. SWT preformed worst with a 56% recall.

4.2 Actual implementation

This section will go through the result from running of Tesseract as detection, together with Tesseracts OCR tool.

Table 4.4 shows the result from running Tesseract detection and recognition. It is sorted by the last column, WRR, followed by the CRR column.

Recognition rate Image name CRR [%] CPR [%] WRR [%] j 100 100 100 l 100 100 100 r 100 65 100 b 92 92 50 p 69 53 50 c 55 43 50 g 50 60 0 e 21 75 0 Average 73 73 56

Table 4.4: The total recall and precision calculated over a selection of images for each algo-rithm.

The two best performing images were both images of timers and the resulting images with detection and recognition can be seen in figure 4.3a and figure 4.3b followed by the other images in the same order as in table 4.4.

(32)

(a) Detection in green and recognition in blue on image (l) from figure 3.2, one of the best preforming images with a 100% WRR.

(b) Detection in green and recognition in blue on image (j) from figure 3.2, one of the best preforming images with a 100% WRR.

(c) Detection in green and recognition in blue on image (r) from figure 3.2, with a 100% WRR.

(d) Detection in green and recognition in blue on image (b) from figure 3.2, with a 50% WRR.

(e) Detection in green and recognition in blue on image (p) from figure 3.2, with a 50% WRR.

(f) Detection in green and recognition in blue on image (c) from figure 3.2, with a 50% WRR.

(g) Detection in green and recognition in blue on image (g) from figure 3.2, one of the worst performing images with a 0% WRR.

(h) Detection in green and recognition in blue on image (e) from figure 3.2, one of the worst preforming images with a 0% WRR.

(33)

5 Discussion

5.1 Result

This section will discuss result.

5.1.1 Measuring detection

We chose to measure text detection with recall and precision, these metrics were chosen be-cause they are one of the more popular ways. Using one of the more popular methods gives a wider net of other papers to compare the result with. The data from other papers which fo-cus on general text detection can be directly compared with the data from the tests on images with specific characteristics explored in this paper.

We choose to focus more on recall because recall ignores false positives. Since the infor-mation of what is expected in a given area is known, this inforinfor-mation can be used to filter away false-positives. Which means false-positives are not a problem for our specific use case. Since recall totally ignores false-positives, a hypothetical algorithm which always marks the whole image as text would lead to recall of 100%. Therefore precision is used as a sec-ondary measurement to make sure the algorithm is actually marking the text.

5.1.2 Images and threshold

During the pilot test of Tesseract we discovered that Tesseract failed for some images. Photos that failed were images of displays with a green led background and grey/black characters, as seen in figure 5.1a. We learned that this was caused by the contrast between the text and background being too low for the thresholding. This leads threshold to remove the text as well as the background and resulted in an image which is almost entirely white, as shown in figure 5.1b. This can be remedied by decreasing the limit for the thresholding, but would lead to noisier images and lower the performance of Tesseract. Another option is to run all images through several thresholding limits, do text detection on all images and combine the results. This would lead to increased performance but slower execution time. However this problem was very contained and only affected images with green lit background. These images did not match the image characteristics described in 1.4.1, which states that the text should have

(34)

5.1. Result

high contrast. The images were ultimately deemed outside the scope and removed from the testing pool.

This does not affect our result but may be of interest for future work if other kinds of industrial displays with the occurrence of green led displays is used. Figure 5.1c shows the finished result with detected areas, and the result is bad.

(a) Original image of green led display (b) Green led display after thresholding

(c) Green display with Tesseract detection

Figure 5.1: Tesseracts detection when used on green led displays.

5.1.3 Measure recognition

Images j and l from our test images shown in figure 3.2 both got a perfect score of 100% in CRR, CPR and WRR. These two images are a close match to and actual test-case, closely cropped and showing close to one word. Images b and g also match this criteria however have worse result. Image b had 92% CRR, 92% CPR and 50% WRR, the text detection on b was pretty good with recall of 88% and the worse result in b is contributed to Tesseract mistaking the letter v for the symbol ¥. Image g had 50% CRR, 60% CPR and 0% WRR, g only had 60% recall and this is partly why it performed worse. The text detection missed the "mi" after the numbers but Tesseract OCR also misstook a "5" for a "8". Image e performed the worst in the tests with 21% CRR, 75% CPR and 0% WRR. This bad result is contributed to text detection failing, in figure 4.3h it is observed that text detection only detected one word which stretches over two different words. This caused recognition to fail as well and lead to a bad result.

5.1.4 Recall and Word Recognition Rate

The end result gave a 74% recall for detection and a WWR on 56% which is a lot lower than what we hoped to achieve. We stated earlier that a recall on 60% was considered as sufficient. This meant that the detection was better than the algorithms we looked into during the pre-study, so some optimisation was possible for the images. However, a 56% WRR is extremely low, even though we were hard in our judgement of successfully recorded words, some

(35)

op-5.2. Method

timisation is still required to make this prototype sufficient and reach a WRR closer to 100%. The optimisation may be in scaling the images into better areas and could be solved by a good communication between the prototype and the user, this could be a possible optimisation to look into in future development. Some of the possible reasons could also be issues with our data collection and lack of time during our optimisation of Tesseract, this is discussed further down in section 5.2.3 and 5.2.4.

5.2 Method

This section will discuss our method.

5.2.1 Pre-study and selection of algorithms

During the literature study a wide spectrum of tools were examined. Some of them worked excellently for GUI testing, SikuliX for example. However Sikulix is created for screenshots and do not handle any issues which might occur when dealing with photos. SikuliX does not deal with tilted images, reflection, differing resolution, etc.

As mentioned in section 3.2, some algorithms showed great result in previous research but implementation proved difficult. ERFilter is one example were previous studies looked good, but where we could not get the algorithm to work due to installations errors. This was mostly attributed to the fact that ERFilter is built in C++ and we had to use a wrapper. Using a wrapper limits the way the algorithms can be customised to increase performance on photos with the image characteristics. In the case of ERFilters the wrapper did not work properly. We acknowledge that we could have gotten these algorithms to work if we focused on them but they were abandoned in the interest of time.

The best results in the ICDAR competition 2011 [8] were from Kim’s method. This algo-rithm is not published and can therefore not be researched or used in this project.

The other algorithms looked into had worse result than EAST, SWT and Tesseract. After looking into the potential algorithm candidates, EAST, SWT and Tesseract showed the most potential to be modified in different ways to work for our images.

5.2.2 Pilot study and algorithm performance

During the pilot study, all three algorithms were implemented. When EAST was provided with images similar to figure 1.1(a), we noticed a huge flaw when it came to detection single characters, EAST could not detect any single characters. This is a big problem since single characters occur frequently in the photos. EAST did however provide a good result with a 65% recall and 71% precision shown in table 4.3. But due to the fact that EAST could not handle single characters, EAST was not considered as a algorithm to continue work with. This flaw is in EASTs architecture and can not be remedied. It occurs because EASTs detection works by drawing a line from the beginning of a word to the end. This line can only occur if more then one characters are lined up. Since a single character do not form a line EAST will never be able to detect it.

Earlier studies have not enlighten this problem, however, after understanding EASTs algorithm this is an understandable error and can not be reduced with any modifications without rethinking the algorithm entirely, which is outside the scope of this paper. This meant that EAST was no longer an option for the prototype.

SWT provided good result for most images, but when run with images similar to figure 1.1(b), we noticed an error when detecting characters with a thin stroke width. Looking at the result in table 4.3, SWT was pretty good with recall 56% and precision 62% but text with thin stroke width might occur more frequently than in the test images. Since stroke

(36)

5.2. Method

width is the basis of the whole algorithm it cannot be fixed without considerably altering the algorithm. However there are some mitigating options which can be tested. Dilation can be done on the image prior to the algorithm is run, this dilates the lines and leads to thicker stroke widths. However dilating images with an already thick stroke width can lead to the text being indiscernible. Since image withing the image characteristics can have varying degrees of stroke width, dilating the images to the necessary level was deemed hurtful to the overall performance of the algorithm. Earlier studies did not talk about this being a problem with SWT however other studies had problems detecting text with noisy images as well as bent text. These are however not part of our image characteristics and does therefore not affect us. But due to the fact that SWT can not handle thin stroke widths, SWT is not an option for the prototype.

Tesseract showed extremely poor result in the beginning and was originally abandoned in search of better options but after reading a lot of other papers and their success with Tesseract we decided to go back and check the settings. Page segmentation mode was set to automatic, after changing this option to "assume uniform block of text" the result improved considerably. The result from Tesseract is the best with recall 74% and precision of 59% as seen in table 4.3. A flaw with images with green lit background and black text was discovered in Tesseract and happens because thresholding but since green background with dark text is not occurring within the image characteristics this was not seen as a big problem. No other flaws were identified with Tesseract and we decided to go on with Tesseract.

5.2.3 Data collection

Due to trade secrets in the company we did not get access to any photos of the actual displays which the program are planned to be run on. Our supervisor provided us with two images taken from Google shown in figure 1.1. These two images served as a base for data collection. This is of course a huge problem since we were not able to test our algorithm on the photos where it was actually going to be deployed. Instead we opted to collect the images ourselves from the internet and our nearby environment. The photos were suppose to be taken in a controlled environment to minimise reflection, glare and rotation. This proved to be really hard to do in an open environment, some of the images in the tests have very high reflection and glare as seen in figure 1.2. This might have affected our result badly, with images where shadows and reflections could have destroyed the threshold or shade in the image, or where unfocused images makes detection unreliable. We could have collected more images to get a more reliable result, but due to the difficulty to find similar displays and to select photos in a safe environment, this would probably lead in a misleading result.

5.2.4 Implementation

When implementing Tesseract there are several factors which can contribute to a better re-sult. When run on the original images, Tesseract as a detector performs rather bad with a lot of false positives. Tesseract relies heavily on image manipulation before being run. The most drastic improvement is when the image is run through a threshold algorithm in pre-processing, there are several different threshold algorithms and we tested a small range of them, however we found the static threshold with a value of 127 to be the most efficient, the other threshold algorithms produced a lot of dots and noise which Tesseract mistook for characters.

Resolution of the image also seemed to be of importance, images with low resolution performed considerably worse than images with high resolution. However scaling images too much lead to even worse result than leaving them untouched. Through some testing we noticed that scaling the images with a factor of two, made the resolution higher without sacrificing clarity.

(37)

5.3. Wider context

Through research we noticed that other researchers had had success with dilation and erosion when running Tesseract. We tried several different combinations but doing anything else than one iteration of dilation followed by erosion with a 1x1 kernel lead to a worse result. We speculate that this is because dilation and erosion is good for removing noise, but makes the letters harder for Tesseract to read and our images do not contain much noise, which only made the letters harder for Tesseract to read.

Tesseract also has several different modes as seen in table 2.1 these modes help Tesseract make assumptions of what to look for. We tried all of the different mode and psm6 was superior in all aspects. We speculate this is because psm6 looks for single uniform blocks of text, which is always the case. Better performance might be possible if you run different modes depending on the expected data, for example running psm10 when looking for single characters however, this was not explored.

5.2.5 Evaluation

Due to the company secrets involved we were not able to get in touch with actual professional testers which means our evaluation is not validated by any of the potential users who knows more about how an automated test should work to be efficient. This is of course something that makes our evaluation weaker.

Our evaluation relies on how we perceived information while reading about automated tests in other articles and theses. The same goes for how evaluation of text detection and recognition was performed, however these were very well documented and standardised. Our supervisor from the company is not a professional tester but have closer contact with people who are professional testers and have provided us with tips and guidelines of how they work and what they are requesting.

5.3 Wider context

At this point, the program is far away from giving a solid result. The result will however provide false negative results instead of a false positive result when implemented correctly with restrictions on what and where to detect. In this aspect, our program could work as a helping tool and exclude some areas that definitely are correct.

If this program were developed further, reaching a 95% recognition rate, we still could not rely on it fully. But as mentioned before, if the test case is done correctly, the test will only provide false negative and never false positive. A false positive could have a catastrophic outcome, if for example a airplane shows the wrong height but is stated as correct, the air-plane could crash. The development of this program should therefore focus on restricting the possibility to get a false positive. A human could not be fully replaced at this point, but the program could facilitate the work required.

(38)

6 Conclusion

Through the report, we have aimed to answer the following question.

1. Which of EAST, SWT and Tesseract as a text detection algorithm have the highest recall when deployed with Tesseract as OCR on photos of displays with the characteristics mentioned in section 1.4.1.

These three algorithms were selected after a literature study where several algorithms were investigated and evaluated. Tesseract, SWT and EAST all showed potential to be efficient detection algorithms. To answer the research question we needed to test how well each algo-rithm performed with photos on industrial displays. These tests exposed defects with EAST and SWT. EAST could not detect single characters which is something that will occur in our type of images and which therefore not suited for further testing. SWT could not detect char-acters or words with a thin stroke width which can be common in our images. Tesseract however, showed an excellent result compared to the other two. The answer to our research question is therefore answered; Tesseract is the best algorithm to combine with Tesseract as OCR when detecting and recognising text in our images.

6.1 Future work

This thesis can be used as a basis for further work in both text detection and image detection. We looked into different options for text detection and some optimising, however, our focus was not on optimising but rather choosing an algorithm. Further work could therefor look deeper into how to optimise Tesseract for these images. Not all optimisation options were ex-plored, for example looking into training Tesseract with images that match the characteristics looked at in this paper is an interesting idea.

A program to build the test cases was also developed in this project but was not evalu-ated by any professional testers, further work might look into how such a system could be implemented to optimise accessibility and efficiency, and also test the program in the proper environment.

Text detection is not the only important part of an automated test system either. Another project could look at image recognition, to detect symbols and figures as well.

(39)

Bibliography

[1] Z. Wang. “Error Pattern Recognition Using Machine Learning”. M.S. thesis. Linköping University, 2018.

[2] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, E. Valveny, M. Iwamura, J. Matas, L. Neumann, V.R. Chandrasekhar, S. Lu, F. Shafait, and S. Uchida. “ICDAR 2015 competition on Robust Reading”. In: IEEE Computer Society (2015). [3] Y. Tao, Y. Yue, and P. Craig. “A Hybrid Approach to Detection and Recognition of

Dash-board Information in Real-time”. In: ICSAI (2017).

[4] L. Neumann and J. Matas. “Real-Time Scene Text Localization and Recognition”. In: CVPR (2012).

[5] X. Zhou, C. Yao, H. Wen, and Y. Wang. “EAST: An Efficient and Accurate Scene Text Detector”. In: arXvi (2017).

[6] B. Epshtein, E. Ofek, and Y. Wexler. “Detecting Text in Natural Scenes with Stroke Width Transform”. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2010).

[7] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. “Reading Text in the Wild with Convolutional Neural Networks”. In: International Journal of Computer Vision Vol-ume 116 (2016).

[8] A. Shahab, F. Shafait, and A. Dengel. “ICDAR 2011 Robust Reading Competition Chal-lenge 2: Reading Text in Scene Images”. In: ICDAR, IEEE Conference (2011).

[9] E. Nilsson. “Test av OCR-verktyg för Linux”. M.S. thesis. Linnéuniversitetet, 2010. [10] R. Smith. “An Overview of the Tesseract OCR Engine”. In: ICDAR (2007).

[11] argman. A tensorflow implementation of EAST text detector. https : / / github . com / argman/EAST. [Viewed 2019-04-02]. 2018.

[12] I. Sutskever, O. Vinyals, and Q. V Le. “Sequence to Sequence Learning with Neural Net-works”. In: Advances in Neural Information Processing Systems 27. Ed. by Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger. Curran Associates, Inc., 2014, pp. 3104–3112.URL: http://papers.nips.cc/paper/5346- sequence-to-sequence-learning-with-neural-networks.pdf.

(40)

Bibliography

[13] C. Patel, A. Patel, and D. Patel. “Optical Character Recognition by Open Source OCR Tool Tesseract: A Case Study”. In: International Journal of Computer Applications [Volume 55– No.10] (2012).

[14] J. Zdenek. Implementation of EAST scene text detector in Keras. https://github.com/ kurapan/EAST. [Viewed 2019-05-09]. 2019.

[15] C. Bunn. Stroke Width Transform. https : / / github . com / mypetyak / StrokeWidthTransform. [Viewed 2019-05-09]. 2015.

[16] N. Lintz. Stroke Width Transform - Results. https : / / github . com / nlintz / StrokeWidthTransform/tree/master/results. [Viewed 2019-05-09]. 2014. [17] M. Zabłocki. Fast Stroke Width Transform for Python. https : / / github . com /

marrrcin/swt-python. [Viewed 2019-05-09]. 2017.

[18] F. Johansson and O. Dahl. “Autonomous Validation through Visual Inspection”. M.S. thesis. Högskolan i Halmstad, 2017.

[19] C. Wolf and J.M. Jolion. “Object count/Area Graphs for the Evaluation of Object Detec-tion and SegmentaDetec-tion Algorithms”. In: LIRIS-RR (2005).

[20] J. M. Odobez D.Chen and H. Bourlard. “Text detection and recognition in images and videos”. In: IDIAP-RR 02-61 (2002).

Automated system tests with image recognition : focused on text detection and recognition

Linköping University | Department of Computer and Information Science

Bachelor’s thesis, 16 ECTS | Datateknik

2019 | LIU-IDA/LITH-EX-G--19/025--SE

Automated system tests with

im-age recognition

focused on text detection and recognition

Automatiserat systemtest med bildigenkänning

Moa Eriksson

Oskar Olsson

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

Lexicon

1

Introduction

1.1

Company background

1.2

Background

1.3

Aim and research question

1.4

Limitations and challenges

1.4.1

Image characteristics

1.5

Scope

1.5.1

Input- and detection area

2

Theory

2.1

Algorithms

2.1.1

Image pre-processing

2.1.2

Text detecting

2.1.3

Text recognition

3

Method

3.1

Timeline

3.2

Pre-study

3.2.1

Literature study

3.2.2

Implementation

3.3

Data collection

3.4

Actual implementation

3.4.1

Physical setup

3.4.2

Pre-procesing

3.4.3

Tesseract

3.5

Evaluation

3.5.1

Measure detection

3.5.2

Measure recognition

3.5.3

CRR

3.5.4

CPR

3.5.5

WRR

4

Result

4.1

Implementation study

4.1.1