Training a Neural Network using Synthetically Generated Data

(1)

DEGREE PROJECT IN TECHNOLOGY, FIRST CYCLE, 15 CREDITS

STOCKHOLM, SWEDEN 2020

TRAINING A NEURAL NETWORK USING

SYNTHETICALLY GENERATED DATA

FREDRIK DIFFNER

HOVIG MANJIKIAN

(2)

Att träna ett

neuronnät med syntetiskt

genererad data

FREDRIK DIFFNER HOVIG MANJIKIAN

Degree Project in Computer Science Date: June 2020

Supervisor: Christopher Peters Examiner: Pawel Herman

KTH Royal Institute of Technology

School of Electrical Engineering and Computer Science

(3)

Abstract

A major challenge in training machine learning models is the gathering and labeling of a sufficiently large training data set. A common solution is the use of synthetically generated data set to expand or replace a real data set. This paper examines the performance of a machine learning model trained on synthetic data set versus the same model trained on real data. This approach was applied to the problem of character recognition using a machine learning model that implements convolutional neural networks. A synthetic data set of 1’240’000 images and two real data sets, Char74k and ICDAR 2003, were used. The result was that the model trained on the synthetic data set achieved an accuracy that was about 50% better than the accuracy of the same model trained on the real data set.

Keywords

Synthetic data set, Generating synthetic data set, Machine learning, Deep

Learning, Convolutional Neural Networks, Machine learning model, Character

recognition in natural images, Char74k, ICDAR2003.

(4)

Sammanfattning

Vid utvecklandet av maskininlärningsmodeller kan avsaknaden av ett tillräckligt stort dataset för träning utgöra ett problem. En vanlig lösning är att använda syntetiskt genererad data för att antingen utöka eller helt ersätta ett dataset med verklig data. Denna uppsats undersöker prestationen av en maskininlärningsmodell tränad på syntetisk data jämfört med samma modell tränad på verklig data. Detta applicerades på problemet att använda ett konvolutionärt neuralt nätverk för att tyda tecken i bilder från ”naturliga” miljöer.

Ett syntetiskt dataset bestående av 1’240’000 samt två stycken dataset med tecken från bilder, Char74K och ICDAR2003, användes. Resultatet visar att en modell tränad på det syntetiska datasetet presterade ca 50% bättre än samma modell tränad på Char74K.

Nyckelord

Syntetiskt dataset, Generera syntetiskt data,

Maskininlärning, Maskininlärningsmodell, Djuplärning, Konvolutionära neurala

nätverk, teckenigenkänning i bilder, Char74k, ICDAR2003,.

(5)

Acknowledgements

We would like to express our gratitude to Christopher Peters for his invaluable

constructive criticism and aspiring guidance throughout this work.

(6)

1 Introduction 1

1.1 Problem . . . . 2

1.2 Methodology . . . . 2

1.3 Delimitations . . . . 3

2 Background 4 2.1 Synthetic data sets . . . . 4

2.2 Neural Networks . . . . 5

2.3 Char74K and ICDAR2003 data sets . . . . 8

2.4 Related Work . . . . 9

3 Methodology 10 3.1 Generating the synthetic data set . . . 10

3.2 Pre-processing Char74K and ICDAR2003 data sets . . . 12

3.3 Neural Network . . . 14

3.4 Evaluation . . . . 15

4 Results 17 4.1 Results from related studies . . . . 17

4.2 Results from this study . . . 18

5 Discussion and Conclusions 21 5.1 Discussion on the results . . . 21

5.2 Discussion on the synthetic data set . . . 21

5.3 Discussion on the neural network . . . 22

5.4 Conclusions . . . 22

5.5 Future Work . . . 23

References 24

(7)

1 Introduction

The rapid growth of high resolution video material generated by modern devices (e.g. mobile devices, security cameras) makes the problem of detecting and recognizing characters in these materials an important problem, especially in areas related to data mining, categorizing, etc.

Text detection and character recognition is a classic pattern recognition problem.

For the Latin script, this is largely a solved problem in certain cases like in the case of images of scanned documents[2]. However, text detection and character recognition in natural images (photographs) pose a much more difficult problem, where characters can be much more difficult to recognize (eg. characters in neon script outside a restaurant). In recent years numerous different approaches has been evaluated to solve the problem, both through dividing the problem into stages of text detection, extraction and recognition, or by end-to-end solutions [20].

This thesis will only focus on the character recognition problem in natural images. For the problem of character recognition, algorithms implementing neural networks has proven to be performing well [2, 5, 20]. Thus, this paper will concentrate mainly on these algorithms.

However, one major problem when training neural networks is the lack of a

big and proper labeled data set for training, which also applies to character

recognition in natural images [4, 5, 20]. A solution in this kind of situation is

to generate a synthetic data set. Generating synthetic data sets is a relatively new

area in the machine-learning domain. The concept is basically to generate fake

data which can be used for training machine-learning models. The key advantages

of training a machine-learning model using synthetically generated data set is its

feasibility and flexibility. It is often the case that the real data set is too small or

missing completely, and in many of these cases it is not feasible gathering more

real data because of time or budget constraints [3]. Furthermore, there are cases

where real data is available in abundance but still can not be used because of some

privacy or confidentiality aspects [14]. In such cases generating a synthetic data

set would be a viable option to solve the problem, given that the results of training

(8)

the machine learning model on such a data set turns out to be acceptable for the purpose. An area where the approach of generating synthetic data is applicable is in the area of character recognition. Even though the use of synthetic data have been evaluated in end-to-end solutions and word recognition [6, 7], the area of character recognition lacks of studies evaluating the use of synthetic data alone for training neural networks.

1.1 Problem

For the problem of character recognition in natural images, the lack of big data sets for training neural networks is a constraint. Synthetic data sets could be a substitute when the real data set is too small or missing completely. The research problem posed in this thesis is thus, can a synthetic data set alone be used to train a neural network on the problem of character recognition in natural images?

Our hypothesis is that a convolutional neural network trained on a synthetically generated data set will have a similar performance to the same network trained on a real data set.

1.2 Methodology

Generally, creating a realistic synthetic data set for training can be complicated, since there are lots of aspects which need to be considered. However, achieving realism artificially in the area of characters in natural images is not difficult due to the narrow variety of aspects of realism. A natural image often contains distortion, blurriness, noise, and gradient lighting. These artifacts can easily be created by computer softwares.

Tensorflow is used for creating, training and evaluate the neural network. The

performance of the network is then evaluated against the state of the art results

from other studies.

(9)

1.3 Delimitations

Text recognition in natural images has traditionally been divided into several sub problems detection, extraction and recognition. This thesis only focusing on generating a synthetic data set for the sub problem of character recognition, where characters already have been detected and extracted from the natural image.

There are however recent studies which achieved good results by developing an end-to-end solution [7, 20].

To evaluate a synthetic data set, a neural network with good performance is needed. When building a neural network, the performance of the network often depends on several parameters. Setting these parameters requires extensive testing, which consumes time especially when heavy computing power is absent.

Due to limitations in time and computing power (a desktop PC with a single mid-

range GPU), only a limited amount of testing will be possible. Consequently, the

parameters of the neural network will be calibrated lightly, which could lead to an

insufficient optimization of the neural network. The limited access to computing

power will also be a constraint when creating the synthetic data set, which will

affect the size of the data set.

(10)

2 Background

In this section, a background on the techniques used in this paper will be given.

First an introduction to synthetic data sets will be presented, then the area of neural networks will be covered, with the focus on convolutional neural networks.

This part is mainly based the book Neural Networks and Deep Learning by Nielsen

¹

. Then, the Char74K and ICDAR2003 data sets will be introduced, and finally, related work will be presented.

2.1 Synthetic data sets

One common problem when training neural networks is the lack of a large and properly labeled data set for training. Gathering such data set could be very time and resource consuming [3]. For example the Char74K data set, which is a data set of real images used for character recognition, was gathered by taking 1922 images on the streets of Bangalore (India), cut out every character, and label them [4]. Another concern when dealing with real images is privacy and confidentiality.

Sometimes a large and properly labeled real data would be available but unusable because of the high risk of disclosing confidential data, or disclosing the identity of people without having their consent [14]. This is a very common case in the health care sector for example.

A common solution in this kind of situation is to generate a synthetic data set, which is often cheaper and easily scalable alternative to real data set. An area where the approach of generating synthetic data is applicable is in the area of character recognition. Since machine learning models are often used in the problem of recognizing characters in natural images, the availability of a training data set is crucial. For example, Gupta et al developed an engine that overlay synthetic text to a existing background image in a natural way, accounting for the local 3D scene geometry. The synthetic pictures was later used to train a Fully-Convolutional Regression Network, which outperformed state of the art techniques for text detecting in natural images [6]. Another example is Jaderberg et al. proposition of a new framework for a state-of-the-art word recognition; they showed that the new framework could be trained with synthetic data [7].

1

Nielsen, Michael A. Neural Networks and Deep Learning. DeterminationPress, 2015

(11)

However, generating a synthetic data is not as simple as one would wish. It is important that the synthetic data is well defined to achieve reliable results [11].

Many aspects play a major roll in defining the synthetic data like quantity, quality, variation, realism, etc... This paper tries to balance between between all of these aspects, but priorities’ realism, that is the degree of which the synthetic data mimics the real data.

2.2 Neural Networks

Artificial Neural Networks is a subset of Machine Learning, which in itself is a subset to Artificial Intelligence. Neural Networks has proven to be good for solving problems related to image recognition, speech recognition, and natural language processing. The technology is vaguely inspired by the biological neural network that constitute animal brains. The same concept is applied in computers to enable pattern recognition in observation data. A neural network consists of small processing units, called neurons. Each neuron i takes a set X of inputs, where each input {x

j

| x

j

∈ X, 0 ≤ x

j

≤ 1} have an attached weight w

ij

. Each neuron also have a bias b

_i

. The output y

_i

of the neuron is then calculated with the help of the activation formula f with its input,

y

i

= f



 ∑

ⁿ

j=1

x

j

w

ij

+ b

i





where n is the size of the input set X and 0 ≤ y

i

≤ 1.

A common architecture of a neural network is the feed forward network. In a feed forward network, the neurons are arranged in several layers; one input layer, some hidden layers, and one output layer. The outputs from neurons in one layer then becomes input to the neurons in the next layer. If a neuron has an input from all of the neurons in the previous layer, it is said to be a fully connected layer [12].

2.2.1 Convolutional Neural Networks

Convolutional neural network (CNN) has proven to be effective for image classification in recent years [9][16][17]. A CNN is a type of feed forward network.

It may contain one or more fully connected layers, and may also contain one

(12)

or more convolutional layers as hidden layers. The difference between fully connected layers and convolutional layers is that the latter consists of one or more filters, where each neuron does not have connections to all neurons in the previous layer. Instead each neuron has a connection to a small region of the previous layer, a local receptive field or a kernel, see figure 2.1. If the previous layer is seen as a matrix, the local receptive field is a k by k sub-matrix, where k is the length of each side of the local receptive field. The stride, s decides where the local receptive field of the next neuron in the filter is located. If s = 1, the local receptive field is just moved one column to the right. If s = 2, the local receptive field moves 2 columns to the right, and so on. The size of the filter then becomes ^⌈

^a^−k+1_s

^⌉ by ^⌈

^a^−k+1_s

^⌉ , when the input layer is an a by a matrix.

Figure 2.1: A convolutional layer (right) with a 5 by 5 local receptive field.

Source: Nielsen [12]

Each neuron in a filter has the same weights and bias, but the weights and biases could differ between the filters in the same convolutional layer. Hence, each filter can be seen as learning a separate feature from the input layer. There could be several filters in parallel, each learning different feature. A pooling layer could be used after one or more convolutional layers to summarize and reduce the number of parameters to the next layer.

To be able to make predictions, one or more fully connected layers are used at the

end of the pipeline, where the last layer has the same number of outputs as the

classifications that the network should be able to determine [12].

(13)

2.2.2 Training

The networks ”learn” through training on a set of training data. There is two general types of training; unsupervised and supervised training. In unsupervised training the data set given for training will not contain labels. It is up to the network itself to extract and learn classes of pattern from the training data set.

In supervised training, the correct labels are given with the training data set.

Training with labeled data is often more efficient, but the disadvantage is that it requires a large amount of labeled data.

The aim of the network when training is to minimise the cost function C(W, B), which can be done by changing the the weights W and biases B. The cost function usually depends on the difference between the networks predicted output y(x) and the actual classification for a training input data x. A weight w is updated with the formula w ← w−η

^∂C_∂w

, where η is the learning rate. The bias is updated in a similar way. This is done through a algorithm called back propagation. Back propagation is rather complicated, and this paper will not go into further details here.

The main computational burden when training a neural network is the computation of the partial derivatives. A common way to speed up this procedure is to estimate the cost function through only train on a random subset of the training data set. This is done iterative, with a new random subset in each iteration. The technique is known as Stochastic Gradient Descent [12].

2.2.3 Over fitting

When training a subset, there is a risk that the network learn features that are

specific to the training data set, but do not apply in general. This problem is called

over fitting. One way to reduce over fitting is to use early stopping, which is to stop

training when the classification accuracy on a validation data set stops increasing

[12]. Another common way is the dropout technique, when a random subset of

neurons is excluded in each training iteration. This makes individual components

of the network less dependent of other components, and hence the network more

stable [12][10].

(14)

2.3 Char74K and ICDAR2003 data sets

The Char74K data set contains images of characters from both the Latin and Kannada script, cropped from images of street scenes taken in Bangalore, India with a standard camera. The data set contains also hand scripts of Latin and Kannada. However, in this work the hand script and the Kannada alphabet images were not used. Only the Latin alphabet alphabet and the Arabic numbers were used which consisted of 12498 images distributed in 62 classes, see figure 2.2.

The ICDAR2003 data set contains 11615 images of Latin alphabet and Arabic numbers from natural scenes. The data set was used in the ICDAR Robust Reading competition 2003 [4].

(a) CHAR74K (b) ICDAR 2003

Figure 2.2: Example images from the char74k and ICDAR 2003 data sets.

Both of these data sets have been used widely in research projects in the area

of character recognition. The data sets were used for both training machine

learning models, and benchmarking performance in terms of accuracy. One

major difference between these two data sets is that Char74k contains some more

extreme cases like images that are rotated 90 degrees or more, or images that

are very blurry. This makes the Char74k a more difficult data set for recognizing

characters. Although ICDAR2003 data set does not contain as many extreme

(15)

cases as Char74k, there are still some extreme cases like for example images that have width of 1 or 2 pixels.

2.4 Related Work

2.4.1 Character recognition in natural images

In 2009 De Campos et al tackled the problem of recognizing characters in images of natural scenes[4]. Their focus was mostly on situations where the traditional optical character recognition (OCR) was ineffective. They gathered and labeled a data set of images containing English and Kannada characters, taken using a standard camera in the streets of Bangalore.

De Campos et al used three classification schemes: nearest neighbor classification, support vector machines, and multiple kernel learning. For the OCR technology, they used a commercial system ABBYY FineReader 8.0

²

. The result was that by using only 15 training images per class, the multiple kernal learning model outperformed the commercial OCR solution by about 25% in classification accuracy. The best result in English character recognition was achieved by the multiple kernal learning model, an accuracy of 55.26% was then achieved.

The paper found also that the performance achieved by training on synthetically generated fonts was nearly as good as the performance of multiple kernal learning model trained on natural images.

2.4.2 Scene Text Analysis using Deep Belief Networks

Ray et al performed exhaustive experiments on the problem of character recognition using two machine learning algorithms Deep Belief Networks and Convolutional Neural networks [13]. These algorithms have been evaluated using four data sets: Char74K English, Char74K Kannada, ICDAR 2003 Robust OCR data set and SVTCHAR data set. The accuracy achieved using Convolutional Neural Networks was 82.3% for Char74k and 84.46% for ICDAR 2003. These results were achieved using only real data sets for training.

2

http://www.abbyy.com

(16)

3 Methodology

In this section, the methods used in this thesis will be presented. First, the methods used for generating the synthetic data set will be covered, then the method used to adjust the images in the ICDAR2003 and Char74K data sets to fit the neural network will be presented. After that, the architecture and implementation of the convolutional neural network will be covered, and lastly methods used for evaluation will be described.

3.1 Generating the synthetic data set

The synthetic data set was generated using an Adobe Photoshop script. The script automated Photoshop to generate iteratively 20’000 random versions of each glyph according to Algorithm 1, see figure 3.1.

Figure 3.1: The first 40 synthetically generated ’a’ characters.

(17)

Funktion generate()

FONTS←− the list of fonts in appendix A1 Glyphs←−[’a’, ..., ’z’, ’A’, ..., ’Z’, ’0’, ..., ’9’]

for glyph in Glyphs do

backgroundLayer←− A blank white layer textLayer←− glyph in black color

savedState←− the images current state in Photoshop for 20000 times do

textLayer←− set font type to uniformRandom(FONTS) textLayer←− set font size to gaussianRandom(µ = 45, σ = 5) textLayer←− set angle to gaussianRandom(µ = 0^◦, σ = 20^◦)

textLayer←− center the text’s x-position at gaussianRandom(µ = 20, σ = 4) textLayer←− center the text’s y-position at gaussianRandom(µ = 20, σ = 4) if unif ormRandom(0, 1) > 0.40 then

textLayer←− set color to uniformRandom(hex color number) backgroundLayer←− set color to uniformRandom(hex color number) end

if unif ormRandom(0, 1) > 0.40 then

Apply a lighting gradient with angel of unif ormRandom(180^◦, 135^◦, 90^◦, 45^◦, 0^◦) end

by defining intervals on unif ormRandom() apply no blur (40% chance), little blur (30%

chance), normal blur (20% chance) and extra blur (10% chance)

by defining intervals on unif ormRandom() apply no noise (40% chance), little noise (30%

chance), normal noise (20% chance) and extra noise (10% chance) saveN ewImage()

restore savedState end

end end

Algorithm 1: The pseudo code for the synthetic data generator.

In total the script generated 62 glyphs × 20’000 images/glyph = 1’240’000 images of size 40x40 pixels. As seen in algorithm 1 the randomness of the generated images was normalized for some of the aspects using a Gaussian distribution. The Gaussian distribution was achieved by implementing the Box-Muller transform which uses two uniform random number generators. The Box-Muller transform’s basic form is

Z =

√ −2 ln V · cos (2π · U) · σ + µ (1)

where V and U are two uniform random generated numbers, µ is the mean,

and σ is the standard deviation. This method’s performance is illustrated in

Figure3.2

(18)

(a) Box-Muller transform performance using µ = 0, σ = 1

(b) Random glyph angel distribution using µ = 0, σ = 20

(c) Random glyph size distribution using µ = 0, σ = 20

(d) Random glyph position distribution in a 40x40 pixel image µ = 0, σ = 20

Figure 3.2: Graphs illustrating the performance of Box-Muller transform when generating 20000 random values.

The uniform random number generator used in Algorithm 1 was the standard javascript funktion Math.random() which generates a number between 0 and 1.

This function was also used to generate the values of V and U in the Box-Muller transform.

3.2 Pre-processing Char74K and ICDAR2003 data sets

The Char74K and ICDAR2003 data sets contain images of different sizes, see

figure 2.2. The data sets has been pre-processed by applying a python script

that squarified the images and resized them to 40x40 pixels. The squarification

(19)

algorithm did not stretch or squeeze the images but extended them by adding new pixels so that the shorter side became as long as the longer side. Algorithm 2 describes this process.

Funktion squarify(image) input: The image to be resized

edgeP ixelV alues←− a list of color values of all the pixels on the edges of image mean←− the mean color value of the colors in edgeP ixelV alues

greaterP ixelV alues←− all pixel values i in edgeP ixelV alues where i > mean lesserP ixelV alues←− all pixel values i in edgeP ixelV alues where i < mean extensionColorthe color to be used for extending the image

if∥greaterP ixelV alues∥ > ∥lesserP ixelV alues∥ then

extensionColor←− the mean color value in greaterP ixelV alues else

extensionColor←− the mean color value in lesserP ixelV alues end

newImage←− extend image by adding pixels on the long side of color extensionColor until the image is square

rsize newImage to 40x40 pixels save newImage

end

Algorithm 2: The algorithm for squarifing images

As stated in algorithm 2 the color of the extended part of the image will be determined by the dominant color on the edges of the image. figure 3.3 illustrates some the results achieved by the algorithm.

(a) char74k before squarification (b) char74k after squarification

Figure 3.3: Examples of the results after squarifying

(20)

3.3 Neural Network

The TensorFlow API for Python 3.7, and the external Python libraries NumPy and OpenCV, were used for implementing the convolutional neural network. The architecture of the network is based on the architecture in [19], since they had a similar problem and achieved good recognition results on their training data set.

3.3.1 TensorFlow

TensorFlow is a free and open source solution for building, training and executing machine learning networks, developed by Google [1]. The API is flexible and easy to use. Machine learning models built and trained in TensorFlow can be used in a wide range of systems, including desktop workstations. TensorFlow has been used in several research projects not only in the field of computer science but also in many other fields. One of the advantages of TensorFlow is that it offers a visualization toolkit named TensorBoard.

3.3.2 Input

When executing the supervised training, the network takes two NumPy arrays as input. The first contains batches of images with the standard format N HW C where N is the number of images in batch, H and W the height and width of each image, and C the number of color channels of each image. To reduce the size of the network the images where converted to grayscale. OpenCV was used for converting the images to grayscale. The dimensions of each picture were 40x40 pixels. Hence the the values in the first NumPy array were (n, 48, 48, 1). The second array contained the label for each image.

3.3.3 Architecture

When setting up an architecture for a neural network there is no recipe to follow.

Which layers to choose, how many layers to add, and what parameter values to use

are all a matter of trail and error which requires a lot of testing. The architecture

used in this thesis was based on the neural network in [19]. This architecture

performed well in a similar task to character recognition. Nevertheless, some

(21)

limited amount of adjustments were made on this architecture with the aim of achieving even better results.

The network consisted of three convolutional layers in the following order 64, 256 and 256 filters; kernel size of 12, 8, and 8; and a stride of 1, 2 and 2. Each layer had the activation formula ReLu, and the parameter padding was set to

”same”. After each convolutional layer there was a Dropout layer set to 0.3, to prevent overfitting. Each convolutional layer was also followed by a maxpooling- layer, with pool-size 2x2. To convert the 2-dimensional outputs from the last convolutional layer to a 1-dimensional array, a flatten layer was placed before the last fully connected layer. The last fully connected layer had 62 outputs (36 when removing case sensitivity), each representing a distinct uppercase character, lowercase character or a number 0-9.

The cost function used was Sparse Categorial Crossentropy. The Adam Optimiser, an effective way for improving the learning rate, was also used with the default and recommended settings [8].

3.3.4 Training

During the training, batches of size 250 were used. In the context of training using stochastic gradient descent, a batch is a random subset of the data set used for training which in this case is the synthetic data set. The network will train on different batches under a number of iterations until the whole training data set is used, which is called an epoch. This process is then repeated under a number of epoch’s. To evaluate the network after each epoch, the Char74K data set was used as a validation data set. The epoch parameter was set to 50, but with the use of early stopping the training stopped when the validation loss had not improved for 10 consecutive epochs. On average, 20 epoch seemed to be enough for the networks validation accuracy to stabilize.

3.4 Evaluation

The evaluation has been done by comparing the results in terms of recognition

accuracy with related works in the same field. See table 4.1.

(22)

The main evaluation categories were:

1. The performance in terms of accuracy of synthetic data training versus real data training.

2. The performance in terms of accuracy of training with the combination of synthetic and real data.

In the evaluation, aspects of uncertainty has be taken into consideration. For

example, some of the related works that have been used in the evaluation did not

have a clear description on the methods used for achieving their results.

(23)

4 Results

This section will cover the results in this thesis. First results from related work will be presented for evaluation, then the results gained in this thesis. A summary of the results can be found in Tabular 4.1.

4.1 Results from related studies

By using the multiple kernel learning method implemented in [18], and training and validate on the English subset of the Char74K data set, De Campos et al.

achieved 55.26% accuracy [4].

Coates et al. achieved 81.7% and 81.4% accuracy on the test set respectively the smaller sample set from ICDAR2003 data set with 62 classes, and 85.5% on the sample set with 36 classes (handling a uppercase and lowercase character as the same class). These results were achieved with an unsupervised convolutionary network training on natural images of characters from different data sets. They also used augmentation to synthetically extend their training data [5].

Saidane et al. achieved an accuracy of 84.5% on the smaller sample set of ICDAR2003 data set with 36 classes [15]. A supervised convolutional neural network was used including 2 convolutional layers and one fully connected layer, trained on the (with augmentation extended) ICDAR2003 training and testing data set.

Ray et al. tested different architectures of a Deep Belief Network for character

recognition [13]. But they also compared the network with a convolutional neural

network, with three convolutional layers. Their best performing deep belief

networks achieved an accuracy of 84.0% on the Char74K dataset and 80.9% on the

test set from ICDAR2003 data set. Their convolutional neural network achieved

an accuracy of 82.3% on the Char74K data set and 84.6% on the ICDAR2003 test

set, with 62 classes. This when training on natural images of characters from

different data sets.

(24)

Paper Char74K ICDAR2003 ICDAR2003 (36)

De Campus et al. [4] 55.26% - -

Coates et al. [5] - 81.7% / 81.4% 85.5%

Saidane & Garcia [15] - - 84.5%

Ray et al. (DBN) [13] 84.0% 80.9% -

Ray et al. (CNN) [13] 82.3% 84.6% -

Table 4.1: Results from other studies

4.2 Results from this study

Training and evaluating the convolutional neural network on several different data sets and combinations of data sets has been done. Images in data sets used for evaluation were never used for training. The different setups for training and evaluating together with the accuracy of the setup are as following:

• Training on the whole synthetic data set (20000 pictures per class), evaluating against the whole Char74K data set. This gave an accuracy of 65.07%

• Training on the whole synthetic data set (20000 pictures per class), evaluating against the ICDAR2003 test data set. This gave an accuracy of 75.25%. Figure 4.1 shows some of the classifications made by the network.

• Training on the whole synthetic data set but with 36 classes instead of 62 (no case sensitivity), evaluating against the ICDAR2003 test data set. This gave an accuracy of 77.72%.

• Training on half of the synthetic data set (10000 unique pictures per class, randomly chosen), evaluating against the ICDAR2003 test data set. This gave an accuracy of 73.14%

• Training on the whole synthetic data set (20000 pictures per class) as well

as the Char74K (32 pictures per class), evaluating against the ICDAR2003

test data set. This gave an accuracy of 74.45%.

(25)

• Training only on the Char74K data set (32 pictures per class), evaluating against the ICDAR2003 test data set. This is however a quit small training set. More training data could be gathered by augmentation. This gave an accuracy of 48.74%.

Training set Char74K ICDAR2003 ICDAR2003 (36)

Only synthetic data 65.07% 75.25% 77.72%

Synthetic data and Char74K - 74.45% -

Only Char74K - 48.74% -

Half of the synthetic data set - 73.14% -

Table 4.2: Results from this study

Figure 4.1: Classifications on the ICDAR2003 test data set, made by a network

trained on (only) the synthetic data set. First character show which character

guessed by the network, the percentage how confident the network was, and the

last character in parenthesis the given class of the character. Blue text indicate

success, red text indicate incorrect classification.

(26)

Figure 4.1 show us some of the classifications made on the ICDAR2003 test data set, when trained on the full synthetic data set. Even though the network does some, for a human, obvious incorrect classifications, a significant part of the errors relate to case sensitivity where even a human would have a hard time classifying a character without context. When removing case sensitivity, the networks reach an classification accuracy of 77.72%.

Some other aspects from the results are worth noticing. We can see that there’s

a 10% difference in the classification accuracy on the Char74K data set versus the

ICDAR2003 data set. The result was affected by the size of the synthetic data

set. Training on both the synthetic data set and the Char74K data set, performed

almost as good as a network only trained on the synthetic data set (similar but

not presented results where reached when trained on the Char74K data set and a

smaller sample, 1000 images per class, from the synthetic data set).

(27)

5 Discussion and Conclusions

5.1 Discussion on the results

Overall, the results show that network perform well when trained on the synthetic data set. Especially when considering the fact that some of the images from the Char74K and ICDAR2003 data set could be really hard to classify, even for a human. It performed better than the algorithm used in De Campus et al. [4], but below the results of other studies. Some interesting aspects is worth mentioning.

The network performed a bit worse when trained on both the synthetic data set and images from the Char74K data set. But the Char74K is quite small, some classes only offers 32 images, and it is hard to determine it’s impact on the result when used for training together with the bigger syntethic data set. Other work has synthetically expanded the Char74K data set when used for training.

The network increased it’s accuracy by approximately 50% when trained on the synthetic data set compared to training on the Char74K real data set (75.25%

versus 48.74%). But again, the Char74K data set is small, and could be extended synthetically.

Another interesting aspect from the results, is that the network used in this study performed better when tested against the ICDAR2003 test data set versus the Char74K data set. This is not in line with results from other studies. One explanation could be that the Char74K data set offers more images for training (a minimum of 32 images per class, while some classes in the ICDAR2003 only contains of 2 images). Due to that other studies mainly have trained their models on images and augmented images from ”real” data sets, their models may have been better tuned to the Char74K data set.

5.2 Discussion on the synthetic data set

No scientific methods has been used to determine which attributes are most

important when generating the synthetic data set, which makes us believe that

there is room for a lot of improvement of the synthetic data set. A subject for

future work could be to produce synthetic data sets with different attributes to

actually determine which attributes that contributes the most. This even applies

(28)

to the size, even though our results suggests that our synthetic data set yields a better result when consisting of 20000 images per character instead of 10000.

However, the results indicate that the synthetic data set alone could be enough for training a network, since it performed as good as when the network was trained on a hybrid data set.

5.3 Discussion on the neural network

Due to time constraints, a limited amount of testing regarding the architecture of the CNN was carried out. When training on both the synthetic data set and the Char74K data set, our network did not perform as well as other studies. Even though other studies also trained on augmented data from the Char74K data set, this could suggest that more testing regarding the architecture and parameters would lead to a better performing network overall.

5.4 Conclusions

In this thesis we have developed a method to resize images to 40x40 pixels without distorting or cropping them. We have also generated a synthetic data set for characters in natural images. The results shows that it is possible to use this synthetic data set to train a convolutional neural network for character recognition in natural images and reach a accuracy of 65-75% on known data sets of characters in natural images. This task of character recognition in natural images could be really hard, even for humans, when classifying a character out of its context (with other characters together constructing a word). Particularly when separating lowercase and uppercase letters.

Considering the achieved results, there is room for improvement regarding both

the synthetic data set and the neural network, this results shows us that it is

possible to expand or even replace a data set with synthetic data generated ”from

scratch” in the area of character recognition. This can probably be applied to other

areas as well, and could make a huge advantage when the data set is small or even

non-existing.

(29)

5.5 Future Work

In this study, the generated synthetic data set was aimed to have a high variety and realism. However, there is still room for improvement for example artifacts like image skew, or paint scratches can also be added to achieve more variety and realism. It is therefore recommended to investigate the results which can be achieved with a more realistic synthetic data set.

As mentioned earlier in the discussion, investigating the impact of the different attributes in the synthetic data set is recommended. It will be for example interesting to find out whether realism has bigger impact on the results than the degree of variety or quantity in the synthetic data set.

Due to the limit in time and computing power, this paper was not able to achieve an optimal convolutional neural network architecture. In Future works, if the aim is achieving a higher recognition accuracy, it would be recommended to further calibrate the network by adjusting the number of layers and their parameters.

When it comes to generating the synthetic data sets, the quantity of the generated

images was limited due to slow performance of Adobe Photoshop. The main

problem with Photoshop scripts is that Photoshop will update its graphical user

interface for each action taken in the script leading to a slower performance. Thus,

it is recommended to use some other solution for generating images in future

works.

(30)

References

[1] Abadi, Martín et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. 2016. arXiv: 1603.04467 [cs.DC].

[2] Akbani, O. et al. “Character Recognition in Natural Scene Images”. In: Dec.

2015. DOI: 10.1109/ICICT.2015.7469575.

[3] Bozkurt, M. and Harman, M. “Automatically generating realistic test input from web services”. In: Proceedings of 2011 IEEE 6th International Symposium on Service Oriented System (SOSE). 2011, pp. 13–24.

[4] Campos, T. E. de, Babu, B. R., and Varma, M. “Character recognition in natural images”. In: Proceedings of the International Conference on Computer Vision Theory and Applications, Lisbon, Portugal. Feb. 2009.

[5] Coates, Adam et al. “Text Detection and Character Recognition in Scene Images with Unsupervised Feature Learning”. In: Oct. 2011, pp. 440–445.

DOI: 10.1109/ICDAR.2011.95.

[6] Gupta, Ankush, Vedaldi, Andrea, and Zisserman, Andrew. “Synthetic Data for Text Localisation in Natural Images”. In: CoRR abs/1604.06646 (2016).

arXiv: 1604.06646.

[7] Jaderberg, Max et al. “Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition”. In: CoRR abs/1406.2227 (2014). arXiv:

1406.2227.

[8] Kingma, Diederik P. and Ba, Jimmy. Adam: A Method for Stochastic Optimization. 2014. arXiv: 1412.6980 [cs.LG].

[9] Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. “ImageNet Classification with Deep Convolutional Neural Networks”. In: Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1. NIPS’12. Lake Tahoe, Nevada: Curran Associates Inc., 2012, pp. 1097–1105.

[10] Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. “ImageNet Classification with Deep Convolutional Neural Networks”. In: Commun.

ACM 60.6 (May 2017), pp. 84–90. ISSN: 0001-0782. DOI: 10 . 1145 /

3065386.

(31)

[11] McLachlan, Scott. “Realism in synthetic data generation”. In: (2017).

[12] Nielsen, Michael A. Neural Networks and Deep Learning. Determination Press, 2015.

[13] Ray, Anupama, Rajeswar, Sai, and Chaudhury, Santanu. “Scene Text Analysis using Deep Belief Networks”. In: Dec. 2014, pp. 1–8. DOI: 10 . 1145/2683483.2683554.

[14] Rubin, Donald B. “Statistical disclosure limitation”. In: Journal of official Statistics 9 (2 1993), pp. 461–468.

[15] Saidane, Zohra and Garcia, Christophe. “Automatic Scene Text Recognition using a Convolutional Neural Network”. In: (Jan. 2007).

[16] Simonyan, Karen and Zisserman, Andrew. “Very Deep Convolutional Networks for Large-Scale Image Recognition”. In: CoRR abs/1409.1556 (2014).

[17] Szegedy, Christian et al. “Going Deeper with Convolutions”. In: CoRR abs/1409.4842 (2014). arXiv: 1409.4842.

[18] Varma, Manik and Ray, Debajyoti. “Learning The Discriminative Power- Invariance Trade-Off”. In: Nov. 2007, pp. 1–8. ISBN: 978-1-4244-1631-8.

DOI: 10.1109/ICCV.2007.4408875.

[19] Viklund, Alexander and Nimstad, Emma. “Character Recognition in Natural Images Utilising TensorFlow”. In: (2017).

[20] Ye, Qixiang and Doermann, David. “Text Detection and Recognition in

Imagery: A Survey”. In: IEEE Transactions on Pattern Analysis and

Machine Intelligence 37 (June 2015). DOI: 10.1109/TPAMI.2014.2366765.

(32)

Appendices

(33)

Appendix - Contents

A First Appendix 29

A.1 List of Fonts used in generating the synthetic data-set . . . 29

(34)

(35)

A First Appendix

A.1 List of Fonts used in generating the synthetic data-set

ArialMT Ebrima-Bold SegoeUI-Italic

Arial-ItalicMT Fixedsys SegoeUI-Semibold

Arial-BoldMT FranklinGothic-Medium SegoeUI-SemiboldItalic Arial-BoldItalicMT FranklinGothic-MediumItalic SegoeUI-Bold

Arial-Black Gadugi SegoeUI-BoldItalic

Bahnschrift Gadugi-Bold SegoeUIBlack

Calibri-Light Georgia SegoeUIBlack-Italic

Calibri-LightItalic Georgia-Italic SegoeUIEmoji

Calibri Georgia-Bold SegoeUIHistoric

Calibri-Italic Georgia-BoldItalic SegoeUISymbol Calibri-Bold AdobeGurmukhi-Regular SitkaBanner Calibri-BoldItalic AdobeGurmukhi-Bold SitkaDisplay

Cambria Impact SitkaHeading

Cambria-Italic InkFree SitkaSmall

Cambria-Bold JavaneseText SitkaSubheading

Cambria-BoldItalic LucidaConsole SitkaText

CambriaMath LucidaSansUnicode SitkaBanner-Italic Candara-Light MicrosoftNewTaiLue SitkaDisplay-Italic Candara MicrosoftNewTaiLue-Bold SitkaHeading-Italic Candara-Italic MicrosoftPhagsPa SitkaSmall-Italic Candara-LightItalic MicrosoftPhagsPa-Bold SitkaSubheading-Italic Candara-Bold MicrosoftSansSerif SitkaText-Italic Candara-BoldItalic MicrosoftTaiLe SitkaBanner-Bold ComicSansMS MicrosoftTaiLe-Bold SitkaDisplay-Bold ComicSansMS-Italic Microsoft-Yi-Baiti SitkaHeading-Bold ComicSansMS-Bold MongolianBaiti SitkaSmall-Bold ComicSansMS-BoldItalic MS Sans Serif SitkaSubheading-Bold

Consolas MS Serif SitkaText-Bold

Consolas-Italic MVBoli SitkaBanner-BoldItalic

Consolas-Bold MyanmarText SitkaDisplay-BoldItalic Consolas-BoldItalic MyanmarText-Bold SitkaHeading-BoldItalic Constantia MyriadPro-Regular SitkaSmall-BoldItalic Constantia-Italic MyriadPro-It SitkaSubheading-BoldItalic Constantia-Bold MyriadPro-Bold SitkaText-BoldItalic Constantia-BoldItalic MyriadPro-BoldIt Small Fonts

CorbelLight NirmalaUI-Semilight Sylfaen

(36)

Corbel NirmalaUI-Bold Tahoma Corbel-Italic PalatinoLinotype-Roman Tahoma-Bold Corbel-Bold PalatinoLinotype-Italic Terminal

Corbel-BoldItalic PalatinoLinotype-Bold TimesNewRomanPSMT Courier PalatinoLinotype-BoldItalic TimesNewRomanPS-ItalicMT

CourierNewPSMT SegoePrint TimesNewRomanPS-BoldMT

CourierNewPS-ItalicMT SegoePrint-Bold TimesNewRomanPS-BoldItalicMT

CourierNewPS-BoldMT SegoeScript TrebuchetMS

CourierNewPS-BoldItalicMT SegoeScript-Bold TrebuchetMS-Italic AdobeDevanagari-Regular SegoeUI-Light TrebuchetMS-Bold AdobeDevanagari-Italic SegoeUI-LightItalic Trebuchet-BoldItalic AdobeDevanagari-Bold SegoeUI-Semilight Verdana

AdobeDevanagari-BoldItalic SegoeUI-SemilightItalic Verdana-Italic

Ebrima SegoeUI Verdana-Bold

(37)

TRITA -EECS-EX- 2020:388