DEGREE PROJECT IN TECHNOLOGY, FIRST CYCLE, 15 CREDITS
STOCKHOLM, SWEDEN 2020
TRAINING A NEURAL NETWORK USING
SYNTHETICALLY GENERATED DATA
FREDRIK DIFFNER
HOVIG MANJIKIAN
Att träna ett
neuronnät med syntetiskt
genererad data
FREDRIK DIFFNER HOVIG MANJIKIAN
Degree Project in Computer Science Date: June 2020
Supervisor: Christopher Peters Examiner: Pawel Herman
KTH Royal Institute of Technology
School of Electrical Engineering and Computer Science
Abstract
A major challenge in training machine learning models is the gathering and labeling of a sufficiently large training data set. A common solution is the use of synthetically generated data set to expand or replace a real data set. This paper examines the performance of a machine learning model trained on synthetic data set versus the same model trained on real data. This approach was applied to the problem of character recognition using a machine learning model that implements convolutional neural networks. A synthetic data set of 1’240’000 images and two real data sets, Char74k and ICDAR 2003, were used. The result was that the model trained on the synthetic data set achieved an accuracy that was about 50% better than the accuracy of the same model trained on the real data set.
Keywords
Synthetic data set, Generating synthetic data set, Machine learning, Deep
Learning, Convolutional Neural Networks, Machine learning model, Character
recognition in natural images, Char74k, ICDAR2003.
Sammanfattning
Vid utvecklandet av maskininlärningsmodeller kan avsaknaden av ett tillräckligt stort dataset för träning utgöra ett problem. En vanlig lösning är att använda syntetiskt genererad data för att antingen utöka eller helt ersätta ett dataset med verklig data. Denna uppsats undersöker prestationen av en maskininlärningsmodell tränad på syntetisk data jämfört med samma modell tränad på verklig data. Detta applicerades på problemet att använda ett konvolutionärt neuralt nätverk för att tyda tecken i bilder från ”naturliga” miljöer.
Ett syntetiskt dataset bestående av 1’240’000 samt två stycken dataset med tecken från bilder, Char74K och ICDAR2003, användes. Resultatet visar att en modell tränad på det syntetiska datasetet presterade ca 50% bättre än samma modell tränad på Char74K.
Nyckelord
Syntetiskt dataset, Generera syntetiskt data,
Maskininlärning, Maskininlärningsmodell, Djuplärning, Konvolutionära neurala
nätverk, teckenigenkänning i bilder, Char74k, ICDAR2003,.
Acknowledgements
We would like to express our gratitude to Christopher Peters for his invaluable
constructive criticism and aspiring guidance throughout this work.
Contents
1 Introduction 1
1.1 Problem . . . . 2
1.2 Methodology . . . . 2
1.3 Delimitations . . . . 3
2 Background 4 2.1 Synthetic data sets . . . . 4
2.2 Neural Networks . . . . 5
2.3 Char74K and ICDAR2003 data sets . . . . 8
2.4 Related Work . . . . 9
3 Methodology 10 3.1 Generating the synthetic data set . . . 10
3.2 Pre-processing Char74K and ICDAR2003 data sets . . . 12
3.3 Neural Network . . . 14
3.4 Evaluation . . . . 15
4 Results 17 4.1 Results from related studies . . . . 17
4.2 Results from this study . . . 18
5 Discussion and Conclusions 21 5.1 Discussion on the results . . . 21
5.2 Discussion on the synthetic data set . . . 21
5.3 Discussion on the neural network . . . 22
5.4 Conclusions . . . 22
5.5 Future Work . . . 23
References 24
1 Introduction
The rapid growth of high resolution video material generated by modern devices (e.g. mobile devices, security cameras) makes the problem of detecting and recognizing characters in these materials an important problem, especially in areas related to data mining, categorizing, etc.
Text detection and character recognition is a classic pattern recognition problem.
For the Latin script, this is largely a solved problem in certain cases like in the case of images of scanned documents[2]. However, text detection and character recognition in natural images (photographs) pose a much more difficult problem, where characters can be much more difficult to recognize (eg. characters in neon script outside a restaurant). In recent years numerous different approaches has been evaluated to solve the problem, both through dividing the problem into stages of text detection, extraction and recognition, or by end-to-end solutions [20].
This thesis will only focus on the character recognition problem in natural images. For the problem of character recognition, algorithms implementing neural networks has proven to be performing well [2, 5, 20]. Thus, this paper will concentrate mainly on these algorithms.
However, one major problem when training neural networks is the lack of a
big and proper labeled data set for training, which also applies to character
recognition in natural images [4, 5, 20]. A solution in this kind of situation is
to generate a synthetic data set. Generating synthetic data sets is a relatively new
area in the machine-learning domain. The concept is basically to generate fake
data which can be used for training machine-learning models. The key advantages
of training a machine-learning model using synthetically generated data set is its
feasibility and flexibility. It is often the case that the real data set is too small or
missing completely, and in many of these cases it is not feasible gathering more
real data because of time or budget constraints [3]. Furthermore, there are cases
where real data is available in abundance but still can not be used because of some
privacy or confidentiality aspects [14]. In such cases generating a synthetic data
set would be a viable option to solve the problem, given that the results of training
the machine learning model on such a data set turns out to be acceptable for the purpose. An area where the approach of generating synthetic data is applicable is in the area of character recognition. Even though the use of synthetic data have been evaluated in end-to-end solutions and word recognition [6, 7], the area of character recognition lacks of studies evaluating the use of synthetic data alone for training neural networks.
1.1 Problem
For the problem of character recognition in natural images, the lack of big data sets for training neural networks is a constraint. Synthetic data sets could be a substitute when the real data set is too small or missing completely. The research problem posed in this thesis is thus, can a synthetic data set alone be used to train a neural network on the problem of character recognition in natural images?
Our hypothesis is that a convolutional neural network trained on a synthetically generated data set will have a similar performance to the same network trained on a real data set.
1.2 Methodology
Generally, creating a realistic synthetic data set for training can be complicated, since there are lots of aspects which need to be considered. However, achieving realism artificially in the area of characters in natural images is not difficult due to the narrow variety of aspects of realism. A natural image often contains distortion, blurriness, noise, and gradient lighting. These artifacts can easily be created by computer softwares.
Tensorflow is used for creating, training and evaluate the neural network. The
performance of the network is then evaluated against the state of the art results
from other studies.
1.3 Delimitations
Text recognition in natural images has traditionally been divided into several sub problems detection, extraction and recognition. This thesis only focusing on generating a synthetic data set for the sub problem of character recognition, where characters already have been detected and extracted from the natural image.
There are however recent studies which achieved good results by developing an end-to-end solution [7, 20].
To evaluate a synthetic data set, a neural network with good performance is needed. When building a neural network, the performance of the network often depends on several parameters. Setting these parameters requires extensive testing, which consumes time especially when heavy computing power is absent.
Due to limitations in time and computing power (a desktop PC with a single mid-
range GPU), only a limited amount of testing will be possible. Consequently, the
parameters of the neural network will be calibrated lightly, which could lead to an
insufficient optimization of the neural network. The limited access to computing
power will also be a constraint when creating the synthetic data set, which will
affect the size of the data set.
2 Background
In this section, a background on the techniques used in this paper will be given.
First an introduction to synthetic data sets will be presented, then the area of neural networks will be covered, with the focus on convolutional neural networks.
This part is mainly based the book Neural Networks and Deep Learning by Nielsen
1. Then, the Char74K and ICDAR2003 data sets will be introduced, and finally, related work will be presented.
2.1 Synthetic data sets
One common problem when training neural networks is the lack of a large and properly labeled data set for training. Gathering such data set could be very time and resource consuming [3]. For example the Char74K data set, which is a data set of real images used for character recognition, was gathered by taking 1922 images on the streets of Bangalore (India), cut out every character, and label them [4]. Another concern when dealing with real images is privacy and confidentiality.
Sometimes a large and properly labeled real data would be available but unusable because of the high risk of disclosing confidential data, or disclosing the identity of people without having their consent [14]. This is a very common case in the health care sector for example.
A common solution in this kind of situation is to generate a synthetic data set, which is often cheaper and easily scalable alternative to real data set. An area where the approach of generating synthetic data is applicable is in the area of character recognition. Since machine learning models are often used in the problem of recognizing characters in natural images, the availability of a training data set is crucial. For example, Gupta et al developed an engine that overlay synthetic text to a existing background image in a natural way, accounting for the local 3D scene geometry. The synthetic pictures was later used to train a Fully-Convolutional Regression Network, which outperformed state of the art techniques for text detecting in natural images [6]. Another example is Jaderberg et al. proposition of a new framework for a state-of-the-art word recognition; they showed that the new framework could be trained with synthetic data [7].
1
Nielsen, Michael A. Neural Networks and Deep Learning. DeterminationPress, 2015
However, generating a synthetic data is not as simple as one would wish. It is important that the synthetic data is well defined to achieve reliable results [11].
Many aspects play a major roll in defining the synthetic data like quantity, quality, variation, realism, etc... This paper tries to balance between between all of these aspects, but priorities’ realism, that is the degree of which the synthetic data mimics the real data.
2.2 Neural Networks
Artificial Neural Networks is a subset of Machine Learning, which in itself is a subset to Artificial Intelligence. Neural Networks has proven to be good for solving problems related to image recognition, speech recognition, and natural language processing. The technology is vaguely inspired by the biological neural network that constitute animal brains. The same concept is applied in computers to enable pattern recognition in observation data. A neural network consists of small processing units, called neurons. Each neuron i takes a set X of inputs, where each input {x
j| x
j∈ X, 0 ≤ x
j≤ 1} have an attached weight w
ij. Each neuron also have a bias b
i. The output y
iof the neuron is then calculated with the help of the activation formula f with its input,
y
i= f
∑
nj=1
x
jw
ij+ b
i
where n is the size of the input set X and 0 ≤ y
i≤ 1.
A common architecture of a neural network is the feed forward network. In a feed forward network, the neurons are arranged in several layers; one input layer, some hidden layers, and one output layer. The outputs from neurons in one layer then becomes input to the neurons in the next layer. If a neuron has an input from all of the neurons in the previous layer, it is said to be a fully connected layer [12].
2.2.1 Convolutional Neural Networks
Convolutional neural network (CNN) has proven to be effective for image classification in recent years [9][16][17]. A CNN is a type of feed forward network.
It may contain one or more fully connected layers, and may also contain one
or more convolutional layers as hidden layers. The difference between fully connected layers and convolutional layers is that the latter consists of one or more filters, where each neuron does not have connections to all neurons in the previous layer. Instead each neuron has a connection to a small region of the previous layer, a local receptive field or a kernel, see figure 2.1. If the previous layer is seen as a matrix, the local receptive field is a k by k sub-matrix, where k is the length of each side of the local receptive field. The stride, s decides where the local receptive field of the next neuron in the filter is located. If s = 1, the local receptive field is just moved one column to the right. If s = 2, the local receptive field moves 2 columns to the right, and so on. The size of the filter then becomes ⌈
a−k+1s⌉ by ⌈
a−k+1s⌉ , when the input layer is an a by a matrix.
Figure 2.1: A convolutional layer (right) with a 5 by 5 local receptive field.
Source: Nielsen [12]
Each neuron in a filter has the same weights and bias, but the weights and biases could differ between the filters in the same convolutional layer. Hence, each filter can be seen as learning a separate feature from the input layer. There could be several filters in parallel, each learning different feature. A pooling layer could be used after one or more convolutional layers to summarize and reduce the number of parameters to the next layer.
To be able to make predictions, one or more fully connected layers are used at the
end of the pipeline, where the last layer has the same number of outputs as the
classifications that the network should be able to determine [12].
2.2.2 Training
The networks ”learn” through training on a set of training data. There is two general types of training; unsupervised and supervised training. In unsupervised training the data set given for training will not contain labels. It is up to the network itself to extract and learn classes of pattern from the training data set.
In supervised training, the correct labels are given with the training data set.
Training with labeled data is often more efficient, but the disadvantage is that it requires a large amount of labeled data.
The aim of the network when training is to minimise the cost function C(W, B), which can be done by changing the the weights W and biases B. The cost function usually depends on the difference between the networks predicted output y(x) and the actual classification for a training input data x. A weight w is updated with the formula w ← w−η
∂C∂w, where η is the learning rate. The bias is updated in a similar way. This is done through a algorithm called back propagation. Back propagation is rather complicated, and this paper will not go into further details here.
The main computational burden when training a neural network is the computation of the partial derivatives. A common way to speed up this procedure is to estimate the cost function through only train on a random subset of the training data set. This is done iterative, with a new random subset in each iteration. The technique is known as Stochastic Gradient Descent [12].
2.2.3 Over fitting
When training a subset, there is a risk that the network learn features that are
specific to the training data set, but do not apply in general. This problem is called
over fitting. One way to reduce over fitting is to use early stopping, which is to stop
training when the classification accuracy on a validation data set stops increasing
[12]. Another common way is the dropout technique, when a random subset of
neurons is excluded in each training iteration. This makes individual components
of the network less dependent of other components, and hence the network more
stable [12][10].
2.3 Char74K and ICDAR2003 data sets
The Char74K data set contains images of characters from both the Latin and Kannada script, cropped from images of street scenes taken in Bangalore, India with a standard camera. The data set contains also hand scripts of Latin and Kannada. However, in this work the hand script and the Kannada alphabet images were not used. Only the Latin alphabet alphabet and the Arabic numbers were used which consisted of 12498 images distributed in 62 classes, see figure 2.2.
The ICDAR2003 data set contains 11615 images of Latin alphabet and Arabic numbers from natural scenes. The data set was used in the ICDAR Robust Reading competition 2003 [4].
(a) CHAR74K (b) ICDAR 2003
Figure 2.2: Example images from the char74k and ICDAR 2003 data sets.
Both of these data sets have been used widely in research projects in the area
of character recognition. The data sets were used for both training machine
learning models, and benchmarking performance in terms of accuracy. One
major difference between these two data sets is that Char74k contains some more
extreme cases like images that are rotated 90 degrees or more, or images that
are very blurry. This makes the Char74k a more difficult data set for recognizing
characters. Although ICDAR2003 data set does not contain as many extreme
cases as Char74k, there are still some extreme cases like for example images that have width of 1 or 2 pixels.
2.4 Related Work
2.4.1 Character recognition in natural images
In 2009 De Campos et al tackled the problem of recognizing characters in images of natural scenes[4]. Their focus was mostly on situations where the traditional optical character recognition (OCR) was ineffective. They gathered and labeled a data set of images containing English and Kannada characters, taken using a standard camera in the streets of Bangalore.
De Campos et al used three classification schemes: nearest neighbor classification, support vector machines, and multiple kernel learning. For the OCR technology, they used a commercial system ABBYY FineReader 8.0
2. The result was that by using only 15 training images per class, the multiple kernal learning model outperformed the commercial OCR solution by about 25% in classification accuracy. The best result in English character recognition was achieved by the multiple kernal learning model, an accuracy of 55.26% was then achieved.
The paper found also that the performance achieved by training on synthetically generated fonts was nearly as good as the performance of multiple kernal learning model trained on natural images.
2.4.2 Scene Text Analysis using Deep Belief Networks
Ray et al performed exhaustive experiments on the problem of character recognition using two machine learning algorithms Deep Belief Networks and Convolutional Neural networks [13]. These algorithms have been evaluated using four data sets: Char74K English, Char74K Kannada, ICDAR 2003 Robust OCR data set and SVTCHAR data set. The accuracy achieved using Convolutional Neural Networks was 82.3% for Char74k and 84.46% for ICDAR 2003. These results were achieved using only real data sets for training.
2
http://www.abbyy.com
3 Methodology
In this section, the methods used in this thesis will be presented. First, the methods used for generating the synthetic data set will be covered, then the method used to adjust the images in the ICDAR2003 and Char74K data sets to fit the neural network will be presented. After that, the architecture and implementation of the convolutional neural network will be covered, and lastly methods used for evaluation will be described.
3.1 Generating the synthetic data set
The synthetic data set was generated using an Adobe Photoshop script. The script automated Photoshop to generate iteratively 20’000 random versions of each glyph according to Algorithm 1, see figure 3.1.
Figure 3.1: The first 40 synthetically generated ’a’ characters.
Funktion generate()
FONTS←− the list of fonts in appendix A1 Glyphs←−[’a’, ..., ’z’, ’A’, ..., ’Z’, ’0’, ..., ’9’]
for glyph in Glyphs do
backgroundLayer←− A blank white layer textLayer←− glyph in black color
savedState←− the images current state in Photoshop for 20000 times do
textLayer←− set font type to uniformRandom(FONTS) textLayer←− set font size to gaussianRandom(µ = 45, σ = 5) textLayer←− set angle to gaussianRandom(µ = 0◦, σ = 20◦)
textLayer←− center the text’s x-position at gaussianRandom(µ = 20, σ = 4) textLayer←− center the text’s y-position at gaussianRandom(µ = 20, σ = 4) if unif ormRandom(0, 1) > 0.40 then
textLayer←− set color to uniformRandom(hex color number) backgroundLayer←− set color to uniformRandom(hex color number) end
if unif ormRandom(0, 1) > 0.40 then
Apply a lighting gradient with angel of unif ormRandom(180◦, 135◦, 90◦, 45◦, 0◦) end
by defining intervals on unif ormRandom() apply no blur (40% chance), little blur (30%
chance), normal blur (20% chance) and extra blur (10% chance)
by defining intervals on unif ormRandom() apply no noise (40% chance), little noise (30%
chance), normal noise (20% chance) and extra noise (10% chance) saveN ewImage()
restore savedState end
end end
Algorithm 1: The pseudo code for the synthetic data generator.
In total the script generated 62 glyphs × 20’000 images/glyph = 1’240’000 images of size 40x40 pixels. As seen in algorithm 1 the randomness of the generated images was normalized for some of the aspects using a Gaussian distribution. The Gaussian distribution was achieved by implementing the Box-Muller transform which uses two uniform random number generators. The Box-Muller transform’s basic form is
Z =
√ −2 ln V · cos (2π · U) · σ + µ (1)
where V and U are two uniform random generated numbers, µ is the mean,
and σ is the standard deviation. This method’s performance is illustrated in
Figure3.2
(a) Box-Muller transform performance using µ = 0, σ = 1
(b) Random glyph angel distribution using µ = 0, σ = 20
(c) Random glyph size distribution using µ = 0, σ = 20
(d) Random glyph position distribution in a 40x40 pixel image µ = 0, σ = 20
Figure 3.2: Graphs illustrating the performance of Box-Muller transform when generating 20000 random values.
The uniform random number generator used in Algorithm 1 was the standard javascript funktion Math.random() which generates a number between 0 and 1.
This function was also used to generate the values of V and U in the Box-Muller transform.
3.2 Pre-processing Char74K and ICDAR2003 data sets
The Char74K and ICDAR2003 data sets contain images of different sizes, see
figure 2.2. The data sets has been pre-processed by applying a python script
that squarified the images and resized them to 40x40 pixels. The squarification
algorithm did not stretch or squeeze the images but extended them by adding new pixels so that the shorter side became as long as the longer side. Algorithm 2 describes this process.
Funktion squarify(image) input: The image to be resized
edgeP ixelV alues←− a list of color values of all the pixels on the edges of image mean←− the mean color value of the colors in edgeP ixelV alues
greaterP ixelV alues←− all pixel values i in edgeP ixelV alues where i > mean lesserP ixelV alues←− all pixel values i in edgeP ixelV alues where i < mean extensionColorthe color to be used for extending the image
if∥greaterP ixelV alues∥ > ∥lesserP ixelV alues∥ then
extensionColor←− the mean color value in greaterP ixelV alues else
extensionColor←− the mean color value in lesserP ixelV alues end
newImage←− extend image by adding pixels on the long side of color extensionColor until the image is square
rsize newImage to 40x40 pixels save newImage
end