A Neural Network Approach to Arbitrary Symbol Recognition on Modern Smartphones

(1)

A Neural Network Approach to Arbitrary Symbol

Recognition on Modern Smartphones

FINAL

SAMUEL WEJÉUS

Master’s Thesis at CSC, KTH Supervisor: Jens Lagergren

Examiner: Anders Lansner Project Sponsor: Bontouch AB

(2)

(3)

Abstract

Making a computer understand handwritten text and sym-bols have numerous applications ranging from reading of bank checks, mail addresses or digitalizing arbitrary note taking. Benefits include atomization of processes, efficient electronic storage, and possible augmented usage of parsed content.

(4)

Referat

Identifiering av symboler genom användning

av neurala nätverk på moderna

mobiltelefoner

Att få en dator att förstå handskriven text och symboler har många användningsområden rörande allt från att läsa bankcheckar, postadresser eller att digitalisera godtyckliga notater. Fördelar av en sådan process inkluderar automa-tisering av procedurer, elektronisk lagring samt möjlig ut-ökad användning av inläst material.

Denna rapport avser att ge en översikt hur system för igen-känning av handskriven text i off-line läge kan skapas. Vi ämnar visa hur sådana system kan delas upp i mindre iso-lerade delar som kan realiseras individuellt. Fokus kommer att läggas på igenkänning av enskilda handskrivna symbo-ler och vi kommer presentera hur detta kan göras med hjälp av konvolutionella neurala nätverk på en modern mobiltele-fon. En applikation som utför igenkänning av nämnda sym-boler kommer att skapas för Apples iOS operativsystem i from av en proof-of-concept applikation.

(5)

List of Figures

2.1 Samples from the MNIST dataset . . . 6

2.2 Hard samples from the MNIST set . . . 6

2.3 Example of recognition using the Evernote OCR system. . . 7

2.4 Variation in writing style . . . 8

2.5 Example of a captured word image before explicit segmentation . . . 10

2.6 Example of different results after preforming explicit segmentation . . . 10

2.7 Example of sliding-window technique . . . 11

2.8 Hypothetical HWR System Pipeline . . . 14

3.1 Visual model of the McCulloch and Pitt’s neuron . . . 17

3.2 Overview of a simple neural network model . . . 18

3.3 Plot of network performance over time . . . 22

3.4 A typical two stage ConvNet . . . 24

3.5 Edge detection using convolution . . . 25

4.1 Training using different learning rates . . . 32

4.2 Input views of application prototype . . . 33

4.3 Example of user drawing custom shapes . . . 34

4.4 Drawing classification pipeline . . . 34

4.5 Classification using camera capture . . . 35

4.6 Pre-processing stages of sample captured with camera . . . 36

4.7 Number of correct classifications using prototype (individual numbers) . 36 4.8 Number of correct classifications using prototype (total) . . . 37

(8)

List of Tables

(9)

Chapter 1

Introduction

The process of parsing samples of a handwritten text into symbols is usually referred to as recognition or classification. The purpose of this representation is that it can be interpreted by a machine. One usually makes a distinction between two types of recognition. If the characters are printed typewriter fonts the recognition of such are referred to as Optical Character Recognition (OCR), and if the characters are written by hand we call the process Handwriting Recognition (HWR). Furthermore, handwriting can be distinguished as being either on-line or off-line, depending on when the text is captured. If the text is captured while the author is writing it is referred to as on-line mode otherwise it is referred to as off-line.

A complete recognition system consists of three parts commonly corresponding to three separate problems: localization, segmentation, and recognition. The goal for each of these parts are: isolating and finding contours of individual words (lo-calization), separating a word into individual characters (segmentation), and finally mapping segmented chunks with the correct interpretation (recognition).

Today the most popular approaches for character recognition involves using some form of Machine Learning (ML) technique. A classifier is the set of techniques used and can be seen as a black box producing output in the form of a classification given some input. Classification is regarded as an instance of supervised learning. That is, learning is performed using a set of correctly identified observations. Machine learning can be used for a complete recognition system, or for specific parts. A recognition system involves complex tasks that need large computational resources. Building recognition systems for smartphones has not yet been widely investigated [24], consequently it is of interest to determine a set of suitable technologies for these type of devices.

(10)

on a smartphone will be presented as a proof-of-concept for how neural networks can be used efficiently on smartphones.

1.1 History

Character Recognition (CR), was first studied in the beginning of the 1900s taking a mechanical approach using photocells [1]. Common techniques investigated included simple template matching and structural analysis. Initial development came to a halt when researchers realized the huge diversity and variability in text when it comes to handwritten input [48]. The history of research efforts made from its infancy have not been a linear process. The problem of CR was at first very popular subject in the field of pattern recognition since it was regarded an easy problem to solve.

Like many other fields in science and technology process is usually tangled and progress is often made when research diverge and then cross-breeding different re-sults [48]. Modern state of the art recognition systems uses techniques from various fields of pattern recognition, machine learning and artificial intelligence. Today on-line HWR is considered being a close to solved problem [54]. The problem of off-line handwriting is however much harder and is considered an open question in the research community [55].

1.2 The Client

Bontouch AB is a IT consulting company whose aim is to partner in long-term collaborations with its customers. The company focus on mobile solutions for plat-forms like Android (Google)1 and iOS (Apple)2. Bontouch is located in Stockholm but have a global market. Projects developed for clients includes, among others, Sweden’s first banking app for Skandinaviska Enskilda Banken AB (SEB)3 which make use of off-line OCR scanning of a predefined printed OCR font on invoices.

For future projects Bontouch is very interested in how current recognition sys-tems can be extended, or replaced to recognize arbitrary input. The main interest for Bontouch AB is to get an overview of the state of the field. What can be ac-complished today and how could a recognition system be implemented on a mobile platform?

1.3 Applications

HWR software simplifies the process of extracting data from the handwritten doc-uments and storing it in electronic formats. There are numerous applications for

1_{http://developer.android.com/}

2

https://developer.apple.com/devcenter/ios/

(11)

HWR systems. In fact, many such applications are commercially available today. These systems range from reading bank checks, signature verification and automatic creation of digital versions of books. General recognition of symbols would also prove valuable as subsystems for instance in autonomously driving cars to interpret traffic signs. Interesting applications of CR, specifically by using a smartphone, could for instance be a currency converter for tourists. By simply aiming the smartphone camera at price tags, prices could be converted in real-time.

1.4 Problem Statement

This project seeks to determine how a system for recognition of handwritten sym-bols would be implemented on a modern smartphone. We will focus on the single symbol off-line case of handwriting i.e. the input is a static image containing one symbol. This report will discuss the different stages of a complete recognition sys-tem to give an overview of the challenges faced, but mainly focus on the recognition phase. A prototype application will be developed and evaluated in order to test such a recognition system with real-world data. The prototype needs to be plat-form agnostic in the sense that it should work on both Android and iOS equipped devices as requested by the customer.

1.5 Challenges

The real challenge in automatic handwriting recognition applications is how they conforms to changes in its environment. In the ideal case an HWR system should be able to operate properly without any assumptions made about the data cap-tured from the real-world. Conditions that can affect the results can for instance be various colors in the scene, illumination conditions, and variation in writing style of independent users. Creating recognition systems for embedded devices such as a smartphone adds additional challenges such as severe limitations in memory and CPU performance. Even with the impressive performance of modern day smart-phones such as iPhone 5 or Google Nexus 4, the performance of such devices is only a fraction of a modern desktop PC [47]. Studies have shown that users tend to rate responsive applications higher, resulting in positive market success [69]. The procedure from capturing input data to final classification thus have to be fast.

The following list points out the problems an embedded recognition system must solve in order to meet stated requirements.

1. Localization. Given a document of several words and/or symbols the position have to be located.

(12)

3. Variance. Handwriting is subject to high variance in writing style between different authors. Factors include size, rotation, elongation and skew or the equivalence of different fonts. Even a single person writing the same symbols twice is subject to variation in size and position.

4. Real-time processing. Compared to a modern desktop PC - a smartphone put strong requirements on limited memory usage, battery consumption and CPU usage.

5. Scene invariance. Capturing scenes from the real-world includes heavy pre-processing separating correct channels and to separate foreground symbols from the background. Lighting conditions also play an important role in scene separation.

6. Platform agnostic. The proposed solution should be able to run on the major-ity of modern smartphone devices. In practice this means Android and iOS devices.

1.6 Limitations of Scope

Creating a complete handwriting recognition system is a huge task and hence some limitations of the scope in this report have to be established. The focus of this report will be on the recognition phase of a HWR system. The prototype will be trained to classify single digits using the Modified NIST (National Institute of Standards and Technology) (MNIST) database (explained further in section 2.1). It is hypothesized that the same classifier trained on digits can also be trained for arbitrary symbols, this it will be explained further in sections 3.2.2 and 3.2.3.

Localization of words or lines in a documents is normally carried out in an isolated procedure and will not be discussed in depth other than mentioning rec-ommended procedures. As will be explained in the “Related Work” chapter there are strong suggestions that using techniques that combines several stages in the CR pipeline produce better results. Combining segmentation and recognition into one single classification step for word recognition uses different techniques than performing explicit segmentation then recognition. Since segmentation could be an integrated part of a HWR system an overview of the subject will be provided in sec-tion 2.3.4. The focus of the suggested prototype will be on single symbol recognisec-tion constrained by the assumptions listed below.

• Only one symbol at a time will be recognized.

• When using real-world data the input sample have to be illuminated well. • The classifier will only deal with binary data and assumes the images have

(13)

Chapter 2

Related Work

This chapter presents an overview of approaches for HWR, both historical and modern. We will elaborate on the problems inherent in CR and describe the current state of field. A description of a complete HWR system will be given in section 2.5. In order to train and test a HWR system, we will use the MNIST dataset, a popular benchmark set consisting of handwritten digits. Section 2.4 deals with possible methods that can make the CR problem easier to solve.

2.1 The MNIST Database

In order to make objective experiments an adequate dataset is needed. One such dataset is the MNIST database [71, 35]. The MNIST database was created by Yann LeCun et al. to be used as a benchmark for testing various classifiers of handwritten input. It was created by letting 250 different authors write different single digit numbers by hand. The set consists of a training set of 60,000 samples and a test set of 10,000 samples. Both sets are completely disjoint from each other in the sense that the authors in the training set are not the same as the authors in the test set. The digits are centered and size-normalized in fixed-sized images. Each sample consists of a binary image of size 28x28 pixels of a single digit together with a label of its correct classification. The MNIST database is a popular benchmark in the pattern recognition community, thus making comparison with other classifiers easier. Consisting of 60,000 samples the MNIST set is considered large enough for reliable training and testing of classifiers. Using the MNIST set will also make evaluation of our proposed solution easier since the prototype that will be created can then be compared with solutions available in the research community. An example of typical samples in the MNIST set is given below in figure 2.1. The set also includes samples that are hard to classify even for a human, as shown in figure 2.2.

When using machine learning algorithms the most important part is not how well the classifier algorithm used adapts to training data but instead how well it

(14)

train and test set with different authors for both it should give a good indication of how well the classifier will perform on real-world data.

Figure 2.1. Samples from the MNIST dataset.

Figure 2.2. Hard samples from the MNIST set.

2.2 Current State of Field

Today there exists, to the best knowledge of the author, no complete system for handwriting recognition with impeccable accuracy. Theories for such systems have been investigated in [58] which uses a combination of several techniques including neural networks and so-called Hidden Markov models.

Taking the step from single letter recognition to word recognition, researchers have found it is more effective to do both at once instead of separating the recogni-tion stages into explicit segmentarecogni-tion and recognirecogni-tion phases. Another extension, similar to Hidden Markov Model (HMM)s, from single symbol to word recognition, is to use neural networks in combination with Graph Transformer Network (GTN). GTNs, as presented in [7], gives an indication that using neural networks as a base for building word recognizers is a promising topic for future research.

Relaxing the demand for correct word recognition and accepting a few errors on a character level commercial applications, such as the note taking application

Evernote1, have successfully implemented what we here refer to as searchable word

(15)

recognition [29, 53, 19]. Such a system do not try to find one correct interpretation

of a word but instead finds all likely possibilities as can be seen in figure 2.3.

Figure 2.3. Example of recognition using the Evernote OCR system (image from

Evernote techblog).

Since the scope of this report is limited to single digit recognition, an overview of recent results of this type of problem will now be presented.

To this date, the best results reported for various types of classifiers trained using the MNIST dataset [71] is presented in table 2.1 below. The table is an aggregation of different attempts at building a classifier using the specific techniques listed. Only the the best result reported for the various types are listed. Looking at the last line of this table we can see that recent results imply that a specific type of neural networks, namely convolutional networks, have produced the lowest error and is considered to be the most efficient approach for recognition of handwritten symbols.

Classifier type Best result for type (error %) Reference

Linear 7.6 Lecun et al. [35]

Non-Linear 3.3 Lecun et al. [35]

K-NN 0.52 Keysers et al. [30]

Boosted Stumps 0.87 Kegl et al. [28]

SVM 0.56 DeCoste, Scholkopf

MLP 0.35 Ciresan et al. [11]

ConvNet 0.23 Ciresan et al. [10]

(16)

2.3 Handwriting Recognition

In this section we will explain the problems inherent in creating a handwriting recognition system. Most pattern recognition problems include usage of extracted features as input for classifiers. In section 2.3.5 we will explain why this approach is considered unwieldy for CR systems.

2.3.1 Writing Styles and Related Issues

Different writing styles is the biggest source of difficulty when it comes to under-standing handwritten text. A simple illustration of the differences between styles can be seen in figure 2.4. The first lines, starting from the top, is referred to as discrete handwriting and the range moves towards more connected, or continuous type of writing. The style of the last couple of lines is commonly referred to as

connected and pure cursive respectively.

Common for most people is to use a mixed writing style alternating between discrete and connected style as can be seen on the last line in figure 2.4.

Figure 2.4. Variation in writing style [64].

Full discrete writing style is considered being easier to parse due to the great advantage that the text is easy to segment. Segmentation is one of the hardest, if not the hardest, problem when building text recognition systems.

(17)

2.3.2 Writer-dependence vs. Writer-independence

A writer-independent system is independent of the differences in writing style by different users. Such a system should be able to recognize handwriting not previ-ously seen during training. Writer-independent systems are generally regarded much harder to construct [68]. This is not only limited to how machines view handwriting, even humans are much more capable of recognizing own handwriting compared to that of a stranger. The difficulty stems from the high degree of variance inherent in handwriting across authors and a writer-independent system must learn to cope with this extra level of complexity by being better at generalizing. In the opposite case, for a writer-dependent system, we only have to learn a few different styles, making the problem easier.

2.3.3 On-line vs. Off-line handwriting

There are two primary aspects of recognizing handwritten text. Either the text is capture in on-line or off-line mode. On-line recognition consist of recognizing text as it is being written, capturing data about speed, movement, and pressure from some input device. Several variants of capturing text in an on-line mode exists ranging from PDAs to pressure sensitive stylus. In on-line mode a rich set of features can be captured from the writing process. The captured sensor data can include: speed, geometrical position, pressure and temporal information none of these are present in off-line mode. In off-line mode handwriting is recognized after the text have been written, usually from static images captured by using either a scanner or camera for instance. This convoys significantly less information.

It is thus commonly agreed that on-line handwriting recognition corresponds to the easier of the two primary problems. In the on-line case the usage of extra information makes the problem easier to solve. The off-line case on the other hand is significantly more complex and is still considered an open research problem [20].

2.3.4 Segmentation

Segmentation is a vital part of a CR system and is one major source of classification errors [56]. In the on-line case you have a great benefit from knowing the inputs device up and down movements. The input is time-ordered which helps deciding on accurate segmentation points. This information is not available in the off-line case. Handwritten words often do not contain clear boundary points between characters, especially in the case of cursive handwriting, where characters is mostly overlapping as illustrated in figure 2.4 in section 2.3.1. Many methods have been suggested for segmentation. According to [8] it is possible to identify three “pure” strategies:

• Classic approach Segmentation points are determined based on “character-like” properties.

(18)

• Holistic methods Instead of segmenting symbols the whole word is recognized, thus sidestepping individual character recognition.

For classical methods it is common to use heuristics to locate good segmenta-tion points [6]. These can be more or less sophisticated, examples include using histograms or character properties such as: average character width or contour ex-traction. A problem with classical methods such as explicit segmentation is that the decision made for segmentation points is not local and can affect future decisions. In figure 2.5 and 2.6 we can see how an incorrect segmentation decision can affect the segmentation of subsequent characters.

Figure 2.5. Example of a captured word image before explicit segmentation. Image

taken from [20].

Figure 2.6. Example of different results after preforming explicit segmentation. Image taken from [20].

For recognition-based approaches a lot of research points in the direction that segmentation can not be done in isolation and must instead be carried out in con-junction with recognition [6, 8, 3]. This leads to an interesting paradox namely; it

is necessary to segment in order to recognize, but it is also necessary to recognize to segment. One way of overcoming this is discussed in [33]. Unfortunately it is

(19)

to single symbol recognition. Popular techniques that have provided good results includes the use of HMMs and GTNs.

As an example of how simultaneous segmentation and recognition can be achieved is the Sliding-window approach and is illustrated in figure 2.7. After localization finds a possible word chunk the word is split into several small strips and analyzed. Using statistical tools, the start and end of single character can be identified. The system can then conclude with high probability that some series of strips is very likely to correspond to a character.

When a character is identified it also marks a segmentation point. Thus seg-mentation and recognition are performed simultaneously. A popular probabilistic model that can be used for this type of operation is HMMs. The HMM technique is a huge topic in itself and out of scope for this report. The interested reader is referred to [20, 7].

Figure 2.7. Example of sliding-window technique. Image chunks extracted from a

single word. Image from IAM-OnDB.

2.3.5 Features and Feature Selection

Features are used to make the classification process of a sample easier. Understand-ing what features are, how they are constructed, and the selection process is of high importance. Which features to use when classifying data is important for a deeper theoretical understanding of pattern recognition. But we will also see how selecting good features for CR is cumbersome work and how we can circumvent this.

What is a Feature?

In the case of CR, as with many problems related to pattern recognition, a funda-mental problem is to find a measurement or function that can describe the data. Given some input data, an extracted feature is a number or vector used as an indicator of a set of predefined characteristics that the are present in the input.

(20)

as random variables. This is natural, as the measurements resulting from different patterns exhibit a random variation.

For many methods used in machine learning it is of high importance to extract good features that give high class separability. The features we are interested in are those that constitute the invariants which makes class separation possible. Finding such invariant features in the case of symbol recognition is considered hard, since the structure of an arbitrary symbols is not known [1].

Selecting good Features

A feature can be almost any type of function calculating some desired property and many features could be available (dozen to hundreds) for various objects. An important question is which ones to include in the classification process. Not all combinations are good since some have a high mutual correlation. If too many features are used, the classifier will be worse on generalization. The central as-sumption for feature selection techniques is that the data contains many redundant or irrelevant features.

Selecting which features to use can be carried out in an algorithmic approach. A feature selection algorithm can be seen as the combination of a search technique for proposing new feature subsets, along with an evaluation measure which scores the different subsets. Selecting a subset of features to use provide three main benefits when constructing predictive models:

• Improved model interpretability • Shorter training times

• Enhanced generalization by reducing over-fitting

It is widely reported in the literature that methods known as wrapper methods tend to be superior as feature selection technique [63]. Other types can be charac-terized as either filter or embedded methods [32].

A wrapper method works (in the simplest case) by starting with an empty set of features, test how well this feature performs on the dataset, extend the set of features to use with this feature if it performs better than previously and test again. Repeat. Terminate on acceptable error level or desired number of features found. In practice, for a finite number of features, increasing the set of features used an initial improvement in performance is obtained, but increasing the set further might result in an increase of the probability of error. This phenomenon is also known as the “peaking phenomenon” [66, p. 267].

(21)

2.4 Problem Reduction Techniques

Decreasing the problem domain often helps in increasing the accuracy of HWR systems. Some techniques for problem reduction that can be applied is listed below. Neither of these actually solves the problem of handwriting recognition completely but can help decrease the error rates. The techniques listed will not be investigated further but are left here as a reference.

• Specifying allowed symbol ranges.

• Utilizing some form of dictionary (n-grams is especially popular in the litera-ture [9]).

• Input is captured from specialized forms.

• Train a new classifier for each new user (usually not relevant since training takes time and the amount of data needed is usually not available).

2.5 Description of a Complete HWR System

We now give an overview of how the different parts of a HWR system are connected. Figure 2.8 is a graphical description of dataflow for a HWR system in the form of a block diagram. This type of similar to what is described in [55] and [58]. The different blocks in the diagram constitute a separation of subsystems each of which could be developed individually using various approaches. The output from one submodule should be seen as input to the next.

(22)

Recognition Line extraction Word detection Localization Explicit Character Segmentation Simultaneous segmentation and recognition Pre-processing Baseline correction, slant, and size normalization

(optional)

Lines of text

Input image of document

Post-processing Language models (dictionary, N-grams) Recognition hypotheses Character Recognition Location of paragraph Enhanced output < implicit or explicit >

(23)

Chapter 3

Theory

In this chapter a theoretical introduction of neural networks will be presented, focusing on concepts related to a specific network type called convolutional neural

networks. For single digit recognition implemented on a smartphone, convolutional

neural networks was found to be the most efficient model, in regards to low error, as explained previously in section 2.2. In this section the choice of why this is a suitable classifier will be further discussed and motivated.

3.1 Choice of Symbol Classifier

When choosing a suitable classifier to be used on a smartphone several factors must be taken into consideration. The most important factors are classification speed, performance, and ease of implementation. High performance in this setting is defined as low classification error rates. Since machine learning techniques normally is divided into distinct training and classification phases only classification speed was deemed important. This is because the training can be accomplished separately on any desktop PC.

Deciding which classifier to use was highly dependent on finding libraries that could be used to implement this classifier on the target platforms. This final choice was highly affected by the discovery of a library called Torch which support the creation of arbitrary network structures. The comparison of libraries can be found in section 4.1.

(24)

3.2 Artificial Neural Networks

In 1943, W.McCulloch and W.Pitts investigated how a computational model of the brains nervous activity could be constructed. The result was a mathematical model of a neuron, a binary device with fixed threshold logic capable of imitating the functionality of simple electrical circuits [44]. An Artificial Neural Network (ANN) is a network of simple processing elements operating on local data in isolation communicating with other elements [44]. Motivated by the structure of the human brain, ANNs draws inspiration from this but have gone far from their biological relative.

The most basic building block of the brain is the nerve cells which does all the processing. These basic constructs are called neurons, both in the brain and our equivalent abstract simulation [43]. An artificial neuron, also referred to as a

perceptron, is normally described as follows (using notation described in table 3.1):

• There are n + 1 inputs with signals x₀ through x_n • Each input has a weight w₀ through w_n.

• An activation function that determines if the neuron should “fire” or not (produce some output) depending on given input.

x`_j Input to node j of layer `

w`_ij Weight from layer ` − 1 node i to layer ` node j

σ(x) Activation function

θ`_j Bias of node j of layer ` O`

j Output of node j in layer `

tj Target value of node j at current layer

(25)

w0 w1 w2 wn

X0

X1

X2

Xn

𝜎

Input Weights Activation Function Output

Figure 3.1. Visual model of the McCulloch and Pitt’s mathematical model of a

neuron. The inputs xi are multiplied by their respective weight wi and then

sum-marized. If the activation function gives an output higher than some threshold the neuron fires otherwise it does not.

Usually the x₀ input is referred to as a bias unit assigned the value +1. The illustration given in figure 3.1 is a helpful description of the equation 3.1 which is the real mathematical description of the neuron model.

yk= σ   n X j=0 wkjxj   (3.1)

Connecting a vast number of neurons in an interactive nervous system or

net-work, it is possible to resemble very advanced types of functions. In principle neural

networks have the ability to realize any type of function mapping from one domain to another [52, 40]. The most popular model for neural networks is the the so called Multi Layer Perceptron (MLP) model1.

A network is ordered into several layers starting with the first called the input layer, the last layer is called the output layer and layers in-between is referred to as hidden layers.

(26)

• Input Layer

A vector of input values (x0...xn) this usually corresponds to a set of features

used or the raw input from a sample. In the case of images input neurons can be mapped to pixel values. There is an additional constant neuron called the bias in each layer and is used for normalization.

• Hidden Layer(s)[one or more]

Values arriving at a neuron in the hidden layer from each input neuron is multiplied by a weight (w_kj), summed up, and used as input to the activation function. The output value, y_k, is used as input for consecutive layers. • Output Layer

The output layer goes through the same process of multiplying weights and summation as in the hidden layer but the data outputted by the last layer is treated as the result of the classification. Its up to the designer of the network to make an interpretation of this data.

Each layer of neurons feeds only the very next layer and receives input only from the immediately preceding layer as seen in by the connections illustrated in figure: 3.2.

inputs

Input Layer Hidden Layer

Weights

Output Layer

Outputs

Figure 3.2. Overview of a simple neural network model.

The data used as input for the first layer is typically the feature vectors obtained as described in section 2.3.5 and the output is a vector interpreted as a classification. The strength of neural networks is formed upon three properties:

(27)

For adaptiveness and self-organization it can be shown that a network can change its behavior while learning and adapting to changes in training data. Using more complex network structure it is possible to form arbitrary decision boundaries and parallelization is inherent in networks due to being built by many independent parts [22].

Feed-forward Algorithm

Executing the feed-forward algorithm on a network is what constitutes as making a classification. Given a set of input features the resulting output is the answer to which class the input sample belongs to. The algorithm works as follows: each connection receives its input from the previous layer or in the case of the first layer it corresponds to the input from current sample xj. Every connection is associated

with a weight wi and reflects the degree of importance for that specific weight. The

output value of the ith perceptron is determined by multiplying each connection with its associated weight, summing up and applying an activation function as shown in equation 3.3. The threshold coefficient (a constant simply set to “1” in figure 3.1) is a so-called bias term and is used for normalization (in equation 3.3 the bias is denoted as θ).

Oi = σ(ξi) (3.2)

ξi = θ +

X

wixj (3.3)

Several of these basic units arranged in a layer and connecting multiple layers form a network.

Each of these units can be trained. Training a network of basic units will make the network mimic the behavior of various functions. That is calculating output =

f (input) of some arbitrary function f by sending input to the first layer, execute

the feed-forward algorithm and reading the output of the last layer.

Backpropagation Algorithm

The best-known and simplest training algorithm for feed-forward networks is back-propagation and is a classic steepest descent method [57]. This section will only provide a brief introduction on the topic. For a more extensive description see [57, 62].

(28)

The name Back-propagation comes from the fact that it is only possible to directly calculate the error for the last layer directly and errors for consecutive layers are calculated by “pushing” the error backwards. Iterating through a complete training set is called one epoch and in each epoch the weights are modified in the direction that reduces the error. There are many types of activation functions used in neural networks but we assumed a sigmoid function here since it has a nice derivative. The derivative of the sigmoid function is:

σ(x)0= σ(x) ∗ (1 − σ(x)) (3.4)

Given a set of training data points tj the process is started by doing a feed-forward pass for some t_j. We can calculate the output error, E_k, the network makes as:

Ek= ((Ok− tk) ∗ σ0(Ok)) (3.5)

After the error for each neuron in the output layer have been calculated, going backwards, one hidden layer at a time. The error for neurons in the hidden layers are calculated as the sum of the products between the errors of the neurons and its connecting weights in the next layer.

Since the gradient points in the direction of the greatest rate of increase of a scalar field we subtract the weight change in the update stage because we are trying to find the minimum by gradient descent. The weights between layer ` and ` − 1 are then updated by subtracting product of the calculated error in layer ` − 1 and output of layer `. The output and error values for neurons are taken from those neurons connected by weight i, j.

∆w = −λE`O`−1 (3.6)

wij = wij + ∆wij (3.7)

The λ in eq 3.6 is called the learning rate as is usually a small number between 0 and 1 and determines how fast the network converges. Larger values can cause the weights to change too quickly, and can actually cause the weights to diverge rather than converge. Some literature suggests determining the learning rate as a function of current epoch, decreasing the learning rate continuously as more epochs are completed.

3.2.1 Network Structure and Network Training

(29)

• The number of hidden neurons should be between the size of the input layer and the size of the output layer.

• The number of hidden neurons should be 2/3 the size of the input layer, plus the size of the output layer.

• The number of hidden neurons should be less than twice the size of the input layer.

Theoretical limits about the number of neurons needed have been proved in [31] but in reality no conclusions on choosing an optimal network topology have been discovered. The decision is highly dependent on the complexity of input data or function you want to imitate.

Choosing the right network structure often starts out with using rules for best practice and is then further refined. Common approaches are based on variable addition and pruning of nodes. Normally an initial structure is chosen. By repeating a procedure of training, testing, and modification on different network sizes, changes in performance can be measured. Choosing which structure to use is then a decision of desired performance.

First problem faced in model selection is how many hidden layers to use. When choosing number of hidden layers the list below can act as a guide, re-produced here as described by Jeff Heaton [25]. The decision of network topology is also important for how well the network generalize to preciously unseen data. Making the structure too complex might result in underfitting while too simple in overfitting [62].

• Zero hidden layers: Only capable of representing linear separable functions or decisions.

• One hidden layer: Can approximate any function that contains a continuous mapping from one finite space to another.

• Two hidden layers: Can represent an arbitrary decision boundary to arbitrary accuracy with rational activation functions and can approximate any smooth mapping to any accuracy.

Training time also affects the performance of the classifier. One method of dynamically optimize the training is that of early stopping. Training too little and the network will not learn the desired function or pattern. Training too much results in bad generalization. The error over time for training and testing is illustrated in figure 3.3.

(30)

containing no relation to training data, and not used during training. The disad-vantage of this approach is that it reduces the amount of available training data both for training and validation [62].

Figure 3.3. Plot of network performance over time. Training too much result in

bad generalization and the calculated error for the validation set increases. Image from Willamette University.

3.2.2 Deep Learning of Representations

Most machine learning and pattern recognition systems relies on the user/programmer to provide some knowledge about the problem instance. The machine can then use this knowledge to learn about patterns inherent in training data. Knowledge in this case is the feature vectors calculated for some input sample. In other words we rely on a priori explicit knowledge about a set of objects in order to gain knowledge about a set of similar objects.

As mentioned previously, finding a general structure in handwritten letters or symbols is an open research problem and with that trying to find a correct set of features to use for training a recognizer is, currently, not considered possible. So if we cannot find or describe the correct set of features maybe we can get that knowledge by observing the world around us?

(31)

Deep learning is a new concept popularized in 2006 by Geoffrey E. Hinton [26]. Neural networks are modeled into layers corresponding to distinct levels of concepts, where higher-level concepts are defined from level ones, and the same lower-level concepts can help to define many higher-lower-level concepts. Techniques based on deep learning is in active use today at Microsoft and Google who for instance created a deep learning neural network that learned to recognize higher-level concepts, such as cats, only from watching unlabeled images taken from YouTube videos [41, 42, 36].

There are theoretical evidence for multiple levels of representation. Evidence from brain research shows that humans first learn simple concepts and then compose them to more complex ones.

It is hypothesized that the visual cortex functions as a generic algorithm built in layers of how we perceive the world [4] it extract features about an object or representation at different, increasing, levels of abstraction [59]. Ranzato et al. 2007 suggests that the brain learns by first extracting edges then patches, then surfaces, then objects and so on. Dividing recognition into different levels of understanding is what gave the inspiration to the so-called convolutional neural networks.

3.2.3 Convolutional Neural Networks

Hubel et.al [27] showed that there exists a complex arrangement of cells within the visual cortex. These cells are sensitive to small sub-regions of the input space, called a receptive field and covers the entire visual field. Two types of basic cells have been identified: simple cells respond maximally to specific edge-like stimulus patterns and complex cells which are locally invariant to the exact position of the stimulus. The combination of these types of cells is what make up our human vision and are well suited to exploit the strong spatially local correlation present in images.

ConvNets are a form of MLP feed-forward networks with a special architecture inspired from biology specifically constructed to learn about features in multiple levels. ConvNets can be viewed as a multi-modular architecture with trainable components of multiple stages.

For completeness it might be interesting to mention that other types of models can be found in the literature such as the NeoCognitron, and HMAX [23, 59]. As shown in [60] convolutional neural networks achive the best performance in handwriting recognition tasks to date and thus serves as enough motivation for investigation.

Structure of a Convolutional Network

(32)

additional dimension could correspond to time). A network is divided into several stages wherein each stage is composed of three layers: a filter bank layer, a non-linearity layer, and a feature pooling layer. A typical ConvNet is composed of one, two or three such 3-layer stages, followed by a classification module. Each layer type is now described for the case of image recognition. For details on the inner workings of ConvNets and comparison overview we refer to the work by Dr. Yahn LeCun and Dr. Patrice Simard [35].

Figure 3.4. A typical two stage ConvNet. Image from Stanford VISTA Lab.

Filter layer A convolutional layer consisting of several feature maps. Each feature map is the result of a convolution between the input image and a kernel. The kernel used for the convolution is initialized with random values when a new training session is started and its values are modified during the training phase to more efficiently extract interesting features.

Non-Linearity Layer The purpose of this layer is to limit the output to some more reasonable range. Typically some sort of squashing function is used and is similar to the use of the activation function in an ordinary MLP.

Pooling Layer The pooling layer down-samples its input by making it k times smaller, where k is some arbitrary constant. Common down-sampling methods include taking the average of some region. The result of a pooling layer is that superior features are elicited, which leads to faster convergence and better gener-alization [49]. Normally this is achieved by taking the average of a neighborhood of pixels in effect producing a smaller image. These layers exist to summarize the results of previous layers and extract the most contributing parts of the result of a convolution making features more location invariant.

(33)

Mathematical Convolution

To give some more insight into what kind of features the filter layers of ConvNets can learn to extract and how it helps in doing classification a short introduction to mathematical convolution will now be presented. Convolution is a mathematical operation described by the function:

(f ∗ g)(t)def=

Z ∞

−∞

f (τ )g(t − τ ) dτ (3.8)

Convolution of an image f corresponds to the input image and a kernel g [18] being multiplied according to equation 3.8. Convolution can be thought of as sliding the kernel (a small matrix) over an image, multiplying the pixels of the image that is under the kernel entries and then summarize. The result of this operation on one pixel in the input image is the new value for this pixel in the output image. As an example, a simple method for edge detection in an image is to calculate the derivative of color intensity for each position by convolution. The result of convolving an edge detection kernel known as the Laplace Operator, is shown in figure 3.5 below.

Figure 3.5. Edge detection using convolution.

Advantages of Convolutional Neural Networks

The black box nature of neural networks works as a universal approximator and achieves its best performance if the data you are modeling has a high tolerance to errors. The scenarios for which NNs works best are:

(34)

The different types of layers in convolutional network combines three archi-tectural ideas to ensure some degree of shift, scale, and distortion invariance: lo-cal receptive field, shared weights (or weight replication) spatial or temporal sub-sampling. They have been designed especially to recognize patterns directly from digital images with the minimum of pre-processing operations. ConvNets are actu-ally something quite new and not explored so much but as pointed out the research around deep learning is gaining more and more recognition [41].

As a concluding note: it is really interesting to emphasize that compared to traditional methods for classification, using a ConvNet structure you never tell the network what structures to look for, no data on what constitutes a certain symbol is ever given before the network start is analysis [61]. When training the network forces itself to discover recurring structures. All in all the network itself basically

(35)

Chapter 4

Results

In this chapter the results of building a prototype application for symbol recognition running on a smartphone will be presented. It is important not to reinvent the wheel hence a comparison of libraries that could be used for neural network development will be presented. The result of the comparison is that a library called Torch meets the requirements stated in section 1.4. A convolutional neural network was trained using Torch and then evaluated for speed and classification performance using the constructed prototype.

4.1 Comparison of Neural Network Libraries

Numerous libraries for building neural networks exist. Many candidate libraries where found and some where rejected from further investigation for various reasons. Reasons for rejection included: many of them did not fulfill initial requirements described in section 1.5. The need for cross-platform support (as a reminder the target platforms was iOS and Android) was the top priority when searching for possible packages to use. This requirement ruled out many popular candidates due to being written either in Python (not supported on iOS at time of writing) or Java (also not supported on iOS). Some libraries that passed initial screening where later dismissed due to being considered not mature enough or not being actively maintained.

The final list of libraries considered for evaluation that met stated requirements includes: FANN, OpenCV, OpenNN, NNFpp, Libann, and Torch. Each of which will be presented below. The result of the comparison was that Torch was the most suitable library to use. For obvious reason Torch will also be given a more thorough presentation.

4.1.1 Investigated Libraries

(36)

sup-port for both fully connected and sparsely connected networks. This library was chosen as a candidate due to it is written in pure C without external dependencies (thus cross-platform compilation is possible on iOS, Android). It is small in size and has a simple API stack. FANN have been reported to have been used successfully in gesture recognition together with the Microsoft Kinect system1. FANN supports MLP network structure and networks can be trained using various backpropagation algorithms.

OpenCV (Open Source Computer Vision Library) is a library of programming functions mainly aimed at real-time computer vision originally developed by Intel. Besides being a library with a rich set of image manipulation functionality, OpenCV also contains various machine learning algorithms wherein neural networks is one of them. OpenCV was considered a candidate due to its extreme popularity and cross-platform binaries exists for both iOS and Android. OpenCV is also used internally at Bontouch for various other projects which would potentially make it easier for other developers familiar with OpenCV to use code developed in this project if that would ever be the case. NNs in OpenCV is very basic and only supports naive MLPs trained using simple backpropagation.

OpenNN (previously know as “Flood”) The neural network implemented in OpenNN is based on the multilayer perceptron. That classical model of neural network is also extended with scaling, unscaling, bounding, probabilistic and conditions layers [37].

NNFpp Similar in functionality to previously mention libraries the strength of NNFpp is its pure object oriented implementation as a small set of C++ classes. After passing initial screening it was discovered that the latest updates stems back to 2007-02 and is probably discontinued.

Libann is an other library written objectively in C++ using STL. Differentiating features include support for multi-layer perceptron networks, Kohonen networks a Boltzmann machine and a Hopfield network. Contrary to what is suggested on the homepage, libann is not being actively maintained with the latest release dating back to 2004-02 (probably discontinued).

Torch Torch provides a Matlab2-like environment for machine learning algorithms. Built using a combination of C and a scripting language called Lua. The advan-tage of Torch is that it is highly customizable and able to tune on fine grained level [12]. Torch also includes support for CUDA3 which offers dramatic increases in computing performance when training.

1_{http://leenissen.dk/fann/wp/2011/05/kinect-neural-network-gesture-recognition/}

2

http://www.mathworks.se/products/matlab/

(37)

4.1.2 Evaluation

Testing of different libraries where carried out in an iterative fashion partly learning more about neural network design and partly identifying features and limitations of the various libraries tested. As a way of learning how to use the different libraries the first test was to create a network that would resemble the XOR function. This was chosen as a test since its a classic example for neural network being the first function proved impossible to implement by a single-layer perceptron as shown by Marvin Minsky and Seymour Papert [45].

Starting the evaluation procedure it was discovered that all three of OpenNN, NNFpp, and Libann could be ruled out due to not being actively maintained or not considered mature. This requirement stem from that in the ideal case software developed in this project would might end up being used in a commercial system. Using immature or not actively maintained code is considered bad practice.

The next step of evaluation was to investigate the support for building con-volutional neural networks. Both FANN and OpenCV is only able to implement MLPs. An attempt was made to extend either of these two to support the ConvNet structure. Implementing the convolutional step was deemed to be to time consum-ing since this would involve rewritconsum-ing layer construction, trainconsum-ing procedures, and implementing a general convolutional operator.

Torch is partly being written in Lua which is possible to integrate in normal C/C++ programs and have extremely good performance [12]. Torch also have sup-port for ConvNets. The drawback of using Torch would be that it would introduce a new programming language not used by the client previously, something that is preferably avoided. On the plus side: Torch is considered very stable with carefully crafted API and under active development.

Lua is a small language and generally considered very easy to learn [51]. The conclusion is that using Torch would be the most beneficial due to high performance, possibilities for rapid development together with using Lua since being actively developed and rich on features.

4.2 Lua

(38)

scripting languages by an order of magnitude and is in many cases comparable to the speed of C [38].

Lua is open-source under the MIT license making it free to use without limita-tions [65]. The cross-platform possibilities ranges from being embeddable in C/C++, Objective-C, Java, C#, Smalltalk, Fortran, Ada, Erlang, and even in other script-ing languages, such as Perl and Ruby. Lua supports numerous platforms includscript-ing smartphone platforms such as Android, iOS4, and Windows Phone. For a developer new to Lua it also very easy to learn the basics and a developer can quickly start develop Lua programs in just 15 minutes without any prior knowledge [51]. Lua can also be made very secure through sand-boxing techniques and has proven to be a good choice in many systems available today [2, 46].

4.3 Torch

The advantages of using Torch is threefold: 1. It is easy to develop numerical algorithms.

2. It is easy to extend (including the use of other libraries). 3. It is fast [12].

Using Torch it is possible to implement arbitrary types of network structure and development is very rapid thanks to the dynamic scripting nature of being built using Lua. Torch relies on an object oriented model for neural network creation letting the user experiment with different network structures, activation functions, training procedures - anything a user want through a simple interface without the need to care about details if not desired. Also a benefit of being based on Lua, Torch is easy to extend, or embed, using either libraries written in Lua or C (and its derivatives) thanks to Lua’s transparent C interface.

Torch is not only fast, it is the fastest machine learning library available to date when put in comparison with alternative first class implementations including Matlab, EBLearn and Theano [12, 13, 5]. This high performance is obtained via efficient usage of OpenMP/SSE and CUDA for low-level numeric routines.

4.4 Prototype

All the technology used to create a symbol recognition system was chosen to be platform agnostic. In order to test this hypothesis a prototype for the Apple iOS operating system was implemented. iOS apps are built using Objective-C which is a superset of C. Lua being C compliant the target platform was chosen because

4

(39)

of this easy bridge of code between Lua and iOS. For Android based systems it is possible to include C code in Java using Java Native Interface, JNI [15].

Speed performance of the various iPhone models released various greatly [47]. Based on recent internal reports from Bontouch, a market share of more than 80% of its customers in Sweden is using an iPhone 4 or more recent version and hence it was chosen as target device. In this section I will give a more exhaustive explanation of the prototype implemented.

4.4.1 Network Training

Creating a trained ConvNet consisted of designing the network using Lua and Torch and training the network on a desktop PC with the MNIST dataset. After around 50 epochs the error levels tend to stabilize around an error of 1% and training was aborted. The trained network was then saved to disk using Torch built-in functionality.

Many different ConvNet structures have been investigated in the literature. Ac-cording to [34, 71] a network structure referred to as LeNet is recommended for character recognition tasks having achieved low error rates (0.95%) on the MNIST set. Structures with better results exists with the best being [10] with an error rate of 0.23% by a committee of 35 ConvNet using elastic distortions. The network structure chosen for the prototype was LeNet. This because the popularity of this type of network with many references available both from the research community and also private resources available on the Internet. Since the target platform was a smartphone size of network was an issue since larger network equals more compu-tations in the recognition stage and because of that the committee of 35 ConvNet used by Ciresan et al. [10] was ruled out from investigation.

When training different values for learning rate, data randomization, and

posi-tive/negative output where tested in order to find a set of training parameters that

(40)

0 5 10 15 20 25 30 35 40 45 50 0 1 2 3 4 5 6 7 8 Epoch error %

ConvNet. Learning rate: 0.01

Testing Training 0 5 10 15 20 25 30 35 40 45 0 1 2 3 4 5 6 7 8 9 10 Epoch error %

Testing Training 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 Epoch error %

(41)

4.4.2 iPhone Application

To test the capabilities and performance of the classifier on real-world data using a smartphone an iPhone application was developed. The prototype application consist of two views to capture two types of input. Either a user can draw a single digit using a touch finger movements or capture a digit using the built in iPhone camera. In both cases the captured input was pre-processed and then classified in an off-line fashion by sending the pre-processed image to the previously trained network. The two different views for the application is show in figure 4.2.

Figure 4.2. Input views of application prototype.

Draw View

(42)

Figure 4.3. Example of user drawing custom shapes.

(43)

Camera View

It is also possible to capture input using the built-in camera. The procedure is similar to that of when input is taken from a freehand drawing. The main difference is that since the sample is taken from a photograph it is bound to include a lot of noise and image extraction is more complicated. A different pre-processing stage is used for camera capture and described in section 4.4.2.

Figure 4.5. Classification using camera capture.

Pre-processing and Classification Details

Image pre-processing was performed using OpenCV. Later in order to extract the best possible sample to send to the classifier it was important to remove noise and prepare the sample. Noise removal was done by a series of blurring and thresholding steps. After noise removal an erosion operation was performed to make the sample even more robust.

After the image had been filtered, the sample was size-normalized and centered by finding the bounding box and padded so the sample would be positioned in the middle of an image with a padding of 1/5 of the width of the bounding box. This to be more compliant with the MNIST training data. The procedure is summarized below and illustrated in figure 4.6.

1. Blurring 2. Thresholding 3. Erosion

(44)

Figure 4.6. Pre-processing stages of sample captured with camera. Borders for bounding boxes have been added for clarity.

4.4.3 Testing

The prototype was tested by letting several users test the prototype by both using freehand drawing and taking pictures of digits written in their own handwriting. The survey was performed by letting 10 persons use the two different capture modes for all possible input (digits 0-9) and make a mark if the prototype made a correct classification or not. Users where told to try to mimic their normal writing style as much as possible and not try to “break” the classifier by willingly writing hard to interpret digits. See figures 4.7 and 4.8 for a full comparison. The survey was done in order to test the prototype in a real-world scenario. The result of the test was approximately 9/10 correct classifications.

Calculation time for the various steps involved in the recognition system was measured and can be seen in figures 4.1 and 4.2 for various devices. The calculation time is taken as the average of 10 consecutive executions and measured in nanosec-onds and converted to secnanosec-onds for readability. As can be seen the recognition time is low making ConvNets suitable as a classifier for smartphones.

0 25 50 75 100 0 1 2 3 4 5 6 7 8 9

Draw (percent correct)

0 25 50 75 100 0 1 2 3 4 5 6 7 8 9

Camera (percent correct)

(45)

Draw

Camera

0 25 50 75 100

Total (percent correct)

Figure 4.8. Number of correct classifications using prototype (total).

Device Pre-processing Classification iPhone 4 0.0076228 0.0320454 iPhone 4S 0.0074817 0.0176622 iPhone 5 0.0035123 0.0102338 iPhone 5S 0.0016783 0.0061551

Table 4.1. Freehand drawing speed performance of prototype on various devices

(time in seconds).

Device Pre-processing Classification iPhone 4 0.0370634 0.0288854 iPhone 4S 0.0298186 0.0173494 iPhone 5 0.0134305 0.0099449 iPhone 5S 0.0086724 0.0056117

Table 4.2. Camera capture speed performance of prototype on various devices (time

(46)

(47)

Chapter 5

Discussion

Neural networks have been successfully used to solve many complex problems and diverse tasks, ranging from self driving cars, autonomous flying vehicles and pre-venting credit card frauds [67, 50]. The power of a model lies in its ability to gen-eralize and predict and to make accurate quantitative anticipations of data in new experiments. Neural networks also has these properties.

Convolutional neural networks (ConvNet) ability to generalize unknown data is highly dependent on the dataset used when training. The goal of machine learning is not to build a theory on what constitutes a certain symbol but instead use acquired knowledge and predictive power to understand what distinguishes one symbol from another.

(48)

Figure 5.1. 30 filters trained on the MNIST set.

The advantage of a ConvNets is its ability to acquire this knowledge without human intervention deciding on what features to use. This advantage lead to the difference between convolutional networks and other types of machine learning pro-cedures in their approach to classification. While others, for example Support Vector Machine (SVM), concentrate on finding a function (or kernel using SVM language) that uses predefined features to separate classes, ConvNets instead focus on the most

efficient features that can separate classes. Since finding structure in handwritten

symbols is still and open research problem the use of convolutional networks is ideal for classification problems for which the separating structure of individual samples are unknown. This due to the automatic discovery of featuers by deep learning.

Using ConvNets in recognition of handwritten symbols offers a great potential since the problem of handwriting is subject to high variance and we do not (yet, if ever) know if any general structure can be found. Thus, instead of using handcrafted features ConvNets have the possibility to learn about the input space by discovering features itself as part of the learning process.

Data Distortion and Decreasing Error Rates

One way of decreasing error rates is suggested by Dr Simard in [60]. He argues that the generalization can be improved by including distortions of sample images. Distortions for each sample is applied randomly just before training. Three different types of distortions that could be used are:

(49)

• Rotation: A slight rotation is applied to the sample.

• Elastic: Non-linear transformation visually equivalent to pulling or pushing fields of pixels in a wave like fashion.

The reason to why these distortions decrease the error rates is believed to be twofold: first they increase the size of the training set and second, and more impor-tantly, that the application of these distortions force the network to look for, and learn more invariant features and aspects of the patterns thus focusing on separat-ing features of classes. Data distortion was not applied to the trainseparat-ings samples for this project.

5.1 Conclusions

This report have presented an overview on how to design a modern handwriting recognition system for smartphones. A proof-of-concept application for character classification, using a neural networks approach, have been successfully developed and evaluated. The evaluation shows that using the guidelines suggested in section 3.2.1, neural networks can be trained efficiently with the purpose of recognizing digits in images. It should be evident from the discussion of feature extraction in section 2.3.5 that crafting good features manually, with high class separability is hard, if not impossible. As a result this report have presented an alternative approach to crafting features, namely letting the classifier itself decide what con-stitutes a good feature. A suitable technique that have the ability to learn visual patterns is to use convolutional neural networks which is designed to mimic our human visual reception of images.

It is observed that training of networks takes a long time. Thus it is not con-sidered reasonable to preform this type of training on a smartphone. There are several reasons for this. Firstly, a smarphone device has severely constrained re-sources (limited memory and CPU performance) compared to a desktop machine. Secondly, the design of the operating system used in both Android and iOS pre-vents computational intensive applications to run in the background for a long time. Applications are only allowed full access to system resources when running in the foreground. Once an application is put in the background the operating system gives no guarantees that these resources will be available for the application and it can be killed at anytime for the benefit of other applications. A typical user only keeps one application active for a short period of time then switch to another. The conclusion is thus that performing adaptive learning on the device is not considered possible when using neural networks.

(50)

(51)

Chapter 6

Future Work

Using symbol recognition techniques, outlined in earlier chapters, works quite well but can be improved in various ways. Expanding the training set using various distortions as explained in section 5 could lead to better results. A simple extension of the prototype is to implement this functionality to make data distortions while training. As presented in [60] using elastic distortions could lead to an decrease in error rates on the MNIST set. It would be very interesting to see how the inclusion of these would change the performance of the prototype on real-world data.

Extensions

To extend the proposed system to facilitate word recognition, new architecture and recognition pipeline would be needed. Separate modules for localization, word

seg-mentation and word recognition have to be constructed. All these modules can be

(52)

(53)

Acknowledgements

(54)

A Neural Network Approach to Arbitrary Symbol Recognition on Modern Smartphones