• No results found

Classifying Age and Gender on Historical Photographs using Convolutional Neural Networks

N/A
N/A
Protected

Academic year: 2021

Share "Classifying Age and Gender on Historical Photographs using Convolutional Neural Networks"

Copied!
51
0
0

Loading.... (view fulltext now)

Full text

(1)

Sj ¨alvst ¨andigt arbete i informationsteknologi 4 juni 2021

Classifying Age and Gender on Historical Photographs using Convolutional Neural Networks

Ulrika Bremberg Liv Cederin

Gabriel Lindgren Filip Pagliaro

Civilingenj ¨orsprogrammet i informationsteknologi

(2)

Institutionen f ¨or informationsteknologi

Bes ¨oksadress:

ITC, Polacksbacken L ¨agerhyddsv ¨agen 2

Postadress:

Box 337 751 05 Uppsala

Hemsida:

https://www.it.uu.se

Abstract

Classifying Age and Gender on Historical Pho- tographs using Convolutional Neural Networks

Ulrika Bremberg Liv Cederin Gabriel Lindgren Filip Pagliaro

This project intends to classify faces in historical photographs into age and gender. The goal was to demonstrate an algorithm specialized on classifying historical images, as well as an interface where users can insert pictures for analysis. This project aims to facilitate historical re- search by contributing with new tools for image analysis. The algorithm is developed in the programming language Python and uses Convolution Neural Networks (CNN) to classify age and gender. The user interface is developed in the JavaScript framework React.js and communicates with the Python algorithm via a Node.js server. The main results are that the gender classification algorithm has an accuracy of 96% and the age detection algorithm has a mean age error of 4.3 years. The results also indicated that our algorithms perform better on historical images than commonly used state-of-the-art classification models.

Extern handledare: Anders Hast, Digital Humaniora vid Uppsala Universitet Handledare: Mats Daniels, Anne Peters, Bj¨orn Victor och Tina Vrieler Examinator: Bj¨orn Victor

(3)

Sammanfattning

Detta projekt ¨amnar att klassificera k¨on och ˚alder i historiska fotografier. M˚alet var att demonstrera en algoritm som ¨ar specialinriktad p˚a historiska bilder, samt ett gr¨anssnitt d¨ar anv¨andare kan ladda upp bilder och f˚a dem klassificerade. Projektet syftar till att underl¨atta historisk forskning genom att bidra med nya verktyg f¨or bildanalys. Algorit- men ¨ar utvecklad i programmeringsspr˚aket Python och anv¨ander Convolutional Neural Networks (CNN) f¨or att klassificera k¨on och ˚alder. Anv¨andargr¨ansnittet ¨ar utvecklat i Ja- vaScript ramverket React.js och kommunicerar med Python algoritmen via en Node.js server. De huvudsakliga resultaten ¨ar att noggrannheten f¨or klassificeringsalgoritmen f¨or k¨on ligger p˚a 96%, medan klassificeringsalgoritmen f¨or ˚alder har ett medelfel p˚a 4.3

˚ar. Resultaten indikerar ¨aven att v˚ara algoritmer presterar b¨attre p˚a historiska bilder ¨an redan befintliga v¨alk¨anda klassificeringsmodeller.

(4)

Contents

1 Introduction 1

2 Background 2

2.1 City Faces . . . 2

2.2 Face Recognition and Historical Research . . . 3

2.3 Face and Landmark Detection . . . 3

2.3.1 Viola Jones Detection . . . 3

2.3.2 Histogram of Oriented Gradients . . . 5

2.3.3 Convolutional Neural Networks . . . 6

2.4 Gender Recognition . . . 6

2.5 DeepFace . . . 8

3 Goals and Delimitations 8 3.1 Ethical Considerations . . . 9

3.1.1 Privacy and Facial Recognition . . . 9

3.1.2 Gender . . . 9

3.1.3 Ethnicity . . . 10

4 Related work 11 4.1 Gender Recognition . . . 11

4.2 Historical Portraits . . . 11

5 Method 12 5.1 Programming Language . . . 13

5.2 Image Preprocessing . . . 13

(5)

5.2.1 Grayscale & BGR . . . 13

5.2.2 Histogram Equalization & CLAHE . . . 14

5.2.3 Canny Edge Detection . . . 15

5.2.4 Median Filtering . . . 15

5.2.5 Normalizing Images . . . 15

5.3 Convolutional Neural Networks . . . 16

5.4 Convolutional Neural Network Layers . . . 16

5.4.1 Convolutional Layers . . . 17

5.4.2 Pooling Layers . . . 18

5.4.3 Dropout Layers . . . 18

5.4.4 Fully Connected Layers . . . 19

5.5 Activation Functions . . . 19

5.6 Face Detection . . . 19

5.7 Visualization . . . 20

5.8 Web Interface . . . 20

6 System Architecture 20 6.1 Web Interface . . . 21

6.2 Server . . . 21

6.3 Machine Learning Algorithms . . . 22

6.4 Front End and Back End Communication . . . 22

7 Requirements & Evaluation methods 22 7.1 System Evaluation . . . 23

8 Preprocessing the Images 24

(6)

8.1 Dividing the Data Set . . . 24

8.2 Face Detection . . . 24

9 Algorithm for Gender Classification 25 10 Algorithm for Age Detection 28 11 Integration of User Interface and Algorithm 29 12 Evaluation Results 29 12.1 Accuracy Results . . . 29

12.2 User Interface Results . . . 30

13 Results & Discussion 30 13.1 Predicting using the DeepFace Model . . . 30

13.2 Comparing Preprocessing Methods . . . 31

13.3 User Interface . . . 32

13.4 Discussion . . . 34

13.4.1 Accuracy Results . . . 34

13.4.2 Preprocessing . . . 34

13.4.3 Meeting the Goals & Delimitations . . . 35

14 Conclusions 35 15 Future Work 36 15.1 Improving the Data set . . . 36

15.2 Data Augmentation . . . 36

15.3 Preprocessing . . . 38

(7)

15.4 Other Algorithms and Frameworks . . . 38

(8)

1 Introduction

1 Introduction

Computer vision, including image recognition and detection, has become a great part of everyday life and there are eminent techniques for extracting large amounts of infor- mation from images. The technology can be utilized in different ways. For instance, large phone companies such as Apple and Samsung offer products with Face-unlock today [Zdz20]. Meanwhile, the retail chain Walmart patented a recognition system in 2018 that tracks customers’ movement in their aisles, and have reportedly also been developing systems aiming to track customer moods [Smi20].

Besides the modern usages of computer vision, image recognition can help with in- terpreting the past. This project is developed in collaboration with City Faces, which is run by the City of Stockholm, with the purpose of digitizing historical biographical data from the city archive, together with a collection of historical photographs provided by the Stockholm City museum. They seek to use computer vision to be able to link portrait collections to social data. This kind of information is valuable for developing data-driven historical research.

This project focuses on developing machine learning models to determine age and gen- der of the people in the photographs. The models are trained on 11 000 digitized images provided by City Faces, and are therefore specialized on historical photographs. In comparison, modern existing models are often trained on modern photographs. We are investigating how preprocessing and machine learning algorithms can be optimized to handle historical material, and also how well modern image recognition systems work in this context.

In the next section we will give a background to the field of face detection and image processing, while also discussing the areas and tools related to the implemented models.

We will present other projects covering similar fields as this project, such as the Photo Sleuth project [MTML19], in Section 4.

Since the data set only has binary genders documented, we have been constrained to view gender as binary in this project. We discuss this, other ethical aspects, and delimi- tations of our project in Section 3.1.

In Section 5 and onward, the final models will be presented including the methods used, the system architecture, and implementations. Finally, the results of the models’

evaluations are presented and discussed.

Declaration of division of labor Everyone in the group has spent most of our time on developing the final machine learning models. This has included testing and

(9)

2 Background

combining different parameters, preprocessing methods, and architectures that we will address in the report. A very important part of the project has also been to explore new libraries, watch video tutorials, and read several articles about computer vision. All of us have therefore contributed with new ideas and perspectives to the models, challenging each other’s work in order to achieve the best possible result.

Consequently, the sections include everyone’s writing. Outside of this, we have had minor responsibility areas:

The main responsible over the user interface was Liv Cederin. She was also the main writer and researcher of the ethical aspects to this project. Gabriel Lindgren spent the most time testing various preprocessing methods and developed automated ways to in- troduce these in our project. Filip Pagliaro set up the face detection algorithm and helped onboarding the project to the SNIC platform. Ulrika Bremberg spent a lot of time work- ing with data augmentation methods and researching related works and models.

2 Background

In the following sections we will explain some areas relevant to the project. Firstly, we introduce the stakeholder of the project. Secondly, a historical perspective on face recognition will be discussed, both in terms of history and technology. Furthermore, we will give an overview of some face detection models that we considered and tested on the historical photos in our project. We will also introduce research about facial at- tributes that can enable gender recognition. In the final section we discuss the DeepFace framework.

2.1 City Faces

This project is an extension of City Faces, which is a project run by the City of Stock- holm with the purpose of digitizing historic biographical data from the city archive together with historical photographs provided by the Stockholm City museum. The col- lection consists of about 400 000 photographs taken of people in Stockholm between 1860 - 1930, of which only a few have already been digitized [sto20]. We have been given access to approximately 11 000 photographs in this data set.

To connect the photographs to the biographical data, computer vision, and machine learning models are being developed. Our model focuses on determining the age and gender the people in the photographs. This information is for example interesting in

(10)

2 Background

situations when the stakeholder wants to get an overview of the ages and genders in a data set. It can also help them label unlabeled data and hence be useful in further model training.

2.2 Face Recognition and Historical Research

Photographs provide a great amount of information by being an accessible medium illustrating history and giving insights about a time passed. Historians often use pho- tographs in research for these reasons. One common usage is identifying people and places in the images to gain insights about human life through history. This has com- monly been done manually by humans, but in recent times computer vision and facial recognition algorithms have been utilized to automate this process. For example the

“Civil War Photo Sleuth”, a project working on creating a database of people pho- tographed during the American Civil War-era, that in the end will be used together with facial recognition to rediscover the identities lost to time [MTML19].

2.3 Face and Landmark Detection

In ”The Quest for Artificial Intelligence”[Nil10], Nils J. Nilsson writes about the first attempts to develop approaches for facial recognition. One of the first steps was done in the 1960’s. The principle then was to extract coordinates of facial attributes that the computer analyzed for recognition. Later in 1970, Michael D. Kelly developed a program that automatically could find these sorts of features to be able to detect people.

Using the nearest-neighbor method [Sub19], where the idea is to categorize unseen data by comparing distance to already categorized data, he could identify individuals after automatically detecting the coordinates of attributes in digital pictures [Nil10].

In these days however, new more modern algorithms have been developed to accomplish this task. The following sections will proceed to give a brief overview on how some of these more modern face detection algorithms that have been considered for this project work.

2.3.1 Viola Jones Detection

One of the most common algorithms for face detection was developed by Paul Viola and Michael Jones in 2001 [VJ03] and is called the “Viola Jones Algorithm”. The algorithm uses the so-called rectangular features, as shown in Figure 1. In brief, the rectangles

(11)

2 Background

Figure 1 The four rectangular features in Viola Jones Algorithm [VJ03]

are used to discover differences in color between areas of the face by subtracting and/or summing up the value of the pixels in two or more rectangles. To understand how these calculations can be valuable, imagine a grey scale image. A black pixel would typically have a value close to zero, whereas a white pixel would be an integer close to 255. An example of a rectangular feature is the three-rectangle feature (C in Figure 1), that is calculated by subtracting the sum within the outer rectangles from the center region. These sort of subtractions can detect important features that build up a face. For example, finding the width of the nose, by interpreting the bridge of the nose as lighter than the eye-areas. This is illustrated in Figure 3.

Before all of the rectangular features of an image are possible to extinguish, the original image needs to be represented as an integral image. The integral image at a location (x,y) is, in short, calculated as the “sum of the pixels above and to the left” of (x,y), as shown in Figure 2. This representation is crucial in order for the Viola Jones algorithm to be able to compute the rectangular features in constant time, otherwise these would be too extensive calculations.

Having the integral image, the next step is to build an efficient classifier with the objec- tive to select the most critical features out of the extracted rectangular features. Viola and Jones mention that the most relevant “...feature selected seems to focus on the prop- erty that the region of the eyes is often darker than the region of the nose and cheeks...”, see Figure 3. The second feature is, somewhat similarly, based on that our eyes are darker than the bridge of our noses and, as previously mentioned, a three-rectangular feature is used to detect this. The final face detector algorithm combine several of these classifiers to be able to find the full face. The face detector can quickly distinguish background regions of the image, since they do not follow the pattern of a normal face,

(12)

2 Background

Figure 2 The Integral Image used in Viola Jones detection as a sum of the pixels above and to the left of a certain pixel (x,y) [VJ03]

Figure 3 Example of the most prominent rectangular features in Viola Jones Algorithm [VJ03]

and discard them. Remaining is then the detected face [VJ03].

2.3.2 Histogram of Oriented Gradients

Histogram of Oriented Gradients (HOG) [DT05] is an algorithm or feature descriptor (which is a description of features in e.g. an image encoded into a series of numbers) used within object detection, such as face detection. HOG takes use of pixel values (pixel color) of an image by dividing the image into small portions and calculating oriented gradients for these. These oriented gradients can be seen as vectors pointing towards where the color of the portion gets darker (see Figure 4). All vectors combined will then make up the contours of an object and can be fed to a Support Vector Machine (SVM) for classification on whether the vectors represent, for example, a face or not.

(13)

2 Background

Figure 4 Example of HOG-features extracted from a picture of the data set

2.3.3 Convolutional Neural Networks

A Convolutional Neural Network (CNN) [YY97] is a type of deep-learning neural net- works often used in image analysis capable of robust visual pattern recognition while being resistant to deformation of the object and variance of the objects position in the image[FMI83]. For example, in the case of face detection, the person in the picture does not need to be centered in the picture, looking straight at the camera, for the model to be able to detect the face.

Training the CNNs are done by analyzing multiple overlapping regions within the image using a filter (a matrix of weights) to generate a feature map which highlight different elementary features such as edges or corners. These features are then combined and analyzed by the following layers in the network to obtain higher level features as can be seen in Figure 5 [YY97]. The layers will be further explained when describing the implementation of our model in Section 5.4. The resulting model is then able to effectively recognize patterns in images unaffected by distortions to the pattern [FMI83].

Thus, CNNs are a very powerful tool for detecting faces and face landmarks, since they are resistant to distortions to the face. They can even be trained for full face recognition or gender recognition if provided with proper training data [ZNRK19].

2.4 Gender Recognition

Humans can classify sex with more than 95% accuracy, even when the subject has an altered hairstyle or after the removal of facial hair or makeup [AHM+18]. Additionally,

(14)

2 Background

Figure 5 Pattern recognition using convolution [YY97]

facial expressions, pose, and lighting conditions has a very small effect on the human ability to do the categorization (however, we perform much worse when pictures are pre- sented upside down) [Nil10]. This suggests that there are facial features distinguishing men and women. Determining the most prominent characteristics can hence be effective when developing gender classifiers. In an article by Abbas et al. [AHM+18] the finding when analyzing 3D pictures was that the nose ridge is the most discriminant portion of the face. However they also refer to Gilani et al.’s study [GRS+14] that suggests that the distance between the eyes and the forehead landmarks are the most effective to separate gender.

Figure 6 Examples of facial landmarks [DRC+18] [VJ03]

In both of these studies the starting point was to find facial landmarks (see the num- bered points in Figure 6), that is important points in the face. To illustrate how these are chosen, the landmarks can be categorized into three different classes. Firstly, we have the biological landmarks such as the pupils or the nostrils. Secondly, there are mathematical landmarks. These describe geometrical features of the face and can for example be a point in the middle of two biological landmarks such as a point between the eyebrows. Finally, there are the pseudo landmarks, which are defined using at least two mathematical landmarks [AHM+18]. Comparing the landmarks of different faces is what later on can be used to distinguish certain masculine and feminine features to

(15)

3 Goals and Delimitations

be used in gender classifier algorithms. Furthermore, what is especially interesting is the distance between these features. The morphometric perspective of facial gender analysis can hence explain how humans, and especially computers, see gender.

2.5 DeepFace

As stated by the DeepFace developers: “DeepFace is a lightweight face recognition and facial attribute analysis (age, gender, emotion, and race) framework for Python.

It is a hybrid face recognition framework wrapping state-of-the-art models: VGG- Face, Google FaceNet, OpenFace, Facebook DeepFace, DeepID, ArcFace, and Dlib”

[SIOA20].

Within the context of this project, DeepFace was used to create an initial benchmark for our project on how well modern models preformed on historical photographs. As it utilizes these state-of-the-art models and thereby provide a fair insight into the perfor- mances of these models (see Sections 12.1 and 13.1).

3 Goals and Delimitations

As many modern face detection/recognition models used today are only trained on mod- ern images, they struggle when faced with historical photographs, as discussed in later in Section 13.1. This project aims to contribute with a more nuanced picture of fa- cial analysis by implementing an algorithm trained on historical images. The project also intends to answer if and how this can be implemented as well as to investigate the challenges in developing classification models for historical photographs compared to modern ones. The purpose is to facilitate historical research, such as genealogy, by providing researchers with new tools to analyze photographs.

The end goal is to demonstrate an algorithm specialized in detecting and classifying the attributes gender and age of faces in historical images. The final product will be in the form of a user interface where users will be able to insert pictures for analysis. With this system we want to make it available to categorize images regardless of age and condition.

(16)

3 Goals and Delimitations

3.1 Ethical Considerations

In the following sections we will discuss ethical aspects of our project, more specifically how it handles gender and ethnicity. Even though the current ethical issues are difficult to solve due to the constraints of the provided data set, they are important to address for further discussion and development of our project.

3.1.1 Privacy and Facial Recognition

As mentioned in the introduction, face recognition is a great part of today’s society and utilized in many different areas. One problematic aspect is that algorithms are used to monitor people’s pattern of action in commercial or surveillance purposes [amn21], which raises ethical issues regarding people’s privacy. However, as this project does not cover the same types of facial recognition algorithms as used in these systems, this aspect is only considered and not discussed further. We do not intend to use our project to identify and recognize individuals but rather to sort archives and simplify the work of historians by classifying images.

3.1.2 Gender

In a study from University of Colorado Boulder [SPB19], gender classification algo- rithms from large companies like Amazon, Google Cloud Vision, and IBM Watson Visual Recognition are analyzed. In all algorithms, genders are classified in a binary matter with only two genders (male and female) and other genders are not included in the classification. The study also discovered that the accuracy in every algorithm was much lower when testing on transgender individuals, and they concluded that it is likely that their training data does not include transgender individuals.

The people in our data set are described solely as males and females. During the time the pictures were taken other gender identities were not accepted and therefore not doc- umented in the data set, even if people in the pictures might have felt differently. This binary classification excludes other genders and is an ethical issue in this project. How- ever, this study aims to specifically classify historical images, and due to poor docu- mentation of other genders we have decided to only include male and female genders in the classification model.

Another ethical issue is the fact that our data set contains approximately 86% males and 14% females. Training the model on all images could lead to distorted results, as the model would have more material to determine the male gender than the female. We take

(17)

3 Goals and Delimitations

UN’s sustainability goal 5 - “gender equality” [fn:20] into account in this project, and try to minimize gender differences as much as possible. This is why we initially trained the model on an equal amount of males and females, something we will discuss further in Sections 8.1 and 13.4.3.

3.1.3 Ethnicity

Questions regarding ethnicity become relevant in this project since it analyzes and cat- egorizes people’s appearances. According to a study from George Mason University [KW16], commercial face recognition engines classify white people with a higher verifi- cation accuracy than black people. The largest gap in accuracy was between white males and black females. In another project referred to as “Gender Shades” [BG18], it was discovered that commonly used gender classification algorithms including Microsoft and IBM performed 34% better on white males compared to black females. There are more studies that confirm these findings, for example the publication “Face Recognition Vendor Test” from the National Institute of Standards and Technology [GNH19].

This problem is well known in face recognition contexts and the problem can be ad- dressed from several different angles. Many algorithms have been trained on imbal- anced data sets, often with a majority of white males. Another problematic aspect is that many default camera settings are not optimized for darker skin tones which can result in lower quality of these images [BG18].

There are clear connections between the above mentioned ethnic issues and our project.

The picture data set portrays a small ethnic group of people where almost everyone is from Europe and mostly from Sweden. This limits the study in the sense that not all ethnic backgrounds are represented which may lead to similar results as the studies above. There is a possibility that the accuracy of the model will differ when we change the ethnic diversity of the test data. By dealing with appearances and different ethnicities there is also a risk that the algorithm will be used in the wrong contexts, for example in a non-serious manner like a “funny” classification application where groups of people risk to feel excluded or violated.

However, since our data set is very limited we will not actively work with issues regard- ing ethnicity other than defining the problems and make sure it is clear that our model is not perfect in this matter. We aspire to be transparent and careful with the contexts in which the research will be used.

(18)

4 Related work

4 Related work

There are other systems similar to our gender and age detection project, and the follow- ing sections will discuss some of these. These systems were chosen because of their similarity to our project and the useful conclusions made by them.

4.1 Gender Recognition

The idea of training a machine learning algorithm to classify gender from images has already been studied and done before. An example is “A Gender Recognition System from Facial Image” by Md. Tawhid et al. [TD18]. They wanted to address the problem of classifying faces taken in uncontrolled environments where there is for example a high rate of noise in a photograph. To account for this problem they provide a frame- work including a preprocessing state using the algorithm BHEP (Bilateral Histogram Equalization with preprocessing) which uses harmonic mean to divide the histogram of an image in order to enhance it. Using BHEP when doing preprocessing resulted in higher accuracies than when not using any image enhancement techniques.

In another paper [JLY11], Jing et al. compares two image enhancement techniques, namely Median Filtering and HE (Histogram Equalization) for preprocessing of images used in facial recognition. They mention that “Pretreating can eliminate and reduce the aspect, which is disadvantageous to enhancing recognition rate, caused by noise, illumination, background, and so on in face images effectively. Pretreating can recover, retain, and enhance image feature which is [advantageous] to enhancing recognition rate.” page 2719. Their results depict that HE effectively enhances the images but without any obvious effect on reducing the noise, while Median Filtering successfully reduces the noise.

4.2 Historical Portraits

Other attempts to process historical photos using machine learning has been made, how- ever with slightly different purposes. An example is Photo Sleuth, a web-based platform to identify soldiers from the American Civil War (1861-1865), which was a widely pho- tographed conflict where a vast majority of photos were unidentified [MTML19]. Mo- hanty et al. describes that modern face recognition algorithms can be used to simplify identification in the photograph, but they rarely work solely on their own. Therefore they let users of the platform help in identifying and adding photos.

(19)

5 Method

As we also have reviewed in our project, Mohanty et al. explain that historical photos are “often achromatic, low resolution, and faded or damaged, which might result in loss of useful information for identification” which adds an additional layer to consider when developing useful algorithms. In our case, some of the photographs will therefore have to be removed or edited before we can depend on them for our research.

Another project analyzing the usefulness of machine learning in this context is Ras- mus and Green’s study [RG12] on how using genetic and contextual data can improve face recognition. The aim of their project was to improve the accuracy of recognizing individuals in a family photographs album by using family trees. The tree itself was constructed by estimating relationships between the individuals depending on for ex- ample physical distance between them and co-occurrence of individuals in photos. This information was then used by the algorithm in combination with face recognition to determine when individuals lived and how they were related.

Similarly to Rasmus and Green, a possible long term goal with this project is to use our research as a component in genealogy. Predicting age and gender can be a part of helping users to find pictures of their relatives in historical photo archives. A challenge with this aspect of the research is that old photos often lack meta data, such as GPS position and a timestamp, that modern photos have today. However with the data set we work with in this project we at least have access to the date of the photograph which is useful.

5 Method

In the following sections we will describe the methods used in our project, and explain considerations taken when selecting some methods over others. Particularly we will describe the tools, languages, and frameworks used while developing our final model.

In the first section we will introduce the main programming language we have utilized in this project. Then we will discuss preprocessing techniques that we have used on our data set. Furthermore, we will motivate the choice to use Convolutional Neural Networks (CNN) for classifying age and gender. Following this, we will describe the layers in our CNN model and the functions we have used to build it. We will also introduce how we have implemented face detection, using a pre-trained CNN model.

In the last sections, we will first briefly discuss functions helping us interpret the results of our model. We will also introduce the platform on which all training and testing in our project took place. Finally, in the last section we will briefly mention the methods used in our web interface.

(20)

5 Method

5.1 Programming Language

For developing the machine learning model we chose Python [VRJL95] as the pro- gramming language, as it offers numerous machine learning and computer vision li- braries making it efficient to use. Some of these libraries will be addressed in following sections, but examples include OpenCV [Bra00] for computer vision and scikit-learn [PVG+11] for machine learning.

Python is widely used for computer vision tasks, resulting in a large community of peo- ple contribute in keeping libraries updated and also developing new ones. In a report from SlashData [CSS+18] from 2018, they claim that 69% of machine learning devel- opers and data scientists uses Python. In comparison, there were only 24% using R [RSt19], another programming language suitable for machine learning. R also provides a wide range of techniques for statistics and graphics, but due to Python’s vast popu- larity and the group’s previous familiarity with the language, Python was the preferred choice.

5.2 Image Preprocessing

In order to train the model in a gender equal fashion, the data set was initially randomly divided in equal parts men and women. As later described in Section 5.6, all images are fed through a face detection model to isolate the individual faces as new square images.

All images of the faces are then standardised to have the arbitrary width and height of 96 pixels in order to reduce computing needs.

Apart from the previously mentioned steps, other preprocessing methods were tested and compared which will be discussed in the following sections.

5.2.1 Grayscale & BGR

The models were tested using both colored (BGR) and black & white images (grayscale) to determine if color affected the performance of the models. An image encoded using BGR is represented and stored as 3 stacked matrices, where each matrix is a color channel (Blue, Green and Red) (see Figure 7) of size image width × image height.

Each value of the matrices represents how intense the pixel color is of that channel.

An image encoded as grayscale will only have pixel values between 0-255 (0=black, 255=white) and will therefore only consist of 1 channel. This will in turn reduce the amount of required processing by the model.

(21)

5 Method

Figure 7 Example of BGR (left) and grayscale (right) image representation

5.2.2 Histogram Equalization & CLAHE

Histogram Equalization (HE) and Contrast Limited Adaptive Histogram Equalization (CLAHE) are two different preprocessing techniques [cla] that are used in order to en- hance the contrast of images as can be seen in Figure 8.

Figure 8 Example of image preprocessed using Histogram Equalization (middle) and Contrast Limited Adaptive Histogram Equalization (right)

(22)

5 Method

5.2.3 Canny Edge Detection

Canny Edge detection [Ope] is a popular preprocessing step used to extract edges in an image. The resulting image can be seen in Figure 9.

Figure 9 Example of image preprocessed using Canny Edge Detection

5.2.4 Median Filtering

Median Filtering is a technique for removing noise in images, especially salt-and-pepper noise. The method changes each pixel value of the image with the median value of its surrounding pixels [HYT79]. As can be seen in Figure 10 the noise (added for demon- stration purpose) gets successfully removed by the method creating a more smooth im- age.

Figure 10 Example of image preprocessed using Median Filtering with noise added for demonstration purpose

5.2.5 Normalizing Images

In the final step of preprocessing the images are transformed into lists of pixel values, and divided by 255. This rescales the values to only range between 0-1 which helps the

(23)

5 Method

algorithm to, for example, remove distortions appearing from shadows and high levels of light. Reducing the range of the values also removes any implied significance in the difference of magnitudes between the values.

5.3 Convolutional Neural Networks

The main part of this project, which is the age and gender classification, is done using a Convolutional Neural Network (CNN). A CNN is, as previously explained in Sec- tion 2.3.3, a deep learning architecture. These architectures are based on a sequence of heterogeneous layers which executes different operations organized in a computa- tional graph. The output of each layer is sent to the next layer until reaching the output [Giu17]. The chosen layers used in our model will be discussed in section 5.4, and are illustrated in Figure 11.

Other alternative algorithms to using CNNs would be Support Vector Machine (SVM [CL11]), but as Chaganti et al. mention, SVM works better when the data sets are smaller [CYN+20]. When testing SVM on our data set the results also showed poor performance compared to our CNN model.

The CNN model was developed using the open-source software library TensorFlow [AAB+15] which is used for implementing machine learning algorithms and particu- larly neural networks. Another alternative considered was to use PyTorch, but due to TensorFlow’s popularity and larger community it was favoured over PyTorch[PGC+17].

The model also uses the deep learning API Keras[ker15] which runs on top of Ten- sorFlow and works as an interface or framework for problem solving within machine learning.

5.4 Convolutional Neural Network Layers

The process of developing the Convolutional Neural Networks mainly consists of com- bining layers in different ways while tweaking the parameters of the layers (such as size). In the following sections we will introduce the selected layers in our model and their associated Keras functions.

(24)

5 Method

Figure 11 A simplified model of the layers in our algorithm. The dots between “fea- ture learning” and “classification” represents repetition of the convolution and pooling before the flattening layer.

5.4.1 Convolutional Layers

Convolutional layers work very well for image classification tasks and are mostly ap- plied to two dimensional inputs (such as a 2D-image). These layers perform discrete convolution between a kernel and the two dimensional output [Giu17]. A kernel is a matrix that is moved across the input, forming an output matrix with some features en- hanced. It can for example be as simple as that the picture gets sharpened or blurred [Gan19]. An example of a kernel is the Laplacian kernel, as shown in Figure 12. In practice, the kernel values are not predetermined like in the figure (0, -1 etc.) but in- stead determined through the training process to extract features that the model consider to be important.

Each convolutional layer is given a number of kernels to convolve with the input. Each kernel produces a 2D map [Ros18], in Figure 11 this corresponds to one of the white squares in the convolution. Layers early in the model learn fewer, bigger, kernels, and detect patterns such as edges. Convolution layers deeper in the model learn more com- plex patterns and therefore uses more, smaller, kernels [Giu17]. We used the class Conv2D[Kera] in Keras to create the convolutional layers in our model.

(25)

5 Method

Figure 12 Laplacian Convolution. The pixel values of I are example values and not true to the picture. k represents a Laplacian 3 × 3 kernel

5.4.2 Pooling Layers

When the number of convolutions are very high, pooling layers are used to reduce the complexity [Giu17]. The pooling layer transform groups of pixels in an image into a single value. This can be done according to different strategies but one of the most common ones, and the one we have used in our model, is “max pooling”. In max pooling a two dimensional group of N × M pixels are transformed into a single value in form of the greatest value in the group. For our model, we use max pooling with a 2 × 2 group.

For this, Keras offers the class MaxPooling2D that we used.

The convolution and pooling layers are part of the feature learning of the model, that is, extracting and analyzing for example the width of the nose in a picture. These two layers are repeated several times to detect both basic and complex patterns in the pictures [Giu17]. After the feature learning the input sent to the next layers are normally flattened as shown in Figure 11.

5.4.3 Dropout Layers

The purpose of the dropout layer is to prevent overfitting [Giu17]. Overfitting is when a model learns the training data too well [Bro16]. Consequently, when new test data (in our case a new photo) is applied to the model it fails to make a reliable prediction, as this data was not part of the original training data.

To counteract this, the dropout layers randomly set some of the input elements to 0 [Giu17, Bro18], hence dropping some of the inputs to that layer. This is especially common to use in big models. Dropout also increases the overall training performance since it reduces the number of nodes that needs to be trained during each step of the

(26)

5 Method

training process. It is also computationally cheap, especially considering the great effect it has on reducing overfitting [Bro18]. The dropout only occur during the training of the model and therefore does not impact the predictions of a trained model [Bro18]. The dropout layers are not included in Figure 11, but normally occur before and after the fully connected layers that will be explained next.

We used two Dropout layers in our model, the first discarding 50% of the input elements and the second one 80%. This may sound like a lot, but important to understand is that all the layers are repeated several times. In each dropout it is random which inputs that are set to 0, so a dropout of 80% does not mean that 80% of the pixels are discarded forever. In Keras we used the class Dropout [Kerc] for our Dropout layers, which is the basis Dropout layer in Keras.

5.4.4 Fully Connected Layers

Fully connected layers, also called “dense” layers, are normally used as intermediate or output layers in the CNN architecture [Giu17]. We created these ¨ using the Keras function Dense [Kerb]. They are particularly useful when wanting to represent a prob- ability distribution [Giu17]. For example when determining whether the output should be classified as a man or a woman. The dense layer can then have an output vector where each element represents the probability of a specific class as shown in Figure 11.

5.5 Activation Functions

The nodes in the fully connected layers and the convolutional layers use activation func- tions which determine the output of the individual nodes. Commonly used activation functions, which are also used in our models, are the Rectifier linear unit (ReLU) for when the output needs to be linear (for example when predicting age) and softmax for when the output consists of multiple classes (predicting gender) [Giu17].

5.6 Face Detection

Another part of the machine learning pipeline where we use CNN is for face detection.

Even before preprocessing the images, all faces in the images are detected and cropped to be used as input data to the classification models. This is done using a CNN based pre-trained model from the Dlib library [Kin09] to ensure irrelevant features are not extracted from e.g. the image background.

(27)

6 System Architecture

Alternatives considered and tested were to use a Viola-Jones based model from the OpenCVlibrary [Bra00] or a HOG based model also from the Dlib library as discussed in Section 2.3.2. However these models did not perform as well (they could not find all the faces), especially when the person in the photograph did not directly face the camera. We discuss these attempts further in Section 8.2. Therefore, the Dlib model was favored.

5.7 Visualization

After training the model, it was tested using the test data from the train test split.

This function comes is from the library scikit-learn [PVG+11] and randomly divides the data set into a training set and a validation set. To visualize the model’s accu- racy (how often it classified an image correctly) we used confusion matrix and classification reportfrom the scikit-learn library, as well as Pyplot from the library Matplotlib [mat21].

5.8 Web Interface

The web interface was developed in the JavaScript [jav] framework React.js together with the JavaScript run-time environment Node.js. As the group had previous expe- rience with React.js it was favored over other JavaScript frameworks such as Angular [ang] or Vue.js [vue]. These would have required time to learn a new framework. Since the interface is a small part of the final product we prioritized developing it quickly over researching the most suitable framework for this task. There were also no complicated algorithms or functions that would be developed in the web interface, so essentially any framework would have worked in this context.

6 System Architecture

The system consists of two main components, the server with the machine learning models and the web interface. The communication between and purposes of these will be further discussed in the following sections.

(28)

6 System Architecture

Figure 13 Overview of the high-level system architecture. The interface sends the im- ages to the server for classification, which then responds with the resulting predictions for age and gender.

6.1 Web Interface

The web interface is a simple user interface built using React.js. It allows the user to upload images to the server for prediction and then displays the results of the classifica- tion. The web interface also provides the user with the option to download the results, in the form of the predicted ages and genders, as a spreadsheet file (.csv) provided by the server. The reason for the choice of .csv file is that it is easy to open in Excel and therefore easy to integrate in other projects.

6.2 Server

The server houses the machine learning pipeline which processes all the images pro- vided by the user through the web interface. All images are run through the pipeline as can be seen in Figure 14. The results are then returned to the web interface and con- densed into a spreadsheet file (.csv) that can be later downloaded by the user through the web interface as mentioned before.

(29)

7 Requirements & Evaluation methods

6.3 Machine Learning Algorithms

To classify gender and age the image uploaded to the server is fed to a model for gender prediction and another for age prediction, after which the results are sent back to the user. The reason for splitting the age and gender classification into two algorithms is to achieve as high accuracy as possible, as each algorithm is optimized for its own purpose.

6.4 Front End and Back End Communication

The front end interface and the Python algorithm communicates via an API developed with Express [Exp], a Node.js [nod] backend application framework. The communica- tion is done via HTTP requests methods that Express provides, more specifically POST and GET methods. The API listens on port number 4000 to which the front end sends the HTTP requests, and this is done via the promise based HTTP client Axios [She], and the built-in JavaScript fetch method.

Figure 14 Overview of the machine learning pipeline

7 Requirements & Evaluation methods

One of our requirements is that the system should have an accuracy of 99%, meaning that 99% of all test images analysed will get correctly classified in gender and age. We believe that this is as close to a perfect model we could get, given the fact that there are always some edge-cases which can not be covered.

(30)

7 Requirements & Evaluation methods

Another requirement is that the algorithm should provide a higher accuracy on historical images specifically, in comparison to already existing algorithms in order to make the algorithm attractive to a potential user. Furthermore, the machine learning model should be made available to use via a generated model files containing the weights of the CNNs.

One requirement on the web interface is that it must have a reasonable response time (not more than 10 seconds due to the amount of heavy calculations done by the machine learning models). In order to classify images it should be possible for the user to upload either a single image or a folder of multiple images. The classification results must be returned to the user and visualized on the web interface. The results should also be available to download in a spreadsheet file (.csv).

7.1 System Evaluation

In order to produce an accuracy score we split the data set into two parts, the test set and training set. The splitting is done using the function train test split provided by the library scikit-learn (see Section 5.7), using 25% of the images, randomly selected by the function mentioned above, as the test set and the rest as the training set.

The model is then trained using only the images in the training set and evaluated on the testing set. As the model makes predictions about the test set, the predictions are compared to the actual labels of the images and an accuracy score is calculated.

To evaluate how well the system works in comparison to existing algorithms, the same data set of historical images were tested with the framework DeepFace, described in Section 2.5.

To further analyse our model in comparison with existing algorithms, a selection of his- torical photographs from another data set called Photo Sleuth [MTML19] were retrieved and classified by both our model and the DeepFace model. The data set consisted of 25 males and 25 females. However, age labels could not be found for these photographs, hence only gender classification could be tested.

Since the web interface is very simple in its functionality, the project group decided not to include an external test group to test the usability. Instead, we tested the interface internally by measuring the response time, checking that the correct values are returned to the web page and visualized accordingly. The interface only contains one web page and the testing is not very extensive, hence our decision to test within the project group.

(31)

8 Preprocessing the Images

8 Preprocessing the Images

In order to get as good results as possible from the machine learning algorithm we preprocessed the images in our data set. The faces in the images were extracted (as described in Section 5.6) and later processed by preprocessing methods described in Section 5.2. The following sections discuss these techniques but also some other ones attempted.

8.1 Dividing the Data Set

The provided data set was imbalanced and consisted of 11 000 photos of approximately 86% males and 14% females. The risk with an imbalanced data set is that they risk to ignore the smaller classes while focusing on labeling the larger ones correctly [Ste20].

In these cases the model risk to get a high accuracy from repeatedly predicting the majority class, in our case males. There are ways to cope with imbalanced data sets, one being under-sampling which is to simply remove data from the majority class [Ste20].

As previously discussed in 3.1.1, we therefore reduced the data set to include equal amounts of males and females. For the main part of the project we used this equally distributed data set, but in a late stage of the development process it turned out that the accuracy still was better when training on the full data set (more on this in Section 13.4.1).

Furthermore, a lot of the images in the data set lacked age labels which meant they could not be used when training the age model. Consequently, all images missing correspond- ing age labels were also removed to generate a data set usable in age detection.

8.2 Face Detection

The first attempt to detect the faces in the pictures was to use Viola-Jones detection in OpenCV [Bra00]. In the beginning we looked at both face and eye detection to get an understanding of how well this worked. At once it was obvious that the algorithm was struggling. In several pictures, multiple faces was found when it only was one present, as shown in Figure 15. For example, the algorithm interpreted several buttons as faces, and nostrils as eyes.

A simple way of still getting the correct face in each picture was to choose the face with the biggest area. This worked in most cases, however we understood that this could be inconvenient when more than one person was present in the picture. Also, we still had

(32)

9 Algorithm for Gender Classification

the problem where some pictures did not portray a face but rather a button or a shoulder.

Figure 15 Unsuccessful face detection using Viola Jones method

Another problem was pictures in profile, that the original rectangular features are not designed to handle, as discussed in Section 2.3.1. Therefore we downloaded a file called

“haarcascade profileface.xml” from the OpenCV CascadeClassifiers [Bra13]. This gave also disappointing results, and we realized that we had to use another strategy than Viola-Jones detection in order to not loose a big part of our data set during the detection stage. As mentioned in Section 5.6, the final system therefore uses a CNN based pre- trained model from the Dlib library using the method cnn face detection model v1 and weights provided in the Dlib documentation1[Kin09]. The model managed to find all faces in the data set (even when the person in the photo was not directly facing the camera) without any false positives outperforming our previous attempts.

In the final machine learning pipeline the face detection model locates and crops all the faces in the photographs which are then passed on for further preprocessing and classification.

9 Algorithm for Gender Classification

The model was initially based on a simpler Visual Geometry Group (model) [SZ14] due to its recorded good performance in image recognition tasks and its simplicity as it only uses kernel sizes of 3 × 3.

The architecture of the final model however consists of three convolutional layers with

1http://dlib.net/python/index.html#dlib.cnn face detection model v1

(33)

9 Algorithm for Gender Classification

increasing kernel sizes, the first of them having 32 5 × 5 kernels, the second one 64 4 × 4 kernels, and the last convolutional layer having 128 3 × 3 kernels. The convolutional layers are then followed by a single dense layer with 64 nodes before the output layer.

The full structure can be seen in Figure 16. This relatively small network was favoured over a more complex network architecture such as the aforementioned VGG model for the sake of performance and due to the small size of our data set.

(34)

9 Algorithm for Gender Classification

Figure 16 Diagram of the gender classification algorithm. The image is inserted at the top and processed through each layer (top-down) and a classification is made and given

27

(35)

10 Algorithm for Age Detection

10 Algorithm for Age Detection

In the early stages of developing the age detection algorithm, attempts were made to predict the exact age of the people in the photographs. However this approach was quickly abandoned as it resulted in poor accuracy varying between approximately 9% - 15%, depending on the data set the algorithm was trained on.

Since the performance of the model was very poor when trying to predict an exact age, age spans were used instead. The first approach was to create static age spans of 10 years each: span 1 representing ages between 0-10 and span 2 representing ages between 10-20 for example. The results from this approach gave us very clear insights.

The algorithm was very good at predicting ages in the span 20-30 and pretty good in the span 30-40, but predicted very poorly in all other spans. This was not very surprising since the ages in our data set are concentrated between 20 and 35 years. We came to the conclusion that an enlargement of the data set would be most effective, but since that was not an option we decided to further develop the spans.

Instead of categorizing the ages using static spans, we decided to use dynamic spans by letting the model predict an exact age of the person and then adding a margin of error around the predicted age using the mean age error of the model. The resulting predictions are then returned to the user in the form predicted age ± m.a.e, (e.g.25 ± 4.3 years). In comparison to static spans, this improved the cases where the actual age of a person is at the edge of a span.

Although dynamic spans still suffers from the same problems as the previous methods due to the lacking representation of people in the ages 40 and above in our data set, it allows for more precise predictions for ages below 40 and hence increases the accuracy in total.

Because of the aforementioned problem, testing showed that there was little to no per- formance gained by using a more complex network for the age classifier. Therefore the final model used for the age detection is very similar to the gender classifier. It uses the same layers as can be seen in Figure 16 with the only exception being the final out- put layer instead consisting of only one node using the ReLU (Rectified Linear Unit) activation function.

(36)

12 Evaluation Results

11 Integration of User Interface and Algorithm

The interface is built in the JavaScript framework React.js and consists of a web page that provides two user functions. A user can upload photographs to the website and then send them to the Python back end via a Node.js web server (see 6.2). If any of the uploaded files are not an image, the user will be informed that they have to redo the uploading. All communication between the front end interface and the back end Python algorithms is done via the express API described in section 6.4.

When the images on the other hand are valid, the server will temporarily save them to a folder on the back end side. Thereafter the images run through the Python face detection algorithm, described in section 8.2. All the faces that are found gets cropped out of the image, resized to 96 × 96 pixels, and preprocessed as described in section 5.5.2. The pictures where faces could not be detected are saved to a separate folder.

The preprocessed images are then analyzed by the classification algorithms described in the sections above, resulting in a predicted age and gender. The results are then sent back to the user side and gets displayed on the web page. The results consist of all age and gender predictions, but also of a list with the images where faces could not be detected. After the uploaded images has run through this entire process they are deleted immediately, and the system does not save any information that the user provides. The user can thereafter download the results into a spreadsheet (.csv) file and hence save them to their computer.

12 Evaluation Results

In this section the results from the evaluation methods described in Section 7.1 are presented.

12.1 Accuracy Results

Evaluating the performance of our final classifying model, the gender classification al- gorithm provided an accuracy of 96% and the age detection algorithm resulted in a 5 year mean age error (m.a.e.).

Using the same test images from our data set in the DeepFace classification model (see Section 2.5), an accuracy of 84% were retrieved for gender classification. For age de- tection the DeepFace model resulted in a 9 year m.a.e.

(37)

13 Results & Discussion

When testing with the Photo Sleuth data set we got an accuracy of 82% for the gender classification using our own classification model, while the DeepFace model gave an accuracy of 56%.

Although the accuracy of the gender classification algorithm did not reach our initial requirement of 99%, we still consider our results a success as our model exceeded the Deep face model as seen above. This meets the requirement that our algorithm should provide a higher accuracy for historical images compared to already existing ones.

12.2 User Interface Results

The web interface was evaluated by the project group and the results showed that all the functionality worked according to the requirements. The response time was also considered to be reasonable, with an average response time of 8.3 seconds for analyzing 10 images, which is lower than our initial requirement.

13 Results & Discussion

In this section we will discuss results from the pretrained DeepFace model and compare these to the performance of our system. We will cover what preprocessing method worked best for our model, and discuss our final product and how it satisfies our goals from Section 3.

13.1 Predicting using the DeepFace Model

DeepFace (see Section 2.5) was used to create an initial benchmark for the project. As mentioned in Section 12, when testing the DeepFace model with the photographs from our data set, an accuracy of 84% for gender classification and a mean age error (m.a.e) of 9 years was achieved. Furthermore, the DeepFace models were also tested on the Photo Sleuth images with an accuracy of 56%. However during these tests the age prediction was not evaluated since the Photo Sleuth images lack age labels.

(38)

13 Results & Discussion

13.2 Comparing Preprocessing Methods

The models were trained on data sets preprocessed using HE, Median Filtering, grayscale, and Canny Edge Detection. As can be seen in Figures 17 & 18. Histogram Equaliza- tion resulted in the highest accuracy for the gender model prediction and the age model benefited the most from training on images converted to grayscale.

Figure 17 Validation set accuracy of the gender model when applying different prepro- cessing techniques to the data set (Higher is better).

(39)

13 Results & Discussion

Figure 18 Mean age error when applying different preprocessing methods (lower is better).

13.3 User Interface

The user interface resulted in a web page where images can be uploaded, displayed, and sent away for classification analysis. The results from the analysis is then displayed on the web page and can be downloaded as a spreadsheet file (.csv) which is easy to load into a Python program for further analysis using e.g. the library Pandas [Jef20].

Screenshots of the web page can be seen in Figure 19 and Figure 20.

(40)

13 Results & Discussion

Figure 19 Web page before having uploaded a photo.

Figure 20 Web page when picture has been uploaded and analyzed.

(41)

13 Results & Discussion

13.4 Discussion

In the end, the results indicate that the project was a success with regards to the main task of developing an age and gender classification algorithm, as well as a web interface according to the requirements in Section 7. In the following section we will discuss our results from the technical progress, evaluation results, and how the project’s purpose and aims were met.

13.4.1 Accuracy Results

As described in Section 7, a requirement set by the project group was that the system should have an accuracy of 99%. However, this requirement was not met and the results indicate an accuracy of 96% for gender classification and a 4.3 year margin of error for age classification.

In the end, in contrast to our suspicions in Section 3.1.1, the results also showed that the model was not affected by the unequal distribution of genders. When trained on the full data set the model even performed better, with the aforementioned accuracy of 96%, compared to when trained on an reduced and equally distributed data set (where it achieved a top accuracy of 93%).

When tested on the images from Photo Sleuth, the model trained on equally distributed data was again outperformed by the model trained on the full data set. With an accuracy of 71% compared to 82%. This shows that the model indeed benefits from being trained on the full data set even if it’s imbalanced.

Regarding age detection, the results show a very good m.a.e. when classifying people in the lower age spans (4.3 years). Although a much higher m.a.e was observed for people over 40 years, sometimes as high as 40 years leading to people aged 80 being predicted to be 40 years old. Even though this impacts the reliability of the final model, we believe that the results are reasonable considering the previously discussed problems with an uneven distribution of ages within the data set.

13.4.2 Preprocessing

The results from the preprocessing tests show that the gender classifier performed the best when the photographs were filtered using Histogram Equalization (HE). As is de- scribed in Section 5.2.2, HE is a technique for enhancing the contrast of an image.

Many of the historical photographs are often faded or generally of bad visual quality.

(42)

14 Conclusions

Furthermore, in many of the images the color of the faces and the backgrounds tend to blend together making it harder to distinguish important details of the face. The contrast enhancing from the HE filtering helps preventing these disturbances, which is probably the reason for why it results in a better validation set accuracy.

13.4.3 Meeting the Goals & Delimitations

The main goal of this project was to demonstrate an algorithm specialized in classifying age and gender in historical images. The purpose was to contribute with a more nuanced picture of facial classification by providing a new tool for researchers within history.

As previously discussed, the accuracy scores did not meet the requirements on the al- gorithm. However, even though the requirements were not met the results show higher accuracy scores than the commonly used DeepFace algorithm (see Section 2.5) when tested on historical photographs from two separate data sets. This shows that our algo- rithm is specialized on historical images and could function as an addition to already existing models, which meets the main purpose of this project. Therefore one can con- clude that the project has succeeded with its purpose and aims.

One of our goals was to minimize the gender differences of our model. We wanted it to be equally good at predicting men and women. Therefore we initially discarded a big part of our data set resulting in an equal number of females and males. However, testing this we received worse results than when we trained the model on the entire data set. Even though an equally distributed data set is preferred (see 8.1), it appeared that the difference in size between the data sets outweighed the fact that the data set was imbalanced. Unfortunately, this most likely means that the algorithm is slightly better at predicting males compared to females.

14 Conclusions

The aim of this project was to develop two CNN models for age and gender classi- fication, trained exclusively on historical photographs. The expectation was to pro- duce prediction accuracies higher than modern state-of-the-art models. This was indeed achieved, our models successfully predict gender with an accuracy of 96% and age with a mean age error of 4.3 years, despite being trained on a relatively small data set. This, together with the interface, allows users without technical backgrounds to find out age and gender of their own photographs.

Initially the idea was to train our models on an equally distributed data set, i.e. equal

(43)

15 Future Work

parts women and men. However, this did not turn out to be favourable as the data set got reduced from 11 000 to 3266 images. The resulting model experienced severe overfitting which resulted in relatively poor performance (see Section 13.4.1). Instead, we chose to train the gender model on all images and the age model on all images that had a corresponding age label. This turned out more successful in terms of accuracy.

Furthermore, multiple preprocessing methods were tested compared while training. The different methods tested were grayscale, Histogram Equalization (HE), Median Filter- ing, and Canny Edge Detection. While the gender model benefited mostly from using HE, the age model worked best with images being represented in grayscale.

15 Future Work

As previously mentioned, developing machine learning models is a very experimental process. Although many different preprocessing approaches and model configurations were tested for this project, there is still much to be done to achieve an “optimal” model.

15.1 Improving the Data set

One simple further development would be to increase the size and variety of the data set. For example, the bias problems discussed in Section 3.1.2 could be alleviated by expanding the data set and including more photographs of people with different ethnic backgrounds.

Another interesting idea with regards to the data set would be to include modern pho- tographs edited to imitate historical photographs. This could expand the training data set without being as limited by the availability of variety in the photographs as it is now.

15.2 Data Augmentation

As prior discussed, the data set was very imbalanced as only around 14% of the pic- tures portrayed women. Consequently, the training and test set became fairly small. A common method to extend small data sets is to augment the images.

Common augmentation strategies include for example tilting, zooming and shifting the images horizontally or vertically [Sar19]. In Figure 21, one of the pictures in the data

(44)

15 Future Work

set has a random combination of different augmentation techniques creating six inde- pendent images that can be used to train the algorithm on.

We attempted to use Keras ImageDataGenerator [Kerd] on our data set. During the few test rounds conducted applying this to our model, we only got worse results than without the augmented images. Moreover, the training took significantly more time which restricted our testing. We still believe that augmentation could be useful for our model, but due to lack of time to try new parameter values and verifying the results, we decided to remove the augmentation.

Another possibility would have been to use data augmentation to increase the number of images for only a certain class. For example, in the case of age detection, data augmentation could be used to increase only the number of images of people aged 40 and above (thereby balancing the data set).

With all of this in mind, developing this project further we believe that augmentation could be a useful tool to extend the data set.

Figure 21 Example of data augmentation of one of the pictures

References

Related documents

Figure 5.2: Mean scores with standard error as error bars for 5 independent measurements at each setting, which were the default settings but different ratios of positive and

The contrast limited adaptive histogram equalization method is used the function “adapthisteq()” in Matlab, which is the algorithm proposed by Zuiderveld, Karel [35]. The

1 A few studies of age discrimination consider both female and male searchers in specific age categories (e.g. Baert et al. However, to fully analyze the age-gender interaction,

By doing this, and also including more classification algorithms, it would yield in- teresting results on how algorithms perform when trained on a balanced training set and

However, the approach to exclusionary screening adopted by the Swedish Ap-funds and Norwegian GPFG that we analyse in our study are remarkably different to the ones studied in

Det fanns även hos de antistalinistiska kamraterna en enorm skräck för bilder som inte gick att förklara eller för uttryck som inte ledde till ett ökat klassmedvetande.. Det

This project is going to design a deep learning neural network algorithm that can be implemented on hardware, for example, FPGA.. This project based on current researches about

Nevertheless,the case is different when it comes to patients that are temporarily or intermitently mentally ill.It is within this area that there is real conflict in the practice of