Color Features for Boosted Pedestrian Detection

(1)

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

Color Features for Boosted Pedestrian Detection

Examensarbete utfört i Datorseende vid Tekniska högskolan vid Linköpings universitet

av Niklas Hansson LiTH-ISY-EX--15/4899--SE

Linköping 2015

Department of Electrical Engineering Linköpings tekniska högskola

Linköpings universitet Linköpings universitet

(2)

(3)

Color Features for Boosted Pedestrian Detection

Examensarbete utfört i Datorseende

vid Tekniska högskolan vid Linköpings universitet

av

Niklas Hansson LiTH-ISY-EX--15/4899--SE

Handledare: Gustav Persson

Autoliv Electronics AB

Fahad Khan

isy_{, Linköpings universitet}

Examinator: Lasse Alfredsson

isy, Linköpings universitet

(4)

(5)

Avdelning, Institution Division, Department

CVL

Department of Electrical Engineering SE-581 83 Linköping Datum Date 2015-11-04 Språk Language Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport

URL för elektronisk version

http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-122494

ISBN — ISRN

LiTH-ISY-EX--15/4899--SE Serietitel och serienummer Title of series, numbering

ISSN —

Titel Title

Färgsärdrag för boostingbaserad fotgängardetektering Color Features for Boosted Pedestrian Detection

Författare Author

Niklas Hansson

Sammanfattning Abstract

The car has increasingly become more and more intelligent throughout the years. Today’s radar and vision based safety systems can warn a driver and brake the vehicle automatically if obstacles are detected. Research projects such as the Google Car have even succeeded in creating fully autonomous cars.

The demands to obtain the highest rating in safety tests such as Euro NCAP are also steadily increasing, and as a result, the development of these systems have become more attractive for car manufacturers. In the near future, a car must have a system for detecting, and per-forming automatic braking for pedestrians to receive the highest safety rating of five stars. The prospect is that the volume of active safety system will increase drastically when the car manufacturers start installing them in not only luxury cars, but also in the regularly priced ones. The use of automatic braking comes with a high demand on the performance of active safety systems, false positives must be avoided at all costs.

Dollar et al. [2014] introduced Aggregated Channel Features (ACF) which is based on a 10-channel LUV+HOG feature map. The method uses decision trees learned from boosting and has been shown to outperform previous algorithms in object detection tasks. The rediscov-ery of neural networks, and especially Convolutional Neural Networks (CNN) has increased the performance in almost every field of machine learning, including pedestrian detection. Recently Yang et al. [2015] combined the two approaches by using the the feature maps from a CNN as input to a decision tree based boosting framework. This resulted in state of the art performance on the challenging Caltech pedestrian data set.

This thesis presents an approach to improve the performance of a cascade of boosted clas-sifiers by investigating the impact of using color information for pedestrian detection. The color self similarity feature introduced by Walk et al. [2010] was used to create a version better adapted for boosting. This feature is then used in combination with a gradient based feature at the last step of a cascade.

The presented feature increases the performance compared to currently used classifiers at Autoliv, on data recorded by Autoliv and on the benchmark Caltech pedestrian data set. Nyckelord

(6)

(7)

Sammanfattning

Bilen har genom åren kommit att bli mer och mer intelligent. Dagens radar- och kamerabaserade säkerhetssystem kan varna och bromsa bilen automatiskt om hi-der detekteras. Forskningsprojekt såsom Google Car har t.o.m lyckats köra bilar helt autonomt.

Kraven för att uppnå den högsta säkerhetsklassningen i t.ex. Euro NCAP blir allt strängare i takt med att dessa system utvecklas och som följd har dessa sy-stem blivit attraktivare för biltillverkare. Inom en snart framtid kommer det att krävas att en bil har ett system för att upptäcka och att bromsa automatiskt för fotgängare för att uppnå den högsta klassen, fem stjärnor. Förutsikterna är att produktionsvolymer för aktiva säkerhetsytem kommer att öka drastiskt när bil-tillverkarna börjar utrusta vanliga bilar och inte enbart lyxmodeller med dessa system. Användningen av aktiv bromsning ställer höga krav på prestanda, felak-ting aktivering av system måste i högsta grad undvikas.

Dollar et al. [2014] presenterade Aggregated Channel Features (ACF) som ba-seras på en tiokanalig LUV+HOG särdragskarta. Metoden använder beslutsträd på pixelnivå som tas fram genom boosting och överträffade tidigare algoritmer för objektigenkänning. Återupptäkten av neurala nätverker och i synnerlighet Convolutional Neural Networks (CNN) har medfört en ökning i prestanda inom nästan alla fält av maskininlärning, inklusive fotgängardetektion. Nyligen kom-binerades dessa två metoder av Yang et al. [2015] genom att särdragskartan från ett CNN användes som insignal till ett beslutsträdsbaserat boostingramverk. Det-ta ledde till det hittills bäsDet-ta resulDet-tatet på det utmanande Caltech pedestrian dataset.

I det här examensarbetet presenteras en metod som kan öka prestandan för en kaskad av boostingklassificerare ämnad för fotgängardetektion. Det färgbaserad särdraget color self similarity, Walk et al. [2010], används för att skapa en version som är bättre lämpad för boosting.

Det presenterade särdraget ökade prestandan jämfört med befintliga klassifice-rare som används av Autoliv på både data inspelat av Autoliv och på Caltech pedestrian dataset.

(8)

(9)

Abstract

The car has increasingly become more and more intelligent throughout the years. Today’s radar and vision based safety systems can warn a driver and brake the ve-hicle automatically if obstacles are detected. Research projects such as the Google Car have even succeeded in creating fully autonomous cars.

The demands to obtain the highest rating in safety tests such as Euro NCAP are also steadily increasing, and as a result, the development of these systems have become more attractive for car manufacturers. In the near future, a car must have a system for detecting, and performing automatic braking for pedestrians to receive the highest safety rating of five stars. The prospect is that the volume of active safety system will increase drastically when the car manufacturers start installing them in not only luxury cars, but also in the regularly priced ones. The use of automatic braking comes with a high demand on the performance of active safety systems, false positives must be avoided at all costs.

Dollar et al. [2014] introduced Aggregated Channel Features (ACF) which is based on a 10-channel LUV+HOG feature map. The method uses decision trees learned from boosting and has been shown to outperform previous algorithms in object detection tasks. The rediscovery of neural networks, and especially Convolu-tional Neural Networks (CNN) has increased the performance in almost every field of machine learning, including pedestrian detection. Recently Yang et al. [2015] combined the two approaches by using the the feature maps from a CNN as input to a decision tree based boosting framework. This resulted in state of the art performance on the challenging Caltech pedestrian data set.

This thesis presents an approach to improve the performance of a cascade of boosted classifiers by investigating the impact of using color information for pedestrian detection. The color self similarity feature introduced by Walk et al. [2010] was used to create a version better adapted for boosting. This feature is then used in combination with a gradient based feature at the last step of a cas-cade.

The presented feature increases the performance compared to currently used clas-sifiers at Autoliv, on data recorded by Autoliv and on the benchmark Caltech pedestrian data set.

(10)

(11)

Acknowledgments

I would like to show my appreciation to Autoliv for giving me the opportunity to perform my master thesis at their Linköping office and providing me with the necessary equipment. A special thanks to Gustav Persson, my supervisor at Autoliv, whom have provided me with great assistance in framework related issues and theoretical discussions.

Linköping, November 2015 Niklas Hansson Ne omittas solum ambulantem

(12)

(13)

Notation

Sets

Notation Description

R The set of real numbers

RN The set of real valued vectors with dimension N

Variables

Notation Description x_i Sample patch i

y_i Label corresponding to sample xi, yi∈{0, 1}

Φ Feature vector

Φ∇ Confidential gradient based feature

u Vectors in bold face u Scalar variable

Abbreviations

Notation Description

CSS Color Self Similarity VJ Viola-Jones

HOG Histogram of Oriented Gradients PD Pedestrian Detection

R_tp True positive rate R_fp False positive rate fppi False Positive Per Image ROI Region Of Interest

ROC Receiver Operating Characteristic bb Bounding Box

(16)

(17)

1

Introduction

Automotive safety has in recent years started to rely more and more on vision based systems. At present day it is mostly used as a driver aid, detecting vehi-cles, pedestrians and animals. Pedestrians are especially exposed in traffic. In bad weather and poor lighting conditions they are difficult for drivers to detect. Pedestrians usually posses a large variations in illumination, which increases the difficulty of correctly classifying them.

The demand for an increase in detection performance is large, partially for cre-ating a better warning system. The main objective is automatic braking, which requires a high reliability. A false detection could lead to terrible consequences. Traditionally, in pedestrian detection, gray scale images have been used, due to the color constancy problem. It has been shown in a variety of publications that color information is an important complementary cue to shape. Many new fea-tures have been suggested to bypass the color constancy problem and increase the detection performance on benchmark datasets such as PASCAL VOC 2007 and Caltech.

1.1 Problem formulation

The main objective of this work is to evaluate the impact of color in pedestrian detection (PD). That, in combination with existing features at Autoliv, could in-crease the overall performance. The features shall be adapted to fit into the frame-work of a cascade of boosted classifiers. The desired color feature should possess photometric invariance, while being computationally efficient with real-time ca-pability.

(18)

2 1 Introduction

The reflected colors of an object changes with the scene illumination. In com-puter vision this is referred to as the color constancy problem. This is especially apparent in outdoor scenes under different weather conditions and time of day. Several methods have been proposed to counter the problem of color constancy such as the ones presented in Walk et al. [2010], Van de Weijer and Schmid [2006], and Wang et al. [2012].

1.2 Related work

Freund and Schapire [1997] introduced AdaBoost (Adaptive boosting) where a weak learner is created by evaluating a set of features and then choosing the best one. The problem is re-weighted such that mis-classified samples influence more before a new weak learner is trained. This allows the algorithm to adapt to more difficult samples. The individual learners do not have to perform miracles, as long as they are better than random. By combining all weak learners, their joint contribution can produce a strong learner.

Viola and Jones [2001] proposed an efficient method for object detection. They introduced the concept of a cascade relying on AdaBoost. The cascade consists of several steps of boosted classifiers where the number of weak classifiers in each step increases. "Easy" examples will be rejected at an early stage and more com-plex classifiers can be used efficiently on a small number of hard examples. The combination of Haar wavelets as features together with integral images results in fast training and classification. The computational cost of evaluating the features on an image patch when using integral images is independent of its size.

Lowe [2004] presented a new scale and rotation invariant feature called Scale In-variant Feature Transform, known in literature as SIFT. Keypoints were extracted from images using difference of Gaussians. At these locations, local image de-scriptors based on the human perception of edges were computed. The descriptor was created from histograms of gradient orientation within the region of a key-point and normalized to account for changes in illumination. SIFT was used in object detection by matching the descriptor from keypoints to the nearest neigh-bour from a training set. Keypoints were then allowed to vote for object poses using the Hough transform, resulting in a higher accuracy of 2D location, scale and orientation.

Dalal and Triggs [2005] proposed a feature called Histogram of Oriented Gradi-ents (HOG), which has been widely used in pedestrian detection. The feature relies on histograms over the gradient orientations within cells, which makes it good for recognizing shape.

Van de Weijer and Schmid [2006] investigated color descriptors in terms of ge-ometric robustness, photge-ometric stability, and generality. I.e. the descriptor should be robust against changes in illumination but also have discriminative power in order to be used for multi-class classification of images. The proposed

(19)

1.2 Related work 3

descriptors were histograms ofhue, weighted with the saturation and opponent angle. The opponent angle was constructed from the first order derivative of

oppo-nent colors, O1and O2. The choice of descriptor depended on data.Hue obtained

good results on highly saturated sets whileopponent angle performed better in the

presence of diffuse lighting and when the colors were less saturated.

Van de Sande et al. [2010] investigated the invariance of different color descrip-tors with respect to changes in illumination. These descripdescrip-tors were used to train an SVM classifier, Cortes and Vapnik [1995], on different data sets, e.g. PAS-CAL VOC 2007 for scene and object recognition. They concluded that with no previous knowledge of the data, the scale and shift invariant Opponent-SIFT per-formed best, followed by the scale invariant C-SIFT.

The bag-of-words model has been widely used in document classification and re-cently also in computer vision by e.g. Sivic and Zisserman [2003]. Color names (CN) are created by computing probabilities for a number of linguistic color la-bels given the rgb value of a pixel. Van de Weijer et al. [2009] proposed a new way of learning color names, where weakly marked (many false positives) images from Google were used. The traditional way of learning CN is by hand labelling color chips into a set of eleven CN. This is done under controlled illumination and against a neutral background. Since natural images under different illumina-tion have been used in this paper, CN with very low saturaillumina-tion (achromatic) such as white, gray and black occupy a larger part of the color space. In real world applications the ability to represent achromatic colors is a desirable feature. Khan et al. [2012a] continued the work of Van de Weijer et al. by further inves-tigating thehue and opponent angle descriptors proposed in Van de Weijer and

Schmid [2006] and CN from Van de Weijer et al. [2009] for object detection. A common problem when trying to avoid the color constancy problem is the loss of discriminative power. A comparison was made between the discriminative prop-erties and the dimensionality of the investigated descriptors. The results showed that CN outperforms the others with a significantly better discriminative power using only eleven CNs, while still possessing some photometric invariance. The recognition performance using CN increased significantly on the PASCAL VOC data sets (2007, 2009).

In Khan et al. [2012b] and Khan et al. [2013], the authors compared early and late fusion of shape and color cues for object and action recognition. They also proposed the use of color attention maps (bottom-up and top-down) to modify the bag-of-words model for shape cues. The use of attention maps outperformed early and late fusion on PASCAL VOC data sets (2007, 2009).

Walk et al. [2010] introduced the concept of Color Self Similarity (CSS) for pedes-trian detection. The similarities of color histograms between different cells in an image patch was used as features. This resulted in state of the art performance when used in combination with HOG and histograms of oriented flow. Häselich et al. [2013] proposed an efficient version of CSS, resulting in an acceleration of a factor four. As mentioned before, color cues in combination with shape features

(20)

4 1 Introduction

have been shown to increase detection performance. Goto et al. [2013] proposed an alternative view where CSS is computed on an image patch and then HOG features are extracted from the similarity images.

More biological approaches have been taken by Zhang et al. [2012] and Wang et al. [2012]. In both articles, the authors use features that try to mimic the pri-mate visual system. Zhang et al. [2012] created descriptors that imitate the sin-gle and double opponent neurons. The sinsin-gle opponent neuron is responsible for capturing surface information such as color, and is triggered by color opponency, e.g. red vs green. The double opponent is used for extraction of boundaries, and is responsive to both color and spatial opponency. The proposed descriptors outperform the hue-SIFT and opponent-SIFT descriptors for object recognition and scene categorization, used by Van de Weijer and Schmid [2006]. In Wang et al. [2012] Color Maximal Dissimilarity Pattern (CMDP) was proposed, where oriented filters were used to capture edge information and they were designed to mimic simple cells in the human visual system. The behavior of more com-plex cells was obtained by max-pooling, resulting in a physiologically motivated suppression effect and a more invariant feature for pedestrian detection. CMDP produced results similar to the CSS proposed by Walk et al. [2010], when com-bined with the HOG feature on the INRIA pedestrian data set.

1.3 Overview

In order to incorporate and evaluate the color features described in this work, a framework for the whole classification chain, from extraction of samples to evaluation of classifiers is needed. The following section will briefly describe the most vital parts in this framework.

Data extraction

First, a data set has to be acquired and samples need to be extracted from it. This has to be done with care, since the performance of a classifier is strongly dependent on image quality, distribution of samples, and the extraction method used to construct the training and validation sets.

Color features

The purpose of this work is to evaluate color features for pedestrian detection (PD). These will be used in combination with existing features from Autoliv’s classification framework, consisting of a cascade of boosted classifiers. By utiliz-ing the additional information from color channels, the performance is expected to improve in cases where shape cues are insufficient.

Inspired by the success of using color for image classification, we investigate the impact of color for pedestrian detection. Recently, Walk et al. [2010] introduced the Color Self Similarity feature (CSS), which utilizes color channel histograms in small cells. An intersection between these are performed and used to con-struct the feature. The dimensionality of the feature vector is however very large,

(21)

1.3 Overview 5

dim(Φ) ≈ 8000, which makes it unsuitable for real-time applications. By modify-ing the feature such that only a small region is used, the dimensionality can be reduced. This modification also makes it suitable for a boosted classifiers since the position of the feature can be altered. The details of this modification will be presented in Chapter 6.

Cascade training

The existing classification framework uses a cascade of boosted classifiers. The first stages are designed to find easy negatives by using simple, but fast features to reduce the number of remaining samples quickly. At later stages more com-plex features can be incorporated to focus on more difficult samples, while still maintaining speed due to the reduced number of samples. The proposed color feature will be incorporated in the end of the cascade.

The selection of features consists of several steps. Firstly, a number of features are generated, where the size and position are different. Their responses are then computed on a training set. The candidate that separates the different classes best is chosen. The procedure is repeated to produce several winners, or weak learners. The responses from the chosen features are summed to produce a so called strong learner. This output also serves as a measurement, or confidence for the class membership of a certain sample.

At this point we have determined which features to use, and their positions and sizes. In order to create a classifier, we need to set a threshold for the output confidence to separate the classes. These features have however only seen the training set. If this threshold was to be created from images that have been used in training, the classifier would not generalize well to other scenes. The solution is to use a validation set, which has been unknown to the classifier during train-ing. By computing the confidences on this set, and using a constraint such that e.g. 95% of pedestrians are correctly classified, the performance on unseen data will most likely improve.

The process is repeated to create more classifiers, or steps, which when they are combined in a series is called a cascade, see Figure 1.1. At each step, new neg-atives have to be extracted to maintain the ratio between positive and negative samples. New negatives are drawn, propagated, and added to the training or val-idation set. The process becomes very time consuming at later stages, where only a small fraction of samples pass through.

Classification and evaluation

Given a cascade of classifiers, the performance needs to be evaluated. This is per-formed by constructing a test set which is independent from the training and val-idation sets. A search grid is created and applied on frames in the test set, from which patches are drawn, classified, and clustered. The remaining detections and their corresponding confidences are saved into text files. These are matched against the ground truth using an overlapping criterion. Given the matches and

(22)

6 1 Introduction

Labeled

data Step 1 Step 2 Step 3

Rejected Samples

Step N

Positively classified

Figure 1.1: Overview of cascade structure. Each new step is trained from samples with confidence above the threshold from previous steps. Samples with confidences below the thresholds are rejected and removed from fur-ther training.

their confidences, performance curves can be produced and classifiers with dif-ferent feature combinations can be compared.

1.4 Limitations

The main purpose of this work is to investigate the impact of color based features for PD. The outcome from this work might not be suitable for target implemen-tation with the current technology. The real time goal might in that case be dis-carded, in order to obtain results. However the rapid increase in computational power could make it a possibility in the near future.

The complete results presented in the final report might not be possible to recre-ate, due to confidentiality concerning the algorithms used by Autoliv.

1.5 Results

This thesis shows that an increase in detection performance can be obtained by using a boosting adapted CSS feature, based on either color names or an opponent color space. The results are consistent on Autoliv data, and on the Caltech set. Details can be found in Chapter 8.

1.6 Autoliv

Autoliv is world leading in developing and manufacturing automotive safety tems. At Autoliv Electronics in Linköping the main focus is camera based sys-tems, where object detection and classification have an important role.

(23)

1.7 Outline 7

1.7 Outline

First, two common image descriptors are presented followed by an introduction to supervised learning, the problem it solves and what is required. The chapter continues with the principle of a cascade of boosted classifiers, how boosting works and an example, namely AdaBoost. The next chapter presents how data extraction is performed, followed by the cascade training procedure. Afterwards investigated color features are presented, followed by the evaluation procedure. Finally the obtained results, and the conclusions from these are presented.

(24)

(25)

2

Image descriptors

Extracting information from images is crucial in computer vision. Image descrip-tors can come in many forms but the common denominator is that they extract information from a region. The procedure can be based on e.g. shape, color, tex-ture or motion. In many cases, the descriptors use image featex-tures that mimic the behavior of the human visual system. During millennia of evolution we as a species have developed our vision, naturally many applications in computer vi-sion try to utilize some of these prominent features. In this chapter two examples of features are introduced, Viola-Jones and HOG, both which have gained great popularity within the community of image classification. These will be used to describe the phenomenon of boosted classifiers. The topic of this work,color fea-tures, will be presented in Chapter 6.

2.1 Viola-Jones

In a time with limited computational power, the contributions made by Viola and Jones [2001] have had a large impact in the field of object detection. The intro-duction of boosted cascade classifiers made the leap to real-time object detection possible. A feasible computational time could be combined with satisfying per-formance. The use of features that could be evaluated quickly was a key in order to discard easy negatives at the early steps. By increasing the number of compo-nents at later steps the desirable performance could be achieved.

The presented principle was to use Haar-like features to extract the difference in intensity between neighboring image regions. Four typical feature variations are shown in Figure 2.1. There are on the other hand many possible variations one could use to further increase performance. The intensities within each region

(26)

10 2 Image descriptors

are summed and the results from gray areas are subtracted from the white areas, resulting in a scalar value. The features can be seen as simple edge or ridge detectors. Figures 2.1a, 2.1b are examples of edge detectors, and Figures 2.1c, 2.1d can detect ridges. The last major contribution was the integral image, see A.1, where each pixel holds the accumulated intensity of previous pixels. That is, the sum of all pixels from the original image within the rectangle created by the top left corner and the current pixel. This construction makes it possible to perform summation of rectangular regions with only four arithmetic operations, independent of region size.

(a)Vertical (b)Horizontal (c)Diagonal (d)Ridge

Figure 2.1:Examples of Viola-Jones features. The sum within gray and white regions are computed. The response is given by the difference between white (+1) and black (-1) regions.

The VJ feature was first used in face detection. The horizontal feature, Figure 2.1b, could capture the difference in intensity of the forehead and eyes. The ridge vari-ation, Figure 2.1d, could detect the bridge of a nose. VJ can be applied in other applications such as pedestrian detection, where a combination of features can be used to detect the silhouette of a human. The construction of these features relies on variability which a boosted classifier can take advantage of. In this case the size, position, and type of features can be varied to create a set of candidates used in the training of a boosted classifier. This will be discussed in more detail in Section 3.3.1.

2.2 Histogram of oriented gradients

An image descriptor which has gained large popularity within the field of pedes-trian detection due to its desirable properties is HOG. It was introduced by Dalal and Triggs [2005] and relies on directional histograms of image gradients. These can be used to capture the silhouette of upright pedestrians.

The first action in HOG is to compute the image gradient, and then divide the image into blocks and cells. The original implementation used cells consisting of 6 × 6pixels and blocks of 3 × 3 cells. Within each cell, a weighted histogram of gradient directions was created. The histograms had 9 bins and the gradient mag-nitudes were used as weights. Depending on the application, the directions can

(27)

2.2 Histogram of oriented gradients 11

be signed or unsigned. In pedestrian detection, the unsigned version is preferred over the signed, due to varying reflective properties of clothing. The histogram bins are, in this case spread out over the interval of [0, π] rather than [0, 2π] used in the signed case. In order to improve accuracy, block normalization is per-formed using e.g. L2-normalization. The histograms are then concatenated into a vector and used as a feature.

The use of gradient orientation results in a powerful edge feature. It can represent the complex shape of a pedestrian, and give a more discriminative description compared to VJ. HOG is however computationally heavier. This form is local on cell level, but uses all cells in a patch. This results in a complete representation, suitable for a standard classification framework such as Support Vector Machines (SVM), Cortes and Vapnik [1995]. An adaptation to boosting can be made by mimicking VJ features. By generating individual blocks with varying positions and cell sizes, a set of boosting adapted HOG features could be created.

(28)

(29)

3

Supervised learning

There exist two major types of machine learning techniques, supervised and un-supervised. Unsupervised learning uses unlabelled data and tries to separate samples into a number of classes/clusters, e.g. K-means, [MacQueen et al., 1967] or expectation–maximization, [Dempster et al., 1977]. The advantage of not hav-ing to label data manually comes at the cost of not behav-ing able to specify classes. Supervised learning uses manually labelled data. During training, samples be-long to a specific class. Certain categories such as cars, traffic signs, or pedes-trians can be targeted. This work will focus on supervised binary classification of pedestrians. In order to perform supervised training, labelling must be per-formed. This is usually done manually by marking image regions containing the object of interest.

3.1 Image features

The goal of classification is to separate classes. The approach of training a clas-sifier using only the intensity of images might work in some cases. This is in general an naive approach. Another method is to construct features that capture discriminative information between classes, such as the overall contour of spe-cific objects, and at the same time allow for small variations in pose. Feature construction can be performed in a number of ways depending on the applica-tion. Gradient based descriptors, e.g. HOG, have been widely used in pedestrian detection to capture the silhouette of humans in an upright position. Computing feature vectors for all patches within a training set results in points in a multi dimensional space. A separating surface can be created to discriminate between classes using e.g. an SVM, [Cortes and Vapnik, 1995]. The classification per-formance could be increased by using additional complementary information,

(30)

14 3 Supervised learning

based on e.g. depth or color. This will however increase the dimensionality of the feature space. Another possibility is to use individual features separately in a boosting framework which is discussed in Chapter 6.

3.2 Binary classification

Binary classification targets the problem of classifying samples into two classes e.g. pedestrian or not by using labels yi∈{0, 1} for all samples xi. Classical

con-vention is to assign one (1) to the object of interest (positive) and zero (0) to back-ground areas or patches. Classes are not restricted to the presence of an object. The task could be extended to separate several different types of animals. This is a more difficult problem called multi-way classification, which is not treated in this thesis. This approach is more common in object recognition, while binary classification is often used for object detection. The circumstances under which a classifier is used varies. There are two major types: Per window and Sliding win-dow. These are closely related with some differences in application which will be clarified in the two following sections.

3.2.1 Per window approach

A per window approach has a test set consisting of image patches xi, where the

ob-ject usually is centred. Each patch has a label yi. This method is mainly used for

object recognition where the purpose is to distinguish objects from each other or determine the presence of an object. For the purpose of comparing classifiers this method can be suitable. Compared to a sliding window approach, Section 3.2.2, less patches are usually classified. The ratio between positives and negatives in a test set can be adjusted, resulting in a faster evaluation process. Another advan-tage is that only the classifiers’ performances are compared. There are no addi-tional variables such as grid construction, clustering and non maxima supression that influence the performance. Even though the actual detection performance can not be measured, comparison between similar classifiers is possible.

3.2.2 Sliding window approach

Instead of a set of labeled patches, complete image sequences are provided. These come with annotations describing the type, position and size of objects. In order to train a classifier, positives patches have to be extracted and preferably resam-pled to a fixed size. Since there are no specifically given negative samples, these have to be drawn from regions not overlapping with positives. This boils down to the starting point of a per window approach with a set of equally sized patches of labeled data. This method is used to detect objects in images and specify their po-sitions. In order to find an object a search grid is constructed from which image patches are extracted. The patches are classified and clustered revealing the size and position of detected objects. In real world applications, where the position and size of objects are of importance, approaches such as sliding window are nec-essary. There exist other methods, e.g. Uijlings et al. [2013]. The image structure is used to obtain a small set of candidate patches by returning the locations of

(31)

3.3 Cascade of boosted classifiers 15

candidate patches with high probabilities. This reduced the computational time by avoiding an exhaustive search and allows an incorporation of more complex classifiers.

3.3 Cascade of boosted classifiers

The problems with traditional classifiers, such as SVM or neural networks are to achieve real-time performance using a sliding window approach. The features are usually constructed using information extracted from the whole patch, and each patch within the grid has to be classified. This results in high complexity and often large dimensionality of feature vectors. Another approach was intro-duced by Viola and Jones [2001], where a classifier was divided into a chain of boosted classifiers, called a cascade. The first steps use fast but simple features, e.g. VJ that discards easy negatives. At later steps the number of samples have been reduced. This allows longer individual steps, or more complex features to be evaluated in reasonable time. E.g. boosting adapted HOG, which was dis-cussed in Chapter 2. A more detailed description of the training procedure of a cascade will be presented in Chapter 5.

3.3.1 Boosting

Boosting circumvents the dimensionality problem by using many modified low dimensional feature candidates. The features could be constructed such that the size and position of the ROI of the feature can be varied. Each weak learner is found by evaluating several modified versions of a feature and choosing the best. By linearly combining several weak learners, a strong learner can be obtained.

Additive models

Friedman et al. [2000] presented a statistical view of boosting. Functions were iteratively fitted and linearly combined into a strong learner, see Eq. (3.1). βi

is a scaling constant and γi a parametrization of some sort of basis function

b(x; γ). It can be anything from wavelets and complex exponentials to vec-tors in some N-dimensional feature space. The combination fi(x) of βi and

b(x; , γ_i) are commonly referred to as a weak learner. Determining β and γ can be done in a greedy forward step-wise approach. Given a set of samples X = {(x1, y1), (x2, y2), ..., (xN, yN)} with labels y ∈ {0, 1}. The parameters that

minimize some loss function, e.g. the expected value of the squared error in Eq. (3.2), are used as a weak learner. The process is iterated until the desired number of weak learners have been trained or until the error has decreased be-low a specified value.

F(x) = M X i=1 fi(x) = M X i=1 βib(x; γi) (3.1)

(32)

16 3 Supervised learning {βi, γi} = arg min β,γE [y − Fi−1(x) − βb(x; γ)] 2 (3.2)

3.3.2 AdaBoost

In this section a derivation of AdaBoost, Freund and Schapire [1997], is presented in Example 3.1, followed by the algorithm in Alg. 3.1.

3.1 Example: Derivation AdaBoost

A strong classifier F(x), that takes a sample x as input can be obtained by creating a linear combination of a set of weak learners f(x). F(x) can be used to perform classification as in Eq. (3.3). cmis a positive weight, fm(x) ∈{−1, 1}, and N is the

number of weak learners. AdaBoost aims at minimizing the cost function J(F) in Eq. (3.4). The optimization is performed in a stage-wise fashion, one weak learner at a time. The strong learner at iteration m is given by Fm(x) = Fm−1(x) +

cmfm(x). The optimization problem can be formulated as in Eq. (3.5), where

xi represents sample i. This can be expressed as Eq. (3.6), using weights wi

constructed from the already determined Fm−1(x).

F(x) = sign [F(x)] = sign        N X m=1 cmfm(x)        (3.3) J(F) = Ehe−yF(x)i (3.4) argmin fm, cm J(Fm) =argmin fm, cm N X i=1 e−yi(Fm−1(xi)+cmfm(xi)) _(3.5) argmin fm, cm N X i=1

wi· e−yicmfm(xi), using wi= e−yiFm−1(xi) (3.6)

Using the indicator function in Eq. (3.7) the expression from Eq. (3.6) can be separated into mis-classified and correctly classified samples as in Eq. (3.8). We see that minimizing the cost function is equivalent to finding the weak learner that minimizes the weighted error in Eq. (3.9) since Eq. (3.8) can be written as Eq. (3.10).

Computing the derivative with respect to cm of Eq. (3.9) and setting it to zero

results in the optimal cmshown in Eq. (3.11)

Ik=

1 if k = true

(33)

3.3 Cascade of boosted classifiers 17 argmin fm, cm        N X i=1 w_i· Ifm(xi),yie cm₊ N X i=1 w_i(1 − I_f m(xi),yi)e −cm        (3.8) e_m= N P i=1 wi· Ifm(xi),yi P i = 1N_w i (3.9) argmin fm, cm N X i=1 w_i·e_m(ecm_{− e}−cm_{) + e}−cm _(3.10) cm = 1 2log 1 − e_m e_m ! (3.11) Algorithm 3.1AdaBoost

Initialize weights wi= 1/N, where N is the number of samples.

foreach iteration m = 1, 2, ..., M do

1. Train weak learner fmwith weight wi, assigned to sample i.

2. Determine weighted error rate emand calculate cmas:

c_m= 1 2log

(1 − em)

e_m

3. Update each weight as wi← wie−yicmfm(xi)

4. Renormalize weights wiwith the sum of all weights.

end for

Output classifier as: y(x) = sign        M X m=1 c_mf_m(x)       

Usually a threshold is used rather than the sign function, Eq. (3.12). The thresh-old t is determined by specifying a desired tp- or fp-rate Eq. (3.13) and (3.14) on the validation set.

yi=

1 if F(xi) ≥ t

(34)

18 3 Supervised learning Rtp= N_(y=1,F>t) Ny=1 (3.13) R_fp= N(y=0,F>t) Ny=0 (3.14)

Forward feature selection by boosting

The image descriptors presented in Chapter 2 can be used in the training of a boosted classifier. This is performed by generating a number of candidates from a set of suitable features. Each candidate is modified by changing its position, scale, and feature type. These candidates are then used while training the classifier. The training procedure can be summarized as in Figure 3.1. Candidates are gen-erated and weak learners are trained. The candidate that decreases J(F) the most is chosen as the next weak learner, followed by an update and normalization of the weights. The procedure is repeated until the desired number of weak learners have been trained.

Labeled data Generate candidates Train weak learners F← F + cmfm Strong learner Reweighting

Figure 3.1: Training of a boosted classifier. Candidates are generated and their responses computed on the data set. The best candidate is chosen and incorporated into the strong learner. The problem is weighted and re-peated until the desired number of weak learners have been trained.

3.3.3 Cascade using boosting

The idea behind a cascade is to link classifiers into a chain in order to obtain per-formance while reducing computational time. At different steps, characteristic feature properties should be taken advantage of. The fast evaluation of VJ, de-scribed in Section 2.1, makes it a perfect candidate for early steps. Even though many false positives will pass through, the remaining number of samples will have been greatly reduced. At this point it is a good idea to introduce more complex boosting adapted features, e.g. HOG from Section 2.2. The impact of increased complexity is weakened by the reduction in number of samples passed on from previous steps while the larger discriminative power will boost the per-formance. There are however no restrictions on which combinations to use. Two types of features could be used in the same step, allowing the boosting process

(35)

3.3 Cascade of boosted classifiers 19

to choose from a larger set of features. If the two features are complementary to each other this could increase the performance significantly compared to only using one.

During training it is important that data used at a specific step have been propa-gated through all previous classifiers. In this way each classifier focuses on infor-mation not captured by previous steps. An important factor to take into account is the choice of threshold described in Section 3.3.2. In a cascade of length N with thresholds determined by Rtp = 0.97at each step would in theory result

in Rtptot = 0.97

N_{for the complete cascade. The desired performance has to be}

(36)

(37)

4

Data acquisition

To train a classifier, training data has to be acquired. This chapter focuses on the principles used when acquiring positive and negative samples. An issue when training a cascade is that correctly classified negatives are discarded. Additional classifiers should preferably be trained on a balanced data set with more or less an equal number of positive and negative samples. This requires an acquisition of additional negatives, which have been misclassified as positives by previous steps in the cascade. In this way each new classifier focuses on the weaknesses of previous steps. At later steps, a large number of new negatives have to be drawn to fulfil this requirement, due to the increased classification performance. Two data sets have been used. The Caltech pedestrian data set, widely used in research, and Autoliv’s data. The data acquisition needs to be performed care-fully since it is crucial for the performance of a classifier. A large number of variables need to be taken into account and the evaluation conditions need to be thoroughly investigated. The acquisition procedure presented here is mainly de-signed to generate data suitable for training a classifier using a sliding window approach, although evaluation will also be performed on a per window set in the Autoliv case. The creation of per window sets will follow the same procedure as described in this chapter. The difference between the two methods concerning evaluation will become apparent in Chapter 7.

4.1 Extraction

A sliding window approach aims at detecting objects of different sizes and po-sitions. Therefore training data has to contain samples extracted from different scales to ensure the presence of pedestrians of varying sizes. The data sets are

(38)

22 4 Data acquisition

tracted using a number of parameters. The procedure is based on the one used by Autoliv and can not be discussed in detail. The maximum and minimum heights h_minand hmax specifies the allowed interval of a pedestrian’s height in pixels.

A border factor (Fb)is used to increase the size of a patch such that the whole

sil-houette of a pedestrian is visible. Afterwards, all patches are converted to a fixed size of N × M. To avoid distortion, patches should be extracted using an aspect ratio r = w/h = M/N. Intervals are created using hmin, hmax, and the number

of scales in the grid. This allows us to compute a positive’s corresponding scale s given its height in pixels. The patch is extracted by computing the center of the annotation bounding box (bb). The appropriate region is then extracted using the specific width and height corresponding to scale s.

Extracting negatives by randomly choosing their sizes should be avoided. It could result in a classifier that discriminates between the two classes based on artefacts deriving from the resampling. The sliding window evaluation will be performed by classifying patches from a search grid. It is desirable that both negatives and positives used in training were extracted from the same grid as used in evalua-tion. Otherwise a similar problem as described above occurs. The classifier has seen a certain set of scales, but are evaluated on another, where other types of resampling artefacts are present. This would lead to a decrease in performance. To avoid these scenarios, negatives were drawn from the evaluation search grid. Only patches which do not overlap with any pedestrian markings were used when randomly generating negatives. These were then converted to a size of N × Mpixels using bilinear interpolation.

4.1.1 Pre-processing

Most pre-processing techniques require expensive calculations. A gained classifi-cation performance needs to be set in relation to the increased complexity. One approach is to increase the color contrast by performing histogram equalization. When dealing with color images, the approach is not as straight forward as when only using a one channel gray scale image. The commonly used procedure is to convert the image to an opponent color space such as Lab. Alternatively, HSI or HSV which are cylindrical color spaces, see Section 6.1. The common denom-inator among these are that the equalization can be performed on the L, I, or V channels, without distorting the color information. Initial testing using HSI and HSV showed that the contrast was increased, however the image appeared colorless due to very low saturation. It is worth mentioning that at this point data sets had already been extracted and the pre-processing were performed on patch level. To reduce computational time in a target application, pre-processing would preferably be applied before candidate patches are extracted.

To enhance the colors, histogram equalization on separate RGB channels were investigated. This introduced severe artefacts since the image was mapped to create a uniformly distributed histogram. Another option is to map the values within each channel such that one percent of the pixels are saturated to zero and one percent saturated to the maximum value. By computing cumulative

(39)

4.2 Data sets 23

histograms the values Iminand Imax for these percentiles could be found. The

mapping was performed using Eq. (4.1) where I(x) is the image. Values within the percentiles were set to either zero or one.

This method introduced less artefacts than histogram equalization while increas-ing the colorfulness of the image. Further on this technique is referred to as Saturation normalisation (SN).

Iout=

Iin(x) − Imin

I_max− I_min (4.1)

4.2 Data sets

The two data sets which has been investigated are the Caltech data set and data that were recorded by Autoliv. The details of the two sets are presented in the following section, along with the acquisition procedure. Sample frames and ex-tracted patches are presented in Figure 4.1.

4.2.1 Caltech

The Caltech data set is the largest currently existing public pedestrian data set. Containing 250 000 frames and approximately 2000 unique individuals Dollár et al. [2009] Dollár et al. [2012]. An evaluation script together with results from state of the art classifiers are provided. In the evaluation process, ROC (Receiver Operating Characteristic) is used. Since it is video based, pedestrians are oc-cluded in large portions of the frames in which they appear. Pedestrians are also not always in an upright position. These properties increase the difficulty of the classification task. An occlusion flag is provided in the annotations, and these samples can be ignored in training and evaluation.

Jpeg compression artefacts such as blockiness and color leakage are apparent in the images. Annotations are not perfect and the pedestrians are not always cen-tred within the bb. These conditions make the task of extracting positives in a consequent way difficult.

The Caltech set consists of 11 sets with around ten 60 seconds videos each. Set 0 − 5are reserved for training and set 6 − 10 for evaluation. The training set was divided into four training (0-3) and two validation sets (4, 5). The annota-tions come with the labels{Person, Person?, People}. Person is given to objects containing a single, clearly marked pedestrian. Person? are annotations where the human eye can not classify the object certainly. Finally, People is given to large groups of persons that would be tedious to separate. Pedestrians within the set are mainly concentrated in the height interval 20 − 160 pixels with a mean aspect ratio r = w/h = 0.41. The poor quality of small patches resulted in poor performance and only pedestrians with a height of 50 − 300 pixels were used in training. In evaluation, pedestrians outside the height interval were also ignored, since the classifiers did not have that training data. During acquisition, 30 unique scales and a border factor of Fb= 1.3were used. A geometric series of fixed patch

(40)

24 4 Data acquisition

heights, associated with intervals spanning the range of hminto hmaxwas used

to extract samples. These were resized to a fixed size of 45 × 90 pixels using bilin-ear interpolation, resulting in an aspect ratio r = 0.5. The patch dimensions were chosen to facilitate construction of 5 × 5 cells.

While collecting positive samples, only annotations which were non-occluded and with label person were used. The reason is that we want to minimize the difficulty of the classification problem. The height of a ground truth bb was com-pared with the fixed intervals to obtain the correct scale for extraction. The patch was extracted using the scale specific patch size, the bb’s center coordinates and converted to the fixed size of 45 × 90 pixels using bilinear interpolation. Positive samples were extracted from all frames in the training set. Unique pedestrians were extracted several times in different frames with the motivation to obtain samples from a variety of distances and poses. The extracted initial training set contained 72 748 samples, 38 472 positives and 41 276 negatives. The validation set 30 294, where 7 491 were positives and 22 803 negatives. Negatives were cho-sen randomly from the grid in areas not overlapping with any type of annotation and resampled to a size of 45 × 90 pixels.

4.2.2 Autoliv

The video sequences used by Autoliv are stored using lossless compression. Arte-facts visible in Caltech are not a problem even for smaller scales. The annotations of pedestrians are more elaborate, e.g. there are fields for degree of occlusion and sharpness. In training only non-occluded pedestrians were used. The bbs are marked with pixel precision, resulting in tighter annotations. In the sequences not only pedestrians are marked, there are also vehicle and two wheeler mark-ings. The cyclist label is especially important due to the similarity to pedestrians. This information can be used to ignore such objects in evaluation.

The data acquisition was performed in a similar way as for the Caltech set with some minor parameter changes. A height interval of hmin= 40and hmax= 420

using 30 scales and a fixed patch size of 45×90 pixels were used. The more precise markings allow us to use a larger border factor, set to Fb= 1.4, since the labelling

noise is smaller compared to the Caltech set.

The conditions for pedestrians to be extracted and included in a data set were no occlusion, no strange pose, and not unsharp. The extracted training data had 66 500 samples, 34 000 positives and 32 500 negatives. The validation set con-tained 32 000 samples divided into 16 000 positives and an equal number of neg-atives. The negative ground truth was constructed by prohibiting overlap with annotations indicating pedestrians or a cyclists. The extracted patches formed initial training and validation sets. During training of the cascade, additional neg-atives were drawn, see Chapter 5. No new positives were included beyond this point, and at each step 3% were discarded when determining thresholds from a tp level of 97%.

(41)

4.2 Data sets 25

(a)Autoliv sample image (b)Autoliv sample image

(c)Caltech sample image (d)Caltech sample image

(e)Sample patches from the 6thstep of the Autoliv training set. Furthest left SN in RGB space of the mean pedestrian, followed by pedestrian samples. Left positive samples and right negative samples.

(f)Sample patches from the 6thstep of the Caltech training set. Furthest left the mean pedestrian in RGB, followed by pedestrian samples. Left positive samples and right negative samples.

Figure 4.1: A variety of sample images from the two datasets. Figure 4.1a and 4.1b from the Autoliv set, Figure 4.1c and 4.1d from the Caltech set. In Figure 4.1e and 4.1f are sample patches from the last training step for Autoliv respectively Caltech.

(42)

(43)

5

Classifier training

At the heart of machine learning and classification lies the training process. The following chapter is dedicated to present the procedure used when training a cascade of classifiers. The chapter is divided into two sections. The first part describes how the training of a base cascade is performed. It is created to mimic the existing classifiers used by Autoliv. The second part contains the method used when incorporating color features and promising feature combinations.

5.1 Bootstrapping

Bootstrapping is a method used to increase the performance of a classifier, e.g. Dalal and Triggs [2005]. It can be performed in many various ways. A common denominator is that one tries to mine for hard negatives to force a classifier to become better, hence the name bootstrapping. In this work, we trained a simple classifier, using less feature evaluations. Hard negatives was extracted by finding the patches which passed through the cascade and the newly trained classifier. These patches were added to the data set and increased the difficulty of the classi-fication task. The goal of this method was to tweak the final classifier to correctly classify these samples by increasing their frequency.

5.2 Pre-cascade

The proposed features are to be incorporated at the last stage of a cascade. In order to create a realistic and fair environment for the evaluation, a Pre-cascade was trained according to Alg. 5.1. This is a similar approach as the one used by Autoliv. All additional classifiers presented in this work were trained as the 6th

(44)

28 5 Classifier training

and final step in a cascade. The pre-cascade were used for previous steps. I.e. the pre-cascade is a common denominator for all classifiers trained on the same data set. The details of which features that were used are not discussed due to confidentiality. All classifiers were trained using data propagated through the pre-cascade and appended at the end to form a complete cascade. The intention is to evaluate the impact of including color features at the last step of a cascade. Training a cascade of classifiers is more complex than only using a single step. In order to keep the ratio of positives and negatives constant throughout the train-ing of individual steps, new data has to be acquired, see Alg. 5.1. In order to guarantee enough samples, minimum ratios were used and the extraction was re-peated until these were fulfilled. The number of negatives needed was computed using Eq. (5.1) and (5.2). At later stages the number of negatives needed to reach the desired ratio became too large to store. A maximum limit of negatives was in-troduced at which the extraction was aborted. Sequences were chosen randomly to make sure data from all sets were used. During extraction a skip parameter was used to only extract every 30th_{frame. The start frame was chosen randomly,}

ensuring all frames were being used. Negatives were drawn using the same pa-rameters and search grid as in the extraction of the base data set in Chapter 4. The difference was that positives were not extracted.

Algorithm 5.1Training of a cascade of boosted classifiers for i = 1, 2, ..., Kclassifiers do

ifBootstrap then

Train bootstrap classifier and add to cascade at step i while Ny=0/Ny=1 < rbootdo

Draw new negatives according to Eq. (5.1) and (5.2), rboot= 0.5

Propagate new negatives through cascade + bootstrap classifier end while

end if

Train main classifier for step i Propagate data through last step while Ny=0/Ny=1< rpropdo

Draw new negatives according to Eq. (5.1) and (5.2), rprop= 0.4

Propagate new negatives through cascade end while

end for

Output propagated data set and pre-cascade.

Ndes= 2(r + ∆)· Ny=1− Ny=0 (5.1)

N_draw = Ndes QK

i=1Rfpi

(45)

5.3 Image features 29

The number of desired new negatives was obtained by multiplying the number of positives with the specified ratio which depends on current training phase. The number of existing negatives in the set was then subtracted, Eq. (5.1). An over-shooting parameter ∆ ≈ 0.1 was added to avoid slow convergence. The number of negatives theoretically mis-classified by the cascade can be computed using Eq. (5.2). Where Rfpi is the false positive ratio of the validation set for the

clas-sifier at step i given a specified tp rate Rtp. Dividing the desired number of

negatives with the product of the fp rates of the K classifiers in the cascade gives an approximation of the number of negatives that have to be extracted to obtain the desired ratio.

5.3 Image features

After obtaining the base cascade the new color features were introduced. After the last step of the pre-cascade had been trained, negatives were propagated to reach a ratio of r = 0.5 within the training and validation sets. Bootstrapping was not performed to ensure that all additional classifiers were trained on identical sets. Another motivation was that training could be performed over night.

(46)

(47)

6

Color features

Previous chapters have touched upon the vital parts needed in the process of training a boosted classifier. We have discussed the concept of image descriptors and how they can be used as features followed by the art of data extraction with the methods and parameters influencing it. This next step is to introduce color features, the main focus of this work. The proposed feature is an adapted version of CSS presented by Walk et al. [2010]. The authors showed that color can be a useful cue when used properly, especially in combination with gradient based features such as HOG, described in Section 2.2. A large part of the evaluation con-sists of finding appropriate color spaces to use. Various representations possess different amount of invariance to changes in scene illumination. Choosing the proper candidate could increase an eventual gain in performance even further.

6.1 Color spaces

The performance of the feature will depend on the used color space, e.g. in-variance to changes in light intensity is desirable to avoid the color constancy problem. The invariance properties of different color spaces were investigated by Van de Sande et al. [2010] and were used to find promising candidates. In the fol-lowing section it is assumed that color channels are normalised to R, G, B ∈ [0, 1]

6.1.1 rg

The RGB color space does not possess any invariance properties, however it is trivial to conclude that the closely related rg space is invariant to scaling in light intensity, see Eq. (6.1). The conversion from RGB to rg is defined as r + g + b = 1, which makes b redundant.

(48)

32 6 Color features " r g # = 1 R + G + B " R G # = 1 a(R + G + B) " aR aG # (6.1)

6.1.2 Opponent colors

The opponent color space defined in Eq. (6.2) encodes the color information in the two first channels and intensity in the third. Opponent colors are popular in image compression since the color channels can be heavily compressed with only a small loss in image quality. O1and O2are invariant to a shift in light intensity,

Van de Sande et al. [2010]. O3does not possess any invariance properties.

        O1 O2 O3         =          (R − G)/ √ 2 (R + G − 2B)/ √ 6 (R + G + B)/ √ 3          (6.2)

In the standard model, the channels will have different ranges. This would re-quire channel specific bins to construct color histograms. An alternative is to translate and scale the color channels such that bO_i∈ [0, 1]. The resulting transfor-mations are displayed in Eq. (6.3), (6.4), (6.5).

b O1= O1+√1 2 √ 2 (6.3) b O₂= O2+ q 2 3 q 8 3 (6.4) b O₃= R + G + B 3 (6.5)

The transformation shrinks the interval, O1 and O2 are mainly concentrated

around zero in natural images and large positive and negative values of these color channels are rare. There exist a risk that only a few histogram bins will be used, decreasing the discriminating power drastically. To solve the problem, the number of bins could be increased or pre-processing applied on bO₁ and bO₂ to utilize all bins. In contrast with already discussed pre-processing the operation would be applied after the color space conversion. Further on the normalized versions of bO_iwill be used but with the notation Oi.

6.1.3 HSI

HSI (Hue Saturation Intensity) is a cylindrical color space with a variety of def-initions and relatives such as HSV and HSL. HSI is more intuitive since I is

(49)

6.1 Color spaces 33

the mean intensity of color channels, compared to V = max(R, G, B) and L = 0.5max(R, G, B) + 0.5 min(R, G, B). The definition of hue and saturation also vary in these related color spaces. HSI is computed according to Eq. (6.6), (6.7) and (6.8). H =arctan O1 O₂ ! =arctan √ 3(R − G) R + G − 2B ! (6.6) S = 1 −min(R, G, B) I (6.7) I = R + G + B 3 (6.8)

A well known problem is the instability of hue for low saturation. Van de Weijer and Schmid [2006] showed that the certainty of the hue channel is inversely pro-portional to the saturation values. They achieved a more robust representation by weighting the hue histogram with the saturation, followed by a normalization such that the summation of a histogram equals to one. Ignoring normalization could lead to very small values in reference cells and consequently large inter-section scores for small similarities. The hue histogram will be scale- and shift invariant to changes in light intensity while the other two possess no invariance, Van de Sande et al. [2010].

6.1.4 Color Names

Color Names (CN) is based on a linguistic representation of colors, e.g. blue, yel-low and black using probabilities. The method was introduced by Van de Weijer et al. [2009]. They computed the probability distributions of eleven color names using Google images. Khan et al. [2012a] used CN in combination with HOG features for object detection and achieved good results. The representation pos-sesses some degree of invariance to light changes. However, it is hard to derive mathematically. CN posses the ability to encode achromatic color such as white, gray, and black. Given a pixel’s RGB value the probabilities for individual color names can be obtained from a lookup table. The procedure to extract the distri-bution of color names within a cell is explained in Alg. 6.1.

The probabilities of a certain color name, ni, were summed for all pixels within a

cell and normalized with the number of pixels in that cell. We repeated the proce-dure for all color names ni, i = 1, ..., 11. The result was a probability distribution

for these color names within the region. The normalization with the number of pixels ensured thatP

ipR(ni| f(x)) = 1, where f(x) was the RGB pixel values and

xthe pixel coordinates. In order to perform computations quickly we created an integral image for each color name.

(50)

34 6 Color features

Algorithm 6.1Procedure to compute CN within a region

Given a RGB value f(x) from an image f at pixel coordinates x and a color name lookup table p(ni| f(x)), i = 1, ..., 11

forall cells k, l do

1. Compute color name probabilities in cell R containing N pixels p_R(n_i | f(x)) = 1 N X x∈R p(n_i| f(x)), ∀i 2. Form histogram H(k, l, b) =hpR(n1| f(x)), ..., pR(n11| f(x)) i end for

Create histogram integral image II(k, l, b) using H(k, l, b)

6.2 Color Self Similarity

The method proposed by Walk et al. [2010] was to convert the image into HSV. The patches were then divided into M 8 × 8 cells, on which color histograms were computed using three bins. To measure the similarity between cells, a number of methods were proposed. The standard L1 _{or L}2 _{norm, histogram}

intersec-tion, and the popular χ2_{distance, which is a measure for computing distances}

between histograms in computer vision. The best measure was shown to be his-togram intersection, described in Eq. 6.9, which was used throughout this work. To compute histogram intersection, cell k was used as reference and intersected with the M cells created in the patch. The intersection score is φk(j) ∈ [0, 1],

where φk(j) = 0indicated no similarity at all and φk(j) = 1a large similarity.

φ_k(j) =

PNbins

i=1 min(refk(i),targetj(i))

PN

i=1refk(i)

(6.9)

Φ_css=hφT₁, φT₂, ..., φT_M−1iT (6.10) Here refkis the color histogram of the reference cell and targetjis the jthtarget

cell. Here j = {1, 2, ..., M} ∩ {k , j} and φk ∈ RM−1, since the intersection with

the cell itself is discarded. The procedure is repeated, using all cells as reference, resulting in M intersection vectors which are concatenated and L2normalized to form the complete feature vector in Eq. (6.10). CSS uses all information in a patch and as a result Φ ∈ RN(N−1)/2. The dimensionality is derived by excluding self intersection of histograms and avoiding intersection of histograms twice.

The absolute color of an object is not represented, since it causes problems un-der varying illumination. Instead the similarity between regions within a patch

Color Features for Boosted Pedestrian Detection

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

Color Features for Boosted Pedestrian Detection

Color Features for Boosted Pedestrian Detection

Examensarbete utfört i Datorseende

vid Tekniska högskolan vid Linköpings universitet

av

Sammanfattning

Abstract

Acknowledgments

Contents

Notation

1

Introduction

1.1

Problem formulation

1.2

Related work

1.3

Overview

Data extraction

Color features

Cascade training

Classification and evaluation

1.4

Limitations

1.5

Results

1.6

Autoliv

1.7

Outline

2

Image descriptors

2.1

Viola-Jones

2.2

Histogram of oriented gradients

3

Supervised learning

3.1

Image features

3.2

Binary classification

3.2.1

Per window approach

3.2.2

Sliding window approach

3.3

Cascade of boosted classifiers

3.3.1

Boosting

3.3.2

AdaBoost

3.3.3

Cascade using boosting

4

Data acquisition

4.1

Extraction

4.1.1

Pre-processing

4.2

Data sets

4.2.1

Caltech

4.2.2

Autoliv

5

Classifier training

5.1

Bootstrapping

5.2

Pre-cascade

5.3

Image features

6

Color features