Measuring Student Attention with Face Detection:: Viola-Jones versus Multi-Block Local Binary Pattern using OpenCV

(1)

DEGREE PROJECT, INDEGREE PROJECT IN COMPUTER SCIENCE, DD143X , FIRST LEVEL

STOCKHOLM, SWEDEN 2015

Measuring Student Attention with Face Detection:

VIOLA-JONES VERSUS MULTI-BLOCK LOCAL BINARY PATTERN USING OPENCV

ANNA LINDELÖF & JOSEFINE ERIKSSON

(ANNALI9@KTH.SE & JOSERIKS@KTH.SE)

(2)

Supervisor: Richard Glassey

Examiner: Örjan Ekeberg

(3)

Abstract

The purpose of this study is to discuss and attempt to approach an answer to the question of how face detection could be used to measure attention in a lecture hall.The conclusion might help further studies in using face detection to provide teachers with tools which can be used to improve learning during lectures.

Face detection in real time applications became possible in 2001 when Viola and Jones presented a new method several times faster than any previous attempt. In 2007 Liao et al. presented a method using multi-block local binary patterns (MB-LBP) for the purpose of overcoming the simplicity and limitations of the Viola-Jones method.

Computer vision libraries such as OpenCV make it easy to implement such algorithms. It currently supports both the Viola-Jones algorithm and the MB-LBP algorithm.

This study compared these two face detection methods to see how they perform in terms of sensitivity and precision and attempted to identied limitations of both methods when used to detect attention in a simulated lecture environment. The study was conducted using boosted algorithms and functionality provided by OpenCV. The input data consisted of a recorded simulated lecture with 6 subjects performing dierent poses, labeled either attention or no attention, during certain periods of time, each pose recognized from a previously recorded actual lecture as a commonly occurring pose.

The most signicant dierence of performance identied in the study was that the MB-LBP method performed face detection in an image three times faster than for Viola-Jones which conrmed previous reported results. Both methods generated high sensitivity values for all poses, but low precision values for two of the poses.The ability of both methods to detect downward tilted faces contributed to a high number of false positives returned when subjects performed the two poses of subjects taking notes or subjects performing activities labeled as no attention. Due to the low precision values caused by this, both methods were not considered to measure attention eectively. It is therefore suggested to instead train a MB-LBP-based method for the specic task of measuring attention in a lecture hall by training it to reject downward-tilted faces and to accept only instances conforming to the chosen denition of attention.

(4)

Sammanfattning

Syftet med denna studie är att diskutera och försöka att närma sig ett svar på frågan om hur ansiktsigenkänning kan användas för att mäta uppmärksamhet i en föreläsningssal. Slutsatsen i denna studie kan kanske hjälpa framtida studier att använda ansiktsigenkänning för att förse föreläsare med verktyg som kan användas till att förbättra lärande under föreläsningar.

Ansiktsigenkänning i realtidsapplikationer blev möjligt 2001 när Viola och Jones presenterade en ny metod era gånger snabbare än tidigare försök. Under 2007 presenterade Liao et al. en metod som använde sig av multi-block local binary patterns (MB-LBP) med än- damålet att överkomma enkelheten och begränsingarna i Viola-Jones metoden. Dataorseende bibliotek som OpenCV gör det lättare att im- plementera sådana algoritmer. För närvarande stödjer OpenCV både Viola-Jones algoritmen och MB-LBP algoritmen.

Denna studie jämförde dessa två ansiktsigenkännings metoder för att se hur de presterar i form av sensitivitet och precision och försök- te identiera begränsingar för båda metoderna när de användas för att upptäcka uppmärksamhet i en simulerad föreläsning. Denna studie utfördes genom användning av boosted algoritmer funktionalitet som tillhandahålls av OpenCV. Ingångsdatan bestod av en inspelad simulerad föreläsning med 6 testpersoner som utför olika poser, dene- rad antingen som uppmärksamhet eller ingen uppmärksamhet, under vissa tidsperioder. Varje pose var identierad från en verklig tidigare inspelad föreläsning som en vanligt förekommande pose.

Den viktigaste skillnaden i prestation identierades i denna studie var att MB-LBP metoden utförde ansiktsigenkänning i en bild tre gånger snabbare än för Viola-Jones vilket bekräftar tidigare rapporte- rade resultat. Båda metoder genererade ett högt sensitivtetsvärde för alla poser, men lågt precisionsvärde för två av poserna. Båda meto- dernas förmåga att upptäcka nedåtlutande ansikten bidrog till att ett högt number av falska positiva instanser returnerades när testperso- nerna utförde dessa två poser som innebar att testpersoner tog anteck- ningar eller testpersoner utförde poser som var denerade som ingen uppmärksanhet. På grund av låga precisionsvärden som orsakade av detta så ansågs ingen av metoderna eektivt mäta uppmärksamhet.

Det är därför föreslaget att istället träna en MB-LBP-baserad metod för de särskilda ändamålet att mäta uppmärksamhet i en föreläsnings- sal. Detta genom att träna metoden att förkasta nedåtlutande ansikten och att bara acceptera instanser som överensstämmer med den valda denitionen av uppmärksamhet.

(5)

1 Introduction 6

1.1 Objective . . . . 6

1.2 Motivation . . . . 7

1.3 Structure . . . . 7

1.4 Terminology . . . . 7

1.4.1 False Positive . . . . 7

1.4.2 Object Class Detection . . . . 8

1.4.3 Feature Based Detection . . . . 8

1.4.4 Hierarchical knowledge-based method . . . . 8

1.4.5 Face Localization . . . . 8

1.4.6 Texture analysis . . . . 8

2 Background 9 2.1 Previous Work on Measuring Attention . . . . 9

2.2 History of Face Detection . . . . 9

2.3 Face Detection Algorithms . . . 10

2.3.1 Viola-Jones . . . 10

2.3.2 Multi-Block Binary Patterns with Boosting . . . 11

2.3.2.1 MB-LBP Feature Extraction . . . 11

2.4 Relevant Data Extraction Methods . . . 12

2.4.1 Integral Image . . . 12

2.4.2 Haar-Like Features . . . 14

2.4.3 Local Binary Patterns . . . 14

2.4.3.1 LBP Feature Extraction . . . 15

2.4.3.2 Performance . . . 15

2.5 Boosting Algorithms . . . 16

2.5.1 AdaBoost . . . 16

2.6 Attentional Cascade Structure . . . 16

3 Method 18 3.1 Testing Environment . . . 18

3.1.1 Hardware . . . 18

3.1.2 Software . . . 19

3.2 Acquisition of Data . . . 19

3.2.1 Simulated Lecture . . . 19

3.2.1.1 Stages Based on Commonly Used Poses . . . 20

3.2.2 Data Obtained Through Manual Face Detection . . . . 20

3.3 Detection of Faces . . . 21

3.4 Measuring Time . . . 21

3.5 Sensitivity and Precision . . . 22

3.6 Chosen Denition of Attention . . . 23

(6)

4 Result 26

4.1 Tables . . . 26

4.2 Graphs . . . 28

4.3 Output Images . . . 30

4.4 Observations . . . 31

5 Discussion 33

6 Conclusion 35

7 Appendix 36

8 References 37

(7)

1 Introduction

Face detection belongs to the more general computer vision domain of ob- ject class detection. Object class detection utilizes features and a set of rules called classiers to detect a predened object in digital images or video re- cordings. Face detection aims to specically detect faces and is a subject which has gained increased research interest because of it being an essential step in applications using features such as facial recognition and identica- tion. Today, face detection is present in many applications used every day, such as most modern camera phones, as well as handheld cameras, photo or- ganization programs [31], surveillance [32], and human-computer interaction [33].

Face detection can be dened through the following scenario: Given an arbitrary image, the purpose of face detection is to determine whether there are any faces present in the image, and if so, return the region containing the face and its location [8].

There is currently interest in developing systems which can measure at- tention of students in a classroom or detect attention in social settings and a common approach is to detect attention using head orientation and gaze tracking [24][26]. Tracking a person's gaze might be one of the most precise methods for determining the focus of attention. However, highly accurate gaze tracking devices are generally expensive. Therefore, the question if face detection, which has been highly implemented in real time applications on handheld devices, could be used to measure attention is worth investigating.

1.1 Objective

The initial question which inspired this study is formulated as follows: Is it possible, using the face detection techniques of today, to measure the attention of students during a lecture in order to help the teacher improve the lectures?

The original question was further narrowed down to which of the two frontal face detection methods, the Viola-Jones and the Local Binary Pat- terns (LBP) methods provided by the openCV library, would perform better in detecting attention of students in a lecture hall with good lighting con- ditions. Further, the objective is to identify some of the limitations of each method when used to measure students' attention in a lecture hall using open source OpenCV. More specically, by using a controlled simulated lecture the study investigates how the performance of the methods, in terms of sensi- tivity and precision, vary for a selected group of common poses commonly performed by students during a lecture.

The comparison was based on their ability of classication of attention

and no attention according to the study's chosen denition of attention. It

was theorized that limitations of the Viola-Jones and the MB-LBP method

(8)

could possibly be shared by other methods which are trained specically to detect faces.

1.2 Motivation

The motivation behind this study was to provide teachers with feedback on how attentive their students are throughout a lecture in order for them to improve their ability to teach eectively. This study was not intended to solve the problem, but to give some insight into the limitations of using face detection methods in measuring the attention of students in a lecture hall.

The choice of using open source resources and only face detection for the measuring of attention was motivated by the will to start the development of an inexpensive method which was easily available. The expectation is that the results of this study will assist future studies in reaching a low cost but satisfyingly eective solution of how to measure attention in an educational environment.

1.3 Structure

The remainder of the contents are structured as follow. The second section contains the background information which give a short review of previous works and a brief summary of its history with the main focus on the Viola- Jones and the MB-LBP methods The section also includes explanations of relevant concepts of face detection methods. Both the Viola-Jones method and the Local Binary pattern method are explained in this section.

In the third section of this study, the method used in the study is di- scussed. Both software and hardware that were used are specied here. The fourth section include all the results in the form of tables, graphs and ex- ample output images. The discussion about the results is given in the fth section, where the results are discussed and analyzed. The conclusion follows in the sixth section.

In the appendix in section seven, the copyright information of one of the methods provided by OpenCV is included. All sources used and references made in this study are listed in the eighth section.

1.4 Terminology

Terminology is given here to explain specic terms that are used in this thesis.

1.4.1 False Positive

A false positive occur when a positive result is incorrectly returned for a

negative input. In this study, false positives include faces of subject who are

not paying attention according to the chosen denition of attention.

(9)

1.4.2 Object Class Detection

Object Class Detection is part of Computer Vision that detects certain ob- jects that it has been dened to nd, such as faces, cars or other objects, in images or recordings.

1.4.3 Feature Based Detection

Feature Based Detection is the initial stage used when detecting something in an image, because this stage will only search for certain features in the pixels.

1.4.4 Hierarchical knowledge-based method

Knowledge-based methods in face detection uses knowledge about human faces to derive rules which are implemented to help detect faces. The term

hierarchical refers to the method also using levels with dierent rules when processing an image where the higher levels focus on general descriptions of a face and the lower levels are more focused on details of specic features.

1.4.5 Face Localization

The goal of face localization is to identify the position of a face in an image containing a single face.

1.4.6 Texture analysis

Texture analysis examines textures present in an image. These textures are

visual patterns which can be used to describe properties such as color, bright-

ness, randomness and regularity perceived in an image.

(10)

2 Background

2.1 Previous Work on Measuring Attention

There have been previous research and development of systems using dier- ent approaches to measure or identify the focus of attention of a subject.

In 2002, the report [24] was published, describing a system tracking the focus of subjects in a meeting scenario by using head orientation and sound.

To estimate the head orientation of a subject, neural networks trained to identify head pan and tilt were used. The report stated that an experiment had been conducted using a head-mounted gaze tracking system which con-

rmed the assumption that head orientation indicated the direction of the subject's focus of attention since the result of the experiment suggested this was true 89% of the time. The accuracy of predicting the subjects' focus of attention for the system in this specic case was 73% using only head pose and 76% also using sound.

In more recent years, a study [25] investigating what students focus their attention on during a lecture used eye-tracking glasses to determine the di- rection of attention. In 2013, a report [26] was published describing a work in progress on a system which can evaluate attention in a classroom and pro- vide the teacher with this information in real-time or as a summarized report at the end of a lecture. At the time of writing, the potential approaches of how to identify attention were quantifying body motion and gaze direction.

2.2 History of Face Detection

Research in face detection can be traced at least as far back as [7] in the 1970s where the approach was a feature-based method operating on a set of im- ages taken in a controlled environment. These images had strict restrictions resembling a passport photo such as satisfying the conditions of containing a frontal face of a certain size, an uncluttered background and having sucient lighting. Applications using methods dependent on the input satisfying such criteria are however limited in their use.

The research interest remained relatively low until the 1990s [2] when

implementation of face recognition methods became more practical and ad-

vances within storage and compression of video spurred a new interest in

developing prominent methods for face detection [3]. Research eorts up

until the 2000s are well recorded in the surveys [3] and [8] when many dif-

ferent approaches were made to the face detection problem such as edge de-

tectors, support vector machines, neural networks and eigenfaces. In 1994,

Yang and Huang tried to approach the problem of developing methods which

can handle more unconstrained input images, which had up until then been

an mostly unaddressed issue [4]. They used a hierarchical knowledge-based

method of three levels utilizing mosaic images and discarding negatives at

(11)

each level. The result was reported as promising compared to previous at- tempts at face localization, especially since the method seemed to be able to handle scalability of faces. However, at this stage face detection was still a relatively slow process.

In 1996, Ojala et al. presented local binary patterns (LBP) for use in texture analysis because of its computationally simplicity and its gray scale invariance [9] . In 1998, Papageorgiou et al. presented a trainable framework for object detection [12] which would later inspire Viola and Jones in their much noticed work on rapid object detection [1].

One of the most signicant achievements has been a machine learning method [1] published by Viola and Jones in 2001 using an integral im- age, classier training with Adaboost and an attentional cascade structure, making real-time applications processing images a reality [5]. Lienhart and Maydt later extended Viola and Jones's Haar-like feature set by adding ro- tated rectangular features and center-surround features to the original set, achieving a notable improvement in performance [6]. Li et al. also made an extension to handle multi-view faces by introducing even more Haar features and improving the boosting algorithm [11].

Since the publication of Viola and Jones [1], boosting-based face detec- tion schemes has become a standard of the real world applications [5]. The importance of the Viola-Jones method is clearly stated in the well written survey [5] where focus is put on it's impact and the following advances in face detection using boosting and other learning-based algorithms.

In 2007, in relation to face detection, the concept of LBP was extended to Multi-block LBP (MB-LBP) by Liao et al. [10]. It was reported that by using MB-LBP instead of Haar-like features (see section 2.5.2) in combination with a boosting-based learning algorithm it was possible to achieve higher discrimination between images and because of its feature set being smaller than that of the Haar-like feature set the training time would be less [10].

Due to its computational simplicity, MB-LBP and variations based on LBP are still relevant today and used in implementations for both face de- tection and recognition, even targeting embedded systems [30][27][28].

2.3 Face Detection Algorithms 2.3.1 Viola-Jones

Viola-Jones is a method that can be used for face detection and which, ac- cording to the paper written by the authors Paul Viola and Michael Jones in 2001 [1], was eective in detecting faces and could achieve a faster per- formance than previous face detection methods. According to the report, their method was around 600 times faster than the Schneiderman-Kanade Detector and 15 times faster than the Rowley-Baluja-Kanade Detector.

Viola-Jones detect a face in a image by searching for an area that is

(12)

slightly darker than the area underneath it. They surround these areas into two rectangles. The slightly darker area, which should be eyes, will be one rectangle and the rectangle underneath will include the nose in its area.

With these two rectangles they will calculate the dierence between the two rectangles by looking at the shade of each of the pixels and give it a specic number for that shade. To decide which number each pixel should be given, Haar basis function is used [1].

If the dierence between the two rectangles are the wanted result for a face, then the image will undergo another calculation which involves three dierent rectangles that are vertical. Two rectangles for each eye and then one rectangle for the nose. If the dierence between the two rectangles for the eyes and the nose are the wanted result, then a face has been detected [1]. This process will be done several times on one image. The image is di- vided into subimages which will be tested with the method described above three times. The three test stages are similar except that the higher stages require more specic results to decide if the subimage really contains a face.

By testing each subimage three times, the probability of getting false pos- itives are much lower compared to only testing each subimage only once.

This will result in detecting most of the faces and discard objects that look like faces for this method [1].

2.3.2 Multi-Block Binary Patterns with Boosting

Liao et al. proposed in [10] to use multi-block local binary pattern (MB- LBP) combined with the use of the integral image and an adjusted boosting algorithm in order to overcome the simplicity and limits of the Haar-like features used in the Viola-Jones method. Because of the MB-LBPs feature set being signicantly smaller than that of the Haar-like feature set, the training time and the time spent to construct classiers is considerably less compared to when using haar-like features. Liao et al. also found that MB- LBP, when compared to the traditional Haar-like features and the original LBP features, produced a higher detection rate (15% and 8% respectively) for equal sizes of feature sets and showed potential of capturing more structure information from an image. For the boosting algorithm, Liao et al. used Gentle adaboost for the selection of features and construction of the weak classiers. A multi branch regression tree was also utilized together with these classiers [10].

2.3.2.1 MB-LBP Feature Extraction MB-LBP builds upon the con-

cept of LBP (see section 2.5.3), but instead of using a neighbourhood of 3x3

pixels it uses 3x3 rectangular blocks which has the advantage of identifying

large feature structures in the image. The integral image is used in calculat-

ing each blocks average intensity after which the MB-LBP operator converts

(13)

the rectangular blocks into a binary code which describes a feature of the MB-LBP feature set [10].

Figure 2.1: Feature extraction of MB-LBP representation. The integral is used on a block of pixels of a chosen size (a.) to calculate its average gray- value (b.). After all blocks' average intensities have been obtained (c.), the MB-LBP operator is applied, resulting in (d.) and a binary code (e.) which describes a MB-LBP feature (f.).

The formula for the MB-LBP operator is identical to the denition of the LBP operator except for the denition of g which represents an average of intensities of pixels instead of the intensity of a single pixel. The formula can expressed as follows:

where gc is the average intensity of the center block and k = (1, 2, . . . , 8) represents the neighbouring blocks. The denition of s(x) is dened as:

[10].

2.4 Relevant Data Extraction Methods 2.4.1 Integral Image

The integral image was introduced to computer vision in [1] by Viola and

Jones as one of the three essential steps in their rapid object detection

method. The concept has a close relation to summed area tables [1] which

had previously been used in computer graphics as far back as 1984 by Crow

(14)

[13]. However, when used in image processing it is now often referred to as the integral image.

The Integral image is a tool used to extract information from an image, in this case calculating the sum of the pixels of a rectangular shape of a specic size and location in the original image.

The integral image at a point (x, y) calculates the sum of the pixels in the image above and to the left [1]. In this way, the integral image for each position in the original image can be calculated and stored to in an array in one pass over the image which makes it possible to calculate any sub image's corresponding integral image with only four accesses made to mentioned array [1]. The summed pixel values in a rectangular shaped feature can be calculated eectively using this method. In Viola and Jones it is used to check for the presence of Haar-like features in an image [1].

Figure 2.2: Example on calculation on Integral Image

For example, if there are four points in a rectangle that are called A, B, C, D like gure 2.1. To calculate the sum where the starting point is D, the sum will be:

Sum = D - B - C + A

Edge A becomes positive due to that it will be double negative from the point D.

The areas of the subrectangles are a1, a2, a3 and a4 for A, B, C and D

respectively. The points A, B, C and D receives their areas, which include

the area upwards and to the left, by their pixel's brightness [1].

(15)

2.4.2 Haar-Like Features

Haar-Like Features uses Haar Basic Functions and Integral Image to calcu- late where there is a face in the image [1]. This is accomplished by checking all subimages by dividing them into rectangles and calculating the dierence between them. The position of the rectangles are decided by searching for an area that is darker than the area beneath it, this area will be the eyes and the nose of the face that is discovered. Dierent sets of rectangles can be used to calculate where a face is positioned and to conrm by using all sets of rectangles that it is truly a face that has been located. [21][1]

Haar Basic functions are a mathematical calculation which, in simple terms, decides which results are given either the value 1 or 0 depending on the value of the result. Due to the complexity, Haar Basic Function is not explained in this study, but is recommended for the interested reader. See sources [14][16] for explanation on Haar Basic Function.

The gures below, Figure 2.3 a-c, shows how the rectangles can look like and the dark area represent the dark area in the photo. E.g. The eyes are a darker area compared to the nose and cheeks area which means that the rectangle above will be darker than the one below, which is shown in gure 2.3.a.

Figure 2.3: Examples on how the rectangles can look in a Haar-Like Feature

2.4.3 Local Binary Patterns

Local binary patterns is a feature which can be used for classication in ob- ject detection, and more importantly in this case, in face detection. The con- cept of the method can be traced back to 1990 when Wang and He presented a texture classication method using texture spectrums [19] [20]. The tex- ture spectrum represents the occurrence distribution of texture units, where texture units consists of 8 elements having possible values of 0,1 or 2 [19].

These texture units describe the local texture at a certain region in an image and they are also the predecessors of local binary patterns since Ojala et al.

later presented the LBP method, a simplied, binary version of Wang and

He's method [9]. LBP can be used to identify edges, lines, spots and corners

in an image [9].

(16)

Figure 2.4: The process of obtaining a LBP. A binary sequence of eight elements are in this example obtained by rst thresholding the outer values of the 3x3 pixels and then, starting in the upper left corner and moving clockwise, extracting a LBP describing the texture of the local pixel area of the image.

2.4.3.1 LBP Feature Extraction A local binary pattern is calculated in a 3x3 pixel area where the center pixel's value is used to threshold the eight neighbouring pixels. If a surrounding pixels value is higher than the threshold value, it will be represented by a value of 1, otherwise a value of 0. Then, following the surrounding pixels clockwise or counterclockwise from a previously dened starting point, a sequence of 1 and 0 is obtained which translates into the local binary pattern. [19][9] The occurrences of the 28=256 dierent patterns in the original image can then be presented in a histogram, describing the dierent textures occurring in the image.

Figure 2.5: Example calculation of the summed texture unit value.

After a local pattern has been calculated, it can then also be multiplied by the weighted value of each corresponding pixel, after which the resulting values are added to create a texture unit value (See Figure 2.5) [9].

2.4.3.2 Performance One of the advantages of using LBP is its grayscale

invariance property [9]. In other words, it is resistant to illumination vari-

ations of the image. It is also highly computationally simple and therefore

well suited for real-time processing [9].

(17)

2.5 Boosting Algorithms 2.5.1 AdaBoost

Adaboost is an algorithm that learn by training itself and was made by Freund and Schapire [15]. It is one of the most used machine learning algo- rithms used in boosting. It is used in the Viola-Jones method for it to learn to detect faces better [1].

Adaboost uses a lot of weak rules to calculate a result. The rules are weak because they are based on assumptions, but compared to random rules the assumptions will return much better results. The result will go through all the rules before it is accepted and at the end the algorithm will adjust the rules to get better results [15].

The algorithm trains itself by running the algorithm until the training error reaches a value of zero, which means that the algorithm no longer will give any false results. Although, this is very hard to achieve. The problem with this is that if the algorithm runs too many times, then the algorithm might overt, making the result false again [15].

2.6 Attentional Cascade Structure

An attentional cascade structure is a cascade of boosted classiers which could be compared to a degenerate decision tree [1]. The idea of the structure is to achieve a high detection rate while doing little computations, essentially decreasing the time to process each image [1]. It makes use of the fact that most sub-images are negatives. Therefore, it uses simple classiers to reject a majority of the negative sub-image at the initial stages of the detection process before reaching more complex and time consuming classiers, which are tasked with ensuring a high accuracy [1]. A minority of the sub-images reaches the nal stages where most of the complex computations occur, en- suring that computational time is not spent on images which can easily be rejected by simpler classiers.

The classiers in the beginning of the cascade uses very simple features.

In the case of the Viola-Jones algorithm, Adaboost is used to rst generate

classiers with as few as two haar-like features [1]. Later the threshold is

manually lowered to ensure that each simple classier has a false negative

rate close to zero, detecting almost all positive images, but as a result has a

high false positive rate [1]. Since each classier is trained only on sub-images

which has passed all previous classiers in the cascade, the task to classify

an image as a face or non-face gets harder at each stage.

(18)

Figure 2.6: Attentional Cascade Structure.

(19)

3 Method

The chosen programming language for this study was Java. The reason that this programming language was chosen was because it is a popular and widespread programming language with extensive libraries which will hopefully make it easier and more convenient for other people to recreate or build upon this study. The downside to using Java is the lack of control over the usage of resources which is normal for high level languages. A common choice would be to use a language such as C or C++ which allow for more optimized usage of resources when implementing systems which should run in real-time on portable devices. However, this study is not intended to produce a high speed system for measuring attention in a classroom. This study intends to give insight into the possibilities and limits when using the two chosen algorithms for the purpose of measuring attention. The time measurements done in the study is for the purpose of comparing the speed of the two algorithms when implemented and run under the same conditions and should not be seen as an attempt to minimize the runtime. The Viola- Jones and the MB-LBP method were chosen because both of them are well known and they are both easily accessible through OpenCV. OpenCv also provides functionality to train both methods which could be used for further studies.

3.1 Testing Environment

The testing environment used for this study is listed below to allow future studies to compare their results and this study's result with each other de- pending on the hardware and software used.

3.1.1 Hardware

The computer used for running the Haar-like and LBP method was run on a computer with a processor: Intel(L) Core(TM) i7 3517U CPU @ 1.90 GHz 2.40 GHz.

The camera that was used in the three actual lectures was a Panasonic DMC-TZ20 camera, from the developers Panasonic Lumix. It can make 1920x1080, HD, format recordings. The camera have a 8 GB memory card which can make 29 minutes and 59 seconds recordings, which in return gave MTS movie les which took 3,46 GB in memory on the computer for each lecture recorded.

The digital system camera that was used was a Canon EOS 500D. It recorded in 1920x1080 HD, using the AV setting and the autofocus function.

Recorded in format MOV.

(20)

3.1.2 Software

The softwares used for this study is listed below.

• Operating system: Windows 8.1 (64-bit based)

• Java Standard Edition, Version 7 Update 17.

• Eclipse Standard, Version 4.4

• OpenCV, Version 2.4.10.

• Matlab, R2011a

3.2 Acquisition of Data

In the initial stage of the study, three actual lectures were recorded, each 30 minutes long. From these recordings four poses commonly used by the students were identied. A recording of a simulated lecture was also made (see Section 3.2.1). Only one camera was used for the recording and it was positioned at the front of the room. A second recording of a simulated lecture was recorded and it's images were only used for input when measuring the average time to process one image (see Section 3.4).

By using the Java method FFmpegFrameGrabber, one frame for each second from the recording of the simulated lecture were extracted, obtaining a total of 714 images as test data. The subjects in the test data from the simulated recording were manually classied as paying attention or not paying attention frame by frame. The Viola-Jones and the MB-LBP method was then applied to the test data in order to detect faces (see Section 3.3).

The output data from both the Viola-Jones and the MB-LBP method were manually viewed and false positives and false negatives were recorded, categorized, and saved as les containing only integer values (see Section 3.2.2). The test data generated output data of the same size (714 images) for both methods. Each output image contained a drawn rectangle for each face the method had detected.

3.2.1 Simulated Lecture

The simulated lecture was intended to represent a simplied version of a

lecture which would test a method's ability to classify common poses and

generate an output graph showing clear transitions between stages for meth-

ods achieving high attention classication performance. It was conducted

by having a person impersonating a teacher in the front of room carry out

a lecture while occasionally directing the 6 subjects by telling them which

stage (see 3.2.1.1.) to change to and when. The subjects were purposely

seated so that the faces of all subjects were clearly visible in the recording

(21)

and no subject was directly seated in front of another subject.The purpose of the controlled transitions between stages were meant to simulate dierent levels and types of attention during a lecture. Each stage was repeated a total of three times and the time spent in each stage varied but was within the margins of 45 seconds to 2 minutes.

3.2.1.1 Stages Based on Commonly Used Poses The four dierent stages used in the simulated lecture are based on the four poses that were observed to be frequently used by students in the recordings of the actual lectures. These poses are listed below.

1. Sitting upright and looking towards the front of the lecture hall 2. Leaning the head on hands and looking towards the front of the lecture

hall

3. Taking notes

4. Looking away from the front of the lecture hall

The rst stage is the stage where complete attention is simulated. The subjects were told to pay complete attention and to sit upright and look towards the person impersonating the teacher.

During the second stage the subjects were still told to be paying attention but to also rest their head on one or both hands. It was decided to have this stage in the simulated lecture because it is a very common pose in a lecture that students rest their head against their hands when the students are beginning to get tired but are still paying attention.

The third stage is when the subjects simulate taking notes. The subjects were told to write down what the person at the front of the lecture hall wrote on the whiteboard. This stage was added since taking notes were identied as a common activity during the recorded actual lectures.

The fourth and last stage of the simulated lecture is simulating no at- tention paid by the subjects. The subjects were told to look wherever they chose except they were not allowed to look towards the front of the lecture hall. It was suggested that they could start texting on their phones or they could pretend to be sleeping by laying their head down on the table or lean back in their seat.

3.2.2 Data Obtained Through Manual Face Detection

To calculate the false negative and false positive, a manual face detection

of the output images were done by a human in order to calculate precision

and sensitivity. This was achieved by going through each output image

returned by each method and identifying false positives, which includes any

(22)

subwindow not containing a complete face of a subject paying attention, and false negatives, which are attentive subjects the method failed to detect.

The results of the manually obtained data were then saved to les containing integer values and indexed by the image's number in the sequence of images.

False positive of each method was also categorized as subimages con- taining backgrounds, half-faces, double-faces (see gure 4.4), and downward- tilted faces or faces where the subject where texting, taking notes, or looking directly down, appearing to be concentrating on something on the retractable bench in front of them.

3.3 Detection of Faces

The algorithms used for this study were the les haarcascade_frontalface_alt.xml as the Viola-Jones-based method and lbpcascade_frontalface.xml as the MB- LBP-based method which were both provided by OpenCV. For the face de- tection, the algorithm was rst loaded into the cascade classier. The input image was converted using the Highgui.imwrite method and stored using the Mat class. Using the MatOfRect class for storing the detections of faces, the detectMultiScale method of the cascade classier was used to detect faces in the Mat stored input image.

Using the Core.rectangle method, colored squares were drawn in the im- age where faces had been detected. Lastly, the output image was stored to a le using the Highgui.imwrite method.

3.4 Measuring Time

The average time each method required to process one image was estimated by using Java's System.nanoTime method. The method was called before and after the face detection and then the dierence between the values re- turned by the method was calculated.

To get a more reliable result and to decrease the inuence that Java's built-in garbage collector have on the results, this process was repeated for 1000 images from the simulated lecture. The 1000 calculated times in nanoseconds was saved to a le. The average time was then calculated and the result was converted to seconds.

The process of calculating the average time required to detect faces in

an image for each of the two algorithms by using 1000 images was repeated

for three trials. The rst and the second trial used all of the 714 images

from the recording of the simulated lecture and the rst 286 images from the

recording.of a second simulated lecture only used for this purpose. The third

trial used the rst 459 of the simulated lecture and all of the 541 images from

the second simulated.

(23)

3.5 Sensitivity and Precision

While it is common to use the false positive rate and the false negative rate to evaluate the classication performance by plotting a ROC (Receiver Operat- ing Characteristics) curve [23], for this study it was decided to use sensitivity and precision to measure performance. As pointed out by Powers in [23], for the task of doing extensive analysis of a binary classication algorithm, using only sensitivity and precision for evaluating performance would not be sucient to describe the behaviour of the nal method. Since the scope of the study is limited to trying to identify limits and areas of potential when evaluating the two chosen algorithms, sensitivity and precision is considered to be able to provide this level of insight in the study. Another incentive for choosing precision was given by Saito and Rehmsmeier in [22] describing the importance of measuring the precision in order to avoid the accuracy paradox and provide a better reecting performance when using imbalanced datasets where negative instances outnumber the positive instances.

Sensitivity and precision for measuring attention by each of the two meth- ods were calculated according to the following formulas:

where true positives are the number of instances correctly returned as positives and selection is the total number of instances returned as positive by the method (see Figure 3.1).

Sensitivity, also known as recall rate, was used to measure how many of the subjects paying attention could be correctly detected by the method.

Precision, also known as positive predictive value, was used to measure

how many of the detections returned by the method as paying attention were

actually paying attention according to the chosen denition of attention.

(24)

Figure 3.1: Visual representation of the relation between selection, positive and negative instances and true and false positives.

3.6 Chosen Denition of Attention

In this study, a subject paying attention is dened according to the two following rules:

1 The subject is paying attention if they are looking towards the front of the lecture hall.

2 The subject who has their eyes closed in one frame is paying attention if the subject is paying attention in the previous frame and the following frame according to the previous rule.

The rst rule implies that a subject do not have to look directly at the teacher in order to be paying attention. This is to allow for the teacher to move around, write information on boards if present and also use projectors and overheads. Students are considered to be paying attention when looking towards one of these educational tools in the front of the lecture hall and therefore not necessarily right at the teacher. Langton et al. describe in [29]

how the direction of a person's gaze is an important indicator, but that head

and body orientation has also been identied as indicators of attention. In

[24] a relatively high correlation between gaze direction and the direction the

head is facing was reported, therefore the rst rule is designed to reect this

observation.

(25)

The purpose of the second rule is to allow subjects to blink without aecting the measurement of attention as this occur frequently in a recording as noticed in [24].

The assumption that a student taking notes is not paying attention dur- ing the whole duration of that activity may not be the most eective de- nition considering the goal is to provide the teacher with relevant data on poor student attention. Therefore, the chosen denition of attention should be further rened and tested in order to be more eective, but for the scope of this study the chosen denition of attention was considered to be sucient to gain a preliminary insight.

Below follow some example images where the two rules are applied.

Figure 3.2: Top: Subjects circled in green are labeled as paying attention

according to the rst rule. Bottom: Subject circled in green are labeled as

paying attention but subject circled in red are labeled as not paying attention

according to the rst rule.

(26)

Figure 3.3: According to the rst rule, the subject is paying attention in the left image but is not paying attention in the right image.

Figure 3.4: According to the rst rule, the subject is paying attention in the left and the right image but is not paying attention in the middle image.

Figure 3.5: According to the second rule, the subject is paying attention in

all three images.

(27)

4 Result

In this section the results of the study are presented in tables, graphs and images. The tables represent how well the methods did in terms of sensitivity, precision and detection time. The graphs represent the number of detected faces by each of the Viola-Jones and the MB-LBP method on the test data from the simulated lecture and the number of false negatives and false posi- tives for each method. The images are examples of output images of the two methods.

The Haar cascade is used by the Viola-Jones method, while the LBP cascade is used by the MB-LBP method.

4.1 Tables

Table 4.1 and 4.2 show the sensitivity and precision of the LBP cascade and the Haar cascade respectively when applied to the test data from the full recording of the simulated lecture.

Table 4.1: LBP cascade used on Simulated Lecture LBP Cascade

Sensitivity 0.9226 Precision 0.6678

Table 4.2: Haar Cascade used on Simulated Lecture Haar Cascade

Sensitivity 0.9280 Precision 0.6606

Table 4.3 and table 4.4 show the sensitivity and precision of the LBP

cascade and Haar cascade respectively for each stage of the simulated lec-

ture. The stages are ordered such that stage 1.1 is the rst stage 1 in the

simulated lecture, followed by stage 2.1 which is the rst occurrence of stage

2.Transitions between stages were excluded.

(28)

Table 4.3: Simulated Lecture at each Stage for LBP LBP Cascade

Stage Sensitivity Precision 1.1 0.9299 0.9428 2.1 0.9528 0.9643 3.1 0.9280 0.3636 4.1 0.8738 0.0403 1.2 0.9501 0.9497 2.2 0.8051 0.8563 3.2 0.8152 0.3780 4.2 0.9255 0.0148 1.3 0.9577 0.9280 2.3 0.9226 0.9417 3.3 0.9284 0.3350 4.3 0.9329 0.0241

Table 4.4: Simulated Lecture at each Stage for Haar Haar Cascade

Stage Sensitivity Precision 1.1 0.9235 0.9743 2.1 0.9532 0.9582 3.1 0.9408 0.3915 4.1 0.8673 0.0336 1.2 0.9505 0.9805 2.2 0.8725 0.8719 3.2 0.7694 0.3756 4.2 0.9304 0.0183 1.3 0.9620 0.8889 2.3 0.9292 0.9556 3.3 0.9249 0.3629 4.3 0.9384 0.0182

Table 4.5 presents the fraction of non-faces composing the false positives

for each method. In this study, non-faces include subimages containing no

face but also subimages containing only half-faces and the double detection

of a face (see Figure 4.5).

(29)

Table 4.5: Estimation of the fraction of false positives generated by non- faces.

Fraction of Non-Faces False Positives Haar Cascade 0.0940

LBP Cascade 0.0775

Table 4.6 presents the fraction of faces composing the false positives for each method which were subimages containing downward-tilted faces or faces where the eyes were directed downwards with the subject either texting, writing, or otherwise looking down towards the retractable bench directly in front of them.

Table 4.6: Estimation of the fraction of false positives generated by faces tilted down or by faces with eyes looking directly downwards.

Fraction of Downward-tilted Faces False Positives Haar Cascade 0.6993

LBP Cascade 0.7296

Table 4.7. presents the result of three trials estimating the average time for each of the two face detection methods to process an image where the

rst and second trial uses the same data of 1000 images for the calculations.

Table 4.7: Estimated average time to process a single image.

Average Detection Time (s) per Image Trials Haar Cascade LBP Cascade

1 1.5144 0.4047

2 1.5033 0.4132

3 1.5400 0.4037

4.2 Graphs

The graphs show the number of faces detected by the Haar-cascade and LBP cascade for images extracted for each second of the recording of the simulated lecture. The manual attention detection is also represented in three of these graphs to show how well the methods detected attentive faces.

Polynomial curve tting was used to show the dierences in faces detected

in all four stages more clearly. With the use of best-t polynomial a line

shows how the overall attention changes in the total time of the recording.

(30)

Figure 4.1: Comparison of detected faces of the Haar cascade and the LBP cascade

Figure 4.2 show the face detections by the Haar cascade and the LBP cascade together with the manual attention detection which uses manually generated data based on the chosen denition of attention.

Figure 4.2: Detected faces by the Haar cascade and the LBP cascade com- pared to manual attention detection

Figure 4.3 show the false positives for both the LBP cascade and the

Haar cascade. Note that false positives are all subimages not containing the

face of a subject who is paying attention according to the chosen denition

of attention.

(31)

Figure 4.3: False positives of Haar cascade and LBP cascade attention de- tection

4.3 Output Images

Following are examples of output images. Face detections are highlighted in red.

Figure 4.4: Image 655 of 714 from the Haar cascade output of the simulated lecture. All detections are false positives since no subject is paying attention.

The detection of the subject in the front row to the right is dened as a double

face detection.

(32)

Figure 4.5: Output examples of some of the most noteworthy of non-face false positives. The lower detection in c.) is not dened as a double face since the majority of the subimage does not contain a part of a face. The non-face false positives in e.) and a.) are examples of detections occurring more than once for the Haar cascade the LBP cascade respectively.

4.4 Observations

The comparison of face detections in the simulated lecture for the Viola- Jones and the MB-LBP method can bee seen in gure 4.1 where the MB-LBP generally detects less faces.

The Viola-Jones method, using the Haar cascade performed slightly bet- ter in terms of sensitivity when applied to the full recording of the simulated lecture by generating a slightly higher sensitivity value compared to the LBP cascade's sensitivity value as seen in table 4.1 and table 4.2. However, sen- sitivity values of both methods were relatively high and shows that both methods succeeded in identifying approximately 92% of the subjects when they were paying attention. Even when the sensitivity values of the dierent stages were compared in tables 4.3 and 4.4, the lowest value of each method, occurring in stage 3.2 and 2.2 for the Haar and LBP cascade respectively, re- presents a correct classication within the margins of 76-81% with the Haar cascade scoring the lowest sensitivity value.

The precision values for the Haar cascade and the LBP cascade when

(33)

applied to the full recording of the simulated lecture were approximately 0.66 as shown in table 4.1 and 4.2. A major contributing factor to the low precision values of the full recording can be seen in table 4.3 and table 4.4.

For both of the two methods the precision values for stages 3 and 4 are notably lower than for the stages 1 and 2 which largely contributes to the low overall precision value. Figure 4.3 conrms that stages 3 and 4 produces the highest false positive values which lower the performance in terms of precision.

Table 4.6 shows that the majority of the generated false positives occur when the subject has a downward posed face. According to table 4.5, only approximately 9% and 7% of the false positives from stages 3 and 4 for the Haar and the LBP cascades respectively were non-faces (see gure 4.5).

In stages 3 and 4 compared to stages 1 and 2, it can be observed that the

number of face detections generally seem to vary more over a slightly larger

scope and more frequently detect a less number of faces in the rst and last

interval of stages as seen in gure 4.1. However, the second interval does not

conform to this pattern.

(34)

5 Discussion

Measuring attention of students is a complex task with many dierent factors which should be taken into account when developing a system to perform this task eectively. In this study, some assumptions were made in order to simplify the problem in accordance to the scope of the study. The results of the study also depend on certain predened conditions. These were the reso- lution of the images, the chosen position of the camera, the non-consideration of students concealed by other students when seated and the good lighting conditions.

The low overall precision values for both the Viola.Jones and the MB- LBP method indicates that approximately 1

3 of the data returned was not relevant. This was considered to not be a high enough precision value to me- asure attention eectively. The overall sensitivity values for both methods were considered to be relatively high when using face detection. However, dening attention is a more complex task than the chosen denition of at- tention in this study reects. The simplied denition of attention may have impacted the performance of both methods to generate higher sensitivity and precision values than if a more complex denition had been used.

For the test data of this study, the dierences were less than 1% in sen- sitivity and precision for the whole recording with the Viola-Jones method generating a higher sensitivity value while the MB-LBP method generated a slightly higher precision value.The reliability of the results for sensitivity and precision depend signicantly on the quality of the data. Due to li- mitations on resources and the time limit of the study, the data used was restricted in terms of variation and quantity. Additionally, as pointed out in [23], measuring sensitivity and precision do not provide a complete analysis on a method's classication performance. Considering these limitations, the dierence in the Viola-Jones and the MB-LBP methods' ability to detect at- tentive faces and their levels of precision were not considered to be signicant enough to draw denite conclusions on which method have the better per- formance for detecting attention. The dierent time lengths of stages should also be considered since it has a signicant eect on the overall performance.

The shorter duration of stage 4 compared to other stages will have resulted in higher overall precision value since both methods performed the poorest in terms of precision for stage 4.

The dierence in performance which could be clearly identied was the ti- me to process an image for face detection where MB-LBP method performed signicantly faster when both methods operated under the same conditions, conrming previous claims in [17] that LBP features can be used to perform face detection faster than the computationally heavier Haar-like features.

Another clear observation for both methods was the limited performance on

correctly classify attention when student were taking notes and when they

(35)

were not paying attention.

The method of measuring sensitivity and precision combined with ma- nually categorizing false positives could identify the clear limitation of both methods' tendency to detect faces labeled as not paying attention, speci- cally downward-tilted faces which frequently occurred in stages 3 and 4. The high detection of false positives presented an obstacle to visually be able to clearly distinguish between stages in an output graph. How well the subjects followed the instructions during the simulated lecture may also have aected the readability of the output graphs slightly.

Gaze-tracking head devices could be used to specically detect the direc- tion of the gaze instead of detecting the frontal face. This approach is closer to the chosen denition of attention since it considers which direction the subject is looking and it has previously been used in systems tasked with estimating focus of attention [24][25]. A less expensive and less intrusive approach would be to investigate the performance of a method trained for the specic purpose of detecting faces conforming to the chosen denition of attention. The MB-LBP method would be the preferred method for further study considering it achieved similar performance in sensitivity and precision compared to the Viola-Jones method but spending less computational time on face detections. Shorter training time due to its smaller feature set also adds to the advantage.

Further study is needed in order to investigate the potential of scalability of using face detection for the purpose of detecting attention of students.

Factors which need to be considered in order to implement an eective system

include the preconditions mentioned in the beginning of the section. The

expectation is that the gray-scale invariant MB-LBP would achieve the best

performance for poor lighting conditions. Using Multiple cameras positioned

in the classroom has been a common approach to avoiding the problem of

students concealed by students in the front seats [26]. A challenge when using

one camera is to capture high resolution images with the focus adjusted to

clearly capture faces of a large group of people seated dierent distances

from the camera. However, this were outside the scope of this study.

(36)

6 Conclusion

The conclusions that can be drawn from this study is that both methods,

Viola-Jones and MB-LBP generates very similar results on the test data used

in this study in terms of sensitivity and precision. Both methods achieved

high sensitivity and precision values when subjects performed limited poses

where the subjects were required to look only towards the front of the lecture

hall. The result of this study conrmed previous claims that the LBP met-

hod process images faster than when using Haar-features like the Viola-Jones

method. In this study, the MB-LBP method performed more than three ti-

mes faster than the Viola-Jones method. The greatest limitation identied

was the high number of false positives generated by both methods for input

where the subjects were repeatedly performing poses with their faces til-

ted downwards. These poses occurred frequently when subjects were taking

notes or not paying attention. Since both methods performances indicated

an ability to detect downward tilted faces which is not compatible with the

chosen denition of attention and since this generated very low precision

values, the trained Viola-Jones and the trained MB-LBP provided by the

OpenCV is not considered to eectively measure attention of students. Since

the MB-LBP method was shown to perform better in terms of time spent on

computations, a suggested continuation of this study is to create and analyze

an MB-LBP algorithm trained for the specic purpose of detecting attention

of students in an educational environment.

(37)

7 Appendix

The code for Haar Cascade which was provided by the open source library OpenCV [18] has a copyright claim. The code was written by Rainer Lienhart and is called haarcascade_frontalface_alt.xml. The copyright (C) is under Intel Corporation 2000. All the rights of the code belong to them.

The code under this copyright is only used for educational purposes and is not intended to be used for commercial use or any other work of prot.

The copyright says the following:

Stump-based 20x20 gentle adaboost frontal face detector. Created by Rai- ner Lienhart.

////////////////////////////////////////////

IMPORTANT: READ BEFORE DOWNLOADING, COPYING, INSTAL- LING OR USING.

By downloading, copying, installing or using the software you agree to this license. If you do not agree to this license, do not download, install, copy or use the software.

Redistribution and use in source and binary forms, with or without mo- dication, are permitted provided that the following conditions are met:

* Redistribution's of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

* Redistribution's in binary form must reproduce the above copyright no- tice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

* The name of Intel Corporation may not be used to endorse or promote products derived from this software without specic prior written permission.

This software is provided by the copyright holders and contributors äs

isänd any express or implied warranties, including, but not limited to, the

implied warranties of merchantability and tness for a particular purpose

are disclaimed. In no event shall the Intel Corporation or contributors be

liable for any direct, indirect, incidental, special, exemplary, or consequential

damages (including, but not limited to, procurement of substitute goods or

services; loss of use, data, or prots; or business interruption) however caused

and on any theory of liability, whether in contract, strict liability, or tort

(including negligence or otherwise) arising in any way out of the use of this

software, even if advised of the possibility of such damage.

(38)

8 References

[1] Viola, Paul, and Michael Jones. "Rapid object detection using a boosted cascade of simple features." Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on. Vol. 1. IEEE, 2001.

[2] Chellappa, Rama, Charles L. Wilson, and Saad Sirohey. "Human and machine recognition of faces: A survey." Proceedings of the IEEE 83.5 (1995): 705-741.

[3] Hjelmås, Erik, and Boon Kee Low. "Face detection: A survey." Com- puter vision and image understanding 83.3 (2001): 236-274.

[4] Yang, Guangzheng, and Thomas S. Huang. "Human face detection in a complex background." Pattern recognition 27.1 (1994): 53-63.

[5] Zhang, Cha, and Zhengyou Zhang. A survey of recent advances in face detection. Tech. rep., Microsoft Research, 2010.

[6] Lienhart, Rainer, and Jochen Maydt. "An extended set of haar-like fea- tures for rapid object detection." Image Processing. 2002. Proceedings.

2002 International Conference on. Vol. 1. IEEE, 2002.

[7] Sakai, Toshiyuki, Makoto Nagao, and Takeo Kanade. Computer analysis and classication of photographs of human faces. Kyoto University, 1972.

[8] Yang, Ming-Hsuan, David Kriegman, and Narendra Ahuja. "Detecting faces in images: A survey." Pattern Analysis and Machine Intelligence, IEEE Transactions on 24.1 (2002): 34-58.

[9] Ojala, Timo, Matti Pietikäinen, and David Harwood. "A comparative study of texture measures with classication based on featured distribu- tions." Pattern recognition 29.1 (1996): 51-59.

[10] Liao, Shengcai, et al. "Face detection based on multi-block lbp repre- sentation." Advances in biometrics. Springer Berlin Heidelberg, 2007.

11-18.

[11] Li, Stan Z., et al. "Statistical learning of multi-view face detection."

Computer VisionECCV 2002. Springer Berlin Heidelberg, 2002. 67- 81.

[12] Papageorgiou, Constantine P., Michael Oren, and Tomaso Poggio. "A general framework for object detection." Computer vision, 1998. sixth international conference on. IEEE, 1998.

[13] Crow, Franklin C. "Summed-area tables for texture mapping." ACM

SIGGRAPH computer graphics 18.3 (1984): 207-212.

(39)

[14] Weisstein, Eric W. Haar Function. From MathWorldA Wolfram Web Resource. Last accessed 18.03.2015 http://mathworld.wolfram.com/

HaarFunction.html

[15] Schapire, Robert E. "Explaining adaboost." Empirical inference.

Springer Berlin Heidelberg, 2013. 37-52.

[16] The Haar Basis by Fazal Majid, Yale University, last modied 27.01.1995, last accessed 18.03.2015, http://users.math.yale.edu/

users/majid/manual/node28.html

[17] Chang-yeon, Jo. "Face Detection using LBP features." Final Project Report 77 (2008).

[18] OpenCV, OpenCV's website, by Itseez, last accessed 17.04.2015, http:

//opencv.org/

[19] Wang, Li, and Dong-Chen He. "Texture classication using texture spec- trum." Pattern Recognition 23.8 (1990): 905-910.

[20] Wang, Li, and D. Ch He. "A new statistical approach for texture anal- ysis." Photogrammetric Engineering and Remote Sensing 56.1 (1990):

61-66.

[21] Mita, Takeshi, Toshimitsu Kaneko, and Osamu Hori. "Joint haar-like features for face detection." Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on. Vol. 2. IEEE, 2005.

[22] Saito, Takaya, and Marc Rehmsmeier. "The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classi-

ers on Imbalanced Datasets." PloS one 10.3 (2015): e0118432.

[23] Powers, David Martin. "Evaluation: from precision, recall and F- measure to ROC, informedness, markedness and correlation." (2011).

[24] Stiefelhagen, Rainer. "Tracking focus of attention in meetings." Pro- ceedings of the 4th IEEE International Conference on Multimodal In- terfaces. IEEE Computer Society, 2002.

[25] Rosengrant, David, et al. "Following student gaze patterns in physical science lectures." AIP Conference Proceedings-American Institute of Physics. Vol. 1413. No. 1. 2012.

[26] Raca, Mirko, and Pierre Dillenbourg. "System for assessing classroom

attention." Proceedings of the Third International Conference on Learn-

ing Analytics and Knowledge. ACM, 2013.

(40)

[27] Girish, G. N., and Pradip K. Das. "Face recognition using MB-LBP and PCA: A comparative study." Computer Communication and Informat- ics (ICCCI), 2014 International Conference on. IEEE, 2014.

[28] Ge, Zhubei, et al. "Face Detection Based on Multi-block Quad Binary Pattern." Computer Vision-ACCV 2014 Workshops. Springer Interna- tional Publishing, 2014.

[29] Langton, Stephen RH, Roger J. Watt, and Vicki Bruce. "Do the eyes have it? Cues to the direction of social attention." Trends in cognitive sciences 4.2 (2000): 50-59.

[30] Acasandrei, Laurentiu, and Angel Barriga. "Hardware-software face de- tection system based on multi-block local binary patterns." Sixth Inter- national Conference on Graphic and Image Processing (ICGIP 2014).

International Society for Optics and Photonics, 2015.

[31] Steinberg, Eran, et al. "Classication And Organization Of Consumer Digital Images Using Workow, And Face Detection And Recognition."

U.S. Patent Application 14/542,261.

[32] Wang, Nai-Jian, Sheng-Chieh Chang, and Pei-Jung Chou. "Real-time multi-face detection on FPGA for video surveillance applications." Jour- nal of the Chinese Institute of Engineers ahead-of-print (2014): 1-8.

[33] Fuzhen, Huang, and Bian Houqin. "Identity authentication system us-

ing face recognition techniques in human-computer interaction." Control

Conference (CCC), 2013 32nd Chinese. IEEE, 2013.

(41)

Measuring Student Attention with Face Detection:: Viola-Jones versus Multi-Block Local Binary Pattern using OpenCV