Feature Engineering and Machine Learning for Driver Sleepiness Detection

(1)

MASTER THESIS

Feature Engineering and Machine

Learning for Driver Sleepiness Detection

Authors

Henrik M˚

artensson

Oliver Keelan

Supervisors

Christer Ahlstr¨

om

Tuan Pham

Examiner

Ingemar Fredriksson

Department of Biomedical Engineering

University of Link¨

oping

October 16, 2017

(2)

(3)

University of Link¨

oping

Abstract

Feature Engineering and Machine Learning for Driver

Sleepiness Detection

by

Henrik M˚

artensson and Oliver Keelan

Falling asleep while operating a moving vehicle is a contributing factor to the statistics of road related accidents. It has been estimated that 20% of all acci-dents where a vehicle has been involved are due to sleepiness behind the wheel. To prevent accidents and to save lives are of uttermost importance. In this thesis, given the world’s largest dataset of driver participants, two methods of evaluating driver sleepiness have been evaluated. The first method was based on the creation of epochs from lane departures and KSS, whilst the second method was based solely on the creation of epochs based on KSS. From the epochs, a number of features were extracted from both physiological signals and the car’s controller area network. The most important features were selected via a feature selection step, using sequential forward floating selection. The selected features were trained and evaluated on linear SVM, Gaussian SVM, KNN, ran-dom forest and adaboost. The ranran-dom forest classifier was chosen in all cases when classifying previously unseen data.

The results shows that method 1 was prone to overfit. Method 2 proved to be considerably better, and did not suffer from overfitting. The test results regarding method 2 were as follows; sensitivity = 80.3%, specificity = 96.3% and accuracy = 93.5%.

The most prominent features overall were found in the EEG and EOG domain together with the sleep/wake predictor feature. However indications have been made that complexities might contribute to the detection of sleepiness as well, especially the Higuchi’s fractal dimension.

(4)

(5)

Acknowledgements

First and foremost, we would like to express our gratitude to Christer Ahlstr¨om for his ever support, advices and patient during this thesis. We would like to thank VTI for providing the master thesis project and Tuan Pham for his valu-able inputs regarding feature creation. Furthermore, we would like to thank Ingemar Fredriksson for taking the time to be our examiner. Thank you to Anna Anund for her positive and supportive spirit throughout the thesis work. We would also like to express our gratefulness towards our families, not only during this thesis, but through all five years of studies on the biomedical engi-neering program.

Obviously, the coffee machine deserves recognition as well, for always being there.

(6)

(7)

List of Abbreviations

ANS Autonomic nervous system

AUC Area Under Curve

CAN Controller Area Network

CART Classification And Regression Tree

ECG Electrocardiography

EEG Electroencephalography

EOG Electrooculography

FD Fractal Dimension

FPR False Positive Rate

HFD Higuchi’s Fractal Dimension

HRV Heart Rate Variability

KNN K-nearest Neighbour

KSS Karolinska Sleepiness Scale

PSD Power Spectral Density

RMSSD Root Mean Square of the Successive Differences

ROC Reciever Operating Characteristic

SF Sequential Forward Selection

SFFS Sequential Forward Floating Selection

SA Sinoatrial Node

SVM Support Vector Machine

SWP Sleep Wake Predictor

(8)

(9)

1 Introduction 1 1.1 Motivation . . . 1 1.2 Aim . . . 1 1.3 Question formulations . . . 2 1.4 Limitations . . . 2 2 Background 3 2.1 Experiment setup . . . 3 2.1.1 Data acquisition . . . 4 3 Theory 5 3.1 Sleepiness . . . 5 3.1.1 EEG . . . 5 3.1.2 ECG . . . 6 3.1.3 EOG . . . 7 3.1.4 CAN . . . 7 3.2 Complexities . . . 7 3.2.1 Sample Entropy . . . 8

3.2.2 Higuchi’s fractal dimension . . . 9

3.3 Machine Learning . . . 10

3.3.1 Ensemble methods . . . 11

3.3.2 Support vector machine . . . 13

3.3.3 K-Nearest Neighbour . . . 16 3.3.4 Feature Selection . . . 17 3.3.5 Classification evaluation . . . 18 3.3.6 Data leakage . . . 19 4 Method 21 4.1 Prephase . . . 21 4.2 Signal processing . . . 22 4.2.1 Filtering . . . 22

4.3 Dividing data into KSS epochs . . . 25

4.3.1 Epochs based on lane departure and KSS . . . 25

4.3.2 Epochs based solely on KSS . . . 27

4.4 Features . . . 28

4.4.1 EEG - frequency features . . . 28

4.4.2 ECG . . . 30

4.4.3 EOG . . . 31

4.4.4 Complexities . . . 33

4.4.5 CAN . . . 35

4.4.6 Sleep Wake Predictor . . . 36

4.5 Data organisation . . . 36

4.6 Feature selection . . . 38

(10)

4.7.1 Evaluation of classifiers . . . 38

5 Results 41 5.1 Summary . . . 41

5.3 Epochs based on lane departures and KSS . . . 43

5.3.1 SWP Included . . . 43

5.3.2 SWP Excluded . . . 47

5.3.3 Test results . . . 49

5.4 Method 2 - Epochs based solely on KSS . . . 50

5.4.1 Including SWP . . . 50

5.4.2 Excluding SWP . . . 53

6 Discussion 57 6.1 Data segmentation . . . 57

6.1.1 Epochs based on lane departures and KSS . . . 57

6.1.2 Epochs based solely on KSS . . . 58

6.2 Refining the signals . . . 58

6.3.1 Feature selection fluctuation . . . 59

6.3.2 The selected features for classification . . . 60

6.4 Creation of the classifier . . . 61

6.4.1 Separating of data . . . 61

6.4.2 Imbalanced data . . . 62

6.4.3 Choice of classifier . . . 62

6.5 Test Results . . . 63

6.6 Karolinska sleepiness scale as ground truth . . . 64

6.7 Limitations . . . 65

6.8 Future work . . . 66

7 Conclusion 68

(11)

1 Introduction

This chapter will present the thesis and motivate why it is of importance in section 1.1. In section 1.2 the aim of the project is presented, followed by a problem statement in section 1.3.

1.1 Motivation

Falling asleep while operating a moving vehicle is a contributing factor to the statistics of road related accidents. It has been estimated that 20% of all acci-dents where a vehicle has been involved are due to sleepiness behind the wheel. Furthermore, the occurrences are overrepresented on roads where the speed limit is 90 and 110 km/h i.e. highways and wide two-lane roads [1]. An alert driver might become sleepy when the driving experience becomes monotonous. Both road design and lack of environmental variability have been related as monotonous situations impacting the mental load activity in the driver.[2][3]

Preventing traffic accidents is of uttermost importance. Many studies have

been performed in either simulated environment or on real roads. However a limitation of these studies are that they usually involves small sample groups. Given the worlds largest dataset performed on real roads, new opportunities are given to investigate different physiological signals as well as driving behaviours.

1.2 Aim

Given the worlds largest dataset of driver participants, the aim of this master thesis is to investigate if it is possible to create a robust algorithm for detecting sleepiness. Features from the data that may be characteristic of sleepiness and two approaches are going to be evaluated. The first method is based on dividing the data into epochs (segments), which are based upon lane departures and the Karolinska sleepiness scale (KSS). KSS is a subjective scale of sleepiness ranging from 1 (very alert) 9 (borderline sleepy). The second method is only going to be depending on epochs based solely on KSS.

(12)

1.3 Question formulations

• Which features are highly representative for driver sleepiness in each re-spective method?

• Can the approach in method 1 be used in order to determine driver sleepi-ness with the help of machine learning?

• Is there an enhanced performance when using the approach of method 2 with machine learning?

1.4 Limitations

This thesis covers a large amount of different features, from different signals. The number of features used are determined in the feature selection stage, which drastically decreases the number of features of interest. Furthermore, this the-sis focus on different supervised learning techniques, and thus excluding any detailed descriptions of unsupervised- and reinforcement learning.

(13)

2 Background

In this thesis work the data used have been collected from three separate

ex-periments carried out by VTI between 2009 and 2011. Each experiment is

independent from one another and has been conducted in real conditions on different road types. Below follows the general description of the experimental setup for the real road scenarios.

2.1 Experiment setup

The experiments were based on a within-subject design in order to obtain data from each driver during different times of the day as well as during different levels of sleepiness, but on the same road for the different trials. In total a number of 85 participants were randomly recruited from the National register of vehicle owners in Sweden with the minimum inclusion criteria of being in the range of 30-60 years of age and have driven at least 5000 kilometres the last year. Each participant received a compensation of 3000 Swedish kronor. The vehicles used for the experiments were a Saab 9-3 Aero or a Volvo XC70 depending on the experiment. Each vehicle was equipped with the necessary sensors and equipment for monitoring and logging data. Apart from the par-ticipant, a test leader was seated in the passenger seat whom were responsible of the dual command to ensure the safety aspects during the experiment. In addition to this, there were also a test leader in the back seat responsible for the data acquisition of the experiment.

The experiments were either conducted on Riksv¨ag 34 or the E4 motorway

outside of Link¨oping. Every five minutes of the experiment the participants were prompted to report a subjective sleepiness rating on the KSS scale. KSS is the abbreviation for Karolinska sleepiness scale and is a tool for evaluating subjective sleepiness. It is based on a nine-graded scale which ranges from ex-tremely alert (1) to very sleepy, great effort to keep awake, fighting sleep (9) [16], see table 1 showing the nine different grades.

Value Description

1 Extremely alert

2 Very alert

3 Alert

4 Rather alert

5 Neither alert nor sleepy

6 Some signs of sleepiness

7 Sleepy,but no effort to stay awake

8 Sleepy,some effort to stay awake

9 Very sleepy, great effort to stay awake, fighting sleep

(14)

2.1.1 Data acquisition

In order to record the physiological data of interest a system called Vitaport from Temec Instruments was used. In these experiments horizontal and vertical electrooculogram (EOG) were recorded on three channels, however only the vertical channel was used later on. Electroencephalgram (EEG) was recorded

on three channels positioned at Fz-A1, Cz-A2 and Oz-Pz. Electromyogram

(EMG) was measured on one channel, placed under the jaw and used for artifact handling of the EEG. Lastly the electrocardiogram (ECG) was measured on one channel. For EEG, ECG and EMG the sampling frequency was 256 Hz, whilst 512 Hz in the case of EOG. In addition to the physiological data, driving behaviour data were also acquired from the car’s controller area network system (CAN), including for example speed and lane position. The sampling frequency was either 40 or 10 Hz, depending on the experiment. Other externals sensors such as Smart Eye were also acquired and used for eye tracking.

(15)

3 Theory

The following chapter comprises of the relevant theory in relation to driver sleepiness. First sleepiness is defined, followed by the impact of sleepiness in EEG, ECG and EOG. The subsequent section describes different complexity measures such as sample entropy and Higuchi’s fractal dimension, which are not widely applied in driver sleepiness, as of yet. The last section of this thesis gives a brief introduction to machine learning.

3.1 Sleepiness

In order to study sleepiness, especially regarding driving scenarios, a sound understanding of the topic is required. Sleepiness, in general, can be defined as the transitional state between wakefulness and sleep. Sleepiness can be observed both subjectively (the feelings of sleepiness) and physiologically as different patterns and behaviours of physiological signals. Furthermore, sleepiness can be observed behaviourally in drivers.[15]

3.1.1 EEG

EEG measurements are the most commonly used features when investigating driver sleepiness. This is mainly due to its direct correlation to one’s mental state. The human brain contains an enormous diversity of EEG rhythms, which are all characterised by their frequency range and amplitude. The frequency content is a result, amongst other things, of the mental state of the subject i.e. degree of consciousness, sleepiness and wakefulness. [13]

Figure 1: The frequency characteristics of each band in the human brain.

As figure 1 illustrates, the EEG rhythms can be divided into different cate-gories depending on the frequency content. High frequency and low amplitudes

(16)

rhythms implies an active brain, which is associated with alertness or dream

sleep. Low frequency and high amplitude rhythms, on the other hand, are

commonly associated with drowsiness and non-dreaming sleep states. The rela-tionship between the electrical activity and the corresponding amplitude is due to the synchronisation, or lack off, of the neurons. Neurons in an active brain fires rapidly when processing information, however they are not synchronised with its neighbouring neurons. This yields a low EEG amplitude. The lack of synchronisation originates from the treatment of the received information. Each neuron serves a particular role of a complex cognitive task, hence the signals are not outputted simultaneously. The same principle applies i.e. when the neu-rons are not engaged in information processing. Here, the neuneu-rons are excited by a common rhythmic input, which yields a high synchronisation amongst the neurons and subsequently a large EEG amplitude. [13]

In general, signals obtained from the scalp can contain amplitudes ranging from a few microvolts to approximately 100µV and a frequency content, which could vary between 0.5 to 30-40 Hz. As previously mentioned, the frequency content can be separated, and it is usually divided into 5 different frequency bands. The interpretation of these bands depends on different aspects such as age and men-tal stage of the subject. The EEG signals of a newborn compared to an adult’s differs heavily, since a newborn generally have considerable higher frequency content. Figure 1 displays the respective morphology of each band along with its frequency range, however the frequency bands varies somewhat depending on author. [13]

When investigating driver sleepiness, some frequency bands are of a higher im-portance, especially the alpha, theta and in some cases beta bands. The alpha frequency band, which ranges from 8-13 Hz, corresponds to a wakeful state. Theta, however, occurs during drowsiness and in certain sleep stages. Lastly, the beta band has a fast rhythm with low amplitude and can be observed e.g. during sleep stages. The underlying principle regarding the stated bands is to detect the frequency shift which occurs when a subject transcends from an alert state to a sleepy state. [13][15]

3.1.2 ECG

Measurements based solely on ECG in terms of investigating driver sleepiness is, as of today, relatively uncommon. However, some researchers [56] have con-ducted studies in order to see the contributing possibilities ECG might have in the field of driver sleepiness. Heart rate variability i.e. HRV is a term frequently coined in context of sleepiness. HRV refers to the beat-to-beat interval, where the normal heart rhythm lies between 50 and 100 beats/minute [13]. At rest, the heart rate is essentially regular, however there are some fluctuations despite if physical or mental stress are absent. The cause behind the small fluctuations is due to the autonomic nervous system (ANS), which influences the firing rate of the sinoatrial node. Increased parasympathetic activity, which is present when

(17)

the body is at rest, results in a decrease of the heart rate, increasing the HRV. Furthermore an increase in sympathetic activity, i.e. when the body is alert and ready for activation, yields an increase of the heart rate, decreasing the HRV [13]. The main area of interest regarding driver sleepiness is the frequency domain of the HRV, especially the low- (0.04-0.15 Hz) and high (0.15-0.4 Hz) frequency bands. The low frequency band has been shown to correlate to the sympathetic activity (alertness), whilst the high frequency band has shown to more correlate to parasympathetic activity (sleepiness) [18][20].

3.1.3 EOG

In addition to the frequently used EEG measurements, such as the frequency bands, different EOG measurements can be used as well to detect sleepiness. The EOG measures the steady corneal retinal potential, which is proportional to vertical and horizontal movements of the eye. This yields for an objective way to quantify the direction of the gaze. The EOG is, usually, of interest when investi-gating subjects with sleep disorders, where the presence of rapid eye movement is important to determining different sleep stages. However, blink duration and eyelid closure speed have both been proven to be quite interesting indicators of sleepiness when investigating driver sleepiness. [58][13]

Studies have shown that the blink duration tends to increase with sleep loss along with a delay of lid opening and lid closure speed. The blink rate, on the other hand, has shown both to be an effect and lack of effect of sleep loss, depending on the authors. As of today, measures correlated to blink duration is considered most responsive to sleepiness. [58]

3.1.4 CAN

Apart from physiological signals, driver sleepiness is often measured based on driver performance. The important measurements can be found in the controller area network of the vehicle. Commonly used measures includes steering wheel angle, where the driver tends to make larger sudden adjustments when sleepy, and the lateral position of the car. The lateral position of the car is found to differ more as the driver gets sleepy. Another easily obtainable measurement is the vehicle’s speed, where the deviation from the speed limit has been found to correlate with sleepiness. However, it can be hard to measure speed correctly since it also depends on the environment and traffic.[15]

3.2 Complexities

There are yet no concise definition within the science community of what refers to as a complex system. It is a term used when the behaviour of a system is proved difficult to predict and control, such as the human brain. There is no complexity measurement, which on its own can reliably predict all possibly

(18)

relevant complexity of e.g. the brain. However, by combining different measure-ments, complexity measures can help to yield a better description of the current situation. The following section highlights two complexity measurements, Sam-ple entropy and Higuchi’s fractal dimension, which have shown to be successful in the field of biological signals. [42][55]

3.2.1 Sample Entropy

The ability to quantify and distinguish complexity of different biological signals have become of increased interested in recent years due to its easy implemtation and can be vital in detecting early biological patterns [6]. Sample en-tropy measures the probability that two sequences matches point-wise within the threshold limit in the succeeding point. Sample entropy was first presented by Richman and Moorman [8].

Given a time series with N data points

u(n) = u(1), u(2), u(3), ..., x(N ) (1) Form a sequence of vectors

Xm(i) = [u(i), u(i + 1), ..., u(i + m − 1)] (2)

Where m is the embedded dimension, i.e. the length of the pattern. Next, define the distance between X(i) and X(j) as

d[Xm(i), Xm(j)] = max

k=0,...,m−1|u(i + k) − u(j + k)| (3)

With a given number of (pattern)vectors, Xm(i), calculate the number of vectors such that

d[Xm(i), Xm(j)] < r (4)

Let Nm_{(i) be the number of vectors that can be regarded as similar, i.e. the} number of vectors where every element j (1 ≤ j ≤ N − m) of any vector Xm(j) that is similar to Xm(i) is within the threshold value r. We may now define the correlation sum for 1 ≤ i ≤ N − m as:

Am_i (r) = N m_(i) N − m (5) Define Φ(r) as: Φm(r) = 1 N − m N −m X i=1 C_im(r) (6)

(19)

The final equation for sample entropy is defined as the negative natural loga-rithm quotient between Φm+1(r) and Φm(r):

SE(m, r) = −lnΦ

m+1_(r)

Φm_(r) (7)

The difference between the value generated from the m and m + 1 dimensions are to be viewed as the complexity quantity. A smaller value of SE means that there is a larger probability that two sequences (Xm(i) and Xm(j)) that are similar for m points, stays similar at the succeeding point, thus the signal is to be regarded as repetitive and of low complexity. A higher value of SE implies that similar patterns will not be predictive and similar patters will not follow in the succeeding point, thus to be regarded as irregular signals, i.e as a high complexity signals [9].

3.2.2 Higuchi’s fractal dimension

Higuchi’s fractal dimension (HFD), has, over these past twenty years, occupied an important place in the analysis of different physiological signals [43]. How-ever, the method was first introduced in the article, Approach to an irregular

time series on the basis of the fractal theory, as a nonlinear measure for the earth’s changing magnetic field [33]. It was not until scientists began

investi-gating the possibilities of applying various algorithms such as Katz, Pickover and Khorasani, etc. for fractal dimension (FD) calculations when HFD was first linked to physiological signals [43]. Katz, creator of Katz’s algorithm, described in an article [46] a successful fractal characterisation of the EEG of healthy and sleeping subjects. Furthermore, he demonstrated the potential of FD and how useful it could be for distinguishing normal- and abnormal states regarding one’s ECG.

A fractal is a shape that retains its structural detail despite scaling, which is why complex systems or objects can be described with the help of FD [47]. One reason to why FD is gaining more popularity, apart from its speed and accuracy, is due to its highly sensitive measure of detecting hidden information contained in physiological time series. There are several methods of calculating the FD, however, there are only three, which are widely accepted; Higuchi’s, Katz’s and Petrosian’s methods [43].

HFD is considered superior compared to well established linear methods, within physiological signals, due to the fact that HFD is an accurate numerical measure which is not dependent on the nature of the analysed signal i.e. stationary or nonstationary, deterministic or stochastic [33]. While this is true, it can also serve as a feature by its own or in combination with other mathematical and statistical (linear or nonlinear) methods, e.g. combining the method with spec-tral features to investigate sleepiness [33].

(20)

HFD is a nonlinear measure of waveform complexity carried out in the time domain, where discrete functions or signals are analysed as a time sequence [33]. It is acquired through equation 8, where several lengths of the signal or curve have been calculated. Note that L(k) is the mean value of the retrieved curvature lengths for each respective k time series.

HF D = ln(L(k))

ln(1_k) (8)

The resulting values lies between 1 and 2 and are interpreted by their magnitude, i.e. a low value (1 or ≈ 1) implies that the investigated signal is predictable, whilst a high value (2 or ≈ 2) implies a very unpredictable behaviour of the signal. K, as in equation 8, refers to the free parameter kmax and has a crucial role when calculating HFD. However, the estimation of kmax was not elabo-rated extensively in Higuchi’s original work. Although different methods have appeared throughout the years, there is no strongly established method of es-timating the parameter that is accepted, as of today. Some researchers [46] suggested an increasing kmax with an increasing N (datapoints). G´omez et al., on the other hand, propose in his article that it is possible to calculate a suitable

kmax by plotting HFD values over a range of different kmax. kmax is then

determined at the saturation point i.e. when the curvature starts to plateau. [33][27]

3.3 Machine Learning

The term machine learning was first coined by Arthur Lee Samuel in 1959 in his article, Some studies in machine learning using the game of checkers, where he highlighted the possibilities of computers to learn from experiences and thus eliminating the need for detailed programming [10].

Much has happened since then, however the principle is unchanged. Machine learning can be defined as the construction of algorithms which have the capa-bility of automatically detect patterns in data. Uncovered patterns can then be used to predict future data, or to perform various kinds of decision making in uncertain situations. Machine learning is usually divided into three different categories, supervised-, unsupervised- and reinforcement learning. Supervised learning is a well-defined method since the computer has trained on mapping the input (raw data) to a certain label. Unsupervised learning, also known as knowledge discovery, has no labels i.e. defining what is correct or not, leaving the algorithm on its own to find structure in its input. Reinforcement learning applies rewards and punishment depending on the outcome of the algorithm, however this method is not commonly used. [10]

This section treats the theory of the classifiers which were adopted in this the-sis, beginning with the ensemble methods i.e. Adaptive Boosting and Random Forest. It then continues with Support vector machine methods and lastly de-scribing K-nearest neighbour (KNN).

(21)

3.3.1 Ensemble methods

Ensemble methods are learning algorithms that combines a set of models i.e. classifiers and then classify new data points by taking a weighted vote of their predictions. Studies have shown that ensembles of classifiers are usually more prone to yield better prediction performance than the individual classifiers used. However, certain conditions have to be fulfilled by the individual members in order to achieve such a result. It is important that the members are accurate on their own as well as diverse. [52]

A classifier is considered accurate if its error rate is lower than a random guess on new feature values. Diverse classifiers implies different predictions on new data points. To put this in context, imagine an ensemble of three classifiers

h1(x), h2(x) and h3(x), that are going to predict on a new data set S. If the three classifiers are identical i.e. not diverse then the output of the classifiers are the same, which infers that they are correct or incorrect at the same time. However, if the errors made by h1(x), h2(x) and h3(x) are uncorrelated there is a possibility that the majority of votes correctly classifies S. Note that the individual classifiers have to be constructed with error rates below 0.5 (more accurate than a random guess) in order to avoid an increase of the error rate for the voted ensemble. [52]

Adaptive Boosting

Boosting is a well established technique within machine learning and ensemble methods, and is widely applied in multiple areas. The principle behind boosting is to combine weak learners and their rules of thumb into a single committee whose predictions will be more accurate compared to a single weak learner. A weak learner is a classifier, predictor etc. whose prediction is slightly better than a random guess. Adaptive Boosting (adaboost) can utilise almost any classifier as a weak learner, however the most basic and common learner is a decision stump, which is not more than a single level decision tree.[10]

Adaboost automatically adapts to the data, hence its name, and is considered to be one of the first practical boosting algorithms where it is not required to define a large number of hyperparameters [12]. The principle behind adaboost is to weight classification outputs differently depending on the performance of the weak learners. The training conducted with the weak classifiers should be on a random subset of the total training set. Note that the same datapoint can occur in different subsets when the datapoints are selected at random.

Like any learning algorithm, adaboost takes a set of training examples as inputs, (x1, y2), ..., (xm, ym), where each xi is a feature from feature space χ and each

yi is the respective label or class {−1, +1}. Initially every datapoint have the same weight (Dt) distribution (sum of all weights equals 1). This implies that the first (t = 1) weak classifier (ht) is trained with equal probability given to all

(22)

training points. Once the weak classifier have been trained its output weight αt is then computed, as in equation 9. [12][47]

αt= 1 2ln 1 − _t t (9) Note that t refers to the number of missclassifications registered during the training set i.e. the error rate. Furthermore, once αt has been acquired the next step is to update the weights of the respective training points. This is achieved by equation 10, where i is the weight for the ith_{training point.}

Dt+1(i) =

Dt(i)exp(−αtyiht(xi))

Zt

(10) The weights are normalised by dividing each weight with Zt, which in this case is the sum of all weights. This is to ensure that the distribution is equal to 1, as previously mentioned. Equation 10 is evaluated for each of the training points

i (xi, yi) where each weight from previous training round is either going to be up-weighted or down-weighted. By iterating the abovementioned process, the final predictor can be described as equation 11,

H(x) = sign T X t=1 αtht(x) (11) which is a linear combination of all the weak classifiers where the final decision is the sign of the sum.[12]

Random Forest

One of the most common techniques within ensemble learning are tree based methods. A tree within machine learning refers to a model that can predict the value of a target variable based on multiple input variables, figure 2.

Figure 2: An example of a simple tree structure design, where the numerical boxes are the respective children of a node split. Terminal node determines the outcome of the target variable.

(23)

They map non-linear relationships rather well, unlike linear models. Random forest is such a method, and was introduced by Leo Breiman in the early 2000s. Breiman defines it as a classifier comprised of a collection of tree structured classifiers, h(x, vk), where vk are independent identically distributed random vectors (k = 1, 2..) and each tree places a unite vote for the most popular class at input x.

Boostrap aggregating (bagging) is used in tandem with random feature selection when creating a ”forest” of classifiers. Imagine a data set, D, comprising of n features (xn) and their corresponding labels (yn). Bagging enables one to draw uniformly at random n data points with replacement. The replacement implies that the same data points can be picked in different random subsets or in the same subset. This, according to Breiman, enhances the accuracy when ran-dom features are used. Furthermore, it has shown to improve generalisation by avoiding specialising the selected data points to a single training set. Breiman also addresses the fact that bagging can be used to give ongoing estimates of the generalisation error of the combined ensemble of trees. [50] [52]

The best suited threshold i.e. split point for the subset is determined by the randomised node optimisation equation,

θj∗= arg max θjDj

Ij (12)

where θ are all the possible thresholds for the entire data set D. This implies that each node has only a subset of thresholds where the best one (θ∗_j) is deter-mined by maximising the information gain (Ij). Note that j in this case refers to the jthnode. The information gain can be estimated by different techniques, however sample entropy is commonly used [52]. Once the node has been sep-arated into its children the process continues until the minimum node size has been reached. Another technique mentioned by Breiman is to use the CART (Classification And Regression Tree) methodology when separating the feature space, which is based upon binary statements i.e. yes or no questions, which is applied in this thesis since it is far easier to understand [51][50].

By iterating the abovementioned process, the final output are an ensemble of random trees whose predictions are averaged i.e. divided by the number of trees in order to yield the final classification of the data point.

3.3.2 Support vector machine

If data are linear separable, it is possible to create infinitely many decision boundaries, i.e. lines, that can separate classes from each other. Naturally, some decision boundaries are more suitable than others and the optimal one depends on whats the criterion is. In machine learning, the idea of support vector machines is to optimise the hyperplane (thick blue line in figure 3) that best divides a dataset [47]. A less desired hyperplane passes close to the dat-apoints, inducing high sensitivity to noise and less generalisation capabilities,

(24)

such as line B in figure 3 [53]. The datapoints from each class, closest to the hyperplane are called support vectors, and by optimising the distance, i.e. the margin, between the support vectors and the hyperplane increases the chance of unseen data being correctly classified on the SVM model [53]. This is illustrated in figure 3, where the hyperplane A has a considerable larger margin between the two classes compared to B.

Figure 3: Two hyperplanes (thick blue lines), separating the red and blue class. The light blue area indicates the size of the margin.

Mathematically a hyperplane of SVM that is to divide two classes can be seen as an optimisation problem. As can be seen in figure 4, the maximum margin is equal to _||w||2 . Thus, minimising ||w|| results in the largest distance between the support vectors, and largest separability is obtained [47]. Therefore, the optimisation problem can be described as in equation 13.

minimise ||w|| subject to yi(wxi− b) ≥ 1 (13)

Where yi indicates the class of which datapoint xi belongs to and w is the

normal vector to the hyperplane. A hyperplane is generally described by the equation inside the parenthesis. [47]

(25)

Figure 4: A mathematical expression of the margin and hyperplanes in SVM

Optimally, there exists a hyperplane, dividing the different classes, such that no data points falls within the margins (Hard-margin). However this is not always the case, and to be able to use SVM in these cases the soft-margin is introduced . The soft-margin allows for certain data points to fall within the margins, meaning that the hyperplane can separate most datapoints, but not all. The optimisation problem of the maximum margin calculation is now changed, introducing a loss function, equation 14

minimise ||w|| + C n X

i=1

ζi subject to yi(wxi− b) ≥ 1 − ζi (14)

Where ζ is the slack variable, measuring the degree of missclassified datapoints and C is the error penalty parameter. A large C indicates harder constraints which ultimately converges to Hard-margin SVM, whilst a low value indicates that datapoints can indeed be present within the margin boundaries .[47] Even though the original SVM was made to solve only linear problems, it is today possible to solve non-linear problems by the application of kernels. The idea is to transform the data non-linearly to a higher dimensional space where a linear classifier can be applied. Normally, transforming the data to a higher dimensional space would involve a lot of steps, including the calculation of

< f (x), f (y) >, where f is a mapping function from input space to a feature

space and <; , ; > denotes the dot product. This implies that the transformation to feature space first requires the calculation of f (x) and f (y) and then the dot product between them. However, by applying the ”kernel trick”, it is possible to obtain the same result but without all of these calculations. The caveat is that the transformation must be done with a kernel. The “trick” works because

(26)

the optimisation step in SVM only uses pair-wise dot products < x, y >, and these dot products can easily be replaced by a kernel function K(x, y). This will implicitly transform the data set to a higher-dimensional space in which the problem might be solvable, and at the same time the trick will save memory and reduce the computational cost. In e.g. a Gaussian SVM, a Gaussian kernel function is used to map the feature vector to a higher dimensional space. [47]

3.3.3 K-Nearest Neighbour

K-nearest neighbour (KNN) is among the simplest machine learning algorithms of today. It is an instance-based classifier, meaning that it does not train a model, instead the stored training instances is seen as the knowledge. [47] The classification of the KNN algorithm is straight forward. The training sam-ples are stored, and used as the knowledge of the classifier. This makes the training phase very fast since there is no real calculation performed. Next an unseen test sample is compared with closest matching training sample. Differ-ent comparison models can be used and one of the simplest is the euclidean distance, which measures the distance between the training and test sample. This is the approach when K = 1, i.e. when the closest training sample decides the class membership of the test sample. By only using the closest training sample as the decider of class membership is usually not recommended since it becomes sensitive to noise. [47]

When increasing the value of K, the membership of a test sample is major-ity voted upon by the K number of nearest training samples [47]. Consider figure 5, where the green dot is the test sample to be given a class member-ship. The inner circle corresponds to K = 3, which means that the test sample would be given the red triangle class membership. If K is increased to 5 (the dotted line in figure 5), the test sample is instead given the blue square class membership.

(27)

3.3.4 Feature Selection

It is not uncommon for a data set to consist of large amount of features. This can result in computational heavy process, and increase the processing time when evaluating a classifier at a later stage. Thus, it is common to select a subset of features prior to the training stage, which both speeds up the overall processing time but can also help to protect against unwanted side effects of a large feature set, such as overfitting. [47]

Sequential Forward Floating Selection

To find an optimal set of features, different approaches can be used. A powerful approach is to use a so called wrapper method, which is particular good when intra-relationship between features are of interest. In a wrapper approach a subset of features are to be selected via a search procedure that often involves training a set of feature candidates on a classification model, to evaluate the performance of the candidates. Both a bottom-up approach, starting with an empty feature set and adding one feature at a time, and a top-down approach, starting with a full feature set and excluding one at the time, can be used in a wrapper method. [47]

In Sequential Forward Floating Selection (SFFS) a subset of features are to be selected. However, the number of subsets, d can become very large depend-ing on the size of the original feature set, D, accorddepend-ing to equation 15. Thus it is necessary to decrease the number of subsets evaluated [48].

D

d

= D!

(D − d)!d! (15)

One approach to decrease the number of subsets is to use a sub optimum search. First a random feature is chosen and evaluated against a classification method. When the best feature has been selected, another feature is added, and the fea-ture pair is evaluated. The addition of feafea-tures continues until there no longer exists an increase of performance when adding another feature from the feature space [48]. This method is known as sequential forward selection (SFS), and is the predecessor to SFFS. The problem with SFS is that once a feature is added it can not be discarded resulting in a suboptimal performance [49]. SFFS applies the method of SFS initially, where a feature with the most significant with respect to the subset Xk is added to form a new subset, Xx+1. This stage is called inclusion. The next stage is to exclude the least significant feature in the feature set Xk+1. If the newly added feature is the least significant feature, the inclusion of additional features continues. However, if the least significant feature is part of the previous subset of features, Xk, such that the evaluation criterion, J , becomes larger when removing the feature (when adding the new feature), the redundant feature is excluded and the remaining subset of features,

(28)

The exclusion of the least significant features continues until the equation 16 becomes true.[49]

J (Xk0 − xs) ≤ J (Xk−1) (16)

Here J is the evaluated criterion, such as trade off between sensitivity and speci-ficity or classification accuracy, and xs is the least significant feature. When equation 16 holds true, the subset of features, X_k0, is set to Xk and the inclusion of features starts over again. The backtracking of the least significant feature for each newly added features and optimisation of the evaluation criterion is known as the ”floating” stage and is the difference between SFS and SFFS. Since the evaluation criterion is tested for each newly added feature a better subset can be achieved compared to the original SFS method. If no improvement is made when adding new features to the subset, the process is ultimately terminated, and the optimum feature subset is said to be found. [49]

Since SFFS (and SFS) includes a classification stage, the computational cost can become large, especially for larger datasets. Also SFFS can be prone to overfitting if small datasets are used.

3.3.5 Classification evaluation

There are different ways of estimating the performance of a classifier. One

approach, illustrated in figure 6, is to create a confusion matrix which enables one to visualise the performance. It is divided into rows and columns, each row refers to the instances of an actual class, whilst each column represents the instances of an predicted class.

Figure 6: Confusion matrix based upon sleepy and alert observations.

The most commonly used performance indicators derived from the confusion matrix, when investigating driver sleepiness are accuracy, sensitivity and speci-ficity. The accuracy is defined as the correct classified samples divided by the total amount of samples. It is derived from the confusing matrix by equation 17.

Accuracy = T P + T N

(29)

Sensitivity, on the other hand, describes, in the sense of this thesis, the correct classified sleepy participants, and is defined as equation 18.

Sensitivity = T P

T P + F P (18)

Lastly, specificity is defined as equation 19, which is a measure of the correct classified alert participants.

Specif icity = T N

T N + F N (19)

Another performance indicator, which is not directly derived from the confu-sion matrix, although it is based upon sensitivity and specificity, is a ROC-curve (Reciever Operating Characteristic). It is a measure which only depends on the sensitivity and false positive rates (FPR) i.e. 1 - specificity, and is useful to apply when combining results from different datasets with different class dis-tribution. Usually, the y-axis of the ROC-curve graph refers to the sensitivity, whilst the x-axis is the FPR. Thus, the ideal prediction of a classifier would yield a point in the upper left corner in the ROC space. This implies that the classifier has classified every single observation correctly despite the, as in this thesis, mental state of the participants correctly. The diagonal divides the ROC space and if the shape of the curve converges to a linear appearance the clas-sifiers prediction performance decreases. Points above the diagonal represents better classification results, which is better than random guessing. However, points below are sings of a poorer performing classifier. [47]

Accuracy, in some cases, can be quite a misleading measurement depending on the distribution of datapoints i.e. if a class is highly overrepresented. If so, there is a risk that the classifier only predicts on the overrepresented class, which subsequently yields a high accuracy, even though it wrongfully classified the other class. The ROC-curve captures a greater scope than the accuracy, since it enables one to see the classifiers performance regarding the underrepre-sented class as well. [47]

Another approach to consider in order to minimise the risk of a classifier, whose predictions are only on one specific class is by introducing a cost function. Usu-ally classifier contains some scoring procedure when predicting data i.e. assigns rewards and penalties to right and wrongfully classified datapoints. A cost func-tion enables one to establish the scoring procedure regarding the margin of what classifies as a particular class and not. The idea is to force the classifier to be more careful during its predictions. [47]

3.3.6 Data leakage

Data leakage is a common phenomenon within machine learning. One have to be observant if the prediction result exceeds the expected outcome.. Although, this is not always the case. A classifier is consider good if it generalises well on

(30)

unseen data. However, the classifier’s performance on the training data should not differ more than a few percent from the performance of the testing data. If the results differs significantly i.e. it generalises considerable better on the test data, it is possible that data leakage has occurred. Data leakage is a term coined when the classifier has unintentionally been exposed to the sample data, that is intended for the evaluation part. Since the algorithm has already seen the ”unseen” data it can enhance its performance, however the output would be misleading. [54]

In order to avoid data leakage it is important to portion the data samples prior to the training and testing phase. Another aspect to consider is to exclude samples that have been included in the training phase of the classifier when evaluating its performance on the training data. The outcome will yield the same result i.e. enhanced performance, if these samples are included, since the classifier has already been exposed to those particular samples. [54]

(31)

4 Method

The aim of this chapter is to describe the chosen methods which the thesis is based upon. It is structured in chronological order, starting with the prephase, emphasising the importance of homogeneous data and time synchronisation. The chapter continues, explaining all the necessary steps i.e. signal processing, feature extraction, feature selection and so on.

4.1 Prephase

As previously mentioned, the thesis is structured similar to the workflow of the project. It initially began with reviewing articles within the same research area, as well as other fields of interest in order to gain knowledge and ideas of how to approach the given problem.

In addition to the literature research a programming environment had to be determined regarding the signal processing, feature extraction and evaluation of different classifiers. MATLAB was chosen, after some consideration, due to previous knowledge of the software and the user friendliness regarding signal processing tasks. MATLAB was installed on four separate computers provided by VTI, two stationary and two laptops. The stationary computers were acces-sible remotely from the laptops and were utilised when the demand of additional processing power was needed.

As stated earlier, each respective experiment was performed on separate oc-casions which had resulted in different configurations of how the retrieved data were named and stored. The first step of solving the problem at hand was to establish a homogeneous data structure throughout the three independent experiments. It is possible to ignore this particular part, however there is a risk of losing coherency and operability since the algorithms would have to be adjustable depending on the arrangement of the individual data structure. Once the structure of the participants were established, illustrated in figure 7, the CAN and physiological data had to be time synchronised. This is mainly due to align the measurements such that the extraction of values are retrieved at the same time instances. If the synchronisation is wrongfully executed it is likely to output erroneous results. Lastly, some vectors, in the same structure, varied in length. These were adjusted accordingly to the size of the time syn-chronisation vector by interpolation, resulting in equal length of all vectors in the structure.

(32)

Figure 7: The resulting data structure of each participant

4.2 Signal processing

The following section describes the different signal processing stages which were

performed, removing unwanted noise and artifacts. Each respective

physio-logical signal has been treated in the same systematic manner i.e. by visual inspection and filtering accordingly to the inspection in combination with rec-ommendations from studied literature.

Note that the provided physiological signals already had been somewhat filtered since the digital recorder, Vitaport, is equipped with such options. However, further corrections were made regarding the refinement of each signal, which are highlighted in the sections below. Furthermore, all frequencies were nor-malised according to the definition, that is the frequency divided with half of the sampling frequency, also known as the Nyquist frequency [14].

4.2.1 Filtering

There have been several filtering methods which the physiological signals have been exposed to throughout the project. Initially the procedure of finding ap-propriate filters began with empirical tests, along with suggestions found in the literature. Since every signal contains different amount of disturbances and ar-tifacts from the electronics and other physiological sources, a trade off had to be made regarding the acceptance of noise post filtering.

(33)

For all signals, a zero-phase digital filter has been applied, which performs a zero-phase digital filtering of the input data, in both the forward and reverse directions[30]

EEG: The EEG data differs from the other physiological measurements since

all three channels were used to extract features from. The filtering has there-fore been applied over each respective channel. Firstly, the EEG signals were processed through a band-pass filter, more specifically a Chebyshev, applying the cheby2 command which is a predefined function in MATLAB when using the signal processing toolbox. Cheby2 also known as the inverse filter, has a flat passband and can be applied in cases where the frequency content are more important than having a constant amplitude [32]. The bandpass filter was ad-justed to the frequencies of interest i.e. lower limit 4 Hz and a upper cutoff limit of 30 Hz, which captures the Beta-,Theta- and Alpha frequency bands. Figure 8 illustrates the differences of a filtered signal compared to its original shape.

2971 2972 2973 2974 2975 2976 Time [s] -100 -50 0 50 100 150 Amplitude [µV]

Original EEG signal vs filtered EEG signal

Original signal Filtered signal

Figure 8: Filtered EEG signal, illustrated in red, compared to its original state, illus-trated in dark grey.

In addition to the refinement of the EEG data was an artifact removal process, due to the EEG recordings having noticeable more prominent artifacts than the ECG and EOG data. The algorithm itself is rather simple, focusing on ampli-tude variations and discards ampliampli-tudes that surpasses a fixed threshold. The threshold is calculated by multiplying the root mean square of the signal with a constant, C. In MATLAB the predefined function RMS is used to calculate the root mean square.

The value of the constant, C, was set to 3.1, and found through empirical research to be satisfactory since it located all visible artifact peaks without the risk of removing actual data. Indices where the amplitudes are greater than the

(34)

calculated threshold are stored in a vector. The algorithm iterates in reverse order through the vector and discards each respective peak. Once processed the output has the appearance as in figure 9, where the red coloured signal has been treated. The dark grey coloured signal refers to the original state.

216 218 220 222 224 Time [s] -80 -60 -40 -20 0 20 40 60 80 100 Amplitude [µV] Untreated signal 214 216 218 220 222 224 Time [s] -80 -60 -40 -20 0 20 40 60 80 100 Amplitude [µV] Treated signal

Figure 9: A comparison between untreated, dark grey coloured, and treated, red coloured, EEG signal.

EOG: Fortunately the filtering process of the EOG data was relatively simple,

due to their good original condition. In this case a 5th order low-pass Butter-worth filter with a cutoff frequency 11.52 Hz, using the butter command which is a predefined function in MATLAB, helped to reduce excessive noise, illustrated in figure 10. The cutoff frequency and choice of filter originated from a provided blink algorithm created by Jammes B et al. [17].

2237 2238 2239 2240 2241 2242 2243 2244 2245 Time [s] -100 0 100 200 300 400 500 Amplitude [µV]

Original EOG signal vs filtered EOG signal

Figure 10: Filtered EOG signal, illustrated in red, compared to its original state, illustrated in dark grey.

(35)

ECG: Lastly, the ECG signals were, like EEG, treated with a band-pass filter,

however in this case by a Butterworth, 5th order. The chosen lower cutoff limit was 0.5 Hz whilst the upper cutoff limit was set to 45 Hz, which is usually the span of where dominating frequencies are located [13]. Figure 11 is an example of how the resulting signal, coloured as red, will appear when processed through the filter. Notice that the baseline wandering has been significantly reduced when comparing it to its original shape.

2258 2260 2262 2264 2266 2268 2270 Time [s] -0.5 0 0.5 1 1.5 Amplitude [mV]

Original ECG signal vs filtered ECG signal

Figure 11: Filtered ECG signal, illustrated in red, compared to its original state, illustrated in dark grey.

4.3 Dividing data into KSS epochs

The dividing of data into KSS epochs have been approached by two different ways, as stated earlier in section 1.2. The first approach defines an epoch as the preceding minute(s) leading up to a lane departure, whilst the second approach divides the signal into epochs with no consideration to lane departures.

4.3.1 Epochs based on lane departure and KSS

The first approach regards treating lane departures as lack of control due to sleepiness. Initially, every lane departure caused by an unintentional act of the driver had to be determined. Documentation from the experiments were provided by VTI, where notifications concerning date and time of the lane de-parture events from the lane tracker sensor were stored. In addition to the saved time points was a binary score, 1/0, that implied whether a lane departure had occurred intentionally or not. Note that score 1 refers to as an unintentionally act, whilst score 0 is a intentional act made by the participant. In some cases the binary score from the lane tracker was misleading, where obvious lane vio-lations had been wrongfully classified. Corrections were made by analysing the video recordings and manually encode the correct binary score for each respec-tive participant.

(36)

Once the process of adjusting and establishing the lane departures were ac-complished, stage number two was initialised. The following stage comprised of determining lane departures of interest and finding the corresponding baseline events. This was achieved by only utilising lane departures where the driver had described his/her physical status as an 8 or 9 on the KSS scale. These numbers refer to as considerably tired, table 1, and were mostly found, not unexpectedly, in trial 3, which are the night trials.

Although there are many ways to go about on how to establish the baseline events, this thesis has based it upon KSS in combination with the location of an occurred lane departure. In this case, a baseline event can be defined as a nor-mal driving state where the participants are considered alert. By investigating the participants KSS score in trial 1 and 2 at the same location where a lane departure have been noted in trial 3, it enables one to have the exact driving environment. However, in order to be classified as a baseline event the partic-ipant has to have a KSS value lower than 7. The value 7 has been excluded since this is recognised as the transition period from alert to a drowsy state. This yields a wider separation between the two classes and thus minimises the differences in the sleepiness estimates given by the participants [15]. As can be seen in figure 12, the lane departure in trial 3 is aligned with its respective baseline event from trial 1 and 2. Note the window length, which determines the duration of each epoch. The length of the window was initially set to 5 minutes, but since no overlapping of the lane departure windows were allowed the window size had to be decreased to 1 minute due to very few data points for longer window lengths. The reason for not accepting overlapping between lane departure windows was that each epoch should not be distorted by other lane departures. Further elaboration regarding the window size along with this approach in general is discussed in section 6.

(37)

Figure 12: The arrangement of the first method.

4.3.2 Epochs based solely on KSS

The second approach differs considerably from the abovementioned technique, however they share KSS as an important pillar. This is a relatively straight forward way of dividing the data into epochs. The two main concerns are the time points at which the experiment starts and finishes, since the recording of the measurements are initialised before the experiment and sometimes continues after the actual experiment is finished. By knowing these time points it enables one to divide the KSS into epochs of 2.5 minutes each, as figure 13 illustrates. As in the first approach, KSS values of less than 7 were regarded as an alert indicator, while KSS of 8 and 9 were an indicator of sleepiness. As in method 1, a KSS value of 7 was discarded. In cases where the last KSS epoch was less than 2.5 minutes the epoch was discarded to form homogeneous epochs over the entire duration of the experiment. The length of each epoch was chosen for different reasons. Since a KSS value was reported every 5 minutes, 2.5 minutes epochs

would be evenly divisible, yielding two epochs per 5 minutes. Furthermore

a window of at least 2 minutes are recommended by the Task Force of The European Society of Cardiology and The North American Society of Pacing and Electrophysiology when using ECG signals in the frequency domain [28].

(38)

Figure 13: The approach of the second method, where the signal has been divided into epochs based on their KSS

4.4 Features

Figure, 14 illustrates the procedure from a raw input signal to an extracted feature. The preprocess of the data is the same across the physiological signals, however, the feature extraction differs. Bellow follows a short description of the usefulness of each respective feature and how they were extracted.

Figure 14: Flowchart of the feature extraction process, where input data is a raw signal.

4.4.1 EEG - frequency features

Analysing frequency features have been, throughout the years, a well established method when determining the mental state of a person. The electrical impulses generated in a EEG signal have a wide variety of rhythms and wave forms, and are conventionally characterised by their frequency range and relative amplitude [13].

(39)

Table, 2 contains the frequency features that were extracted during the project. The following features were aimed to investigate the high frequency, as well as low frequency content in the provided EEG signals. High frequency/low ampli-tude rhythms are proven to be correlated with an active brain associated with alertness or dream sleep. Low frequency/high-amplitude rhythms, on the other hand, are usually associated with drowsiness and non-dreaming sleep states [13]. Since the brain is fairly complex, extracting different descriptive features can ease the process of understanding the entireness of ones mental state.

Nr Feature

1 - 3 Absolute Power Theta

4 - 6 Absolute Power Alpha

7 - 9 Relative Power Theta

10 - 12 Relative Power Alpha

13 - 15 (θ + α)/β

16 - 18 α/β

19 - 21 (θ + α)/(α + β)

22 - 24 θ/β

Table 2: Extracted EEG features from the frequency domain

Initially, the power spectrum density (PSD) had to be determined, in order to calculate the respective absolute power of Theta and Alpha. There are several approaches when it comes to estimating the PSD, however, in this project it was acquired by using MATLABS’s built-in functions fft and xcorr. In addition to the PSD was the calculation of the resulting frequency vector.

Once the PSD and its corresponding frequency vector had been established, it enabled the process of calculating the absolute powers regarding Theta and Alpha to be initiated. The absolute band powers are the sum of the PSD within a certain frequency range. In laboratory studies, the theta band of a EEG has been proven to be an indicator of sleepiness [38][39], while sleepiness in active individuals in a real environment have been more associated with the alpha band of EEG [40][41]. Since both the theta and alpha band have been proven to indicate sleepiness, but in different situations, features from both bands are extracted.

In total, there were three different frequency bands of interest, which values were extracted from. However, the third one, the Beta band, was only used for additional calculations and did not serve as a feature itself. Furthermore, the relative power was obtained by dividing the frequency band of choice over the sum of the extracted absolute powers of the Theta and Alpha band. Lastly, features 13-24 [31] are all calculated as illustrated in table 2.

(40)

4.4.2 ECG

From the ECG signal a number of different features can be extracted. This can be done in both the time domain and frequency domain. Although not as frequently occurring as EEG when applied in the field of driver sleepiness it carries a lot of useful information in both domain instances. In the field of driver sleepiness the main area of interest is the frequency domain of the heart rate variability (HRV), especially the low- and high frequency domains [18][19]. However, different features of interest can also be extracted from the time do-main and used for estimating driver sleepiness, such as the root mean square of successive difference (between R-peak to peak intervals) (RMSSD), which might be correlated to the high frequency bands of HRV, and thus correlate to sleepiness [20]. Table 3 illustrates the features extracted from the ECG signal.

Nr Feature

25 LF/HF-ratio

26 Relative Power HF

27 Relative Power LF

28 RMSSD

Table 3: Features extracted from the frequency- and time domain of the ECG signals

To estimate the LF/HF-ratio, the R-peak of each heartbeat was extracted by squaring the filtered signal and removing eventual baseline wandering, to clarify each peak and then using the built-in function findpeaks in MATLAB. In

find-peaks the minimum distance between R-find-peaks was set to 200 msec, since this is

approximately the refractory period where no new R-peaks can be introduced [13]. By taking the time difference between successive RR-intervals, the HRV was obtained. When the PSD of the HRV was applied, the frequencies of in-terest could be extracted. The PSD used was the Lomb-Scargle periodogram, using the built-in function plomb in MATLAB. The advantage of using the Lomb-Scargle periodogram is that it can estimate the power spectrum of un-evenly sampled data, such as HRV, without the need of interpolation [21]. By calculating the absolute power of the low- (0.04-0.15 Hz) and high (0.15-0.4 Hz) frequency bands, the LF/HF-ratio could be obtained. An example of the different frequency bands are illustrated in figure 15.

(41)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Frequency [Hz] 0 0.002 0.004 0.006 0.008 0.01 PSD [s 2 /Hz]

Power Spectral Density of HRV

Figure 15: The PSD of the HRV for a participant. The blue coloured area defines the low frequency band (0.04-0.15 Hz), whilst the red area defines the high frequency band (0.15-0.4 Hz)

The relative power was calculated by dividing the power of the low or high fre-quency band, depending on what is sought, with the sum of the power of both bands.

To calculate the RMSSD, the same approach to acquire the RR-intervals as in the LF/HF-ratio case was used. The built-in function RMS in MATLAB was used to calculate the root mean square of successive RR-intervals.

4.4.3 EOG

Another popular choice of method when investigating driver sleepiness regards the measurement of eye activity. This particular approach of estimating one’s mental state by observing different blink patterns, duration of a blinks etc. has been well documented and proven to be a successful measure of drowsiness [23]. Several articles have highlighted the usage of EOG measurements when researching within the field of driver sleepiness [22][24]. Table 4 illustrates the resulting extracted features from the EOG signal.

Note that entire EOG feature set have been extracted from an automatic blink

detection algorithm developed by Jammes B. et al. [17], provided by VTI.

However, below follows a description of how the features were retrieved when applying the algorithm.

(42)

Nr Feature

29 Mean Blink Duration

30 Mean Blink Amplitude

31 Mean Lid Closure Speed

32 Mean Lid Opening Velocity

33 90 Percentile Of Blink Duration

34 90 Percentile Of Blink Amplitude

35 90 Percentile Of Lid Closure Speed

36 90 Percentile Of Lid Opening Velocity

37 Number of blinks per min

Table 4: Features extracted from the time domain of the EOG signal

As previously mentioned the EOG data acquisition included horizontal and ver-tical channels during the time of the measurements, however only the verver-tical channel was used. The vertical channel records the blinks made by the partici-pant and by applying the algorithm it enables the blink duration to be extracted. Firstly, the algorithm preprocesses the acquired data by low pass filtering. It then computes the derivative of the signal, which is then used when searching for specific time sequences where the derived signal exceeds a threshold and falls below another threshold within a short time period. If the amplitude of the original signal for such a time sequence exceeds a subject specific threshold it is regarded as a blink event. The subject’s specific threshold is estimated by a calibration test prior to the feature extraction process, which in this case was the first 10 minutes of the EOG signal.

Figure 16: Not authorized to show the image due to copyright reasons.

The blink amplitude was extracted by locating the maximum value of the fil-tered EOG signal, within the located blink complex (debh-finb, see figure 16). The lid closure speed was extracted when the EOG signal exceeded the eyelid

(43)

speed threshold value for closing phase by taking the derivative of the segment. The lid opening speed applies the same approach by taking the derivatives of the values below the eyelid opening threshold.

Features 33-36 were calculated by taking the 90th percentile of the derived values of feature 29-32 (prior to the mean calculation of each feature of the features). The 90th percentile is a robust way of looking at blinks of longer du-ration. The percentile was acquired by applying MATLAB’s built-in function

prctile. Lastly, feature 37 was calculated by counting the number of accepted

blinks per epoch, and dividing it with 2.5 (the window length). Decimals were rounded to nearest integer.

4.4.4 Complexities

Complexity measurements are not widely applied within the field of driver

sleepiness detection. However, some researchers have considered complexity

features as a contributing factor of estimating sleepiness while driving [5][29]. The most frequently evaluated physiological signal seems to be EEG when eval-uating driver sleepiness. Complexity measurements are, however, popular in other research areas such as speech recognition, and therefore it would be of interest to investigate if complexity features evaluated on different physiological signals can assist in creating a robust classifier.

Nr Feature

38 - 40 Higuchi Fractal Dimension, EEG

41 Higuchi Fractal Dimension, EOG

42 Higuchi Fractal Dimension, ECG

43 - 45 Sample Entropy, EEG

46 Sample Entropy, EOG

47 Sample Entropy, ECG

Table 5: Extracted complexity features

The extraction of features regarding Higuchi Fractal Dimension and Sample Entropy, see table 5, were based upon the work of Jes´us Monge- ´Alvarez [34]. HFD requires one parameter, as previously mentioned, i.e. kmax. The param-eter kmax, which is a fixed constant, had to be dparam-etermined before values could be extracted. Unfortunately Higuchi, the author of the method, selected two

kmax without any extensively elaborations to why these values were chosen [33].

There is no widely accepted method, as of today, of estimating kmax, how-ever, there are several different approaches. One particular method involves plotting complexity values over different values of kmax. According to G´omez et al. the optimal kmax is reached when the curvature of the plot begins to plateau [27]. The same methodology was adopted in this thesis, however a fixed window (2.5 min) was placed over each respective physiological signal for all

Feature Engineering and Machine Learning for Driver Sleepiness Detection

MASTER THESIS

Feature Engineering and Machine

Learning for Driver Sleepiness Detection

Authors

Henrik M˚

artensson

Oliver Keelan

Supervisors

Christer Ahlstr¨

om

Tuan Pham

Examiner

Ingemar Fredriksson

Department of Biomedical Engineering

University of Link¨

oping

October 16, 2017

University of Link¨

oping

Abstract

Feature Engineering and Machine Learning for Driver

Sleepiness Detection

by

Henrik M˚

artensson and Oliver Keelan

Acknowledgements

List of Abbreviations

Contents

1

Introduction

1.1

Motivation

1.2

Aim

1.3

Question formulations

1.4

Limitations

2

Background

2.1

Experiment setup

3

Theory

3.1

Sleepiness

3.2

Complexities

3.3

Machine Learning

4

Method

4.1

Prephase

4.2

Signal processing

4.3

Dividing data into KSS epochs

4.4

Features