Deep learning to classify driver sleepiness from electrophysiological data

(1)

Linköping University

Department of Biomedical Engineering

MASTER THESIS

Deep learning to classify driver sleepiness

from electrophysiological data

AUTHORS

Ida Johansson

Frida Lindqvist

SUPERVISORS

Christer Ahlström

Martin Hultman

EXAMINER

Ingemar Fredriksson

June 12, 2019

LIU-IMT-TFK-A–19/571–SE

(2)

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its proce-dures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

c

(3)

Abstract

Deep learning to classify driver sleepiness from electrophysiological data

Ida Johansson and Frida Lindqvist

Driver sleepiness is a cause for crashes and it is estimated that 3.9 to 33 % of all crashes might be related to sleepiness at the wheel. It is desirable to get an objective measurement of driver sleepiness for reduced sensitivity to subjective variations. Using deep learning for classification of driver sleepiness could be a step toward this objective. In this master thesis, deep learning was used for investigating classification of electrophysiological data, electroencephalogram (EEG) and electrooculogram (EOG), from drivers into levels of sleepiness. The EOG reflects eye position and EEG reflects brain activity. Initially, the intention was to include electrocardiogram (ECG), which reflects heart activity, in the research but this data were later excluded. Both raw time series data and data transformed into time-frequency domain representations were fed into the developed neu-ral networks and for comparison manually extracted features were used in a shallow neuneu-ral network architecture. Investigation of using EOG and EEG data separately as input was performed as well as a combination as input. The data were labeled using the Karolinska Sleepiness Scale, and the scale was divided into two labels "fatigue" and "alert" for binary classification or in five labels for comparison of classification and regression. The effect of example length was investigated using 150 seconds, 60 seconds and 30 seconds data.

Different variations of the main network architecture were used depending on the data represen-tation and the best result was given when using a combination of a convolutional neural network (CNN) and a long short-term memory (LSTM) network with time distributed 150 seconds EOG data as input. The accuracy was in this case 80.4 % and the majority of both alert and fatigue epochs were classified correctly with 85.7 % and 66.7 % respectively. Using the optimal thresh-old from the created receiver operating characteristics (ROC) curve resulted in a more balanced classifier with 76.3 % correctly classified alert examples and 79.2 % correctly classified fatigue ex-amples. The results from the EEG data, both in terms of accuracy and distribution of correctly classified examples, were shown to be less promising compared to EOG data. Combining EOG and EEG signals was shown to slightly increase the proportion of correctly classified fatigue exam-ples. However, more promising results were obtained when balancing the classifier for solely EOG signals. The overall result from this project shows that there are patterns in the data connected to sleepiness that the neural network can find which makes further work on applying deep learning to the area of driver sleepiness interesting.

(4)

Acknowledgement

Firstly we would like to thank our supervisors Martin Hultman and Christer Ahlström for support and advices throughout the thesis work and Ingemar Fredriksson for being the examiner. Secondly, we would also like to express our gratitude to VTI and Anna Anund for providing the master the-sis, the data and for the warm welcome. For practical help with computers and graphics card we would also like to thank Peter Ståhl (VTI) and Anders Eklund (IMT). The Titan X used for this research was donated by the NVIDIA Corporation.

Lastly, we are grateful for friends, family and colleagues who have provided us with energy and positivity throughout this thesis work.

Linköping, 2019 Ida Johansson and Frida Lindqvist

(5)

List of Abbreviations

AI Artificial intelligence

CNN Convolutional neural network ECG Electrocardiography

EDF European Data Format EEG Electroencephalogram EOG Electrooculogram

EMD Empirical mode decomposition EMG Electromyography

ELU Exponential linear unit HHT Hilbert-Huang Transform HRV Heart rate variability IMF Intrinsic mode function

IMT Department of Biomedical Engineering KSS Karolinska Sleepiness Scale

LSTM Long short-term memory PSD Power Spectral Density

PReLU Parametric rectified linear unit ReLU Rectified linear unit

RNN Recurrent neural network

ROC Receiver Operating Characteristics

(6)

1 Introduction

The chapter begins with an introduction to the master thesis followed by the aim, the research questions posed and delimitations of the thesis. For this report it is important to separate sleep from sleepiness. Sleep is a state with a periodic suspension of the consciousness and inhibition of nearly all voluntary muscles [1]. Furthermore, the ability to react to stimuli is decreased. Sleepi-ness however is the need to go to sleep and the symptoms might alter depending on how tired the subject is [2]. The experienced level of sleepiness is subjective and can differ between individuals. Therefore, it would be useful to get an objective measurement of sleepiness to provide better tools for understanding the physiological changes when going from alert to sleepy. This is the main research focus in this master thesis.

Getting crash statistics of the problem with driver sleepiness is not an easy task. Reasons might be that sleepiness in itself could be interpreted differently and is a collective name for driver conditions [3]. In case of a crash, most likely the adrenalin level rises and the feeling of being tired disappears [3]. A driver that has been in a crash might not always be willing to contact the police and accident reports filed by the police may not mention sleepiness as a cause of the crash [3]. Deep learning has been around since 1979, but the interest in the area has accelerated in re-cent years. Today, there is active research on applying deep learning to solve complex problems in almost all research areas. Deep learning is a branch of machine learning where the deep learning model tries to learn patterns and important features from raw data automatically. The availability of large datasets and the increased computing power has revolutionized the deep learning area and greatly improved the predictive capacity of today’s computing devices. Deep learning models can be applied to, for example image, audio and text data. [4]

1.1 Motivation

Driver sleepiness is a cause for crashes in traffic and may result in death or serious injuries [5]. In a study conducted in Auckland, New Zeeland during 1998 to 1999 all driver crashes where the drivers were admitted to hospital or died were identified and analyzed [6]. The result of the study showed that there is a strong association between driver sleepiness and the risk of a car crash [6]. A similar conclusion is reported in a systematic review on risks of crashes related to driver sleepiness by Bioulac et al. [7] from 2017. They conclude that there is an increased risk for a motor vehicle crash for drivers experiencing sleepiness at the wheel. They also report of increased evidence that sleepiness at the wheel is one of the most important factors related to the risk of motor vehicle crashes. International research show that as much as 3.9 to 33 % of all traffic crashes might be a result of driver sleepiness [7].

If the driver is sleepy it may affect the driving performance such as visual awareness, maneu-vering and tracking the speed of the vehicle [5]. Monotonous driving experience can make an alert driver tired [5]. The classification of driver sleepiness could take steps towards an objective measurement of sleepiness which would be beneficial since it then decreases the dependency on subjective measures. A subjective assessment of sleepiness is more sensitive to subject variations. Today, deep learning has been used to classify different stages of sleep with a successful outcome [8]. Work on classifying driver sleepiness according to Karolinska Sleepiness Scale (KSS) using machine learning (Support vector machine, K-Nearest Neighbour, Adaptive Boosting and Random Forest) has been carried out previously [9]. However, detection of different stages of sleepiness according to KSS using deep learning is, to our knowledge, an unexplored area.

Deep learning often require large amounts of data and given one of the worlds largest datasets of electrophysiological data, collected on both real roads and in simulators, it should be possible to apply deep learning models to investigate patterns in the data.

(9)

1.2 Aim

The aim of the master thesis is to investigate classification of electrophysiological data from drivers into levels of sleepiness, by the use of deep learning. The goal is to study whether a developed neural network can find patterns characteristic for driver sleepiness in the electrophysiological data. A dataset of driver participants collected from numerous experiments will be used and divided into training examples based on the Karolinska Sleepiness scale (KSS). KSS is a subjective measurement of sleepiness and is described further in section 3.1.1.

1.3 Question formulations

• Is the amount of data provided enough to train a deep learning network for classification of sleepiness?

• Which will give the best result, feeding the network with raw time series data or the data transformed into time-frequency domain representation?

• How will the classification results be affected if the network includes subnetworks for the different electrophysiological signals?

1.4 Limitations

The thesis is limited to only implement neural networks using the Keras library in Python. Only data provided from experiments conducted by VTI will be used. KSS will be used as labels indi-cating the level of sleepiness and no other manual assessment of the level of sleepiness of the data will be performed.

The goal for this thesis is not to provide a detection system for driver sleepiness that can be implemented for commercial use, but to investigate the possibilities to provide an objective mea-sure of driver sleepiness for research purposes.

(10)

2 Background

For this thesis, data from 12 different experiments collected by the Swedish National Road and Transport Research Institute, VTI, was used. The experiments were carried out between 2004 and 2015 and have been conducted both in simulators and on regular roads. The chapter will also include some related work in the area of driver sleepiness and deep learning.

2.1 Experiments

The experimental setup vary between the experiments in factors such as number of participants and driving conditions. In total, data from 12 separate experiments were used and they are summarized in table 1.

Table 1: Summary of experiments.

Experiments

Nr Name Environment Driving sessions per participant Participants 1 ASHMI Road 3 24 2 D4S Road 3 43 3 Sleepeye Road 2 18 4 Drowsi WP1 Simulator 6 14 5 Drowsi WP4 Road 5 18 6 ERANET Road 2 24

7 Slippery roads Simulator 4 12

8 Vibsec Simulator 3 12

9 VDM Simulator 18 30

10 Sensation SP6 Simulator 2 10

11 Sensation SP2 Simulator 1 44

12 Awake pilot 15 Simulator 2 20

Five of the experiments used were collected on real road and seven experiments were collected from driving sessions in simulator. For the experiments conducted on real roads, the cars they used were Volvo 850, Volvo S8, Saab 9-3 Aero or Volvo XC70 depending on the experiment. The vehicles had equipment that was used to collect the data, both electrophysiological measurements (EOG which reflects eye position, ECG which reflects heart activity and EEG which reflects brain activity) but in some of the experiments also data from the vehicle, such as speed and lane position, and video recordings of the subjects. The experiments were carried out either on Riksväg 34 or highway E4 outside of Linköping, Sweden.

The simulator used was VTI’s third generation high-fidelity moving base driving simulator which is equipped with visual system, audio system, vibration table, vehicle model etc. to give the driver realistic driving conditions. One of the experiments was conducted with truck drivers as partici-pants who drove in a truck driver simulator.

The length of the driving sessions ranges from 45 to 90 minutes but in some cases the subject falls asleep before the session is completed and the session is then interrupted. During the driving sessions conducted on real road, two test leaders were present in the car together with the partic-ipant. One test leader in the front seat responsible for dual commands and to ensure safety. The second test leader is seated in the backseat and is in charge of logging equipment.

The drivers were recruited from the register of vehicle owners. There were a restriction on the allowed age of the participants for the experiments. For one experiment the required age was less than 25, and for the other experiments the participants had an age of 30-60 years. There were a restriction on most of the experiments that the participants had driven a minimum number of kilometers the preceding year.

For collection of electrophysiological data, the portable recording system Vitaport 2 (Temec In-struments, the Netherlands) was used. The system is a digital recorder with a large number of

(11)

channels where one is a marker channel used to synchronize data. In the VDM experiment the system g.HIamp (g.tec Medical Engineering GmbH, Austria) was used instead of Vitaport. The system g.HIamp is a 256 channel biosignal amplifier used for measurements of brain functions. For the majority of the experiments the sampling rate was 256 Hz for EEG and ECG, and 512 Hz for EOG. For the VDM experiment the sampling rate was 256 Hz for the EOG measurements as well. Therefore, the EOG data were later downsampled to match the sampling rate of the other electrophysiological data. During driving and recording of data, the participant got to subjectively assess their sleepiness level based on the KSS. They were asked to provide the sleepiness level (KSS value) every 5 minutes which then corresponds to the level of which they have been feeling during the preceding 5 minutes. A histogram over the distribution of KSS levels in the datasets is seen in figure 1. The KSS levels are described further in section 3.1.1.

Distribution of KSS values among all training examples

1 2 3 4 5 6 7 8 9 KSS value 0 200 400 600 800 1000 1200 1400 1600

Number of training examples

Figure 1: Histogram over the KSS levels from all datasets.

2.2 Related work

Previous studies on driver sleepiness can roughly be divided into three groups according to the method of which the data are collected. The first group are those using contactless approaches with for example cameras that compute blink rate. The second one is those using a contact approach with sensors that measures physiological signals from the driver. Both approaches has benefits and drawbacks. Theoretically, physiological signals are more accurate as they originate from the driver. However, noise and artifacts that arise from body movements may reduce the accuracy of the signal. The driver also has to wear sensors which might feel uncomfortable. The contactless alternative has drawbacks since it might get affected by for example light conditions or other ex-ternal parameters or that the cameras are not able to detect the eyes. The third group is vehicle based measurements which for example measure the deviation from the lane position or pressure on the acceleration pedal. [10]

A lot of work has been carried out in the field of deep learning on EEG data and 156 pub-lished articles are compiled and compared in a review by Yannick R. et al. [11] from 2019. In the study, there was no article investigating sleepiness but 15 that investigated sleep staging with the help of deep learning. A majority of the reviewed work used preprocessing steps such as filtering of the EEG data but a smaller amount of them used artifact handling. Almost an equal number of articles described that they used frequency domain features as raw EEG data features. The reported network architectures for EEG in the reviewed articles vary over the years but in total almost half (41%) of them used Convolutional neural networks (CNNs) and a smaller amount used Recurrecnt neural networks (RNNs) (14 %) and Autoencoders (13 %). The amount of papers that

(12)

reported using a combination of CNN and RNN were smaller (7 %). However, as the number of papers are growing, the number of studies using a combination of CNN and RNN are growing. This suggests that RNN has been showing an increased potential in the analysis of EEG signals. It was also shown that the number of papers reporting of architectures such as Autoencoders decreases with increasing number of papers. When using Autoencoder architectures it was common to use a two-step training process where the first step is unsupervised feature learning and the second step is training a classifier on these features.

Inspiration for network architectures was taken from a number of articles investigating sleep stage classification using deep learning. Biswal et al. [8] investigated neural network architectures of CNN, RNN and a combination of CNN and RNN, for different data representations for EEG data. These representations included raw time series data, spectrogram and expert defined features. Wei et al. [12] used empirical mode decomposition (EMD) and Hilbert transform to produce time-frequency representations of EEG data with convolutional neural network architecture. Stephansen et al. [13] investigated sleep stage scoring for diagnosis of narcolepsy where the neural network ar-chitecture included convolutional subnetworks for EEG, EOG and electromyogram (EMG) modal-ities. The extracted features from the subnetworks were either fed into fully connected layers, or into layers utilizing memory, LSTM. Mossoz [14] studied drowsiness characterization with focus on eye closure dynamics using a temporal convolution neural network. The network process face images and extracts features related to eye closures.

In this thesis, deep learning is used on both raw time series data, and time-frequency domain representations of it, to investigate whether patterns can be detected that distinguish between alert and fatigue drivers.

(13)

3 Theory

This chapter comprises theory related to the master thesis. Electrophysiological signals in relation to sleepiness will be discussed as well as the area of deep learning and the construction of neural networks.

3.1 Sleepiness

Sleepiness is a normal physiological state and can be defined as one’s tendency to fall asleep [2]. Signs of sleepiness can be found in electrophysiological signals and they can therefore be used in detection systems to prevent crashes related to sleepiness [15]. Furthermore, the behavior of the driver might also change as the driver gets tired [15]. The complexity of the sleepiness physiological state makes it difficult to measure and it is not yet a numerically defined quantity [16].

3.1.1 Karolinska sleepiness scale

A subjective measurement of sleepiness is the Karolinska sleepiness scale, KSS, which is a scale of nine steps [17]. Often, subjective rating is the only possible alternative when assessing sleepiness and the rating is based on the subjects own feelings and experience [18]. The nine steps in the KSS scale can be described as follows

1. Extremely alert 2. Very alert 3. Alert 4. Rather alert

5. Neither alert nor sleepy 6. Some signs of sleepiness

7. Sleepy, but no difficulty remaining awake 8. Sleepy, some effort to keep alert

9. Extremely sleepy, fighting sleep

3.2 Electrophysiological data

The electrophysiological data that are available to use for this master thesis is electrocardiogram (ECG), which are electrical signals from the heart recorded at the skin surface with help of elec-trodes. The electrophysiological data also includes electroencephalogram (EEG), which is recorded from electrodes placed on the surface of the head and electrooculogram (EOG), which reflects dif-ferent types of eye motion and is recorded by electrodes placed around the eye. The amplitude of the electrophysiological signals are low and therefore an amplifier connected to the electrode wires is needed [19].

3.2.1 Electrooculogram, EOG

The electrooculogram (EOG) measures eye position dependent DC potentials that are recorded using surface electrodes positioned around the eyes [20]. The electrode placement is illustrated in figure 2, which has three electrode pairs. The recorded signal arises from a potential that is present between the fundus of an eye that has a functional retina and the cornea [20]. This electric potential field can be described by a positive pole at the cornea and a negative pole at the retina that give rise to a potential that ranges between 0.1 to 4.0 mV [21]. When the eye moves, the positive dipole moves accordingly and thus a signal that is dependent of the movement can be obtained [21]. The signal that you can measure is therefore not due to muscle activity [21]. However, several aspects affects the EOG signal such as muscle artifacts, light conditions and fatigue [21]. A drowsy driver tend to get an increase in blink frequency and an increase in duration time of each blink [22].

(14)

Vertical left, Vl+ Vertical left, Vl-Vertical right, Vr-Vertical right, Vr+ Horizontal, H-Horizontal, H+

Figure 2: Electrode placement of EOG measurement.

3.2.2 Electroencephalogram, EEG

Electroencephalography (EEG) measures electrical potential differences on the scalp surface that derives from brain activity [11]. The measured activity arise from a collection of millions of cortical neurons producing an electrical field [19]. Electrical fields can propagate in a rapid manner and EEG therefore requires high temporal resolution [11]. It is difficult to extract relevant information from raw EEG since it represents several neural sources of activity at the same time [23]. How-ever, specific responses called event-related potentials can be extracted from the EEG and these potentials are related to a specific event [23]. The EEG signal is most often characterized by the frequency content and the amplitude which reflects the mental state of the subject [19]. A low amplitude and high frequency corresponds to when the brain is active and a high amplitude and low frequency reflects a state of drowsiness [19]. This is due to the fact that the cortical neurons of a subject in a drowsy state will have a higher degree of synchronized activity resulting in a higher amplitude [19]. However, signs of sleepiness are not a simple matter to detect and it is not correct to say that a given frequency band reflects a specific process [23]. Therefore, the whole event-related potential field acquired from the EEG measurement is useful in the evaluation of driver sleepiness [24].

A subject that is alert will process a lot of information and the cortical neurons will have a high but unsynchronized activity [19]. Previous research suggests that changes in the alpha waves present in the EEG signal is related to sleepiness [25]. The electroencephalographic rhythms can be classified into five different frequency bands; Gamma rhythm (> 30 Hz), Beta rhythm (14-30 Hz), Alpha rhythm (8-13 Hz), Theta rhythm (4-7 Hz) and Delta rhythm (< 4 Hz) [19]. The appearance of the different rhythms can be seen in figure 3. The most commonly reported marker for sleep onset period (SOP) is attenuation of EEG alpha waves [26]. These changes in EEG potentials are mostly related to when a subject is trying to fall asleep. On the contrary, when the subject is fighting sleep, the signal appearance might be a bit different as the subject alternate between sleep and being awake [24].

(15)

Figure 3: EEG rhythms.

3.2.3 Electrocardiogram, ECG

An electrocardiogram (ECG) is obtained by placing electrodes at specific points of the subject’s body and record an electrical signal from the heart from different electrode combinations [27]. The magnitude of the signal ranges from a few microvolts up to 1V. From the ECG-signal the heart rate variability (HRV) can be used to investigate sleepiness [28]. Wakefulness is characterized by an increased sympathetic activity whereas an increased parasympathetic activity is characteristic for relaxation [28]. Furthermore, the heart rate decreases at rest which increases the heart rate variability [28]. Sleepiness corresponds to parasympathetic activity that is counteracted as the subject is trying to stay awake [28]. During rest a normal heart rate is between 50 and 100 beats per minute [19]. However, the autonomic nervous system (ANS) influences the firing rate of the sinoatrial node (SA node) which causes variability of the heart rate [19]. Sympathetic activity will yield an increase in heart rate and thereby also result in a decrease in HRV [19]. It is important to remember that many things can affect the heart rate variability and it does not necessarily needs to be sleepiness, but can for example be just relaxation. Classifying driver sleepiness based solely on ECG signals is relatively uncommon today and the focus is mostly on signals from electroencephalography.

3.3 Signal processing

Biological signals are often affected by noise and interference from other biological signals originat-ing from a different place in the body [19]. The signal can also be corrupted as a result of poorly connected electrodes and interference from an external source [19]. Processing of electrophysiolog-ical data can involve filtering with different types of filters to remove noise and artifacts. However, filtering change the appearance of the signal and consequently affect the obtained result. Filters improve the signal-to-noise ratio which might be crucial for the analysis of electrophysiological data [29]. The filter parameters must be adjusted based on the data appearance since a bad filter design might introduce distortions [29]. Typically, for electrophysiological data there is an overlap of signal and noise and neither of those have a clearly defined appearance which complicates the filtering process [29].

There are several types of filters that have different properties that can be altered by altering the impulse response and frequency response of the filter [29]. These responses describe the trans-fer functions of the filter in time and frequency domain and are important points of understanding how to achieve good filter designs [29]. The most common filter types are called Butterworth, Chebychev, elliptic and Bessel-Thomson [30]. The Butterworth filter type is a common choice for smoothing electrophysiological data due to their maximally flat frequency response in the passband and acceptable roll-off rate [30]. The Butterworth filter can be used for removal of artifacts and

(16)

unwanted noise and the square magnitude of the Butterworth frequency response is defined by |H(jω)|2= 1

1 + (ω/ωc)2N

(1) where the cut-off frequency if given by ωc, N is the order of the filter and ω is the angular frequency [31].

3.3.1 Spectrogram

The Fourier transform X(f ) of a signal x(t) is referred to as the spectrum of the signal [32]. Mostly, only the amplitude spectrum, |X(f )|, is of interest but there is also the phase spectrum defined by arg{X(f )} [32]. The spectrogram shows how the spectral density of the signal varies with time [33]. The short-time Fourier transform calculation is performed by taking data which is in the time domain and divide it into sections, which usually overlap [33]. The magnitude of the frequency spectrum of each section is calculated after taking the Fourier transform of each section [33]. Each section correspond to a vertical line, and these lines are placed next to each other to form an image or a two-dimensional surface [33].

3.3.2 Hilbert-Huang transform

The Hilbert-Huang transform (HHT) can be used to analyze time series that are non-stationary and non-linear [34]. The output from the transform is a dynamical time-frequency spectrum matrix [12]. The HHT consist of Empirical Mode Decomposition (EMD) and the Hilbert transform (HT). The EMD procedure is adaptive and is based on the assumption that the data consist of inde-pendent intrinsic signals of modes (IMF) [34]. Each of the intrinsic mode functions are satisfying two conditions [35]:

1. The number of zero crossings and the number of extremes should be equal or differ by one 2. For any given point of the data, the mean of the envelops of the local maxima and the local

minima should be zero

The signal s(t) that can be decomposed using the procedure EMD which is given by s(t) =

N X i=1

IMFi(t) + rN(t) (2)

where IMF1(t), IMF2(t), ..., IMFN(t) are all the intrinsic mode functions in the signal [12]. There is also a negligible residue of the signal described by rN and the decomposition stops when the residue becomes a monotonic function or a function with only one extrema and no more intrinsic mode functions can be extracted [35]. From the empirical mode decomposition a number of IMFs are obtained and the functions represents stationary signals at different scales [12]. The Hilbert transform can then be applied to each IMF to compute the instantaneous frequency [34]. Using the Hilbert-Huang transform to construct signals is defined by

zi(t) = IMFi(t) + jH(IMFi(t)) = ai(t)ejθi(t) (3) where θi(t) and ai(t) are defined by equation 4 and 5 respectively [12].

θi(t) = tan−1 H(IMFi(t)) IMFi(t) (4) ai(t) = q IMF2i(t) + H2(IMFi(t)) (5) From this, the instantaneous frequency ωi(t) of IMFi(t) can be computed using the following equation [12]:

ωi(t) = dθi(t)

dt (6)

The instantaneous frequencies are by definition non-negative and thereby, "physically meaning-ful" [36]. The negative instantaneous frequencies are caused by multi-extrema between two zero-crossings or large fluctuations in amplitude in the signal [36]. The Hilbert-Huang transform has been applied in many fields since it is able to handle nonlinear systems and has shown promising results in automatic sleep stage scoring based on EEG signals [12].

(17)

3.3.3 Welch power spectral density estimate

The Welch method is a way to estimate the Power Spectral Density (PSD) by splitting the data into segments and find the average of their respective periodograms. The segments are usually overlapped by 50 % or 75 % and each segment of data are windowed. By the use of overlap the length of each segment can be increased along with an increased number of segments. The result is a decreased variance of the estimators while providing an increased resolution. [37]

3.4 Deep learning

Machine learning is the ability of artificial intelligence (AI) systems to acquire knowledge by learn-ing patterns in data [38]. Deep learnlearn-ing is a type of machine learnlearn-ing that uses several layered models, called neural networks, to build complex concept from simple concepts [38]. An example of a neural network with only one hidden layer is shown in figure 4. The typical type of deep neural network, as used in deep learning, is the deep feedforward networks or feedforward neural networks. These networks consists of an input layer, more than one layer of hidden units and an output layer [39]. The goal for the network is to approximate some parameters θ such that a model mapping y = f (x; θ) is as similar as possible to the function f∗ where y = f∗(x) is the mapping of input x to category y [38]. The network will use the hidden layers to produce the output that is desired in that application despite that this information is not being explicitly specified in the training data [38].

The term neural network is inspired from the biological neurons and the analogy is that they both computes their own action value from a received input [38]. Common layers of a typical feedforward neural network are the so called dense or fully connected layers which are layers where all neurons in one layer connects to all neurons in another layer.

Input 1 Input 2 Input 3 Input 4 Output Hidden layer Input layer Output layer

(18)

Activation functions

It is common to use activation functions in the deep learning model to introduce nonlinearities into the system. These activation functions are used to compute the output values of the neurons in the hidden layers and the output layer [38]. Examples of common activation functions are shown in figure 5. −1 −0.5 0 0.5 1 −1 0 1 z f ( z ) (a) Linear −4 −2 0 2 4 0 0.5 1 z σ ( z ) (b) Logistic sigmoid −4 −2 0 2 4 −1 0 1 z tanh( z ) (c) Hyperbolic tangent −1 −0.5 0 0.5 1 0 0.5 1 z ReLU(z) (d) ReLU −2 −1 0 1 −0.5 0 0.5 1 z PReLU(z) (e) PReLU −2 −1 0 1 −0.5 0 0.5 1 z ELU(z) (f) ELU

Figure 5: Common activation functions used in deep learning layers: a) f (z) = z, b) σ(z) = 1

1+e−z, c) tanh(z) =

ez−e−z

ez_+e−z, d) ReLU(z)=max(x,0), e) PReLU(z)=max(x,αx) with α ∈ (0, 1) and f )

ELU(z)=max(x,α(ex− 1)) with α ∈ (0, 1).

Especially the ReLU activation function 5d) is a standard choice in many applications. The func-tion is piecewise linear and because of this, many of the positive aspects of linear models are preserved, such as easy optimization [38]. For recurrent neural networks, described further down in section 3.4.2, information needs to be propagated over various time steps and this is a less complicated operation when some of the computations are linear [38].

As can be seen in figure 5, ReLU is linear for all positive values and zero for all negative val-ues. Leaky ReLU, given by max(x,αx) with α ∈ (0, 1), differs from ReLU as it has a small slope for the negative values, defined by the parameter α, instead of just being zero [38]. PReLU is a type of leaky ReLU, the difference is that a leaky ReLU fixes the parameter α to a small value such as 0.01 and PReLU (Parametric ReLU) has α as a parameter that is learned by the network [38]. ReLU does not produce negative values and therefore the mean activation is always larger than zero which will act as a bias for the next layer [40]. If units with non-zero mean activation does not cancel out each other, there will be a bias shift [40]. It is therefore desired to push the mean activations towards zero, which is possible using leaky ReLU, ELU and PReLU that all produce negative values [40].

(19)

Regression and classification

Machine learning can be used to solve many kinds of tasks and two examples are classification tasks and regression tasks. Classification and regression are similar but the output format differs. Classification is when the network algorithm specifies which category some input belongs to and can either be binary with only two class labels or multi-class classification with more than two class labels. The output is thereby categorical (discrete). In a regression task the network is fed with data and the output becomes a predicted numerical value (continuous). [38]

Cost function

When training a neural network or other types of machine learning algorithms, the goal is to find parameters, weights or other structures such that a cost function is minimized. The cost function is minimized using optimization algorithms based on gradient descent which will be discussed further down. The cost function is a crucial point when designing a neural network and the choice of cost function should depend on the type of output activation function used. [38]

Cross-entropy is a common choice of cost function in deep learning applications. This uses the neg-ative log-likelihood, which is the same thing as minimizing the cross-entropy between the training set and model distribution. The cross-entropy is the difference between the probability distribution of the models predictions and the training dataset’s probability distribution. Cross-entropy loss function in combination with sigmoid or softmax output activation functions improves the per-formance of the neural network models compared to mean squared error. In regression problems, the cost function of choice is typically mean squared error in combination with a linear activation function in the output layer with one node. [38]

Optimization algorithms

Training a neural network is a difficult optimization problem and many optimization algorithms have been proposed until present time. Gradient descent is one of the most common optimization algorithms used in deep learning and there are a large amount of algorithms to optimize gradient descent. Gradient descent is used to approach the (local) minimum of a function by taking steps in the opposite direction to the gradient. The size of these steps are given by the learning rate η which is a parameter that can be tuned, but the choice of a proper learning rate can be a challenging task. A too large learning rate can make it impossible for the solution to converge to a minimum and make it fluctuate around it. A learning rate that is too small leads to slow convergence. [41] The three basic variants of gradient descent is based on the amount of data used to calculate the gradient. When the entire dataset is used to compute the gradient of the cost function, it is called Batch gradient descent or Vanilla gradient descent. Stochastic gradient descent is when the gradient is calculated and parameters updated after each training example. The method in between these two extreme cases are Mini-batch gradient descent where the gradient is calculated and parameters updated based on a mini-batch of training examples with a size that is usually between 50-256. This method is most commonly used when training neural networks. The expres-sion Stochastic gradient descent is sometimes also used when referring to the mini-batch method. [41]

In addition to the above mentioned methods, there are further variants of algorithms that uses momentum and adaptive learning rates to reduce oscillations and speed up training. Traditional stochastic gradient descent algorithm tends to oscillate around local optimas, especially where the steepness are different in different dimensions [41]. Momentum algorithms accumulates a moving average of previous gradients (which will be exponentially decaying), and moves in this direction [38]. By introducing the momentum term, the algorithm is accelerated towards the optimum and the oscillations are dampened [41].

AdaGrad is an algorithm which will adapt the learning rate to do small updates for frequent parameters and large updates for infrequent parameters [41]. Because of this, there is no need to manually tune the learning rate [41] and for convex optimization this algorithm should, in theory, improve the performance [38]. However, this algorithm suffers from a drastic decrease in

(20)

the effective learning rate too early in the training process [38]. RMSprop is an algorithm that is a modification of AdaGrad to get better performance in nonconvex contexts [38]. One of the most popular algorithms is called Adaptive Moment Estimation (Adam), which calculates adap-tive learning rates for each parameter [41]. The Adam algorithm can be seen as a combination of RMSProp and momentum, but with a few differences in characteristics [38]. In the algorithm, the exponentially decaying average of past squared gradients is stored, which is also the case for RMSprop, but in addition, an exponentially decaying average of past gradients is also stored [41]. It has been shown empirically that Adam performs favorably compared to other similar methods and that it has good performance in practice [41].

Regularization

When training the neural network it is important to avoid overfitting, described in section 3.4.5. Overfitting is basically when the model starts to learn noise which is data that does not represent any true properties of the signal. Two types of regularization methods will be described: L1 regu-larization and L2 reguregu-larization. Both L1 and L2 reguregu-larization are types of weight reguregu-larization. The most common technique that is used in weight regularization is to add a weight penalty to the loss function to keep the weights small. If small changes in the input would result in large changes of the output this could be an indication that the network is overfitting. [42]

L1 regularization calculates the sum of the absolute value of the weights and pushes weights towards zero which result in fewer weights that is allowed to grow. The L2 regularization which is more common to use instead calculates the sum of squared values of the weights. This regulariza-tion method penalize larger weights more severely and drives most weights to smaller values. L1 and L2 regularization can either be used separately or together. [42]

Flatten

Flatten is a layer used in neural networks to flatten the input [43] and an illustration can be seen in figure 6. This is a practical operation used when it is required to change the dimension of the data within the network.

1 2 5 6 3 4 7 8 9 3 4 8 7 9 2 6 5 1

Figure 6: Illustration of flatten.

Training a Neural Network

After setting all fixed parameters described above, the actual training can begin. Generally, the initial weights are randomized and later updated as the network learns [44]. The backpropagation algorithm is a method often used when training deep learning networks. It calculates the gradient and allows information from the cost to flow backwards in the network. Nevertheless it is impor-tant not to confuse backpropagation with the optimization algorithm for the neural network as it only computes the gradient. The gradient is then later used by the optimization algorithm, such as gradient descent, to perform learning. [38]

Backpropagation consist of propagation and weight update. The first step is that the input data propagates through the network and an output is generated [38]. This output should be as close to

(21)

the desired output as possible [38]. From this output the error for that set of weights is calculated as the aim is to minimize the loss function [45]. Thereafter, the derivative of the error is calculated with respect to all of the weights to optimize the weights [45]. As the weights are adjusted the process starts over and iterates until the networks learns and converges [45]. A schematic overview of the backpropagation process is presented in figure 7.

Model/weights Weight initialization Feed Forward Calculate loss function Derivative of loss function Backpropagation Update weights Input Output Desired output Loss function Iterate Error Gradients

Figure 7: Schematic overview of the training of a neural network.

3.4.1 Convolutional neural networks, CNN

Convolutional neural networks (CNNs) is a type of feedforward neural network [38]. They have been successfully used in various practical applications processing both 1D and 2D data, such as time series classification and image classification. An illustration of a convolutional neural network can be seen in figure 8.

Input

Class 1

Convolution Pooling Convolution Pooling Fully

Connected

Fully Connected

Class 1 (0.8) Class 2 (0.2)

Figure 8: Architecture of a convolutional neural network.

As the name implies convolution is used as an operator in the network. The mathematical operator takes the input data and a filter (more commonly called a kernel) and performs convolution which is the same as correlation except that the kernel is flipped relative to the input [46]. However, many neural network libraries implement cross-correlation but still call it convolution [38]. The convolutional operation is defined by

s(t) = Z

x(a)w(t − a)da = (x ∗ w)(t) (7) where x is often referred to as input (the data) and w the kernel [38]. Note the minus sign in the function w, which is where the function is flipped relative to the input as mentioned earlier. The output can in machine learning applications be called a feature map [38]. Usually, this operation is discretized due to the fact that the data are discrete, the operation is then

s(t) = ∞ X a=−∞

(22)

and this can also be expanded to be two or three dimensional convolution [38], which will give the following in 2D s(i, j) = (I ∗ K)(i, j) =X m X n I(m, n)K(i − m, j − n) (9) where I is the image or matrix of data and K is the kernel [38]. In a traditional neural network every output unit interacts with every input unit, but in a convolutional neural network the kernel is made smaller than the input which results in sparse localized interactions and also fewer operations when computing the output [38]. The two dimensional convolution operation can be illustrated as in figure 9. ×1 ×0 ×1 ×0 ×1 ×0 ×1 ×0 ×1 0 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0                           ∗ 1 0 1 0 1 0 1 0 1         = 1 4 3 4 1 1 2 4 3 3 1 2 3 4 1 1 3 3 1 1 3 3 1 1 0                   I K I ∗ K

Figure 9: Illustration of the two dimensional convolution operation used in convolutional neural networks.

CNNs have less connections and parameters compared to a feedforward neural network with a similar size and they are therefore easier to train [47]. These fewer connections can sometimes be referred to as sparse weights, or sparse connectivity, which is generated by the kernel size that is smaller than the input matrix data [38]. The decreased number of parameters is beneficial since problems arises with increasing number of parameters such as overfitting which will be discussed further down. The amount of training data required tends to increase with the number of param-eters. The number of parameters from a convolutional layer is given from the number of filters and the filter size. For example if the filter size is 5x5 and the number of filters is 4 there is 5x5 parameters per filter, assuming no bias. With this follows that the total number of parameters is 5x5x4=100.

An important layer often used in convolutional neural networks is the max-pooling layer which act on the hidden layers of the network. By using max-pooling, the spatial size can be reduced and thereby also reduce overfitting [48]. The pooling layers combine outputs from clusters of neurons at one layer and put them together into a single neuron at the next layer [49]. There data gets a more compact representation which reduces the computational cost and there is also a better ro-bustness for noise and invariance to image transformations [49]. An illustration of the max-pooling operation can be seen in figure 10.

7 9 3 5 9 4 0 7 0 0 9 0 5 0 9 3 7 5 9 2 9 6 4 3 2 × 2 Max Pooling 9 5 9 9 9 7 2 2

(23)

3.4.2 Recurrent neural networks, RNN

A recurrent neural network is a network specialized on processing sequential data since they have a temporal dimension. [38]. RNNs store previous outputs and uses them when making predictions in the current output [50]. However, RNN architectures have problems with vanishing or exploding gradients and therefore LSTM architectures, described further down, are proposed as a great option to traditional RNNs [51]. In case of exploding gradient, the weights may tend to oscillate and in case of vanishing gradient, learning long term dependencies may take a very long time or not be possible at all [51].

Long Short-Term Memory, LSTM

Long Short-Term Memory networks are a type of RNN that have the ability to handle long term dependencies by the presence of memory blocks in the recurrent hidden layer of the network architecture [52]. The memory blocks each contain an input gate and an output gate which controls flow of input activation and output activation respectively [52]. Because of the prescence of gates, LSTM networks are usually called gated RNNs [38]. The design of the LSTM architecture is made to be able to handle the problems that conventional RNNs have such as vanishing or exploding gradients [51]. The concept is to create paths through time where the derivatives doesn’t vanish nor explode [38]. The LSTM components are designed for the network to accumulate information over time and to learn when to forget information that is not needed anymore [38]. LSTMs have been successfully used in applications such as speech recognition and unconstrained handwriting recognition [38]. σ σ tanh σ × (1) + (3) × (2) × (4) tanh s(t−1)

Memory from previous cell

h(t−1) Output from previous cell

x(t)

Input

s(t)

Memory from current cell

h(t) Output from current cell h(t)

Output from current cell

Figure 11: Visualization of an LSTM memory cell.

An example of how an LSTM memory cell can be visualized is shown in figure 11. In an LSTM network, several memory cells are connected recurrently to each other. The memory from the previous cell will flow into the current cell in the pipe at the top of the image. The symbolN is element-wise multiplication and at (1) in the figure, there will be an element-wise multiplication between the previous cell state and the output from the previous cell. The output from the previous cell is multiplied by the sigmoid function, σ, deciding which information to remember and which to forget. For the sigmoid function, the values that are multiplied with 0 will be forgotten completely since it results in 0, and the values multiplied with 1 will not be forgotten at all since the values then remain the same. The result is that some of the old information is forgotten and it is therefore called the forget gate. The equation for the forget gate, f(t), for timestep t and cell i, is as follows

f_i(t)= σ  bf_i +X j U_i,jf x(t)_j +X j W_i,jf h(t−1)_j   (10)

where x(t) is the current input, h(t−1) is the output from the previous cell [38]. The other param-eters bf_{, U}f _{and W}f _{are bias, input weights and recurrent weights respectively [38].}

(24)

At the multiplication symbol at (2), the amount of new information to be added is decided and it is called the input gate. The sigmoid function will regulate which of the information that should be forgotten and which should be kept by multiplying with values between 0 and 1. The tanh activation function receives both the previous hidden state and the input and regulates the values to be between -1 and 1. The input gate, g_i(t), can be described mathematically by the following equation g(t)_i = σ  bg_i +X j U_i,jg x(t)_j +X j W_i,jg h(t−1)_j   (11)

where the parameters bg, Ug and Wg are specific for the input gate [38].

The result is added with element-wise addition, which is illustrated by the symbolL (3), to the result from the forget gate and the cell state is updated. This is given by the following equation

s(t)_i = f_i(t)s(t−1)_i + g_i(t)tanh  bi+ X j Ui,jx (t) j + X j Wi,jh (t−1) j   (12)

where s(t)_i is the cell state and b, U and W are bias, input weight and recurrent weight of the LSTM cell [38].

At last, there is the output gate which controls the output of the current LSTM cell. It regulates the amount of new memory that should influence the output. The input and the output from the previous cell is fed into a sigmoid function and the cell state is fed into a tanh activation function. The two are then multiplied (4) and the result is the information that the output of the current cell should have. The output gate, o(t)_i can be described by the following formulas

o(t)_i = σ  bo_i +X j U_i,jo x(t)_j +X j W_i,jo h(t−1)_j   (13) h(t)_i = tanh(s(t)_i )o(t)_i (14) where we have specific bias, input weight and recurrent weight for the output gate named bo, Uo and Worespectively [38].

Together with LSTM a wrapper called TimeDistributed, found in the Keras library in Python can be used. TimeDistributed is visualized in figure 12 and is used to apply a layer to each tem-poral slice of a input and the weights are then shared between timesteps [53]. It can be used for arbitrary layer types [54].

(25)

Conv Pooling etc Conv Pooling etc Conv Pooling etc

Flatten Flatten Flatten

LSTM LSTM LSTM Flatten Dense TimeDistributed TimeDistributed T x sequence length

Sequence length Sequence length Sequence length

(NumLabels)

Figure 12: Visualization of TimeDistributed where T is the number of timesteps. Note that the input to each timestep is not the same, but the processing of each input is, due to the weight sharing.

3.4.3 Bidirectional RNNs

A bidirectional RNN makes it possible for the output prediction to depend on the whole input signal and not only on information from the past. It combines RNNs that move forward and backwards in time and an illustration of this can be seen in figure 13. [38]

h

_t-1

h

_t

h

_t+1

h

_t-1

h

_t

h

_t+1

y

_t-1

y

_t

y

_t+1

x

_t-1

x

_t

x

_t+1 Inputs Forward layer Backward layer Outputs

Figure 13: Visualization of a bidirectional RNN, x is the input signals, h is the hidden sequence and y is the output data.

As mentioned, one state is moving forward through time and another state is moving backwards through time. This output is thereby computed as a representation that depends on both the

(26)

future and the past. The bidirectional RNN can also be used on two-dimensional input data. In this case four RNNs is used, one for each direction (up, left, right and down). [38]

3.4.4 Autoencoder

An autoencoder is a neural network that consists of two parts, an encoder and a decoder. The autoencoder is trained to compress data and then reconstruct it again. The purpose is not a perfect reconstruction but rather an output than resembles the input data [38]. The model is forced to choose which aspects of the data that should be prioritized and thereby learns useful properties of the data [38]. Autoencoders can be used for dimensionality reduction and to retrieve features from data [38]. A visualization of an autoencoder can be seen in figure 14.

Decoder

Encoder Reconstructed

data Input data

Figure 14: Visualization of an autoencoder.

The encoder is where the input data are transformed into a hidden representation [55]. The hidden representation is then mapped back to a reconstructed vector and this mapping is called decoder [55]. The training of an autoencoder is done by trying to minimize the reconstruction error [55]. An autoencoder would be beneficial to use to see if there is any patterns at all in the data that is to be used in the network. If so, the encoder part could be used as the first part of the neural network as pre-trained layers.

3.4.5 Overfitting

When training large neural networks there is a risk of overfitting which is when the validation error start to increase even though the training error continues to decrease.

Early stopping

A method called early stopping can be used to prevent overfitting. The algorithm returns the parameter setting from the point when the validation error was at its lowest instead of returning the latest parameters [38]. The early stopping algorithm thereby decides the best amount of time to train (number of iterations) and then terminates the training [38]. A drawback of using early stopping is that the best parameters has to be stored, which may increase the training time [38]. However, the time is mostly negligible and additionally the algorithm compensates this by reducing the computational cost since it limits the number of training iterations [38].

Dropout

A dropout layer can also be used to prevent overfitting. The idea is to drop units from the neural network during training and by this prevent units from co-adapting too much. The dropout is randomized and the unit, with all its connections, is temporarily removed from the network. All possible "thinned" neural networks share weights and for each training case a new thinned network is trained. This can be seen as training a collection of thinned networks with extensive weight sharing. If a neural network has n units there is 2n _{possible thinned networks. [56] The} result of using dropout is visualized in figure 15.

(27)

Dropout

×

Figure 15: Visualization of a thinned network as a result of using dropout.

Batch Normalization

Batch Normalization is a method used in deep learning to reduce the Internal Covariate Shift, which is the change in distribution of network activations. Small changes in the parameters of the network gets amplified when deeper layers are reached. The network needs to adapt to these distributions, covariance shifts, which slows down the training. By reducing this shift, the training will be improved and converge faster. Adding batch normalization to the neural network will fix the means and the variances of each layer inputs. This will make the network less sensitive to learning rate parameters. The result is that it is possible to use higher learning rates, speed up the training and it also reduces the need for dropout as it regularizes the model. [57]

3.4.6 Evaluation methods

It is important to be able to evaluate the performance of the machine learning algorithm that is applied to the classification problem. Therefore, a few measures and methods that is commonly used for evaluation of a machine learning algorithm is brought up here. Usually the model is not evaluated on the same data that has been used for training and instead, a separate test set of data are saved for evaluation [38]. If the neural network or other algorithm performs well on the unseen test set, then the algorithm is able to generalize well [38].

Accuracy

Accuracy is common measure of evaluation of a machine learning algorithm. The accuracy is defined as the proportion of correctly classified outputs [38], as in the following equation

Accuracy =Number of correct predictions

Total number of predictions (15) Confusion matrix

The confusion matrix is a way to visualize and discriminate the best classification method [58], and one example of how a confusion matrix can look like is located in table 2. The columns of the confusion matrix shows the predicted class and the rows shows the actual class. The instances on the diagonal, the True Positives and True Negatives are how many examples that the model classifies correctly for each class respectively [58]. The remaining two instances, False Positives and False Negatives are the misclassified examples [58].

(28)

A ctual v alue Prediction outcome p n total p0 True Positive False Negative P 0 n0 False Positive True Negative N 0 total P N

Table 2: Confusion matrix.

Sensitivity

The sensitivity is a measure of the true positive participants that are classified correctly and can be calculated using equation 16 [58].

Sensitivity = True Positive

True Positive + False Negative (16) Specificity

The specificity is a measure of the true negative participants that are classified correctly. This can be calculated using equation 17 [58].

Specificity = True Negative

True Negative + False Positive (17) Mean absolute error

When having a regression problem with continuous output insted of probability for class labels, it could be useful to use another evaluation method than only accuracy. One common method for evaluation is the Mean Absolute Error (MAE) which is described by the following formula

MAE = 1 n n X j=1 yj− ˆyj (18)

where yj is the actual value, ˆyj is the estimated output from the network and n is the number of training examples.

Mean squared error

Similar to MAE is Mean Squared Error (MSE) which is defined by the following formula

MSE = 1 n n X j=1 (yj− ˆyj)2 (19)

where again yjis the actual value, ˆyj is the estimated output from the network and n is the number of training examples.

Receiver Operating Characteristics, ROC

Receiver operating characteristics (ROC) curves are a useful tool for visualizing and organizing classifiers such as neural networks. ROC graphs are especially useful where the class distribution and classification error costs are unequal. When constructing the ROC graph, the true positive

(29)

rate and the false positive rate is needed. The true positive rate is equivalent to the sensitivity from above and the false positive rate is given by:

False positive rate = False Positives

Total negatives (20)

which is the same as 1 - specificity. The ROC curve is then constructed by plotting a graph with the true positive rate on the Y axis and the false positive rate on the X axis. The graph therefore illustrates the tradeoff between benefit, true positives, and costs, false positives. When having a discrete classifier that predicts the class value label, the classifier will result in a single point in ROC-space. If the classifier can predict a probability for a certain class it is possible to threshold this probability such that probabilities below the threshold belongs to one class and above to the other. In this case, it is possible to try different values of this threshold and get a curve of how the true positive rate and false negative rate varies with the threshold. [59]

In figure 16 is an example of how a curve can be represented in ROC-space when a test set has been thresholded and plotted.

0 0.2 0.4 0.6 0.8 1

0 0.5 1

False positive rate

T rue p ositiv e rat e Random guess

Figure 16: The appearance of the ROC curve as a result of thresholding a test set.

The point (0,0) in ROC space corresponds to when the classifier gets zero positive classifications and thereby gets no false positive errors but at the cost of not gaining any true positives. Opposite to this scenario is the point (1,1), which is when the classifier only gives positive classifications. Both of these points are on the dotted line for random guess in figure 16. The closer the classifier is to the north west corner (minimized Euclidean distance) the better the classifier is and the point (0,1) corresponds to perfect classification. The point at the curve that is closest to the point (0,1) is given by using the threshold that maximizes the true positive rate and minimizes the false positive rate of the system. A classifier that appears below the random guessing line, in the lower right triangle, achieve worse results than random guessing. The area under a ROC curve (AUC) is a statistical property stating the probability that an arbitrarily chosen positive instance is ranked higher than an arbitrarily chosen negative instance. [59]

(30)

4 Method

The master thesis starts with the prephase including a literature study and initial pre-processing of the data such as filtering and normalization. The following sub-chapters will describe how the data were processed and divided into epochs along with the architecture of the developed neural networks. In addition there is also described how the networks were trained and evaluated. Python version 3.7 with Anaconda distribution was chosen for implementation of the network due to its convenience while all data processing was performed in Matlab 2019a.

4.1 Prephase

The master thesis project was initiated with a prephase including installing the setup for the work and getting data in the correct format for upcoming work. Two laptops, provided by VTI, were used for pre-processing of the data. One stationary computer equipped with an NVIDIA Titan X Pascal graphics processing unit was used for training the neural networks.

The experiment data had been acquired on separate occasions and under separate conditions and the configuration of the experiments therefore differed between them. For all of the experiments, the data were provided in European Data Format (EDF) where each driving session were located in separate EDF-files. EDF is a format used for exchange and storage of biological and physical signals. The corresponding KSS values were located in excel files, dat files or text files. The KSS values were given every five minutes and added into a vector with a sample frequency of 8 Hz. Some of the experiments included decimal KSS values which were rounded to fit the KSS scale. Previous master thesis students on VTI [60] had established a structure for three of the experiments used and the same structure was followed for the remaining experiments. The structure is illustrated in figure 17. There is an exception in the dataset called VDM were the EOG signal has a sampling frequency of 256 Hz and the KSS values had a sampling frequency of 50 Hz.

Participant

Physiological

data

8 Hz

KSS

256 Hz

ECG

EEG

512 Hz

EOG

Figure 17: Data structure for each participant in each of the experiments.

The physiological data from each experiment were timesynced relative to each other and relative to the KSS values in Matlab using a timesync vector that indicated when the experiment started, i.e. the time zero. The timesynced data labeled with KSS values were saved in separate mat-files version 7.3 for each driving session.

4.2 Signal processing

The electrophysiological signals were processed by the use of filters and thresholds to suppress noise and artifacts. These steps were performed in Matlab. All signals went through some general steps illustrated in figure 18. Each step will be described further in the following chapters.

(31)

Convert file format Filtering Split into epochs Normalization Remove epochs with artifacts

Figure 18: Flowchart of data processing steps performed.

4.2.1 EOG

The electrooculogram signals were fairly free from noise and the filtering therefore consisted of a simple low-pass Butterworth filter with cut-off frequency 11.52 Hz. The Butterworth filter was implemented in Matlab using the built in function butter. The effect of filtering can be seen in figure 19. 0 5 10 15 20 25 30 Time [s] -300 -200 -100 0 100 200 300 400 500 Amplitude [ V] Filtering of EOG Original signal Filtered signal

(a) Filtering of EOG

0 0.5 1 1.5 2 2.5 3 Time [s] -300 -200 -100 0 100 200 300 Amplitude [ V] Filtering of EOG Original signal Filtered signal (b) Zoomed in Figure 19: Filtering of the EOG signal using a lowpass filter.

For the EOG signals, there were several motion artifacts and baseline distortions that was processed in Matlab to improve the appearance of the signals. An algorithm was developed to reduce the impact of baseline distortions. The first step of the algorithm was to extract the absolute value of the derivative of the signal using the Matlab functions abs and diff. A threshold was set to extract peaks in the derivative that was abnormally high. The threshold of choice was found through trial and error to be 100 µV. The peaks were extracted using the function findpeaks in Matlab. Breakpoints were then applied at the points of these peaks and the linear trends of the separate parts of the signals were estimated and subtracted using the function detrend in Matlab. The result can be seen in figure 20.

(32)

0 5 10 15 20 25 Time [min] -2000 -1500 -1000 -500 0 500 1000 1500 2000 Amplitude [ V] Baseline correction Before After

Figure 20: Correction for baseline distortion on the EOG signal.

The second algorithm was developed to improve the appearance of motion artifacts in the signal. This algorithm begins, similarly to the previous one, with extracting the absolute value of the derivative in the same way. The indices of the signal where the derivative is small (less than 0.001) and the amplitude is below -500 µV simultaneously, were extracted and saved in a separate "marker" signal vector. Single indices with no surrounding values were then removed from the signal and a median filter of order 15 is then applied using the function medfilt1 in Matlab. The marker signal corresponding to the detected positions of small derivatives can be seen in figure 21.

0 10 20 30 40 50 60 70 Time [s] -2000 -1500 -1000 -500 0 500 1000 1500 2000 Amplitude [ V] Marker signal Original signal Final marker signal

Figure 21: Marker signal of the detected artifacts in the EOG signal.

This marker signal only indicates where the EOG signal has an abnormally low derivative for a long period of time, but the entire artifact is wider. To extract the start and end positions of the

Deep learning to classify driver sleepiness from electrophysiological data

Linköping University

Department of Biomedical Engineering

MASTER THESIS

Deep learning to classify driver sleepiness

from electrophysiological data

AUTHORS

Ida Johansson

Frida Lindqvist

SUPERVISORS

Christer Ahlström

Martin Hultman

EXAMINER

Ingemar Fredriksson

June 12, 2019

LIU-IMT-TFK-A–19/571–SE

Copyright

Abstract

Deep learning to classify driver sleepiness from electrophysiological data

Acknowledgement

List of Abbreviations

Contents

1

Introduction

1.1

Motivation

1.2

Aim

1.3

Question formulations

1.4

Limitations

2

Background

2.1

Experiments

2.2

Related work

3

Theory

3.1

Sleepiness

3.2

Electrophysiological data

3.3

Signal processing

3.4

Deep learning

h

h

h

h

h

h

y

y

y

x

x

x

×

×

×

×

×

×

×

4

Method

4.1

Prephase

Participant

Physiological

data

8 Hz

KSS

256 Hz

ECG

EEG

512 Hz