An IoT Solution for Urban Noise Identification in Smart Cities: Noise Measurement and Classification

(1)

Author:

Yasser Alsouda

Supervisor:

Sabri Pllana

Examiner:

Sven-Erik Sandström

Date:

2019-02-14

Master Degree Project

An IoT Solution for Urban Noise

Identification in Smart Cities

(2)

Abstract

Noise is defined as any undesired sound. Urban noise and its effect on citizens are a significant environmental problem, and the increasing level of noise has become a critical problem in some cities. Fortunately, noise pollution can be mitigated by better planning of urban areas or controlled by administrative regulations. However, the execution of such actions requires well-established systems for noise monitoring. In this thesis, we present a solution for noise measurement and classification using a low-power and inexpensive IoT unit. To measure the noise level, we implement an algorithm for calculating the sound pressure level in dB. We achieve a measurement error of less than 1 dB. Our machine learning-based method for noise classification uses Mel-frequency cepstral coefficients for audio feature extraction and four super-vised classification algorithms (that is, support vector machine, k-nearest neighbors, bootstrap aggregating, and random forest). We evaluate our approach experimentally with a dataset of about 3000 sound samples grouped in eight sound classes (such as car horn, jackhammer, or street music). We explore the parameter space of the four algorithms to estimate the optimal parameter values for the classification of sound samples in the dataset under study. We achieve noise classification accuracy in the range of 88% – 94%.

Keywords: urban noise, sound pressure level (SPL), internet of things (IoT), machine learning, support vector machine (SVM), k-nearest neighbors (KNN), bootstrap aggregating (Bagging), random forest, mel-frequency cepstral coeffi-cients (MFCC).

(3)

Acknowledgements

First, I would like to express my gratitude to my supervisor Dr. Sabri Pllana for his sup-port throughout the entire process of this thesis. Without his contribution, this work could not have been successfully conducted.

Additionally, I would like to thank my teachers and my classmates in the Department of Physics and Electrical Engineering, especially Dr. Sven-Erik Sandström and Dr. Ellie Cijvat, who helped to join this master programme.

Finally, I must express my love and gratitude to my family and my friends for their unfail-ing support durunfail-ing my years of study. This accomplishment would not have been possible without them. Thank you!

(4)

Publications

• Y. Alsouda, S. Pllana, and A. Kurti, “IoT-based Urban Noise Identification Using Machine Learning: Performance of SVM, KNN, Bagging, and Random Forest,” in International Conference on Omni-layer Intelligent Systems (COINS), May 5–7, 2019, Crete, Greece. ACM, 2019. [1].

• Y. Alsouda, S. Pllana, and A. Kurti, “A Machine Learning Driven IoT Solution for Noise Classification in Smart Cities,” in Machine Learning Driven Technologies and Architectures for Intelligent Internet of Things (ML-IoT), August 31, 2018, Prague, Czech Republic. Euromicro, 2018. [2].

(5)

List of Figures

2.1 The key fields and some related domains of urban development needed

for smart cities. . . 3

2.2 The three layers of IoT architecture. . . 4

2.3 Machine learning techniques and common machine learning algorithms. . 5

2.4 Noise classification hardware platform consists of a Raspberry Pi Zero W and a ReSpeaker 2-Mic Pi Hat. . . 6

3.5 Illustration of rarefactions and compression of a sound wave. . . 7

3.6 A representation of the frequency distribution along the human cochlea, from base to apex [3]. . . 8

3.7 Illustration of center frequency and band limits of one octave and 1/3 octave bands. . . 8

3.8 The sound pressure level of common noise sources. . . 10

3.9 Equal-loudness contours for pure tones according to ISO 226:2003. . . . 11

3.10 A, C and Z frequency weighting response. . . 13

3.11 Demonstrating the behavior of Fast and Slow weighting [4]. . . 14

3.12 Demonstrating the behavior of Impulse weighting [4]. . . 14

4.13 Our machine learning based approach for noise classification. . . 17

4.14 The procedure for generating MFCCs of environmental sounds. . . 18

4.15 Mel scale versus normal frequency scale (Hz). . . 20

4.16 An example of Mel-filterbank with 10 filters spaced between 1 and 8 kHz. 20 4.17 An illustration of SVM for a 2-class classification problem. . . 22

4.18 The effect of C and γ on the decision boundaries in SVM for a 2-class classification problem. . . 23

4.19 An illustration of KNN for a 2-class classification problem for k = 3. . . 24

4.20 The effect of k on the decision boundaries in KNN for a 3-class classifi-cation problem. . . 25

4.21 Illustration of the classification tree structure for classifying an instance into five classes. . . 26

4.22 The structure of tree-based ensemble methods. . . 27

5.23 Tacklife SLM01 Classic Sound Level Meter. . . 29

5.24 Performance evaluation of SPL measurement using the Raspberry Pi and Mic-Hat without calibration. . . 30

5.25 Performance evaluation of SPL measurement using the Raspberry Pi and Mic-Hat after calibration (Ccal= 5). . . 31

5.26 Heat map of the SVM accuracy as a function of γ and C. . . 32

5.27 The effect of the parameter γ on the performance of the SVM classifier. . 33

5.28 The effect of the parameter C on the performance of the SVM classifier. . 33

5.29 The confusion matrix of SVM-based noise classification. . . 34

5.30 Performance of the KNN classifier for various values of nearest neighbors k and Euclidean, Manhattan, and Chebyshev distances. . . 35

5.31 The confusion matrix of KNN-based noise classification. . . 36

5.32 Performance of the Bagging and Random Forest classifiers for various numbers of decision trees. . . 37

5.33 The confusion matrix of Bagging-based noise classification. . . 39

(7)

List of Tables

2.1 Major properties of the Raspberry Pi Zero W. . . 6

3.2 The physical quantities of a sound wave. . . 7

3.3 The center frequency and band limits of one octave and 1/3 octave. . . 9

3.4 Weighting factors for center frequencies in one octave band. . . 13

4.5 Classes of sound samples in the dataset . . . 18

5.6 Major properties of Tacklife SLM01. . . 30

5.7 Accuracy [%] of SVM. . . 34

5.8 Time [seconds] for training and testing of SVM model on Pi Zero W. The time for feature extraction is not included. . . 34

5.9 Accuracy [%] of KNN . . . 36

5.10 Time [seconds] for training and testing of KNN model on Pi Zero W. The time for feature extraction is not included. . . 36

5.11 Accuracy [%] of Bagging and Random Forest. . . 38

5.12 Time [seconds] for training and testing of Bagging and Random Forest models on Pi Zero W. The time for feature extraction is not included. . . . 38

5.13 Performance comparison of classifiers with respect to accuracy, precision, recall, F1score, and execution time. Training and testing is performed on Raspberry Pi Zero W. We use 3/4 of the data-set (3042 sound samples) for training and 1/4 of the date-set for testing. . . 40

(8)

List of Abbreviations and Symbols

CART . . . Classification and Regression Trees

dB . . . Decibel

DCT . . . Discrete Cosine Transform

DFT . . . Discrete Fourier Transform

DT . . . Decision Tree

FFT . . . Fast Fourier Transform

IoT . . . Internet of Things

KNN . . . K-Nearest Neighbors

MFCC . . . Mel-Frequency Cepstrum Coefficients

SPL . . . Sound Pressure Level

SVM . . . Support Vector Machines

WHO . . . World Health Organization

Ccal . . . The calibration constant

C . . . The soft margin parameter in SVM

γ . . . The influence of the training data on the decision boundary in SVM

k . . . The number of nearest neighbors in KNN

(9)

1 Introduction

1.1 Motivation

About 85% of Swedes live in urban areas, and the quality of life and the health of citizens are affected by noise. Noise is defined as any undesired environmental sound. The world health organization (WHO) [5] recommends for good sleeping less than 30 dB noise level in bedrooms and for teaching less than 35 dB noise level in classrooms. Recent studies [6] have found that exposure to noise pollution may increase the risk for health issues, such as heart attack, obesity, impaired sleep, or depression.

Following the Environmental Noise Directive (END) 2002/49/EC [7], each EU mem-ber state has to assess environmental noise and develop noise maps every five years. As sources of noise (such as traffic, construction sites, music, and sporting events) may change over time, there is a need for continuous monitoring of noise. Health damag-ing noise often occurs for only a few minutes or hours, and it is not enough to measure the noise level every five years. Furthermore, the sound at the same dB level may be per-ceived as annoying noise or as pleasant music. Therefore, it is necessary to go beyond the state-of-the-art approaches that measure only the dB level [8, 9, 10] and also identify the type of the noise. For instance, it is important that the environmental protection unit and law enforcement unit of a city know whether the noise is generated by a jackhammer at a construction site or by a gunshot. The Internet of Things (IoT) is a promising technology for improving many domains, such as eHealth [11, 12], and it may be also used to address the issue of noise pollution in smart cities [13].

1.2 Objectives

Major objectives of this project include,

• O1: to study and develop noise measurement techniques that can be used for con-tinuous monitoring of noise level in urban areas,

• O2: to study machine learning techniques for noise classification on resource-limited devices.

1.3 Related Work

1.3.1 IoT Solutions for Noise Measurement

Goetze et al [8] provide an overview of a platform for distributed urban noise measure-ment, which is part of an ongoing German research project called StadtLärm. A wireless distributed network of audio sensors based on quad-core ARM BCM2837 SoC was em-ployed to receive urban noise signals, pre-process the obtained audio data and send it to a central unit for data storage and performing higher-level audio processing. A final stage of web application was used for visualization and administration of both processed and un-processed audio data. The authors in [9] used Ameba RTL 8195AM and Ameba 8170AF as IoT platforms to implement a distributed sensing system for visualization of the noise pollution. In [10], two hardware alternatives, Raspberry Pi platform and Tmote-Invent nodes, were evaluated in terms of their cost and feasibility for analyzing urban noise and measuring the psycho-acoustic metrics according to the Zwicker’s annoyance model.

In comparison to related work, which is limited to the systematical model and visual-ization of distributed sound sensing systems, our approach focus on using an IoT unit for

(10)

1.3.2 Machine Learning Methods for Sound Classification

In [14], a combination of two supervised classification methods, SVM and KNN, was used as a hybrid classifier with MPEG-7 audio low-level descriptor as the sound feature. The experiments were conducted on 12 classes of sounds. Khunarasal et al [15] proposed an approach to classify 20 different classes of very short time sounds. The study investigated various audio features (e.g., MFCC, MP, LPC, and Spectrogram) along with KNN and neural network.

We complement the related work with a study of noise classification on a low-power and inexpensive device, that is the Raspberry Pi Zero W.

1.4 Outline

The rest of the thesis is organized as follows. Chapter 2 gives an overview of smart cities, Internet of things, machine learning, and the Raspberry Pi platform. The fundamental concepts about noise and our method for noise measurement are presented in Chapter 3. In Chapter 4, we describe the proposed method and the algorithms we used for noise classification. Chapter 5 presents the experimental evaluation of our approach. The thesis is concluded in Chapter 6.

(11)

2 Background

The background chapter introduces some relevant topics of the thesis including: smart city, Internet of things, machine learning, and the hardware platform.

2.1 Smart City

Today, more than half of the world’s population are urbanized. The rapid urbanization has raised several challenges and problems. The mission of smart cities is to mitigate such problems and optimize the using or resources.

A "Smart City" [16] is a developed urban area that uses the information and com-munication technology (ICT), with government and society collaboration, to respond to various urban challenges and thus enhance the quality of urban life. Common examples of Smart City projects include automation and salubrity of public buildings, smart waste management services, air quality monitoring, noise monitoring, traffic congestion moni-toring, energy consumption monimoni-toring, smart parking, smart lighting, etc.

In spite of the increased use of the term "Smart City" in the last few years, there is still no absolute definition until now. According to the European Smart Cities project [17], a smart city can well perform in six key fields of urban development: smart economy, smart people, smart government, smart mobility, smart environment, smart living. Figure 2.1 shows some domains from each key field.

Figure 2.1: The key fields and some related domains of urban development needed for smart cities.

(12)

2.2 Internet of Things (IoT)

For the past three decades, the Internet cared most about people as a way to exchange data and information, like sending and receiving e-mails. Nowadays, physical devices and machines that have computational intelligence (such as home appliances, mobile phones and vehicles) are ubiquitous in our life. These devices can sense and communicate with the ambient environment to perform specific tasks. In contrast to the Internet of people, adding Internet connectivity to such smart devices (things) in order to exchange data and interact over the Internet without any required human-to-human or human-to-machine interaction is called the Internet of Things.

The number of IoT devices have increased dramatically during the last few years, and it is estimated to reach 30 billion by 2020 [18]. Furthermore, several urban IoT systems have been designed and performed to support the Smart City vision [19].

The most common IoT architecture consists of three layers: perception layer, network layer, and application layer (Figure 2.2) [20].

• Perception layer: the perception layer is responsible for sensing the ambient en-vironment and collecting data using different types of sensors. The data collected by sensors are transmitted to the network layer either directly (wires) or wirelessly through various wireless personal area networks (Wifi, Bluetooth, ZigBee, etc.)

• Network layer: the network layer is the brain of the IoT system. Its main task is to process the data delivered from the perception layer and then transmit it through the Internet to the application layer using wired or wireless network technologies (3G/4G, LTE, fiber optics, etc.)

• Application layer: the application layer is responsible for the final presentation and the effective utilization of the processed data. Various IoT applications include smart home, smart city, e-government, e-health, etc.

(13)

2.3 Machine Learning

Machine Learning is described by Mitchell [21] as follows, “a computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E”.

In general, machine learning techniques are categorized into two main types: super-vised learning and unsupersuper-vised learning (Figure 2.3).

• Supervised learning: learning techniques that build and train a model based on a known set of data (input and output) to predict the outputs of new data in the future. Supervised learning problems are divided into classification and regression problems.

– Classification: supervised learning technique that is used to predict qualitative responses, such as a color or disease.

– Regression: supervised learning technique that is used to predict quantitative responses like weight.

• Unsupervised learning: in contrast to supervised learning, unsupervised learning techniques are based only on input data, without corresponding output.

– Clustering: similar to classification, clustering techniques are used to group a set of input data based on the relationship between this data, such as grouping clients by their purchasing behavior.

(14)

2.4 Raspberry Pi and Mic-Hat

Figure 2.4 depicts our hardware experimentation platform that comprises a Raspberry Pi Zero W and a ReSpeaker 2-Mic Pi HAT.

The Raspberry Pi [22] is a low-power and low-cost single-board computer with a credit card size. It may be used as an affordable computer to learn programming or to build smart devices. A Raspberry Pi Zero W with a Wi-Fi capability is used for our experiments. The Raspberry Pi Zero W (see Table 2.4) comes with a single-core CPU running at 1GHz, 512MB of RAM, and costs only about $10.

We use for sound sensing a dual-mic array expansion board for Raspberry Pi called ReSpeaker 2-Mic Pi HAT [23]. This board is developed based on WM8960 and has two microphones for collecting data and is designed to build flexible and powerful sound applications.

Figure 2.4: Noise classification hardware platform consists of a Raspberry Pi Zero W and a ReSpeaker 2-Mic Pi Hat.

Table 2.1: Major properties of the Raspberry Pi Zero W. Property Raspberry Pi Zero W

SOC Broadcom BCM2835 core 1 x ARM1176JZF-S, 1GHz

RAM 512MB

storage micro SD

USB 1 x micro USB port wireless LAN 802.11 b/g/n

bluetooth 4.1

HDMI mini

GPIO 40 pins

(15)

3 Noise Measurement

Noise is defined as any unwanted or unpleasant sound. Noise measurement is the process of determining or monitoring the emission from one or several noise sources in terms of sound pressure level. In this chapter, we introduce the fundamental concepts about the nature of sound and present our method for calculating the sound pressure level that is measured in dB or dBA.

3.1 Fundamentals of Sound: Physics and Perception

In physical terms, sound is a pressure wave that propagates through a transmission medium such as air or water. The propagation speed of a sound wave can differ depending on the medium, which is approximately (340 m/s) in air. Sound waves consist of a pattern of low and high-pressure regions called rarefaction and compression, respectively. In Figure 3.5, lighter areas represent rarefactions and darker areas represent compressions.

Figure 3.5: Illustration of rarefactions and compression of a sound wave.

Sound can be studied as a physical phenomenon when it propagates through a medium, and then it is represented by its physical quantities (Table 3.2). When the sound is received at the human ear, it is referred to by its perception by the brain.

Table 3.2: The physical quantities of a sound wave. Property Unit

Pressure Pascal (Pa) = N/m2

Intensity Pa.m/sec = watt/m2 Energy Joule = watt.sec Velocity m/sec

(16)

The simplest sound wave can be represented as a sinusoid with a single frequency component (pure tone). However, most of the sounds that exist in nature are a combination of various tones with different frequencies.

The analysis of the incoming sound can be performed by the hearing part of the inner ear that is called the cochlea. The cochlea is the sensory organ that can analyze the sound in terms of frequency. In other words, it works like a spectrum analyzer. Figure 3.6 shows the frequency distribution along the cochlea’s basilar membrane when a sound wave propagates from the base to the apex, where high frequencies are detected at the base and low frequencies are detected at the apex.

Figure 3.6: A representation of the frequency distribution along the human cochlea, from base to apex [3].

The human ear tends to perceive the frequencies of a sound relatively, so relative units are needed for frequencies. For this reason, the octave band is usually used as a typical relative scale. Each octave band has center frequency, lower frequency, and upper frequency (Figure 3.7).

Figure 3.7: Illustration of center frequency and band limits of one octave and 1/3 octave bands.

(17)

The center frequency and band limits of one octave bands and third octave bands within the audible frequency range are shown in Table 3.3.

Table 3.3: The center frequency and band limits of one octave and 1/3 octave. Center frequency [Hz] Center frequency [Hz] Band limits

(Octave band) (1/3 Octave band) Lower Upper

31.5 25 22 28 31.5 28 35 40 35 44 63 50 44 57 63 57 71 80 71 88 125 100 88 113 125 113 141 160 141 176 250 200 176 225 250 225 283 315 283 353 500 400 353 440 500 440 565 630 565 707 1000 800 707 880 1000 880 1130 1250 1130 1414 2000 1600 1414 1760 2000 1760 2250 2500 2250 2825 4000 3150 2825 3530 4000 3530 4400 5000 4400 5650 8000 6300 5650 7070 8000 7070 8800 10000 8800 11300 16000 12500 11300 14140 16000 14140 17600 20000 17600 22500

(18)

3.2 Sound Pressure Level

Noise exposure is usually measured in terms of sound pressure since it is the most acces-sible parameter of a sound wave. The hearing system of human beings can sense sounds over a wide range of sound pressures. Hence, expressing the sound pressure is not appro-priate on a linear scale, but rather on a logarithmic scale. For this reason, measurements of the sound pressure are usually expressed in decibels (dB).

The decibel is a base ten logarithmic scale that can reduce the wide range of sound pressure into a more convenient and more manageable range represented as the ratio be-tween the sound pressure and a reference pressure (usually 20µPa in air). The reference pressure is the hearing threshold, which is the quietest sound a young healthy human being can sense at 1 kHz.

Measuring the value of sound pressure in dB is usually referred to as sound pressure level (SPL), which takes values between 0 dB (the threshold of hearing) and 120-140 dB (the threshold of pain). The minimum variation of sound pressure level that the human ear can recognize is 3 dB (i.e. twice of the pressure in the linear scale). Figure 3.8 shows the approximate sound pressure levels of some common noise sources.

The sound pressure level compared to the reference level is defined as follows:

SP L = 10 log10 p2 p2 ref ! (1)

Where p is the effective sound pressure (µPa) and pref is the reference pressure (µPa).

(19)

3.3 Equal-loudness contours

The human ear can only hear sounds within the frequency range (20 Hz - 20 kHz), which is known as the audible frequency range. Within the audible frequency range, the sen-sitivity of our hearing system at the same sound pressure level is not equal at different frequencies. The equal-loudness contour is a graphical representation of how we feel the same loudness with respect to frequency. The equal-loudness curves can be obtained by mapping the sound pressure level (expressed in dB) of a pure tone with respect to the perceived loudness level (expressed in phon, which is the unit of loudness level).

There are several sets of equal-loudness contours specified for the human ear. The first set was measured by Fletcher and Munson [24] whereas the definitive set, called "Normal equal-loudness-level contours" [25], is defined in the international standard ISO 226:2003 (Figure 3.9). The graph in the figure emphasizes that the sensation of our hearing system depends on frequency. Frequencies between 1 kHz and 5 kHz are the most sensitive, while at very low and very high frequencies within the audible frequency range the curves rise up implying less sensitivity of the human ear.

Experimentally, a reference tone at 1 kHz was chosen. At this frequency, the SPL value and the loudness level are equal. For example, 10 phons correspond to 10 dB at 1 kHz and 15 dB at 250 Hz. For this reason, we need a sort of frequency weighting.

(20)

3.4 Frequency Weighting

As we discussed in Section 3.3, the sensitivity of our hearing system varies depending on the frequency of the sound. For noise measurement applications and studying the noise effect on humans, measurements must be done in terms of frequency besides sound pressure level. Frequency weighting filters emphasize the important parts of the audible frequency spectrum and attenuate other parts.

Three types of frequency weighting are described in the international standard IEC 61672-1:2013, A-weighting, C-weighting and Z-weighting. The frequency responses of A, C, and Z weighting are plotted in Figure 3.10.

• A-Weighting:

A-weighting is the most common frequency weighting and it is widely utilized in environmental noise measurements. Based on the equal-loudness contour at 40 phons (Section 3.3), A-weighting represents the subjective sound perceived by hu-mans. It covers all the audible frequency range and reflects the response of the human ear by emphasizing mid-frequencies to which the human ear is most sensi-tive and attenuating very low and very high frequencies where the human ear is less sensitive. Measurements made using A-weighting are usually expressed in dBA.

• C-Weighting:

This type of frequency weighting represents what the human ear can hear at high sound pressure levels (above 100 dB). At these levels, the ear’s response starts to flatter and thus all frequencies can be perceived equally the same. The C-weighting filter has a linear frequency response over several octaves in the middle of the au-dible frequency range. It is often applied for peak measurements of sound pressure levels, which are expressed in dBC.

• Z-Weighting (Zero-weighting):

A flat frequency response that produces an equal sound pressure level over the fre-quency range (10 Hz - 20 kHz). It represents the actual incoming sound and can be used for analyzing the noise rather than its effect on human. Measurements made using Z-weighting are usually expressed in dBZ.

Equations (2) and (3) represent the functions used to obtain the frequency response of A- and C-weighting, respectively. Table 3.4 shows the weighting factors, as an additive correction to SPL, of A, C and Z frequency weighting applied to the center frequencies of one octave band.

RA(f ) = 121942_{× f}4 (f2_{+ 20.6}2₎p(f2_{+ 107.7}2_{) (f}2_{+ 737.9}2 _(f2_{+ 12194}2₎ A(f ) = 20 log₁₀(RA(f )) + 2 (2) RC(f ) = 121942_{× f}2 (f2_{+ 20.6}2_{) (f}2 _{+ 12194}2₎ C(f ) = 20 log₁₀(RC(f )) + 0.06 (3)

(21)

Figure 3.10: A, C and Z frequency weighting response.

Table 3.4: Weighting factors for center frequencies in one octave band.

Frequency (Hz) 31.5 63 125 250 500 1000 2000 4000 8000 16000 A-Weighting -39.4 -26.2 -16.1 -8.6 -3.2 0 1.2 1.0 -1.1 -6.6 C-Weighting -3.0 -0.8 -0.2 0.0 0.0 0 -0.2 -0.8 -3.0 -8.5

Z-Weighting 0 0 0 0 0 0 0 0 0 0

3.5 Time weighting

Noise is measured by detecting the sound pressure level of a noise source. The SPL of some noise sources, like environmental noise, can fluctuate quickly and thus makes it difficult to read the SPL value in real time. Time waiting is the process of dampening the reaction to sudden fluctuations in order to create a smoother display on sound level meters. Two types of time weighting are specified in the IEC 61672-1 standard, Slow (S) and Fast (F) time weighting. A third type called Impulse (I) was indicated in the old standard IEC651.

• Slow:

The slow time weighting can be used in situations where the measured noise fluc-tuates a lot. The S weighting has a time constant (τ = 1 s) and a decay time (4.3

(22)

• Fast:

The fast time weighting is the typical and most common time weighting used in normal noise measurement. The F weighting has a time constant (τ = 125 ms) and a decay time (34.7 dB/s).

• Impulse:

The impulsive time weighting is roughly four times quicker than the F weighting. With a time constant (τ = 35 ms) and a very slow decay time (2.9 dB/s), I weighting is appropriate for measuring sharp impulsive noise like a gunshot.

Figure 3.11 illustrates the behavior of the Slow and Fast weighting filters while figure 3.12 illustrates the behavior of the Impulse weighting filter. The SPL value measured in this graph increases suddenly from 50 dB to 80 dB and drops back to 50 dB after 6 s. The S measurement (in orange) shows a slow rise and fall time, approximately 5 s and 6 s, respectively. Similarly, the F measurement (in green) shows a faster reaction, 0.6 s (rise time) and 1 s (fall time). In contrast to S and F weighting, I weighting (shown in blue) is asymmetric. It has a very fast rise time but a very slow fall time, 35 ms and 9 s, respectively.

Figure 3.11: Demonstrating the behavior of Fast and Slow weighting [4].

(23)

3.6 Algorithm for calculating sound pressure level

The sound pressure level can be calculated using two algorithms, one in the time domain the one in the frequency domain [26]. In the first algorithm, the sound pressure level over the time interval T can be represented by its equivalent continuous sound level as the following: Leq,T = 10 log10 1 T RT 0 p 2_(t).dt p2 ref ! (4) Where:

p(t): is the rms value of the instantaneous sound pressure generated by a noise source. pref: is the reference sound pressure.

T: is the time interval.

The second algorithm is based on analyzing the signal in the frequency domain. The Discrete Fourier Transform (DFT) is used to obtain the frequency spectrum of the signal and then calculate its average energy in the frequency domain. The equation of N-point DFT is given by: X[k] = N −1 X n=0 x[n]e−j2πNkn , k = 0, ..., N − 1 (5)

To reduce the complexity of computing the DFT, the fast Fourier transform (FFT) algorithm can be used. The FFT algorithm can compute the DFT rapidly by reducing the number of required operations, which can be specifically useful for measurements in the real time. One of the most common FFT algorithms is the Cooley–Tukey algorithm, also known as radix-2 FFT [27]. This algorithm requires a window length with a power of two (N = 2m), where m is an integer. So, the minimum number of samples Nminthat can be

chosen when using fsas a sampling frequency and ∆t as an observation interval is given

as follows:

Nmin = fs ∆t (6)

The frequency weighting can be directly applied to the frequency spectrum samples. Applying A-weighting and C-weighting to X[k] results the A-weighted and C-weighted samples XA[k] and XC[k], respectively:

XA[k] = αA(fk) X[k] (7)

XC[k] = αC(fk) X[k] (8)

fk= k ∆f = k

fs

N (9)

αA(f ) and αC(f ) are the linear frequency responses of the A-weighing and C-weighting

filters, presented in Section 3.4, whereas fkare the frequencies for each sample of X[k].

At this stage, the signal level can be obtained by integrating the total energy of the signal over a time response interval (fast or slow). The energy of a signal is defined as the sum of the squared magnitudes of the samples in the time domain. Since the weighted samples are produced in the frequency domain, the energy of the signal cannot be calculated directly in the time domain.

(24)

One way to calculate the energy then is to perform the inverse Fourier Transform (IFFT), which is computationally expensive. Another way to compute the energy of the signal in the frequency domain with less computational cost is to apply the Parseval’s theorem. For DFT, the Parseval’s relation is given by:

x = N −1 X n=0 |x[n]|2 ₌ 1 N N −1 X k=0 |X[k]|2 ₍₁₀₎

From the symmetry property of the DFT, the samples of the signal in the frequency domain are symmetrical around the sample (N/2) since they have only real values in the time domain: X N 2 + k = X∗ N 2 − k (11)

Since the samples are symmetrical in the frequency domain, the energy of the signal can be calculated using only the first (N/2 + 1) samples of X, or the weighted samples (Xa or Xc). This can reduce the computational cost by a factor of two since only half

of the samples are considered. The expression for calculating the total energy can be represented as follows: x ' 2 N N/2 X k=0 |X[k]|2 (12)

For one-time interval observation (∆t), the average instantaneous energy of the signal within this interval is given by:

˜ x ' 2 N ∆t N/2 X k=0 |X[k]|2 (13)

The final sound pressure level in dB (if no frequency weighting is used) can be ob-tained by the following expression:

SP L = 10 log₁₀( ˜x ref

) + Ccal

= 10 log₁₀( ˜x) − 10 log10(ref) + Ccal (14)

Where ref is a constant related to the standard reference pressure, and Ccal is the

(25)

4 Noise Classification

In this chapter, we describe our method for classification of noise using machine learning on Raspberry Pi. The proposed noise classification system is illustrated in Figure 4.13. MFCCs are extracted from a training dataset of sound samples to train SVM, KNN, bag-ging, and random forest models that are used to predict the type of sensed environmental sounds.

Figure 4.13: Our machine learning based approach for noise classification.

4.1 Dataset

To investigate the performance of the system, we conduct experiments with eight differ-ent classes of environmdiffer-ental sounds: quietness, silence, car horn, children playing, gun-shot, jackhammer, siren, and street music. For the purpose of this study, we chose noise-relevant environmental sound clips from popular sound datasets, such as UrbanSound8K [28] and Sound Events [29]. The total dataset contains 3042 sound excerpts with length up to four seconds. Table 4.5 provides information about environmental sound samples that we use for experimentation.

(26)

Table 4.5: Classes of sound samples in the dataset Class Samples Duration [min:sec]

Quietness 40 02:00 Silence 40 02:00 Car horn 312 14:38 Children playing 560 36:47 Gun shot 235 06:39 Jackhammer 557 32:34 Siren 662 43:17 Street music 636 42:24 Total 3042 180:19 4.2 Feature Extraction

In machine learning, audio features are a set of distinctive properties that well describe a sound signal. These features should be informative and discriminative. Feature extraction is the process of transforming input data into a set of features and selecting the most in-formative and meaningful ones to reduce the large number of available features. Features extraction is the first step in any sound classification system. Choosing the appropriate set of features is important, and it can be critical for the efficiency of classification algo-rithms. A variety of audio features have been proposed for sound classification [30], such as Zero-Crossing Rate (ZCR), MPEG-7 audio features, and Mel-Frequency Cepstrum Coefficient (MFCC).

The MFCCs [31] is a well-known feature set. These features are widely used in the area of sound classification because they are well-correlated to what the human can hear. They analyze the sound waves linearly at low frequency and logarithmically at high fre-quency.

In this project, we use the MFCCs for audio feature extraction. The MFCCs can be obtained using the procedure depicted in Figure 4.14.

(27)

The Mel-Frequency Cepstrum Coefficients are defined as the real cepstrum of a win-dowed short-time signal derived from the FFT of that signal. Each step of the MFCCs extraction procedure is briefly described in the following:

1. Framing and Windowing:

In the first step, the input signal is divided into a number of small frames of a spe-cific length. The frame length is usually between 20 and 40 ms and the number of samples per frame should be a power of two in order to facilitate the imple-mentation of Fast Fourier Transform. Then, each frame is multiplied by a window function. There are multiple window functions that can be used, such as Rectangu-lar, TrianguRectangu-lar, Blackman, Hamming, and Hann (Hanning). For audio applications, Hamming and Hann window functions are the most used, which are defined by:

Hamming:w(n) = 0.54 − 0.46 cos 2πn M − 1 , 0 ≤ n ≤ M − 1 (15) Hann:w(n) = 0.5 − 0.5 cos 2πn M − 1 , 0 ≤ n ≤ M − 1 (16)

Where M denotes the window length.

2. Fast Fourier Transform:

To have a better analysis of a discrete signal, it is usually converted into its fre-quency component using the Discrete Fourier Transform (DFT). The DFT of a pe-riodic signal xN[n] with N consecutive samples is defined as:

XN[k] = N −1 X n=0 xN[n]e−j2πnk/N 0 ≤ k < N (17) xN[n] = N −1 X n=0 XN[k]e−j2πnk/N 0 ≤ n < N (18)

The Fast Fourier Transform (FFT) is a family of algorithms that compute the DFT in a simpler and much faster way, which reduces the computational complexity of the DFT in Equation(17) from O(N2) to O(N log₂N ). After framing and windowing the input signal, an FFT algorithm is applied to each windowed frame.

3. Mel-Frequency Cepstrum:

The Cepstrum of a signal is defined as the IFFT of the log magnitude of the FFT of that signal:

c[n] = FFT−1{log |FFT{x[n]}|} (19) The Mel-Frequency Cepstrum of a windowed signal derived from FFT can be ob-tained in a similar way to the real cepstrum but using a perceptual scale of pitches called the Mel-Scale [32]. In contrast to the linear frequency scale (in Hz), the Mel-scale can be represented as the logarithmic frequency scale that reflects the perceived frequency of a pure tone (Figure 4.15).

(28)

Figure 4.15: Mel scale versus normal frequency scale (Hz).

The Mel-scale formula as a function of the linear frequency scale, and its corre-sponding inverse are expressed as:

M = 2595 log₁₀ 1 + f 700 = 1127 ln 1 + f 700 (20) M−1 = 700 102595m − 1 = 700 e m 1127 − 1 (21)

In this step, the powers of the spectrum obtained from Step (2) is mapped onto the Mel-scale using a Mel-scale filterbank. The filterbank [33] is an array of band-pass filters that computes the average spectrum around each canter frequency with increasing bandwidth.

Figure 4.16 shows an example of a filterbank containing ten triangular filters spaced between 1 and 8 kHz, where lower frequencies are spaced more closely to each other than high frequencies to approximate the behavior of the human auditory system.

(29)

The filterbank with M triangular filters (m = 1,2,...,M ) is defined by: Hm[k] =                                0 k < f [m − 1] k − f [m − 1] f [m] − f [m − 1] f [m − 1] ≤ k ≤ f [m] f [m + 1] − k f [m + 1] − f [m] f [m] ≤ k ≤ f [m + 1] 0 k > f [m + 1] (22)

Where f [m] denotes the boundary points, which are uniformly spaced in the Mel-scale. The boundary points can be obtained as:

f [m] = N Fs M−1 M (fl) + m M (fh) − M (fl) M + 1 (23)

M () is the Mel-scale given by Equation (20). fl and fh are the lowest and highest

frequencies of the filterbank, respectively, and Fsis the sampling frequency, where

fl, fh, and Fsare expressed in Hz.

4. Logarithm:

The logs of the powers at the output of each filter of the Mel-scale filterbank is computed: S[m] = ln "_{N −1} X k=0 |Xa[k]|2Hm[k] # 0 ≤ m < M (24)

Where Xa[k] is the FFT of the input signal.

5. Discrete Cosine Transform:

The Mel-frequency cepstrum is then the Discrete Cosine Transform of the M-filter outputs. The Discrete Cosine Transform (DCT) is similar to the DFT, but using real numbers.

Where the DCT of a signal x[n] is defined by:

c[n] = M −1 X m=0 S[m] cos πn(m + 1 2) M 0 ≤ n ≤ M (25)

The result of the Discrete Cosine Transform gives the Mel-frequency cepstrum co-efficients. The parameter M determines the number of coefficients and its values vary for different applications. For sound classification systems, typically the first 13 Mel-Frequency Cepstrum Coefficients are used [33].

Foote [34] proposes the use of the first 12 MFCCs plus an energy term as sound features. In this project, we explore the feasibility of performing feature extraction based on MFCCs. We compute the first 12 MFCCs of all frames of the entire signal and append the frame energy to each feature vector. Thus, each audio signal is transformed into a sequence of a 13-dimensional feature vector.

(30)

4.3 Classification Algorithms

Classification algorithms are supervised learning techniques used for classifying data into different categories. In the midst of the diversity of classification algorithms, selecting the proper algorithm is not straightforward, since there is no perfect one that fits with all applications and there is always a trade-off between different model characteristics, such as, complexity, accuracy, memory usage, and speed of training.

In this section, we explain four supervised classification methods: support vector ma-chine, k-nearest neighbors and two tree-based methods, that is: bootstrap aggregating and random forest.

4.3.1 Support Vector Machines (SVM)

SVM [35] is a popular supervised algorithm mostly used for solving classification prob-lems. The main goal of the SVM algorithm is to design a model that finds the optimal hyperplane that can separate all training data into two classes. There may be many perplanes that separate all the training data correctly, but the best choice will be the hy-perplane that leaves the maximum margin, which is defined as the distance between the hyperplane and the closest samples. These samples are called the support vectors.

Considering the example of two linearly separable classes (circles and squares) shown in Figure 4.17, both hyperplanes (one and two) can classify all the training instances correctly, but the best hyperplane is one since it has a greater margin (m1 > m2).

Figure 4.17: An illustration of SVM for a 2-class classification problem.

When the data is non-linearly separable, the nonlinear classifier can be created by applying the kernel trick [36]. Using the kernel trick, the non-separable problem can be converted to a separable problem using kernel functions that transform low dimensional input space into high dimensional space. Selecting the appropriate kernel and its param-eters has a significant impact on the SVM classifier. Another important parameter for the SVM classifier is the soft margin parameter C, which controls the trade-off between the simplicity of the decision boundary and the misclassification penalty of the training

(31)

points. A low value of C makes the classifier tolerant with misclassified data points (that is, smooth decision boundary), while a high value of C makes it aiming to achieve a perfect classification of the training points (that is, complex boundary decision).

One of the kernel functions that is commonly used in SVM classification is the radial basis function (RBF). The RBF kernel on two feature vectors (x and x’) is expressed by Equation 26. K(x, x0) = exp (−kx − x 0_k2 2σ2 ) = exp (−γkx − x 0_k2₎ (26)

The RBF parameter γ determines the influence of the training data points on deter-mining the exact shape of the decision boundary. With high values of γ, the details of the decision boundary are determined only by the closest points, while for low values of γ, even the faraway points are considered in drawing the decision boundary.

Figure 4.18 demonstrates the effect of the parameters C and γ for a two-class classi-fication problem, where C {10−2− 102_{} and γ {10}−1_{− 10}1_{}. It can be seen from the}

figure that the decision boundary is smooth for low values of C (first row of sub-figures) and more complex for higher values (last row) while increasing the value of the parameter γ decreases the influence of the far training points.

In this project, we explore the effect of the parameters C and γ on an SVM classifier trained by our dataset of sound samples.

Figure 4.18: The effect of C and γ on the decision boundaries in SVM for a 2-class classification problem.

(32)

4.3.2 K-Nearest Neighbors (KNN)

KNN is one of the simplest machine learning algorithms used for classification. In spite of its simplicity, it is considered one of the top 10 classification algorithms [37]. The KNN works based on the minimum distance (such as Euclidean distance) between the test point and all training points. The class label of the test point is then determined by the most frequent class of the k nearest neighbors to the test point.

The KNN classifier is illustrated with an example in Figure 4.19. Two class labels are represented with squares and circles, and the purpose of the KNN algorithm is to predict the correct class of the triangle. Suppose k = 3; then the model will find the three nearest neighbors of the triangle. The class label of the triangle can be predicted using KNN by finding the most frequent class of the three nearest neighbors, which is the class of squaresin this case.

Figure 4.19: An illustration of KNN for a 2-class classification problem for k = 3.

The KNN is a lazy learning algorithm, which means that the classifier does not gener-alize the training data points before the testing phase. This leads to the fact that the KNN and other lazy learning algorithms have a very fast training phase, in which the training task is reduced to just memorizing the training points. On the other hand, the testing phase becomes computationally expensive and requires more time since all the training points are needed to make a prediction. In addition, more memory is needed to store all the training points.

The performance of KNN can be affected by a number of key elements. The most im-portant consideration is the value of the nearest neighbors k. Since k is highly dependent on the training data, there is no optimal choice for all datasets. In most cases, a smaller value of k creates a complex model (rough decision boundary) and increases the accuracy of KNN. When k is too small, the model becomes more sensitive to outliers or noisy fea-tures if the classes boundaries are not clear (over-fit). On the other hand, a larger value of k creates a simpler model (smooth decision boundary), which suppresses the effect of noisy features, but leads to a higher misclassification rate. When k is too large, all testing points will belong to the same dominant class (under-fit).

(33)

Figure 4.20 shows the effect of the value k on the decision boundary in a 3-class classification problem. It can be seen from the figure that low values of the variable k leads to a rough decision boundary and thus complex model. If a point from class blue is nested with the points of another class (green or red), then this point will be classified correctly but all near points of other classes will be classified incorrectly. The decision boundary becomes smoother with higher values of k and the model becomes simpler. In this case, the model will show some tolerance for nested points during the training phase.

Figure 4.20: The effect of k on the decision boundaries in KNN for a 3-class classification problem.

Another important key element that affects the performance of KNN is the distance metric. The appropriate choice of the distance metric for different applications can im-prove the accuracy of KNN. There are several distance functions used to compute the distance between two data points. Commonly used distances functions include:

• Euclidean distance: d(q, p) =pPn_i=1(qi− pi)2

• Manhattan distance: d(q, p) =Pn

i=1|qi− pi|

• Chebyshev distance: d(q, p) = maxi(|qi− pi|)

Where d denotes the distance between two vectors (two data points) in n-dimensional vector space, p = (p1, p2, ..., pn) and q = (q1, q2, ..., qn).

(34)

4.3.3 Bootstrap Aggregating (Bagging) and Random Forest

Bagging and random forest are very robust ensemble supervised learning methods [38]. An ensemble learning method is a technique that combines several machine learning al-gorithms to achieve more accurate and stable prediction than it could be achieved by one algorithm itself. Even though bagging and random forest can be applied to all types of classification and regression algorithms, they are practically useful for decision trees. In this project, we consider bagging and random forest as tree-based methods for the classi-fication task.

In machine learning, decision trees (DTs) are a simple and popular supervised learning method that uses a tree-like structure as a predictive model [39]. They are usually referred to as classification and regression trees (CART). Decision trees can be graphically rep-resented as a hierarchical structure consisting of multiple elements (nodes) connected to each other by incoming and outgoing lines (branches or edges). There are three types of nodes in any classification tree:

• Root node: which is the starting node at the top of the tree structure. A root node has outgoing branches (children) and no incoming branches (parent). It contains an attribute test condition to separate records that have different characteristics.

• Internal nodes: which include all nodes between the root and terminal nodes. Each internal node has both a parent and at least one child.

• Terminal nodes: each tree ends with one or more terminal nodes that have a parent and no children. Each terminal node, also known as a leaf, is assigned a class label.

Figure 4.21 shows a simple example of a classification tree used for classifying an instance into five different categories. The root node and each internal node contains a conditional test that compares the instance properties (x1 and x2) with the regions’

bor-ders (θi) and splits the node based on the comparison result. Finally, each terminal node

contains a class label that refers to one of the five regions.

(35)

When there are many nodes in a tree, the model becomes more complex and tends to overfit. Moreover, decision trees are sensitive to their training data, hence changing the training data may result in quite different predictions.

However, the predictive performance of decision trees can be significantly improved by aggregating many tree-models using ensemble methods, such as bagging and random forest. The structure of tree-based ensemble methods is represented in Figure 4.22.

Figure 4.22: The structure of tree-based ensemble methods.

• Bagging (bootstrap aggregating): Bootstrap aggregating is a supervised learning al-gorithm designed to increase the prediction accuracy of other learning alal-gorithms. The idea behind bagging is to aggregate multiple complicated models and then av-erage (in regression) or majority vote (in classification) the outputs of the different models of the same type to construct a more powerful prediction model.

For decision trees classification, the bagging classifier builds and trains N tree clas-sifiers on N training sets; or N training subsets, chosen randomly with replacement from the original training set (bootstrapping). Finally, the overall prediction of the bagging classifier is obtained by taking a majority vote of the classes predicted by each of the N tree models.

The bagged trees are not pruned, and all features in a subset are considered when the tree is looking for the best split at each node. The bagging classifier has one main parameter, which is the number of bagged trees.

(36)

• Random forest: The random forest algorithm, also known as random decision for-est, is an extension to bagged decision trees, in which the trees are similar to each other and hence highly correlated. To overcome this problem, the random forest provides an improvement that decorrelates the trees. In contrast to bagging, where all features M of a training subset are selected to split each node in a tree, the ran-dom forest considers only a subset of m features ranran-domly selected from the total number of features M in each training subset when searching for the best split at each node in a tree. Typical values of the parameter m for classification tasks are (m =√M ) and (m = log₂(M )).

(37)

5 Experimental Evaluation

In this chapter, we examine our solution for noise measurement and classification. The evaluation of each method is done by conducting a number of experiments on the Rasp-berry Pi platform.

5.1 Setup

The first step is to prepare our platform (the Raspberry Pi and the ReSpeaker microphone) for performing experiments. The Raspberry Pi has a micro SD card slot on its underside, so we used a 16 GB micro SD card to store our data and the operating system. Then, the Raspbian operating system is downloaded and installed. When the Raspberry Pi is ready, the microphone is mounted on its upper-side and prepared for recording. All recorded sounds are sampled using a sample rate 44.1 kHz and stored as WAV files.

5.2 Measurement

To investigate the efficiency of our platform for noise measurement, we implement our method for calculating the sound pressure level on the Raspberry Pi using the ReSpeaker microphone for collecting data. For evaluation, we used a sound level meter called Tack-life SLM01[40] to compare our results with reference values. The Tacklife is a commer-cial sound level meter (Figure 5.23) that can measure sound pressure level between 40 and 130 dBA (see Table 5.6).

(38)

Table 5.6: Major properties of Tacklife SLM01. Property Tacklife SLM01 Frequency range 31.5 Hz - 4 kHz Measuring level range 40 - 130 dB Frequency weighting A Time weighting FAST

Accuracy ±1.5 dB

Resolution 0.1 dB

To examine our method, We created several sound clips of type WAV. All sounds are sine waves at 1 kHz with various sound pressure levels, from 60 to 85 dB. For each sound wave, we measured the sound pressure level using the Raspberry Pi and compared the results with reference values measured by a commercial sound level meter called Tacklife SLM01[40].

First, our experiment was done without any calibration (Ccal = 0), where both

de-vices were set 1 meter away from the sound source. Figure 5.24 shows the results of our measurement without calibration, where the horizontal axis represents the SPL values measured by the sound level meter (SLM01), and the vertical axis represents the SPL val-ues measure by the Raspberry Pi. It is obvious that all measured valval-ues of sound pressure level are lower than the reference values, with an approximate observational error be-tween 4 and 5.7 dB. For this reason, we need to tune the calibration constant to decrease the error.

(39)

From the measurement data shown in Figure 5.24, we observe that Ccal = 5 gives

bet-ter measurement accuracy with measurement error less than 1 dB, which is not noticeable since the smallest difference in sound pressure level the human ear can detect is 3 dB. Figure 5.25 shows the measurements before and after the calibration in comparison with the reference values obtained by the sound level meter.

Figure 5.25: Performance evaluation of SPL measurement using the Raspberry Pi and Mic-Hat after calibration (Ccal= 5).

(40)

5.3 Classification

In this section, we investigate the performance of SVM, KNN, Bagging and Random Forest on eight different classes of environmental sounds: quietness, silence, car horn, children playing, gunshot, jackhammer, siren, street music. For training the models, we use a dataset of 3042 samples of environmental sounds (see Table 4.5). We divide the dataset arbitrary into two subsets: 75% are used for training and 25% for testing. All experiments are repeated 20 times with different subsets, and the obtained results are averaged. We have implemented all algorithms in Python using open source packages for machine learning and audio analysis (that is, scikit-learn [41] and librosa [42]).

5.3.1 SVM Parameter Space Exploration

To optimize the performance of SVM, we used the grid search to select the best combina-tion of the parameters γ and C for the RBF kernel. To explore the SVM accuracy, we plot the heat map depicted in Figure 5.26 as a function of γ and C, where γ {10−11− 101_}

and C {10−4− 108_{}. Table 5.7 shows the SVM model accuracy [%] for various values}

of γ and C parameters. After evaluating the model, we achieved a 93.87% accuracy for γ = 0.00167 and C = 3, as shown in Figure 5.27 and Figure 5.28. Figure 5.29 shows the confusion matrix that compares the predicted class with the actual class.

(41)

Figure 5.27: The effect of the parameter γ on the performance of the SVM classifier.

(42)

Table 5.7: Accuracy [%] of SVM. γ C 0.0001 0.00167 0.01 0.1 0.1 64.14 67.07 22.96 21.18 1 79.19 92.31 72.66 29.88 3 82.85 93.87 75.22 31.53 5 84.40 93.86 75.21 31.53 10 85.90 93.83 75.19 31.53 100 89.24 93.70 75.18 31.54

Table 5.8: Time [seconds] for training and testing of SVM model on Pi Zero W. The time for feature extraction is not included.

γ

C 0.0001 0.00167 0.01 0.1

Train Test Train Test Train Test Train Test 0.1 8.03 2.37 11.90 2.59 21.98 2.87 31.56 4.64 1 5.00 1.90 11.93 1.98 26.37 2.58 33.00 4.56 3 4.36 1.63 12.29 1.99 26.70 2.65 33.42 4.50 5 4.50 1.62 12.44 1.99 26.76 2.56 33.36 4.51 10 4.29 1.41 12.33 1.98 26.85 2.56 35.32 4.77 100 5.50 1.17 12.29 1.98 26.59 2.58 34.24 4.65

(43)

5.3.2 KNN Parameter Space Exploration

For KNN classifier we examine the influence of the parameter k and different distance functions: Euclidean, Manhattan, and Chebyshev (Section 4.3.2).

Figure 5.30 illustrates the classification accuracy of KNN for various values of k for each kind of distance. It can be seen from the figure that, for the same value of k, the Euclidean distance outperforms the Chebyshev whereas Manhattan outperforms both Eu-clidean and Chebyshev. All distance functions give better results for low values of k. By increasing k, the accuracy of the KNN classifier drops significantly.

Figure 5.30: Performance of the KNN classifier for various values of nearest neighbors k and Euclidean, Manhattan, and Chebyshev distances.

Table 5.9 presents the results for the KNN accuracy, where the M anhattan distance and k = 1 proved to be the best parameters with sound type recognition accuracy of 93.88%. The time consumption for training and testing a KNN classifier on our data set are shown in Table 5.10. It is obvious that the training phase has the same trivial time consumption, for all values of the parameters k and Distance, since it is limited to just memorizing the training data, as discussed in Section 4.3.2.

Figure 5.31 depicts the confusion matrix of the KNN classifier, which compares the predicted and the correct label of each class. The prediction accuracy of different classes varies from 85% (for the class Car horn) to about 100% for the two classes (Quietness and Silence).

(44)

Table 5.9: Accuracy [%] of KNN Distance

k Euclidean Manhattan Chebyshev 1 93.46 93.88 90.43 5 88.88 89.42 85.01 10 83.34 84.13 80.58 50 68.20 69.66 67.01

Table 5.10: Time [seconds] for training and testing of KNN model on Pi Zero W. The time for feature extraction is not included.

Distance

k Euclidean Manhattan Chebyshev Train Test Train Test Train Test 1 0.05 0.21 0.05 0.5 0.05 0.14 5 0.05 0.37 0.05 0.92 0.05 0.24 10 0.05 0.47 0.05 1.15 0.05 0.31 50 0.05 0.80 0.05 1.71 0.05 0.57

(45)

5.3.3 Bagging and Random Forest Parameter Space Exploration

To evaluate the performance of bagging and random forest, we examine the accuracy of both models for different numbers of decision trees (N ). For random forest, we also explore the influence of the variable m that represents the number of features considered for splitting each node (Section 4.3.3). The accuracy of both models is shown in Figure 5.32, in which the number of trees varies within the range (1 − 100) while the variable m of the random forest classifier takes one of the two common values, m =√M (in orange) and m = log2 (in green).

Figure 5.32: Performance of the Bagging and Random Forest classifiers for various num-bers of decision trees.

It can be seen from the graph that using more than one tree gives better performance, for both classifiers, than using a single tree classifier. The accuracy starts from 71.15% for bagging and 68 % for random forest when only one tree is considered. For a small number of trees, the bagging classifier shows higher accuracy since all features are considered for the best split at each node (m = M ). As the number of trees continues beyond 15, the random forest classifier begins to outperform the bagging classifier, as expected. The accuracy of all classifiers stays constant after a certain point (about 50 trees), around 88% for bagging and 90% for random forest.

Table 5.11 and Table 5.12 present the performance of Bagging and Random Forest with respect to classification accuracy and time that is needed for training and testing, respectively. Figure 5.33 and Figure 5.34 show the confusion matrices for Bagging and Random Forest (for m =√M ) classifiers, respectively, when using 60 decision trees.

(46)

Table 5.11: Accuracy [%] of Bagging and Random Forest. Classifier

N Bagging Random Forest (sqrt) Random Forest (log)

1 71.15 68.24 68.47 5 81.20 80.59 81.01 10 85.07 85.68 85.53 15 86.18 86.74 87.56 20 86.57 87.88 88.30 25 87.24 88.55 88.71 30 87.03 88.91 88.96 35 87.40 89.14 89.28 40 87.29 89.30 89.36 45 87.80 89.63 89.30 50 87.73 89.63 89.65 55 87.81 89.66 89.80 60 87.81 89.91 89.92 65 88.07 89.75 89.87 70 88.03 89.86 89.87 75 88.06 89.72 90.00 80 88.00 90.09 90.08 85 88.00 90.08 90.06 90 88.13 90.03 90.12 95 88.04 90.00 90.17 100 88.09 90.17 90.07

Table 5.12: Time [seconds] for training and testing of Bagging and Random Forest models on Pi Zero W. The time for feature extraction is not included.

Classifier

N Bagging Random Forest (sqrt) Random Forest (log) Train Test Train Test Train Test 1 0.36 0.018 0.169 0.016 0.171 0.016 5 1.76 0.046 0.817 0.037 0.837 0.037 10 3.45 0.080 1.620 0.064 1.616 0.064 15 5.18 0.115 2.448 0.091 2.470 0.090 20 7.14 0.149 3.306 0.119 3.235 0.118 25 8.83 0.183 4.054 0.144 3.991 0.144 30 10.73 0.218 4.849 0.172 4.819 0.171 35 12.20 0.251 5.608 0.199 5.662 0.199 40 13.75 0.286 6.464 0.224 6.430 0.224 45 16.29 0.320 7.327 0.252 7.228 0.251 50 17.23 0.356 8.106 0.276 8.081 0.278 55 19.11 0.389 8.955 0.304 8.784 0.303 60 21.03 0.423 9.609 0.329 9.560 0.394

(47)

Figure 5.33: The confusion matrix of Bagging-based noise classification.

(48)

5.3.4 Performance comparison of classifiers

In the midst of the diversity of classification algorithms, selecting the proper algorithm is not straightforward, since there is no perfect one that fits with all applications and there is always a trade-off between different model characteristics, such as accuracy and execution time.

In this section, we compare the performance of SVM, KNN, Bagging, and Random Forest with respect to the noise identification accuracy and the time needed for model training and testing on the Raspberry Pi Zero W. Table 5.13 lists the performance results of the four studied classifiers. Parameter values that result in the best accuracy of each classifier under study are as follows,

• SVM: C = 3 and γ = 0.00167.

• KNN: k = 1 with M anhattan distance. • Bagging: N = 60.

• Random forest: N = 60.

With respect to the noise identification accuracy, all four classifiers perform well; the achieved accuracy is in the range 87.81% – 93.88%. We may observe that the accuracy of bagging and random forest is slightly lower than the accuracy of SVM and KNN.

Regarding the model training time, we may observe in Table 5.13 that the fastest classifier is KNN. With respect to the testing time, the fastest classifier is the random forest and the slowest is SVM.

Table 5.13: Performance comparison of classifiers with respect to accuracy, precision, recall, F1 score, and execution time. Training and testing is performed on Raspberry

Pi Zero W. We use 3/4 of the data-set (3042 sound samples) for training and 1/4 of the date-set for testing.

Classifier SVM KNN Bagging Random Forest Accuracy [%] 93.87 93.88 87.81 89.91 Precision 0.953 0.947 0.91 0.93 Recall 0.937 0.941 0.89 0.90 F1 score 0.944 0.944 0.90 0.91 Training [s] 12.29 0.05 21.03 9.61 Testing [s] 1.99 0.5 0.42 0.33

(49)

6 Conclusion

In this thesis, we have presented a solution for noise measurement and classification using a low-power and inexpensive IoT unit, that is the Raspberry Pi Zero W.

For noise measurement, we implemented an algorithm to perceive sounds from real-world sources using the ReSpeaker 2-Mic Pi HAT and then calculate the sound pressure level in dB. After the calibration with calibration constant (Ccal = 5), we achieved a

measurement error less than 1 dB.

Our machine learning approach for noise classification is based on MFCC as audio features, and four supervised classification algorithms (that is, SVM, KNN, Bagging, and random forest) to classify noise sources. We used a dataset of about 3000 samples of en-vironmental sounds divided into eight different categories (such as car horn, jackhammer, music, etc). We have observed in our experiments with various environmental sounds that all algorithms provide high noise classification accuracy in the range 87.81% — 93.88%. Experiments with various values of the parameter C indicate that SVM had the highest accuracy for C = 3, while various values of the parameter k indicate that the accuracy of KNN decreases with the increase of k. The accuracy of both Bagging and random forest increases by increasing the number of bagged trees N , where our experiments proved that N = 60 leads to the highest accuracy for both classifiers. With an accuracy of 93.88% and execution time less than second, for both training and testing, the KNN proved its superiority over the other algorithms when k = 1.

6.1 Future work

Future work will investigate the usefulness of our solution for a large number of Raspberry Pi devices in an environment that combines features of the Edge and Cloud computing systems.

(50)

References

[1] Y. Alsouda, S. Pllana, and A. Kurti, “IoT-based Urban Noise Identification Using Machine Learning: Performance of SVM, KNN, Bagging, and Random Forest,” in International Conference on Omni-layer Intelligent Systems (COINS), May 5–7, 2019, Crete, Greece. ACM, 2019.

[2] ——, “A Machine Learning Driven IoT Solution for Noise Classification in Smart Cities,” in Machine Learning Driven Technologies and Architectures for Intelligent Internet of Things (ML-IoT), August 31, 2018, Prague, Czech Republic. Euromicro, 2018.

[3] A. Kern, C. Heid, W.-H. Steeb, N. Stoop, and R. Stoop, “Biophysical parameters modification could overcome essential hearing gaps,” vol. 4, no. 8, August 2008.

[4] N. Audio, Fast, Slow, Impulse Time Weighting - What do they mean?, (accessed May. 7, 2018). [Online]. Available: https://www.nti-audio.com/en/support/know-how/ fast-slow-impulse-time-weighting-what-do-they-mean

[5] WHO, WHO Europe: Data and Statistics, (accessed Mar. 3, 2018). [On-line]. Available: http://www.euro.who.int/en/health-topics/environment-and-health/ noise/data-and-statistics

[6] L. Poon, The Sound of Heavy Traffic Might Take a Toll on Mental Health, CityLab, (accessed Mar. 9, 2018). [Online]. Available: https://www.citylab.com/equity/2015/ 11/city-noise-mental-health-traffic-study/417276/

[7] EC, Environmental Noise Directive, European Commission, (accessed Mar. 9, 2018). [Online]. Available: http://ec.europa.eu/environment/noise/directive_en.htm

[8] M. Goetze, R. Peukert, T. Hutschenreuther, and H. Toepfer, “An open platform for distributed urban noise monitoring,” in 2017 25th Telecommunication Forum (TELFOR), Nov 2017, pp. 1–4.

[9] Y. C. Tsao, B. R. Su, C. T. Lee, and C. C. Wu, “An implementation of a distributed sound sensing system to visualize the noise pollution,” in 2017 International Con-ference on Applied System Innovation (ICASI), May 2017, pp. 625–628.

[10] J. Segura-Garcia, S. Felici-Castell, J. J. Perez-Solano, M. Cobos, and J. M. Navarro, “Low-cost alternatives for urban noise nuisance monitoring using wireless sensor networks,” IEEE Sensors Journal, vol. 15, no. 2, pp. 836–844, Feb 2015.

[11] D. Perez, S. Memeti, and S. Pllana, “A simulation study of a smart living IoT so-lution for remote elderly care,” in 2018 Third International Conference on Fog and Mobile Edge Computing (FMEC), April 2018, pp. 227–232.

[12] B. Farahani, F. Firouzi, V. Chang, M. Badaroglu, N. Constant, and K. Mankodiya, “Towards fog-driven iot ehealth: Promises and challenges of iot in medicine and healthcare,” Future Generation Computer Systems, vol. 78, pp. 659 – 676, 2018.

[13] A. Zanella, N. Bui, A. Castellani, L. Vangelista, and M. Zorzi, “Internet of things for smart cities,” IEEE Internet of Things Journal, vol. 1, no. 1, pp. 22–32, Feb 2014.