IMPLEMENTATION AND EVALUATION OF AUDITORY MODELS FOR HUMAN ECHOLOCATION

(1)

Master Thesis

Electrical Engineering Thesis no:

December 2015

IMPLEMENTATION AND EVALUATION OF AUDITORY MODELS FOR HUMAN ECHOLOCATION

VIJAY KIRAN GIDLA

Department of Applied Signal Processing Blekinge Institute of Technology

37179 Karlskrona Sweden

(2)

This thesis is submitted to the Department of Applied Signal Processing at Blekinge Institute of Technology in partial fulﬁllment of the requirements for the degree of Master of Science in Electrical Engineering.

Contact Information Author:

VIJAY KIRAN GIDLA

E-mail: vijaykiran.gidla@outlook.com University advisor:

Docent BO SCHENKMAN Blekinge Institute of Technology

Department of Applied Signal Processing Blekinge Institute of Technology

371 79 KARLSKRONA SWEDEN

Internet: www.bth.se/ing Phone: +46 455 385000 SWEDEN

(3)

Abstract

Blind people use echoes to detect objects and to find their way, the ability being known as human echolocation. Previous research have found some of the favorable conditions for the detection of the object, with many factors yet to be analyzed and quantified. Studies have also shown that blind people are more efficient than the sighted in echolocating, with the performance varying among the individuals. This motivated the research in human echolocation to move in a new direction to get a fuller understanding for the high detection of the blind. The psychoacoustic experiments solely cannot determine how the superior echo detection of the blind listeners should be attributed to perceptual or physiological causes. Along with the perceptual results it is vital to know how the sounds are processed in the auditory system. Hearing research has led to the development of several auditory models by combining the physiological and psychological results with signal analysis methods. These models try to describe how the auditory system processes the signals. Hence, to analyze how the sounds are processed for the high detection of the blind, auditory models available in the literature were used in this thesis. The results suggest that repetition pitch is useful at shorter distances and is determined from the peaks in the temporal profile of the autocorrelation function computed on the neural activity pattern. Loudness attribute also plays a role in providing information for the listeners to echolocate at shorter distances. At longer distances timbre aspects such as sharpness information might be used by the listeners to detect the objects. It was also found that the repetition pitch, loudness and sharpness attributes in turn depend on the room acoustics and type of the stimuli used. These results show the fruitfulness of combining results from different disciplines through a mathematical framework given by signal analysis.

Keywords: Human echolocation, Psychoacoustics, Physiology, Signal analysis, Auditory models.

i

(4)

Acknowledgment

Firstly, I would like to express my sincere gratitude to my advisor Docent Bo Schenkman who supported me throughout my master thesis. I would have not been able to complete my thesis without his support, patience and motivation. His guidance helped me to think critically on the results that I have found from the analysis in this thesis. His valuable comments during the writing of my thesis helped me to order my analysis into a good framework. I am really grateful for having such an advisor for my master thesis.

Beside my advisor, I would like to thank my examiner Sven Johansson who was patient and co- operative with the submission of my thesis. I would like to thank Professor Brian C. J. Moore and Professor Jan Schnupp for allowing me to use the ﬁgures from their books in my thesis.

My sincere thanks also goes to my senior Abel Gladstone Mangam who suggested me to my advisor to perform the research in human echolocation. I thank the staﬀ at the BTH library and IT help desk who were very supportive in providing me with the literature and software I needed for my thesis. Last but not the least, I would like to thank my parents who supported me throughout my thesis.

ii

(5)

List of Figures

2.1 Anatomy of the human ear. . . 4 2.2 Cochlea unrolled, in cross section. . . 5 2.3 Cross section of the cochlea, and the schematic view of the organ of corti. . . 6 2.4 An illustration of the most important pathways and nuclei from the ear to the auditory

cortex. . . 7 2.5 Basic structure of the models used for the calculation of loudness. . . 8 2.6 A simulation of the basilar membrane motion for a 200 Hz sinusoid. . . 9 2.7 A simulation of the basilar membrane motion for a 500ms iterated ripple noise with

gain=1, delay=10ms and no of iterations = 2. . . 10 3.1 Sound recordings made in the anechoic, conference and the lecture room. . . 13 3.2 The autocorrelation function of a 5ms signal recorded in the anechoic chamber (Exper-

iment 1) with reﬂecting object at 100cm. . . 17 3.3 The autocorrelation function of a 500ms signal recorded in the anechoic chamber (Ex-

periment 1) with reﬂecting object at 100cm. . . 17 3.4 The autocorrelation function of a 5ms signal recorded in the conference room (Experi-

ment 1) with reﬂecting object at 100cm. . . 18 3.5 The autocorrelation function of a 500ms signal recorded in the conference room (Ex-

periment 1) with reﬂecting object at 100cm. . . 18 3.6 The autocorrelation function of a 5ms signal recorded in the lecture room (Experiment

2) with reﬂecting object at 100cm. . . 19 3.7 The autocorrelation function of a 500ms signal recorded in the lecture room (Experiment

2) with reﬂecting object at 100cm. . . 19 3.8 The mean of the spectral centroid for the 10 versions as a function of time of the left

ear 500ms recording in the anechoic chamber (Experiment 1). . . 20 3.9 The mean of the spectral centroid for the 10 versions as a function of time of the left

ear 500ms recording in the conference room (Experiment 1). . . 21 3.10 The mean of the spectral centroid for the 10 versions as a function of time of the left

ear 500ms recording in the lecture room (Experiment 2). . . 21 4.1 The frequency response used to design the gm2002 filter of the PCP module in the AIM. 22 4.2 The NAP of a 200 Hz signal in the 1209 Hz frequency channel. . . 24 4.3 The Dual profile of a 5ms signal recorded in the anechoic room (Experiment 1). . . 31 4.4 The Dual profile of a 5ms signal recorded in the conference room (Experiment 1). . . . 32 4.5 The Dual profile of a 5ms signal recorded in the lecture room (Experiment 2). . . 33 4.6 The Dual profile of a 50ms signal recorded in the anechoic room (Experiment 1). . . . 34 4.7 The Dual profile of a 50ms signal recorded in the conference room (Experiment 1). . . 35 4.8 The Dual profile of a 500ms signal recorded in the anechoic room (Experiment 1). . . 36 4.9 The Dual profile of a 500ms signal recorded in the conference room (Experiment 1). . 37 4.10 The Dual profile of a 500ms signal recorded in the lecture room (Experiment 2). . . . 38 4.11 An example to illustrate the pitch strength measure computed using the pitch strength

module of the AIM. . . 39

v

(8)

5.1 The parametric (Weibull fit) and non parametric (Local linear fit) modeling of the mean proportion of correct responses of the blind participants as a function of distance. (a) For the 5ms recordings in anechoic chamber. (b) For the 5ms recording in conference room. . . 44 5.2 The parametric (Weibull fit) and non parametric (Local linear fit) modeling of the mean

proportion of correct responses of the blind participants as a function of distance. (a) For the 50ms recordings in anechoic chamber. (b) For the 50ms recording in conference room. . . 44 5.3 The parametric (Weibull ﬁt) and non parametric (Local linear ﬁt) modeling of the

mean proportion of correct responses of the blind participants as a function of distance.

(a) For the 500ms recordings in anechoic chamber. (b) For the 500ms recording in conference room. . . 45 A.1 The spectral centroid as a function of time for the 10 versions (marked in diﬀerent colors

for each subplot) of the left ear 5ms recording in the anechoic chamber (Experiment 1). 65 A.2 The spectral centroid as a function of time for the 10 versions (marked in diﬀerent colors

for each subplot) of the right ear 5ms recording in the anechoic chamber (Experiment 1). 65 A.3 The spectral centroid as a function of time for the 10 versions (marked in diﬀerent colors

for each subplot) of the left ear 5ms recording in the conference room (Experiment 1). 66 A.4 The spectral centroid as a function of time for the 10 versions (marked in diﬀerent colors

for each subplot) of the right ear 5ms recording in the conference room (Experiment 1). 66 A.5 The spectral centroid as a function of time for the 10 versions (marked in diﬀerent

colors for each subplot) of the left ear 5ms recording in the lecture room (Experiment 2). 67 A.6 The spectral centroid as a function of time for the 10 versions (marked in diﬀerent

colors for each subplot) of the right ear 5ms recording in the lecture room (Experiment 2). . . 67 A.7 The spectral centroid as a function of time for the 10 versions (marked in diﬀerent colors

for each subplot) of the left ear 50ms recording in the anechoic chamber (Experiment 1). 68 A.8 The spectral centroid as a function of time for the 10 versions (marked in diﬀerent colors

for each subplot) of the right ear 50ms recording in the anechoic chamber (Experiment 1). . . 68 A.9 The spectral centroid as a function of time for the 10 versions (marked in diﬀerent colors

for each subplot) of the right ear 50ms recording in the conference room (Experiment 1). 69 A.11 The spectral centroid as a function of time for the 10 versions (marked in diﬀerent colors

for each subplot) of the left ear 500ms recording in the anechoic chamber (Experiment 1). . . 70 A.12 The spectral centroid as a function of time for the 10 versions (marked in diﬀerent colors

for each subplot) of the right ear 500ms recording in the anechoic chamber (Experiment 1). . . 70 A.13 The spectral centroid as a function of time for the 10 versions (marked in diﬀerent colors

for each subplot) of the right ear 500ms recording in the conference room (Experiment 1). . . 71 A.15 The spectral centroid as a function of time for the 10 versions (marked in diﬀerent colors

for each subplot) of the left ear 500ms recording in the lecture room (Experiment 2). . 72 A.16 The spectral centroid as a function of time for the 10 versions (marked in diﬀerent colors

for each subplot) of the right ear 500ms recording in the lecture room (Experiment 2). 72 B.1 The temporal proﬁles of stabilised auditory image for a 500ms signal recorded in the

conference room (Experiment 1). . . 79

vi

(9)

List of Tables

3.1 Mean of the sound pressure levels (dBA) for the left and right ears over the 10 versions of the 500ms duration signals in the anechoic and conference room of Experiment 1. . 15 3.2 Mean of the sound pressure level (dBA) for the left and right ears over the 10 versions

of the 500ms duration signals in the lecture room of Experiment 2. . . 15 4.1 Mean of the maximum of the Short Term Loudness in sones of 10 versions for the

recordings in anechoic conference and the lecture room with 5ms duration signal. . . . 28 4.2 Mean of the maximum of the Short Term Loudness in sones of 10 versions for the

recordings in anechoic conference and the lecture room with 50ms duration signal. . . 28 4.3 Mean of the maximum of the Short Term Loudness in sones of 10 versions for the

recordings in anechoic, lecture and conference room with 500ms duration signal. . . . 29 4.4 Mean of the pitch strength (autocorrelation index) of 10 versions for the recordings in

anechoic conference and the lecture room with 5ms duration signal. . . 40 4.5 Mean of the pitch strength (autocorrelation index) of 10 versions for the recordings in

anechoic conference and the lecture room with 50ms duration signal. . . 40 4.6 Mean of the pitch strength (autocorrelation index) of 10 versions for the recordings in

anechoic, lecture and conference room with 500ms duration signal. . . 40 4.7 Mean of the 10 versions of mean of median of the sharpness (acums) for the recordings

in anechoic, conference and the lecture room with 5ms duration signal. . . 41 4.8 Mean of the 10 versions of mean of median of the sharpness (acums) for the recordings

in anechoic, conference and the lecture room with 50ms duration signal. . . 42 4.9 Mean of the 10 versions of mean of median of the sharpness (acums) for the recordings

in anechoic, conference and the lecture room with 500ms duration signal. . . 42 5.1 Detection thresholds of object distance (cm) for duration, room, and listener groups. . 46 5.2 Threshold values of loudness (sones) for duration, room, and listener groups. . . 46 5.3 Threshold values of the pitch strength (autocorrelation index) for duration, room, and

listener groups. . . 47 5.4 Threshold values of the mean of the mean of median sharpness (acums) for duration,

room, and listener groups. . . 48 A.1 Calibrated levels with and without A weighting. . . 60 A.2 SPL values (dBA) for 10 versions of the left ear recordings in the anechoic chamber

with 5ms duration signal . . . 61 A.3 SPL values (dBA) for 10 versions of the right ear recordings in the anechoic chamber

with 5ms duration signal . . . 61 A.4 SPL values (dBA) for 10 versions of the left ear recordings in the lecture chamber with

5ms duration signal . . . 61 A.5 SPL values (dBA) for 10 versions of the right ear recordings in the lecture chamber

with 5ms duration signal . . . 61 A.6 SPL values (dBA) for 10 versions of the left ear recordings in the conference chamber

with 5ms duration signal . . . 62 A.7 SPL values (dBA) for 10 versions of the right ear recordings in the conference chamber

with 5ms duration signal . . . 62

vii

(10)

A.8 SPL values (dBA) for 10 versions of the left ear recordings in the anechoic chamber with 50ms duration signal . . . 62 A.9 SPL values (dBA) for 10 versions of the right ear recordings in the anechoic chamber

with 50ms duration signal . . . 63 A.11 SPL values (dBA) for 10 versions of the right ear recordings in the conference chamber

with 50ms duration signal . . . 63 A.12 SPL values (dBA) for 10 versions of the left ear recordings in the anechoic chamber

with 500ms duration signal . . . 63 A.13 SPL values (dBA) for 10 versions of the right ear recordings in the anechoic chamber

with 500ms duration signal . . . 64 A.15 RSPL values (dBA) for 10 versions of the right ear recordings in the conference chamber

with 500ms duration signal . . . 64 A.16 SPL values (dBA) for 10 versions of the left ear recordings in the lecture chamber with

500ms duration signal . . . 64 A.17 SPL values (dBA) for 10 versions of the right ear recordings in the lecture chamber

with 500ms duration signal . . . 64 B.1 Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the

anechoic chamber with 5ms duration signal. . . 73 B.2 Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the

conference room with 5ms duration signal. . . 74 B.5 Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the

lecture room with 5ms duration signal. . . 74 B.8 Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the

lecture room with 5ms duration, 32 clicks signal. . . 74 B.9 Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the

lecture room with 5ms duration 64 clicks signal. . . 75 B.10 Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the

lecture room with 500ms duration signal. . . 75 B.11 Median of the sharpness in acums of 10 versions for the recordings in the anechoic room

(Experiment 1) with 5ms duration signal. . . 76 B.12 Median of the sharpness in acums of 10 versions for the recordings in the anechoic room

(Experiment 1) with 50ms duration signal. . . 76 B.13 Median of the sharpness in acums of 10 versions for the recordings in the anechoic room

(Experiment 1) with 500ms duration signal. . . 76 B.14 Median of the sharpness in acums of 10 versions for the recordings in the conference

room (Experiment 1) with 5ms duration signal. . . 77 B.15 Median of the sharpness in acums of 10 versions for the recordings in the conference

room (Experiment 1) with 50ms duration signal. . . 77 B.16 Median of the sharpness in acums of 10 versions for the recordings in the conference

room (Experiment 1) with 500ms duration signal. . . 77

viii

(11)

B.17 Median of the sharpness in acums of 10 versions for the recordings in the lecture room (Experiment 2) with 5ms duration signal. . . 78 B.18 Median of the sharpness in acums of 10 versions for the recordings in the lecture

room(Experiment 2) with 5ms duration, 32 clicks signal. . . 78 B.19 Median of the sharpness in acums of 10 versions for the recordings in the lecture room

(Experiment 2) with 5ms duration, 64 clicks signal. . . 78 B.20 Median of the sharpness in acums of 10 versions for the recordings in the lecture room

(Experiment 2) with 500ms duration signal. . . 78

ix

(12)

Abbreviations

ACF Autocorrelation Function.

AIM-MAT Auditory Image Model in Matlab.

AIM Auditory Image Model.

BMM Basilar Membrane Motion.

CC Calibrated Constant.

ELC Equal Loudness Contour.

ERB Equivalent rectangular bandwidth.

FIR Finite Impulse Response.

GLM Generalized linear models.

H-C-L Halfwave rectiﬁcation, Compression, Lowpass ﬁltering.

H-L Halfwave rectiﬁcation, Lowpass ﬁltering.

HP-AF High Pass Asymmetric Function.

ILD Interaural Level Diﬀerence.

IRN Iterated Rippled Noise.

ITD Interaural Time Diﬀerence.

MAF Minimum Audible Field.

MAP Minimum Audible Pressure.

NAP Neural Activity Pattern.

PCP Pre Cochlear Processing.

PS Pitch Strength.

RMS Root Mean Square.

RP Repetition Pitch.

SAI Stabilized Auditory Image.

SF Strobe Finding.

SPL Sound Pressure Level.

STI Strobe Temporal Integration.

TI Temporal Integration.

autocorr Autocorrelation Module in Auditory Image Model.

cGC Compressive Gamma Chirp.

dcGC Dynamic Compressive Gamma Chirp.

fMRI Functional Magnetic Resonance Imaging.

gm2002 Glasberg and Moore 2002.

pGC Passive Gamma Chirp.

sf2003 Strobe Finding 2003.

ti2003 Temporal Integration 2003.

x

(13)

Chapter 1

Introduction

Human echolocation formerly known as “facial vision” or “obstacle sense” is the ability of the blind to detect objects in their environment, audition being the sensory basis for this ability (Dallenbach and Supa, 1944; Dallenbach and Cotzin, 1950). A blind person may use his or her self-generated sounds, e.g. by the voice, but it is also usual to use sounds generated by mechanical means such as the shoes, a cane, or some device like a clicker to detect an object (Schenkman and Nilsson, 2010). There are diﬀerent factors that inﬂuence this ability of the blind and researchers over the years have performed various experiments to understand this ability.

The discriminating power of this ability was initially studied and it was found that, both the blind and the sighted listeners could detect and discriminate object’s (Kellogg, 1962;

Köhler, 1964; Rice, Feinstein, and Schusterman, 1965; as cited in Arias and Ramos, 1997). Later, the effect of various factors influencing the echolocation ability of the blind was studied by e.g Schenkman (1985), concluding that self made vocalizations and clicks were the most effective echolocation signals and an auditory analysis similar to the autocorrelation function (Bilsen and Ritsma, 1969; Yost, 1996) could represent its underlying psychophysical mechanism.

The influence of the precedence effect on human echolocation was investigated by Seki, Ifukube, and Tanaka (1994) who performed the localization task in the vertical plane and found that the blind were more resistant to the precedence effect and the performance accuracy decreasing with the decreasing distance of the (reflected) sound source. Studies were also made to find the influence of the exploratory movements in echolocation. It was found that, for some distances, participants were somewhat more accurate when moving than being stationary (Miura et al, 2008). Later studies by Rowan et al (2013);

Wallmeier, Geßele, and Wiegrebe (2013) also showed that binaural information is useful in locating the objects, when using echolocation.

Experiments were done to ﬁnd the environmental conditions and the type of signals that would favour echolocation. Schenkman and Nilsson (2010) analysed the eﬀect of reverberation on the performance of the blind by using the signals recorded in an anechoic and a conference room. They found that the blind performed better for longer distances in the latter case. However, Kolarik et al (2014) say that the reverberation time in the study of Schenkman and Nilsson (2010) was rather low (T₆₀= 0.4s), and it is possible that longer reverberation times would lead to impaired rather than improved performance.

The eﬀects of reverberation time on echolocation performance have yet to be quantiﬁed.

1

(14)

CHAPTER 1. INTRODUCTION 2

Regarding the type of signals favourable for echolocation, Rojas et al (2009, 2010) sug- gested that short sounds generated at the palate are the most eﬀective for echolocation.

On the other hand Schenkman and Nilsson (2010) reported that longer duration sig- nals are beneficial for echolocation. Therefore, to find the type of signals favorable for echolocation Schenkman, Nilsson, and Grbic (2011) studied the influence of click trains and longer duration noise signals on the echolocation performance. They found that the detection of the object at 100cm was best with both 32 clicks/500ms and 500ms noise;

and at 150 cm with 32 clicks/500ms rather than the 500ms noise signal, contradict- ing the results of their previous experiment which favored the longer duration signals.

Schenkman, Nilsson, and Grbic (2011) assumed that the decrease in performance was due to the diﬀerence in the experimental setup.

In order to clarify the cause for the decrease in the performance, a physical analysis was made on the stimuli used in the experiments of Schenkman, Nilsson, and Grbic (2011) and is presented in the room acoustic chapter of this thesis. Although the analysis was made to explain the decrease in the performance, it is to be noted that the experiments performed by Schenkman and Nilsson (2010); Schenkman, Nilsson, and Grbic (2011) ex- cluded exploratory movements, which probably are considered to be advantageous for the blind (Miura et al, 2008). Hence, more experimental testing is required by considering all these factors to conclude which types of signals are favourable for echolocation.

Another aspect that has been the focus of the recent research in human echolocation is the variability of echolocation ability among the blind and sighted. Several studies have reported that blind participants have echolocation abilities superior to those of sighted participants (Dufour et al, 2005; Schenkman and Nilsson, 2010; Schenkman and Nilsson, 2011; Kolarik et al, 2013), with variability among the individuals (Schenkman and Nilsson, 2010; Teng and Whitney, 2011; Teng et al, 2012). However, the results from the psychoacoustic experiments could not explain whether the high echolocation ability of the blind is due to their extensive practice or brain plasticity or both. In some cases even the characteristics of the acoustic stimulus that determine the detection of the blind is not known.

To discover whether the physiological diﬀerences are the cause for the high detection of the blind several researchers have analyzed the brain activity of the participants.

Thaler, Arnott and Goodale (2011) conducted a study using functional magnetic reso- nance imaging (fMRI) in one early and one late blind participant and demonstrated that echolocation activates occipital and not auditory cortical areas, with stronger activation in the early blind participant. A more recent study by the same authors (Thaler et al, 2014), suggest that the echo-motion response in blind experts may represent reorganiza- tion rather than exaggeration of responses observed in sighted novices, and that there is the possibility that this reorganization involves the recruitment of visual cortical areas.

However, the extent to which such recruitment contributes to the echolocation abilities of the blind remains unclear and a combined study using the neuroimaging techniques and psychoacoustic methods may give a clearer insight into the role of physiology in the high echolocation ability of the blind.

Although it is expected that the combination of neuroimaging and psychoacoustic meth- ods can give us some insight into the high echolocating ability of the blind, these do not reveal the information in the acoustic stimulus that determines it (at least when the information is not known) and how this information is represented in the human auditory system. A reasonable solution to ﬁnd the information necessary for the high echolocation ability of the blind is by performing a signal analysis on the acoustic stim- ulus. However, such an analysis does not show us how the information is represented in

(15)

CHAPTER 1. INTRODUCTION 3

the human auditory system. To solve this problem, one may use auditory models in the literature which try to mimic the human hearing. Analyzing the acoustic stimulus using these models may give us insight into the causes for the high echolocation ability of the blind.

It is vital to use signal analysis and the auditory models in order to understand the differences between the listeners in human echolocation, since one needs to consider the transmission of the acoustic sound from the source to the internal representation of the listener. Initially, when the acoustic sound travels and undergoes transformation due to the room acoustics, one should first understand which information is being received at the human ear. This is where signal analysis comes into play, as we can analyze the characteristics of the acoustic sound which are transformed due to various room conditions. The second step is to analyze how the desired characteristic of the acoustic sound that contains the information is represented in the auditory system. This is where the auditory models come into play. The desired information is transformed in a similar way to how the auditory system might processes it. Therefore by keeping track of the information from the outer ear to the central nervous system one may understand the cause for the differences between the participants and this is the research strategy of this thesis.

To model the auditory analysis performed by the human auditory system the auditory image model of Patterson, Allerhand, and Giguere (1995), loudness models of Glasberg and Moore (2002, 2007) and the sharpness model of Fastl and Zwicker (2007) were considered in this thesis. Matlab was chosen as the implementation environment. The auditory image model was implemented in matlab by Bleeck, Ives, and Patterson (2004b) and the current version is known as AIM-MAT. The loudness and the sharpness mod- els were implemented in PsySound3 (Cabrera, Ferguson, and Schubert, 2007), a GUI- driven Matlab environment for analysis of audio recordings. AIM-MAT and PsySound3 were downloaded from https://code.soundsoftware.ac.uk/projects/aimmat and http://

www.psysound.org, respectively and used in the thesis.

AIMS OF THE THESIS:

(1) To ﬁnd out the information in the acoustic stimulus, that determines the high echolo- cation ability of the blind.

(2) To ﬁnd out how this acoustic information which determines high echolocation ability of the blind might be represented in the human auditory system.

For this we use the recordings of Schenkman and Nilsson (2010) and Schenkman, Nilsson, and Grbic (2011), denoted Experiment 1 and 2, respectively.

OUTLINE OF THE THESIS:

The thesis is formulated as follows: As the auditory models are developed based on research in physiology and perception, initially a detailed review of relevant parts of these subjects is presented in Chapter 2. In Chapter 3, the signal analysis done on the recordings of Schenkman and Nilsson (2010) and Schenkman, Nilsson, and Grbic (2011) to ﬁnd out the information used to detect the objects is presented. Chapter 4 describes how the auditory models were designed and implemented. The analysis of the recordings of Schenkman and Nilsson (2010), and Schenkman, Nilsson, and Grbic (2011) using the auditory models is also presented in this chapter. The results from the auditory models are compared with the perceptual results in Chapter 5. A discussion of the results is presented in Chapter 6 followed by the conclusion in Chapter 7.

(16)

Chapter 2

Physiology and Perception

A signal processing model of human auditory system is designed on the basis of research in physiology and psychology of hearing. Therefore, it is vital to give a background to the physiological and psychological aspects of hearing for understanding how the models may explain human echolocation.

2.1 Physiology of hearing

The auditory system consists of the auditory periphery and the central nervous system which encodes and processes the acoustic sound respectively. A brief description of how this is done is presented below.

2.1.1 Auditory periphery

The peripheral part of the auditory system consists of the ear, which transduces the sound waves from the environment into neural responses and strengthen the perception of the sound. Figure 2.1 shows the structure of the human ear, which is further subdivided

Malleus Incus

Stapes (attached to oval window)

Semicircular Canals

Vestibular Nerve

Cochlear Nerve

Cochlea Tympanic

Cavity

Eustachian Tube Round

Window Tympanic

Membrane External

Auditory Canal Pinna

Concha

Figure 2.1: Anatomy of the human ear, Figure adapted from, Chittka L, Brockmann A [CC-BY-2.5 (http:// creativecommons.org/licenses/by/2.5)], via Wikimedia Commons.

4

(17)

CHAPTER 2. PHYSIOLOGY AND PERCEPTION 5

into outer, middle and inner ear. Initially when the sound reaches the human ear, the head, torso and pinna attenuate the sound in a frequency dependent manner in which the sound pressure is decreased at high frequencies. After the attenuation due to the head, torso and pinna, the sound travels through the auditory canal via the concha (the cavity which helps to funnel sound into the canal). Since the resonance frequency of the concha is closer to 5 kHz and the resonant frequency of external auditory canal is about 2.5 kHz, the concha and external auditory canal cause an increase in sound pressure level (SPL) of about 10 to 15 dB in the frequency range 1.5 kHz to 7 kHz. The tympanic membrane vibrates as a result of sound waves travelling in the external auditory canal, and the vibrations are passed along the oscillating chain (Yost, 2007).

The middle ear consists of the ossicular chain (malleus, incus and stapes), which provide eﬀective means to deliver sound to the inner ear where the neural process of hearing begins. Due to the diﬀerence in surface area between tympanic membrane and the stapes foot plate, and also due to the lever action of the ossicles there is an increase in the pressure level between the ear drum and the inner ear by 30 dB or more. The actual pressure transformation depends on the frequency of the stimulus (Yost 2007, pp 75-79).

Thus the middle ear works a little bit like a thumbtack, collecting pressure over a large area on the blunt, thumb end, and concentrating it on the sharp end (Schnupp, Nelken, and King, 2011).

The vibratory patterns representing the acoustic message reach the cochlea via the stapes. Along the entire length of the cochlea runs a structure known as the basilar membrane, which is narrow and stiff at the basal end of the cochlea (i.e. near the oval and round windows), but wide and floppy at the far, apical end. The basilar membrane subdivides the fluid-filled spaces inside the cochlea into upper compartments (the scala vestibuli and scala media) and lower compartments (the scala tympani). Thus the cochlea is equipped with two sources of mechanical resistance, one provided by the stiffness of the basilar membrane, the other by the inertia of the cochlear fluids.

The stiﬀness gradient decreases as we move farther away from the oval window, but the inertial gradient increases. As the inertial resistance is frequency dependent, the path of overall lowest resistance depends on the frequency. It is long for low frequencies which are less aﬀected by inertia (i.e Path B in Figure 2.2) and increasingly short for high frequencies (i.e Path A in Figure 2.2). Hence every time the stapes pushes against the oval window, low frequencies cause vibrations at the apex of the basilar membrane, and the high frequencies cause vibration at the base. This property makes the cochlea to

A B

Stapes

Oval Window Bony Wall

Basilar Membrane

Round Window

Helicotrema

Figure 2.2: Cochlea unrolled, in cross section. The grey shading represents the inertial gradient of the ﬂuids and the stiﬀness gradient of the basilar membrane. Note the gradients run in opposite direction.

Figure, redrawn with permission from Schnupp, Nelken, and King (2011)

(18)

operate as a mechanical frequency analyser. However it is to be noted that the cochlea does not have a sharp frequency resolution and it is perhaps more useful to think of the cochlea as a set of mechanical ﬁlters (Schnupp, Nelken, and King 2011, pp 55-64).

Another important phenomena that the basilar membrane exhibits is the travelling wave phenomena. However, Schnupp, Nelken, and King (2011), say that describing the trav- elling wave as a manifestation of the sound energy can be misleading and suggest that it is probably more accurate to imagine the mechanical vibrations as travelling along the membrane only in the sense that they travel mostly through the fluid next to the membrane and then pass through the membrane as they come near the point of lowest resistance. The travelling wave may then be mostly a curious side effect of the fact that the mechanical filters created by each small piece of basilar membrane, together with the associated cochlear fluid columns, all happen to be slightly out of phase with each other.

The mechanical vibrations from the basilar membrane are transduced into electrical potentials by the shearing against the tectorial membrane of the stereocilia in the organ of corti (cf Figure 2.3). This happens as follows: A structure in the organ of corti, named scala vascularis, leaks theK⁺ ions from the bloods stream into the scala media. The scala vascularis also sets up an electrical voltage gradient across the basilar membrane. As the stereocilia in each bundle are not all of the same length, and as the tips are connected with each other, by fine protein fiber strands known as “tip links”, the ion channels, open in response to stretch (increase in the tension) on the tip links, allowing the K⁺ ions to flow through the hair cells. The hair cells then form glutamatergic, excitatory synoptic contacts with the spiral ganglion neurons along their lower end. These neurons form the long axons that travel through the auditory nerve and reach the cochlear nucleus (Schnupp, Nelken, and King, 2011).

Scala media

Cross sectoin of the cochlea

Stria vascularis

Organ of Corti

Outer hair cells

Inner hair cells

Basilar membrane Scala Vestibuli

Scala tympani Basilar membrane

Tectorial membrane

Figure 2.3: Cross section of the cochlea, and the schematic view of the organ of corti. Figure, redrawn with permission from Schnupp, Nelken, and King (2011)

As can be seen in Figure 2.3, there are two types of hair cells, outer and inner hair cells.

The inner and outer hair cells are connected to type I and type II ﬁbers respectively.

Anatomically, type II ﬁbers are unsuited to provide fast through output of the encoded information (Schnupp, Nelken, and King, 2011). Hence, only the inner hair cells are known to be the biological transducers. Although the outer hair cells do not provide

(19)

any neural transduction, they are known to exhibit motility which cause the non linear cochlear ampliﬁcation. A detailed description of how this non linear cochlear ampliﬁ- cation can be modeled using the signal processing techniques is presented in Chapter 4.

2.1.2 Central auditory nervous system

As discussed in the above section, the auditory periphery transduces the acoustic sound.

However hearing involves more than neural coding of the sound, i.e. processing of the encoded sound. This processing is done by the central auditory nervous system. The central auditory nervous system consists of the cochlear nucleus, the superior olivary complex, the inferior coliculus, the medial geniculate body and the auditory cortex, and also other structures, Figure 2.4, illustrates this.

Left hemisphere Medial geniculate

Right hemisphere

Inferior colliculus

Lateral lemniscus

Dorsal cochlear nucleus Superior olive

Right ear Left ear

Ventral cochlear nucleus

Figure 2.4: An illustration of the most important pathways and nuclei from the ear to the auditory cortex. The nuclei illustrated are located in the brain stem. Figure redrawn with permission from Moore (2013).

There is evidence that many cells in the dorsal cochlear nucleus react in a manner that suggest a lateral inhibition network, which helps in sharpening the neural representation

(20)

of the spectral information (Yost 2007, pp 240). As the information from the left and right ears converge at the olivary nuclei they are assumed to process the spatial percep- tion of sound (Schnupp, Nelken, and King, 2011). The spectral and spatial information from the cochlear nucleus and the superior olivary complex are further processed and combined by the inferior coliculus. Finally, the region in auditory cortex processes the complex sound.

2.2 Perception

The physiological background remains one main inspiration for the auditory models but they are also based on how the physical and perceptual attributes of the acoustic sound are encoded in the auditory system. Loudness, pitch and timbre are three subjective attributes of the acoustic sound that are relevant for human echolocation. Therefore this section discusses how these attributes are encoded in the auditory system.

2.2.1 Loudness

Loudness is the perceptual attribute of intensity and is deﬁned as that attribute of auditory sensation in terms of which sounds can be ordered on a scale extending from quiet to loud (ASA, 1973).

Regarding the underlying mechanisms of how loudness is perceived, there is no full understanding. The dynamic range of auditory system is wide and different mechanisms play a role in intensity discrimination. Psychophysical experiments suggest that neuron firing rates, spread of excitation and phase locking play a role in intensity perception, but the latter two may not always be essential. A disadvantage with the neuron firing rates is that, although the single neurons in the auditory nerve can be used to explain the intensity discrimination, this does not explain why the intensity discrimination is not better than observed, suggesting that the discrimination is limited by the capacity of the higher levels in the auditory system, which may also play a role in intensity discrimination (Moore, 2013).

Fixed filter for transfer of outer/ middle ear

Transform spectrum to excitation pattern

Transform excitation pattern to

specific loudness

Calculate the area under the specific loudness pattern Stimulus

Figure 2.5: Basic structure of the models used for the calculation of loudness. Figure, redrawn from Moore (2013).

Several models (cf, Moore, 2013, pp 139 - 140) have been proposed to calculate the aver- age loudness that would be perceived by a large group of listeners. Figure 2.5, shows the basic structure of a model used to calculate loudness. The model performs the outer and middle ear transformations and then calculates the excitation pattern. The excitation pattern is transformed into speciﬁc loudness, which involves a compressive non-linearity.

The total area under the speciﬁc loudness pattern is assumed to be proportional to the overall loudness. Therefore, whatever may be the mechanism underlying the perception of loudness, the excitation pattern seems to be the essential information that should be used to design an auditory model of loudness.

(21)

2.2.2 Pitch

Pitch is deﬁned as “that attribute of auditory sensation in terms of which sounds may be ordered on a musical scale” (ASA, 1960).

Regarding the underlying mechanisms of how pitch is encoded is still a matter of debate.

One view is that, as the cochlea is assumed to perform the spectrum analysis, the acoustic vibrations are transformed into a spectrum, coded as a profile of discharge rate across the auditory nerve. An alternative view proposes that the role of the cochlea is to transduce the acoustic vibrations into temporal patterns of neural firing. These two views are known as place and time hypotheses. Figure 2.6 shows a simulation of the basilar membrane motion of a 200 Hz sinusoid, generated using dynamic gammachirp filterbank module available in AIM-MAT. It can be seen that both the frequency and the temporal patterns are preserved.

According to the place hypothesis, pitch is determined from the position of maximum excitation along the basilar membrane, within the cochlea. This explains how the pitch is perceived by the pure tones at low levels, but it fails to explain pure tones at higher levels i.e at higher levels, due to non linearity of the basilar membrane (as described in the physiology section) the peaks become broader and tends to shift towards a lower frequency place. This should lead to a decrease in pitch; however the psychophysical experiments show that the pitch is stable. Another case where the place hypothesis fails is, its inability to explain the pitch of the stimuli whose fundamental is absent. According to the paradox of the missing fundamental, the pitch evoked by a pure tone remains the

40 35

30 25

time (ms)

20 15

10 5

0 0 0.2 0.5 0.9 1.7 2.8 4.7 8

Frequency (kHz)

Figure 2.6: A simulation of the basilar membrane motion for a 200 Hz sinusoid. The ﬁgure was generated by using dynamic gamma chirp ﬁlter bank module available in AIM-MAT. It can be seen that both the place and the temporal information is preserved.

(22)

0 5 10 15 20 25 30 35 40

0.1 0.3 0.5 0.9 1.5 2.5 3.9 6

time (ms)

Frequency (kHz)

Figure 2.7: A simulation of the basilar membrane motion for a 500ms iterated ripple noise with gain=1, delay=10ms and no of iterations = 2. The ﬁgure was generated using dynamic gamma chirp ﬁlter bank module available in AIM-MAT. It can be seen that there are no periodic repetitions to support the time hypothesis.

same if we add additional tones with frequencies that are integer multiples of that of the original pure tone (harmonics). It also does not change if we then remove the original pure tone (the fundamental) (De Cheveign´e, 2010).

On the other hand, as the time hypothesis states that pitch is derived from the periodic pattern of the acoustic waveform, it overcomes the problem of the missing fundamental.

However the main diﬃculty with the time hypothesis is that it is not easy to extract one pulse per period, in a way that is reliable and fully general. Psychoacoustic studies also show that pitch exist for stimuli which is not periodic. An example of such a stimuli is iterated ripple noise (IRN), a stimuli that models some of the human echolocation signals (cf Figure 2.7).

In order to overcome the limitations of the place and time hypothesis two new theories were proposed, pattern matching (De Boer 1956, cited in De Cheveign´e 2010) , and a theory based on autocorrelation (Licklider 1951, cited in De Cheveign´e 2010). De Boer (1956) described the concept of pattern matching in his thesis. It states that the fun- damental partial is the necessary correlate of pitch, but it may be absent if other parts of the pattern are present. In this way pattern matching supports the place hypothesis.

Later Goldstein (1973), Wightman (1973) and Terhardt (1974) described diﬀerent mod- els for pattern matching. One problem with the pattern matching theory is that it fails to account for pitch whose stimuli have no resolved harmonics.

(23)

The autocorrelation hypothesis assumes temporal processing in the auditory system.

It states that, instead of detecting the peaks at regular intervals, the periodic neural pattern is processed by coincidence detector neurons that calculate the equivalent of an autocorrelation function (Licklider 1951, cited in De Cheveigné 2010). The spike trains are delayed within the brain by various time lags (using neural delay lines) and are combined or correlated with the original. When the lag is equal to the time delay between spikes the correlation is high and outputs of the coincidence detectors tuned to that lag are strong. Spike trains in each frequency channel are processed independently and the results combined into an aggregate pattern. However, De Cheveigné (2010) says that the autocorrelation hypothesis works too well: It predicts that, pitch should be equally salient for stimuli with resolved and unresolved partials, whereas this is not the case (De Cheveigné, 2010). An alternative to the theory based on an autocorrelation like function is the strobe temporal integration (STI) of Patterson et al (1995). According to STI, the auditory image underlying the perception of pitch is obtained by using triggered, quantised, temporal integration, instead of an autocorrelation like function.

The STI works by ﬁnding the strobes from the neural activity pattern and integrating it over a certain period.

To summarize, there is no full understanding of how pitch is perceived. Whether tem- poral, spectral or multi mechanisms determine the pitch perception, the underlying information that the auditory system uses to detect the pitch is the excitation pattern.

Hence, the excitation pattern remains the crucial information that should be simulated to design an auditory model of pitch perception.

2.2.3 Timbre

When the loudness and pitch of an acoustic sound are similar, the subjective attribute of sound which is used to distinguish the sound is the timbre. Timbre has been deﬁned as that attribute of auditory sensation which enables a listener to judge that two non identical sounds, similarly presented and having same loudness and pitch , are dissimilar (ANSI, 1994). One example is the diﬀerence between two musical instruments playing the same tone e.g guitar and piano.

Timbre is a multidimensional percept and there is no single scale on which we can order timbre. To quantize timbre one approach is to consider the overall distribution of the spectral energy. Plomp and his colleagues, showed that the perceptual diﬀerences between diﬀerent sounds, were closely related to the levels in 18 1/3 octave bands, thus relating the timbre to the relative level produced by the sound in each critical band.

Hence, generally, for both speech and non speech sounds, the timbre of steady tones are determined by their magnitude spectra, although the relative phases may play a small role (Plomp as cited in Moore, 2013). When we consider time varying patterns, there are several factors that inﬂuence the perception of timbre, which include:(i) periodicity;

(ii) variation in the envelope of the waveform; (iii) spectrum changing over time; (iv) what the preceding and following sounds are like.

The timbre information can be assessed using the auditory models from the levels in the spectral envelope and variation of the temporal envelope. Another way to preserve the ﬁne grain time interval information that is necessary for timbre perception is by the strobe temporal integration of Patterson et al (1995).

(24)

Chapter 3

Room acoustics

Before analyzing how an acoustic sound might be represented in the auditory system using auditory models, it is vital to study the physics and room acoustics of the sound that determines human echolocation. Hence, this chapter initially reviews the studies analyzing the acoustic signals.

3.1 Review of studies analyzing acoustic signals

As discussed in chapter 2, the iterated ripple noise stimuli models some of the human echolocation signals. Initially, a brief review of the studies performed on this stimuli in the literature is presented. Thereafter, the review of the studies of other acoustic stimuli used for understanding human echolocation is presented.

Bassett and Eastmond (1964) examined the physical variations in the sound field close to a reflecting wall. They used a loudspeaker which generated Gaussian noise, placed at more than 5 m from a large horizontal reflecting panel, in an anechoic chamber. A microphone was placed at number of points between the loudspeaker and the panel and an interference pattern was observed. Bassett and Eastmond reported a perceived pitch caused by the interference of direct and reflected sound at different distances from the wall; the pitch value being equal to the inverse of the delay. In a similar way, (Small

JR and McClellan as cited in Bilsen 1966), delayed identical pulses and found the pitch perceived was equal to the inverse of the delay, naming the pitch as time separation pitch. Later, Bilsen and Ritsma (1969) stated that, when a sound and the repetition of that sound are listened to, a subjective tone is perceived with a pitch corresponding to reciprocal value of the delay time and termed the pitch perceived as repetition pitch.

Bilsen, tried to explain repetition pitch phenomenon by using autocorrelation peaks or the spectral peaks. Yost (1996) performed experiments using iterated ripple noise stimuli and concluded that autocorrelation is the underlying mechanism, used by the listeners to detect the repetition pitch phenomenon.

Regarding other acoustic stimuli used for understanding human echolocation, Rojas et al (2009, 2010), conducted a physical analysis on the acoustic characters of orally produced pulses and ﬁnger produced pulses, showing that the former were better for echolocation.

Papadopoulos et al (2011) examined the acoustic signals used in the study of Dufour et al (2005) and stated that the information for obstacle discrimination were found in the frequency dependent inter aural level diﬀerences (ILD) especially in the range from 5.5 to 6.5 kHz, rather than on inter aural time diﬀerences (ITD). Pelegrin Garcia, Roozen, and Glorieux (2013) performed a study using the boundary element method and found that frequencies above 2 kHz provide information for localization of the object, whereas the lower frequency range would be used for size determination. Similar analysis was performed by Rowan et al (2013) using a virtual auditory space technique and came to

12

(25)

CHAPTER 3. ROOM ACOUSTICS 13

the same conclusion, viz. that performance was primarily based on information above 2 kHz. In view of the above studies several analysis were performed for this thesis and are presented in the remaining part of this chapter to identify the information necessary for the detection of the object.

3.2 Sound recordings

The sound recordings of Schenkman and Nilsson (2010), Schenkman, Nilsson, and Grbic (2011) are used in our study. A brief description of how the recordings were made is given here. In Schenkman and Nilsson (2010), the binaural sound recordings were conducted in an ordinary conference room and in an anechoic chamber using an artiﬁcial manikin.

The object was a reﬂecting 1.5 mm thick aluminium disk with a diameter of 0.5 m.

Recordings were conducted at a 0.5, 1, 2, 3, 4, and 5 m distance between microphones and the reﬂecting object. In addition, recordings were made with no obstacle in front of

(a) (b)

(c)

Figure 3.1: Sound recordings made in the Experiment 1, a) anechoic room b) conference room, with loudspeaker on the chest of the artiﬁcial manikin and in the Experiment 2, c) lecture room with loudspeaker behind the artiﬁcial manikin. The pictures are reproduced with permission from Bo Schenkman.

(26)

the artiﬁcial manikin. The following durations of the noise signal were used: 500, 50, and 5 ms; the shortest corresponds perceptually to a click. The electrical signal was a white noise. However, the emitted sound was not perfectly white, because of the non-linear frequency response of the loudspeaker and the system. A loudspeaker generated the sounds, resting on the chest of the artiﬁcial manikin. The sound recording set ups can be seen in Figures 3.1(a), 3.1(b).

In Schenkman, Nilsson, and Grbic (2011), recordings were conducted in an ordinary lec- ture room. Recordings were conducted at 100 and 150 cm distance between microphones and the reﬂecting object. The emitted sound were either bursts of 5 ms each, varying in rates from 1 to 64 bursts per 500 ms or a 500 ms white noise. These sounds were gener- ated by a loudspeaker placed 1 m straight behind the center of the head of the artiﬁcial manikin. The sound recording set up can be seen in Figure 3.1(c). From now on the recordings of Schenkman and Nilsson (2010), Schenkman, Nilsson, and Grbic (2011) will be referred to as Experiment 1 and Experiment 2 respectively. A detailed description of the recordings can be found in Schenkman and Nilsson (2010) and in Schenkman, Nilsson, and Grbic (2011).

3.3 Signal analysis

To find out the information used for detecting an object and to analyze how the acoustics of the room affect human echolocation, a number of different analysis were performed namely: sound pressure level, autocorrelation, and spectral centroid. Before analyzing the recordings, the recordings were calibrated by the calibrating constants (CC), using equation 3.1. Based on the SPL of 77, 79 and 79 dBA for the 500ms recording without the object at the ear of the artificial manikin in the anechoic, conference and lecture room of Experiment 1 and Experiment 2, the CC’s were calculated to be 2.4663, 2.6283 and 3.5021 respectively. ¹

CC = 10

SP L−20∗log10(rms(signal) 20∗10−6 )

20

(3.1) As the recordings were binaural both the left and right ear recordings were analyzed.

The recordings in Experiment 1 and Experiment 2, had 10 versions of each duration and distance. It should be noted that the recordings vary over the versions causing the term

”rms(signal)” in equation 3.1 to vary, thereby varying the calibrated constants with the versions. However, as the variation is very small in this thesis only the 9th version of the 500ms ﬁrst recording without the object (NoObject rec1) in Experiment 1 and the 9th version of the 500ms recording without the object in Experiment 2 were used to ﬁnd the above calibrated constants. Another reason to choose only the 9th version is that although the other versions may not have the same CC’s they will be relatively calibrated with respect to the recording of version 9. For example, suppose the recording in the anechoic chamber version 1 had 67 dB SPL and version 9 had 66 dB SPL before calibration, then the levels obtained by calibrating the recordings to 77 dB SPL using the CC of the 9th version would be 78 dB SPL for version 1 and 77 dB SPL version 9.

In other words, they will give the same level diﬀerence, even after calibration.

3.3.1 Sound Pressure Level (SPL)

The detection of the objects may to a certain extent based on an intensity diﬀerence.

Hence, the SPL in dBA were calculated using equation 3.2, where ”RMS” is the root

1The A weighting was not included in equation 3.1. However, the diﬀerence was found to be less than 0.5 dB and hence was neglected. See section A.1 of the appendix for more details.

(27)

mean square amplitude of the signal analyzed. The results of the 500ms recordings in Experiment 1 and Experiment 2 are tabulated in table 3.1 and 3.2. A detailed analysis of the SPL values for all the 10 versions of 5, 50 and 500ms recordings are presented in Tables A.2 to A.17 in Appendix A.

SP L = 20∗ log10

CC∗ rms(signal) 20∗ 10⁻⁶

(3.2)

Recording Anechoic chamber Conference room Left ear Right ear Left ear Right ear NoObject rec1 77.153 77.866 79.003 78.817 NoObject rec2 77.592 77.374 78.993 78.824 Object50cm 85.182 88.216 87.539 87.457 Object100cm 81.877 82.550 82.827 82.377 Object200cm 77.097 78.044 79.598 79.481 Object300cm 76.975 78.211 78.926 78.898 Object400cm 77.051 77.986 79.016 78.860 Object500cm 76.987 78.033 79.009 78.798

Table 3.1: Mean of the sound pressure levels (dBA) for the left and right ears over the 10 versions of the 500ms duration signals in the anechoic and conference room of Experiment 1.

Recording Lecture room Left ear Right ear NoObject 79.165 79.577 Object100cm 79.594 81.545 Object150cm 79.412 79.681

Table 3.2: Mean of the sound pressure level (dBA) for the left and right ears over the 10 versions of the 500ms duration signals in the lecture room of Experiment 2.

The tabulated SPL values in Tables 3.1 and 3.2 show the effect of room acoustics in the form of level differences (both between the ears and also among the rooms). The level differences between the recording without object and recording with object at 100 and 150cm were less in Experiment 2 when compared to Experiment 1. This may be due to the differences in experimental setup (cf Figure 3.1) and the acoustics of the room.

However the extent to which this information is used by the participants is not straight forward as loudness perceived by the human auditory system cannot be related directly to the SPL (Moore, 2013). This issue is further discussed in Chapter 4.

3.3.2 Autocorrelation Function (ACF)

Generally, intensity diﬀerences play a role in human echolocation. However, Schenkman and Nilsson (2011) showed that repetition pitch is the more important information used by the participants rather than the loudness in order to detect the objects. As discussed in pitch perception section of Chapter 2, pitch perception can often be explained using the peaks in the autocorrelation function, hence an autocorrelation analysis is performed in this section.

The repetition pitch for the recordings in Experiment 1 and Experiment 2 can be the- oretically calculated using equation 3.3. The corresponding values for recordings with objects at 50, 100, 150, 200, 300, 400 and 500 cm would be approximately 344, 172, 114, 86, 57, 43 and 34.4 Hz (assumed sound velocity to be 344m/s). As the theory based

(28)

on autocorrelation uses temporal information, repetition pitch perceived at the above frequencies can be explained by the peaks in the autocorrelation function (ACF) at the inverse of the frequencies i.e, approximately 2.9, 5.8, 8.7, 11.6, 17.4, 23.2 and 29 ms respectively. Therefore, the autocorrelation analysis was performed using a 32 ms frame which would cover the required pitch period. A 32ms hop size was to analyze the ACF for the next time instants 64ms, 96ms etc. In order to compare the peaks among all the recordings the ACF was not normalized to the limits -1 to 1.

RP = speed of the sound

2∗ distance of the object (3.3) In Experiment 1 the participants performed well with the longer duration signals (cf Schenkman and Nilsson (2010)). They, assumed that the higher detection ability of the participants for the longer duration signals may be that although the subject may miss the RP at the ﬁrst repetition, they may perceive it in the later repetitions. This can be visualized using the ACF in Figure 3.2 and 3.3 , where for the 5ms recording the peak was present only for the initial 32ms frame (note that for each duration signals an additional 450ms silence was padded and presented to the participants, the ACF were analyzed in the same manner hence the 5ms duration signal had total duration of 455ms and 500ms signal had total duration of 950ms) whereas for the 500ms recording the peak was also present for frames with time instants greater than 32 ms.

The assumption of Schenkman and Nilsson (2010) could explain the reason for the high echolocation ability of the participants for higher duration signals in Experiment 1. How- ever, in Experiment 2 the performance decreased although the repetitions were present for the frames with time instant greater than 32ms (cf Figure 3.6 and 3.7). There- fore, drawing the conclusion that longer duration signals are always beneﬁcial for human echolocation cannot be made based on the available results. The peak heights at the pitch period for the recordings with object at 100cm for the 5ms duration signal in conference room when compared with the lecture room show that the peak height is greater for the recording in conference room (cf Figure 3.4 and 3.6). For the 500ms duration signal with object at 100cm in the lecture room when compared to the 5ms signal recording in the conference room although had a greater peak height (cf Figure 3.4 and 3.7) the peak is not distinct enough when compared with the 500ms duration signal in the conference room (cf Figure 3.5 and 3.7).

The reason for these differences in the peak heights between the conference room and the lecture room may be due to the room acoustics. As ACF depends on the spectrum of the signal the acoustics of the room certainly influences the peaks in the ACF. The reverberation time T60 for conference and lecture room was 0.4 and 0.6 seconds respec- tively, indicating that the acoustics of the room may influence the ACF and in turn the echolocation ability. How this information of the peaks is represented in the auditory system is further discussed in Chapter 4.

3.3.3 Spectral Centroid (SC)

Detection of an object may also be based on the eﬃcient use of the timbre information available in the stimuli. To test this hypothesis one has to describe that attributes of the acoustic sound which contribute to the timbre perception. An attribute that describes the timbre perception is the spectral centroid (Peeters et al, 2011). The spectral centroid gives a time varying value characterizing the subjective center of the timbre for a sound.

Therefore, the spectral centroid analysis is performed on the recordings and presented in this section.

IMPLEMENTATION AND EVALUATION OF AUDITORY MODELS FOR HUMAN ECHOLOCATION

Master Thesis

Electrical Engineering Thesis no:

December 2015

IMPLEMENTATION AND EVALUATION OF AUDITORY MODELS FOR HUMAN ECHOLOCATION

Abstract

Acknowledgment

Contents

List of Figures

List of Tables

Abbreviations

Chapter 1

Introduction

Chapter 2

Physiology and Perception

2.1 Physiology of hearing

Left hemisphere Medial geniculate

Right hemisphere

Inferior colliculus

Lateral lemniscus

Dorsal cochlear nucleus Superior olive

Right ear Left ear

Ventral cochlear nucleus

2.2 Perception

time (ms)

Chapter 3

Room acoustics

3.1 Review of studies analyzing acoustic signals

3.2 Sound recordings

3.3 Signal analysis