• No results found

D. Relationships between subjective and objective parameters

IV. DISCUSSION

2.2 Pre-processing

A very important part of the auralization system is the offline calculation, decoding, and storage of an accurate set of impulse responses ready to be used in the second block, which applies the room effect to a talker’s voice in real-time. This part of the system is an adaptation of the LoRA toolbox designed by Favrot and Buchholz [5]. The LoRA toolbox is a software application that uses the output (impulse response with directional information) of an acoustics simulation program to encode it in Ambisonics and decode it to a particular reproduction layout, producing an IR for each loudspeaker.

However, some modifications in the procedure and calculation are needed in order to

5

EQ filter

Convolver 29 x WAV

files Room Model

Directional RIR

LoRA toolbox

... ...

Figure 1: Block diagram of the real-time convolution system.

match the requirements for self-voice auralization.

First, a computer-based room acoustic model is needed, which is then loaded into an acoustic simulation program. In the proposed system, Odeon is used [23], although other alternative solutions may also be used, as long as the interface with the LoRA toolbox is implemented satisfactorily. In the acoustic simulation, the source is located at the talker’s position, avoiding positions too close to the boundaries that could not be satisfactorily reproduced by the system due to the inherent latency (analyzed in section 3.2). The receiver point is located 1 m in front of the source. Note that this position does not correspond to the position of the ears relative to the mouth (sound source). However, the reflection pattern is reasonably similar to the reflection pattern experienced at the position of the ears. In addition, the proposed calibration method takes advantage of this approximation, as will be discussed in section 3.3. The source is oriented toward the audience and has a directivity pattern similar to the average human speech [24, 25].

For rooms in a volume range of approximately 100 m

3

< V < 1000 m

3

, the used simulation parameters are 5000 rays, a maximum reflection order of 2000, a transition

6

correspond at least to the largest reverberation time among all frequency bands for the simulated room. The early part of the response is calculated through the image source method and the late part by ray-tracing. Although 5000 rays are usually a low number in this kind of simulations, it is not of critical importance here, since the fine structure of the late reverberation is not of interest, but only the envelope of the energy-time curve.

The acoustic simulation program exports the discrete early reflections separately, each one with its delay, direction of incidence, and attenuation per frequency band.

The late reverberation is exported as vectorial intensity (i.e., in first order Ambisonics format WXYZ) in each of the standard octave frequency bands from 63 Hz to 8 kHz at the defined time intervals. The combination of these two components is referred to as the Directional IR in Fig. 1. The LoRA toolbox is adapted to omit the direct sound from these files, because it will be produced by the talker himself during the real-time auralization. The early reflections are then encoded in fourth order Ambisonics and decoded into the corresponding loudspeaker layout for reproduction (see Fig. 6 in [5]).

The envelope of the reverberation tail is decoded with a lower directional accuracy (first order Ambisonics) than the early reflections, which leads to a higher degree of diffuseness in the resulting multichannel IR. The decoded envelopes are filled with noise sequences uncorrelated among the different channels, in order to avoid coherent interference effects and coloration of the sound. The late reverberation is added to the early reflections and the resulting impulse responses for each loudspeaker are stored as separate WAV files with a 32 bit precision and a sampling frequency of 44.1 kHz.

7

cessing (convolution), and reproduction.

2.3.1 Signal acquisition

The talker’s speech signal is picked with a headworn microphone DPA 4066-F, placed on the talker’s cheek, then digitalized at 44.1 kHz/24 bit with a Behringer ADA8000 and sent into a PC with a RME HDSP MADI audio interface connected to a RME ADI648 (MADI/ADAT converter). Although other placements of the microphone could be more suitable for research, as e.g. in Cabrera et al. [10], the built-in fitting accessory was quite ergonomic and well adapted to the placement on the cheek. The microphone capsule is close enough to the mouth to avoid any severe influence of feedback (see analysis later in the paper). As in P¨ orschmann [26], the spectral distortion introduced by this placement of the microphone is corrected with an equalizer filter h

EQ

(t), which adjusts the spectrum of the speech signal to match the spectrum of the on-axis speech signal at 1 m in front of the mouth. The calculation of the equalizer filter is done on an individual basis, as the placement of the microphone in relation to the mouth of the speaker differs among users. The measurement of the equalizer filter is used also to calibrate the system, as detailed in the next section. The justification for applying the equalizer filter is that the calculation of the impulse response in the simulation program assumes an on-axis source signal to provide a spectrally correct output. For practical reasons, the equalizer filter was pre-convolved with the stored multichannel room impulse responses, reducing the overall delay in the system during run-time operation. Nevertheless, the conceptual representation of Fig. 1 is still valid.

8

provide high quality audio, both regarding bit depth and sampling frequency, introduce the lowest possible delay between input and output, and convolve a number of long impulse responses. Lengths of hundreds of thousands of taps are typical for room impulse responses. In the present system, 29 simultaneous convolutions are required (one for each loudspeaker).

To perform the convolutions, a free software convolver—jconvolver —is used [19].

Jconvolver is a multichannel software implementation of the variable block-size con-volution scheme proposed by Gardner [27]. It runs in a four-core PC under Fedora 8 Linux, patched with the real-time kernel module from Planet CCRMA and uses JACK audio server with ALSA sound driver architecture. The convolver is configured with a simple script that defines the input (the speech signal from the microphone), the 29 impulse responses, and adjustments of gain and delay to account for the position of the loudspeakers in the actual arrangement, which are at different distances from the center of the layout. With JACK, each of the outputs of the convolver are assigned to physical outputs of the audio interface.

In order to investigate the demands of the DSP software in relation to the process capability of the hardware (Quad core Intel PC with 8 GB of RAM), a small benchmark study was carried out. In Table 1, the CPU load is measured as a function of the minimum block size (identical for JACK and jconvolver ) and length of the impulse response, while calculating 29 impulse responses. In Table 2, the CPU load is indicated for each combination of number of channels and minimum block size, for an impulse response of 65536 samples. The CPU load increases with the number of channels and the length of the impulse response, whereas it decreases with the block size. The drawback of the decrease in CPU load is an increase in latency, which is not desirable for real-time convolution. The measured low values of CPU load show that it is possible

9

sampling frequency. The latency introduced by jconvolver is indicated in parentheses.

IR length Block size (latency)

64 (2.9 ms) 128 (5.8 ms) 256 (11.6 ms)

22050 8.7 % 7.3 % 6.6 %

44100 9.2 % 7.9 % 7.2 %

88200 10.2 % 9.0 % 8.2 %

176400 13.4 % 11.8 % 11.0 %

to run in parallel alternative processes to record or monitor an input or output signal, or also to run multiple instances of jconvolver in the same computer, so as to simulate more complex auditory environments, for example, adding a second sound source at a different position in the simulated room.

2.3.3 Reproduction

The output signals are converted into the analog domain with a MADI/ADAT con-verter RME ADI-648 and four Behringer ADA8000 devices, amplified, and sent to 29 DYNAUDIO BM6 loudspeakers. The loudspeakers are arranged on the surface of a quasi-sphere with distances in the range 1.5 m–2.0 m from the center of the arrange-ment (see Fig. 2 for specific details of this layout). As the frequency response of the loudspeakers is fairly flat in the frequency range of interest for voice (100 Hz–10 kHz), no equalizers are introduced, as these could be detrimental for the audio quality with small displacements from the equalized position [28].

10

a 44.1 kHz sampling frequency. The latency introduced by jconvolver is indicated in parentheses.

Number of channels Block size (latency)

64 (2.9 ms) 128 (5.8 ms) 256 (11.6 ms)

4 2.1 % 1.6 % 1.5 %

8 3.1 % 2.7 % 2.4 %

16 6.0 % 5.0 % 4.4 %

32 10.8 % 9.3 % 8.6 %

Figure 2: Position of the 29 loudspeakers in the array used for reproduction (from Favrot and Buchholz [5]).

11

T

30

[s] 0.16 0.09 0.08 0.07 0.07 0.07

3 Practical considerations

There are some practical issues that should be addressed so that this auralization system

works as intended.