Artificial Reverb vs. Real Recorded Reverb in the Back Channels in a 5.1
Surround Setup
Adrian Emilsson
Audio Technology, bachelor's level 2018
Luleå University of Technology
Department of Arts, Communication and Education
Abstract
When recording music for surround audio engineers sometimes face limitations in time, ideal microphone positions or a noisy audience. If this cannot be dealt with at the
location, artificial reverbs are often used in the mixing to “fill in the blanks”. In this study, three instruments were recorded separately with two 5.1 surround microphone setups. Two audio engineer students created artificial reverbs that replaced the back channels of each microphone setup. A listening test was conducted where test subjects compared the 5.1 real recording to the two other stimuli with artificial back channels in terms of realism, envelopment and preference. The result showed that the real
recording and the artificial back channels were interchangeable, but that the artificial
back channels pointed towards more envelopment, and that the real recording pointed
towards more realism.
Table of contents
1. Introduction 1
1.1 History of surround sound 1
1.2 Recording for 5.1 2
1.3 Reverberation 5
1.4 Realism in artificial Reverberation 6
1.5 Envelopment in the surround format 6
1.6 Applications of artificial reverb in the surround sound format 7
1.7 Research question 7
1.8 Purpose 7
2. Method 8
2.1 Method overview 8
2.2 Attributes and preference rating 8
2.3 Stimuli 9
2.3.1 Choosing instruments and type of music 9
2.3.2 Recording stimuli 9
2.3.3 Creating the artificial back channels 10
2.3.4 Normalizing the back channels 12
2.3.5 Back-channel bias 12
2.3.6 Finalizing the stimuli 12
2.4 Listening test 12
2.5 Test subjects 14
3. Results 15
3.1 Average test scores, T-tests and p-values 15
3.2 Comments by the test subjects about the test 16
3.2.1 Comments regarding the instruments 16
3.2.2 Comments regarding the attributes 16
4. Analysis 18
4.1 T-test instrument analysis 18
4.2 T-test attribute analysis 18
4.3 Test subject comment analysis 19
4.4 Artificial vs real recorded reverberation analysis 20
5. Discussions and Conclusions 22
6. Future applications and further studies 23
7. Bibliography 24
1. Introduction
In evaluation of surround sound the primary interest is usually the sound source and how it fills the space. Berg and Rumsey (2003) found attributes that are usable when evaluating surround sound. For example, naturalness, presence, ensemble width, localization, source distance and envelopment, all could be defined and understood in the surround sound realm. The majority of these attributes evaluate the audio in the front channels, neglecting what’s in the back. This could perhaps mean that the relevance of the back-channel audio is not as great as the front-channel audio. This pattern can be found in other audio fields too. A recording engineer might, when faced with time, venue or equipment constraints, not record back-channel audio, presuming that it can be artificially created later, with the same result. Would there be a notable difference between the recorded and the artificial audio? In a study by King, Leonard, Howie and Kelly (2017) realism and immersion in surround was investigated, using both real recorded reverberation and artificial reverberation. King et al. used a 9.1 surround setup, 5.1 surround setup with four additional height channels, and only switched between the real and artificial audio in the height channels. They showed that there was no perceived difference between real and artificial height channels in terms of realism or immersion. The study by King et al. can be simplified by using a 5.1 surround system, with the benefit of a potential perceivable greater difference between the real and artificial audio. Two of the validated attributes in the study by Berg and Rumsey (2003) that include the back channels to a greater extent, envelopment and naturalness, almost matched the attributes in the study by King et al. (2017). In this study, however, the attributes in focus are realism and envelopment.
Background
1.1 History of surround sound
“Fantasound” was the forerunner of all modern surround sound systems today and was used to accompany the movie “Fantasia”, created in 1938 by Disney. The system used a surround sound setup with two front speakers and only one surround speaker in the back. During the 1950s different types of surround setups emerged in the market using different amounts of speakers. Problems of perceived location made some setups, like the quadraphonic, an unsatisfying choice for surround. This led to the modern 5.1 speaker setup, which arose during the early 1990s. It consists of a Left and Right speaker in an equilateral triangle with the listener, a Center speaker in between the former at the same distance from the listener, two Surround speakers placed at ±110°
(±10°) from the center speaker and a subwoofer somewhere in the listening
environment (figure 1). (International Telecommunications Union, 2013)
Figure 1. The modern surround sound setup, without the subwoofer.
In smaller surround systems the surround speakers’ positions becomes a compromise between envelopment and rearward image. This can in practice only be solved by using more speakers, reducing the tradeoff. Height speakers can be added to complement the speakers in the horizontal plane, which will better simulate a real-world venue.
However, the ability to hear errors in elevation is generally three times less accurate than hearing errors in the horizontal plane, when listening to sounds in the front. This ability is further reduced when listening from the side or back (Holman, 2008). That might indicate that more speakers are not necessary, but a study by Hamasaki, Nishiguchi, Hiyama & Okumura (2006) showed that added speakers with different heights can produce a significant increase in direction as well as envelopment.
1.2 Recording for 5.1
There are many different types of recording techniques one can use when recording for surround. Every technique has a different characteristic and will be better suited for certain types of recordings. The way we prefer music can even vary between pieces alone depending on the way it was recorded (Atsushi, Francisco, Kim, Martens, Walker, 2006). As the venue itself affects the possibilities of how to record, and also plays a major part the sound of the music, established surround microphone setups could (or should) be considered as suggestions. These established surround setups are often used when different attributes of the music are considered. If wanted the recordist could capture the music with for example more directional properties or with a very
reverberant sound. The only techniques discussed in this chapter is the Fukada Tree, the OCT Surround and the Hamasaki Square.
In stereo recordings there is generally either a difference in intensity or time that
creates the stereo effect. Two well-known different recording techniques are XY-stereo,
which uses two cardioid microphones in the same spot in a 90° angle, and AB-stereo,
which uses two omnidirectional microphones with a distance between. XY-stereo is
known to have good directional properties but sound narrow. AB-stereo is known to have a full sound but lack in directional properties. These two can be tweaked or combined in for example ORTF-stereo where there is a small distance between two cardioid microphones with a 110° angle. When recording in surround these basic techniques can be configured in many different combinations in order to capture a sound in an environment in the best way (Holman, 2008).
The Fukada tree (figure 2) consists of five cardioid microphones, four of the
microphones in a 2 m sided square facing outwards from the center of the square, and one center microphone 1 m closer to the sound source. This setup has good directional properties, since it consists of only directional microphones. The Fukada tree is often used with two additional omnidirectional microphones (in grey in figure 2) on either side of the left and right microphones. Their positions are most often one meter left respectively right outside the left and right microphone. This can yield unwanted phase issues and depending on the recording situation their positions could be changed.
Blending the cardioid and omnidirectional signal for the left and the right channels respectively can create a fuller sound, but still have the beneficial directional properties (DPA Microphones, 2017).
Figure 2. The Fukada Tree.
The OCT Surround (figure 3) array consists of three cardioid microphones and 2
supercardioid microphones. Compared to Fukada tree this rig is half the size in width
and one sixth of the size in length. The left and right microphones are supercardioid. The
spacing between the left and the right side can be adjusted to suit different sizes of
ensembles. The main idea with this setup is to produce better phantom images between
the front speakers. In the Fukada tree, for example, there is a small overlap between the
front microphones, which can make the half-images “mushy” (Holman, 2008).
Figure 3. The OCT Surround.
The Hamasaki square (figure 4) is not a complete surround microphone setup in itself but can be a complement to spot microphones or a main microphone array. This is mainly used to capture ambience and not the sound source(s) directly. The Hamasaki square consist of four bidirectional microphones positioned in a square with a side of 2 m, with the null side of all microphones pointing towards the source. The square itself might not even be positioned close to the main array and many squares can be used at the same time in different parts of the venue (DPA Microphones, 2017).
Figure 4. The Hamasaki square, with the arrow pointing towards the sound source.
Depending on the situation sometimes combinations of these methods and traditional stereo methods must be used to create a satisfying result. When recording a symphony orchestra there’s often a main array usually positioned above the conductor. Spot- microphones for the different instrumental groups are often used to highlight solos during the performance. The spot-microphones are panned according to the main array.
Hamasaki squares can be used to capture the space of the venue. These combinations of
techniques can apply to smaller ensembles as well (Holman, 2008).
When recording for even bigger formats than 5.1 surround, additional microphones can be used to capture height channel information, or side channel information. In the study by King et al. (2017) four microphones were used to capture height information, in addition to the main 5.1 array. The height channels in that recording were mainly used for ambience enhancement. King et al. used four omnidirectional microphones with diffraction attachments and they were positioned one meter above the left, right and surround microphones pointing upwards, based on a 5.1-Fukada tree setup. These types of bigger microphone setups often mirror the speaker setups later used when listening to the reproduced recording. In that way the sound reaching a certain spot in the recording environment will originate in the same spot in the speaker environment, making it sound more natural.
As many setups, including the Fukada Tree and the OCT surround, tend to resemble or mirror the speaker positions of the reproducing system, one can draw the conclusion that this would create the most natural sound. This, however, is not always the most desired sound when recording, since all venues have different aural qualities. The microphones are there to capture the sound and the venue, rather than positioned in a certain way to satisfy a specific microphone array. This, in the long run, means that the venue and the sound produced in that venue will set the terms for how to record. This also means that sound engineers are able to capture certain attributes regarding to how they do their recording. Sometimes the sought-after attributes may not have been captured at the venue. This could lead the mixing engineers to use artificially created sounds to accomplish the wanted result (Shriram, 2011).
1.3 Reverberation
Shriram (2011) describes reverberation in a room as “…a natural phenomenon that occurs in an enclosed space due to the sound reflecting off the different boundaries of that space”. Shriram then decomposes sound in enclosed spaces into three parts: direct sound, early reflections and reverberant sound. Direct sound is the sound coming straight from the source to the listener, unaffected by any boundaries in the room. Early reflections are described as reflections via surfaces in the room, separated from the direct sound in both time and direction. Reflections heard at 5 ms to 50 ms after the direct sound, are considered early reflections. The reverberant sound is a denser set of reflections heard after 50 ms, they come from all directions and the sound has been reflected many times. Every enclosed listening environment will have a different type of reverberation, but all can be described using these three parts.
When creating reverbs artificially there are two main types of reverbs – algorithmic and
convolution. The algorithmic reverb relies on calculated echoes and using feedback-
loops. More advanced algorithmic reverbs take the time and frequency domain in
perspective simulating specific rooms, with their specific early reflections, how the
sound is absorbed over time, general reverberation time etc. (Everest & Pohlmann,
2017). Convolution reverbs are also calculated but rely on an impulse response, which is
the recorded reverberation from a specific venue. The impulse response is multiplied
with the dry signal so that it sounds as if it was recorded in that space. The advantages of
the algorithmic reverbs are the possibility to change the parameter settings, but it also
needs knowledge regarding what the different parameters do. The advantages of
convolution reverbs are the more instant natural sounding reverberation, since it
always builds on existing venues. That, however, also implies that you need a lot of different impulse responses in order to create different types of reverberating venues (Shriram, 2011).
1.4 Realism in artificial reverberation
Studies in surround for bigger formats than 5.1 came to the conclusion that additional height channels are important when creating envelopment and a sense of a realistic space (Hamasaki et al, 2006). The same study proposed a surround system using 22.2 channels. This number of channels will in turn create a demand for more discrete channels during the recording phase. There is a possibility for this to be worked around by using reverb to fill in the blanks (King et al., 2017). King et al. used a 9.1 setup
consisting of a 5.1 setup and four additional speakers positioned above the right, left, right surround and left surround speakers. The height channels either had location- recorded information or artificially produced information, derived from the main 5.1 recording. They showed that there was no preference for either the recorded or the artificially produced material when used together with the 5.1 recording. They also showed that both the recorded and the artificially produced material, in the 5.1-context, had the same realism rating. Their test subjects could however easily differentiate between the two types of height source material when they were listening to the height channels alone.
When comparing real reverberation with two convolution reverbs, modeled after the same venue, preferences was shown differently for different instruments with different types of timbre and spectral composition (Shriram, 2011). The test also included a rating of paired attributes for the stimuli. They had to rate naturalness (artificial – natural), spaciousness (small – big), ambience (dry – wet), distance (near – far), roughness (smooth – rough) and density (scattered – compact) on a scale from 1 to 7.
The natural recording of the venue showed to be the most natural sounding of the three, but not perceived as the biggest, or the most ambient – still scoring at the middle of the scale. Here the convolution reverbs both sounded bigger and more ambient. This shows how different types of reverbs can be good for different causes.
1.5 Envelopment in the surround sound format
Envelopment in audio is the sense of being surrounded by sound, and envelopment is most common in big reverberant halls or a surround sound system. There are of course other enveloping audio related experiences when, for example, reverberant listening rooms are used, and all speaker produced sounds are reflected off the back and side walls. Envelopment can roughly be seen as one of two parts of spatial impression and the other one being apparent source width (Soulodre, Lavoie, Norcross, 2003). Soulodre et al. found that the apparent source width is determined by the energy within the 105 ms after the arrival of the direct sound. Envelopment, or listener envelopment, is determined by the energy 105 ms after the arrival of the direct sound. What makes a sound more or less enveloping is the level and spatial distribution of that late energy.
Soulodre et al. (2003) also found that an overall higher level and a longer reverberation time was perceived as more enveloping. The attribute immersion was used by King et al.
(2017) to evaluate artificial reverbs, and immersion is similar to envelopment, but
immersion does not need a surround system to be apparent.
1.6 Applications of artificial reverbs in the surround sound format
With the price of hard drive space still declining every year we are more and more capable of recording more music. There are few times today when there’s a limit to how many channels one can record simultaneously. This rise in technology will make it easier to record more channels, as long as there are enough microphones. Big
microphone setups can take time to rig, and when recording live concerts there’s always a possibility that the recording venue is noisy or filled with a loud audience. King et al.
(2017) showed that it is possible to replace recorded audio with reverb in height channels and still have a preferable, realistic and immersive result. This could lead to fewer microphones, but still maintaining the audio qualities of the real recording.
The possibility to create more natural sounding environments increases with the increase of channels, which in turn puts pressure on the mixing engineer to know how to utilize the surround sound format to the fullest, and how effects like reverbs behave when used in surround. One might argue that audio engineers are capable of creating suiting reverberation that can compete with or exceed the recorded material in terms of realism or envelopment.
1.7 Research question
Can real recorded reverb be replaced with artificial reverb in the back channels of a 5.1 surround sound setup with the same perceived realism and/or envelopment?
1.8 Purpose
This could potentially lead to a better understanding of artificial reverb regarding the
attributes envelopment and realism. It could be seen as guidelines when creating
artificial reverb, so that the audio engineer knows what to use when needing a more
enveloping or realistic sound.
2. Method
2.1 Method overview
In order to find out if realism or envelopment could be preserved when replacing real audio with artificial a listening test was conducted. The listening test had to have stimuli that were recorded in a way that suited this type of experiment. The study by King et al.
(2017) had done a similar comparison and they used classical music recorded in a large hall. The study made by Shriram (2011) showed that different types of reverbs are better suited for different types of instruments, and Atsushi et al. (2006) showed that music could be more or less preferable only depending on the microphone setup used in the recording. This suggested that at least two different artificial reverbs should be compared to the real recording and that at least two microphone setups should be used.
In accordance with prior research (King et al., 2017) the ambition was to make artificial reverbs that would be easy to recreate, for future purposes. The stimuli were then compared against each other in a listening test in terms of realism, envelopment and preference.
2.2 Attributes and preference rating
The attributes chosen for this test were realism and envelopment. The only difference between stimuli were in the back-channel audio. Therefore, all attributes associated with perceived direction, instrument width, or front channel information in general, were unnecessary. The attributes were chosen out of the attributes that Berg and Rumsey (2003) proposed when evaluating attributes for surround sound audio. The attribute “Naturalness” which was described by Berg and Rumsey as “How similar to a natural (i.e. not reproduced through e g loudspeakers) listening experience the sound as a whole sounds.” combined with the attribute “Presence”, described as “The experience of being in the same acoustical environment as the sound source, e g to be in the same room.”, could roughly be compared to the attribute “Realism” King et al. (2017) used in a study similar to this one. In the same study by King et al. they also used the attribute
“Immersiveness”. What makes a sound immersive might not depend on how it was recorded or how many speakers you are listening through. It could be as simple as the sound itself drawing you into the virtual world of that sound. This also means that a disconnecting sound could have the opposite effect, even if the recording and speaker conditions were right. Instead the attribute “Room Envelopment” proposed by Berg and Rumsey (2003) was chosen. Berg and Rumsey described it as “The extent to which the sound coming from the sound source’s reflections in the room (the reverberation) envelops/surrounds/exists around you – i e not the sound source itself. The feeling of being surrounded by the reflected sound.”, which comes closer to what this study focuses on. The direct sound of the sound source is, in this study, not as interesting as what has happened to the indirect sound.
In addition to the attribute ratings a preference rating was made, also based on the study by King et al (2017) which used both attribute and preference ratings. They found that the attribute ratings didn’t necessary mirror the preference ratings. Berg and
Rumsey (2003) describes preference as “If the sound as a whole pleases you. If you think
the sound as a whole sounds good. Try to disregard the content of the programme, i e do
not assess genre of music or content of speech.”. If a realistic and enveloping sound is
not preferable, the other unknown factors that determine the outcome makes the preference rating just as important.
2.3 Stimuli
The structure of each stimuli was either a multichannel recording in all five full-range channels, or a front channel recording of three channels with an artificial reverb in the two back channels.
All music was recorded with a 5.1 surround sound speaker setup in mind, but the microphone techniques clearly shows no LFE microphone. As mentioned above there was no mixing involved, other than normalizing the artificial back channels. This meant that the LFE channel was not used, both during recording and during the test.
2.3.1 Choosing instruments and type of music
Three recordings were the foundation of the stimuli used in this experiment. The
recordings were short classical music excerpts played on clarinet, piano and snare drum.
These instruments are different in frequency and timbre, and they differ a lot in their tonal width. Shriram (2011) showed that different types of reverbs were preferred on different types of instruments (piano, oboe and cello). These instruments were chosen with that in mind. All three musicians that was recorded in this study were asked to play a piece of music that included big dynamic differences and, if possible, also difference in frequency content. It was important that they were familiar with the piece and really could express the differences within the pieces, still with a musically satisfying recording. Larger dynamic differences tend to trigger and excite both natural and artificial reverberation in a way that could make any differences more obvious. The pieces that the musicians chose to play were “Opus 10: Etude No. 3 in E major”, by Frédéric Chopin, on piano, “Monolog” – first mov., by Erland von Koch, on clarinet and
“For what four?”, by Lalo Davilaon, on snare drum. All musicians were students at the School of Music at Luleå University of Technology.
2.3.2 Recording stimuli
For every performed musical piece a number of different simultaneous recordings were made. The surround microphone setups Fukada Tree and OCT Surround was used in this experiment. Two additional microphone setups were also used, including a Hamasaki Square and a stereo pair of close-up microphones, each pointing about 10°
away from center. All stimuli were recorded in the large concert room in Studio
Acusticum since the long reverberation of the hall is suitable for this type of experiment.
The adjustable ceiling in the hall was at the highest level, which gave the hall a
reverberation time of 2,5 s. The microphone setups used for the recordings was partly based on the study by Atsushi et al. (2006), where the Fukada Tree showed the best preference ratings in comparison with three other recording setups. Theile (2001) proposed a then new type of surround microphone setup, the OCT Surround, and compared it to the Fukada Tree. Both the Fukada Tree and the OCT Surround uses cardioid or super-cardioid microphones, which makes them easy to compare. This led to the conclusion that the same type of microphone could be used at all recording
positions. This posed a potential problem since all microphones in these two setups
didn’t all have the exact same characteristics. As seen in Figure 3 the left and right microphones are super-cardioids, rather than cardioids. Since this experiment focuses on the difference between different back channel information the microphone similarity in the front channels was more important than getting all microphone characteristics right. In the recording the Neumann KM184 (Georg Neumann GmbH, 2018) was used at all recording positions. This microphone was the only condenser microphone available in sufficient numbers, meaning that the only directional pattern of this microphone, cardioid, had to do. To not confuse the reader the modified OCT Surround that was used is onwards called OCTmod in the text. All microphones were recorded using RME
Micstasy preamps at the same gain and a Sequoia interface. No mixing was applied to the recordings, which means that there is only one microphone per channel.
The Hamasaki square that was used during recording was never used in the experiment due to the way it is supposed to be mixed with the rest of the channels. The idea is to take the two front microphones and add them to the front left and right channels, while the two back microphones are added to the left and right surround channels. If the reason one might replace backchannel audio is due to noise, it would seem strange if only the back microphones of the Hamasaki square were affected. In order to keep the number parameters low this recording technique was scrapped, even if it is often used in this type of recording. Neither were the close-up microphones used directly in this test, but indirectly fed through some of the artificial reverbs.
Figure 5. Microphone positions viewed from above with the instrument at the X to the right. The blue dots represent the Fukada Tree and the red dots represent the OCTmod. The same spot was used for their individual center microphone. All surround setup microphones were placed 2,5 m above the stage floor.
2.3.3 Creating the artificial back channels
The artificial reverb in the back channels was created to suit the front channels of each
microphone setup. The reverb was created by two audio engineer students at School of
Music at Luleå University of Technology. They were both provided with the three front
channels for both microphone setups and was then asked to create back channel audio
that suited the three channels in the front. They were also given information about
where this was recorded and were also instructed to, if possible, make the reverb sound
enveloping and realistic. They were provided with two types of audio when creating
their back channels. One student made his back-channel audio to the Fukada Tree front
channels from the close-up microphones and the back-channel audio to the OCTmod
front channels from the three OCTmod front channels. The other student was given the three Fukada Tree front channels when creating his back-channel audio for the Fukada Tree, and the close-up microphones for the OCTmod front channels. This is better shown in figure 6 where the audio that was used to create the back channels for each student for respectively technique is circled.
→ Reverb 1a, Fukada
Student 1
→ Reverb 1b, OCTmod
_____________________________________________________________________________________________________
→ Reverb 2a, Fukada
Student 2
→ Reverb 2b, OCTmod
Figure 6. Visualization of the audio used for each reverb.
The three front channels of each microphone setup had a more reverberant sound than the very dry close-up microphones. The two students used a Lexicon 960 to create the back-channels. They were both familiar with how the Lexicon 960 works and the sound of the big concert room in Studio Acusticum. They did not get to listen to the real back channel recording and were never able to compare their artificial audio to the real recording. The two reverb settings each student made were supposed to suit all three instruments.
One often used technique for artificial reverbs is to use really “dry” input audio in an
attempt to blend the spot microphone audio with ambience audio. It is, however, also
common to use the main recording pair/array as the audio that is fed through the
reverb, which this makes the added artificial reverb audio correlate more with the main
pair, often creating a more coherent whole. Those reverbs are blended into an already
decent mix, which differs from this study where the reverbs are used as individual
channels. The two students that created the reverbs had to make one of each reverb
version because of a potential outcome where one of the audio types would be better
suited for creating a reverb with certain characteristics.
2.3.4 Normalizing the back channels
When creating their backchannels both students increased the input level of the Lexicon by 6 dB when using the close-up microphones, and one of the students did not use his center channel when running the three front channels through the Lexicon. Since all artificial reverbs was created unknowing of the real back channel audio this led to different audio levels among the back-channel pairs. A higher level of the back channels would be perceived as more enveloping (Soulodre et al. 2003). It was solved by
normalizing all back channels to the same loudness level. The artificially created back channel audio was a compromise between the three different instruments which meant that levels could differentiate a lot between transient rich sounds and sounds without transients. Thus, the configuration of the test determined how the audio was
normalized. The test subjects were to grade three different surround stimuli against each other, one real recording and two with artificial back channels, which meant that the two stimuli with artificial back channels had to be normalized to the same loudness level as that real recorded back channels. The loudness measurement was done in accordance with EBU R128 and normalized within a tenth of decibels accuracy (European Broadcast Union, 2014).
2.3.5 Back-channel bias
The artificial back channels could have been made by the recordist, but the awareness of how the recorded audio actually sounded could have influenced the way the artificial reverb was made. That could have directed the different stimuli towards being almost inseparable, which was not the intention of this study. By using unbiased students when creating the artificial reverbs, the recorded and the artificial audio would not turn out alike. This projected their own subjective view onto the realism or envelopment of the audio they created. That is also why more than one student was asked to make an artificial reverb, so that there would be a larger representation of what engineers think of as “real” or “enveloping” when creating reverbs. Two unbiased versions are a small, and not in any scientific way representable number, but still more representative than one biased version.
2.3.6 Finalizing the stimuli
The recorded pieces were, except for the snare drum piece, too long for the intended purpose of the music and were therefore shortened. As the intention when choosing pieces of music was to have a great dynamic range within the piece, it was of great importance that the shortened versions of the music also contained that dynamic range.
The shortening was also done in a way to keep whole phrases of the piece intact. This gave test versions of the pieces a length of about one minute. Shorter parts than one minute would not have displayed all different qualities of the recordings.
2.4 Listening test
The listening test took place at the School of Music at Luleå University of Technology in
room L151, an acoustically treated mixing and control room for 5.1 surround. The
volume of the test was set by two audio engineers at a level they individually felt
comfortable. The average volume between those was used during the test and the test subjects were not allowed to change the volume during the test. All visible meters in the listening room that could indicate something about the sound were hidden so that the test subjects only could rely on their hearing. The subjects were asked to sit in the sweet spot, with the same distance from all speakers. This ensured that all subjects had the same listening conditions through the test and that no sound levels were harmful.
The test interface was done using a customized version of MUSHRA, without reference stimuli, that included three stimuli at each test trial. The trials were in random order and the order of the stimuli in each trial were also randomized. This ensured that the order of stimuli couldn’t determine the end result. The top left corner on each test page indicated what attribute/preference the test subject was to rate. On each test page were also three buttons connected to each of the three simultaneous stimuli and above the buttons were faders, which were used for rating each stimulus ranging from 0 to 100. A higher number was associated with a higher perception of the attribute, i.e. more
enveloping, realistic or preferable. When pushing one of the buttons the stimuli
associated with that button was played. For every page there was a section dedicated for the playback, including position in stimuli, start and stop-positions and a play, a pause and a loop button.
Figure 7. The MUSHRA test page that was used during the test. In the actual test only three stimuli were compared at the same time, with no reference track.
Each test subject underwent a brief training test before the real test on how to use the
interface which mainly consisted of learning the playback section. They were also given
a short explanation of each attribute and what type of stimuli they were supposed to
rate. They were also trained in how they were going to rate the stimuli. No information
about the actual difference was given to the test subjects.
During the test the test subjects compared three stimuli against each other and rated them according to what they thought sounded most realistic/enveloping/preferable.
Since there were three instruments recorded with two microphone setups and three different types of ratings the test subjects had to do 18 (= 3 ∙ 2 ∙ 3) different trials. This means that there were only six different sets of stimuli, containing three versions with the same front channels. The two different microphone setups were never compared against each other, as this study focuses on differences between the real recording and the artificial.
After the test each test subject was asked to write down their general thoughts about the test, for example if they found differences in how they perceived different attributes, instruments etc.
2.5 Test subjects
In total 17 test subjects, ranging from 21 to 29 in age, did the test. All test subjects were studying at the School of Music at Luleå University of Technology and all had experience of listening to live classical music and/or surround sound music. All subjects reported to have normal hearing.
It was important that the test subjects didn’t know what the differences between the
stimuli were beforehand. If they knew that the differences were in the back channels
their listening focus could have been directed towards the back channels, not listening
to the surround stimuli as a whole. This is also why some students at the school were
disqualified from partaking in the test, despite them being good listeners.
3. Results
The raw data from the test is presented in appendix, including the distribution
represented in histograms. The written comments each test subject made after the test are also in appendix. The comments are not translated from their original language, in order to not lose vital information in translation. Histograms over the t-test with significant p-values can be found in the appendix.
3.1 Average test score, T-tests and p-values
The T-tests in this study were paired and two-tailed, since the compared numbers were set by the same test subjects and with a possible outcome on both sides. The confidence interval of these t-tests was set to 95 %.
Table 1. Average test scores and p-values for the attribute Realism. The highlighted numbers represent p-values lower than 0,05.
REALISM Average test score P-value
Stimuli Real Rec. Reverb 1 Reverb 2 Real vs Rev1 Real vs Rev2
Piano, Fukada 82.1 81 82.3 0.8396 0.9626
Piano, OCTmod 89.2 80.5 80.8 0.0401 0.0403
Clarinet, Fukada 84.1 81.1 82.7 0.6509 0.7637
Clarinet, OCTmod 87.4 62.5 81 0.0028 0.2077
Snare, Fukada 84.5 74.8 70.7 0.1010 0.0165
Snare, OCTmod 77.2 71.4 73.2 0.3923 0.5365
Table 2. Average test scores and p-values for the attribute Envelopment. The highlighted numbers represent p-values lower than 0,05.
ENVELOPMENT Average test score P-value
Stimuli Real Rec. Reverb 1 Reverb 2 Real vs Rev1 Real vs Rev2
Piano, Fukada 68.6 84.2 76.6 0.0068 0.2560
Piano, OCTmod 73 71.2 70.1 0.7810 0.6563
Clarinet, Fukada 61.3 81.2 70.2 0.0130 0.1264
Clarinet, OCTmod 71.6 83.9 66.1 0.1119 0.3192
Snare, Fukada 64.8 77.9 81.9 0.0157 0.0055
Snare, OCTmod 83.1 75.1 65.5 0.1824 0.0349
Table 3. Average test scores and p-values for the attribute Preference. The highlighted numbers represent p-values lower than 0,05.
PREFERENCE Average test score P-value
Stimuli Real Rec. Reverb 1 Reverb 2 Real vs Rev1 Real vs Rev2
Piano, Fukada 84.1 78.9 80.2 0.5063 0.5742
Piano, OCTmod 80.9 89.2 79 0.2072 0.7706
Clarinet, Fukada 84 84.9 79.1 0.8776 0.3989
Clarinet, OCTmod 85.2 79.6 81.8 0.3611 0.5834
Snare, Fukada 88.8 78.3 70.9 0.1391 0.0103
Snare, OCTmod 85.1 76.2 78.8 0.2708 0.2897
3.2 Comments by the test subjects about the test
All test subjects found the test to be hard, and no one was really sure about all their answers, saying that some stimuli sounded too alike. The predicted test time was around 20 minutes, while the actual test time ranged from 25 minutes to one hour with an average test time around 35 minutes.
3.2.1 Comments regarding the instruments
The most common comments made by the test subjects regarded the snare drum.
Several of them thought that the snare drum sound was the easiest to evaluate because the stimuli differentiated most. Some of the test subjects, however, had the same reason for the opposite – they found it harder to evaluate the snare drum stimuli because of the bigger differences.
The comments regarding the piano circled around it being hard to evaluate. One test subjects said that it was due to the much stronger front channels than back channels.
Others thought that the piano was the easiest one to evaluate.
The comments regarding the clarinet were, similar to the snare drum comments, about some finding it to be the easiest one to evaluate while others found it to be the other way around. Some pointed out that the clarinet stimuli contained more
instrument/instrumentalist noise than the other stimuli.
3.2.2 Comments regarding the attributes
The perceived realism in the test was hard to evaluate for the majority of the test subjects, due to them not knowing which reference frame for realism to use. Some test subjects differed in their opinion about realism with certain instruments, but the snare drum was almost always involved.
The comments about perceived envelopment varied from test subject to test subject,
with some finding it very easy to evaluate, while others were struggling with certain
instruments. The majority seemed to think that it was easier to evaluate envelopment
than realism.
Very few test subjects made any comments about the preference rating, but when they commented preference they all spoke of it in comparison to the other attributes.
Perceived large stimuli differences tended to give clearer preference scores.
Some of the test subjects perceived a difference in the stereo width when switching
between the three stimuli, despite there being no difference in the front channels.
4. Analysis
4.1 T-test instrument analysis
Two of the significant p-values regarding the piano was connected to realism, where the real recording was perceived as significantly more realistic than the both reverbs. The third significance was related to envelopment where Reverb 1 was significantly more enveloping than the real recording.
The clarinet had two significant p-values, one regarding realism where the real
recording was rated significantly higher, and one regarding envelopment where Reverb 1 was rated significantly higher.
The snare drum had five significant p-values with three of them regarding envelopment, one realism and one preference – the only one regarding preference. Three of these were a comparison between the same audio, which probably means that there was a big difference making it easy to evaluate. One of the significant p-values for envelopment were, unlike the other significant p-values for envelopment, showed that the real recording was rated higher than Reverb 2.
When comparing the different average values of the instruments there are bigger differences between the average snare drum values and smaller between the average piano values. This is probably why five of the ten significant p-values are located among the snare drum stimuli. The spread of the data, however, was more or less the same for all tests, regardless of instrument.
4.2 T-test attribute analysis
Table 1, 2 and 3 shows that it was easier to evaluate attributes compared to preference.
Four out of ten significant p-values were connected to realism, five to envelopment and only one to preference. Two of the significant p-values connected to realism were comparisons between the same reverb (Reverb 1) and the same microphone technique, and the average test results is leaning towards the same p-value in the third comparison as well. Three of the significant p-values connected to envelopment were comparisons between the same reverb (Reverb 1) and the same microphone technique. Three out of the four snare drum p-values connected to envelopment were significant, proving that the difference between the different back channels was easy to evaluate. The only p- value of significance regarding preference was a snare drum comparison, proving that the real recording sounded better.
Generalized the average test scores for the attributes and preference of each back-
channel version can be displayed as:
Table 4. Average test scores for the attributes and preference of each back-channel version when using the Fukada data (blue), OCTmod data (yellow) and the average of all the data (green). The highlighted p-values are those below 0.05 and the p-values that are bold are below 0.0125.
Attribute Version P-value
Real rec Rev 1 Rev 2 Real vs Rev1 Real vs Rev2
Realism 83.5 79 78.6 0.1689 0.0905
Envelopment 64.9 81.1 76.3 9.34E-06 0.0016
Preference 85.6 80.7 76.7 0.216 0.0176
Realism 84.6 71.5 78.3 0.0006 0.0342
Envelopment 75.9 76.7 67.2 0.8317 0.0273
Preference 83.7 81.7 79.8 0.6086 0.2675
Realism 84.1 75.2 78.5 0.0005 0.0066
Envelopment 70.4 78.9 71.7 0.0016 0.6296
Preference 84.7 81.2 78.3 0.2135 0.0123
This shows that the real recording significantly sounded more realistic than both of the artificial reverbs. Reverb 1 on the other hand generally sounded more enveloping than the other two, but only significantly more enveloping when compared with the Fukada back channels. This should be compared with Reverb 2 which shows significant p-values for both microphone techniques, but differs in that it is evaluated more enveloping than the Fukada recording and worse than the OCTmod recording. The real recording is generally considered more preferable, but only significant when combining all preference data.
The lowered significance level is due to the data being compared more than once.
According to the Holm-Bonferroni method the level of significance should be the normal significance level (in this case 5 %) divided by the number of times the data has been compared. In this case, where the data has been compared four times, the significance level should be
0.054= 0.0125 (Statistics how to, 2018). Instead of using the same
significance level for all p-values a small correction to the significance level is made. The smallest p-value has to be lower than 0.0125, but the second smallest p-value only has to be lower than
0.053= 0.0167, and so on. If one of the p-values are higher than the
significance level, that one and all following p-values are deemed too high. Each row is its own comparison, shown in table 4 where only the bold letters represent lower than significant levels.
4.3 Test subject comments analysis
Most test subjects found it easy to hear differences in the snare drum stimuli and their
answers in the test also partly shows this. Even if the more interesting results are
located among the snare drum stimuli, there is no real overall consistency between the
results. The test subjects who thought they could better hear the differences between the other instruments might have done so, but the data shows that it wasn’t the overall case.
The test subjects’ individual interpretation of the different attributes might have changed the result, as one test subject didn’t know what reference to use when comparing the acoustics of the recording. A better description of the attributes might have given clearer results.
4.4 Artificial vs real recorded reverberation analysis
The biggest differences when comparing the different stimuli versions is the amount of direct sound. The away-facing microphones are still capturing some of the direct sound from the sound source, and this was not accounted for by the engineers who made the artificial reverbs. As they were never able to hear the real recording there were some other properties that differentiated, for example reverberation time. The big hall in Studio Acusticum has a reverberation time (RT60) of 2.5 s long (Studio Acusticum, 2018). The recorded files show that the room has a RT60 of 2.33 s, while the artificial reverb ranges from 2.5 s to 3 s. Soulodre et al. (2003) pointed to that being one of the reasons that some artificial reverbs sounded more enveloping. This is, however, hardly the case since the artificial reverb with the shortest RT60 (2.5 s) was the one that had p- values below or close to 0,01 when compared to the real recording for every
envelopment rating. The structure within the artificial reverb is the probable cause of the increased envelopment. In the real recording there is a smoother spacing of the early reflections and a smoother reverberant sound, while the artificial reverbs are rougher in the early reflections and not as smooth. The real recording is more or less fading from a strong initial peak, while the artificial reverbs swell and then fade. This is more in line with the study by Soulodre et al. (2003) who argued that the late energy (105 ms after the arrival of the direct sound) is the energy responsible for envelopment. The initial peak is probably direct sound captured by the away-facing microphones. The real recording has more treble than the artificial reverbs.
Figure 8. The OCTmod backchannel (left, grey), Reverb 1b (left, blue), Reverb 2b (left, red), Fukada Tree backchannel (right, grey), Reverb 1a (right, blue) and Reverb 2a (right red). A hit on the rim of the snare drum was used to evaluate the reverberation.