Speech Quality Investigation using PESQ in a simulated Climax system for ATM

(1)

2007:260 CIV

M A S T E R ' S T H E S I S

Speech Quality Investigation using PESQ in a simulated Climax system for ATM

Alexander Storm

Luleå University of Technology MSc Programmes in Engineering

Space Engineering

Department of Computer Science and Electrical Engineering Division of Signal Processing

(2)

Abstract

The demand for obtaining and maintaining a certain speech quality in systems and networks has grown the last years. It is getting more and more common with specifications of the quality instead of “-It sounds ok.”. The most accurate way to perform a quality test is to let users of the services express their opinion about the perceived quality. This subjective test method is unfortunately expensive and very time consuming. Lately new objective methods have been developed to replace these subjective tests. These new objective methods have high correlation to subjective tests, are fast and relatively cheap.

In this work the performance of one of these objective methods was investigated, where the main focus was on speech distorted by impairments commonly occurring in air traffic control radio transmissions. The software where the objective method is implemented was evaluated for recommendation for further usage. Some of the test cases were tested on people for a subjective judgment and then compared with the objective results.

Keywords: Objective speech quality evaluation, PESQ, Mean Opinion Score (MOS), MOS-LQO, Climax.

(3)

(4)

Preface

This is the report of the final master thesis that concludes my journey towards my MSc Degree in Space Engineering at Luleå University of Technology. The thesis was carried out at Saab Communication in Arboga and it involved examination of objective methods for grading speech quality in communication links.

I would like to thank Saab Communication for this opportunity; a special gratitude goes to my supervisors Alf Nilsson, Ronnie Olsson and Lars Eugensson at Saab for their inspiration and knowledge. At the department of Computer Science and Electrical Engineering I would like to thank my examiners Magnus Lundberg Nordenvaad and James LeBlanc.

I would also like to express my gratitude to all the inspiring people that I have had the opportunity to meet and work with during my five years at LTU. You have made these years some of the best of my life and I wish you all a delightful future.

Finally, a gratitude to my family and friends for your support.

Arboga, October 2007.

Alexander Storm

(5)

Content

1 Introduction ...- 5 -

2 What is Speech Quality and how is it measured? ...- 6 -

2.1 Impairments ...- 7 -

2.2 Quality measurements ...- 9 -

2.3 PESQ – Perceptual Evaluation of Speech Quality...- 16 -

2.4 Intelligibility ...- 21 -

2.5 Future measurement methods...- 24 -

3 Theory...- 25 -

3.1 CLIMAX...- 25 -

3.2 Impairments ...- 27 -

3.3 Objective measurements ...- 30 -

3.4 PESQ-verification...- 30 -

3.5 Subjective measurements ...- 30 -

4 Methods ...- 32 -

4.2 Objective quality measurements...- 34 -

5 Result...- 42 -

6 Conclusion and Discussion...- 59 -

6.4 Error sources...- 63 -

6.5 The GL-VQT Software^®...- 63 -

6.6 Additional measurements ...- 64 -

References ...- 65 -

Appendix 1 Glossary...- 68 -

Appendix 2 ITU-T P.862, Amendment 2. Conformance test 2(b) ...- 69 -

Appendix 3 Subjektivt test av talkvalitet ...- 70 -

Appendix 4 Results for case 9 of the objective measurements...- 71 -

Appendix 5 The MOS-LQS result of the subjective measurement ...- 72 -

(6)

1 Introduction

The European organization for civil aviation equipment (Eurocae¹) is developing a technical specification for a communication system using Voice over IP (VoIP) for Air Traffic Management (ATM). The specification is planned to be completed by the end of 2007 and it is expected to contain a speech quality recommendation for ATM according to the International Telecommunication Union (ITU) MOS-scale. To measure and verify the quality it is proposed that the objective PESQ-algorithm should be used. To get a feeling of what kind of quality demands that can be reasonable some cases of speech impairments typical for ATM were investigated and tested in this work using the PESQ-algorithm. For comparison and software evaluation matters some of the test cases were also tested on humans to get subjective opinions.

The purpose of this work was to investigate what kind of objective methods for speech quality assessment that are available on the market and how they perform. Another purpose was to simulate and investigate how different impairments influence the speech quality using the objective PESQ- algorithm. Evaluation and verification of the algorithm for these specific impairments were made using subjective tests and recent research. Also the software where the algorithm is implemented was investigated and evaluated. A small comparison between intelligibility and quality of speech was also performed.

1 Eurocae is an organization where European administrations, airlines and industry can discuss technical problems. The members of Eurocae are European administrations, aircraft manufacturers, equipment manufacturers and service providers and their objective is to work out technical specifications and recommendations for the electrical equipment in the air and on the ground [1].

(7)

2 What is Speech Quality and how is it measured?

With the introduction of IP services like Voice over IP (VoIP) the need for methods to measure the performance of the services are required. VoIP is introduced to reduce expenses using one type of network for both voice and data.

Over the years users have become accustomed to the quality that the

“ordinary” Public Switched Telephone Network (PSTN) provides to the extent that nowadays PSTN is a standard in quality and predictability. VoIP needs to meet up with this standard to be widely accepted [2]. To cope with this challenge it is important to understand the many differences between VoIP and PSTN. Examples of the main differences are presented below:

PSTN was designed for time-sensitive delivery of voice traffic. It was constructed with non-compression analog-to-digital encoding techniques, always with the voice channel in mind to give the right amount of bandwidth and frequency response [2]. The IP networks were, on the other hand, designed for non-real-time applications like file transfers and e-mails.

In PSTN the call setup and management are provided by the core of the network while VoIP networks have put this management into the endpoints such as personal computers and IP telephones [2]. Because of this the network core is not equally controlled and regulated which can have negative impact on the quality.

A telephone call in PSTN gets a dedicated channel with dedicated bandwidth. This is a guarantee for a certain quality which is about the same for every call. VoIP, on the other hand, can neither guarantee nor predict voice quality. In VoIP the calls are divided up into small frames or packets which can take different routes between the caller and the receiver. The available bandwidth can not be guaranteed, it depends on the performance and load of the network [3].

In PSTN the codec ITU-T G.711 [4] is used. It is a linear waveform codec that almost reproduces the waveform at decoding. G.711 works at a 8kHz sampling rate (8000samples/s) and each encoded segment is 8bit long (0,125ms) which gives a data rate of 64kbit/s (or bps). The codecs G.729 (10ms segments, ~80bits/segment, ~8kbps) and G.723.1 (30ms segments, ~180bits/segment, ~6kbps) are non-linear because they only try to process the parts of the waveform that are important

(8)

for perception; leading to a smaller bandwidth requirement. The drawbacks are longer segments and low bit-rate which can lead to higher end-to-end delay.

VoIP introduces factors, like headers, that increase the bandwidth requirement. After encoding the code words are accumulated into frames, usually of 20ms. The frames are then placed in packets before transmission. For correct delivery, headers are added to the packets.

First the IP-, UDP- and RTP-protocols add a header each of a total of 320bits. The transmission medium layer, typical Ethernet, adds an additional header of 304bits. This adds up to a total of 95,2kbps for the VoIP transmission, eq.(1), using G.711, 64kbps payload.

(1)

2.1 Impairments

There are many factors influencing the quality of speech transmitted over a network. It is possible to measure most of these factors but it is not assured that these measures give a correct estimation of the quality. Quality is highly subjective, it is the user that decides whether the quality is acceptable or not.

Voice quality can be described by three key parameters [3]:

end-to-end delay – the time it takes for the signal to travel from speaker to listener.

echo – the sound of the talker’s voice returning to the talker’s ear.

clarity – a voice signal’s fidelity, clearness, lack of distortion and intelligibility.

The two first are often considered to be the most important but the relationship between the factors are complex and if just any of the three is turned unacceptable the overall quality is unacceptable.

2.1.1 Delay

Delay is only affecting the conversational quality; it doesn’t introduce any distortions to the signal. The delay in PSTN is dependent on the distance the signal will travel, longer distance - higher delay. In VoIP the delay is

s bps frames bits

frame bits

bits 320 304 ) 50 95200

1280 (

(9)

dependent on the managing of the packets; switching, signal processing (encoding, compression), packet size, jitter buffers etc [2].

Delay becomes an issue when it reaches about 250ms, between 300ms and 500ms conversation is difficult and at delays over 550ms a normal conversation is impossible. In PSTN the end-to-end delay is usually under 10ms but in VoIP networks the delay can reach 50-100ms due to the operations (packetization and compression etc.) of the codec [2].

2.1.2 Echo

Like delay echo is a bigger issue when dealing with conversational quality. It doesn’t affect the sound quality even though a talker can perceive it as disturbing as other distortions. There are two different kinds of echo, acoustic and electrical echo. The acoustic echo can be heard when a portion of the speech is coming out of the speaker, at the far end, and heard by the microphone and sent back to the talker. Electrical echo is introduced where a 2-wire analog is connected to a 4-wire system. These connections are made by hybrids and if there’s some impedance mismatch between the 2-wire and the 4-wire the speech will leak back to the talker. If the echo returns less than 30ms after it was sent the talker will not usually perceive it as annoying, this is also depending on the level of the echo. If the echo returns a little bit more than 50ms after the transmission the conversation will be affected and the talker will apprehend the conversation as “hollow” or “cave-like”. Echo is bigger problem in VoIP than it is in PSTN. VoIP does not introduce more echo but it introduces more delay which makes the echo more noticeable and annoying [3].

2.1.3 Clarity

Clarity is the parameter that is the most subjective. Clarity is dependent on the amount of various distortions introduced to and by the network. There are several kinds of distortions that influence the clarity. Some examples [5]:

Encoding and decoding of the signal. Which codec is used and what are its features.

Time-clipping. For example front end clipping (FEC) introduced by a Voice Activity Detector (VAD).

Temporary signal loss caused by packet loss.

Jitter. Variance in delay of received packets.

Noise. Background noise for example.

Level Clipping. When an amplifier is driven beyond its voltage or current capacity.

(10)

Of these the following impairments are introduced in a PSTN network:

Analog filtering and attenuation in a telephone handset and on line transmissions.

Encoding via non-uniform PCM which introduces quantization distortion. This has minimal impact on the clarity and is accepted in ordinary PSTN telephony.

Bit errors due to channel noise.

Echo due to many hybrid wire junctions.

Along with VoIP new impairments have been introduced because of the new technology of transmitting speech signals:

Low bit-rate codecs are more often used to limit the need for bandwidth. These introduce nonlinear distortion, filtering and delay.

Front-end-clipping (FEC). To lower the bandwidth requirement even more silence suppression is used together with VADs.

Packet losses which introduces dropouts and time-clipping.

Packet jitter, variance in packet arrival times to the receiver. This is limited by jitter buffers.

Packet delay. Can cause packet loss and jitter.

In this work, only impairments affecting the clarity and the listening quality are investigated.

2.2 Quality measurements

2.2.1 Subjective Assessment

Using people, for grading the quality of speech, is the most accurate way to measure the quality since the user of the services is human. People are also used because it is hard for instruments and machines to mimic how humans perceive speech quality. The drawbacks are that subjective tests are very expensive and time consuming. They are usually used in the development phase of systems and services, not suitable for real-time monitoring. The International Telecommunication Union (ITU) has standardized a method for a subjective speech quality test. It is described in the standard ITU-T P.800 [6].

Subjective tests are performed in a carefully controlled environment with a large number of people. A large number is required since the subjective judgment is influenced by expectations, context/environment, physiology and

(11)

mood. The large number of subjects increases the accuracy and decreases the influence of deviating results.

The participants listen to the transmitted/processed speech samples and grade the perceived quality according to the scale stated in P.800, see table 2.1.

Table 2.1. Opinion scale according to ITU-T P.800.

Score Quality of the speech

5 Excellent

4 Good

3 Fair

2 Poor

1 Bad

After the test, the individual scores are collected and the results are treated statistically to produce the desired information. The most common result is the mean value. This mean is evaluated as the quantity MOS (Mean Opinion Score), a MOS-score of 3,6-4,2 is widely accepted as being a good score for a network. Letters are added to state the kind of test. For a listening-only test the notation is either MOS-LQS (listening-quality-subjective) or just MOS. For a conversational test the notation is MOS-CQS or MOSc (table 2.2) [7].

Table 2.2. MOS notation according to P.800.1.

Listening-Quality Conversational-Quality

Subjective MOS-LQS MOS-CQS

Objective MOS-LQO MOS-CQO

Estimated MOS-LQE MOS-CQE

In every subjective test some references should be employed, usually Modulated Noise Reference Units (MNRU’s) are used. MNRU is standardized in ITU-T P.810 [8] and it describes how to distort speech samples in a controlled mathematical way. The amount of MNRU is measured in dBQ where the Q-value is the ratio in decibels between the signal and the added white noise, for subjective tests several Q-values are used. These extra reference speech samples are mixed with the original samples. After the test the reference samples have now both a MOS-value and

(12)

a Q-value. By plotting these it is possible to obtain a relationship between MOS and Q. Figure 2.1 [9] shows an example of a regression of this relationship curve, it usually has this S-shape. With this relationship it is possible to translate every MOS-score to a Q-score. The Q-scores tend to be more language and experiment independent which makes it possible to compare scores from different experiments at different laboratories which is not possible using the MOS-scores only.

Figure 2.1. An example of a regression of the relationship between MOS- and Q-values.

Together with listening and conversational test there are also talking quality tests [6]. Listening tests are by far the most widely used since they are the easiest to perform. The judgment is also stricter in listening tests since the participants will be more focused and sensitive for small impairments that won’t be caught during a conversational test. In ITU-T P.800 three different listening tests are described; the ACR-, the DCR- and the CCR-method.

ACR – Absolute Category Rating

Here the subjects are presented with sentences with a length of 6-10s. After each sentence the listeners should rate the perceived quality according to table 2.1. The mean value of all ratings is the MOS-LQS. ACR is the most frequent used listening test. It works well at low Q-values (Q<20dB) but shows a reduction in sensitivity at higher Q-values (good quality circuits).

One reason for the low sensitivity can be that in ACR different sentences are often used for different systems.

(13)

DCR – Degradation Category Rating

DCR shows higher sensitivity at high Q-values (Q>20dB) than ACR. In DCR the listeners are presented with pairs of the same sentence where the first sample is of high quality and the second one is processed by the system. After each pair the listeners rate the degradation of the last sample compared to the first unprocessed sample according to a degradation opinion scale (table 2.3).

Afterwards the degradation MOS (DMOS) is calculated.

Table 2.3. Degradation opinion scale (ITU-T P.800).

Score The degradation is:

5 Inaudible

4 Audible but not annoying 3 Slightly annoying

2 Annoying

1 Very annoying

CCR-Comparison Category Rating

The CCR-method is similar to the DCR but the order, in which the samples are presented to the listener, is random. In half of the pairs the processed sample should be the first sample and for the other half of the test the second sample should be processed. After each pair the listeners rate how the quality of the second sample is compared to the quality of the first sample. The rating is according to the comparison opinion scale in table 2.4.

Table 2.4. Comparison opinion scale.

Score The quality of 2:nd compared to 1:st is:

3 Much better

2 Better

1 Slightly better 0 About the same -1-2

-3

Slightly worse Worse

Much worse

This leads to a Comparison MOS (CMOS). An advantage of CCR over DCR is the possibility to assess processes that either have degraded or improved the speech quality.

(14)

2.2.2 Objective Assessment

Even though the subjective tests give the most accurate measurement of speech quality the need for objective methods is desired. Subjective tests are, as stated earlier, both expensive and time consuming. There have been many different techniques of objective assessment during the years and they can be divided into different groups [10]. First of all the measurements can be either passive or active.

Passive measurements

The passive measurements are divided into planning and monitoring tools.

Planning tools

The E-model, ITU-T G.107

The E-model is a method for estimating the performance of networks. It is used as a transmission planning tool and it is described in ITU-T G.107 [15].

The foundation of the model is eq.(2).

(2)

Ro is the basic signal-to-noise ratio. Is represents all impairments which occur simultaneously with speech, for example loudness, quantization distortion and side tone level. Id is the “delay impairment factor” which includes all impairments due to delay and echo effects. Ie is the “equipment impairment factor” and represents all impairments caused by the equipment; low bit-rate codecs for example. Finally the “advantage factor” A represents the user’s expectation of quality. For example, using a mobile phone out in the woods, people can be more forgiving on quality issues because they are satisfied with just being able to establish a connection.

All this sums up to the Rating Factor, R, which ranges from 0 to 100 where 100 is the highest rating, i.e. best quality. The R-value can then be converted into a MOS-CQE (conversational-quality-estimated) or a MOS-LQE score for comparison with other objective measurements.

A Ie Id Is Ro R

(15)

Monitoring tools

In inactive or non-intrusive monitoring measurements the actual traffic is examined, no need for speech samples being sent trough the system. It is possible to monitor the system 24 hours a day and the monitoring doesn’t affect or intrude the system. The drawback is that the accuracy and correlation to subjective tests are lower than for intrusive measurements.

ITU-T P.563

The ITU-T P.563 [16] describes a new standard for non-intrusive measurements. P.563 is a single-ended method for objective speech quality assessment. It is based on models of voice production and perception. It measures the effects of one-way distortions and noise on speech and delivers a MOS-score that can be mapped to a MOS-LQO score for example.

Active measurements

Active measurements are divided into electroacoustic or psychoacoustic measurements and the basic idea is to transmit a waveform from one end of system and receive it at the other end. The received (degraded) waveform is then compared to the original waveform resulting in a quality score based on the difference between the two waveforms. The advantages of these methods are that they have the highest correlation with subjective measurements and that the original waveform can be constructed to match the objective of the measurements, different languages, specific distortions etc. The drawback is that the technique uses a specific speech sample which is transmitted, not live traffic. Among these tests signal-to-noise ratio (SNR) and total harmonic distortion (THD) can be mentioned [10].

Electroacoustic measurements

Electroacoustic measurements were among the first objective techniques to measure the perceived quality of waveforms [10]. One example of the earlier methods is the 21- and 23-Tone Multifrequency Test where a complex waveform containing several equally spaced frequencies is transmitted through the system. At the end the signal-to-distortion ratio (SDR) is calculated as a power ratio in decibels (dB), the ratio is an indication of the quality. This method was soon questioned since it gave very low SDR-values for some codecs even though the users didn’t perceive any degradation. The reason was that the codecs in concern affected parts of the transmission that was not very important for the human perception. Later (1989) proposals were made to

(16)

change the multifrequency waveform to digital files containing recorded speech. The processing was basically the same but the method was no success due to poor correlation to subjective test results [10 p.120].

Psychoacoustic measurements

The problem with electroacoustic measurements was that they only consider and measure different characteristics of the transmitted signal, the actual content of what was being transmitted was not considered. The increasing use of communication services raised the need for weighted measurements which considered how humans perceive different kinds of impairments.

PSQM – Perceptual Speech Quality Measure (KPN Netherland)

PSQM was one of the earliest standardized methods to measure speech quality from a human perception point of view. PSQM was standardized in 1996 through the ITU-T Recommendation P.861 [11]. The purpose was to objectively measure the quality of telephone-narrow-band (300-3400Hz) speech signals transmitted trough different codecs under different controlled conditions. PSQM measured the perceptual distance between the input and the output signal. The result was a score from 0 to infinity where 0 corresponded to a perfect match. The objective was to map this score into a MOS-score but due to problems with different results according to which language being used there was no good mapping function resulting in low correlation to subjective scores [11]. Another reason for the low correlation was the weak time alignment function. The PSQM-algorithm was developed further in 1997 to cope with these limitations. The new method which was included in P.861 was called PSQM+ and it had solved problems like how to judge and handle severe distortions and time clipping.

PAMS – Perceptual Analysis Measurement System (British Telecom)

The PAMS-algorithm is based on another signal processing technique than PSQM. They both compare a source signal with the same one transmitted but PAMS gives a score between 0 and 5 which correlate on the same scale as subjective MOS testing. PAMS calculates and analyses the Error Surface to get the score. The error surface is the difference between the Sensation Surfaces between the output and input speech samples. The score is then the average of the error surface at different frequencies. This process will be described in the next section.

(17)

Both PSQM+ and PAMS showed unsatisfactory correlation with subjective tests for a couple of test cases. The solution was the combination of the perceptual model of PSQM99 (extension of PSQM+) and the powerful time alignment function of PAMS. The new algorithm was called Perceptual Evaluation of Speech Quality (PESQ) and became the new standard ITU-T P.862 [12] in February 2001. With the introduction of P.862 the PSQM, P.861- standard was withdrawn. As for the earlier methods, the PESQ is intended for narrow-band telephone signals. Since this work focuses on the performance on the PESQ-algorithm it will get a more elaborate explanation.

2.3 PESQ – Perceptual Evaluation of Speech Quality

Figure 2.2. The basic functionality of the PESQ-method.

Figure 2.2 shows the basic block representation of the PESQ test procedure. A speech sample is first inserted into the system under test and then collected at the output of the system. The collected sample is then compared to the original speech sample in the PESQ-algorithm resulting in a PESQ Raw-score which is mapped to get the highest correlation to subjective MOS-scores. The resolution of the inserted speech sample file should be 16-bit and the sample rate should be 8kHz (PESQ is also validated for a sample rate of 16kHz).

Figure 2.3. Block representation of the PESQ algorithm.

(18)

The PESQ-algorithm is illustrated in figure 2.3 [13]. The first step of the processing is to compensate for any gain or attenuation of the system under test. The signals are aligned to the same constant power level in the level alignment block; this level is the same as the normal listening level used in subjective tests. In the input filter the algorithm models and compensates for the filtering that takes place in the handset of the telephone in a listening test, it is assumed that the handset’s frequency response follows the characteristics of an IRF (Intermediate Reference System) receiver. Since the exact filtering is hard to characterize the PESQ is rather insensitive to the filtering of the handset.

To enable comparison between the two signals, time alignment is required. The degraded signal is often delayed, sometimes with variable delays and PESQ uses the technique from PAMS to cope with this problem. The time alignment process is divided into two main stages; an envelope- based crude delay estimation and a histogram-based fine time alignment. The envelope-based approach starts by calculating the envelopes of the whole length of the degraded and reference signal respectively. This is achieved with the help of a voice activity detector (VAD). These envelopes are then cross-correlated by frames in order to find the delay. This procedure yields a resolution of about

±4ms. Subsequently the signals are divided up in utterances; an utterance is a continuous speech burst with pauses shorter than a certain lengths (200ms).

Theses utterances are examined using the same envelope-based delay estimation. The first step in the histogram-based estimation is to divide the signals into frames of 64ms with a 75% overlapping. These frames are Hann- windowed and cross-correlated and the index of the maximum from the cross-correlation gives the delay estimate for each frame. A weighted histogram of the delay estimates is then constructed, normalized and smoothed by convolution with a symmetric triangular kernel of a width of 1ms. The location of the maximum in the histogram is then combined with the previous delay estimation yielding the final delay estimation for the utterance. The maximum is also divided by the sum of the histogram before convolution to give a confidence measure between 0 (no confidence) and 100 (full confidence). In many cases there can be delay changes within the utterance. To test for this each utterance is split up into smaller parts on which the envelope- and histogram-based delay estimations are performed.

The splitting process is repeated at several points and the confidence is measured and compared to the confidence before the split. As long as the confidence is higher than before the splitting the process continues to find the right delay estimation.

(19)

The auditory transform is a psychoacoustic model that mimics the properties of human hearing. In this the signals are mapped into an internal representation in the time-frequency domain by a short-term Fast Fourier Transform (FFT) with a Hann-window over 32ms frames. The result is components called cells, see figure 2.4. During the FFT the frequency scale is warped into a modified Bark scale, called the pitch power density, which reflects the human sensitivity at lower frequencies. This Bark spectrum is then mapped to a loudness scale (Sone) to obtain the perceived loudness in each time-frequency cell. During this mapping equalisation is made to compensate for filtering in the tested system and for time varying gain [12]. The achieved representation is called the Sensation Surface.

In the Disturbance Processing block the sensation surface of the degraded signal is subtracted from the sensation surface of the reference signal resulting in the Error Surface, containing the difference in loudness for every cell. An example of an error surface is shown in fig.4.2. Two different disturbance parameters are calculated; the absolute (symmetric) disturbance and

Figure 2.4. The time-frequency cells.

the additive (asymmetric) disturbance. The absolute disturbance is a measure of the absolute audible error and it is achieved by examining the error surface. If the difference in the error surface is positive, components like noise have been added, if the difference is negative, parts of the signal have been lost due to coding distortion for example. For each cell the minimum of the original and degraded loudness is computed and divided by 4. This gives a threshold which is subtracted from the absolute loudness difference; values that are less than zero after this subtraction are set to zero. This is called masking, when the influence of small distortions, that are inaudible in the presence of loud signals, are neglected. The additive disturbance is a measure of the audible errors that are significantly louder than the reference. It is calculated for each

(20)

cell by multiplying the absolute disturbance with an asymmetry factor. This asymmetry factor is the ratio of the degraded and original pitch power densities raised to the power of 1,2. Those factors that are less than 3 are set to zero and those over 12 are clipped at that value leading to that only those cells where the degraded pitch power densities exceeds the reference pitch power densities remains, i.e. additive disturbance for positive disturbances only. The two disturbance parameters are aggregated along the frequency axis resulting in two frame disturbances. If these frame disturbances are above a threshold of 45 they are identified as bad intervals. The delay is then recalculated for these intervals and once again cross correlated in the time alignment block. If this correlation is below a threshold it is concluded that the interval is matching noise against noise and the interval is no longer bad.

For a correlation above the threshold a new frame disturbance is calculated and replaces the original disturbance if it is smaller.

In the Cognitive Model the frame disturbance values and the asymmetrical frame disturbance value are aggregated over intervals of 20 frames. These summed values are then aggregated over the entire active interval of the speech signal.

Finally the PESQ score is a linear combination of the average disturbance value and the average asymmetrical disturbance value and ranges from -0,5 to 4,5, eq.(3).

(3)

d^SYMis the average disturbance value and d^ASYMis the average asymmetrical disturbance value.

This PESQ Raw-score shows in some cases poor correlation with MOS-LQS.

To obtain higher correlation the PESQ Raw-score is usually mapped into the MOS-LQO (MOS-Listening Quality Objective) score (ITU-T P.862.1 -11/2003 [14]). The mapping function, shown in eq.(4) and in figure 2.5 gives a score from 1,02 to 4,55 which corresponds to the P.800 MOS-LQS, see table 2.1. The maximum value 4,5 for the PESQ-score was chosen because it is the same as for a clear and undistorted condition in a typical ACR-LQ test.

(4)

x represents the PESQ Raw-score and y the MOS-LQO score.

6607 , 4 4945 ,

1 1

999 , 0 999 , 999 4 ,

0 _x

y e

ASYM

SYM d

d

PESQMOS 4, 0,1 0,0309

(21)

-1 0 1 2 3 4 5 0

0,5 1 1,5 2 2,5 3 3,5 4 4,5 5

P.862 PESQ Raw-score MOS-LQO - MappedP.862

Figure 2.5. The MOS-LQO mapping function.

The produced MOS-LQO score estimates the listening quality only, it takes no concern to impairments that influence the conversational quality (MOS-CQO) like delay, jitter, echo, sidetone and the level of the incoming speech.

Table 2.5. Comparison between some objective methods. The average and worst-case correlation coefficients for 38 subjective tests are shown.

No. of tests Type Corr.coeff. PESQ PAMS PSQM PSQM+

19 Mobile average 0,962 0,954 0,924 0,935

network worst-case 0,905 0,895 0,843 0,859

9 Fixed average 0,942 0,936 0,881 0,897

network worst-case 0,902 0,805 0,657 0,652

10 VoIP/ average 0,918 0,916 0,674 0,726

multi-

type worst-case 0,810 0,758 0,260 0,469

Table 2.5 [13] shows a comparison between PESQ, PAMS, PSQM and PSQM+.

The table shows the correlation coefficients for the different algorithms compared with 38 subjective tests. The conclusion is that PESQ has the highest correlation in both average and worst-case. PESQ shows high accuracy for a wide range of conditions. For some conditions PAMS is close but it is less accurate for some other conditions. PSQM and PSQM+ show lower correlation in conditions including VoIP, packet loss etc.

(22)

2.4 Intelligibility

In communications quality is closely related to intelligibility (the degree to which speech can be understood); high quality usually means high intelligibility. However, it is important to distinguish between the two; in many cases intelligibility is crucial while high quality is a desirable bonus [17]. Even though they correlate well in many cases the relationship becomes much more incomprehensible in other cases. For example, a small quality drop can have big influence on the intelligibility. On the other hand even at low quality scores it might be possible to apprehend and understand the transmitted information without great effort. It is also possible to improve the quality while decreasing the intelligibility and vice versa. An example is when using noise suppression schemes to lower the background noise to improve the perceived quality, these systems tend to decrease the intelligibility [18]. The PESQ-algorithm was not developed to assess speech intelligibility. However, since there are a relation between quality and intelligibility it might be possible to extend the PESQ-algorithm to correlate well with subjective intelligibility tests. Research is made to investigate the relation and how PESQ performs in intelligibility tests [18], [19] and [20].

2.4.1 Subjective measurements

Just like quality, intelligibility is a subjective judgment indicating how well a human listener can decode speech information [17]. It is measured using statistical methods where trained talkers speak using standardized word lists trough the system under evaluation. The words are received at the far end and trained listeners try to recognize what words that have been spoken.

There are a number of different standardized word lists to use; one is the Modified Rhyme Test (MRT) [21], [22]. It consists of 50 six-word lists of rhyming words. The whole list is presented to the listener and the talker pronounces one of the six words in each list. The listener marks the word he thinks is spoken. After the test has been done by at least five listeners the results are collected and treated statistically to access the desired information.

Table 2.6 shows the first five rows of six rhyming words in the MRT.

Table 2.6. The first five rows of words in the MRT.

went holdpat lane kit

sent coldpad lay bit

bent toldpan late fit

dent foldpath lake hit

tent soldpack lace wit

rent goldpass lame sit

(23)

A similar method is the Diagnostic Rhyme Test (DRT). It consists of 96 rhyming pairs of words which are constructed from a consonant-vowel-consonant sound sequence. Examples of the word pairs are presented in table 2.7. The words only differ in the initial consonant and they are chosen in a way that the result can be interpreted in different ways to show what kind of consonants that are hard to recognize and then pin-point out what needs to be altered in the system to get correct intelligibility. Consonants are chosen because they are more important for the intelligibility than vowels [10]. They are also more sensitive to additive impairments like noise, tones etc. as they contain 20 times less average power than vowels. Since consonants are shorter in duration, 10-100ms, compared to vowles, 10-300ms, they are also more sensitive to losses and additive pulses.

Table 2.7. Examples of word pairs in the DRT. Specific features of speech is also shown.

Voicing Nasality Sustenation Sibilation Graveness Compactness veal

bean gin dintzoo

feel peen chin tintsue

meat need mitt nipmoot

beat deed bit dipboot

vee sheet vill thick foo

bee cheat bill tickpooh

zee cheep jilt singjuice

thee keep gilt thing goose

weed peak bid finmoon

reed teak did thinnoon

yield key hit gillcoop

wield tea fit dillpoop

The subjective intelligibility tests result in a percentage, 0-100% representing the amount of words that were recognized correctly. These results are more straightforward to interpret than the corresponding subjective MOS (1-5). The MOS-score reflects more impressions than just intelligibility and the scores can vary quite a lot among different listeners [17].

These subjective tests do not always reflect the reality. In normal life speech is made up of sentences which increase the intelligibility because of the flow of words. The MRT and the DRT consists of random words and even though they are equally distorted the real life sentences are perceived as having higher overall intelligibility.

2.4.2 Objective measurements

There are a couple of indices for objective speech intelligibility. The two most fundamental are the Speech Transmission Index (STI) [23] and the Speech Intelligibility Index (SII) [24]. The STI gives a number between 0 and 1 where 1 represents good intelligibility and low influence of acoustical system properties and/or background noise (compare with subjective tests, 0-100%).

The STI is based on the assumption that speech can be described as a

(24)

fundamental waveform which is modulated by low-frequency signals [23].

The STI-score is calculated from the Modulation Transfer Function (MTF) of the system (figure 2.6). The MTF is the reduction in the modulation index of the signal at the transmitter, m(1), and at the receiver, m(2), eq.(5).

Figure 2.6. The STI-method.

(5)

The SII-method is described in [24]. It is a development from the STI-method and it works in a similar way and produces scores between 0 and 1.

Correlation

The SII-method correlates well with subjective tests. An arising problem is that this objective method along with other are limited to linear systems.

Testing modern applications such as low bit-rate coding do not produce well correlated scores. Research is done to find new objective methods that can deal with these non-linear systems [25].

1 2 m MTF m

(25)

2.5 Future measurement methods

The area of how to objectively measure speech quality is expanding fast. It is getting more and more common that there are demands regarding the speech quality in specification involving communication solutions. A couple of years ago this was not the case, measurements to obtain a quality score had to be made subjectively. This was far too expensive and time consuming to be used in an every day manner. Today the objective tools have become accurate, fast and cheap enough for extensive usage. Subjective tests are still more accurate but regarding the benefits of objective measurements they will continue to expand in areas of usage.

The research of today is struggling with the following tasks:

Higher correlation for non-intrusive measurements, like P.563, with subjective tests. Today intrusive measurements, like P.862, give more accurate results of the speech quality. A new ITU-T standard is under development with the working name P.VTQ. It is a tool for predicting the impact on the quality of IP-network impairments and for monitoring the transmission quality. It uses metrics from the RTCP-XR (RTP Control Protocol-Extended Report) to calculate the quality and it gives a MOS-score on the ACR Listening Quality Scale [26].

Combine quality and intelligibility measurements. Extend the PESQ- algorithm to include intelligibility measurements and give a common score for both quality and intelligibility.

Make intelligibility measurements work in VoIP applications. The objective STI-method is inaccurate in non-linear and time-variant packetized networks.

Modify and extend in service standards like ITU-T P.862 and ITU-T G.107 to have the same accuracy for VoIP systems as in “old” PSTN networks. ITU-T P.OLQA is a new standard under development. It will be the “Universal” model for objective predicting of listening quality. It will include not only speech but other new 3G-applications [26].

Develop tools for predicting and monitoring conversational quality from mouth-to-ear. This includes both the electrical connection and the acoustical part. The ITU-T P.CQO is a new standard that are developed to deal with this task [26].

(26)

3 Theory

The main objective for this work is to investigate how different impairments degrade the quality of speech in ATM (Air Traffic Management) radio. The ATM system under consideration is the CLIMAX system.

3.1 CLIMAX

Climax, or the offset carrier system, is an Air Traffic Control (ATC) Communications system working in the VHF-band (30-300MHz). The construction of Climax started in the United Kingdom in the 1960’s and is now widely spread in Europe. Sweden does not use the Climax system but it becomes an issue for us when flying to countries where the system is used.

This multi-carrier system is attended for ground to air communication and it is based on the idea on having 2-5 transmitters transmitting on the same frequency with a slight offset. Climax offers greater ground coverage, higher redundancy and better coverage on low altitude and at harsh environments [27].

Climax is limited to a 25kHz channel spaced environment. This can cause problems since the 8.33kHz environment is spreading in Europe because of the need of more available frequencies; an 8.33kHz receiver does not have enough bandwidth to operate correctly in a Climax environment.

To prevent for audible beats, homodynes, caused by frequency difference the multiple carriers are separated in frequency according to table 3.1 [28].

Table 3.1. Frequency arrangement for Climax channels (fc is the assigned channel frequency).

No. of

Climax Legs Leg 1 Tx

frequency Leg 2 Tx

frequency Leg 5 Tx frequency

2 f^c+5kHz f^c -5kHz - - -

3 f^c +7.5kHz f^c f^c-7.5kHz - -

4 f^c +7.5kHz f^c -7.5kHz f^c +2.5kHz f^c-2.5kHz -

5 f^c -2.5kHz f^c -7.5kHz f^c f^c +2.5kHz f^c +7.5kHz

(27)

3.1.1 Operations

The Air Traffic Control Centre (ATCC) transmits the audio signal to all the transmitters and the air plane receives the signal from all the transmitters within coverage (figure 3.1). The pilot will then hear all the incoming transmissions simultaneously.

Figure 3.1. The basic structure of Climax. The air plane receives the signal from three antennas.

When the pilot receives multiple transmissions there is a great risk that the transmissions are mutual delayed due to different transmission paths. It is crucial that this delay do not reduce the quality and intelligibility of the transmitted speech. The European Organisation for Civil Aviation Equipment (EUROCAE) has proposed that this delay difference may vary between 0 and 10ms for Climax. For values over 10ms difference, echo effects will start to become disturbing and annoying [29].

For air to ground it works differently. The transceiver on the air plane transmits at the centre frequency and the transmission is received at each aerial within range. The ATCC then selects the aerial with the best performance by Best Signal Selection (BSS). This leads to that only one transmission reaches the air traffic controller (figure 3.2) [27].

(28)

Figure 3.2. The air plane transmits to ATCC, BSS is used for better performance.

3.2 Impairments

For the investigation five main impairments are examined in this work:

I. The Climax case (delay).

II. Speech with added noise.

III. Speech with an added tone.

IV. Packet (frame) losses.

V. Speech with added noise pulses.

3.2.1 Case I

Here the unique feature of the Climax system was investigated. It was simulated that a pilot receives two transmissions with the same speech from two different transmitters. One of the speech samples was delayed to simulate the echo-effect that the pilot will perceive , longer delays give more disturbing echo. How disturbing the echo gets is also dependent on the propagation loss that may be different for the two paths, resulting in different levels of the two received signals. Only simulations were investigated, no real radio transmissions were made. Most of the delay originates in the equipment used for the transmission; the propagation time in air is negligible. The propagation loss is on the other hand mostly dependent on the transmission path in the air. Examples of measured delays are shown in table 3.2 [30].

(29)

Table 3.2. Examples of measured delays from Sundsvall ATCC.

Station Round-trip

(ms) One-way

(ms) Delay difference (ms)

Arvidsjaur 13,4 6,70 0,0

Gällivare 16,1 8,05 1,4

Måttsund 18,1 9,05 2,4

Storuman 18,7 9,35 2,7

The measurements, in table 3.2, were made at the Sundsvall ATCC in Sweden. The one-way latency is obtained by assuming the round-trip latency is twice the one-way latency. The delay difference measure is the delay relative the shortest one-way latency. It should be noted that the stations operated at 125,60MHz and that these measurements were performed on TDM-connections, not VoIP.

3.2.2 Case II and III

For the cases with added noise or tones the Signal-to-Noise Ratio (SNR) was the measure which was being altered. The SNR is a measure of the level of desired signal compared to the level of the background noise. The SNR is measured in decibels (dB) and is calculated as eq.(6):

(6)

For the cases investigated in this work the P^signal is the Average RMS Power of the entire clean signal and the P^noise is the same for the nosie/tone.

Where noise was added, both the influence of white and pink noise was investigated. White noise is characterized by that it contains all frequencies with the same probability, the same mean energy and that the power is evenly distributed among all frequencies. Pink noise emphasizes the lower parts of the frequency spectrum, it distributes its energy evenly in all octaves, that is the power density decreases by 3dB/octave towards higher frequencies. This feature makes it, for example, more pleasant to listen to.

dB P dB P P

dB P

SNR _signal _noise

noise signal

log 10

(30)

3.2.3 Case IV

In a digital voice transmission the speech is divided up into packets and frames containing usually 20ms speech. Depending on the performance and the load of the network some of these packets can be lost during transmission. Losses can occur if the network is congested, i.e. components receive too many packets which make their buffers to overflow and cause packets to be discarded. Congestion can lead to packet rerouting which can result in that the packets arrive too late to the jitter buffers leading to packet discarding. The individual packets can also be discarded by different applications because they are damaged with bit errors due to circuit noise or equipment malfunction [2], [3].

The effect on the quality of a packet loss is depending on many factors. First, what is the content of the lost packet? Of course packets containing speech affects quality more than packets containing silence when lost. Also what kind of speech the packets contain is important, if it contains vowel sounds or consonant sounds, if they occur in the beginning of a syllable or in the end, or if the whole syllable is lost? The time when the packet is lost is also important especially when dealing with bursts of lost packets [31]. For example, bursts towards the end of a telephone call are subjectively perceived as more negative regarding quality than bursts occurring at the beginning of the call.

Another factor that influences the packet loss is what codec being used.

Waveform codecs like G.711 encodes the whole waveform, no compression and high bit-rate. Usage of this codec affects the perceived quality much less than other perceptual codecs like G.729, G.723 and G.721 which encode only the relevant part of the voice signal.

The PESQ-algorithm has earlier been tested and verified for packet losses with a normal distribution. Studies have also been made to investigate how the PESQ measures the impact of specific packet losses [32].

3.2.4 Case V

The case with noise pulses occur when an analog radio is disturbed by transmitters using frequency hopping. When the signals of the transmitters are mixed together intermodulation products are created and if these products coincide with the frequency of the radio a noise pulse is perceived by the radio, see table 3.3. The length of the noise pulse is the time the frequency hoppers remains at that specific frequency. An example is the Bluetooth^®-technology which uses frequency hopping. It changes channels 1600 times per second, i.e. it remains 0,625ms on a channel [33].

(31)

Table 3.3. Examples of intermodulation products.

Intermodulation products (f^A)

f¹+ f² = fÂ f¹- f² = fÂ 2f¹- f² = fÂ 2f²– f¹ = fÂ

3.3 Objective measurements

As PESQ is the most accurate and most used objective speech quality tool it was used for the investigation of the five cases. Several files were made to include most of the realistic real-life cases. The files were examined using a software where the PESQ-algorithm had been implemented. Each tested file resulted in a MOS-LQO score.

3.4 PESQ-verification

Before the testing of the five cases the software itself was tested to make sure that it worked as expected. Speech files with impairments that are neglected by the PESQ were examined, expected to result in a maximum quality score.

To fully verify the algorithm a conformance test can be made according to ITU-T P.862, Amendment 2 [36]. This test contains three test cases for the narrow band operation where the test scores of the enclosed files should not diverge from the scores of a reference implementation with more than a certain value. Test case number 2 of these three specified cases was performed; case 2 validates P.862 with variable delay.

3.5 Subjective measurements

To get a hint on whether the PESQ judge the impaired files correctly a subjective test was performed with some of the files from the objective measurements. The used Absolute Category Rating method delivers a MOS-

(32)

LQS score which was compared with the objective MOS-LQO score. The results should be treated very carefully though, to be able to compare the numeric values the subjective test needs to be performed in a standardized way in a tightly controlled environment [6]. No calibration with MNRUs [8]

was performed for example; the test could therefore not be repeated somewhere else with accurate results. The test was only made to investigate how PESQ ranks the files compared to the subjects, does PESQ rank added noise vs. an added tone differently than humans for example.