Institutionen för systemteknik
Department of Electrical Engineering
Master’s Thesis
Language Independent Speech Visualization
Thesis done in Automatic Control, Department of Electrical Engineering,
Linköping University by
Jan Braunisch
LiTH-ISY-EX--11/4501--SE
Linköping 2011
Department of Electrical Engineering Linköpings tekniska högskola Linköpings universitet Linköpings universitet SE-581 83 Linköping, Sweden 581 83 Linköping
Language Independent Speech Visualization
Thesis in Automatic Control,
Department of Electrical Engineering,
Linköping University
by
Jan Braunisch
LiTH-ISY-EX--11/4501--SE
Supervisor: Mehmet Guldogan
isy, Linköpings universitet
Examiner: Fredrik Gustafsson
isy, Linköpings universitet
Avdelning, Institution
Division, Department
Division of Automatic Control Department of Electrical Engineering Linköpings universitet
SE-581 83 Linköping, Sweden
Datum Date 2011-09-01 Språk Language Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport
URL för elektronisk version
http://www.control.isy.liu.se http://www.ep.liu.se ISBN — ISRN LiTH-ISY-EX--11/4501--SE
Serietitel och serienummer
Title of series, numbering
ISSN
—
Titel
Title
Språkoberoende talvisualisering
Language Independent Speech Visualization
Författare
Author
Jan Braunisch
Sammanfattning
Abstract
A speech visualization system is proposed that could be used by a deaf person for understanding speech. Several novel techniques are proposed, including:
• Minimizing spectral leakage in the Fourier transform by using a variable-length window.
• Making use of the fact that there is no spectral leakage in order to calculate how much of the energy of the speech signal is due to its periodic component vs. its nonperiodic component.
• Modelling the mouth and lips as a band-pass filter and estimating the central frequency and bandwidth of this filter in order to assign colours to unvoiced speech sounds.
Our tests show that the proposed system clearly shows the differences between most speech sounds, and that the visualizations for two persons pronouncing the same speech sound are usually very similar.
Nyckelord
Keywords speech visualization, hearing aids, speech processing, speech feature extraction, windowing, pitch detection
Abstract
A speech visualization system is proposed that could be used by a deaf person for understanding speech. Several novel techniques are proposed, including:
• Minimizing spectral leakage in the Fourier transform by using a variable-length window.
• Making use of the fact that there is no spectral leakage in order to calculate how much of the energy of the speech signal is due to its periodic component vs. its nonperiodic component.
• Modelling the mouth and lips as a band-pass filter and estimating the central frequency and bandwidth of this filter in order to assign colours to unvoiced speech sounds.
Our tests show that the proposed system clearly shows the differences between most speech sounds, and that the visualizations for two persons pronouncing the same speech sound are usually very similar.
Contents
1 Introduction 1
1.1 Objectives and Contributions of this Work . . . 1
1.2 Organization of the Thesis . . . 2
2 Speech and Speech Processing 3 2.1 What is Speech . . . 3
2.2 Different Kinds of Speech Sounds . . . 4
2.2.1 Vowels . . . 4
2.2.2 Consonants . . . 5
2.3 Pitch and Loudness . . . 8
2.4 The Source–Filter Model . . . 10
2.5 Speech Processing Techniques . . . 10
2.5.1 Windowed Discrete Fourier Transform . . . 10
2.5.2 Linear Prediction . . . 12
3 The Speech Visualization System 13 3.1 Sound Input . . . 13 3.2 Adaptive Window . . . 13 3.3 Extra Smoothing . . . 16 3.4 Fourier Transform . . . 17 3.5 dBA Weighting . . . 17 3.6 Pitch Detection . . . 18 3.7 Energy Analysis . . . 19
3.8 Formant Extraction and Vowel Colour . . . 20
3.9 Energy Distribution Analysis . . . 20
3.9.1 Efficient Calculation of fc and Bw . . . 23
3.9.2 Handling Background Noise . . . 23
3.10 Display . . . 24
3.10.1 Spectrogram . . . 25
3.10.2 Energy and Vowel Colour . . . 27
4 Results 29
5 Conclusions and Future Work 37
5.1 Conclusions . . . 37 5.2 Future Work . . . 37 5.3 Acknowledgements . . . 39
Chapter 1
Introduction
1.1
Objectives and Contributions of this Work
Deaf people have big problems communicating with other people. Not only because they cannot hear what other people say, but also because not hearing what other people say makes it very difficult for them to learn to speak themselves. Sign languages have been invented, but most people don’t know any sign language, and they are difficult to use over the phone.
What is needed is a speech visualization system that can convert the spoken language into visual patterns that can be used to understand what is being said. If designed correctly, such a system would not only be useful for understanding what other people say, but also for learning how to speak oneself, since it would allow comparing the visualization of one’s own speech with that of others, thus permitting one to learn by imitation.
Very little seems to have been done previously in this subject. Some research groups have developed aids for teaching deaf persons to speak, but these have been too crude to be used for understanding speech. Wang et al. have published several papers about speech visualization [19], but they have not shown that they have produced anything useful.
The only successful attempt of producing a speech visualization suitable for understanding speech seems to have been made by Watanabe et al. [17]. They have used time-delay neural networks to recognize different consonants. For each consonant, they assign a pattern. The vowels are coloured according to their formant frequencies (see Section 2.2.1). The consonant patterns and vowel colours are overlaid in a single visualization that also shows the pitch. The system has a few shortcomings, however. Notably, it doesn’t seem to differentiate between different fricatives, and it is only designed to be used with the Japanese language. Also, it is too slow to run in real time.
The system we propose has the following benefits:
1. Language independence. Thus, it would be useful for deaf people who want to use more than one language, as well as for those who live among people
speaking strange dialects.
2. Speaker Independence.
3. Low system requirements. This makes it possible to put the system on cheap handheld computers that can be distributed to many people.
4. No reliance on proprietary systems. This makes it possible to distribute the system without paying any royalties.
Machine learning algorithms (for example neural networks) have not been used in this work. The main reason for this is that these require large amounts of training data. Usually, researchers use ready-made databases of training data for training their models, but most of those databases are designed for training speech-to-text systems, and would thus not be suitable for our purposes. Some databases, like the MOCHA database [18], contain simultaneous recordings of speech and ar-ticulatory parameters (tongue position, etc.). Getting the arar-ticulatory parameters from the speech signal could be useful for speech visualization, but this is a very difficult problem since it is a one-to-many mapping, meaning that given a speech sound, there are usually many different articulatory parameters that could have given rise to this sound [16].
1.2
Organization of the Thesis
This thesis is organized as follows: Chapter 2 describes human speech and a few commonly used speech processing techniques. Chapter 3 explains the proposed speech visualization system. Chapter 4 shows the results obtained from the pro-posed system. Chapter 5 gives conclusions and suggests improvements to the system.
Chapter 2
Speech and Speech
Processing
In this chapter, we will give an overview of how speech is produced and how different speech sounds can be classified. We will also introduce a few techniques for processing speech in computers.
2.1
What is Speech
Speech is sound produced by our speech organs, i.e. the vocal cords, oral cav-ity, tongue, lips, etc. When we want to analyze the speech sound that a person creates, we can, without losing any important information, consider only the one-dimensional function p(t) of air pressure with respect to time, at some position near the mouth of the speaker. It doesn’t matter much which position we choose, since this mostly affects the phase and amplitude of the function.
The speech sound has two main components: periodic and nonperiodic. The periodic part results from vibrations created when the air from the lungs gets pressed between the vocal cords, while the nonperiodic part usually results from letting air pass through narrow passages, or from blocking the air flow and releasing it suddenly.
In this chapter, we use spectrograms to show the frequency content of speech. These are a way to show energy as a function of time and frequency. The frequency varies along the vertical axis and the time varies along the horizontal axis, and each point of the spectrogram shows, by varying the brightness, the amount of energy at the corresponding time and frequency of the signal. Whether to let “brighter” mean “more energy” or “less energy” is arbitrary, and we have chosen the former.
2.2
Different Kinds of Speech Sounds
Every language has a number of phonemes which are combined to produce words. The phonemes of a language are the smallest distinctive speech sounds that distin-guish one word from another. The study of the sounds in human speech is called
phonetics. Ladefoged and Johnson [10] is an excellent introduction to this subject.
We will give examples of different speech sounds by giving examples from the Swedish language.
2.2.1
Vowels
Vowels are shaped when the regularly spaced pulses from the vocal cords pass through the oral cavity. Vowels are periodic sounds, and thus have a fundamental frequency and harmonics. Looking at the spectrum of a vowel, one will see that around certain frequencies, the harmonics are stronger than elsewhere. These are the resonance frequencies of the vocal tract, and are called formants. The term formant was coined by Gunnar Fant in [5]. The formants are different for different vowels, and are by far the most important distinguishing features of a vowel. The formants are written f1, f2, . . ., sorted in increasing frequency. The first two or
three formants are the most important ones for distinguishing between vowels. The difference between different persons in the size of the vocal tract results in differences in the formant frequencies. However, the ratios between formants (fn/fm, n 6= m) tend to be comparable between different persons pronouncing the
same vowel. This fact has been used by many researchers [15] [17]. Another useful fact about formants is that they are constant with respect to intonation, which is because a change in intonation only changes the fundamental frequency, and not the resonance frequencies of the vocal tract.
The first four formants for the long vowels of the Swedish language, are given in Table 2.1 and their spectrograms in Figure 2.1.
Vowel f1 (Hz) f2(Hz) f3 (Hz) f4 (Hz) /a/ 600 925 2540 3320 /o/ 290 595 2330 3260 /u/ 285 1640 2250 3250 /å/ 390 690 2415 3160 /e/ 345 2250 2850 3540 /i/ 255 2190 3150 3730 /y/ 260 2060 2675 3310 /ä/ 505 1935 2540 3370 /ö/ 380 1730 2290 3325
Table 2.1: Formant frequencies for the long Swedish vowels [6] These are the average measurements of 24 male speakers.
2.2 Different Kinds of Speech Sounds 5 0 2 4 6 8 10 0 2 4 6 8 10 time (s) frequency (kHz)
Figure 2.1: Spectrograms of the long vowels /a/, /o/, /u/, /å/, /e/, /i/, /y/, /ä/ and /ö/.
2.2.2
Consonants
Consonants are speech sounds that are produced by completely or partially block-ing the air beblock-ing breathed out through the mouth. They are usually categorized by their place and manner of articulation, as well as if they are voiced or not.
The place of articulation refers to what part of the mouth we use to hinder the flow of air. For example, /f/ is pronounced between the lower lip and the upper teeth, which makes it labiodental. Figure 2.2 shows many places of articulation.
The manner of articulation refers to how we obstruct the air flow. For example /p/ is a plosive, which means that it is pronounced by closing the air passage, and then suddenly releasing the air.
Finally, a voiced consonant is one where the vocal cords vibrate.
Nasals
Nasals are voiced consonants pronounced with the mouth closed, so that the air only goes through the nose. The Swedish language has three nasals: /m/, /n/ and /ng/. As can be seen in the spectrograms in Figure 2.3, nasals are very similar to vowels, except that their harmonics are weaker.
Figure 2.2: Places of articulation. 1. Exo-labial, 2. Endo-labial, 3. Dental, 4. Alveolar, 5. Post-alveolar, 6. Pre-palatal, 7. Palatal, 8. Velar, 9. Uvular, 10. Pharyngeal, 11. Glottal, 12. Epiglottal, 13. Radical, 14. Postero-dorsal, 15. Antero-dorsal, 16. Laminal, 17. Apical, 18. Sub-apical. Image courtesy of “Ishwar”.
Plosives
Plosives are consonants that are formed by first blocking the air passage and then audibly releasing the air. In Swedish, there are two different kinds of plosives: voiced and aspirated.
In voiced plosives, like /b/, /d/ and /g/, the vocal cords start to vibrate at the same time as the air is released. In aspirated plosives, like /p/, /t/ and /k/, there is a big burst of air escaping through the mouth after the release.
/b/ and /p/ are pronounced using the lips, /d/ and /t/ are pronounced with the tip of the tongue against the teeth, while /g/ and /k/ are pronounced by pressing the back of the tongue against the soft palate (also called velum). Figure 2.4 shows a spectrogram of the plosives just mentioned.
Fricatives
Fricatives are produced by forcing the air through some narrow passage. They can be either voiceless, like /h/, /f/, /s/, /tj/ or /sj/, or voiced like /j/ and /v/. What differentiates different fricatives is their place of articulation. Figure 2.5 shows a spectrogram of these fricatives.
2.2 Different Kinds of Speech Sounds 7 0 1 2 3 4 5 0 2 4 6 8 10 time (s) frequency (kHz)
Figure 2.3: Spectrograms of the nasals /m/, /n/ and /ng/.
0 1 2 3 4 5 0 2 4 6 8 10 12 time (s) frequency (kHz)
0 1 2 3 4 5 6 7 0 5 10 15 time (s) frequency (kHz)
Figure 2.5: Spectrogram of the fricatives /h/, /f/, /s/, /tj/, /sj/, /j/ and /v/.
/l/ and /r/
The Swedish language has only two consonants that don’t fall under any of the categories described above: /l/ and /r/. Figure 2.6 shows their spectrograms.
/l/ is a so-called lateral approximant. Lateral means that the air passes on both sides of the tongue, but is blocked by the tongue from passing through the middle of the mouth. Approximant means that there is very little obstruction of the air flow, so that it is similar to a vowel.
/r/ is a so-called trill, where the tongue is made to vibrate against the alveolar
ridge (number 4. in Figure 2.2).
2.3
Pitch and Loudness
In speech, information is conveyed not only by the words we use, but also through pitch and loudness. For example, a rising pitch in the end of a sentence is used in most languages to convey that the sentence is a question, while pronouncing a word in an elevated loudness and falling tone is used to place stress on that word, see Figure 2.7. Some languages (for example all Chinese dialects) even use variations in pitch to distinguish between words.
2.3 Pitch and Loudness 9 0 0.5 1 1.5 2 2.5 3 3.5 4 0 2 4 6 8 10 12 time (s) frequency (kHz)
Figure 2.6: Spectrogram of /l/ and /r/.
0 0.5 1 1.5 2 2.5 0 0.2 0.4 0.6 0.8 1 time (s) frequency (kHz)
2.4
The Source–Filter Model
In the source–filter model of speech production, speech is modelled as being pro-duced in two independent steps: the source and the filter.
The source refers to from where the sound originates. There are two different kinds of sound sources: the glottis and narrow passages. The sound produced by the glottis can usually be seen as an impulse train of equally spaced impulses. Sometimes though, when our voice becomes “creaky”, the distances between the impulses are large and irregular, so that they can be considered to be independent of each other.
Narrow passages are what produce the nonperiodic sounds in for example frica-tives and plosives. For example, /sj/ is pronounced by pressing the back of the tongue against the soft palate. When air is pushed through a narrow passage, the turbulence gives rise to noise with a very broad spectrum that is usually modelled as white noise.
When the sound from the source passes through the vocal tract and the lips, its frequency content will change. The vocal tract and the lips can be modelled as a linear filter, attenuating some frequencies and enhancing others.
One important aspect of the source–filter model is that the source and the filter are independent and can be analyzed separately. For example, pitch detection is about finding the fundamental frequency and epoch detection is about finding the exact time of each impulse [1]. These analyze only the source. An example of a method of estimating the filter will be described briefly in Section 2.5.2.
2.5
Speech Processing Techniques
In this section, several fundamental signal processing techniques will be mentioned to make the proposed system clearer.
2.5.1
Windowed Discrete Fourier Transform
When processing speech, we often want to analyze the frequency content of a signal and how it changes in time. One common way to do this is using the windowed discrete Fourier transform. This works by first windowing the sampled speech signal using a windowing function like Hann or Hamming, see Figure 2.8. Usually, to increase the temporal resolution, adjacent windows are made to overlap, see Figure 2.9.
Each window is then transformed into the frequency domain using a fast Fourier transform. What we get is the complex amplitudes of the sinusoids that make up the content of the window. Usually only the absolute values of these complex amplitudes are used, effectively discarding the information about the phase of the sinusoids. The reason why the phase can be ignored is that it is not essential for understanding speech. Note for example that the two waveforms in Figure 2.10 will sound identical to the ear.
2.5 Speech Processing Techniques 11 0 N − 1 0 0.5 1 sample amplitude hann=12(1 − cos(N −12πn)) hamming=0.54 − 0.46 cos(N −12πn)
Figure 2.8: The Hann and Hamming windows.
0 N − 1 2N − 1 0 0.2 0.4 0.6 0.8 1
0 0.5 1 1.5 2 −2 −1 0 1 2 3 time (ms) amplitude sin(2π1000t) + sin(2π3000t) sin(2π1000t) + sin(2π3000t − 2π3 )
Figure 2.10: sin(2π1000t) + sin(2π3000t) vs. sin(2π1000t) + sin(2π3000t −2π 3).
2.5.2
Linear Prediction
Linear prediction works by modeling the speech signal as an autoregressive process in order to derive an all-pole filter that can estimate the filtering properties of the vocal tract. [9].
The autoregressive model is given as
ˆ x(n) = p X k=1 akx(n − k)
where (x(n)) is the speech signal, and ˆx(n) is an estimation of a sample based
on previous samples. We will not go into how the (ak)pk=1are estimated, but we
mention that it can be done in O(p2) time. After having computed (ak)
p
k=1, we get an all-pole filter estimating the vocal
tract as 1
A(z), where A(z) = 1 + a1z
−1+ . . . + a
Chapter 3
The Speech Visualization
System
In this chapter, we describe in detail the proposed speech visualization system. A block diagram of the proposed system is presented in Figure 3.1. Sections 3.2 and 3.3 describe a system for dynamically adapting the window used in the windowed Fourier transform. Sections 3.6 and 3.7 describe how we extract the pitch and calculate how much of the total energy of the speech signal is due to the periodic part and the nonperiodic part of the signal, respectively. Section 3.8 describes how we colour vowels while Section 3.9 describes a novel technique of colouring unvoiced consonants. Finally, Section 3.10 describes how the visualization is drawn on the computer screen.
3.1
Sound Input
The sound is sampled from the microphone at 48000 samples per second, 16 bits per sample, mono. If the system would have to be run on a slow computer, the sample rate could be lowered to for example 24000 Hz without affecting the result very much. Fetching the sound samples from the microphone involves handling the window overlapping. Say for example that the window length is set to 4000 samples and the overlap to 25% of the window length, i.e. 1000 samples. Then, to get the sound data for the next window, 1000 samples are taken from the end of the previous window and 3000 are taken from the microphone, giving 4000 samples that are sent on to next step.
3.2
Adaptive Window
When choosing a window function for the windowed discrete Fourier transform, there is always a tradeoff involved. Choosing a smoother windowing function will result in less spectral leakage, but it will also worsen the frequency resolution, and vice versa. To get both excellent frequency resolution and low spectral leakage at
Sound Input (Section 3.1) Adaptive Window (Section 3.2) Extra Smoothing (Section 3.3) Fourier Transform (Section 3.4) dBA Weighting (Section 3.5) Pitch Extraction (Section 3.6)
Energy Distribution Analysis (Section 3.9) Spectrogram (Section 3.10.1) Formant Extraction (Section 3.8) Energy Analysis (Section 3.7) Energy Plot (Section 3.10.2) Display (Section 3.10)
3.2 Adaptive Window 15
once, we have designed a variable-sized window that takes into account the shape of the waveform when determining where to place the edges of the window. More specifically, it tries to align the edges of the window so that it contains a whole number of periods, as illustrated in Figure 3.2. This method makes use of the fact that human speech never contains more than one fundamental frequency.
Samples
Amplitude
x1 xN
flexible edge window fixed edge
Figure 3.2: The adaptive windowing works by trying to align the left edge of the window so that the window contains a whole number of periods.
Say that our “raw” window (i.e. the window we get from the previous step) contains N samples, x1, x2, . . . , xN, where x1 is the earliest sample. We want to
select a subinterval of these samples such that this subinterval contains a whole number of periods. We always choose the right endpoint of this subinterval to be
N . This implies that the subinterval will contain as new values as possible, and
will thus minimize the delay of the system. It also implies that we have maximum freedom in selecting the left endpoint.
To determine a suitable left endpoint, we introduce a function to measure how well the left endpoint l fits with the right endpoint:
sqdiff(l) =
A
X
k=0
(xN −k− xl−k)2.
This is simply a measure of how different the series {xn}ll−A is from the series
{xn}NN −A. This function is inspired by the square difference function often used
in pitch detection [12] [3].
As our left endpoint we take the l ∈ [A + 1, N/2] that minimizes sqdiff(l). The left endpoint of this interval is chosen to ensure sqdiff doesn’t use any value before
x1, and the right endpoint is chosen so that we will never discard more than half
of the raw window. Our experiments have shown that a suitable value for A is about N/10.
To sum up, after applying the adaptive windowing, we have xl, . . . , xN where
l minimizes the sqdiff function under the constraint l ∈ [A + 1, N/2]. To simplify
3.3
Extra Smoothing
Sometimes the adaptive windowing fails. This is usually because the sound signal is changing too quickly. To detect when it has failed, we measure the difference between x1 and xN compared to the total variation in {xn}N1. That is, we check
if |xN− x1| ≤ α max n∈[1,N ]xn− minn∈[1,N ]xn (3.1)
For a suitable value of α. This test is illustrated in Figure 3.3. If this inequality doesn’t hold, we consider the adaptive windowing to have failed. In this case, we need to apply some kind of traditional window function.
|xN − x1 | max xn − min xn window
Figure 3.3: The discontinuity at the edges and the total amplitude variation within a window.
One option would be to simply apply a plain Hann window, but if |xN − x1|
is just a little bit bigger than we desire, applying a plain Hann window would damage the frequency resolution more than necessary.
Instead, we use a window that is an interpolation of a rectangular window and a Hann window:
hann2(n, N, a) = a + (1 − a)√2
3hann(n, N ),
where the constant 2/√3 is needed to make the energy gain of the window constant with respect to a. The parameter a controls the smoothness of the window, with
a = 1 giving a pure rectangular window, and a = 0 giving a pure Hann window.
This function is plotted for a few values of a in Figure 3.4. We let
a = αmaxn∈[1,N ]xn− minn∈[1,N ]xn
|xN − x1|
which is precisely enough to ensure that (3.1) holds after applying the window function to (xn)N1.
Summing up, we check if the adaptive windowing has failed by looking at the size of |xN− x1| compared to the total variation in (xn)N1 , and if needed, we apply
3.4 Fourier Transform 17 0 N − 1 0 0.5 1 1.5 sample amplitude hann2(n; N ; 0.2) hann2(n; N ; 0.5) hann2(n; N ; 0.8)
Figure 3.4: The hann2 function for a = 0.2, a = 0.5 and a = 0.8.
3.4
Fourier Transform
To perform the discrete Fourier transform (DFT), we make use of FFTW [7], which is a library for the C programming language that does very fast transforms of any size, including prime sizes. The transform we do is a one-dimensional real-to-complex transform, taking as input N sound samples and giving as output
M = N/2 complex amplitudes X1, . . . , XM and one constant term. Assuming that
the sampling frequency of the input is fsHz, the frequencies corresponding to the
complex amplitudes are fs/2M, 2fs/2M, . . . , fs/2.
After transforming, we get the power spectrum as ˆPm = (|Xm|/M )2. The
division by M is needed for the power spectrum to be independent of the size of the transform, and is related to how FFTW implements the DFT.
3.5
dBA Weighting
The human perception of loudness depends not only on the energy of a sound signal, but also on its frequency content. For example, a 1000 Hz tone will sound louder than a 100 Hz tone of equal amplitude. To let the system “hear” the sound more like the human ear would hear it, we must adjust the power spectrum based on some approximation of the sensitivity of the human ear to different frequencies. We do this by weighting the power spectrum using the same function that is used for the dBA noise weighting: [13]
dBA(f ) = 12200
4f8
(f2+ 20.62)2(f2+ 107.72)(f2+ 737.92)(f2+ 122002)2.
We thus get the dBA-weighted power spectrum as Pm= dBA(mfs/2M ) ˆPm, m = 1, . . . , M. 0 5 10 15 20 0 0.2 0.4 0.6 0.8 1 1.2 f (kHz) dBA( f )
Figure 3.5: The dBA weighting function.
3.6
Pitch Detection
Pitch detection is a difficult problem that many researchers have tried to solve. Numerous algorithms have been proposed to solve the problem, most of which work really well most of the time, but none of which work all of the time. [14] [11] In our case, since the adaptive window makes sure that each window contains an integer number of periods of a voiced sound, the pitch will be one of the frequencies
fs/2M, 2fs/2M, . . . , fs/2, and there will hardly be any spectral leakage in the
power spectrum.
Let us call the numbers 1, . . . , M the frequency bins. Our objective is to find the bin m0 corresponding to the pitch f0= m0fs/2M .
We can’t just take the m corresponding to the highest power, since this is likely to be one of the harmonics. In fact, for speech sounds, Pm0 is sometimes
not even larger than Pm0−1 or Pm0+1. One might attempt to find m0by defining
Hm = Pm+ P2m+ . . . and looking to the m > 1 that maximizes Hm, but this
would fail since if m0 = 4, we would incorrectly get m0 = 2, since H2 is always
greater than H4.
In our first somewhat successful attempt to find the pitch, we made use of the fact that the periodic component of a speech sound usually makes up almost all of the energy of the signal. What we did was that we assumed m0to be the biggest
3.7 Energy Analysis 19
m such that Hm> aH0 for a suitable value for a. This method worked most of
the time, but it was very sensitive to noise. Our final version works as follows:
• Let ˜Pm = M −mM m. ( ˜Pm)M1 is thus a power spectrum with the higher
fre-quencies attenuated.
• Like above, let ˜Hm= ˜Pm+ ˜P2m+ . . .
• Let mhbe the m for which ˜Pmis maximal. We assume that mhis a harmonic
of m0.
• For each divisor md, 1 < md < mh of mh, starting with the greatest one:
Test to see if ˜Hmd− ˜Hmhis greater that it would have been if it just contained
noise. If so, we expect md to be a multiple m0, and we start over from the
beginning with mh= md. By noise, we here refer to both background noise
as well as the non-periodic part of the speech signal.
• If we finished looking through the divisors of mh without finding one that
might be a harmonic of m0, then mh= m0, and we’re done.
This is the test we use to determine if ˜Hmd− ˜Hmh contains more than just
noise:
• Let ˜Pnoise= mmd
d−1( ˜H1− ˜Hmd). If the true pitch is mdor mh, then ˜H1− ˜Hmd
will contain about 1 −m1
d =
md−1
md of the total noise, so that ˜Pnoisewill be a
good estimate of the total noise power.
• We estimate the part of ˜Hmh that is not noise as Gh = ˜Hmh− ˜Pnoise/mh
and the part of ˜Hmd− ˜Hmh that is not noise as
Gextra= ˜Hmd− ˜Hmh− 1 md − 1 mh ˜ Pnoise.
• Finally, the condition that must be satisfied for us to assume that md is a
multiple of m0 is:
Gextra> βGh.
Loosely speaking, if this inequality is true, it means that lowering mh to md
would significantly increase the content of periodic energy in ˜Hmh.
Our experiments have shown that a suitable value for β is 0.04.
3.7
Energy Analysis
In order to help the user distinguish between voiced and unvoiced sounds, we want to display the total power of the speech signal, as well as how much of that power is made up of periodic sound and nonperiodic sound, respectively. Let’s call these “periodic power” and “nonperiodic power”.
We can get the total power as Ptot = H1 in the notation of Section 3.6. To
get an estimate of the periodic power, we must from Hm0 subtract an estimate
of how much nonperiodic power is contained in Hm0. Considering the fact that
nonperiodic power is usually spread out quite evenly through the spectrum, we expect Hm0 to contain about
1
m0 of the nonperiodic energy, while Ptot − Hm0
should contain about m0−1
m0 of the nonperiodic energy. Thus, we take
Pper= Hm0− 1 m0 m0 m0− 1 (Ptot− Hm0) = Hm0− 1 m0− 1 (H1− Hm0) and
Pnonper= Ptot− Pper.
3.8
Formant Extraction and Vowel Colour
Watanabe et al. [17] propose an interesting way of visualizing vowels by colours. It works by extracting the first three formants f1, f2and f3, after which the RGB
values of the colour are generated as
R = 5f1/f3
G = 3f3/5f2
B = f2/3f1
The coefficients 5, 3/5, and 1/3 are chosen to make a neutral vowel colourless, and using ratios of formants instead of using the formants directly removes the influence of vocal tract length on the colour produced. Watanabe et al. also show that it is much easier to recognize a vowel by seeing its colour than by looking at its spectrogram or its formant tracks.
To extract the formants, we have taken the implementation of a formant ex-traction algorithm [4] used in the Praat computer program [2].
3.9
Energy Distribution Analysis
Inspired by the method described in the previous section, we wanted to use some kind of colouring to help the user to distinguish between different unvoiced speech sounds. When investigating these kinds of speech sounds, we found that their spectrums were smoother than those of vowels. Thinking in terms of the source– filter model of speech production (cf. sec. 2.4), it seemed that their spectrums could be well estimated as those of white noise passed through different band-pass filters.
In order to have a way to define and to extract the central frequency and bandwidth of the filter, we treat the power spectrum as a probability distribution
p(kfs
2M) =
Pk
Ptot
3.9 Energy Distribution Analysis 21
where Ptot=P Pkis the total power. We then define the central frequency and the
bandwidth as the mean and standard deviation, respectively, of this distribution:
fc = M X k=1 kfs 2M Pk Ptot Bw= v u u t M X k=1 (kfs 2M − fc) 2 Pk Ptot
In this way, both fc and Bwwill have dimension frequency.
We found that these two values are useful for distinguishing voiceless conso-nants, see Table 3.1. The table also contains, as a reference, the skewness, γ, of the distribution [8], defined as
γ = 1 B3 w M X k=1 (kfs 2M − fc) 3 Pk Ptot ,
which is a dimensionless quantity showing in which direction the distribution is leaning, see Figure 3.6.
Sound fc (Hz) Bw(Hz) γ /p/ 2400 1700 1.8 /s/ 6000 1500 -1.1 /f/ 4200 2200 0.9 /tj/ 3500 900 0.7 /sj/ 1300 500 3.5 /h/ 1400 1100 2.0
Table 3.1: fc, Bw and γ for some unvoiced consonants in Swedish.
Figure 3.6: Negative vs. positive skew.
Although it looks like γ could be useful to help discriminating between the different consonants, γ fluctuates a lot, and we found it better only to use fc and
To produce a colour, we first convert ˆfcto the mel frequency scale (explained in
the next section). The reason for doing this conversion is to get better resolution in the lower frequencies. We then choose the Hue, Saturation and Value of the colour as H = fˆc−1 2.3−1360 ◦ S = 0 if Bw< 500 1 if Bw> 1700 Bw−500 1700−500 otherwise V = 1 ,
Finally, we convert these to RGB values using the following steps:
H0= H 60◦, X = C(1 − |H0mod 2 − 1|)23, (R1, G1, B1) = (C, X, 0) if [H0] = 0 mod 6, (X, C, 0) if [H0] = 1 mod 6, (0, C, X) if [H0] = 2 mod 6, (0, X, C) if [H0] = 3 mod 6, (X, 0, C) if [H0] = 4 mod 6, (C, 0, X) if [H0] = 5 mod 6, m = V − C, (R, G, B) = (R1+ m, G1+ m, B1+ m).
The usual way of converting from HSV to RGB is to use
X = C(1 − |H0mod 2 − 1|),
but this results in a very uneven colour spectrum, where red, green and blue take up a lot of space, but yellow, cyan and magenta are very narrow, cf. Figure 3.7.
Figure 3.7: Top: converting HSV to RGB using X = C(1 − |H0mod 2 − 1|) in the algorithm above. Bottom: converting HSV to RGB using X = C(1 − |H0mod 2 − 1|)23. If you cannot see any difference between these, try displaying them on an
3.9 Energy Distribution Analysis 23
3.9.1
Efficient Calculation of f
cand B
wIf we would calculate fcand Bwdirectly from the definition above, we would have
to iterate from 1 to M two times, one time to find Ptot and fc, and one time to
find Bw, since Bw depends non-linearly on fc.
Instead, we first calculate
µn = M X k=1 (kfs 2M) nP k, n = 0, 1, 2,
which can be done in only one iteration. Note that µ0= Ptot. We then get
fc = µ1 µ0 , Bw= s µ2 µ0 −µ 2 1 µ2 0 .
If we wanted, we could get γ as
γ = 1 B3 w µ3 µ0 − 3µ1µ2 µ2 0 + 2µ 3 1 µ3 0 .
3.9.2
Handling Background Noise
The above method of getting fc and Bw from µn also allows us to easily alleviate
the effects of static background noise. Since (1) the noise is uncorrelated to the speech signal, and (2) µn depends linearly on the power spectrum, we know that
µn,noise+signal≈ µn,noise+ µn,signal. Thus, we can get an estimate of µn,signal if we
have an estimate of µn,noise.
We have noticed that the µn are quite constant when there is only background
noise, which is why we can assume that µn,noise will be quite constant even under
the presence of speech.
We used the fact that, when there is no speech signal, µ0, i.e. the power of
the signal after dBA-weighting, will be small, to devise the following scheme for estimating µn,noise:
• Let µ0,min = ∞ and µn,noise= 0, n = 0, 1, 2.
• For each window, after calculating µn:
– If µ0< µ0,min, let µ0,min= µ0.
– If µ0< 32µ0,min, there is probably no speech, so:
∗ Let µn,noise= 109µn,noise+101µn, n = 0, 1, 2.
∗ Let µ0,min= 109µ0,min+101µ0.
The last step above is necessary to prevent µ0,minfrom getting stuck at some very
low value and thus hindering µn,noise from continuing being updated. Using this
algorithm, we get a smoothed estimate of the background noise that is even able to follow slow changes in the background noise.
3.10
Display
The visualization is drawn on a rectangular canvas with user-configurable size. It consists of two parts: the upper part is a modified spectrogram while the lower part shows the result of the energy analysis and the formant extraction, see Figure 3.8.
Figure 3.8: The visualization. Upper part: modified spectrogram. Lower part: plot of total/periodic/nonperiodic power and vowel colour.
The visualization is drawn column by column from left to right, restarting from the left after it has reached the right edge. Each column is one pixel wide and represents the result of analyzing one frame of sound input.
The next two sections explain in detail how each part of the visualization is produced.
3.10 Display 25
3.10.1
Spectrogram
The spectrogram can be seen as a real-valued function of two variables,
time, frequency 7→ power,
where each column of pixels corresponds to a certain point in time and each row of pixels corresponds to a certain frequency. Each pixel has a brightness based on the power contained in a certain frequency band at a certain point of time. The brighter the pixel, the more power. The colour of each column of the spectrogram is the one we get from the energy distribution analysis, cf. Section 3.9.
Mel Scale
If we would draw the power spectrum on the screen with a linear frequency scale, the user would have difficulties discerning the details in the lower frequencies. For example, he wouldn’t see clearly how the pitch changes. That’s why we have chosen to rescale the power spectrum using the mel scale:
mel(f ) = C log(1 + f 700 Hz)
The mel scale approximates the sensitivity of the human ear to frequency changes, meaning that increasing the frequency of a tone by a certain number of mel will sound like an equally large increase no matter the original frequency of the tone. The same thing is not true for a linear frequency scale, where the difference be-tween 50 and 100 Hz is perceived as much larger than the difference bebe-tween for example 10.000 and 10.050 Hz. Since we are not interested in the absolute value of mel(f ), the value of C doesn’t matter, and we let it be unity. mel(f ) is plotted in Figure 3.9. The difference between a linear frequency scale spectrogram and a mel scale spectrogram is shown in Figure 3.10.
Log Power
One tricky thing to take into account when drawing a spectrogram is the high dynamic range of the power spectrum. For example, for a vowel, many harmonics of the fundamental frequency will be more than a hundred times weaker than the strongest harmonic, but they will still be very important for identifying the vowel. Another problem is that even small changes in the distance between the speaker and the microphone, or changes in how loud the speaker speaks, may cause the power of the speech signal to change by several orders of magnitude.
Thus, letting
pm= log Pm,
we select the brightness ∈ [0, 1] of each pixel based on the logarithm of the power according to the following rule:
brightness(pm) = 0 if pm< plow, pm−plow
phigh−plow if plow< pm< phigh,
0 5 10 15 20 0 1 2 3 f (kHz) log (1 + f 700 Hz )
Figure 3.9: The mel function.
Figure 3.10: Left: linear frequency spectrogram. Right: mel scale spectrogram.
Finally, we get the colour of each pixel by multiplying this brightness value with the RGB values from the energy distribution analysis.
Autolevel
To adjust to changes in sound volume and noise level, the thresholds plowand phigh
need to be continuously adjusted. It is crucial to do this in a good way considering that the pm’s can vary by many orders of magnitude, while the brightness of a
3.10 Display 27
pixel can only be set to 256 different values. Our algorithm for doing this works by going through the following steps after each time the power spectrum has been updated:
1. Let pmax= maxmpm and psmall= log(MC PmPm). We will use pmax as the
highest value we will want to display and psmall as the smallest value we will
want to display. C is used for fine-tuning and a suitable value for it is 0.5.
2. If pmax> phigh, then let phigh= pmax. Otherwise, let phigh= (1 − λ)phigh+
λpmax.
3. Similarly if psmall < plow, then let plow = psmall. Otherwise, let plow =
(1 − λ)plow+ λpsmall.
In this way, when pmax is bigger than phigh, we immediately adjust phigh. On
the other hand, for every window where pmax is less than phigh, we decrease phigh
a bit, which means that if the speech signal becomes weaker, we adjust ourselves to that. (This could for example be because the speaker has moved further away from the microphone.)
Similarly, plowis able to follow changes in the background noise.
We have found that a good value for λ is tl
tl+10 s where tl is the length of one
window minus overlap. This means that if we see the above algorithm as two low-pass filters, their time constants are 10 seconds.
3.10.2
Energy and Vowel Colour
In the lower part of the screen, we plot the energy content of the speech, as illustrated in Figure 3.8.
The total height of the curve is log Ptot−c1
c2−c1 , and the height of the nonperiodic
part is log Pper−c1
c2−c1 . This leaves
log Ptot−log Pper
c2−c1 for the periodic part, which is drawn
with the colour calculated in Section 3.8. The thresholds c1and c2are continuously
updated using a scheme similar to the one described in Section 3.10.1.
This tricky way of dividing the height of the total power into periodic and nonperiodic power makes it possible to distinguish sounds like /u/ and /v/, since it lets us see that /v/ contains more nonperiodic energy than /u/, despite the fact that the periodic energy in /v/ is many times stronger than its nonperiodic energy.
Chapter 4
Results
In this chapter, we will show how our system visualizes various speech sounds. We show visualizations of the sounds of the Swedish language, as well as a few sounds from English and Chinese that don’t appear in Swedish. In order to demonstrate the speaker-independence of the system, we will show each sound pronounced by both a male and a female speaker. Unfortunately, we had problems getting the formant extraction to work properly, so it was disabled when producing the below images.
Vowels
Since Swedish already has a large repository of vowels, we will only use these to demonstrate our system, see Figures 4.1 and 4.2. To demonstrate what the visualization would look like if the formant extraction was working, Figure 4.3 contains the same vowels as Figure 4.1, except that it has been manually coloured according to the formant frequencies given in Table 2.1, Section 2.2.1.
Voiced Consonants
Figures 4.4 and 4.5 show the visualizations of a number of different voiced con-sonants. It can be noted that the energy plot clearly shows that most of these have a larger proportion of nonperiodic energy as do the vowels, thus helping the user in distinguishing between voiced consonants and vowels. Unfortunately, it is still difficult to distinguish these voiced consonants from vowels and from each other, since it requires looking closely at where the formant frequencies are. It is especially hard to differentiate between /b/, /d/ and /g/, since these are very short.
Unvoiced Consonants
Figures 4.6 and 4.7 show visualizations of unvoiced consonants. Undoubtedly, the colouring is very useful for helping to distinguish between them.
Figure 4.1: Visualization of Swedish vowels /a/, /o/, /u/, /å/, /e/, /i/, /y/, /ä/ and /ö/. Male speaker.
Figure 4.2: Visualization of Swedish vowels /a/, /o/, /u/, /å/, /e/, /i/, /y/, /ä/ and /ö/. Female speaker.
31
Figure 4.3: Visualization of Swedish vowels /a/, /o/, /u/, /å/, /e/, /i/, /y/, /ä/ and /ö/. Manually coloured using formant frequencies from Table 2.1.
Figure 4.4: Visualization of Swedish voiced consonants /b/, /d/, /g/, /l/, /j/, /v/, /r/, /m/, /n/ and /ng/ as well as English /r/, /z/ and /th/ as in ‘there’. Male speaker.
Figure 4.5: Visualization of Swedish voiced consonants /b/, /d/, /g/, /l/, /j/, /v/, /r/, /m/, /n/ and /ng/ as well as English /r/, /z/ and /th/ as in ‘there’. Female speaker.
Figure 4.6: Visualization of Swedish unvoiced consonants /p/, /t/, /k/, /s/, /f/, /tj/, /sj/ and /h/ as well as English /th/ as in ‘thing’ and Chinese /x/. Male speaker.
33
Figure 4.7: Visualization of Swedish unvoiced consonants /p/, /t/, /k/, /s/, /f/, /tj/, /sj/ and /h/ as well as English /th/ as in ‘thing’ and Chinese /x/. Female speaker.
Pitch and Loudness
Figures 4.8 and 4.9 demonstrate how changes in pitch and changes in loudness are visualized. In can be seen that thanks to the adaptive window, the mel scale spectrogram and the energy plot, changes in these quantities are very easy to spot. As a comparison, Figure 4.10 shows a visualization of the same input as in Figure 4.8, but using a Hann window instead of the adaptive window. Clearly, there is a huge benefit in using the adaptive window.
Figure 4.8: Visualization of the Chinese syllable “fa” pronounced with four differ-ent tones.
35
Figure 4.10: Visualization of the Chinese syllable “fa” pronounced with four dif-ferent tones, using the Hann window instead of the adaptive window.
Chapter 5
Conclusions and Future
Work
5.1
Conclusions
This paper introduces a speech visualization system including several novel tech-nologies which help making this system a significant improvement compared to a plain spectrogram. The biggest improvements are that the system makes it easy to follow pitch and loudness and to distinguish between unvoiced consonants. The system has been designed without any specific language in mind, and our tests have shown that the system is indeed equally good at visualizing sounds from dif-ferent languages. Further, we have shown that the same speech sounds spoken by different people are visualized similarly. This shows that we have been successful in making this system both speaker and language independent. However, impor-tant work remains to be done in helping the user distinguishing between different vowels and voiced consonants.
5.2
Future Work
Distinguishing between voiced plosives
One problem with our visualization is that it is difficult to distinguish between voiced plosives like /b/, /d/ and /g/.
As can be seen in Figure 5.1 (a), these consonants consist of a very short unvoiced part followed by a voiced part. In order to figure out which of these parts is most important for distinguishing between the different sounds, we took the unvoiced part of /g/ and attached it to the voiced part of /b/, and similarly with /b/ and /d/ as well as /d/ and /g/, see Figure 5.1 (b).
By listening to the combined sounds and deciding which sound they are more similar to, we were hoping to find out which of the two parts of each sound is most important for distinguishing between them. Unfortunately all of the three
modified sounds sounded only like a mixture between the sounds used to create it, forcing us to conclude that both parts of the voiced plosives are important for distinguishing between them.
We think that it would be necessary to use some kind of machine learning, e.g. neural networks, in order to make progress in this field. This seems to be the same experience as Watanabe et al. have had in solving the same problem.
Figure 5.1: Top: the first 100 ms of /b/, /d/ and /g/. Bottom: /g/+/b/, /b/+/d/ and /d/+/g/.
Compensating for the difference in frequency response for different mi-crophones
We noticed that fcand B were somewhat different between different microphones.
Since we already have a system to mitigate the influence of noise on fc and B
(see Section 3.9.2), we concluded that the difference must be due to a difference in frequency response of the microphones.
Two options in solving this problem could be:
1. Measuring the frequency response of the microphone to be used, and using this to adjust the power spectrum.
2. Giving up the idea of using fcand B altogether, and instead using some kind
of machine learning to extract features from the spectrum. For example, neural networks could be taught to recognize how far to the back or to the front the consonant is pronounced, and how sharp or soft the consonant is.
5.3 Acknowledgements 39
Voicing detection
Using a method for distinguishing between voiced and unvoiced speech, it would be possible to only apply the colouring of unvoiced consonants when it actually makes sense. Also, it would be possible to create a more integrated visualization by somehow merging the vowel colouring and periodic energy into the spectrogram.
Formant extraction
The code for formant extraction that we use is somewhat unreliable, and its param-eters need to be tuned depending on the speaker. Thus, an important improvement to be made is to find a better algorithm for this.
5.3
Acknowledgements
We would like to thank all the people who have released code under free, open source licenses. Without you, our work would have been a lot more difficult. Of course, we are also grateful to everyone who read our thesis and came with suggestions. Especially Mehmet Guldogan and Tim Nordenfur have come with many useful suggestions.
Bibliography
[1] P. Bamberg. Pitch tracking and epoch detection. 2000.
[2] P. Boersma and D. Weenink. Praat: doing phonetics by computer [computer program]. 2010. URL: http://www.praat.org/.
[3] A. de Cheveigné and H. Kawahara. YIN, a fundamental frequency estimator for speech and music. 2001.
[4] D. G. Childers. Modern Spectrum Analysis. IEEE Press, 2004.
[5] G. Fant. Acoustic Theory of Speech Production. Mouton & Co, The Hague, Netherlands, 1960.
[6] G. Fant, G. Henningson, and U. Stålhammar. Formant frequencies of swedish vowels. STL-QPSR, 10(4):026–031, 1969.
[7] M. Frigo and S. G. Johnson. The design and implementation of FFTW3. Proceedings of the IEEE, 93(2):216–231, 2005. Special issue on “Program Generation, Optimization, and Platform Adaptation”.
[8] P. T. von Hippel. Mean, median, and skew: Correcting a textbook rule. Journal of Statistics Education, 13(2), 2005.
[9] K. Koppinen. Linear prediction. URL: http://www.cs.tut.fi/courses/SGN-4010/LP_en.pdf.
[10] P. Ladefoged and K. Johnson. A course in phonetics. Wadsworth, 2010.
[11] I. Luengo, I. Saratxaga, E. Navas, I. Hernaez, J. Sanchez, and I. Sainz. Evalu-ation of pitch detection algorithms under real conditions. In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, volume 4, pages IV–1057 –IV–1060, april 2007.
[12] P. McLeod and G. Wyvill. A smarter way to find pitch.
[13] American National Standards of the Acoustical Society of America. ANSI/ASA S1.4-1983. 1983.
[14] L. Rabiner, M. Cheng, A. Rosenberg, and C. McGonegal. A comparative performance study of several pitch detection algorithms. Acoustics, Speech and Signal Processing, IEEE Transactions on, 24(5):399 – 418, oct 1976.
[15] H. M. Sussman. A neuronal model of vowel normalization and representation. Brain and Language, 28(1):12–23, 1986.
[16] T. Toda, A. Black, and K. Tokuda. Acoustic-to-articulatory inversion map-ping with gaussian mixture model. INTERSPEECH-2004, pages 1129–1132, 2004.
[17] A. Watanabe, S. Tomishige, and M. Nakatake. Speech visualization by inte-grating features for the hearing impaired. IEEE Transactions on Speech and Audio Processing, 8(4):454–466, 2000.
[18] A. Wrench. The MOCHA-TIMIT articulatory database. Queen Margaret University College, 1999. URL: http://www.cstr.ed.ac.uk/research/projects/artic/mocha.html.
[19] Wang X., Xue L. F., Yang D., and Han Z. Y. Speech visualization based on locally linear embedding (LLE) for the hearing impaired. In BioMedical Engineering and Informatics, 2008. BMEI 2008. International Conference on, volume 2, pages 502 –505, may 2008.