• No results found

From Acoustics to Articulation: Study of the acoustic-articulatory relationship along with methods to normalize and adapt to variations in production across different speakers

N/A
N/A
Protected

Academic year: 2021

Share "From Acoustics to Articulation: Study of the acoustic-articulatory relationship along with methods to normalize and adapt to variations in production across different speakers"

Copied!
280
0
0

Loading.... (view fulltext now)

Full text

(1)

From Acoustics to Articulation

Studies on the acoustic-articulatory relationship along with methods to normalize and adapt to variations in production across different speakers.

G. ANANTHAKRISHNAN

Doctoral Thesis

Stockholm, Sweden, 2011

(2)

ISSN 1653-5723

ISRN-KTH/CSC/A–11/23-SE ISBN 978-91-7501-215-5

KTH School of Computer Science and Communication SE-100 44 Stockholm SWEDEN Akademisk avhandling som med tillstånd av Kungl Tekniska högskolan framlägges till offentlig granskning för avläggande av teknologie doktorsexamen i datalogi fredagen den 27 januari 2012 klockan 10:00 i Sal F3, Lindstedtsvägen 26, Kungl Tekniska högskolan, Stockholm, Sweden.

© G. Ananthakrishnan, Dec 2011 Tryck: E-Print AB

(3)

To

Anvesh, my son,

Pritesh, my wife,

the apples of my eyes

(4)

Abstract

The focus of this thesis is the relationship between the articulation of speech and the acoustics of produced speech. There are several problems that are encountered in understanding this relationship, given the non-linearity, variance and non-uniqueness in the mapping, as well as the differences that exist in the size and shape of the articulators, and consequently the acoustics, for different speakers. The thesis covers mainly four topics pertaining to the articulation and acoustics of speech.

The first part of the thesis deals with variations among different speakers in the articulation of phonemes. While the speakers differ physically in the shape of their articulators and vocal tracts, the study tries to extract articula-tion strategies that are common to different speakers. Using multi-way linear analysis methods, the study extracts articulatory parameters which can be used to estimate unknown articulations of phonemes made by one speaker; knowing other articulations made by the same speaker and those unknown ar-ticulations made by other speakers of the language. At the same time, a novel method to select the number of articulatory model parameters, as well as the articulations that are representative of a speaker’s articulatory repertoire, is suggested.

The second part is devoted to the study of uncertainty in the acoustic-to-articulatory mapping, specifically non-uniqueness in the mapping. Several studies in the past have shown that human beings are capable of producing a given phoneme using non-unique articulatory configurations, when the artic-ulators are constrained. This was also demonstrated by synthesizing sounds using theoretical articulatory models. The studies in this part of the the-sis investigate the existence of non-uniqueness in unconstrained read speech. This is carried out using a database of acoustic signals recorded synchronously along with the positions of electromagnetic coils placed on selected points on the lips, jaws, tongue and velum. This part, thus, largely devotes itself to describing techniques that can be used to study non-uniqueness in the sta-tistical sense, using such a database. The results indicate that the acoustic vectors corresponding to some frames in all the phonemes in the database can be mapped onto non-unique articulatory distributions. The predictabil-ity of these non-unique frames is investigated, along with verifying whether applying continuity constraints can resolve this non-uniqueness.

The third part proposes several novel methods of looking at acoustic-articulatory relationships in the context of acoustic-to-acoustic-articulatory inversion. The proposed methods include explicit modeling of non-uniqueness using cross-modal Gaussian mixture modeling, as well as modeling the mapping as local regressions. Another innovative approach towards the mapping prob-lem has also been described in the form of relating articulatory and acoustic gestures. Definitions and methods to obtain such gestures are presented along with an analysis of the gestures for different phoneme types. The relationship between the acoustic and articulatory gestures is also outlined. A method to conduct acoustic-to-articulatory inverse mapping is also suggested, along with

(5)

v

a method to evaluate it. An application of acoustic-to-articulatory inversion to improve speech recognition is also described in this part of the thesis.

The final part of the thesis deals with problems related to modeling infants acquiring the ability to speak; the model utilizing an articulatory synthesizer adapted to infant vocal tract sizes. The main problem addressed is related to modeling how infants acquire acoustic correlates that are normalized between infants and adults. A second problem of how infants decipher the number of degrees of articulatory freedom is also partially addressed. The main contri-bution is a realistic model which shows how an infant can learn the mapping between the acoustics produced during the babbling phase and the acous-tics heard from the adults. The knowledge required to map corresponding adult-infant speech sounds is shown to be learnt without the total number of categories or one-one correspondences being specified explicitly. Instead, the model learns these features indirectly based on an overall approval rating, provided by a simulation of adult perception, on the basis of the imitation of adult utterances by the infant model.

Thus, the thesis tries to cover different aspects of the relationship between articulation and acoustics of speech in the context of variations for different speakers and ages. Although not providing complete solutions, the thesis pro-poses novel directions for approaching the problem, with pointers to solutions in some contexts.

(6)

Sammanfattning

Denna avhandling fokuserar på sambandet mellan talets artikulation och akustik. Detta samband är svårbeskrivet, både på grund av anatomiska och akustiska skillnader mellan olika talare, och på att icke-linjäritet, varians och flertydigheter förekommer i sambandet för samma talare. Avhandlingen in-nehåller fyra delar som beskriver relationen mellan talets artikulation och akustik.

Den första delen handlar om variationerna i talartikulation mellan olika talare. Talarnas anatomi skiljer sig åt, men studien försöker identifera artiku-lationsstrategier som är gemensamma för de olika talarna. Flera olika linjära analysmetoder används för att beräkna de artikulatoriska parametrar som kan uppskatta talarens artikulation av fonem som inte ingått i analysens trän-ingsmaterial. Beräkningen bygger på kännedom av andra artikulationer som gjorts av talaren och motsvarande artikulationer för andra talare av språket. En ny metod presenteras därefter för att välja antalet modelleringsparametrar och de fonem som bäst beskriver de artikulatoriska variationerna hos talaren. Den andra delen studerar osäkerheten i det akustiska-till-artikulatoriska sambandet, speciellt med avseende på hur det kan vara flertydigt (“icke-unikt”). Flera tidigare studier har visat att människor kan använda flera olika artikulationer för att skapa vissa språkljud och detta fenomen kan också visas med hjälp av artikulatorisk syntes. Studierna i denna del av avhandlingen un-dersöker icke-unika kopplingar mellan artikulation och akustik i läst tal, med hjälp av samtidiga inspelningar av den akustiska signalen och data från elek-tromagnetiska spolar placerade på utvalda punkter på läpparna, käken, tun-gan och velum. Tekniker som kan användas för statistisk analys av icke-unika samband i en sådan databas beskrivs. Resultaten tyder på att det för alla fonem i databasen finns förekomster av akustiska vektorer som motsvaras av icke-unika artikulatoriska sannolikhetsfördelningar. Vidare utreds om dessa icke-unika förekomster kan förutsägas, och om kontinuitetsbegränsningar kan göra den akustiskt-artikulatoriska kopplingen unik.

Den tredje delen föreslår flera nya metoder för att studera sambandet mel-lan akustik och artikulation. Två av metoderna använder sig av explicit mod-ellering av icke-unika samband, med Gauss-klockor respektive styckvis avbild-ning. En tredje metod beskriver sambandet med hjälp av artikulatoriska och akustisk gester. Definitioner och metoder för att ta fram gester presenteras tillsammans med analyser av dessa gester för olika fonem- typer. Därefter utvärderas möjligheten att med hjälp av sådana gester beräkna de artikula-tioner som svarar mot en given talsignal, så kallad akustisk-till-artikulatorisk inversion. En tillämpning av inversion för att förbättra automatisk taligenkän-ning beskrivs också.

Den sista delen av avhandlingen behandlar problem relaterade till mod-ellering av hur spädbarn lära sig att tala. Det största problemet består i att förstå hur ett spädbarn normaliserar det akustiska sambandet mellan det tal det hör från vuxna i omgivningen och de ljud som det själv producerar då det jollrar. Ett andra problem består i att spädbarn måste lära sig hur mån-ga relevanta artikulatoriska frihetsgrader det finns i språket, och vilka de är.

(7)

vii

Dessa problem studeras i denna avhandling med hjälp av en artikulatorisk talsyntes anpassad till spädbarns talrörstorlek. Med denna som bas skapas en realistisk modell, som visar hur ett barn kan lära sig de akustiska sambanden och utforska artikulatoriska frihetsgrader. Modellen visar hur spädbarnet kan lära sig att kategorisera olika språkljud, utan att det från början vet det tota-la antalet kategorier eller det akustiska sambandet meltota-lan fonemljud i barns och vuxnas röster. Istället lär sig modellen dessa funktioner indirekt, baserat på ett övergripande godkännande av hur väl modellen imiterar vuxnas uttal (vilket simulerar vuxnas återkoppling på barnets försök att härma).

Avhandlingen försöker alltså täcka olika aspekter av sambandet mellan artikulation och akustik, med variationer mellan olika talare och åldrar. Den föreslår nya metoder för att beskriva och studera problemet med flertydighet i sambandet och ger lösningar för att, under vissa förutsättningar, hitta de unika artikulationer som skapat en given talsignal.

(8)

Acknowledgements

Firstly I would like to thank all those people who had a direct role in terms of contributions to my thesis. I acknowledge the help of Olov, Pierre, Giampiero, Daniel, Samer, Julian and Mats. Olov was a fantastic guide, who gave me the perfect mix of freedom and guidance, so that I had a great time working on my research. His contribution towards writing the articles and this thesis has been amazing, with him providing me with the perfect construction of sentences paraphrasing what I wanted to say extremely precisely. Pierre was an excellent support during the time when I had an ennui in my research. I went to Grenoble and told him, I’ll work on anything you tell me to. And what ensued was a fantastic collaboration and correspondence. He has been a guiding light in more ways than one, with his beautiful views on both speech related issues and the world in general. Giampiero was another source of guidance with his quiet but important way of shaping my thinking and research. My collaboration with him has been really fruitful and has made a huge impact on how I do my research now.

Daniel and I have had several fruitful collaborations and several papers together. Brainstorming with him and working with his whirl-wind style was a huge motivation for me. Samer, who joined TMH along with me, has shared office space with me for more than 4 years. He has been such a fantastic person to work, discuss and disagree with, in the process, opening up new radical ways of looking at things. I also enjoyed all the activity and attention that his research and its methods brought to our room. Julian was fun to work with during my stint at Grenoble and Mats was an excellent grandfather figure during my entire PhD. It was always fun to catch up with Mats, who states his views in such an understated way and yet, make a huge impact. Björn was the main source of support for me with respect to all things official and he also gave me extremely good tips on how to handle several official responsibilities.

Daniel Elenius not only provided the Matlab code to build a baseline HMM system for some of my work, but also was good fun to chat with, giving me many tips about the Swedish culture and way of life. I would like to thank Lèo, Thuy, Christian and Gaël for having the patience to listen to my synthesized vowels and help me with figuring out what went wrong with my initial attempts. Sten took keen interest in my research and his comments on my work during several presentations were indeed very helpful.

There are some projects which I must acknowledge because they pro-vided me with the funding, means and direction to conduct my research. These include the French ANR-08-EMER-001-02 grant (project ARTIS), the Swedish Research Council project 80449001, Computer-Animated Language Teachers, the Future and Emerging Technologies (FET) programme within the Sixth Framework Programme for Research of the European Commission, ASPI (contract no. 021324). I must also thank Knut and Alice Wallenberg foundation and ISCA for providing me with several travel grants. I also want to thank Laurent for performing the MRI recordings for subjects ‘yl’ and ‘hl’,

(9)

ix

and S. Masaki, S. Takano, I. Fujimoto, and Y. Shimada (ATR, Kyoto, Japan) for subject ‘pb’ used in the first part of my thesis.

There are many others whose contributions to my thesis were more sub-tle. Preben was a great inspiration throughout my PhD. We had amazing philosophical discussions and he was really patient and encouraging with me trying to speak Swedish. He helped me engage in technical discussions in Swedish, which helped me build a lot of confidence. The work I did with Bo was an excellent way to keep my interest in the field of hearing and acoustics going on. I got to learn a lot of things from my association with him. Robert and his super human ability of being active in so many fields was an awesome beacon in my research. He showed me how one can really have several hats on at the same time. Sebastian’s masters thesis was important for me because it was the first time I had a relatively supervisory role. It was made a great experience because of his amazing spirit and interest.

Laura and Sofia helped me improve my Swedish a great deal and Anita was the one who finally had the patience to continue making conversation with me until I became confident. Laura, Sofia and Preben also helped me out with some experiments for my thesis and during courses that we took together. As soon as I arrived into this new country, David, Inger, Kjell and Susanne with their inimitable spirit made me feel at home. They helped me out with several basic things, answering all the stupid questions I had for them. Just having the opportunity to meet and talk to Gunnar Fant was a fantastic feeling. I would definitely like to thank all the members in our lab, Chris, Anna, Simon, Jens, Jonas, Joakim, Rolf, Gabriel, David, Kjell Gustavsson, Raveesh, Peter, Alex, Marco, Kjetil, Jan, Anders Fribeg, Anders Askenfelt, Roberto, Anick, Erwin, Mattias, Kornell, Sheri, Becky, Yoko, Roxana, Markku, Kerstin, Cristine, Margereta, Glaucia, Eléna, Svante, David Alegería (may his soul rest in peace), Gunnar Julin, Johan, Beyza, Elizabeth, Björn Kjellgren, Cecilia, Fatemeh, Richard, Elke and all the others, who made my time in KTH and Sweden completely enjoyable.

Besides the lab environment, it was my many friends who helped me live a comfortable life in Stockholm and kept me grounded. I mention only a few here: Sundar, Samarth, Arun, Ashwin, Vineeth, Mani, Saikiran, Indira, Ramnath, Sandeep, Lukman, Shimja, Deepak, Suvarna, Santanu, Satish, Sha-ranya, Mukta, Sumedha, Sugeerthi. There were many India related cultural activities at KTH that I took part in, which helped me have fun during my entire stay here. Alphonsa Mam and Alex were really the people who helped me in this regard. They were family to me for the initial dark and lonely months of my stay in Sweden. Last but not the least, I want to thank my family, my mom, dad, mama, mami, ammamma, tata, pati, Zob and Bosky, who encouraged me to firstly take up doctoral studies and then, supported me through it. Sundaram tata and Radha pati are no more now, and how happy they would have been to see me at this stage. Pritesh and Anvesh, if not for you, I would not have achieved even 0.00001% of what I have, in my endeavors.

(10)

Contents

Contents x

The Thesis is based on the following Articles xiv

List of Abbreviations xxiii

List of Figures xxiv

List of Tables xxviii

1 Introduction 1

1.1 The speech communication mechanism . . . 1

1.2 Articulatory modeling and speaker normalization . . . 4

1.3 Acoustic-to-Articulatory Inversion . . . 4

1.4 Infant language acquisition . . . 9

1.5 Concluding remarks . . . 10

I

Articulatory Modeling and Speaker Normalization

11

2 Towards a Universal Articulatory Model 13 2.1 Two-mode factor analysis of articulation data . . . 13

2.2 Accounting for inter-speaker variability . . . 15

2.3 Techniques for three-way factor analysis of articulation data . . . . 17

2.4 Data . . . 20

2.5 Generalizing to unknown articulations . . . 23

3 Predicting Unseen Articulations from Multi-speaker Articula-tory Models 25 3.1 Three-way decomposition of the multi-speaker database . . . 25

3.2 Generalizations to unknown articulations . . . 26

3.3 Selecting the optimum number of factors . . . 31

3.4 Universal articulatory model? . . . 34 x

(11)

CONTENTS xi

4 Selecting Articulations that are Representative of a Speaker’s

Articulatory Strategy 37

4.1 Experimental procedure . . . 37

4.2 Results . . . 39

4.3 Discussion and concluding remarks . . . 46

II Non-uniqueness in the Acoustic-to-Articulatory Inversion 51

5 About Acoustic-to-Articulatory Inversion 53 5.1 Data . . . 55

5.2 Acoustic parameterization . . . 56

5.3 Obtaining weights for different articulators . . . 58

5.4 Measuring uncertainty using cross-modal clustering . . . 61

5.5 Non-linearity in the acoustic-to-articulatory inversion . . . 69

6 In Search of Non-uniqueness in the Acoustic-to-Articulatory Mapping 73 6.1 Introduction . . . 73

6.2 Non-uniqueness based on optimal Gaussian clustering . . . 76

6.3 Non-uniqueness as a function of the conditional distribution . . . . 82

6.4 Measuring non-uniqueness in the multi-modal (discrete) sense . . . 87

6.5 Relation with critical points . . . 94

6.6 Validity of the results and effect of choosing different model param-eters . . . 95

6.7 Concluding remarks . . . 99

7 Exploring the Predictability of Non-unique Acoustic-to-Articulatory Mappings 101 7.1 Entropy and its relationship with Gaussian modeling . . . 101

7.2 Estimating the entropy of the conditional distribution modeled as a GMM . . . 103

7.3 Non-uniqueness in terms of upper bound of the conditional entropy 105 7.4 Analysis of non-uniqueness in the continuous sense . . . 106

7.5 Relationship with entropy . . . 108

7.6 Correlations with articulator movements . . . 111

7.7 Interpretation of the results . . . 112

7.8 Discussion about the relationship between entropy and non-uniqueness113 8 Resolving Non-uniqueness in the Acoustic-to-Articulatory Map-ping 115 8.1 Non-uniqueness along a trajectory . . . 115

8.2 How often do the two types of non-unique paths occur . . . 119

(12)

III Effecting Acoustic-to-Articulatory Inversion based on

Sta-tistical Data

123

9 Cross-modal Clustering for Explicit Modeling of Non-uniqueness125

9.1 Finding the theoretical limits to acoustic-to-articulatory inversion . 126 9.2 Addressing the non-uniqueness problem in Inversion using

cross-modal clustering . . . 129

9.3 Concluding Remarks . . . 132

10 Using Cross-Modal Acoustic-Articulatory Clustering to Improve Speech Recognition 135 10.1 How to construct a CMCHMM . . . 136

10.2 Phoneme recognition using CMCHMMs . . . 139

10.3 Comparing CMCHMMs with the baseline HMMs . . . 140

10.4 Concluding remarks . . . 143

11 Acoustic-to-Articulatory Inversion based on Local Regression 145 11.1 Introduction . . . 145

11.2 Method . . . 146

11.3 Data . . . 148

11.4 Experiments and results . . . 148

11.5 Concluding remarks and future work . . . 152

12 Mapping between Acoustic and Articulatory Gestures 155 12.1 Introduction . . . 155

12.2 Theory and methods . . . 157

12.3 Data and experiments . . . 172

12.4 A new evaluation criterion for the Inversion . . . 176

12.5 Results and discussion . . . 178

12.6 Non-uniqueness in the mapping between acoustic-to-articulatory Gestures . . . 197

12.7 Concluding remarks and future work . . . 201

Appendices . . . 203

12.A List of British English phonemes in the MOCHA-TIMIT database . 203

IV Infant Language Acquisition in the Context of

Acoustic-Articulatory Relationships

205

13 Challenges that Infants face when Learning the Sounds in a Language 207 13.1 Physical limitations and challenges . . . 207

13.2 Challenges in acquiring static sounds of a language . . . 208

(13)

CONTENTS xiii

14.1 Introduction . . . 211

14.2 Tools, methods and data . . . 212

14.3 Experiments and results . . . 213

14.4 Concluding remarks . . . 217

15 Using Imitation to learn Infant-Adult Acoustic Mappings 221 15.1 Introduction . . . 221

15.2 Experimental paradigm and tools . . . 222

15.3 Experiments and results . . . 225

15.4 Conclusion and future work . . . 228

V Conclusions

231

16 Conclusions and Future Directions 233 16.1 Articulatory modeling and speaker normalization . . . 233

16.2 Acoustic-to-Articulatory Inversion . . . 234

16.3 Infant language acquisition . . . 235

(14)

The Thesis is based on the following

Articles

Paper A: Predicting Unseen Articulations from Multi-speaker Articulatory Models

G. Ananthakrishnan, Pierre Badin, Julián Andrés Valdés

Var-gas and Olov Engwall

The Proceedings of Interspeech 2010, pp. 1588–1591.

Abstract: In order to study inter-speaker variability, this work

aims to assess the generalization capabilities of data-based multi-speaker articulatory models. We use various three-mode factor anal-ysis techniques to model the variations of midsagittal vocal tract contours obtained from Magnetic Resonance Imaging (MRI) im-ages for three French speakers articulating 73 vowels and conso-nants. Articulations of a given speaker for phonemes not present in the training set are then predicted by inversion of the models from measurements of these phonemes articulated by the other subjects. On the average, the prediction RMSE was 5.25 mm for tongue con-tours, and 3.3 mm for 2D midsagittal vocal tract distances. Besides, this study has established a methodology to determine the optimal number of factors for such models.

Contributions of the co-authors: PB and JAVV were involved with collecting the data as well as hand-tracing the tongue contours. Additionally, PB contributed towards techniques which were used as a baseline, namely the classical PCA. PB and OE contributed towards revising the drafts leading to the final article. GA devel-oped and implemented the techniques, set-up and performed the experiments and wrote the article.

(15)

The Thesis is based on the following Articles xv Paper B: Selecting Articulations that are Representative of a Speaker’s

Ar-ticulatory Strategy

G. Ananthakrishnan, Pierre Badin, Julián Andrés Valdés

Var-gas and Laurent Lamalle

Submitted to IEEE Transactions on Audio, Speech and Language Processing.

Abstract: This paper proposes to apply a 3-way linear analysis

technique, namely Tucker3 decomposition (3-mode PCA), in order to find a reduced number of articulations which are representative of the speaker’s articulation strategy as well as the anatomy of the speaker’s vocal tract. Based on Magnetic Resonance Imaging (MRI) data collected on three native French speakers with 73 articulation (including 10 oral and 3 nasal vowels), the paper proposes to find data on which of the vowels is most suitable to generalize a speaker’s specific articulation strategy, based on articulation information from other speakers. The experiments show that in order to generalize a speaker’s articulatory strategy, one needs to collect data for 21 to 31 articulations. These can be used to explain around 66% of the variance of the unknown articulations. Based on perceptual stud-ies on oral vowels it was shown that it was possible to make viable predictions for a subset of the vowels in the database based on the method described.

Contributions of the co-authors: PB contributed towards

re-vising the drafts leading to the final article as well as pointers to convert midsagittal distances to area function, besides conducting the perceptual experiments. GA, developed and implemented the techniques, analyzed the results, set-up the perceptual experiments and wrote the article. PB and LL were involved in collecting the data. PB, GA and JAVV were involved in making MRI contour tracings.

(16)

Paper C: The Acoustic to Articulation Mapping: Non-linear or Non-unique? Daniel Neiberg, G. Ananthakrishnan and Olov Engwall

The Proceedings of Interspeech 2008, pp. 1485–1488.

Abstract: This paper studies the hypothesis that the acoustic-to-articulatory mapping is non-unique, statistically. The distributions of the acoustic and articulatory spaces are obtained by fitting the data into a Gaussian Mixture Model. The kurtosis is used to mea-sure the non-Gaussianity of the distributions and the Bhattacharya distance is used to find the difference between distributions of the acoustic vectors producing non-unique articulator configurations. It is found that the mapping for stop consonants and alveolar frica-tives are generally not only non-linear but also non-unique, while that for dental fricatives are found to be highly non-linear but fairly unique. Two more investigations are also discussed: the first is on how well the best possible piecewise linear regression is likely to perform, the second is on whether the dynamic constraints improve the ability to predict different articulatory regions corresponding to the same region in the acoustic space.

Contributions of the co-authors: GA and DN developed the

methodology for measuring non-linearity and non-uniqueness to-gether. DN ran the experiments and analyzed the results. GA developed the methodology and ran the experiments regarding es-tablishing the theoretical limits of piece-wise linear regression. The paper was written together by DN and GA with inputs from OE.

(17)

The Thesis is based on the following Articles xvii Paper D: In Search of Non-uniqueness in the Acoustic-to-Articulatory

Mapping

G. Ananthakrishnan, Daniel Neiberg and Olov Engwall

The Proceedings of Interspeech 2009, pp. 2799–2802.

Abstract: This paper explores the possibility and extent of

non-uniqueness in the acoustic-to-articulatory inversion of speech, from a statistical point of view. It proposes a technique to estimate the non-uniqueness, based on finding peaks in the conditional proba-bility function of the articulatory space. The paper corroborates the existence of non-uniqueness in a statistical sense, especially in stop consonants, nasals and fricatives. The relationship between the importance of the articulator position and non-uniqueness at each instance is also explored.

Contributions of the co-authors: DN proposed and prepared the data for comparison between non-uniqueness and the critical points in the trajectory. OE gave critical suggestions for improving the quality of the article in successive revisions of the draft. The main concept, implementation and writing of the paper were done by GA.

Paper E: Resolving Non-uniqueness in the Acoustic-to-Articulatory Mapping G. Ananthakrishnanand Olov Engwall

The Proceedings of International Conference on Acoustics, Speech and Signal Processing 2011, pp. 4628–4631.

Abstract: This paper studies the role of non-uniqueness in the

Acoustic-to-Articulatory Inversion. It is generally believed that ap-plying continuity constraints to the estimates of the articulatory parameters can resolve the problem of non-uniqueness. This pa-per tries to find out whether all instances of non-uniqueness can be resolved using continuity constraints. The investigation reveals that applying continuity constraints provides the best estimate in roughly around 50 to 53 % of the non-unique mappings. Roughly around 8 to 13 % of the non-unique mappings are best estimated by choosing discontinuous paths along the hypothetical high prob-ability estimates of articulatory trajectories.

Contributions of the co-authors: OE provided useful suggestion

towards writing the article. GA and implemented the methodology, set-up and ran the experiments and wrote the article.

(18)

Paper F: On Acquiring Speech Production Knowledge from Articulatory Measurements for Phoneme Recognition

Daniel Neiberg, G. Ananthakrishnan, Mats Blomberg The Proceedings of Interspeech 2009, pp. 1387–1390.

Abstract: The paper proposes a general version of a coupled

Hid-den Markov/Bayesian Network model for performing phoneme recog-nition on acoustic-articulatory data. The model uses knowledge learned from the articulatory measurements, available for training, for phoneme recognition on the acoustic input. After training on the articulatory data, the model is able to predict 71.5% of the ar-ticulatory state sequences using the acoustic input. Using optimized parameters, the proposed method shows a slight improvement for two speakers over the baseline phoneme recognition system which does not use articulatory knowledge. However, the improvement is only statistically significant for one of the speakers. While there is an improvement in recognition accuracy for the vowels, diphthongs and to some extent the semi-vowels, there is a decrease in accuracy for the remaining phonemes.

Contributions of the co-authors: The development of concept

and design of experiments was done together by DN and GA. DN set-up, ran the experiments and analyzed the results. GA built the HMM-BN tools required to set-up the experiment. The paper was written together by DN and GA with inputs from MB.

(19)

The Thesis is based on the following Articles xix Paper G: Acoustic-to-Articulatory Inversion based on Local Regression

Samer Al Moubayed and G. Ananthakrishnan The Proceedings of Interspeech 2010, pp. 937–940. The author names are placed in alphabetical order

Abstract: This paper presents an Acoustic-to-Articulatory

inver-sion method based on local regresinver-sion. Two types of local regresinver-sion, a non-parametric and a local linear regression have been applied on a corpus containing simultaneous recordings of positions of articu-lators and the corresponding acoustics. A maximum likelihood tra-jectory smoothing using the estimated dynamics of the articulators is also applied on the regression estimates. The average root mean square error in estimating articulatory positions, given the acous-tics, is 1.56 mm for the non-parametric regression and 1.52 mm for the local linear regression. The local linear regression is found to perform significantly better than regression using Gaussian Mixture Models using the same acoustic and articulatory features.

Contributions of the co-authors: SAM proposed the initial idea and implemented the non-parametric regression. GA proposed and implemented the parametric regression, set-up the experiments and wrote the article.

Paper H: Important Regions in the Articulator Trajectory G. Ananthakrishnanand Olov Engwall

The Proceedings of the International Seminar on Speech Production 2008, pp. 305–308.

Abstract: This paper deals with identifying important regions in

the articulatory trajectory based on the physical properties of the trajectory. A method to locate critical time instants as well as the key articulator positions is suggested. Acoustic-to-Articulatory In-version using linear and non-linear regression is performed using only these critical points. The accuracy of inversion is found to be almost the same as using all the data points.

Contributions of the co-authors: OE provided important sug-gestions towards writing the paper. GA designed and performed the experiments besides writing the paper.

(20)

Paper I: Mapping between Acoustic and Articulatory Gestures G. Ananthakrishnanand Olov Engwall

The Journal of Speech Communication 2011, 53(4): 567–589.

Abstract: This paper proposes a definition for articulatory as well as acoustic gestures along with a method to segment the measured articulatory trajectories and the acoustic waveform into gestures. Using a simultaneously recorded acoustic-articulatory database, the gestures are detected based on finding critical points in the utter-ance both in the acoustic and articulatory representations. The acoustic gestures are parameterized using 2-D cepstral coefficients. The articulatory trajectories are essentially the horizontal and ver-tical movements of Electromagnetic Articulagraphy (EMA) coils placed on the tongue, jaw and lips along the midsagittal plane. The articulatory movements are parameterized using 2D-DCT with the same transformation as that applied on the acoustics. The re-lationship between the detected acoustic and articulatory gestures in terms of the timing as well as the shape is studied. Acoustic-to-articulatory inversion is also performed using a GMM-based regres-sion, in order to study this relationship further. The accuracy of predicting of the articulatory trajectories from the acoustic wave-form are at par with state-of-the-art frame-based methods with dy-namical constraints (with an average error of 1.45-1.55 mm for the two speakers in the database). In order to evaluate the acoustic-to-articulatory inversion in a more intuitive manner, a method based on the error in estimated critical points is suggested. Using this method, it was noted that the estimated articulatory trajectories using the acoustic-to-articulatory inversion methods were still not accurate enough to be within the perceptual tolerance of audio-visual asynchrony.

Contributions of the co-authors: OE was involved in the dis-cussion related to the developing the critical trajectory error mea-sure, besides provided critical suggestions on improving successive revisions of the paper. The remaining development, experiments, analysis and writing related to the article was completed by GA.

(21)

The Thesis is based on the following Articles xxi Paper J: Imitating Adult Speech: An Infant’s Motivation

G. Ananthakrishnan

The Proceedings of International Seminar on Speech Production 2011, pp. 361–368.

Abstract: This paper discusses two aspects of speech acquisition

by infants which are often assumed to be intrinsic or innate knowl-edge, namely the number of degrees of freedom in the articulatory parameters and the acoustic correlates that find the correspondence between adult speech and the speech produced by the infant. The paper shows that being able to distinguish the different vowels in the vowel space of a certain language, both by the infants as well as the adults is a strong motivation for choosing both these param-eters.

Contributions of the co-authors: This is a single author paper Paper K: Using Imitation to learn Infant-Adult Acoustic Mappings

G. Ananthakrishnanand Giampiero Salvi

The Proceedings of Interspeech 2011, pp. 765–768.

Abstract: This paper discusses a model which conceptually

demon-strates how infants could learn the normalization between infant and adult acoustics. The model proposes that the mapping can be in-ferred from the topological correspondences between the adult and infant acoustic spaces, which are clustered separately in an unsu-pervised manner. The model requires feedback from the adult in order to select the right topology for clustering, which is a crucial aspect of the model. The feedback is in terms of an overall rating of the imitation effort by the infant, rather than a frame-by-frame correspondence. Using synthetic, but continuous speech data, we demonstrate that clusters, which have a good topological correspon-dence, are perceived to be similar by a phonetically trained listener.

Contributions of the co-authors: GS was involved in formulat-ing the problem and proposformulat-ing the model, besides contributformulat-ing with important suggestions towards writing the paper. GA designed and implemented the model, set-up the experiments and analyzed the results, besides writing the paper.

(22)

List of other publications by the

author

1. Using an Ensemble of Classifiers for Mispronunciation Feedback

Ananthakrishnan, G., Wik, P., Engwall, O. and Abdou, S.

in Proceedings of SLaTE 2011, Venice, Italy.

2. Tracking pitch contours using minimum jerk trajectories

Neiberg, D., Ananthakrishnan, G. and Gustafson, J.

in Proceedings of Interspeech 2011, Florence, Italy.

3. Detecting confusable phoneme pairs for Swedish language learners depending on their first language

Ananthakrishnan, G., Wik, P. and Engwall, O.

in Proceedings of Fonetik 2011, Stockholm, Sweden. 4. An acoustic analysis of lion roars. II: Vocal tract

characteristics

Ananthakrishnan, G., Eklund, R., Peters, G. and Mabiza, E.

in Proceedings of Fonetik 2011, Stockholm, Sweden. 5. Automatic Prominence Classification in Swedish

Al Moubayed, S., Ananthakrishnan, G. and Enflo, L.

in Proceedings of Speech Prosody 2010, Workshop on Prosodic Prominence, Chicago, USA.

6. Classification of Affective Speech using Normalized Time-Frequency Cepstra

Neiberg, D., Laukka, P. and Ananthakrishnan, G.

in Proceedings of Prosody 2010, Chicago, USA.

7. Detection of Specific Mispronunciations using Audiovisual Features

Picard, S., Ananthakrishnan, G., Wik, P., Engwall, O. and Abdou, S.

in Proceedings International Conference on Auditory-Visual Speech Processing 2010, Kanagawa, Japan.

8. Cross - modal Clustering in the Acoustic - Articulatory Space

Ananthakrishnan, G. and Neiberg, D.

in Proceedings Fonetik 2009, Stockholm, Sweden

9. Audiovisual speech inversion by switching dynamical modeling Governed by a Hidden Markov Process

Katsamanis, N., Ananthakrishnan, G., Papandreou, G., Engwall, O. and Maragos, P.in Proceedings of EUSIPCO 2008, Lausanne, Switzerland.

10. On the Non-uniqueness of Acoustic-to-Articulatory Mapping

Neiberg, D. and Ananthakrishnan, G.

(23)

List of Acronyms and

Abbreviations

MRI Magnetic Resonance Imaging

EMA Electromagnetic Articulography

EPG Electropalatography

US Ultra Sound

2DMDF Two Dimensional Midsagittal Distance Function Inversion, A-to-A Acoustic-to-Articulatory Inversion

DCT Discrete Cosine Transform

2D-DCT Two Dimensional DCT

LJ EMA coil on the lower jaw

LL EMA coil on the lower lip

UL EMA coil on the upper lip

TT EMA coil on the tongue tip

TB EMA coil on the tongue body

TD EMA coil on the tongue dorsum

V EMA coil on the velum

RMSE Root Mean Square Error

CC (Pearson’s) Correlation Coefficient

EM Expectation Maximization

MAP Maximum A-posteriori Probability

MCMAP Maximum Cross-modal A-posteriori Probability

GMM Gaussian Mixture Models

HMM Hidden Markov Models

SVM Support Vector Machine

RFE Recurrent Feature Elimination

RBF Radial Basis Function

PCA Principal Component Analysis

ANN Artificial Neural Network

PARAFAC Parallel Factor Analysis

CVC Consonant-Vowel-Consonant

VCV Vowel-Consonant-Vowel

MMSE Minimum Mean Square Error

MLTE Maximum Likelihood Trajectory Estimate

GMMR Gaussian Mixture Model Regression

DBN Dynamic Bayesian Network

CMCHMM Cross-modal Coupled HMM

MFCC Mel Frequency Cepstral Coefficients

ERB Equivalent Rectangular Bandwidth

CTE Critical Trajectory Error

VLAM Variable Length Articulatory Model

IPA International Phonetic Association

ASCII American Standard Code for Information Exchange BIC Schwarz or Bayesian Information Criterion

SOM Self Organizing Maps

TH Tongue Height parameter for VLAM

TS Tongue Shape parameter for VLAM

LP Lip Protrusion parameter for VLAM

JH Jaw Heigh parameter for VLAM

ASP Along the Same Path

WCP With Change in Path

(24)

List of Figures

1.1 A block-diagram of the speech communication loop . . . 2 2.1 Illustration of the effect of applying control parameters learnt on one

speaker to the model of another speaker . . . 16 2.2 Explanation of the decomposition using PARAFAC . . . 18 2.3 Examples of ‘as visible’ tongue contours for speaker ‘pb’ . . . 20 2.4 The closest distances between every point on the lower contour to the

closest point on the upper contour . . . 21 2.5 The mid-points of the closest distances, the polynomial fit with equal

sampling and the 13 points used to estimate the 2DMDF . . . 22 2.6 En example 2DMDF plot . . . 22 2.7 Plot showing the differences in 2DMDFs for the three subjects . . . 24 3.1 Comparison of the RMSE (cm) for the different three-way methods

(vowels) . . . 27 3.2 Comparison of the RMSE (cm) for the different three-way methods (all

phonemes) . . . 28 3.3 The comparative performance in terms of average RMSE (cm) of the

generalization properties of the three-way models . . . 30 3.4 Comparison of the average RMSE (cm) made in generalization to the

‘unknown’ articulation for each subject . . . 31 3.5 Prediction of the ‘tongue contour’ using 1stlevel inversion for phoneme

/a/ . . . 32 3.6 Prediction of the ‘tongue contour’ using 1stlevel inversion for phoneme

/i/ . . . 33 3.7 Plot showing the cross-validation method to determine the optimum

number of parameters for a single speaker model . . . 34 3.8 Figures showing how the first three factors/control parameters control

different articulation shapes of the tongue for different speakers . . . . 35 4.1 Plots showing the RMSE (cm), explained variance and optimum

num-ber of factors for building models using the best combination of ζ known articulations for training the model . . . 40

(25)

List of Figures xxv 4.2 Plot showing the average RMS error for different speakers and vowels . 41 4.3 The first three formant values of the synthesized vowels in Experiment 2a 42 4.4 The first three formant values of the synthesized vowels in Experiment 2b 43 4.5 The vocal tract contours based on the predictions of the 2DMDF for

speaker ‘pb’ over different number of articulations used for training, ζ. 44 5.1 The positions of the EMA coils on the speaker’s articulators . . . 56 5.2 Non-linear separation boundary for Weighted RMSE . . . 60 5.3 Simulated data to illustrate the effect of cross-modal clustering . . . . 62 5.4 Clustering of simulated data on application of the MAP criterion . . . 63 5.5 Clustering of simulated data on application of the MCMAP criterion . 64 5.6 Formant space and articulatory positions for vowel /e/ . . . 65 5.7 Formant space and articulatory positions of LJ for vowel /@/ . . . 66 5.8 The figure shows the overall uncertainty (for the whole articulatory

configuration; d = 14) for the British English vowels using Equation 5.10. 67 5.9 The figure shows the uncertainty for individual articulators (d = 2) for

the British vowels using Equation 5.10. . . 68 5.10 Non-linearity for vowels . . . 70 5.11 Non-linearity for consonants . . . 71 6.1 Examples showing frames with similar acoustic features while their

corresponding articulatory features show large variation. . . 75 6.2 Figure explaining how choosing the correct level of quantization is

cru-cial for estimating non-uniqueness . . . 77 6.3 Non-uniqueness for vowels based on Equation 6.2 . . . 78 6.4 Non-uniqueness for consonants based on Equation 6.2 . . . 79 6.5 Acoustic and Articulatory subspaces for phoneme /l/. The same data

points have different distributions in the articulatory space correspond to different distribution in the acoustic space . . . 81 6.6 Acoustic and Articulatory subspaces for phoneme /t/. The same data

points have different distributions in the articulatory space and almost the same distribution in the acoustic space . . . 81 6.7 Illustration of the modeling of the joint acoustic-articulatory space

us-ing a GMM, for a hypothetical example. . . 85 6.8 The conditional probability distribution for the different articulator

coils along the midsagittal plane . . . 88 6.9 The histogram showing the distribution of the normalized non-uniqueness 90 6.10 Graph comparing the Level of non-uniqueness (αp

a) of different

artic-ulators for vowels, plosives and fricatives grouped according to their category . . . 92 6.11 Graph comparing the Level of non-uniqueness (αp

a) of different

articu-lators for other sonorants grouped according to their category . . . 93 6.12 Graph showing the relationship between the NNuM and the distance

(26)

7.1 Illustration of upper bound of the entropy for a bimodal distribution . 102 7.2 Comparison of the nfN uM

a,p and normalized nΞa,p for LJ, UL and LL . . 109

7.3 Comparison of the nfN uM

a,p and normalized nΞa,p for TT, TB and TD . 110

8.1 Sequences of conditional probability functions of the articulatory coils given the acoustic features for the female speaker . . . 116 8.2 Sequences of conditional probability functions of the articulatory coils

given the acoustic features for the male speaker . . . 117 8.3 Illustration of the two types of articulatory trajectory ‘paths’ . . . 118 8.4 Illustration of an example where the TT coil has been observed to

have switched from Path 1 to Path 2, even though Path 1 satisfies the continuity constraints. . . 119 8.5 Comparison of the frequency of occurrences of each type of non-uniqueness

with respect to the duration of the non-unique patch . . . 120 10.1 Inference graph of a fully coupled HMM or CMCHMM chain . . . 137 10.2 Phoneme recognition performance for the female speaker using lr/skip

CMCHMM where only acoustics is used for testing . . . 141 10.3 Phoneme recognition performance for the female using lr/ergodic

CM-CHMMs where only acoustics is used for testing . . . 142 11.1 Graph illustrating the effect of K on the mRMSE over the MMSE

estimates for the two local regression methods . . . 149 11.2 Comparative performance between the two local regression methods

against GMMR . . . 150 11.3 Graph illustrating the effect of K and ND on the mRMSE over the

MLTEs for the two local regression methods . . . 151 11.4 Comparative performance between the two local regression methods

against GMMR when ND = 1 (MLTEs) . . . 152 11.5 Figure showing the measured and estimated TT trajectories along the

vertical axis using the different local regressions . . . 153 12.1 The original and smoothed versions of an EMA coil placed on the

tongue tip . . . 161 12.2 The importance function and the segmentation of a gesture for an EMA

coil placed in the tongue tip . . . 162 12.3 The frequency magnitude response of the ERB Filter-banks WWWk with

B = 80 Hz and 45 filters. . . 164 12.4 A part of the spectrogram from ERB filter-bank outputs . . . 165 12.5 Illustration of the short and long term effects of Minimum jerk

Smooth-ing on the spectrogram . . . 166 12.6 The original and reconstructed spectrogram using the 2D-cepstrum . . 168 12.7 Multiple hypothesis about the articulatory gestures from the gesture

(27)

List of Figures xxvii 12.8 How the Critical Trajectory Error (CTE) measures are calculated . . . 179 12.9 Comparison of the mean RMSE (mm) of the trajectory reconstruction

by interpolation using only the critical points . . . 180 12.10 The relationship between the critical points in the the acoustic signal

and the different articulatory channels . . . 182 12.11 Difference in the standard deviations of the critical points

correspond-ing to phoneme /p/ with respect to other points of the phoneme . . . . 183 12.12 Difference in the standard deviations of the critical points

correspond-ing to phoneme /a/ with respect to other points of the phoneme . . . . 184 12.13 Phonemes with the top 10 probabilities of the co-occurrence between

the articulatory critical point and the acoustic critical point for LJ . . 186 12.14 Phonemes with the top 10 probabilities of the co-occurrence between

the articulatory critical point and the acoustic critical point for UL . . 188 12.15 Phonemes with the top 10 probabilities of the co-occurrence between

the articulatory critical point and the acoustic critical point for LL . . 189 12.16 Phonemes with the top 10 probabilities of the co-occurrence between

the articulatory critical point and the acoustic critical point for TT . . 190 12.17 Phonemes with the top 10 probabilities of the co-occurrence between

the articulatory critical point and the acoustic critical point for TB . . 191 12.18 Phonemes with the top 10 probabilities of the co-occurrence between

the articulatory critical point and the acoustic critical point for TD . . 193 12.19 Comparisons of the mRMSE, over a ten-fold cross-validation for the

gesture-based method with respect to the frame-based method . . . 195 12.20 Comparisons of the mCC, over a ten-fold cross-validation for different

methods for the gesture-based method with respect to the frame-based method . . . 196 12.21 Comparisons of the mCT E, over a ten-fold cross-validation for different

methods for the gesture based method with respect to the frame-based method . . . 197 12.22 The mRMSE in mm for individual phonemes, vowels and other

sono-rants, and different articulators, from bottom to top for the gesture based method . . . 198 12.23 The mRMSE in mm for individual phonemes, diphthongs and other

consonants, and different articulators, from bottom to top for the ges-ture based method . . . 199 12.24 The plot of the conditional distribution of the articulatory gestures,

given the acoustic gesture for the phoneme /ô/ in the phrase ‘great

rhythm’ spoken by the male speaker. . . 200

13.1 The different challenges that infants face in learning speech communi-cation . . . 208

(28)

old infants, by varying J (synchronous with TH) and acoustic correlates ‘Z&T’ (‘Dir-J-7-9’ condition) . . . 214 14.2 The result of the copy synthesis using a vocal tract model of 7–9 month

old infants, by varying J (synchronous with TH) . . . 215 14.3 The result of the copy synthesis, by varying two parameters, J

(syn-chronous with TH) and TS . . . 216 14.4 The result of the copy synthesis using a vocal tract model of 13–15

month old infants, by varying three parameters, J (synchronous with TH) . . . 217 15.1 The block diagram of our model for acquiring speaker normalization . 222 15.2 Illustration of the Adult feedback in terms of RMSE over 22 utterances

between the original adult speech and the infant imitation using the ‘transformed vocal tract’ method. . . 226 15.3 Figure showing the topologically corresponding adult-infant clusters in

the formant ratio space . . . 229

List of Tables

4.1 The results of the perceptual tests performed by the 4 listeners for Experiment 2a . . . 45 4.2 The results of the perceptual tests performed by the 4 listeners for

Experiment 2b. . . 46 4.3 List of French vowels . . . 47 4.4 Table showing the best combinations of articulations selected for

train-ing the models for values of ζ = 1 to 21 . . . 48 4.5 Table showing the best combinations of articulations selected for

train-ing the models for different values of ζ = 22 to 31 . . . 49 4.6 Table showing the best combinations of articulations selected for

train-ing the models for different values of ζ = 41 to 71 . . . 50 5.1 Weights obtained from the SVM-RFE algorithm . . . 61 6.1 Table indicating the percentage of frames with more than one mode

for different articulators. . . 89 xxviii

(29)

List of Tables xxix 6.2 Table showing the average absolute non-uniqueness (NuM) in mm for

different phoneme categories, estimated only for those instances where more than one peak has been observed in the conditional probability distribution. . . 89 6.3 Table showing the results of the hypothesis testing for the hypothesis

that slope of the regression between NNuM and distance from critical point is positive . . . 95 6.4 Table showing the average entropy of distribution between the different

modes in the articulatory space over small time-divisions in the data time-series. . . 96 6.5 Percentage of non-unique (multi-modal sense) frames as against the

number of GMM components used. . . 97 6.6 Table showing the percentage of frames for which at least one of the

nearest 10 frames to each peak in the conditional distribution was given the same phonemic label as the current data frame. . . 98 6.7 Table showing the % of non-unique frames for the TT coil with the

male speaker (msak) for different lengths of context windows. . . 99 7.1 Table showing the mean upper bounds of the conditional entropy for

different articulator coils . . . 106 7.2 Table showing the mean upper bounds of the conditional entropy for

different phoneme classes. . . 106 7.3 Comparison between the CC for the non-uniqueness between the

dif-ferent articulator coils and the CC for the actual positions of the coils 111 9.1 The RMSE in mm for the best possible piecewise linear prediction for

the female speaker where cluster correspondence is known apriori . . . 127 9.2 The mRMSE for the ideal prediction without using dynamic

con-straints, for the female speaker assuming the within cluster error is zero . . . 128 9.3 The mRMSE for the ideal prediction using dynamic constraints for

the female speakerminimizing distance . . . 128 9.4 The mRMSE for the ideal prediction using dynamic constraints for

the female speaker minimizing velocity . . . 129 9.5 RMSE for different combination of Articulatory and Acoustic Gaussian

components in the MMSE method. . . 130 9.6 RMSE for different combination of Articulatory and Acoustic Gaussian

components in the MLTE with dynamic features method. . . 131 9.7 The CC for different combination of Articulatory and Acoustic

Gaus-sian components in the MMSE method. . . 131 9.8 The CC for different combination of Articulatory and Acoustic

(30)

9.9 RMSE for different combination of Articulatory and Acoustic Gaussian components in the MMSE method along with dynamic programming to find the best possible articulatory components. . . 133 10.1 Full evaluation for the best baseline (26 Gaussians) and the best lr/ergodic

CMCHMM (12 Gaussians/3 states). . . 141 10.2 Evaluation for predicting the Articulatory states and corresponding

improvement in recognition accuracy when the articulatory data is un-available as against when it is un-available . . . 143 12.1 Performance of the acoustic gesture detection algorithm . . . 181 12.2 Table showing the results of the different types of regression (MMSE)

on the entire training data-set as against selecting only the critical points from the training data to perform the training . . . 185 12.3 Table comparing the performance of the proposed method for

differ-ent window lengths (ws), number of ‘quefrency’ components (P ) and

number of ‘meti’ components (Q) . . . 194 12.4 Table comparing the performance of the ‘Gesture’ based and

frame-based algorithms in terms of mRMSE and its standard deviation for different phoneme classes. . . 196 12.5 Table indicating the percentage of frames with more than one mode for

different articulators using the frame-based and gesture based methods. 200 14.1 Table showing the ratings provided by the three Swedish listeners on

vowel productions for different conditions . . . 219 15.1 Ratings by the phonetically trained listener on the correspondence

be-tween the infant and adult topologically mapped clusters for the best topological configuration using the ‘Mil’ method. . . 227

(31)

1

Introduction

1.1

The speech communication mechanism

From a purely communication-systems point of view, speech is a channel to ex-change information between human beings. The signal, which is essentially the acoustic waves carried through the medium, air, is modulated by several layers of information. Speech production is the means of encoding this information into the speech channel. The most obvious components used for speech production are the voice source in the larynx and the articulators, namely the tongue, jaws, lips, velum and the pharynx. In addition, there are cognitive, perceptual and motor aspects of speech production. This aspects are important for the muscles in the articula-tors to achieve a certain state of distention or relaxation so that the corresponding acoustic signal has quite precise qualities. These acoustic qualities are perceived by the listener and decoded into relevant information, thus completing the act of communication. Speech perception involves, among other things, the cochlea (in-ner ear) which break the signal into frequency components in the basilar membrane and then convert them into neural signals. This is followed by decoding the specific acoustic features encoded by the speaker to extract relevant information. Figure 1.1 illustrates this process with the help of a block diagram.

The connection from articulation to acoustics is directly observable since a phys-ical system effects this segment of the communication loop. Hence, studying it is direct, although difficult. Modeling this direction of the process has gained some grounds with several efforts in building articulatory synthesizers (e.g., Maeda, 1982; Iskarous et al., 2003; Birkholz and Jackèl, 2003). Although the quality of synthesis using these models is not sufficiently natural compared to other methods of speech synthesis, the degree of control that these synthesizers exhibit is fairly high, helping researchers understand the process of speech production.

Modeling the inverse direction of the process, i.e., from acoustics to articulation is not as straightforward and is difficult to verify. There are several theories of speech perception and not all of them maintain that the inverse direction is even necessary for decoding speech. Some theories, like the Motor-theory (Liberman et al., 1967) and the Direct realist theories of speech perception (Fowler, 1996; Diehl et al., 2004), propose that acoustics are converted into an invariant articulatory

(32)

Figure 1.1: A block-diagram of the speech communication loop. The question mark indicates the doubtfulness regarding the need for establishing this connection during perception of normal speech communication.

representation before perception and understanding. Some other theories, like the Acoustic Landmark Theory (Quantal Theory) (Stevens, 2002), accept the role of articulatory knowledge in speech perception but do not propose an articulatory representation as the unit of perception. Other speech perception theories, like the Fuzzy-logical model, (Massaro, 1989) propose that speech perception is independent of an articulatory representation.

The main disagreement between these theories is the unit of information that is encoded and decoded in speech communication. What is agreed upon, however, is that this unit of information should be fairly invariant (although some amount of variation is expected) to different speakers, or otherwise, the transfer of infor-mation would not be possible. There are many factors that affect the transmission itself, such as noise in the environment, the structure of the pinnae, the position and direction of the speaker’s and listener’s heads etc. The units of information should not be affected by these factors. Decoding the speech signal also involves a classification problem on the listener’s part where he/she needs to assign the right labels to the sequence of sounds that was uttered by the speaker. Even if it is possible for human beings to agree upon the class labels of these sounds, mea-surement of their acoustic features reveals large differences in not only productions

(33)

1.1. THE SPEECH COMMUNICATION MECHANISM 3 from different speakers of various ages and sexes, but also in different contexts of the conversations, as well as the emotional state of the speakers. Some of these differences are due to the anatomical differences in the vocal tracts of the speakers. A child’s vocal tract has different properties (in particular, the length) compared to adults and there are differences between males and females, although to a smaller extent.

Whether or not inverse mapping from acoustics to articulation occurs during speech perception, there are some situations in which this inverse mapping is rel-evant. The first is when infants develop the ability to speak. They need to learn the correspondences between the sounds that the adults produce to the articula-tory configurations of their vocal tract which allows them to reproduce the adult speech sounds (Guenther, 2006). This phase is quite special and it can be argued that it occurs only once in a life-time for most humans. The second situation per-tains to when people learn a new language, often as adults. Here the new language may have a new set of sounds which are classified in a different way. Many adults are successful at not only perceiving these new set of speech sounds, but also in producing them. In this situation, the study of acoustics to articulation mapping again becomes important (Wilson and Gick, 2006). The third situation is that of entrainment where two or more speakers tend to change their phonetic cues in order to sound more like each other (Bailly and Lelong, 2010). In this case, the decoding also necessarily involves an inverse process whereby the acoustic targets of the target speech need to be mapped onto articulatory configurations that enable the speaker to sound more like the target speaker.

This thesis does not deal with trying to establish whether an inverse mapping needs to be performed during perception of normal speech. It deals more with the problems that one encounters if and when the inverse mapping is performed and proposes some solutions with respect to certain aspects and in specific situations.

There are several applications for performing this inverse mapping computation-ally. One application that has been increasingly popular is to give feedback using intra-oral visualizations to language learners. One problem with applying inversion to this problem is the fact that different speakers differ quite substantially in their vocal tract anatomy. This, on the one hand, restricts the similarity that people can have in the sounds that they produce; on the other hand, induces people to develop different articulatory strategies to cope with these anatomical differences. However, the differences in articulatory strategies may not only be motivated by anatomy, but also by individual speaker idiosyncracies. In any case, there is a dis-tinctly individualized articulatory behavior for every speaker. We attempt to solve this problem by using an articulatory modeling approach.

(34)

1.2

Articulatory modeling and speaker normalization

(Part I)

Articulatory modeling, which accounts for the different variations a speaker is able to make in order to produce the different phonemes in a language, has been of special interest to speech researchers, because it provides an insight into the mech-anisms that control the production of speech. Traditionally, articulatory modeling has been used with the purpose of building a realistic speech synthesis system, that models the actual articulations of human beings (Maeda, 1979). The use of artic-ulatory models have recently spread to include applications such as, in animated avatars, which include intra-oral representations (Hueber et al., 2011; Engwall et al., 2006) as a tool to provide feedback on the articulation of speakers. This can be used for pronunciation training in second language learning as well as training hearing-impaired speakers (both adults and children) to improve their pronuncia-tion. Silent speech interfaces (Denby et al., 2010; Hueber et al., 2010) are other applications where articulatory models are relevant. Building articulatory models also helps in understanding the basic speech production behavior, especially of the ability to maintain recognizable acoustic correlates that correspond to phonemes in a language in spite of largely different anatomies of vocal tracts (Lammert et al., 2011). Articulatory models, along with an articulatory synthesizer, can be used to study other phenomena like infant speech acquisition (Ménard et al., 2002; Chapter 14).

In lieu of the differences in individual speakers, one needs to have an articulatory model which can adapt to a particular speaker’s articulatory strategies. Part I of the thesis is devoted to addressing this problem. Using a database of MRI images of the midsagittal plane of the vocal tract for three French speakers (c.f. Section 2.4), the problem is introduced in Chapter 2. Some solutions are proposed in Chapter 3, using 3-way linear analysis methods. The main idea is to find an articulatory model that explains variations made by all three speakers. The studies suggest that it is possible to adapt the articulations made by speakers in general, to the articulatory strategy of a specific speaker, if we have articulatory data of the specific speaker making other articulations. Chapter 4 takes this approach further by trying to find the number of articulations necessary for a certain speaker required to make viable predictions about the other articulations of the speaker. This is verified using perception tests on human listeners.

1.3

Acoustic-to-Articulatory Inversion (Parts II and III)

Acoustic-to-articulatory inversion (or just Inversion) is the process of determining the parameters of the articulation that produced a certain sound from the acoustics of the signal. The word ‘Inversion’ is associated with what is essentially a mapping problem because the direction of mapping is the inverse of the traditional direction of speech communication, which is producing the acoustics based on the articulatory

(35)

1.3. ACOUSTIC-TO-ARTICULATORY INVERSION 5 configuration. Understanding the Inversion process gives us several insights into the problems associated with completing the speech communication loop. Although the Inversion problem is quite an old one within the speech community, the progress made in terms of finding reasonable and workable solutions has been very slow.

Besides the aim of trying to understand speech communication, solving the Inversion problem has several other applications. One of the more unorthodox ap-plications is the possibility of an extremely low bit-rate encoder. The articulation, in fact, contains almost all the relevant information that is associated with recog-nizing speech (except, possibly the pitch and voicing information). The advantage that an articulatory representation has, with respect to an acoustic representation is the low number of parameters required to encode phonetic information. This can be encoded in as few as 6 to 7 articulatory parameters, besides information about the pitch and voicing. Not only that, the articulatory movement is quite smooth and slowly varying as compared to the acoustic signal. The sampling rate required to adequately represent the trajectory of one articulatory parameter could be as low as 50 Hz. This would, thus, enable transmitting speech data at an extremely low bit-rate, without losing information like co-articulation, or even the identity of the speaker (assuming relevant voice source parameters are also transmitted). The possibility of such an application urged several researchers to focus their attention on solving this problem as early as 1967 (Schroeder, 1967; Mermelstein, 1967). This became a distinct possibility with the development of an articulatory synthesizer which could then synthesize several vowels.

The animation of talking heads (intra-oral or external) (Beskow et al., 2003) also requires the use of Inversion where the acoustics is converted into articulatory movements of the intra-oral talking heads performed using the Inversion process. This is required to be in real-time so that the articulatory movement can be syn-chronized with the acoustics.

Any model which explains infant acquisition, should necessarily include the ability to perform acoustic-to-articulatory inversion (Guenther, 1995; Chapter 15), because this is essentially what infants do when they learn to speak. They convert the sounds that they hear adults making into articulatory movements of their own, so that their production corresponds to that of the adults.

Improving speech recognition has been one of the most important goals of In-version. It has been shown that speech recognition can be improved significantly when recorded articulatory data is used along with acoustic features. The same has been attempted by augmenting acoustic data with articulatory data predicted using Inversion (e.g., Wrench and Hardcastle, 2000; Markov et al., 2006; Chapter 10).

Another important application of Inversion is building better articulatory syn-thesizers (Birkholz and Jackèl, 2003). While articulatory data is limited and diffi-cult to procure, acoustic data is much easier to collect. Thus, training articulatory synthesizers on recorded articulatory data can be replaced by training them with articulatory predictions made through Inversion from acoustics.

In the past, the inverse mapping was learnt using the inversion-by-synthesis approach (c.f. Chapter 5). More recently, however, thanks to the availability of

References

Related documents

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

Both Brazil and Sweden have made bilateral cooperation in areas of technology and innovation a top priority. It has been formalized in a series of agreements and made explicit

The increasing availability of data and attention to services has increased the understanding of the contribution of services to innovation and productivity in

Generella styrmedel kan ha varit mindre verksamma än man har trott De generella styrmedlen, till skillnad från de specifika styrmedlen, har kommit att användas i större

I regleringsbrevet för 2014 uppdrog Regeringen åt Tillväxtanalys att ”föreslå mätmetoder och indikatorer som kan användas vid utvärdering av de samhällsekonomiska effekterna av

Närmare 90 procent av de statliga medlen (intäkter och utgifter) för näringslivets klimatomställning går till generella styrmedel, det vill säga styrmedel som påverkar

Den förbättrade tillgängligheten berör framför allt boende i områden med en mycket hög eller hög tillgänglighet till tätorter, men även antalet personer med längre än

På många små orter i gles- och landsbygder, där varken några nya apotek eller försälj- ningsställen för receptfria läkemedel har tillkommit, är nätet av