• No results found

machine-Z [feet to head]

N/A
N/A
Protected

Academic year: 2022

Share "machine-Z [feet to head]"

Copied!
200
0
0

Loading.... (view fulltext now)

Full text

(1)

A RTICULATORY -A COUSTIC R ELATIONSHIPS IN S WEDISH V OWEL S OUNDS

Christine Ericsdotter

Thesis submitted for the degree of Doctor of Philosophy Stockholm University

Autumn 2005

[y]

(2)

Doctoral Dissertation Department of Linguistics Stockholm University S-106 91 Stockholm Sweden

ISBN: 91-7155-151-4

Cover: Magnetic resonance images of male (left) and female (right) speaker producing [y]

Printed in Jena, Germany, October 2005

Also published online at http://www.ling.su.se/fon/perilus

(3)

T

ABLE OF

C

ONTENTS

PART I: GOALS, RATIONALE AND DATA

Abstract 1

1. Introduction 1

1.1. Articulatory modelling 1

1.1.1. Ongoing articulatory modelling work 2

1.2. Project approach and present work initiation 3

1.3. The method under evaluation 3

1.4. The nature of the evaluation 4

1.5. Outline of goals and thesis plan 4

2. Data 5

2.1. Data acquisition 5

2.1.1. MRI medium and specifications 5

2.1.1.1. General on MRI 5

2.1.1.2. Machine specifications 6

2.1.1.3. MR plane distribution 6

2.1.2. Speech materials 7

2.1.3. Choice and training of subjects 8

2.1.4. Visualizing teeth 9

2.1.5. Nose marker 9

2.1.6. Video recordings 10

2.1.7. Audio recordings 10

2.1.8. Experimental procedure 11

2.2. Post-experimental data processing and preparation for analysis 13

2.2.1. MR images 13

2.2.1.1. Transformation from 2D to 3D coordinates and visualization

of planes 13

2.2.1.2. Image processing 14

2.2.1.3. Choice of set of images for analysis 14

2.2.1.4. Integration of dental contours 14

2.2.1.5. Record of head movements derived from dental contours 18 2.2.1.6. Technique and error estimation of distance and area outlines 19

2.2.2. Video images 20

2.2.3. Sound 20

2.3. Additional data for method evaluation 21

2.3.1. Acquisition and processing of X-ray images 21 2.3.2. Speech in upright versus supine position 21 2.3.3. Some comparisons between vocal tract profiles from upright

versus supine position in the present data 22

PART II: ARTICULATORY ANALYSIS

3. Larynx 23

3.1 Measurements of distances and areas 23

3.2 Vertical larynx position 24

3.2.1. Muscular regulation 24

3.2.2. Lung volume 24

3.2.3. Pitch 25

3.2.4. Voice quality 25

3.2.5. Resonances 25

3.2.6. Implications for the data under investigation 26

3.2.7. VLP determination on sagittal MR images 26

3.2.8. VLP determination on axial MR images 27

3.2.8.1. Image selection 27

3.2.8.2. Anatomical identification: Interpreting MR signal levels 28

3.2.8.3. Laryngeal proportions 33

3.2.8.4. Ordering the axial images by height 38

3.2.8.5. VLP classification on axial images 42

3.3 Distances and areas in the laryngopharynx 43

3.3.1. Mid-sagittal distances 45

3.3.2. Areas 46

3.3.3. Distance-to-area relationships 46

3.3.4. Modelling the larynx tube 47

4. Oropharynx 53

4.1 Dimensions at rest 54

4.2 Dimensions during speech production 54

4.2.1 Degrees of freedom 54

4.2.1.1. Anterior-posterior dimension 54

4.2.1.2. Transversal dimension 54

4.2.1.3. Pharynx during vowel production 55

4.2.2 Passive or active forces? 55

4.3. Distances and areas in the oropharynx 55

4.3.1. Distances 55

4.3.1.1. Mid sagittal distances 55

4.3.1.2. Cross-sectional tongue shape: concavity (grooving) dimensions 61

4.3.1.3. Transversal distances 62

4.3.1.4. Covariation between sagittal and lateral distances 62 4.3.1.5. Comparison with previously presented data 63

4.3.2. Areas 64

(4)

4.3.3. Distance-to-area relationships 65

4.3.3.1. Subvelar region 65

4.3.3.2. Velar region 71

5. Oral Cavity 77

5.1. Dimensions and degrees of freedom 77

5.2. Distances and areas in the oral cavity 77

5.2.1. Mid-sagittal distances 77

5.2.2. Jaw paths 80

5.2.3. Areas 82

5.2.4. Distance-to-area relationships 84

5.3. Areas between front teeth 89

6. Lips 95

6.1. Articulatory-acoustic issues 95

6.2. Speaker-specific lip reference materials 95

6.2.1. Recording session 96

6.2.2. Tracings and scaling 97

6.2.3. Results 98

6.3. Lip data during MRI sessions 100

6.3.1. Tracing and scaling 100

6.3.2. Results from multi-mode sessions 101

6.3.3. Sagittal mode sessions 103

7. Summary of Articulatory Analysis 105

7.1. Distance-to-area rules: coefficients and performance 105 7.1.1. Conclusions before proceeding to acoustic evaluation 113 7.2. Extension and predictability of CE vowel groupings 114

PART III: ACOUSTIC EVALUATION

8. Analysis of sound recordings 119

8.1. Fundamental frequency 119

8.1.1. MR machine noise 119

8.1.2. Vowels 120

8.2. Short-term energy 124

8.3. Auditory analysis 125

8.4. Formant patterns 126

8.4.1. Articulatory stability 127

8.5. Reference materials and comparisons with other studies 132 9. Acoustic Predictions and Comparison between Observations and Predictions 135 9.1. Building vowel tubes from articulatory data 135

9.1.1. Vocal tract midline 135

9.1.2. Areas perpendicular to the midline 137

9.1.3. Composition of vowel tubes 138

9.1.4. Method for acoustic evaluation 140

9.2. Comparisons between observations and predictions 141

9.2.1. Vocal tract termination point 141

9.2.2. Performance of HS predictions 143

9.2.3. HS vs CE predictions 143

9.3. Sources of error and comments on predictions 144

PART IV: COMMENTS AND PERSPECTIVES

10. Concluding Remarks 155

10.1. Collection and analysis of articulatory and acoustic reference materials 155 10.2. Establishment of distance-to-area rules from articulatory data 155

10.3. Evaluation of method performance 155

10.3.1. A generalization exercise 156

10.3.2. Applying methods on the X-ray data 158

10.4. Future work 158

REFERENCES 159

PRINTED APPENDIX

A.1. Outlined area shapes on multi mode MR images 167

ELECTRONIC APPENDICES

E.1. Specification of MR image orientation CD

E.2. Examples of articulatory images CD

E.3. Articulatory measurements on MR images CD

E.4. Acoustic measurements CD

E.5. Area functions CD

E.6. Examples of sound files CD

(5)

A

CKNOWLEDGEMENTS

This research was funded by the National Institute of Health (R01 DC02014) and Stockholm University (research student’s grant SU 617-0230-01).

The thesis has benefited from the advice of Björn Lindblom, Stockholm University and University of Texas at Austin; Johan Sundberg, Royal Institute of Technology, Stockholm; Adrian Simpson, Friedrich-Schiller-Universität, Jena; and Olle Engstrand, Stockholm University. Valuable comments on parts of the work were also given by Thierry Metens, Unité de Résonance Magnétique de l’Hôpital Erasme, Brussels and Chris Moore, Department of Radiotherapy Physics, Christie Hospital, Manchester.

The warm encouragement from colleagues and friends, particularly Märit Aronsson, Sven Björsten, Kim Ciccone, Östen Dahl, Robert Eklund, Andreas Gyllenhammar, Pétur Helgason, Diana Krull, Josefina Larsson, Johan Liljencrants, Eva Lindström, Bob McAllister, Jin Hua Nie, Håkan Olsson, Magnus Petersson, Adrian Simpson, Karolina Smeds, Ulla Sundberg, Masja Koptjevskaja Tamm and Hartmut Traunmüller, was appreciated and much needed.

Who did what

The present work was part of the research project “Modeling Coarticulation – An integrated phonetic approach”, financed by the NIH (R01 DC02014). The following contributions have been made by project members and collaborators:

Björn Lindblom, Stockholm University and University of Texas at Austin, and Harvey Sussman, University of Texas at Austin, were the principal investigators of the project.

They planned, supervised and refined the work. Björn’s vast experience and knowledge was indispensable for the quality of this thesis, as was his ardent research spirit throughout my time in phonetics.

Didier Demolin, Université Libre de Bruxelles, made the MRI experiment in Brussels possible. His generous arrangements and assistance were much appreciated by subjects and experimenters alike. Thierry Metens, Unité de Résonance Magnétique de l’Hôpital Erasme in Bruxelles, was the physicist of the MRI experiment in Brussels. His expertise and patience were tremendously important to the quality of the data. Alain Soquet, Université Libre de Bruxelles, was part of the experimental team in Brussels. He assisted during the MRI experiment and gave advice on image orientation conversions (section 2.2.1.1).

The male and the female subject, whose identities are protected, showed patience and flexibility previous to and during the experiments. This was much valued.

Peter Branderud, Stockholm University, was the project research engineer and was part of the experimental team in Brussels and Stockholm. He created the optical microphone (section 2.1.7); defined the profile filters for the sound materials (section 2.2.3); defined the image processing of the MRI images (section 2.2.1.2) and of the video images (section 2.2.2); refined the teeth integration procedure (section 2.2.1.4); and expanded the Papex tool for coordinate extraction from Papyrus files. His pioneering and meticulous methods were invaluable to this work. Also, his unique nature of profound curiosity, modesty and non-occidental sense of time was very precious throughout my time in the phonetics lab at Stockholm University.

Johan Stark, Stockholm University and Columbi Computers, was the project computer engineer. He defined the teeth integration procedure (section 2.2.1.4) and gave mathematical advice. His innovative ideas and practical way of thinking were of great importance.

Josefina Larsson and Elisabet Eir Cortes were project assistants. They converted and processed the MRI (section 2.2.1.2) and video images (section 2.2.2); carried out the profile filtering of the sound materials (section 2.2.3); and double-checked measurements. They were also the main experimenters of the additional lip experiment (section 6.2.1). Their readiness and positive approach were great assets.

Hassan Djamshidpey, Stockholm University, was the project technician. He provided and evaluated experiment equipment and carried out the sound recordings during the additional lip experiment (section 6.2.1). His composed and attentive persona has been an immense support in experimental and other trying situations throughout my time at the phonetics lab at Stockholm University.

Adrian Simpson, Friedrich-Schiller-Universität, Jena, defined and developed several scripts and methods for fast and safe data processing. He also read and commented on the text. His critical eye and genuine interest were essential and much valued, as was his willingness to discuss any issue at any time.

Johan Liljencrants, Svante Granqvist and Giampiero Salvi, Royal Institute of Technology, Stockholm, contributed with advice on acoustic modelling. Roberto Bresin, Royal Institute of Technology, Stockholm, created the Papex tool for coordinate extraction from Papyrus image files (the tool was later expanded by Peter Branderud). He also consulted on image orientation conversions (section 2.2.1.1) and was a great support during the initial phase of the work.

Judith Rinker, EnglishEdit, Uppsala, carried out the language control of the thesis text, and Ramona Benkenstein, Lektorat und Webbservice, Jena, performed the main formatting of it. Bitte Granlund made the larynx cartilage drawings.

I, Christine Ericsdotter, was initially a project assistant and took part in planning, performing and evaluating the experiments. I carried out the articulatory and acoustic measurements; specified and checked programs and scripts; supervised and evaluated methods and data processing; analysed data; and wrote the text.

(6)
(7)

P

ART

I: G

OALS

, R

ATIONALE AND

D

ATA

Abstract

The goal of this work was to evaluate the performance of a classical method for predicting vocal tract cross-sectional areas from cross-distances, to be implemented in speaker-specific articulatory modelling. The data forming the basis of the evaluation were magnetic resonance images from the vocal tract combined with simultaneous audio and video recordings. These data were collected from one female and one male speaker. The speech materials consisted of extended articulation of each of the nine Swedish long vowels together with two short allophonic qualities. The data acquisition and processing involved, among other things, the development of a method for dental integration in the MR image, and a refined sound recording technique required for the particular experimental conditions.

Articulatory measurements were made of cross-distances and cross-sectional areas from the speakers’ larynx, pharynx, oral cavity and lip section, together with estimations on the vocal tract termination points. Acoustic and auditory analyses were made of the sound recordings, including an evaluation of the influence of the noise from the MR machine on the vowel productions. Cross-distance to cross-sectional area conversion rules were established from the articulatory measurements. The evaluation of these rules involved quantitative as well as qualitative dimensions. The articulatory evaluation gave rise to a vowel-dependent extension of the method under investigation, allowing more geometrical freedom for articulatory configurations along the vocal tract. The extended method proved to be more successful in predicting cross-sectional areas, particularly in the velar region.

The acoustic evaluation, based on area functions derived from the proposed rules, did however not show significant differences in formant patterns between the classical and the extended method. This was interpreted as evidence for the classic method having higher acoustic than physiological validity on the present materials. For application and extrapolation in articulatory modelling, it is however possible that the extended method will perform better in articulation and acoustics, given its physiologically more fine-tuned foundation.

1. I

NTRODUCTION

1.1. Articulatory modelling

Articulatory modelling is the generic term for a multiplicity of approaches and motives for visualising and synthesizing human speech using physiological parameters. Its main motives are to gain more insight into speech production planning, control and performance, to develop more flexible speech synthesizers, and to provide effective tools for speech therapy and language learning.

Articulatory models have been used in research, teaching and entertainment for at least 300 years (see Engwall 2002: 17-22 for an up-to-date survey and Traunmüller 2000 for listening examples). But as acoustic synthesis was born and nearly perfected during the past 50 years, articulatory synthesis is still generally crude, inefficient and computationally demanding.

One reason for the deficiencies of articulatory syntheses is the enterprise of acquiring and analysing articulatory data, which form the basis of the parameters driving the model.

During a great part of the history of phonetics, it was cadavers which provided data on the vocal tract anatomy and physiology, placing obvious limitations on the facts to be found and implemented. Experimental articulatory studies on living subjects have, however, become practicable and more frequent in recent years, each technique with its advantages and drawbacks (see Stone 1997 for a review). Examples of measurement methods have been examination of moulds of the oral cavity (Heinz and Stevens 1964) or even of the pharynx (Ladefoged et al. 1971) from silent articulation; flexible fiberoscopy for observation of the lower pharynx and larynx (Sawashima and Hirose 1968);

electromyography (EMG) which allows measurement of electrical activity of muscles; and electropalatography (EPG) which reveals patterns of tongue palate contact. The use of X- ray for the study of the entire vocal tract in motion has provided much of the basic information on articulatory patterns and articulatory-acoustic relationships in phonetic theory, e.g. the classical studies by Chiba and Kajiyama (1941) and Fant (1960) (see Dart 1987 for an extensive survey of X-ray investigations and Munhall 1995 for X-ray excerpts from different languages). The drawback of X-ray is that it gives only a two-dimensional view of the articulatory structures, and radiation levels place severe restrictions on the duration of recording, even if recent studies using digital X-ray involve low radiation and less radiation diffusion, protecting the eyes and brain (Branderud et al. 1998; Stark et al.

1999, Connan, 2003). X-ray microbeam provided a partial solution to the radiation problem (Kiritani et al. 1976). This system registers the movements of metallic pellets glued to the tongue, lips and reference points rather than the full shape of the vocal tract, hence minimizing subject radiation exposure, but also renouncing information on the velum and pharynx. Other techniques for tracking movements of receiver coils attached to the tongue, lips and/or jaw (Branderud 1985; Schönle et al. 1987; Perkell et al. 1992) also offer real- time analysis of articulatory movements with no known health-hazarding effects, but typically cover only the mid-sagittal oral and labial parts of the vocal tract.

The methods described above have yielded interesting findings, and continue to do so.

Undoubtedly, however, they impose limitations on the naturalness of the data elicited.

Safer, less invasive imaging techniques allowing three-dimensional views of the full vocal tract are continually being developed within the medical field and applied in phonetics.

Such techniques are, e.g., computed tomography (CT), a low-radiation X-ray snapshot technique providing a three-dimensional image of restricted parts of the vocal tract (used

(8)

by, e.g., Sundberg et al. 1987 to investigate the pharynx); ultrasound, which uses high frequency sound waves and their echoes to calculate and display tissue or organ boundaries (used in speech research since Kelsey et al. 1969 and, e.g., reviewed for imaging of the tongue in a collection of papers edited by Stone 2005); or, as in this study, magnetic resonance imaging (MRI), a technique that registers the movements of protons of hydrogen atoms set in spin in a strong magnetic field (Baer et al. 1991, see further section 2.1.1). The drawback of collecting three-dimensional full vocal tract data by MRI is that this works only on static articulations, due to long imaging registration time, which also decreases the naturalness of the data. Dynamic and tagged MRI from more than mid-sagittal plane is, however, under development (Shadle et al. 1999; Kane et al. 2002; Albiter et al. 2003).

There is still much work to be done in piecing together the different bits of information provided by different investigations, since an instrument that competently measures one articulator might be quite inadequate for measuring another (Stone 1997: 11). Better conceptions of vocal tract changes can be achieved through the combining of two or more articulatory observation methods. In the present research project, MRI, X-ray, EPG and digital lip video have been collected from a female and a male subject. In this thesis, MRI and lip video are analysed.

1.1.1. Ongoing articulatory modelling work

Despite the development of new methods, the acquisition and analysis of articulatory data is a time-consuming and error-ridden task. This fact has kept the number of speakers and utterances used for calibrating models or describing patterns low (one exception being the extensive microbeam database presented by Westbury 1994). From this basis, it is difficult to identify, generalize and parameterise articulatory patterns. Nevertheless, research groups in several parts of the world persist in developing models and theoretical frameworks using present data and means. A representative sample of the work in progress was presented at the latest International Congress of Phonetic Sciences (ICPhS) in Barcelona in 2003. Much of the work presented there consisted of developments from projects initiated 30 or more years ago, which gives some perspective both on the complexity of the matter and on a renewed interest in articulatory synthesis:

The Configurable Articulatory Synthesizer (CASY) from the Haskins group (Iskarous et al.

2003) is a development from previous work (Coker and Fujimura 1966; Mermelstein 1973;

Rubin et al. 1981) representing articulators using simple geometric parameters. It allows for speaker variation by adjusting parameters to fit mid-sagittal articulatory data from different speakers. The current work involved the development of the geometric tongue shape from circle to conic arch, and the evaluation of more generic versions of distance-to-area functions transforming two dimensional profiles into three dimensional vocal tracts (see further 1.3). The aim of the synthesis was described as the accurate reproduction of static and dynamic configurations using as few geometrically simple parameters as possible, these parameters still being linguistically meaningful and capturing common speaker variability of production patterns and vocal tracts. They promoted a stronger link between articulatory synthesis and speech production measurements, to ensure generation only of potential, and no unattested vocal tract shapes. The conviction that speaker variability can be captured more parsimoniously with articulatory than acoustic parameters, and thereby allows for a more flexible speech synthesis, is interpreted to be the project’s driving force, as is the fact that high-quality, flexible speech

synthesis operating on longer stretches of speech is much needed by people who depend on the text-to-speech technique (Whalen 2003).

The higher-level articulatory parameter controlled synthesizer HLsyn from the MIT group (Stevens and Hanson 2003) was published in the beginning of the nineties (Stevens and Bickley 1991) and is an attempt to capture aerodynamics in a rule-based framework. In practice, it is a quasi articulatory front end to the Klatt formant synthesizer (Klatt and Klatt 1990). The articulatory parameters exploit many of the interdependencies between the acoustic parameters in the formant synthesis. The current work involved analysis of gestural overlap during consonant clusters as performed by the model and matched to observations. The overall research aim was described as “to formulate principles for the synthesis of running speech based on parameters related to articulation, and therefore to gain insight into the process by which speakers coordinate the movements of the articulators” (page 199). The final remarks included the unsatisfactory situation of the experimenter being left to base some parameter adjustments on trial and error rather than on observations, since speech articulation has not been studied as extensively as speech acoustics has (page 202). The researchers conclude that “the ultimate synthesizer should be able to hear itself and to be self-correcting” (page 202), meaning that feedback and corrections should not be made by the experimenter but by functions simulating the strategies of a developing child evaluating and correcting its own speech on the basis of what it has heard.

Making the the user’s interface look like a human vocal tract seems not to be a priority.

The model created by Shinji Maeda (Maeda 1979; Maeda 1988; Maeda 1990) is based on a high number of lateral X-ray images and labiofilm from running speech. It allows extended investigation of articulatory-acoustic relationships and has been used frequently as a research tool (in the present conference by, e.g., Ménard 2003; Ouni and Laprie 2003; and Traunmüller et al.

2003). Ongoing work involves examination of mechanical properties of lip movements during different speaking styles, with the goal of revealing biomechanical characteristics of muscles acting as the motor of articulatory movements, in order to refine the control of the synthesizer (Maeda and Toda 2003).

Work on audiovisual articulatory synthesis integrates acoustic, articulatory and facial expressive phenomena underlying speech production (Badin et al. 2003: 193, Beskow et al.

2003). The development of animated talking heads aims at capturing, understanding and utilizing features such as increased speech intelligibility achieved from synchronisation of visible articulatory movements, facial expressions and the sound signal for extended use in speech synthesis and human-computer interaction. Ongoing work is directed towards, among other things, better understanding of the correlation between facial and vocal tract movement (Beskow et al. 2003), and development of strategies for determining articulatory control parameters such as coarticulation models vs. simple concatenative strategies (Badin et al. 2003).

The APEX model was first presented in the seventies (Lindblom and Sundberg 1971) and is continuously being developed and evaluated (Stark et al. 1996; Stark et al. 1998; Ericsdotter et al.

1999; Stark et al. 1999; Ericsdotter-Bresin 2000). The model produces vocalic sounds and is controlled by a small set of articulatory parameters defining a mid-sagittal profile. It can be speaker-specifically calibrated and allows for acoustic-to-articulatory inversion as well as articulatory-to-acoustic calculation of profiles via distance-to-area conversions (see further 1.3).

The ongoing work involves evaluation of principal component analysis for the tongue, reinventing the currently used semi-circle tongue (Lindblom 2003), and evaluation of distance- to-area functions transforming two-dimensional profiles to three-dimensional vocal tracts (Ericsdotter 2003; present work). APEX is being developed mainly for linguistic and music theory research (see further 1.2), but is also aiming for applications within the clinical field and in

(9)

1.2. Project approach and present work initiation

The present work was initiated in the project “Modelling coarticulation – An integrated phonetic approach” (NIH R01 DC02014). In this project, articulatory movements and patterns are acknowledged as the tangible outcome of an individual’s motor planning of speech gestures. Through observing and systematizing articulatory behaviour, it should be possible to derive and map out principles of speech gesture planning and organization.

Better understanding of the nature of such principles and processes is of interest in itself as a link in the human communication system, and as such also of fundamental importance to several areas within the linguistics field, for example speech therapy, typology or sound change.

A wealth of phonetic studies has aimed at capturing speech planning principles, frequently by studying coarticulation effects in different languages. These studies have resulted in theories on the conditions, motivation and origin of articulatory patterns (Farnetani 1997; Löfqvist 1997). The conclusions and philosophies of these theories are sometimes disparate, however, even when describing similar gestural patterns. Several factors might justify these divergences, for example (a) Factual differences between the data constituting the facts in the studies, e.g. individual, sociophonetic, stylistic, language- specific or data acquisition factors; (b) Differences in the conditions for building a theory, e.g. scientific tradition, demands on the theoretical framework to be economic, physiologically true or phonologically driven; (c) The possibility that the principles are too complex, numerous and/or variable to be extracted from a small set of articulatory data (small set given the individual’s full linguistic-articulatory repertoire; and given low number of speakers). It might, however, be necessary to investigate a low number of subjects thoroughly to define adequate questions and/or to form a theory, which can then be tested on a larger set of data.

The deeper implications of coarticulation theories are hard to test directly, but it should be possible, from an acoustic and articulatory point of view, to evaluate the performance of their various rule systems on a set of speech tasks, provided that there is a reliable evaluation method at hand. The project in which the present work was initiated aims to contribute to the development of such an evaluation method, by refining an articulatory modelling tool for linguistic research. Such a tool must meet the demands of physiologically realistic speech organ properties and theoretically unbiased parameters and degrees of freedom. The procedure for building the tool has been defined as follows:

1) copy some of the articulatory behaviour of two subjects,

2) calibrate an existing articulatory model with the anatomical properties of these subjects

3) map the articulatory configurations into physiological parameters, 4) extrapolate non-observed articulations from these parameters.

The task of the present study was restricted to a phase in the first step of this procedure.

It does not aim to implement the results in the articulatory model, but only to provide an evaluation of a method for the possible implementation in the model. For results from the coarticulation project, see Modarresi et al. (2004; 2004; 2005), Lindblom (2003), Lindblom et al. (2002) and Lindblom and Sussman (2004).

Using articulatory modelling to uncover principles and processes driving speech patterns is a delicate task, however sophisticated the model. A simplified, parameterised version of a complex pattern does not compare to the observed pattern itself. Hence, when the output of modelling is compared with the observed pattern, there will always be differences, and the reasons for these differences will be difficult to interpret. A patient yet critical eye for modelled data should be preserved.

1.3. The method under evaluation

The geometrical prediction routine under consideration in this thesis is that presented by Heinz and Stevens (1964, 1965). They developed a formula that, at each point in the VT, relates the mid-sagittal distance d to the area A of the cross-section at that point:

A = αdβ (Equation 1)

where α and β are constants depending on speaker and position along the vocal tract. The predictability of the cross-sectional area A from the cross-distance d was proposed assuming that the tongue has a flat surface and that the upper contour has fixed shape at each point in the VT. Eq (1) was applied to the whole vocal tract. When area functions were derived for 2D tracings made of cineradiographic film images formant frequency predictions were obtained that were “within a few percent for the first three formants” when compared to corresponding utterances.

Studies made on the basis of three-dimensional articulatory data (Sundberg et al. 1987;

Baer et al. 1991; Soquet et al. 2002) confirm the original finding by Heinz and Stevens that the α and β coefficients are speaker and vocal tract position dependent. However, three- dimensional articulatory data (both classical tomograms from Fant 1960 and more recent data from CT and MRI) also show that cross-sectional areas along the vocal tract do not always conform to the shapes implied by Eq (1). The aim of the present work is to systematically investigate the geometry of VT cross-sectional areas during vowel articulation and to develop distance-to-area rules that can be used in the derivation of area functions. The point of departure is the power function of Heinz & Stevens. It is expected, that such an undertaking will lead to greater realism in articulatory modelling and result in significant improvements in formant frequency predictions. Several articulatory models under development and in use today depend on such conversion rules (see 1.1.1 above and Engwall 2002: 17-20). A methodical investigation of the geometrical preliminaries for the method is therefore thought to be of interest and importance to several research areas in theoretical and applied phonetics.

(10)

1.4. The nature of the evaluation

Heinz and Stevens evaluated their transformation method exclusively using acoustical means. The comparison between the acoustic outcome from the predicted area functions and the formant estimations from the proper utterances was the only validity test of their method. They could not carry out a full vocal tract articulatory evaluation, since there were no articulatory cross-sectional data from the pharynx to compare predicted articulatory data to. In the present work, cross-dimensions as well as cross-sections from the entire vocal tract were available for analysis and comparison. In the type of articulatory modelling ultimately aimed at (1.2), physiological realism and naturalness are required in addition to high quality acoustic outcome. Hence, the evaluation of the method performance will be articulatory as well as acoustic. Furthermore, the evaluation should not only be quantitative, presenting numeric comparisons between predicted and observed cross-sectional areas and formants, it should also be qualitative, providing explanations or at least speculations on the method’s drawbacks or successes.

A methodological problem is that an articulatory evaluation has fewer intrinsic errors than an acoustic evaluation has. The articulatory evaluation can be made on the basis of measurements, and depend only on measurement and production accuracy. The acoustic evaluation must be made between acoustic reference materials, containing measurement inaccuracies, and the acoustic output of area functions, which depend on vocal tract midline definition, area predictions perpendicular to the midline rather than observed areas, choice of main vocal tract area in cases of multiple air-filled areas, choice of vocal tract termination point, and present acoustic theory. The different conditions for the evaluations should be kept in mind.

1.5. Outline of goals and thesis plan

This thesis pursues the following goals and procedures:

1) Collection and analysis of articulatory and acoustic reference materials a. Simultaneous collection of MR images, lip video and sound data

b. Measurements of mid-sagittal distances and cross-sectional areas on MR images along the vocal tract, and on video images of the lips

c. Acoustic measurements on sound records

2) Establishment and application of distance-to-area

a. Articulatory measurements form basis of derivation of coefficients for Eq (1) b. Coefficients from (2b) used to derive area functions for acoustic evaluation

3) Evaluation of method performance

a. Comparison between measurements and predictions of cross-sectional areas b. Comparison between measurements and predictions of formant patterns c. Applicability test of rules on vowels from separate data set, upright position

More introductory words are found at the beginning of each chapter. The text is structured in 4 parts, described briefly below.

Part I: Goals, rationale and data Chapter 1: Introduction.

Goals and rationale of the work addressed in the light of previous and ongoing research within the articulatory modelling field.

Chapter 2: Data.

Description of data acquisition and processing, discussion of methodological aspects.

Part II: Articulatory analysis Chapter 3: Larynx.

Determination of vertical larynx positions, measurements on MR images of distances and areas in the laryngopharynx. Establishment and discussion of distance-to-area rules as well as of fixed dimensions for larynx tube modelling.

Chapter 4: Pharynx.

Measurements on MR images of distances and areas in the mid and upper pharynx.

Establishment and discussion of distance-to-area rules.

Chapter 5: Oral Cavity.

Measurements on MR images of distances and areas in the oral cavity. Derivation of jaw paths from integrated dental objects. Establishment and discussion of distance-to-area rules.

Chapter 6: Lips.

Description of additional lip video data. Relationships between lip height, width and depth as well as between mid-sagittal distance and cross-sectional area established and applied on frontal lip video data from MRI session.

Chapter 7: Summary of Articulatory Analysis.

Summary and comparison between methods for distance-to-area conversion. Error calculation and discussion of method performance.

Part III: Acoustic evaluation

Chapter 8: Analysis of Sound Recordings.

Acoustic, auditory and visual analysis of the sound records for the establishment of a sound reference material.

Chapter 9: Acoustic Predictions and Comparison between Observations and Predictions.

Description of vowel tube composition from articulatory data and acoustic evaluation of these tubes. Comparison between reference and computed sound materials.

Part IV: Comments and perspectives Chapter 10: Concluding Remarks.

(11)

2. D

ATA

2.1. Data acquisition

2.1.1. MRI medium and specifications 2.1.1.1. General on MRI

Magnetic resonance imaging (MRI) allows a tomographic view of body tissues in any plane of the human body, without known risks for the patient/subject. Figure 2-1 shows examples of MR images through the mid-sagittal plane and through a coronal plane at approximately the boundary between the soft and hard palates. The technique uses a magnetic field and radio waves to image body sections. The machine applies a radio frequency pulse specific to hydrogen, directed toward the region of interest. This pulse causes the protons in that area to spin. The proton motion is registered by the machine and is encoded depending on hydrogen content, so that different body tissues are represented as different grey levels in an image. At the dark extreme of the grey scale, tissues containing little hydrogen are found, e.g. teeth and bone. At the light extreme, tissues containing a high degree of hydrogen are found, e.g. fat and water (a comprehensive description of the MRI medium can be found in, e.g., Hornak 1996-2000).

MRI is used primarily within the medical field but has been increasingly applied in speech research over the past 15 years (e.g. by Baer et al. 1991; Lakshminarayan et al.

1991; Moore 1992; Sulter et al. 1992; Yang and Kasuya 1994; Wein et al. 1995; Demolin et al. 1996; Story et al. 1996; Tiede 1996; Alwan et al. 1997; Di Girolamo et al. 1997;

Narayanan et al. 1997; Badin et al. 1998; Honda and Tiede 1998; Engwall and Badin 1999;

Fitch and Giedd 1999; Stone 2000; Jackson and Shadle 2000). A growing body of articulatory data collected using MRI is valuable in understanding and modelling the vocal tract in three dimensions, particularly the pharynx area, the behaviour of which during speech is traditionally hard to capture. The technique has, however, its drawbacks and limitations when applied in the field of speech, as do other experimental/diagnostic imaging techniques.

Figure 2-1. Examples of MR images. Left: mid-sagittal plane. Right: coronal plane, approximately at the boundary between soft and hard palates. Female subject. Vowel /i/.

One issue has been image acquisition time, which was several minutes per speech token in the earliest studies (e.g., 3.4 minutes in the classical study by Baer 1991). In the present investigation, acquisition time was only around 14 s. This reduction in time required depends mainly on the number of image planes used, the development of the MR machines and the exploration of technical possibilities done by physicists controlling the MR machines. Fourteen seconds is, however, still a long time when it comes to single speech tokens, and this requires that subjects hold articulatory configurations steady for durations which are not only extra-linguistic but are also motorically difficult to produce. This might result in articulatory instability and subject motion during image acquisition, which in turn might create image artefacts (see below). It also means that only sustainable speech sounds can be investigated, at least if a complete, multi-mode vocal tract view is desired. Dynamic MRI typically captures the mid-sagittal plane alone (see, e.g., Demolin 2002), although dynamic multi-plane MRI is under development (see section 1.1). In the present study, only (sustainable) vowels were investigated.

Another issue resides in the way body tissues are encoded in grey levels in the image.

At the light extreme of the grey scale, tissues containing high degrees of hydrogen are found, e.g. fat. At the dark extreme, tissues containing little hydrogen are found, e.g. teeth and bone. Air-filled spaces are also encoded dark. This combination creates an obvious problem in the oral cavity, as no boundaries between teeth and air-filled spaces are visible (the subject will seem “toothless”). This problem was addressed by integrating tooth contours from plaster casts directly in the MR image, and is described in sections 2.1.4 and 2.2.1.4.

A third issue is the fact that no metals are allowed within the image field. The magnetic field is so strong, it can lift a grown man by the metal in his belt buckle, or cause a paper clip to become an uncontrollable weapon. Metal can also give rise to image artefacts. MRI assumes a homogeneously applied magnetic field, and an inhomogeneous field will cause distorted images so that, e.g., a straight line will appear bent (Hornak, 2000, section 11).

The ban on metals has rendered sound recordings difficult, since microphones and cables usually contain metal. This obstacle was approached by the construction of an optical microphone and glass fibre cables, making high quality sound recordings possible. This is described in section 2.1.7.

There are also complexities in the image interpretation, especially when performed by a medically untrained person, as in the present study. Possible image artefacts must be identified. For example, image artefacts arising from chemical shift and subject motion must be detected and accounted for. The artefact of chemical shift is caused by the difference in Larmor frequency of fat and water, and results in a misregistration between the fat and water pixel in an image (Hornak 1996-2000, section 11). This misregistration looks like a dark framing line (like a membrane) between, e.g., fat and muscle. Sakai and collaborators (1990) found the chemical shift artefact at 1.5 T images to be more prominent in T1-weighted images (as in the present study) than in T2-weighted images. Fortunately, this occurs at predictable places, namely at abrupt changes of tissue type histology. This should reduce confusion between artefacts and true anatomy (Dwyer et al. 1985: 18), once the conditions are known. The artefact of subject motion is a blurring of the entire image or part of it, and can result in ghost images in the phase encoding direction (Hornak 1996- 2000, section 11). Curtin (1989: 4) states that motion artefacts in larynx imaging such as swallowing, respiration and pulsative flow in the great neck vessels are less of a problem when working with low or medium magnetic field strengths (<1 T) than in higher field strengths.

(12)

Air-filled area outline interpretation problems are discussed in 2.2.1.6. In section 3.2.5.2, interpretation issues arising when anatomically comparing the present materials to reference materials from the literature are discussed.

2.1.1.2. Machine specifications

The present MRI experiment was carried out at the Unité de Résonance Magnétique de l’Hôpital Erasme in collaboration with Université Libre de Bruxelles. The MR machine used was a Philips Medical Systems Gyroscan NT. The magnetic field strength was 1.5 T and slice thickness was 4 mm. Further specifications are listed in Table 2-1. Moderate TR was used, making the images proton density-weighted for soft tissues but T1-weighted for pure water.

Table 2-1. MR setup specification.

Location Unité de Résonance Magnétique de l’Hopital Erasme, Brussels Date 2000.06.27

MR machine Philips Medical Systems Gyroscan NT Magnetic Field Strength 1.5 Tesla

Slice Thickness 4 mm

TE Multi-mode images 9 ms; Sagittal images 8 ms

TR Multi-mode images 1800-1950 ms; Sagittal images 1150 ms Imaging sequence type Turbo Spin Echo (TSE), also referred to as Fast Spin Echo (FSE)

Proton density-weighted images, T1-weighted only for pure water RF coil Quadrature neck coil

Image format 256*256 pxl Window centre 256 Window width 1702

2.1.1.3. MR plane distribution

The MR plane distribution over the vocal tract was set by one of the phonetic experimenters (Lindblom) and the physicist (Metens). There were some conflicting demands in this procedure: From a phonetic point of view, it is advantageous to cover the vocal tract with as many image planes and plane types as possible, and for articulatory to acoustic modelling, it is advantageous if the planes are roughly perpendicular to the sound propagation midline. But for technical reasons, the number of planes and their positions must be restricted since information in images in which the slices intersect is degraded or lost. This means that the image plane orientation must be set in a way which avoids planes intersecting in phonetically interesting regions. There is also the issue of acquisition time for the subject. This is dependent on the number of planes – the more planes, the longer the acquisition time. Longer image acquisition times place higher demands on the subject’s articulatory stability and increase the risk of motion artefacts. The anatomy of the subject also played a role. The size difference between the male and female subjects gave rise to differently distributed planes. The male subject was also more anatomically asymmetric, which meant that the image planes had to be tilted and fine-tuned in a way that was different from that of the female subject. The choice of number of planes and of plane orientation was thus a trade-off between what was phonetically interesting, suitable for modelling, technically possible, anatomically permissible and acceptable for the subject with respect to imaging time. Eleven sagittal planes (acquisition

were chosen as the final plane distribution. The latter planes are henceforth referred to as the

“multi-mode images”. The landmark for the first multi-mode plane was the glottis height during phonation of calibration vowels. An overview of the planes is found in Figure 2-2.

Note that the anteriormost planes are ordered backwards. This could have been changed in this text, but to avoid potential sources of error in remembering to describe plane 14 as plane 12 in the male subject, plane 15 as plane 13 in the female, etc., the original numbering was kept throughout the text. The x axis runs in the direction from the right to left side; the y axis runs in the direction from the nose to the neck; the z axis runs in the direction from feet to head.

The specifications for position and orientation of the planes in the MR machine space are presented in Appendix E1 and are discussed in 2.2.1.1.

machine-Y [nose to neck]

machine-Z [feet to head]

'

machine-Y [nose to neck]

machine-Z [feet to head]

machine-X [right to left side]

machine-Z [feet to head]

machine-X [right to left side]

machine-Z [feet to head]

Figure 2-2. The MR plane distribution. Top row: multi-mode planes; bottom row: sagittal planes. Plane numbering 1

2 3 4

6 5 7 9 8 1110 12 14 13

15

2 1 3 4

5 6 7 8 10 9 13 11 12

14

machine X [right to left side] machine X [right to left side]

machine Y [nose to neck] machine Y [nose to neck]

machine Z [feet to head] machine Z [feet to head]

machine Z [feet to head] machine Z [feet to head]

(13)

2.1.2. Speech materials

The speech materials were chosen to cover greater parts of the Swedish vowel space. Eleven vowel allophones were elicited, one each from seven categories /i, y, ʉ, e, ø, o, u/ and two distinct qualities from /ɛ/ [ɛ:] [æ:] and /a/ [ɒ:] [a]. Table 2-1 provides a summary of the vowels elicited and their distribution in the Swedish lexicon. It also contains a list of the ASCII encoding for each category which henceforth is found in text, tables and figures.

The fact that these vowels were to be produced with a duration of 15 seconds placed extra linguistic demands on the subjects. Several of the vowels are generally diphthongised when produced by Swedish speakers, at least when produced in isolation (Eklund and Traunmüller, 1997) or in utterance-final position. The diphthongisation seems, however, to be at a non- awareness level, so that aiming for vowel target appeared natural enough. Furthermore, the /a/

allophone [a] occurs only in non-long positions in Swedish; and in the speech of the female subject, /ɛ/ is realised as [æ] also in positions not followed by /r/; hence the allophone [ɛ] in long position is not used in her “natural” speech. Producing these vowels as long during the experiment, however, might not have been a disadvantage. Being not perfectly natural, they might be expected to be produced more carefully as they demand higher concentration.

Clearly, it is not an everyday speech activity to keep a vowel quality steady for 15 seconds. One might argue that such a task would rather seem like singing. Acknowledging differences in articulatory strategy between speech and singing, there might be no greater differences in the geometric relationships under investigation, but a valid question is nonetheless which vocalizing mode was actually being recorded in the present study.

Crystal (1997) identifies two main differences between music and speech: “Music is composed to be repeated; speech is not. And, if we examine modern western music, we find tones that have been given absolute values, whereas those of speech are relative” (page 173). The repetition criterion suggests that the vocalization in the present study is singing,

since the subjects were to repeat the same procedure 66 times (11 vowels * 3 repetitions * 2 imaging modes). However, in many phonetic investigations speech materials are controlled and repeated, and even if this fact, to a greater or lesser extent, estranges speech in suchinvestigations from unscripted speech, it does not transform the speech into song. The second criterion regards specific tone frequency values and fixed intervals in singing, as opposed to relative intonation patterns produced at different frequency levels and ranges in speech (“People are not instruments. They do not speak out of tune.” Crystal 1997, page 173). According to this criterion, the vocalization in the present study is speech, since there were no pre-specified pitch levels set out for the vowels (although it turned out that the subjects chose to select and keep particular pitches, see section 8.1.1.2). Crystal does not mention fixed time value of notes in musical scores. The duration time for one vocalization was pre-specified in this study, which made the result more singing mode-like.

To avoid becoming lost in theoretical arguments, it might be sensible to acknowledge that singing and speaking distinctions depend heavily on function, tradition and training. Compare, for example, the psychological, articulatory and acoustic distance from everyday speaking mode to a) highly elaborated and technically demanding vocalization in western classical opera music; b) melodic storytelling type of vocalization in country music, and c) rhythmic, small pitch range type of vocalization in rap music (an overview of articulatory and acoustic characteristics in different singing styles is given by Sundberg 2001: 255-266). This implies that the acoustic and articulatory output may give one definition, and the actual aim of the performer may give another. Both subjects of this study had musical training, but were not instructed to sing; they were instructed to maintain a spoken vowel for a longer time than usual. This is taken to be of decisive importance.

Table 2-2. Recorded vowels, their distribution in Swedish (mainly according to Garlén 1988, and not including compounds) and their ASCII coding throughout the text. The difference in production of /ø/ between subjects is regional.

Phoneme Pre stressed σ Stressed σ (in., med., fin.) Post stressed σ 1 Post stressed σ 2 Post stressed σ 3 Recorded allophone ASCII code Lip config.

/i/ [] minut [i:] [ij] bi [] nick [] fjäril [] Stromboli [i:] i Spread

/y/ [] fysik [y:] [y] ny [] bytt [] bandy [y:] y Outrounded

// [] trumpet [] [] du [] skutt [] furu [] känguru [:] U Inrounded /e/ [] geni [e:][e] se [e] [] lett [] [] cykel [] [] sotare [] [] viktigare [e:] e Spread /ø/ [ø̞] flöjtist

[œ] försvar [ø:][ø] tö [ø̞] dröm

[œ:][œ] bör [œ] förr [ø:] (female subj.)

[œ:] (male subj.) oe oe

Outrounded

// [] läckage

[æ] värdera [:] knä [] sätt

[æ:] kär [æ] bjärt [:]

[æ:] E

ae

Spread/neutral Neutral /a/ [a] batik [:][] dra [a] glatt [a] myra [a] människa [a] människorna [:]

[a] A

a

Outrounded Neutral

/o/ [] fåtölj [o:][o] så [] blått [] päron [o:] o Inrounded

/u/ [] motiv [u:][u] sko [] lort [] pucko [] människor [u:] u Inrounded

References

Related documents

Exakt hur dessa verksamheter har uppstått studeras inte i detalj, men nyetableringar kan exempelvis vara ett resultat av avknoppningar från större företag inklusive

Från den teoretiska modellen vet vi att när det finns två budgivare på marknaden, och marknadsandelen för månadens vara ökar, så leder detta till lägre

The increasing availability of data and attention to services has increased the understanding of the contribution of services to innovation and productivity in

Av tabellen framgår att det behövs utförlig information om de projekt som genomförs vid instituten. Då Tillväxtanalys ska föreslå en metod som kan visa hur institutens verksamhet

Syftet eller förväntan med denna rapport är inte heller att kunna ”mäta” effekter kvantita- tivt, utan att med huvudsakligt fokus på output och resultat i eller från

Närmare 90 procent av de statliga medlen (intäkter och utgifter) för näringslivets klimatomställning går till generella styrmedel, det vill säga styrmedel som påverkar

Den förbättrade tillgängligheten berör framför allt boende i områden med en mycket hög eller hög tillgänglighet till tätorter, men även antalet personer med längre än

På många små orter i gles- och landsbygder, där varken några nya apotek eller försälj- ningsställen för receptfria läkemedel har tillkommit, är nätet av