Automatic phonological transcription using forced alignment

(1)

Department of English

Bachelor’s degree Project English Linguistics

Autumn 2018

Automatic phonological transcription using

forced alignment

FAVE toolkit performance on four non- standard varieties of English

Valeria Sella

(2)

Automatic phonological transcription using forced alignment

FAVE toolkit performance on four non-standard varieties of English Valeria Sella

Abstract

Forced alignment, a speech recognition software performing semi-automatic phonological transcription, constitutes a methodological revolution in the recent history of linguistic research. Its use is progressively becoming the norm in research fields such as sociophonetics, but its general performance and range of applications have been relatively understudied. This thesis investigates the performance and portability of the Forced Alignment and Vowel Extraction program suite (FAVE), an aligner that was trained on, and designed to study, American English. It was decided to test FAVE on four non-American varieties of English (Scottish, Irish, Australian and Indian English) and a control variety (General American). First, the performance of FAVE was compared with human annotators, and then it was tested on three potentially problematic variables: /p, t, k/ realization, rhotic consonants and /l/. Although FAVE was found to perform significantly differently from human annotators on identical datasets, further analysis revealed that the aligner performed quite similarly on the non- standard varieties and the control variety, suggesting that the difference in accuracy does not constitute a major drawback to its extended usage. The study discusses the implications of the findings in relation to doubts expressed about the usage of such technology and argues for a wider implementation of forced alignment tools such as FAVE in sociophonetic research.

Keywords

Forced alignment, FAVE, phonological transcriptions, English varieties, sociophonetics.

(3)

1. Introduction and background

Speech recognition, the “automation of the processes of auditory perception and comprehension” (Crystal, 2010, p. 155), is becoming an essential aid in linguistics subfields that study speech. This kind of technology is especially useful in speech corpora studies, where manual phonological transcriptions can take up to 50-60 times real-time, which means that for every one minute of recorded speech, it can take up to one hour to phonetically transcribe it (Strik & Cucchiarini, 2014, p. 91). Such laborious, costly, and incredibly time-consuming processes can be often simplified with the use of Automatic Speech Recognition (henceforth ASR) toolkits.

A relatively recent ASR-based tool that significantly supports phonological and phonetic analyses is forced alignment, a semi-automatic phonological transcription software that relies on speech recognition engines and machine-readable pronouncing dictionaries. By using an audio file and an orthographic transcription, the forced alignment software annotates the sound stream automatically, i.e., it puts boundaries at the beginning and end of individual phones and issues a phonetic transcription (Bailey, 2016, p. 11). It is a valuable tool for sociophonetics and any study dealing with accent variation, as it dramatically reduces the time needed for the phonological annotation of speech data. Furthermore, several areas have benefited from forced alignment, for instance, language pathology, second language learning, and forensic phonetics (Foulkes, Scobbie & Watt, 2010, pp. 736-737).

Right now, one of the leading forced aligners is the Forced Alignment and Vowel Extraction program suite (henceforth FAVE), developed by the University of Pennsylvania (Rosenfelder, Fruehwald, Evanini & Yuan, 2011). FAVE has been increasingly used in recent works of sociophonetics, and its advantages, limitations, and issues have been discussed by several interdisciplinary scholars. However, the list of studies on forced alignment is far from long and FAVE, although it is one of the lead aligners, has but a small share on that list, with research so far primarily focusing on using or testing it in the US. This toolkit, however, could prove to be useful in research on other, non-American, varieties of English without further changes in its acoustic models, and reliably assist the hands of linguists that face tremendous loads of transcription work. Thus, there is a definitive need for research on how FAVE performs to define whether it can be used at present on varieties of English quite different from the one it was trained on, without any further modification.

To this end, it was decided to examine the portability of FAVE by testing its performance on a control variety (American English) and four non-American varieties (Irish, Scottish, Australian, and Indian English), focusing on three consonantal variables (/p, t, k/ realization, rhotic consonants and /l/). For this purpose, the analysis involved comparing FAVE to human annotators. Although it was found that FAVE’s transcriptions differed significantly from manual annotations, the results revealed that the software’s performance on the non-standard varieties was surprisingly close to that of the control variety. The study concluded that FAVE’s use for non-American varieties of English is indeed a viable option.

(6)

1.2 Forced alignment generalities

The details of the functions and processes involved in speech recognition in general, and forced alignment in particular, are fairly complicated and multidimensional. Such models include several statistical and mathematical computations that a simple explanation cannot hope to do justice to. Thus, while a thorough discussion of the technology behind forced alignment is beyond the scope of the present study, a brief description of how forced alignment works was considered necessary for a broader understanding of FAVE. This section will describe the generalities of forced alignment, refer to some of the leading forced aligners, and narrow down to a more detailed explanation of how FAVE works before discussing the existing literature around it. The final section will define the aims and research questions of this paper.

As stated previously, forced alignment is a speech technology software used for the extraction of phonological transcriptions. It time-aligns text and speech, usually relying on semi-ASR technology. Since it is a semi-automatic software, the user is required to supply two inputs: one audio file and an orthographic transcription of its content. The process through which forced alignment yields phonological transcriptions is illustrated in Figure 1, schematically represented as three stages. The first stage is the input forced alignment requires; the second stage is the first processing of that input.

Stage two is therefore the conversion of both textual and audio input in order to be processable by the aligner. Regarding the textual input, the aligner runs the orthographic transcription from a pronouncing dictionary, converting it into a phonetic transcription needed for the recognition performed on the audio signal. Pronouncing dictionaries provide a phonetic transcription for as many as possible sequences of phonemes that form a word in a language variety. The length of the pronouncing dictionary in use, as well as the variety of realizations provided, are essential for the success rate of the aligner (Bailey, 2016, p. 12). Regarding the audio file, the aligner performs a digital processing of the audio signal, where the waveform is going through a series of mathematical transformations, turning into a cepstrum via Fourier transform. In essence, this process divides the audio signal into frames, in order for them to be processed at the next stage. (Coleman, n.d, p. 16).

(7)

Figure 1. Forced alignment process.

In the third and final stage, there are several speech recognition components, with the Hidden Markov Model (HMM) method having a prominent role. HMM is a pattern recognition model where “the probabilities at each step are seen as depending on the outcome of previous steps” (Crystal, 2010, p. 156). In the case of speech technology,

ORTHOGRAPHIC TRANSCRIPTION

PRONOUNCING DICTIONARY

PHONETIC TRANSCRIPTION

AUDIO FILE

DIGITAL PROCESSING

FRAMES FORCED

ALIGNMENT

HMMs DECISION MAKING

ALIGNMENT Stage 1: Input

Stage 2:

Conversion

Stage 3:

Speech recognition and

alignment of audio and text

(8)

HMMs estimate the probability of a given phoneme to occur based on the presence of previous phonemes in order to recognize, match and set accurate boundaries around them in the speech signal. They are 3-5 state models, as Figure 1 above shows, where each circle depicts the probability of being in a state and each arrow the probability of transitioning to another state (Coleman, n.d, pp. 19-20). HMM training is an iterative process where the model is gathering statistical information on the acoustics of the phones it is exposed to. Based on these observations, the HMMs calculate the probability of a state (i.e., a phone) matching a given point in the cepstrum.

Consequently, the accuracy improves with increased exposure to data. In the above exemplification of /p/ in /P EH T/, each state represents the beginning, middle, and end of a phoneme respectively. When the HMMs recognize this pattern in the cepstrum, they set the boundaries before and after the phone. After this procedure is done for all the phones, an aligned transcription is issued marking the end of the forced alignment process. This process is generally the standard one that all forced aligners follow.

1.3 Forced aligners

Over the past recent years, forced alignment has been increasingly used in sociolinguistics, and is one of the latest methodological advances that shaped the history of quantitative sociophonetics research. Forced aligners dramatically decreased the time needed for phonetic segmentation of gathered speech data and have gradually become a standard method of phonetic analysis for any large-scale project. In 1972, the vowel measurements in Labov’s investigation were conducted manually, using a hard-wired spectrum analyzer with which they could study 300-500 vowels a week, while less than 30 years later, with the appearance of the first computer spectrograms, the same time was reduced to a day (Labov, 2001, p. 151). Nowadays, with the fast advancements of speech recognition technology, forced aligners can perform the same task within minutes.

In the past decade, many forced aligners have been developed to respond to the need for a faster speech data analysis. The Penn Phonetics Lab Forced Aligner (P2FA) (Yuan &

Liberman, 2008), the Prosodylab-Aligner (PLA) (Gorman, Howell & Wagner, 2011), and the Forced alignment and Vowel extraction (FAVE) (Rosenfelder et al., 2011) are some of the most used forced aligners for the investigation of English varieties. Several studies have made use of these aligners reporting positive results regarding their performance and accuracy and have offered directions for their improvement as well.

Yuan and Liberman have worked on and with P2FA (2009, 2011, 2013), while PLA has been used even in child speech by Knowles et al. (2015), with a less accurate but still quite notable performance, especially if one considers the difficulties child speech poses to speech recognition software. Worth mentioning is also the Dartmouth Linguistic Automation service (DARLA), a fully automated forced alignment tool that does not require any beforehand transcriptions (Reddy & Stanford, 2015). It is however significantly less accurate than the rest of the semi-automatic aligners, as this kind of speech technology has not advanced as far as to not require any pre-processing, or a post-correction (Bailey, 2016, p. 13).

(9)

FAVE is a modified version of P2FA, with Evanini’s (2009) automatic vowel extraction added on it. It is adapted to account for overlapping speech, and therefore it is suitable for sociolinguistic interviews. Its acoustic models are trained on official recordings from the Supreme Court of the US, and for its phonetic transcriptions, it makes use of the Carnegie Mellon University pronunciation dictionary (CMU), an extensive and consistent machine-readable pronouncing dictionary of the General American English variety (Bailey, 2016, p.12). CMU uses the ARPAbet phoneme set, a transcription convention used in speech recognition, which is related to the International Phonetic Alphabet.

FAVE is scripted in the Python programming language and consists of two toolkits:

FAVE-align and FAVE-extract. FAVE-align requires as inputs the audio in the WAVE format as well as a tab-delimited text file of the orthographic transcription, segmented in breath groups (Fruehwald, 2013). It uses the HTK toolkit to match the orthographic transcriptions to their corresponding place in the speech signal and produces a phonetic transcription using CMU. Finally, the software issues the time-aligned transcription in the form of a PRAAT TextGrid. PRAAT (Boersma & Weenink, 2015) interprets the audio file and visually represents it as a waveform and a spectrogram, while it produces the aligned transcription in two different tiers. The top tier shows the boundaries between the phonetic transcription of the individual phones while in the bottom one is the orthographic transcription of the entire word (Figure 2).

Figure 2. The aligned FAVE TextGrid file and audio file, viewed in PRAAT.

FAVE-extract, on the other hand, is the vowel formant extracting tool of FAVE. Vowel extraction is a separate function of the aligner, which can be performed only using the segmented files FAVE-align produces. Although not a part of the forced alignment process, it is a significant service FAVE offers as it is “designed to improve the accuracy of the formant measurement and to ensure that measured formant frequencies correspond to the auditory impression of the tokens in question” (Labov, Rosenfelder &

Fruehwald, 2013, p. 35).

(10)

1.4 FAVE on research and research on FAVE

As Bailey (2016) states, forced alignment programs like FAVE constitute methodological revolutions in contemporary sociolinguistic research (p. 11). Fruehwald (2014), one of the developers of FAVE, has stressed the advantages regarding the consistency and replicability that the use of this technology may bring to sociolinguistics, pointing out that the unbiased analysis of automation is one of great importance (pp. 19-21). FAVE has been used in many studies of language variation over the past years, and the general comments have been positive, praising its aid especially in sociophonetic research. This section will discuss a few examples and their findings on FAVE’s performance, as well as their considerations of its limitations.

In the US, Labov et al. (2013) have investigated the Philadelphia Neighborhood Corpus using it, reporting its invaluable assistance in vowel measurements (9000 as opposed to the 300 if done manually for fifty minutes of audio), commenting on the accuracy of the aligner with the increase of tokens it measures (pp. 37-38). FAVE has also been used by Stanford, Severance and Baclawski (2014) on the changes of the Eastern New England dialect (ENE) to analyze the vowel variables within the shift they were investigating.

Although the authors pointed out the limitations of forced alignment, they did appraise its aid in “conducting a relatively large-scale study with limited resources” (p. 110).

Moreover, even though testing FAVE’s performance was beyond the scope of their study, they compared their manual transcriptions of a portion of their data with FAVE’s, positively assessing the reliability of the software (p. 111).

Despite the prominence of FAVE in comparison with other competitive toolkits, some critical issues remain. Examining FAVE, Severance, Evanini, and Dinkin (2015) evaluated its error rate as well as possible tendencies in error occurrence using Dinkin’s (2009) data from Upstate New York. Their findings showed that the mean error percentage was less than 4 percent, but the question whether or not this is acceptable compared with human transcriber agreement rates is left unanswered, as they noted that there is no conclusive answer to be found in the existing literature (p. 37). Still, they have drawn attention to the fact that a non-trivial portion of the vowel measurements was particularly erroneous, implying that there is a need for further improvements in the alignment and vowel extraction of the software (p. 40).

Moreover, FAVE has been criticized for its inaccuracy in annotating spontaneous speech. Liu, Mallard and Silva (2017) proposed methods of improvements through automatic lexicon enrichment and acoustic model improvement, yet the results of their study suggested that forced alignment is more accurate when “the acoustic models are trained in formal speech” (p. 7). They found, however, that lexical enrichment (the insertion of any words that are not included in the pronunciation dictionary in use but exist in a given orthographic transcription) improves the accuracy of the aligner and that

“being able to pinpoint words that should undergo phone reduction similar to actual conversational speech should further increase forced alignment quality” (pp. 7-8).

Another limitation of FAVE is the one of portability, meaning whether it can be used or not with other varieties besides American English. Outside the US, Bailey (2016) tested FAVE on Manchester English on specific phonological variables, applying pronunciation dictionary expansion as well. This is a needed modification, as FAVE

(11)

does not account for variability beyond the variety it was trained on, and therefore lexicon expansion is one of the solutions. He emphasized the importance of addressing the problem that fast speech rate poses to FAVE - which is much less of an issue for human transcribers - and how this subsequently affects its accuracy and reliability of phone recognition (p. 18). Bailey acknowledges that despite their occasional inaccuracy, such tools are worth using to advance research with bigger scales of data. He especially highlights the need for further research on how forced alignment deals with different variants discrimination and proposes that technological enhancements can be implemented to improve the software reliability (p. 19).

Finally, MacKenzie and Turton (2013) studied FAVE’s performance on four British dialects (RP, Essex, Manchester, and Liverpool) along with two other aligners. Their choice of focus was the accuracy of vowel segmentation, as the classes of vowels differ between the selected dialects (for example, a question was whether or not the aligner’s accuracy is affected by the presence of a STRUT-FOOT distinction in the dialect). The authors mentioned recognition issues of non-standard dialect consonantal features as well. For example, since FAVE is looking for a non-prevocalic /r/ as in General American, it has difficulties with non-rhotic dialects as it tries to find an /r/ where there is none. Nevertheless, the study reported that the performance of FAVE was surprisingly good. While pointing out the disadvantage of FAVE in customization in comparison to PLA which can be trained on any given speaker/dialect models, FAVE can, however, deal with “messy data” like spontaneous/conversational speech (as opposed to the formal style of reading passages/wordlists) in contrast to the other aligners. In addition, it breaks down the data in breath groups automatically unlike PLA which requires manual pre-processing (pp. 36-38).

2. Aims and research questions

In the line of MacKenzie and Turton’s (2013) study, this thesis aims to investigate to what extent forced alignment software can be reliably used to analyze non-standard varieties of English, focusing on the performance of FAVE. Their investigation was indicative that FAVE is, in fact, amenable to the study of other, non-American varieties, and the purpose of the present study is to examine this proposition with further data.

However, at least two important limitations of their method must be underlined: first, MacKenzie and Turton (2013) did not include a control variety, that is, result data for a variety on which the program is known to perform well (for example General American) and which acts as a point of comparison for the overall performance results.

Second, their study focuses primarily on vowel identity. In order to have a clearer idea on the accuracy of FAVE’s boundary placement around the vowels, it may be argued that it is not so much the acoustic quality of vowels that poses a problem as the consonantal environment in which they are found. Vowel boundaries might be more or less accurate depending on the context in hand; inaccurate annotation of consonants can create bigger issues as in the example of /r/ mentioned before. Thus, a selection of

(12)

consonants known to vary greatly across different varieties of English will be the main topic of interest for the present study.

Despite the length limitations of the thesis, the choice of varieties was as broad as possible considering the time in hand; Scottish, Irish, Australian and Indian English were chosen to be tested in comparison with a control variety, General American. The choice of the variables was based on what consonants present greater variation in pronunciation between the four varieties, and, consequently, would pose greater problems in the process of alignment. These variables are: 1) /p, t, k/ realization 2) rhotic consonants, and 3) /l/ (clear/dark/vocalization). Consulting Wells’ (1982) reference work, the aspiration of /p, t, k/ can vary or be completely absent in some of these varieties, while /t/ varies in place of articulation, such as in the opposition between Indian English retroflex /t/ and Scottish English T Glottaling, as well as voicing (pp.

409-410, 430, 603, 627-628). /r/ varies as well in place of articulation while even fully rhotic dialects such as Scottish English, affected by RP, may drop /r/ occasionally (pp.

410-411, 603, 432, 629). Regarding /l/, more subtle or gradual variation between dark and clear /l/ points to a fairly sharper division compared to RP, while there is also variation in coloring and place of articulation (pp. 412, 431, 603, 629). These characteristics and the varieties they correspond to will be further discussed in the results section.

Based on the issue of portability of FAVE and the need for a better understanding of its performance on consonantal variation, the main research question was formulated as such: How viable is FAVE in research on non-standard varieties of English considering its current limitations? In order to provide an answer, this question was operationalized in two different ways: 1) How does FAVE’s annotations of the varieties under study compare with the ones performed by human annotators, and 2) What issues arise and to what extent is the accuracy of time-alignment affected by the variation in the consonants under study? The next chapter will introduce the methodology used to answer these questions.

3. Method

3.1 Data

As stated above, the thesis aimed to assess FAVE’s accuracy on a range of varieties of English, especially concerning some of the consonantal features that are expected to pose problems to the aligner. Based on the nature of the research questions, a quantitative methodological approach was chosen, similar to the procedures adopted in previous research (MacKenzy & Turton 2013; Bailey 2016). This subsequently necessitated comparing a selection of varieties of English with a control variety on which the software is known to perform well. Given the time limitation, but also the aspiration of the scope of investigation regarding the number of varieties under study, it was considered appropriate to obtain the samples from an existing corpus rather than

(13)

actively collecting them. The audio material for each variety and their subsequent annotations, manual and automatic, would be the data of the investigation.

The samples were selected from the data made available by the PAC project (La Phonologie de l’Anglais Contemporain: usages, variétés et structure / The Phonology of Contemporary English: usage, varieties, and structure). This project started in 2000 as an extension of the PFC project (La Phonologie du Français Contemporain), with a primary object to represent spoken English variation in the world. The methodology of this corpus, following Labov’s work, was designed to facilitate a strict phonological comparison between varieties (Carr et al., 2004, p. 24). It provides recordings of various styles of speech (wordlists, running text, formal and informal interviews) and has so far recorded informants from 37 locations, in 9 English speaking countries. After selecting the control variety (General American) and the four non-American varieties (Scottish, Irish, Australian, and Indian English), it was decided to use the two wordlists for the annotations. These wordlists, which amount to 192 words, “allow the examination of a wide sample of segmental phenomena” and were therefore suited for this study (p. 25).

All the variables under focus (/p, t, k/, /r/, /l/) had an adequate number of instances within these lists. Moreover, they could be found in both pre-vocalic and post-vocalic positions, which was necessary for the assessment of FAVE’s performance on all consonantal environments. Out of the 192 words, 151 words had at least one of the variables relevant to the analysis of the study (see Appendix A).

Table 1. Information on the speakers in the selected recordings from PAC.

Varieties Sex Age Place of birth Source

General American F 32 Sacramento,

CA

Durand et al. 2015

Scottish English F 48 Annbank,

South Ayrshire

Durand et al. 2015

Irish English M 31 Dublin Durand et al. 2015

Australian English M 56 Melbourne Durand et al. 2015

Indian English M 31 Delhi Domange¹

As shown in Table 1, the selection comprises recordings for two female and three male speakers, aged between 31 and 56 at the time of the recordings, each corresponding to one of the dialects under study. It is important to note at this juncture that the purpose of this study was not to provide a general description of Scottish, Irish, Australian or Indian English; such extensive analysis lies outside the scope of the present

1 Unpublished material.

(14)

investigation. The speakers should not be taken as representative of each variety, but rather as indicative for the purposes of scrutinizing FAVE in the face of input variation.

The general characteristics of their phonology (discussed further in relation to the variables in focus in the qualitative descriptions in the result sections) indeed agree with the general characteristics of their dialects in standard descriptions, but the very existence of variation in their speech was the actual point of interest. These informants were chosen because PAC offers the possibility to make a systematic phonological comparison between varieties to an extent which is difficult to achieve with any other available corpora.

After obtaining the wordlists for all five varieties, the next step was to produce manual transcriptions by two annotators which would be compared with FAVE’s aligned transcription of the same material. Such an approach is recommended for studies evaluating automatic phonetic transcriptions, where a consensus transcription acts as a reliable “reference point” to the automatic one (Strik & Cucchiarini, p. 95). The author of the study was selected as Annotator 1 and the supervisor as Annotator 2. Both annotators manually annotated all the material in PRAAT, labelling each phone using the ARPAbet transcription convention, since a more accurate labelling tailored to each variety was not the focus of the study. The segmentation process was based on Ladefoged’s (2003) descriptions of duration measurement techniques, and the annotations were performed based on identifying patterns on the spectrogram and waveform, combining at the same time auditory cues from the audio. This process was discussed and rehearsed prior to the annotation of the material by running an early test annotation using Indian English to ensure that Annotator 1 was following all the acoustic phonetics conventions appropriately. This inspection was repeated after every annotation so that any resulting disagreement between annotators will not be a product of oversight. After obtaining the transcriptions of all the material from both annotators, the orthographic transcriptions (which were checked from both annotators for mistakes) were provided to FAVE to perform the same task, and the sum of annotations acted as the data for the analysis.

3.2 Analysis

Since variation in agreement is expected to exist between human annotators as well, FAVE’s performance is not anticipated at the moment to be entirely accurate or outperform experienced annotators. The comparison between FAVE’s annotations and manual annotations sought to determine whether the disagreement rate displayed is relatively similar to having a third human annotator involved in any given phonological transcription process.

After obtaining the transcriptions from both annotators and FAVE and excluding all the irrelevant data within the annotations (numbers, silences, errors, words without any of the phones of interest), the first step was to extract the onset and offset times of all the vowels surrounded by the phones of interest. For example, for the word pet, the annotated onset and offset times of /ɛ/ in each transcription had to be extracted: onset and offset times would then signify the placement of the boundaries between respectively /p/ and /t/ around /ɛ/ as illustrated in Figure 3. The boundary displacements

(15)

(MacKenzie & Turton, 2013, pp. 49-52) were then calculated between the two annotators and between Annotator 1 and FAVE. The boundary displacements correspond to the absolute difference between each relevant time index for each annotated item. For example, in Figure 3 the boundary displacement between Annotator 1 and Annotator 2 for /p/ in pit corresponds to 0s and between Annotator 1 and FAVE to 0.008s. The two groups (Annotator 1 versus Annotator 2 and Annotator 1 versus FAVE) were then tested for statistical differences in onset and offset positions with a paired sample t-test in R.

Figure 3. Example of boundary placements by Annotator 1, Annotator 2 and FAVE, represented as annotations of onset and offset times around the vowels (Phones column).

In a second step, displacement rates between FAVE and Annotator 1 for each variety, and each group of variables were then compared by running a one-way analysis of variance (ANOVA), in order to see how differently FAVE performs on each variety.

Since the distribution of results was not normal, a power transformation via square root had to be operated on the data² prior to testing. Where ANOVA detected overall statistically significant differences between varieties for the variables of interest, a Tukey post-hoc test was performed in order to determine which specific varieties differed. In the following section, the general results will be discussed first, followed by a detailed report of the results for each variable separately.

4. Results

4.1 General results

As stated in the aims and research questions, the primary object of this study was to scrutinize FAVE’s performance with input quite different from what it was trained on.

In the process of annotating the four non-American varieties of English (Irish, Scottish, Australian, and Indian English) against the control variety (General American), FAVE did have a statistically significant difference in its annotations when compared with human annotators.

2 This transformation was not done for the paired t-test since this assumption does not apply to the groups tested but to the differences between the groups.

(16)

As the results of the t-test revealed, presented in Table 2 and 3, boundaries displacement between Annotator 1 and Annotator 2 significantly differed from that between Annotator 1 and FAVE for all five varieties, in all the environments (onset and offset alike). For onset, most difference between the groups (Annotator 1 versus Annotator 2 and Annotator 1 versus FAVE) was found for Scottish English, while for offset it was for Irish English. However, the displacement rates did not vary greatly between the five varieties. FAVE performed well on General American as expected, but what was striking was that this was also the case for the rest of the varieties to a certain extent.

Overall, its performance on the four varieties under focus was surprisingly good considering the spectrum of accent differences the software had to deal with.

Table 2. Results on each of the five varieties regarding the onset boundary displacement between human annotators, one annotator and FAVE as well as the difference in agreement between the two groups.

Speaker ONSET

A1-A2 A1-FAVE Difference

Mean S.D Mean S.D t.

GA 0.01 0.013 0.015 0.013 t (244) = -6.19***

SCE 0.008 0.011 0.025 0.040 t (242) = -6.36***

IRE 0.01 0.011 0.014 0.014 t (238) = -4.29***

AUE 0.01 0.012 0.019 0.023 t (237) = -5.83***

INE 0.008 0.012 0.015 0.019 t (235) = -5.71***

Note: Significance levels *** p ≤ 0.001, ** p ≤ 0.01, * p ≤ 0.05, n.s ≥ 0.05

Table 3. Results on each of the five varieties regarding the offset boundary displacement between human annotators, one annotator and FAVE as well as the difference in agreement between the two groups.

Speaker OFFSET

A1-A2 A1-FAVE Difference

Mean S. D Mean S. D t.

GA 0.019 0.021 0.035 0.031 t (244) = -8.56***

SCE 0.019 0.023 0.038 0.044 t (242) = -6.7***

IRE 0.014 0.017 0.057 0.052 t (238) = -12.65***

AUE 0.015 0.024 0.037 0.039 t (237) = -8.47***

INE 0.014 0.02 0.025 0.032 t (235) = -6.36***

Note: Significance levels *** p ≤ 0.001, ** p ≤ 0.01, * p ≤ 0.05, n.s ≥ 0.05

(17)

Generally, the displacements of the onset boundary placement for both groups were fewer than the respective offset ones. The agreement between human annotators was rather high as well as consistent across varieties, offering a solid baseline for comparison with FAVE. This is shown by the way the displacement means remained low between varieties for both onset and offset annotations. For onset positions, the least disagreement was found for Scottish English and Indian English with the other three following closely. The disagreement rates increased for offset boundary placement, but it was still generally kept in low levels.

When comparing automatic and manual work, the disagreement was higher between Annotator 1 and FAVE than between the two annotators. The mean in Table 2 and 3 shows that instead of the expected General American, least disagreement between Annotator 1 and FAVE occurred on Irish English for onset and Indian English for offset measurements. Most disagreement on onset was on Scottish English, while for offset annotations it was on Irish English, which even though it was the best for onset, this time resulted to the biggest difference between the annotator and the aligner - a disagreement quite bigger than what the rest of the varieties displayed. As in between the human annotators, displacement rates increased for offset boundary placements for all varieties. The most sizable difference between onset and offset boundaries displacement was found for Irish English, where the mean and standard deviation for offset position more than doubled with respect to onset measurements.

Overall, even though the manual annotations were considerably more accurate, FAVE performed well for all varieties. Meanwhile, FAVE was expected to have a lower disagreement, and therefore higher resemblance with human annotators for the variety it was trained on, but the smallest difference was not on General American for either onset or offset. The fact that FAVE did not seem to perform noticeably better on General American than the other varieties is an important finding as it overturned the original expectation that the software would clearly yield the best results for the control variety.

This unexpected result will become clearer in the next three subsections presenting the results for each variable, where the performance on the non-American varieties may often be considered as good as on the control variety, if not sometimes slightly better regarding annotation accuracy. First, a qualitative description of the observations on the data will be given before reporting the results of the differences found between varieties in the statistical analyses.

4.2 /p, t, k/

Regarding the phonation of the three stop consonants /p, t, k/, some differences in the length of aspiration in comparison with the control variety were noted. This was the case mostly with Indian English, which had unaspirated /p, t, k/ in all initial positions, but this did not affect to a great extent the accuracy of the annotations. It was observed that FAVE’s judgment of the acoustic cues for the voice onset time (VOT) of /p, t, k/

was overall quite accurate, but on offset boundary placement the software was less precise. This inaccuracy was greater on pre-vocalic positions, as in post-vocalic position the annotation of the preceding vowel often encroached the hold phase of the stop.

(18)

The place of articulation in some varieties was also sometimes problematic for the aligner. While for Indian English the characteristic retroflex /t/ posed no problems, Scottish and Australian English glottalization of /t/ and /k/ on some instances did affect the accuracy of the annotations on final positions. Irish English was also challenging regarding its post-vocalic, word-final /t/. In the informant’s accent, there is a fricativisation of /t/ (therefore pet becomes [pɛt̞ ]) and it was often perceived by the program as background noise and was not annotated accurately.

A one-way analysis of variance (ANOVA) on onset boundary displacement yielded significant variation among varieties, F (4, 208)=4.456, p=0.00178**. A post hoc Tukey HSD test showed that Australian English differed significantly from General American (p=0.002**) and Indian English (p=0.014*). The rest of the varieties were not significantly different from each other. For offset boundary displacement, a greater difference was found between varieties, F (4, 233) =13.69, p=4.81e-10***. Between varieties, it was found that Indian English differed significantly from Australian English (p=0.000***), General American (p=0.012*), Irish English (p=0.000***), and Scottish English (p=0.000***), while Irish English was also significantly different from General American (p=0.003**). The differences between varieties are illustrated in Figure 4.

Figure 4. Boxplots displaying Annotator 1 vs. FAVE’s distribution of /p, t, k/ onset and offset boundary displacement between General American (GA), Scottish English (ScE), Irish English, (IrE) Australian English (AuE), and Indian English (InE).

In Figure 4, the boxplots show how the median boundary displacements (measured in square root seconds) between all varieties were quite close to each other for the onset measurements. Indian English along with General American formed the group that FAVE performed best on; the boundary displacement was only significantly different between Indian and Australian English, and General American and Australian English, with the rest of the varieties lying in between. For offset measurements, FAVE performed very well on Indian English outperforming General American, which showed much variation. There is a gradual increase in displacements moving from one variety to the next until Irish English, which, as expected, was found in the last place mainly due to /t/ annotation displacements as in the example of pet above.

(19)

4.3 /r/

From the beginning of the investigation, it was anticipated that FAVE was going to encounter difficulties annotating non-rhotic varieties since it is trained on a fully rhotic variety and would try to annotate /r/ even when it is not pronounced. It was therefore decided to exclude words in which the two annotators could not find any realization of post-vocalic /r/, in order to have a proper comparison of FAVE’s annotations of actual acoustic differences of /r/ between varieties. That was primarily the case with Indian and Australian English, where words like more and heart were pronounced as /mɔ:/ and /hɑ:t/ respectively. The place of articulation was different mainly for Scottish and Indian English, where instead of the retroflex or bunched /r/ of General American but often an alveolar tap. Irish English /r/, on the other hand, presented a significant degree of retroflection and was usually associated with much vowel coloring (Wells, 1982, p.

432).

For /r/, ANOVA found no significant difference between varieties for onset boundary displacement, F(4,152)=1.871, p=0.118. In contrast, for offset boundary displacement the analysis yielded significant variation between varieties, F(4,160)=7.136, p=2.59e- 05***. The post hoc Tukey HSD test showed that Irish English differed significantly from Australian English (p=0.003**), General American (p=0.001***), and Indian English (p=0.025*), while Scottish English was significantly different from Australian English (p=0.02*) and General American (p=0.02*). The results are presented in Figure 5.

Figure 5. Boxplots displaying Annotator 1 vs. FAVE’s distribution of /r/ onset and offset boundary displacement between General American (GA), Scottish English (ScE), Irish English, (IrE)

Australian English (AuE), and Indian English (InE).

As the boxplots indicate, for onset boundary displacement there was no significant difference between the varieties. Nevertheless, Indian and Australian English were in the first positions with the rest of the varieties following closely. For offset positions it was Australian, Indian English and General American that were ahead of Scottish and Irish English. FAVE appeared to experience difficulties performing well when the realization of post-vocalic /r/ is acoustically different. For offset measurements, there were significant differences for half of the comparisons, with Irish English occupying the last position because of its characteristic very dark resonance of /r/, especially in

(20)

word-final positions. Even though this characteristic of Irish /r/ yielded satisfying annotations for the onset, it frequently caused FAVE to displace the offset boundary as it perceived /r/ to be longer than it was due to r coloring of the following vowel.

Additionally, significant variation was noted for Scottish English in comparison to the other varieties for both onset and offset measurements, as a result not only of the place (alveolar), but mainly of the manner of articulation (tap and sometimes trill) of /r/.

These particular differences of manner in the Scottish /r/ posed problems to FAVE similar to the ones it has when it deals with non-rhotic varieties, and it was challenging to annotate even for Annotator 1.

4.4 /l/

Each variety differs in its realization of /l/ in comparison to what FAVE was trained on (General American velarized dark /l/). Scottish English often has a darker quality of /l/

in pre-vocalic and coda positions, Indian and Irish English /l/ is always clear, while Australian English has a pharyngealized /l/ in all environments. /l/ vocalization was expected to create a lot of variation in the annotations within the varieties since the boundary between /l/ and a mid-back vowel is sometimes hardly detectable.

No significant difference was found between varieties on running ANOVA for onset boundary displacement, F(4, 53)=1.653, p=0.175. For offset boundary displacement, ANOVA results yielded significant variation between varieties, F(4,99)=2.632, p=0.0387*. A post hoc Tukey HSD test revealed that significant differences only existed between Indian English and American English (p=0.041*). The results are illustrated in Figure 6.

Figure 6. Boxplots displaying Annotator 1 vs. FAVE’s distribution of /l/ onset and offset boundary displacement between General American (GA), Scottish English (ScE), Irish English, (IrE) Australian English (AuE), and Indian English (InE).

The difference in /l/ realization created no significant difference in the onset measurements. FAVE performed best for Indian English, Australian English and General American leaving Irish and Scottish English behind. As the boxplots in figure 6 show, Scottish English was found last mainly due to instances where FAVE was unable

(21)

to detect any meaningful differences between the dark /l/ in onset and a mid-back rounded vowel (as in the word lock), therefore misplacing the onset boundary for /l/.

For offset measurements, the boundary displacements were increased for all varieties, where the group with the best performance included only Scottish English and General American, followed by a second group consisting of the rest of the varieties. The boundary displacement difference was found to be significant only between General American and Indian English where the clear /l/ of Indian English appeared to be problematic for FAVE. In post-vocalic positions, the software performed better with varieties in which the realization of /l/ is similar to General American.

The next section offers a discussion of the present results and their significance, starting with a summary of all the results and closing with a review of the present study’s importance.

5. Discussion

This study aimed at investigating the viability of FAVE on non-standard varieties of English, by examining how FAVE compares to human annotators and looking into the extent that problematic variables such as /p, t, k/ realization, rhotic consonants and /l/

affect the accuracy of the software. In order to answer these questions, the thesis first assessed FAVE’s transcriptions in comparison with those by human annotators, and at a second stage, it compared the performance of FAVE between varieties, for each variable under study. The overall disagreement rate between the two annotators was very small, while the one between FAVE and Annotator 1, although statistically significant, was considered quite good for all of the varieties. Furthermore, there was higher disagreement in the placement of the offset than of the onset boundaries, and that was true for both groups. Based on previous research, it was expected that FAVE would perform very well with General American, which indeed it has. One key result, however, was that in all cases it was found that FAVE performs at least equally well on one or more varieties along with General American. FAVE’s performance on the control variety was not exceptionally better than the other varieties as it was anticipated;

all four selected varieties stayed quite close to the accuracy rates that FAVE had on General American.

Nevertheless, General American was always in the groups that FAVE performed better on, except for /p, t, k/ where Indian English outperformed all varieties. In particular, Indian and Irish English on onset and Indian and Australian English on offset measurements were very well annotated by the software, and this was especially evident in the results for each variable. Moreover, the comparison of FAVE’s performance between varieties showed that the most significant difference between varieties was found on the /p, t, k/ variable (both the onset and offset), while for /r/ and /l/ there was a significant difference only for offset, and especially for /l/ it was only between one pair (Indian English-General American).

(22)

As stated earlier in the thesis, the limited number of studies on FAVE’s portability stresses the need for further research on the program’s performance. Even though they are significant contributions, most of the previous works on FAVE focused on American or British English varieties and, as such, they were limited in scope. The present study tested FAVE on a broader range of varieties of English in order to scrutinize the aligner and test this portability hypothesis. The results suggested that the difference between the boundary placements, measured at between 5 and 15 milliseconds in onset, and between 14 and 40 milliseconds in offset positions, although significant, should not be taken as a restriction of FAVE’s use on American English varieties. With a post-correction of the automatic annotations, the user can earn a substantial amount of time in comparison to manual annotations. The software is not yet at the accuracy levels of a human annotator and therefore cannot be compared with manual annotations as if it were a third human annotator, but its reliability is still very high. Although they use different methods and focus on different varieties, the results reported here support the conclusions of studies like MacKenzie and Turton’s (2013);

FAVE is indeed a viable option as a methodological tool for non-standard varieties of English. In general, the addition of results from this and any other future study can further broaden the understanding of FAVE’s capabilities within sociolinguistic circles, and further support the usage of such technology.

The most important conclusion to be derived from the results is that their interpretation relies heavily on how realistic the expectations of the user are. FAVE has received criticism on its accuracy, but only because it is expected to perform like its human counterparts when in reality it assists their work. Seeing how well FAVE can perform on the samples of Scottish, Irish, Australian and Indian English, this study expects that the software could be used safely without further modifying its acoustic models for research on other varieties than GA. Provided that users are aware of the software’s limitations that stem from its specific training, they can certainly use it on their data and even test it on broader variation. The need for manual corrections could be greater for some varieties than for others, and a more fine-grained analysis of acoustic variation such as the one offered in this study should be very helpful in this process. Further research along this line could help derive broader generalizations about the software’s performance, where the use of a control variety is encouraged. Lastly, the existence of studies such as the present one is important for the justification of using FAVE in future research, since forced alignment can save hundreds of hours of manual annotations in sociophonetics studies on the world’s Englishes.

6. Conclusion

The investigation of FAVE’s performance on non-American varieties of English was conducted in the context of method automation in the field of linguistics via speech technology software. It was aimed at contributing to the general effort of examining how viable current forced alignment tools are in assisting linguistic research. The study considered the limited literature on the topic and the issues of previous investigations,

(23)

such as the heavy focus on the acoustic properties of the annotated vowels rather than the consonantal environment they are found in, in order to judge the accuracy of the software. Therefore, it was considered central for this thesis to provide more information regarding the software’s capabilities, unbiased by unrealistic expectations of human-like performances, and investigate if the inevitable inaccuracies of automation are restrictive. Examining specific variables that are expected to create problems was essential in providing information as to what future researchers should consider and future developers could improve.

Regarding the significance of the results, discovering that in annotating most of the varieties FAVE performed equally as well or outperformed its accuracy on the control variety was considered a positive indication for the software’s abilities. This suggested that FAVE seems to be indeed transferable to the study of non-standard varieties of English and can be expected to yield quite reliable results. Furthermore, the fact that FAVE’s general accuracy between varieties, except in some cases, was not substantially different, suggests that the software deals adequately with variation in pronunciation.

Additionally, considering the software’s comparison to the human annotations, FAVE’s performance might not have the accuracy of a human annotator, but the actual magnitude of the difference does not seem troubling.

Nevertheless, some limitations of the study should be emphasized. The investigation sought to cover some of the range of English varieties, yet the present selections will not be enough to state definitive conclusions on the portability of FAVE on all other varieties of English. The current results support its extended use; however, it is possible that FAVE will be more accurate in transcribing some varieties than others, depending on how they vary from what the aligner was trained to detect and annotate. Furthermore, the selection of variables was limited due to the time and length restrictions of this paper, and therefore testing more variables from the present varieties should provide a better picture. Finally, FAVE was examined on highly formal speech style (wordlists), and therefore other difficulties may arise from examining connected speech, especially conversational data.

Closing with the implications of this study, the importance of this thesis is two-fold; it provided further evidence supporting the portability of FAVE and drew attention to some of the consonantal variables that were expected to affect the software’s accuracy.

If used properly, that is, being aware of its limitations and have realistic expectations of its performance, FAVE is a valuable tool, especially in the hands of sociophoneticians.

Coming to a broader view of the linguistic research landscape, what should be taken from studies as the present one is that advancements in speech technology ought to be implemented wherever appropriate in linguistic investigations. Ultimately, future studies are encouraged to test FAVE on more varieties of English focusing on several different variables (consonants and vowels alike), not only for the sake of evidence for or against using the software in linguistic research, but also to provide feedback to the current and future developers for the improvement of forced alignment toolkits.

(24)

References

Bailey, G. (2016). Automatic detection of sociolinguistic variation using forced alignment.

University of Pennsylvania Working Papers in Linguistics, 22 (2) Article 3. Retrieved from https://repository.upenn.edu/pwpl/vol22/iss2/3

Boersma, P. & Weenink, D. (2015). Praat: Doing phonetics by computer [computer program]. http://www.fon.hum.uva.nl/praat/

Brulard, I, Carr, P. & Durand, J. (Eds.). (2015). La Prononciation de l’anglais contemporain dans le monde: variation et structure. Toulouse: Presses Universitaires du Midi.

Cambridge University. 1989-2015. HTK Hidden Markov Model Toolkit.

http://htk.eng.cam.ac.uk

Carnegie Mellon University. 1993-2016. CMU Pronouncing Dictionary.

http://www.speech.cs.cmu.edu/cgi-bin/cmudict.

Carr, P., Durand, J., Pukli, M. (2004). The PAC project: Principles and Methods. Tribune des Langues Vivantes, 36, 24-35. Retrieved from http://www.projet- pac.net/images/publications/carrdurandpukli0409.pdf

Coleman, J. (n.d) Forced alignment and speech recognition systems. Retrieved from Oxford

University, Phonetics Laboratory Website:

http://www.phon.ox.ac.uk/jcoleman/BAAP_ASR.pdf

Crystal, D. (2010). The Cambridge encyclopedia of language (3. ed.). Cambridge: Cambridge University Press.

Dinkin, A. J. (2009). Dialect Boundaries and Phonological Change in Upstate New York.

Publicly Accessible Penn Dissertations, 79. Retrieved from https://repository.upenn.edu/edissertations/79

(25)

Evanini, K. (2009). The permeability of dialect boundaries: A case study of the region surrounding Erie, Pennsylvania. Publicly Accessible Penn Dissertations, 86. Retrieved from https://repository.upenn.edu/edissertations/86

Foulkes, P., Scobbie, M. J., Watt, D. (2010). Sociophonetics. In Hardcastle, W. J., Laver, J., &

Gibbon, F. E. (Eds.), The handbook of phonetic sciences. (pp. 703-738). Malden, Ma.:

Wiley-Blackwell.

Fruehwald, J. (2013). Using FAVE-align. Retrieved December 9, 2018, from https://github.com/JoFrhwld/FAVE/wiki/Using-FAVE-align

Fruehwald, J. (2014, August 12). Automation and sociophonetics. Talk presented at Methods in Dialectology XV. Groningen: University of Groningen. Retrieved from https://jofrhwld.github.io/papers/methods_xv/#

Gorman, K., Howell, J. & Wagner, M. (2011). Prosodylab-Aligner: a tool for forced alignment of laboratory speech. Canadian Acoustics - Acoustique Canadienne, 39(3), 192–193.

Retrieved November 29, 2018, from https://jcaa.caa-

aca.ca/index.php/jcaa/article/view/2476/2225

Klautau, A. (2001). "ARPABET and the TIMIT alphabet". Retrieved from https://web.archive.org/web/20160603180727/http://www.laps.ufpa.br/aldebaro/papers/ak_a rpabet01.pdf

Knowles, T., Clayards, M., Nadig, A., Sonderegger, M., Wagner, M., & Onishi, K. (2015).

Automatic forced alignment on child speech: Directions for improvement. Proceedings Of Meetings On Acoustics, 25(1). doi:10.1121/2.0000125

Labov, W. (2001). Principles of linguistic change. Vol. 2, Social factors. Oxford: Blackwell.

Labov, W., Rosenfelder, I. & Fruehwald, J. (2013, March 25). One hundred years of sound change in Philadelphia: Linear incrementation, reversal, and reanalysis. Language 89(1), 30-65. Linguistic Society of America. Retrieved from Project MUSE database.

Ladefoged, P. (2003). Phonetic data analysis: An introduction to fieldwork and instrumental techniques. Malden, Oxford, Victoria: Blackwell Publishing.

Liu, C., Mallard, S., & Silva, R. (2017). Improving conversational forced alignment with lexicon expansion. Retrieved from Stanford University, CS 2246 Website:

http://web.stanford.edu/class/cs224s/reports/

Reddy, S. & Stanford, J. (2015). Toward completely automated vowel extraction: Introducing DARLA. Linguistics Vanguard, 1(1), 15-28. doi:10.1515/lingvan-2015-0002

Rose, Y. (2014). Corpus-based investigations of child phonological development. In Durand, J., Gut, U. & Kristoffersen, G. (Eds.), The Oxford handbook of corpus phonology (pp. 265- 301). Oxford: Oxford University Press.

Rosenfelder, I., Fruehwald, J., Evanini, K. & Yuan, J. (2011). FAVE (Forced Alignment and Vowel Extraction) Program Suite. http://fave.ling.upenn.edu.

Severance, N. A., Evanini, K., & Dinkin, A. (2015). Examining the performance of FAVE for automated sociophonetic vowel analysis. Paper presented at NWAV 44, Toronto, ON.

Retrieved October 11, 2018, from

https://www.academia.edu/17327504/Examining_the_performance_of_FAVE_for_automat ed_sociophonetic_vowel_analyses

Stanford, J. N., Severance, N. A., & Baclawski, K. P. (2014). Multiple vectors of unidirectional dialect change in eastern New England. Language Variation and Change, 26(1), 103–140.

doi:10.1017/S0954394513000227

(26)

Strik, H., & Cucchiarini, C. (2014). On Automatic Phonological Transcription of Speech Corpora. In Durand, J., Gut, U. & Kristoffersen, G. (Eds.), The Oxford handbook of corpus phonology (pp. 265-301). Oxford: Oxford University Press.

Ulrike, G. (2014). Corpus phonology and second language acquisition. In Durand, J., Gut, U. &

Kristoffersen, G. (Eds.), The Oxford handbook of corpus phonology (pp. 265-301). Oxford:

Oxford University Press.

Wells, J. C. (1982). Accents of English. Cambridge: Cambridge University Press.

Yuan, J., & Liberman, M. (2009). Investigating /l/ variation in English through forced alignment. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2215–2218. Retrieved from https://www.ling.upenn.edu/~jiahong/publications/c06.pdf

Yuan, J., & Liberman, M. (2008). Speaker Identification on the SCOTUS corpus. Proceedings

of Acoustics 08, 5687-5690. Retrieved from

http://languagelog.ldc.upenn.edu/myl/ICASSP_final.pdf

Yuan, J., & Liberman, M. (2011). Automatic detection of "g-dropping" in American English using forced alignment. Proceedings of 2011 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2011, 490-493. doi:10.1109/ASRU.2011.6163980 Yuan, J., Ryant, N., Liberman, M., Stolcke, A., Mitra, V., & Wang, W. (2013). Automatic

phonetic segmentation using boundary models. Proceedings of The Annual Conference of The International Speech Communication Association, INTERSPEECH, 2306-2310.

Retrieved from https://www.isca-

speech.org/archive/archive_papers/interspeech_2013/i13_2306.pdf

Automatic phonological transcription using forced alignment

Automatic phonological transcription using

forced alignment

FAVE toolkit performance on four non- standard varieties of English

Valeria Sella

Automatic phonological transcription using forced alignment

Abstract

Contents

1. Introduction and background

2. Aims and research questions

3. Method

4. Results

5. Discussion

6. Conclusion

References