Bootstrapping an IPA dictionary for English using Montreal Forced Aligner 2.0
One of my big tinkering projects that I’ve been thinking a lot about is a multilingual aligner that uses IPA to be able to share acoustic models across languages and maybe even help align data from languages without requiring an initial corpus of ~20 hours to get a good acoustic model. That whole project is outside the scope of the current post (but see the final thoughts section at the end for more on this). What I wanted to do with this blog series is document the end-to-end process of constructing this multilingual aligner with the hopes that these tutorials are useful for other people working with their specific data sets. With the WikiPron data sets, there’s a number of crowdsourced pronunciation dictionaries available (see here for a listing), this blog post is intended to serve as a guide for cleaning those up and bootstrapping larger pronunciation dictionaries.
What I’m currently working on is getting an IPA dictionary set up on the level of the Arpabet dictionary for LibriSpeech. The primary issue is in terms of coverage, where the English dictionary we use for aligning LibriSpeech has 206,518 lexical entries whereas the WikiPron English US dictionary only has 57,214 entries, leaving a large gap in the dictionary if we were to use it with supplementing.
The basic process for bootstrapping a larger dictionary with new entries was as follows:
- Clean up and standardize initial WikiPron dictionary
- Train a G2P model for generating IPA pronunciations for English
- Generate Out of Vocabulary (OOV) list
- Generate pronunciations using the G2P model for a subset of OOV words
- Check pronunciations generated and add new ones to the dictionary
- Retrain G2P model as necessary
Cleaning up and standardizing initial WikiPron dictionary
The basic dictionary was downloaded from WikiPron’s scrape page. Given Wikitionary’s crowdsourced nature, I did some initial analysis of the graphemes and phones being used. There were several foreign graphemes included, so words containing these were ignored to simplify the grapheme set:
% / @ ² à á â ä æ ç è é ê ë í î ï ñ ó ô õ ö ø ù ú ü ā ą č ē ę ğ ı ł ń ō ő œ ř ū ș ț ʼ ṭ ₂
All told, there were only 487 words with these characters, so removing these still left 56,727 total entries.
For processing the pronunciations, the initial phone set for the dictionary was:
a aː b d d͡ʒ e eː f h i iː j k l l̩ m m̩ n n̩ o oː p r s t t͡ʃ u uː v w z æ æː ð ŋ ɑ ɑː ɒ ɔ ɔː ə ɚ ɛ ɛː ɜ ɜː ɝ ɝː ɡ ɪ ɪː ɪ̯ ɫ ɹ ʃ ʊ ʊ̯ ʌ ʍ ʒ ʔ θ
The first automated pass transforming the dictionary took out ligatures. With phones in pronunciation dictionaries separated by spaces, having the ligatures is not necessary. However, the WikiPron pronunciations had some inconsistencies for digraphs for affricates and diphthongs. The following sequences were normalized:
Even following this, there were some /e/, /o/ and /a/ phones left over. These were manually corrected into some combination of /eɪ/, /ɛ/, /oʊ/, /ɔ/, /ɑ/. Additionally, some orthographic characters were used in pronunciations, so these were replaced with the correct IPA character:
The voiceless /ʍ/ was merged with /w/, both because these sounds are merged in most American dialects, and /ʍ/ appeared in a very inconsistent subset of pronunciations. Similarly, /ɫ/ was merged with /l/ due to a similar inconsistency.
Vowels before rhotics were standardized as well to the following set based on the set listed on the English phonology wikipedia.
To make the transcriptions in WikiPron adhere to the sets above, the following transformations were applied to standardize the vowel-rhotic sequences:
Finally some unstressed syllabic consonants were created to make them consistent with syllabic consonants elsewhere and to aid in alignment.
In unstressed syllables, Arpabet pronunciations usually have separate schwa-consonant sequences (i.e.
AH0 L ,
AH0 N ,
AH0 M), but not for syllabic r (i.e.
ER0). This leads to alignment issues using these dictionaries, because the default settings lead to 30ms minimum for phones, so these unstressed syllables have minimum 60ms duration. In running spontaneous speech, we’re unlikely to see unstressed syllables that are that long, and the quality of these syllables are much more like their consonant articulations than a vowel-consonant sequence.
For reference, the final phone inventory is:
As a more general point, the specific purpose for this dictionary is for alignment and speech tech and does not strictly represent English phonology per se. However, it’s also not strictly phonetic either, as /t/ and /d/ flapping are not represented in pronunciations. As the above mentions, the most important part is number of segments, since allophonic variation can be capture through the use of clustered triphones in alignment systems.
Training a G2P model for generating IPA transcriptions
Following all of the transformations above, we have a reasonable starting point for bootstrapping a large scale dictionary. Following installation of MFA into a conda environment, the first step is train a G2P model:
mfa train_g2p english_us.txt english_us_ipa_g2p.zip --num_jobs 13
And voila! Sort of. It takes me about 30 minutes to train a G2P model over the (as of this post) 60,756 words in
english_us.txt. The command above uses 13 processes to perform training, so depending on your set up, you may need to adjust it (the default is 3). The training can be customized (see the MFA docs on training G2P models for more details), but the default configuration should be a reasonable set up for both the current English dictionary and other dictionaries you might want to train on.
Generating an Out of Vocabulary (OOV) list
Once the G2P model is trained, we can generate a list of OOV words by running MFA’s validator utility:
mfa validate /mnt/d/Data/speech/LibriSpeech/librispeech_mfa english_us.txt -t /mnt/d/temp/validate_temp --ignore_acoustics
So what this command will do is run through all of the files in the LibriSpeech folder (the
librispeech_mfa folder in the command above is a preprocessed version that uses the Prosodylab format for MFA), check for various issues like OOV words, errors reading the sound files, missing transcriptions, etc. The validator will generally also do a pass of generating the MFCC features and doing a quick monophone alignment to check for issues, but having the
--ignore_acoustics flag skips this sequence to save time. The output of this command should be something like:
(aligner) user@hostname:/mnt/d/Data/speech/librispeech_ipa$ mfa validate /mnt/d/Data/speech/LibriSpeech/librispeech_mfa english_us.txt -t /mnt/d/temp/validate_temp --ignore_acoustics
INFO - Setting up corpus information...
INFO - Parsing dictionary without pronunciation probabilities without silence probabilities
INFO - Creating dictionary information...
INFO - Setting up training data...
Skipping acoustic feature generation
INFO - ===============================Corpus===============================
292367 sound files
292367 sound files with .lab transcription files
0 sound files with TextGrids transcription files
0 additional sound files ignored (see below)
0.000 seconds total durationDICTIONARY
There were 55415 word types not found in the dictionary with a total of 271084 instances.
Please see/mnt/d/temp/validate_temp/librispeech_mfa/corpus_data/oovs_found.txtfor a full list of the word types and/mnt/d/temp/validate_temp/librispeech_mfa/corpus_data/utterance_oovs.txtfor a by-utterance breakdown of missing words.SOUND FILE READ ERRORS
There were no sound files that could not be read. There were no sound files that had unsupported bit depths.FEATURE CALCULATION
Acoustic feature generation was skipped.FILES WITHOUT TRANSCRIPTIONS
There were no sound files missing transcriptions.TRANSCRIPTIONS WITHOUT FILES
There were no transcription files missing sound files.TEXTGRID READ ERRORS
There were no issues reading TextGrids.UNREADABLE TEXT FILES
There were no issues reading text files.Skipping test alignments.
INFO - All done!
In the temporary directory, MFA will output an
oovs_found.txt file (in this case,
/mnt/d/temp/validate_temp/librispeech_mfa/corpus_data/oovs_found.txt). This file will be a list of all words in the corpus that were not found in the dictionary, sorted by frequency in the corpus, so adding pronunciations for the initial entries is more important than adding entries for the words at the end that might only occur once or twice. You can see the exact frequencies for OOV words in the
Generating pronunciations using the G2P model for a subset of OOV words
In the validation output above, there’s a lot of missing words (55,415 types and 271,084 tokens!). Generating pronunciations for all of these is going to take a long time, even with multiprocessing. In addition, at the start of bootstrapping, the data isn’t going to be as consistent as you might like. Some of the transformations I mentioned above were discovered only when I generated a pronunciation (and somehow got a trill /r/ as an output phone). The best process I’ve found is an iterative bootstrap approach, where I generate pronunciations for the most frequent 1,000 OOV words (this doesn’t take very long), add the vetted pronunciations to the dictionary and retrain as necessary.
The command I use to generate pronunciations for the subset file (
oovs_g2p.txt below) is:
mfa g2p english_us_ipa_g2p.zip to_g2p.txt oovs_g2p.txt --num_jobs 13 --num_pronunciations 3 -t /mnt/d/temp/g2p_temp
As with the training, I’m using 13 cores for generating these pronunciations (took about 10 minutes to process 1,500 words in the latest batch). The additional flag for
--num_pronunciations is set to 3, so for each word, the best 3 candidates will get returned. For transparent orthographies like Spanish, you probably don’t need to generate as many pronunciations, but for English it provides variable pronunciations in the output.
The output of the above command will be a file (
oovs_g2p.txt above) with contents like the following:
By generating three candidates, we are able to capture a number of sources of variability including dialect variation (paupers as /p ɔ p ɚ z/ or /p ɑ p ɚ z/), different stress for different grammatical categories (adjective incarnate as /ɪ n k ɑ ɹ n ɪ t/ and verb incarnate as /ɪ n k ɑ ɹ n eɪ t/), and also just random orthography-phonology nonsense that English has (ploughed in the list above has the first candidate as /p l aʊ d/, but really /p l ɑ f t/ isn’t a bad guess considering coughed). Again for languages with more straightforward orthography-phonology mappings, you could probably get away with just one candidate.
Checking pronunciations generated and add new ones to the dictionary
For selecting candidates, I try to balance speed of adding entries with the following criteria:
- including common dialectal variations (like both /ɔ/ and /ɑ/ in caught-cot merger words, though there may be some inconsistency on my part as I don’t have the distinction)
- including unstressed vowel variation (pronounce as /p ɹ ə n aʊ n s/ or /p ɹ oʊ n aʊ n s/)
- some less common dialectal variation (like news as /n uː z/ or /n j uː z/, straw as /s t ɹ ɑ/ or /s tʃ ɹ ɑ/)
- some phonetic variation, mostly around common deletions (smouldering as /s m oʊ l d ə ɹ ɪ ŋ/ or /s m oʊ l d ɹ ɪ ŋ/)
But again, speed is the main criteria as there are a lot of missing words. The dictionary is available on GitHub here. Feel free to download, take a look, and if you find any issues, feel free to create a pull request (or let me know).
The dictionary can be downloaded to use with MFA alignment with the command:
mfa download dictionary english_us_ipa
and the G2P model with the command:
mfa download g2p english_us_ipa_g2p
I hope this is useful as both a tutorial and a starting point for using IPA in English alignment. The eventual goal of my tinkering with this is to include a bunch of corpora together and create an IPA-based acoustic model to use with any language that has a dictionary in IPA. Of course I think that’s still a bit of a pipe dream given variation in IPA, but the hope is that with the phone clustering under the hood for triphones in Kaldi, along with speaker-adaptation and other feature transformations, we might be able to get relatively good performance.
I have some thoughts and opinions on having an IPA parser inside of MFA that takes out some diacritics so that phonesets are more comparable:
- length markers could be internally discarded, and then added back when creating TextGrids
- diphthongs and affricates could be split up internally so that you’re not reliant on a single language providing data for something like /tʂ/, and could instead leverage other language data more straightforwardly for /ʂ/
- stripping out tone marking since it’s a segmental alignment (particularly cases where it’s a separate symbol in the transcription)
I’m planning on doing more of these blog posts for other languages (and most of them I don’t speak, so I’ll need some feedback on what makes sense and what doesn’t for those). I’m hoping also that the dictionary in this post can be a bit of a living resource as I and hopefully others update it, but at the very least I hope this is useful as a tutorial on using MFA’s G2P functionality to bootstrap and speed up dictionary development.