Bootstrapping an IPA dictionary for English using Montreal Forced Aligner 2.0

One of my big tinkering projects that I’ve been thinking a lot about is a multilingual aligner that uses IPA to be able to share acoustic models across languages and maybe even help align data from languages without requiring an initial corpus of ~20 hours to get a good acoustic model. That whole project is outside the scope of the current post (but see the final thoughts section at the end for more on this). What I wanted to do with this blog series is document the end-to-end process of constructing this multilingual aligner with the hopes that these tutorials are useful for other people working with their specific data sets. With the WikiPron data sets, there’s a number of crowdsourced pronunciation dictionaries available (see here for a listing), this blog post is intended to serve as a guide for cleaning those up and bootstrapping larger pronunciation dictionaries.

English but IPA
  1. Train a G2P model for generating IPA pronunciations for English
  2. Generate Out of Vocabulary (OOV) list
  3. Generate pronunciations using the G2P model for a subset of OOV words
  4. Check pronunciations generated and add new ones to the dictionary
  5. Retrain G2P model as necessary

Cleaning up and standardizing initial WikiPron dictionary

The basic dictionary was downloaded from WikiPron’s scrape page. Given Wikitionary’s crowdsourced nature, I did some initial analysis of the graphemes and phones being used. There were several foreign graphemes included, so words containing these were ignored to simplify the grapheme set:

% / @ ² à á â ä æ ç è é ê ë í î ï ñ ó ô õ ö ø ù ú ü ā ą č ē ę ğ ı ł ń ō ő œ ř ū ș ț ʼ ṭ ₂
a aː b d d͡ʒ e eː f h i iː j k l l̩ m m̩ n n̩ o oː p r s t t͡ʃ u uː v w z æ æː ð ŋ ɑ ɑː ɒ ɔ ɔː ə ɚ ɛ ɛː ɜ ɜː ɝ ɝː ɡ ɪ ɪː ɪ̯ ɫ ɹ ʃ ʊ ʊ̯ ʌ ʍ ʒ ʔ θ
Digraph construction
Standardizing some inconsistent phone use
Rhotic lexical sets in US English
Rhotic transformations
Word final unstressed syllabic consonants
Inventory for English US in the dictionary

Training a G2P model for generating IPA transcriptions

Following all of the transformations above, we have a reasonable starting point for bootstrapping a large scale dictionary. Following installation of MFA into a conda environment, the first step is train a G2P model:

mfa train_g2p english_us.txt --num_jobs 13

Generating an Out of Vocabulary (OOV) list

Once the G2P model is trained, we can generate a list of OOV words by running MFA’s validator utility:

mfa validate /mnt/d/Data/speech/LibriSpeech/librispeech_mfa english_us.txt -t /mnt/d/temp/validate_temp --ignore_acoustics
(aligner) user@hostname:/mnt/d/Data/speech/librispeech_ipa$ mfa validate /mnt/d/Data/speech/LibriSpeech/librispeech_mfa english_us.txt -t /mnt/d/temp/validate_temp --ignore_acoustics
INFO - Setting up corpus information...
INFO - Parsing dictionary without pronunciation probabilities without silence probabilities
INFO - Creating dictionary information...
INFO - Setting up training data...
Skipping acoustic feature generation
INFO - ===============================Corpus===============================
292367 sound files
292367 sound files with .lab transcription files
0 sound files with TextGrids transcription files
0 additional sound files ignored (see below)
2484 speakers
292367 utterances
0.000 seconds total duration
There were 55415 word types not found in the dictionary with a total of 271084 instances.
Please see
/mnt/d/temp/validate_temp/librispeech_mfa/corpus_data/oovs_found.txtfor a full list of the word types and/mnt/d/temp/validate_temp/librispeech_mfa/corpus_data/utterance_oovs.txtfor a by-utterance breakdown of missing words.SOUND FILE READ ERRORS
There were no sound files that could not be read. There were no sound files that had unsupported bit depths.
Acoustic feature generation was skipped.
There were no sound files missing transcriptions.
There were no transcription files missing sound files.
There were no issues reading TextGrids.
There were no issues reading text files.
Skipping test alignments.
INFO - All done!

Generating pronunciations using the G2P model for a subset of OOV words

In the validation output above, there’s a lot of missing words (55,415 types and 271,084 tokens!). Generating pronunciations for all of these is going to take a long time, even with multiprocessing. In addition, at the start of bootstrapping, the data isn’t going to be as consistent as you might like. Some of the transformations I mentioned above were discovered only when I generated a pronunciation (and somehow got a trill /r/ as an output phone). The best process I’ve found is an iterative bootstrap approach, where I generate pronunciations for the most frequent 1,000 OOV words (this doesn’t take very long), add the vetted pronunciations to the dictionary and retrain as necessary.

mfa g2p to_g2p.txt oovs_g2p.txt --num_jobs 13 --num_pronunciations 3 -t /mnt/d/temp/g2p_temp

Checking pronunciations generated and add new ones to the dictionary

For selecting candidates, I try to balance speed of adding entries with the following criteria:

  • including unstressed vowel variation (pronounce as /p ɹ ə n aʊ n s/ or /p ɹ oʊ n aʊ n s/)
  • some less common dialectal variation (like news as /n uː z/ or /n j uː z/, straw as /s t ɹ ɑ/ or /s tʃ ɹ ɑ/)
  • some phonetic variation, mostly around common deletions (smouldering as /s m oʊ l d ə ɹ ɪ ŋ/ or /s m oʊ l d ɹ ɪ ŋ/)
mfa download dictionary english_us_ipa
mfa download g2p english_us_ipa_g2p

Final thoughts

I hope this is useful as both a tutorial and a starting point for using IPA in English alignment. The eventual goal of my tinkering with this is to include a bunch of corpora together and create an IPA-based acoustic model to use with any language that has a dictionary in IPA. Of course I think that’s still a bit of a pipe dream given variation in IPA, but the hope is that with the phone clustering under the hood for triphones in Kaldi, along with speaker-adaptation and other feature transformations, we might be able to get relatively good performance.

  • diphthongs and affricates could be split up internally so that you’re not reliant on a single language providing data for something like /tʂ/, and could instead leverage other language data more straightforwardly for /ʂ/
  • stripping out tone marking since it’s a segmental alignment (particularly cases where it’s a separate symbol in the transcription)

Recovering academic linguist, tinkering on speech tech resources and tools sometimes

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store