Automatic speech recognition (ASR) systems do not effectively address variations in word pronunciation. Typically, ASR dictionaries contain few alternative pronunciations for each entry. In natural speech, however, words rarely follow their citation forms. This failure to capture an important source of variability can cause recognition errors, particularly in normal conversational speech.
The automatic inference of pronunciation variation has been explored using phonetically transcribed corpora. Unfortunately, increasing the number of dictionary entry variants based on a pronunciation model also increases the confusability between dictionary entries, and thus often leads to an actual performance decrease.
Speaking mode has been considered to reduce confusability by probabilistically weighting alternative pronunciations depending on the speaking style. See F. Alleva, X. Huang, M.-Y. Hwang, Improvements on the Pronunciation Prefix Tree Search Organization, Proc. Int. Conf. on Acoustics, Speech and Signal Processing, Atlanta, Ga., pp. 133-136, May 1996 (incorporated herein by reference). This approach uses pronunciation modeling and acoustic modeling based on a wide range of observables such as speaking rate; duration; and syllabic, syntactic, and semantic structure—contributing factors that are subsumed in the notion of speaking mode. See, e.g., M. Ostendorf, B. Byrne, M. Bacchiani, M. Finke, A. Gunawardana, K. Ross, S. Roweis, E. Shriberg, D. Talkin, A. Waibel, B. Wheatley, and T. Zeppenfeld, Systematic Variations in Pronunciation via a Language-Dependent Hidden Speaking Mode, in International Conference on Spoken Language Processing, Philadelphia, USA, 1996 (incorporated herein by reference).
Just as the phonetic representation of careful speech is a schematization of articulatory and acoustic events, a phonetic transcription of relaxed informal speech by its nature is a simplification. Pronunciation models implementing purely phonological mappings generate phonetic transcriptions that underspecify durational and spectral properties of speech. Reduced variants as predicted by a pronunciation model ought to be phonetically homophonous—e.g., the fast variant of “support” being pronounced as /s/p/o/r/t/ is phonetically homophonous with “sport”). But for to create such homophony, not only should the unstressed vowels be deleted, but the durations of the remaining phones also should take the same values as in words not derived from fast speech vowel reduction. Similarly, fast speech intervocalic voicing in a word like “faces” cannot be precisely represented as /f/ey/z/ih/z/—phonetically homophonous with “phases”—unless both the voice value of the fricative as well as the durational relationship between the stressed vowel and the fricative have changed.