1. Technical Field
The present application relates to speech recognition and more specifically to adapting acoustic models for specific speakers or classes of speakers.
2. Introduction
Speech recognition applications typically rely on a single acoustic model that represents all potential speakers. Often, a generic speech model is used to recognize speech from multiple users. However, a single canonical model that represents all speakers generically is not well suited to many individuals in minority accent groups of a given population. For instance, strong regional accents or speakers with a foreign accent often encounter speech recognition difficulties stemming from numerous differences between their way of speaking and the single canonical model. These difficulties can slow down user speech interaction, thereby frustrating users, or prevent speech interaction altogether.
In many cases, the number of speakers making up a regional accent or foreign accent group is very small. Due to the small number of speakers, data is too sparse to build specific acoustic models for each class of dialect or accent. One known solution is to modify pronunciation dictionaries by providing alternative phoneme sequences for word pronunciations which differ depending on the dialect or accent. For example, speakers from the southern states pronounce many vowels as diphthongs, or some accents have low separation between some sounds like “l” and “r”. One current approach in the art is to account for these differences to some extent by expanding the allowed pronunciations to include all the possible variations. This approach has the drawback of introducing additional confusion into the speech recognition model, which can reduce the overall speech recognition accuracy.