In the field of this invention it is known that, with speaker-independent ASR such as in an automated telephone response system, if any speaker uses the system with a significantly different accent than the user population used to train the system, then recognition hit-rate or accuracy will be affected by the accent of that user; further, it is known that it is possible to model accents, both native and non-native, by careful manipulation of the associated baseforms (the pronunciations of the words in the recognition grammar).
It is known that it is possible to adapt dynamically to the accent or dialect of a given user population by allowing the statistical distributions within the acoustic model to be modified in accordance with the correct recognition results observed for that user population. The acoustic model contains a number of states and transitions, both represented by Gaussian (normal) probability densities, which, by definition, are defined by a mean value and a standard deviation. For a given user-population, either the mean or the standard deviation or both will vary from the generalised values in the acoustic model for the general user population. Therefore, dynamic adaptation allows the mean and/or standard deviation to shift to values more appropriate to the user population. In concrete terms: there might be a mean value of X for the general population. For a specific population, the mean may consistently be less than X. Therefore, adaptation would have X shift downwards (N.B.: there are many Gaussian distributions, and therefore many X's and associated standard deviations.)
However, this approach has the disadvantage that what was a generalised and general purpose recognition system is now geared specifically to a given user population. In consequence, it is no longer appropriate for all users. Hit-rate may decrease even for speakers of standard variants of a given language.
Another way to cater for multiple accent types would be to include additional pronunciations, known as baseforms, for one and the same word. For example, the name of the UK town ‘Hull’ is pronounced with a very different vowel sound in Southern British English from Northern British English, and so two baseforms for this word would be included. If account is then taken of the fact that many speakers either in the North or in the South may well not pronounce the ‘H’ at the beginning, then this results in four variants for ‘Hull’: H AH L, H UH L, AH L, and UH L.
However, this approach has the disadvantage(s) that manually increasing the baseform variants for any given term can lead to increased ambiguity within the recognisor, and therefore can reduce the overall hit-rate or accuracy, and can even reduce the efficiency of the recognisor. It is also known that adaptation can lead to the recognisor no longer being appropriate for use by the original user population or indeed by any users beyond those that have provided input to adapt the recognisor.
A need therefore exists for a system and method for automatic speech recognition wherein the abovementioned disadvantage(s) may be alleviated.