This disclosure relates to a system and a method for supporting automatic speech recognition of regional accents based on statistical information and user corrections.
Automatic speech recognition (ASR) and speech-to-text conversion have been developed to generate text more rapidly while keeping the user's hands free for other tasks. Speech recognition involves hardware and software that is capable of receiving a spoken sound pattern and matching it with a particular word, phrase, or action. Speech-to-text conversion is a more elaborate system that is capable of continuously performing speech recognition but in such a manner that it is capable of converting a spoken conversation or discourse to corresponding text that is comparable to what a typist at a keyboard would do, but more rapidly. Current speech-to-text systems are capable of following a natural conversation and generating corresponding text with a relatively low rate of errors with some limitations.
One difficulty current speech-to-text systems have is correctly interpreting variations in speech when the meaning stays constant. A given person will tend to pronounce words slightly differently at different times. As they become excited, they tend to speak more rapidly. Many people tend to slur words together or to partially drop phonemes from their pronunciation. A human speaker is familiar with the vagaries of typical human speech and would readily make the correct interpretation in this case, but a machine has a more difficult time making the distinction.
Different people will tend to pronounce the same words differently and use different phrasing. Oftentimes the variations in people's speech patterns follow predictable and identifiable patterns by groups such as: the place that the speakers grew up in, their age or gender, or their profession or type of work they do. These variations in pronunciation and word use are referred to as dialects. A dialect is typically distinguished by the use or absence of certain words or phrasing. A dialect will also typically have predictable manners of pronouncing certain syllables and/or words. It can be appreciated that the predictable nature of a dialect could be used to facilitate the learning process for a speaker dependent speech-to-text converter.
Automatic speech recognition systems can work effectively for languages and accents for which a language model has been created. They do not however, fare well in areas or domains where there are a variety of strong regional accents. Current methods of coping with variations in regional accents rely on large amounts of recorded audio being processed and added to the language model. For example, automatic speech recognition language models that have a very specific domain such as, for example, the insurance industry are used by a restricted group of people and are therefore successful. However, automatic speech recognition language models do not work very well for call centers because of the large number of people calling in from different regions with problems that are not particularly linked to a specific domain.
In addition, this mass collection of audio for domain specific user groups is difficult and expensive. It is therefore desirable to provide an alternative method of improving the automatic speech recognition for certain accents based on knowledge of the user accessing the automatic speech recognition system.