1. Technical Field
An embodiment of the present invention generally relates to a speech recognition system. More particularly, an embodiment of the present invention relates to a speech recognition system that enables a user to access a plurality of speech recognition engines without requiring that the user train each speech recognition engine.
2. Discussion of the Related Art
Speech recognition technology enables a user to invoke a particular function(s) by providing verbal instructions. Accuracy of a speech recognition system depends on a number of factors. For instance, it is well-known that speaker-independent (“SI”) speech recognition systems typically suffer from lower accuracy as compared to speaker-dependent (“SD”) speech recognition systems that have been trained on speaker-specific data. Furthermore, speech recognition accuracy may be negatively affected by environmental factors—such as background noise, reverberation, or microphone performance.
Adaptation to the speaker characteristics and background environment may improve speech recognition accuracy. For example, acoustic model adaptation is a common approach used for desktop-based speech recognition engines to adapt SI acoustic models to a particular user's voice and to the background environment. However, all of the current engine providers require the user to explicitly train his/her acoustic models by reading a predetermined text of duration between five and twenty minutes to create a SD acoustic model. This is a time-consuming task and hence is not user friendly. Thus, a speech recognition system, having a speech recognition engine that does not require explicit training by the user, is needed.
Mobile applications that utilize speech recognition technology pose additional issues. For instance, in a mobile usage model, users will very likely need to access different kinds of speech-enabled services provided by one or more service providers. Because the speech recognition engines that a particular service provider uses in its applications may differ from those used by other service providers, the current adaptation method requires the user to train each new speech recognition engine that he/she encounters while accessing different services. Furthermore, a service provider needs to maintain all of its customers' user profiles, so that a user is not required to retrain the speech recognition engines every time he/she accesses that particular service. Hence, most speech recognition service providers use SI systems that use the same acoustic models to recognize any user's speech. Consequently, speech recognition service providers must generally either compromise in accuracy or provide limited voice access capability (e.g., command and control functionality, as opposed to natural language queries).