Many speech solutions, such as speech-enabled applications and speech recognition systems, utilize a computing device to “listen” to a user utterance and to interpret that utterance. Depending upon design considerations, a speech solution may be tasked with accurately recognizing a single user's utterances. For example, a dictation-focused solution may need to be highly accurate and tuned to a given user. In other applications, a system designer may want a speech solution to be speaker-independent and to recognize the speech of different users, provided the users are speaking in the language the application is designed to process and the users are uttering phrases associated with the application.
In practice, a user utterance may be “heard” by a computing device and may be broken into pieces. Individual sounds and/or a collection of individual sounds may be identified and matched to a predefined list of sounds, words, and/or phrases. The complex nature of translating raw audio into discrete pieces and matching the audio to a pre-defined profile often involves a great deal of signal processing and may, in some instances, be performed by a speech recognition (SR) engine executing on a computing system.
While SR engines may be relatively accurate, these engines and other speech solution components often require tuning. In practice, a system's recognition rate at implementation may be unacceptably low. This recognition rate may be improved through tuning. However, conventional approaches to tuning may be costly. Moreover, the effectiveness of conventional tuning approaches is often difficult to quantify and predict.