Technological innovations in speech processing applications have led to widespread development of speech-based automated systems and applications using automated speech recognition (ASR) and/or natural language understanding techniques. For example, speech recognition systems are being implemented to support hands-free command and control of various functions within a car environment. Moreover, speech recognition systems may be implemented for dictation/transcription applications to record and recognized spoken input from one or more persons and automatically generate a textual transcription that is stored and subsequently used for various applications (archiving, indexing, etc.).
There are various factors that can negatively affect the decoding accuracy of spoken input by ASR systems. For instance, in ASR applications, speech decoding accuracy can vary depending on the type of microphone system that is used to capture spoken input, the manner in which a person uses the microphone system and/or the varying environmental conditions that may exists at different times during capture and recordation of audio input by the microphone system. For instance, when a person uses a microphone having a manual talk switch (to manually turn on/off the microphone), the manual operation of the talk switch may lead to poor synchronisation between the time at which the talk switch button is pressed and the user begins speaking. For example, it a user simultaneously presses the talk switch button and begins to speak, the first spoken utterance may be chopped-off, or if the user begins speaking too late, environmental noise may be added to the audio input, leading to decreased decoding accuracy.
In other circumstances, the decoding accuracy of an ASR system can be adversely affected when the distance between the speaker's mouth and the microphone is varied during a speech session. For instance, for lip microphone devices, the distance between the lip microphone and the persons' mouth can change during a session resulting in possible degradation in decoding accuracy. Similar problems exists when using fixed microphones (e.g., in a car) which are sensitive to how a person is positioned near the microphone and the direction that the person faces when speaking.
Other causes of decreased decoding accuracy in ASR systems due to microphones that the ASR applications typically require the microphone parameters to be adapted and adjusted to the ASR system, as well as adapted and adjusted based on the speaker. For example, some conventional speech applications require the microphones to be set and re-adapted to the speech recognition system each time a new person begins a new dictation session. If certain adjustments and adaptations are not made for each new person using the speech recognition system, the error rate of the speech recognition can significantly increase.
For example, an ASR system may require various steps for adjusting the microphone system so as to optimize the speech recognition decoding accuracy. First, the ASR system determines an average level of static environmental noise in a given environment (no speech). Next, the system, may request spoken input by a person in the given environment, which allows the system to determine the volume of the speaker's voice relative to the static environmental noise, which is then used to adjust the sensitivity and volume of the microphone input. Typically, after the system adjusts the volume input level, other additional parameters in the ASR system may be adapted to an individual speaker when reading a particular prepared passage. In particular, each new user may be required to read a prepared passage after the volume has been, adjusted so as to adjust an array of parameters to fine tune adjust the microphone and better adapt the ASR system to the current user.
These microphone adjustment procedures of the ASR system may be problematic and impractical in certain applications. For example, when an ASR system is used for transcription of conferences, these microphone adjustment procedures may be too burdensome and thus not followed. In particular, at conferences and meetings, a microphone and ASR system is typically located on the podium or in the middle of a meeting table. In some instance, the microphone is head-mountable and located at the speaker's lips for accurate input. When speaking at a conference, each speaker may have time to activate his/her user-specific (pre-trained) speech model that was previously trained and stored in the ASR system, but there is typically no time for each speaker to perform a microphone adjustment process (as described above), which may foe needed to adjust the parameters of the ASR system to the speaker's personal speech patterns to obtain an optimal transcription.
The decoding accuracy of an ASR system can also foe affected depending on the type of microphone that was used when training the ASR system or when using the ASR system. For example, decoding accuracy can be decreased when the type of microphone used by a person to train the ASR system is different from the type of microphone used by that person when giving a lecture during a transcription or dictation session. By way of specific example, a person will typically train an ASR system by providing speech training data using a wired microphone connected to the ASR system, while the same speaker may actually use a wireless microphone when using the ASR system during a lecture, meeting, conference, which can lead to decreased decoding accuracy.