This application, and the innovations and related subject matter disclosed herein, (collectively referred to as the “disclosure”) generally concern speech recognition systems and associated techniques. More particularly but not exclusively, this disclosure pertains to speech recognition systems having more than one microphone and a recognition engine configured to consider concurrently several versions of a given utterance, and associated processing techniques. As but one particular example, a speech recognition engine can generate several ordered lists of transcription candidates based on several independent streams of a given utterance and a selector can determine a “best” transcription from the lists of transcription candidates. In another particular example, a speech recognition system can concurrently extract likely acoustic features (e.g., phonemes, triphones and corresponding posterior probabilities and/or likelihoods) from each of several versions of an utterance, then combine the posterior probabilities or likelihoods associated with the acoustic features to generate a refined stream acoustic features having improved overall likelihoods. Such a system can apply one or more recognition operations to the stream of such acoustic features to derive a highest-likelihood transcription of the utterance.
Automatic speech recognition (ASR) is especially challenging in the far-field because the direct-path speech signal is impaired by additive noise and reverberation due to additional paths between the talker and microphone. Consequently, word error rates (WER) for a near-field trained system can increase from around 10% when used with a near talker to 60-80% when applied to distant talking. Near-field speech can also be impaired by additive noise and/or reverberation, for example when recorded on a street or in small enclosures or rooms with acoustically reflective walls. In general, WER must be below 15% for a user to reliably execute common home automation tasks (e.g., set a thermostat, turn on sprinklers, send messages, play music, etc.).
In practice, far-field communication with home automation devices encumbers users because they must usually stand near a single receiver and speak towards that receiver. Aside from constraining a users' movement, such limited human-device communications have a subjectively unnatural feeling compared to the more unencumbered human-to-human communications, since the user may not be able to speak directly towards or into a device's microphone to control it.
As one illustrative example, a light switch and a thermostat can have respective receivers and recognition mechanisms. Currently, such devices operate independently of each other and each responds (or attempts to respond) to detected user commands, regardless of whether the detected user commands are intended by the user to be directed to that or another device. Thus, in the light switch and thermostat example, each device could easily (and possibly erroneously) infer that a user has asked it to switch off when the user says “turn the light off.” To overcome such errors, users currently must approach and speak directly at or into a given device (e.g., to increase a signal-to-noise ratio received by the respective speech recognition mechanism), as indicated in FIG. 10.
A recorded utterance can reflect a unique combination and selected degree of any number of signal characteristics. Such signal characteristics can include, and each unique combination of signal characteristics can reflect a selected degree of, frequency response, inherent reverberation, background noise, signal-to-noise ratio, signal-to-echo ratio, residual echo, and/or applied signal processing. Signal processing can be applied to a recorded utterance in an attempt to condition the signal for a speech recognition engine. Such conditioning can include suppression and/or cancellation of reverberation, echo, and/or noise. Signal characteristics can also be manipulated and influenced using other available signal processing techniques, including acoustic-beamforming, noise-suppression, de-reverberation, echo-cancelling, echo-suppression, and spatial-noise suppression techniques, and other signal processing techniques.
The effects of applying signal processing techniques on a given recording of an utterance, and the quality or degree of impairment to each unprocessed version of an utterance derived from one or more respective microphones, typically corresponds to a state of the acoustic environment when the recording is made. And, parameters relevant to defining the acoustic environment include, for example, position of each microphone relative to the user making the utterance, position of each microphone and position of the user in relation to walls of the room and other acoustically reflective surfaces and/or objects, room dimensions, presence, position and type of other sound sources, presence, position and type of active loud-speakers, etc.
Interestingly, many of those very signal-characteristics often negatively affect speech recognition systems, causing a decrease in any of several measures of accuracy or quality of recognized speech in relation to a user's actual utterance. Some of the noted signal-characteristics inherently arise from the state of an acoustic environment, and some result (intentionally or unintentionally) from applying a given signal-processing technique to a signal. Speech recognition systems are notoriously sensitive to signal characteristics such as additive noise, reverberation, and non-trivial processing of input utterances. Nonetheless, it is not always apparent which characteristics might be “preferred” by a given recognizer engine in order to achieve improved (or even satisfactory) speech recognition results. For example, signals that produce perceptually “better” or “cleaner” sound to a human listener often do not produce a corresponding “better” (e.g., lower) word-error-rate in a speech recognition system.
Therefore, speech recognition systems with reduced error rates are needed. A need exists for far field receivers that can be distributed spatially throughout a room. As well, a need exists for a new front-end signal processing and speech recognition architecture that combines inputs from the distributed receivers, as well as any available near-field (e.g., mobile) devices, to achieve a desirable WER (e.g., less than or equal to about 15%) regardless of the position or orientation of a user in a given room.
As well, speech recognition systems are needed to resolve intended operations among various devices. More generally, a need exists for speech recognition systems that provide users with a more natural and less attention-consuming experience. There further remains a need for signal processors suitable for use with such speech recognition systems. There also remains a need for speech-recognition techniques suitable for such systems exposed to substantial background noise and/or systems having one or more microphones positioned in close proximity to a noise source (e.g., an air handler of an HVAC system, or one or more loudspeakers). And, a need exists for systems configured to implement such speech recognition techniques.