This disclosure relates to the field of automatic speech recognition and receiving audible commands from a speech input device, wherein the audible commands are cross-checked with image data from an imaging device or image sensor such as a camera focused on a source of the audible commands. Spoken words are created through mouth movements adjusting sound waves that are transferred from the speaker's mouth through air. Vehicle speech entry systems for users often consist of one or more microphones positioned to detect the sound. Typically, these microphones are electromechanical assemblies which mechanically resonate over a range of the mechanical frequencies of speech (sound waves at frequencies less than 20 khz). Digital voice tokens (temporal speech fragments) can be sent to artificial voice recognition systems and converted to digital requests (e.g. information technology requests in the vehicle infotainment or vehicle control systems; or external web-based service requests transmitted through wireless networks). The result of these audible requests is to simplify and/or automate a desired function to enhance user comfort and/or convenience and/or safety—often all three.
Numerous digital and algorithm-driven methods have been developed in an attempt to improve the performance of artificial voice recognition systems. For example, token matching systems based on learning a specific user speech characteristic from audible content is often used to improve the success rates of artificial voice recognition systems. Another typical method is to use artificial intelligence techniques to match the speech characteristic of the voice input with one or more phonetic characteristics (e.g. languages, pronunciations, etc.). One additional method, that is often used to reduce noise, is to require that the user press an electromechanical button, often on the steering wheel, to limit voice capture to the times when the button is depressed.
In some cases a sound detection and processing system uses one or more microphones, and subsequent signal processing is utilized to reduce the effects of noise (including road noise, noise from vehicle entertainment systems, and non-user audible inputs). Noise reduction can be accomplished through appropriate geometric placement of the microphones to enhance user voice inputs while reducing noise. Also, appropriate symmetric placement of multiple microphones relative to the position of the user during normal driving helps reduce the effects of outside noise sources. Specifically, microphones are positioned symmetrically relative to the boresight vector of the natural mouth position while the eyes are naturally facing forward e.g. if the user is a driver of the vehicle, “eyes on the road.” The subsequent phase cancellation processing of the microphone inputs has been shown to substantially reduce the effects of noise. In this example, the phase of the user speech signal detected at the multiple microphones is the same (due to same travel distance from the user's mouth), while the phase of the noise from other locations inside/outside the vehicle will have different phases at the multiple microphones and thus this sound can be filtered out through various signal processing techniques.
Errors in automated speech recognition processes can lead to incorrectly determining the intended user speech, resulting in potential frustration (and/or distraction) of the user. For example, the speech recognition might incorrectly interpret the sound and make the wrong request (e.g. calling the wrong person). Or, the speech recognition may ignore the request. One goal of the automated speech recognition process, including the sound detection and measurement system is to maximize the quality of the user's speech input sounds (signal) and minimize un-desired sounds (noise); e.g. maximize Signal to Noise (SNR) ratio.
One problem in the field of automated speech recognition lies in the lack of credible ways for prior art systems to double check a perceived speech input with additional out-of-band information (i.e., information other than standard audio signal analysis). A need in the art exists in configuring automatic speech recognition systems so that user commands, issued to the system for vehicle operation and performance, are confirmed in terms of origin, authorization, and content.