Many vehicles include automatic speech recognition (ASR) systems configured to control various subsystems within the vehicles, such as heaters and air-conditioners (collectively “cabin temperature controls”), power windows and mobile telephones. Such systems respond to commands uttered by human speakers (“speakers”), typically drivers, but also sometimes passengers. Environments within vehicles pose challenges for these ASR systems, at least in part due to sound reflections (“reverberations”) from hard surfaces, such as glass windows, in close proximity to the speakers, as well as road noise and wind noise.
Some vehicles include intercom systems that amplify speech detected by microphones located near front seats and play the amplified speech through loudspeakers located near rear seats, to facilitate conversation between front-seat occupants and rear-seat occupants. However, direct sounds from the speakers, combined with delayed sounds from the loudspeakers, often interfere with understanding of the speech.
So-called “smart room” conference facilities include microphones and video cameras that enable conference participants in one location to converse with, see and be seen by participants in another such facility. The multitude of microphones located throughout each such facility can, however, pick up sounds other than speech of one person who currently “has the floor,” thereby introducing noise into the audio stream.
Some home entertainment systems, such as television receivers, also include ASR systems to control volume, channel, source, etc. Similarly, some single- or multi-player games can be controlled by voice commands. Performance of entertainment, game and other systems that recognize or respond to voice commands is hampered by many of the same issues listed above.
Various techniques have been employed in attempts to improve microphone systems and front-end signal processing systems to ameliorate the problems summarized above. Some such attempts are described below.
A space in which an audio system is used may generically be referred to as a “room,” and propagation of acoustic signals within a room may be modeled by an acoustic room transfer function (RTF). For example, Jiraporn Pongsiri, et al. discuss understanding and modeling room acoustics in “Modeling the acoustic transfer function of a room,” Proceedings of the 12th International Conference on Scientific Computing and Mathematical Modeling, Chicago, Ill., pp. 44, 1999. Many audio systems include signal processors, such as filters, that are designed based on assumed or measured RTFs.
G. Schmidt and T. Haulick disclose limiting gain of rear loudspeakers in a vehicle, according to a delay between a primary source (e.g., a sound directly from a driver) and a secondary source (e.g., a sound from a loudspeaker in the rear of the vehicle) to avoid mislocalization of sounds by rear-seat passengers. E. Hänsler, G. Schmidt: Topics in Acoustic Echo and Noise Control, Springer 2006, Chapter 14 “Signal Processing for In-Car Communication Systems.” However, the authors do not disclose or suggest how to detect such a delay. The authors merely describe an experiment in which such a delay was artificially created between two loudspeakers and subjects were asked to adjust volume of the delayed sound, relative to the non-delayed sound.
In “A multi-microphone approach to speech processing in a smart-room environment,” Alberto Abad Gareta discloses using visual, audio or audio-visual information to estimate head orientation of a speaker to select microphones aimed at the speaker. See section 5.4 and pages 108 and 150.
Michael A. Casey, et al., disclose using a video camera to estimate location of a speaker and then steer a fixed beamforming algorithm to the speaker. In addition, a stereo output is controlled, based on the location estimate, to improve a 3D-spatial audio output. Vision Steered Beam-forming and Transaural Rendering for the Artificial Life Interactive Video Entertainment (ALIVE), Audio Engineering Society Convention 99, 10/1995.
Markus Guldenschuh discloses a camera for user tracking, however the author focuses on loudspeaker arrays, not microphone steering. Transaural Beamforming; Methods for Controllable Focused Sound Reproduction, Diploma Thesis, Institute of Electronic Music and Acoustics, University of Music and Performing Arts Graz, Graz, Austria, September 2009.
Christoph Boges, et al. disclose both acoustic and visual localization techniques to estimate location of a speaker, however this estimate is used only to steer a microphone array. Algorithms for Audiovisual Speaker Localisation in Reverberant Acoustic Environments, Proceedings of the 3rd Workshop on Positioning, Navigation and Communication (WPNC '06), March 2006.
Thus, although steering microphone arrays with estimates of speaker location is known, problem is still exist with the quality of audio signals obtained with such steered microphone arrays, as well as with non-steered microphones.