Field of the Invention
The invention relates to an electronic device and, more particularly, to directional audio-video capture for an electronic device.
Brief Description of Prior Developments
Electronic devices having speaker phone or hands free applications are known in the art. During a hands free voice call, any sounds other than the user's voice may be considered as background noise which should be attenuated (or eliminated) in order to improve the quality of the phone conversation. The noise can be related to environment, network, and audio chains corresponding to sending and receiving signals. The environmental noise (or interfering sounds/background noise) can easily interfere during the hands free voice call and sometimes can exceed the user's voice (signal) level such that it becomes very difficult to separate the two. This may cause a poor signal to noise ratio (SNR).
There are several audio-only source tracking techniques for speech communication known in the art. With conventional configurations providing for sending an audio use directivity pattern that can attenuate the sensitivity to outside of the source (user) direction, it is possible to improve the SNR and eliminate the unwanted signals from the source signals before the signal is transmitted. However, this assumes that the direction-of-arrival (DOA) of the signal is known or can be estimated. Additionally, audio based tracking using the conventional techniques generally does not work for a silent moving source.
In the case of hand-held mobile communication devices the relative position of the sound sources can also move due to the movement of the device. Continuous handling of the device (e.g. due to spontaneous gestures and hand movements) makes the source tracking task much more challenging compared to a traditional meeting room setup where the device can be assumed to be relatively stationary compared to the movement of the sound source. Device movements can introduce very fast changes in the DOA that would be unlikely by the movement of the sound source.
In a typical mobile communication voice call, the relative position of the user and the device can change. Since the audio-only tracking systems require audio data for the calculation of DOA angle(s), this introduces a processing delay for the tracking information (thus preventing real-time source location information updates). Unfortunately in real-time voice communication the end-to-end delay needs to be minimized for fluent operation. This can lead into several problems. For example, when the user moves during speech pauses, the source tracker may lose the correct source position during the silent periods. When the speaker starts to talk, the beginning of the sentence could be distorted due to incorrect location information. From the multi-microphone noise reduction point of view this means that the user's voice is processed as a background noise source until the correct location information is taken into account.
Another class of directional audio capture algorithms form the directivity pattern of the microphone array by utilizing the statistical properties of the signal. These algorithms do not utilize dedicated sound source location information but try to self-adapt to the desired source. Typically these algorithms need to adapt to the changes both in the source location but also in the room impulse response. This makes these algorithms relatively slow in reacting to instantaneous changes in the environment. It is also non-trivial to control an algorithm that is making autonomous decisions about the source direction without a possibility for an external control. For example in the case of loud interfering source (a.k.a jammer), it becomes more difficult to control the microphone array to classify the source as a noise source, especially if the signal statistics of the interfering source are similar to the desired source, e.g. in the case of competing talker.
Additionally, human face detection and video tracking of human faces are known in the art. Face detection deals with the localization of a face (or multiple faces) in an input image. The process includes scanning the entire image, if no prior knowledge about the face position is available. Face tracking may also extend to face detection by using temporal correlation to locate a human face in a video sequence. Rather than detecting the face separately in each frame, knowledge about the face position in the previous frame is used in order to narrow the search in the current frame.
For example, “Face Detection In Color Images” (R. L. Hsu, M. Abdel-Mottaleb, and A. K. Jain, IEEE Transactions on Pattern Analysis and Machine Intelligence, 24:696-706, 2002), which is hereby incorporated by reference in its entirety, describes one approach to face detection based on skin color detection. Approaches for face detection (or tracking) based on skin color detection generally determine and group the skin color pixels which are found in the image. Next, for each such group of pixels, a bounding box (or the best fitting ellipse) is computed. The skin components which verify certain shape and size constraints are selected as face candidates. Finally, features (such as eyes and mouth) are searched inside each face candidate based on the observation that holes inside the face candidate are due to these features being different from the skin color.
Further, “Detecting Faces In Images: A Survey” (M. Yang, D. J. Kriegman, and N. Ahuja, IEEE Transactions on Pattern Analysis and Machine Intelligence, 24:34-58, 2002), which is hereby incorporated by reference in its entirety, describes one approach to face detection based on face texture information.
Moreover, “A Hybrid Approach To Face Detection Under Unconstrained Environments” (A. Hadid, M. Pietikainen, International Conference of Pattern Recognition (ICPR 2006)), which is hereby incorporated by reference in its entirety, describes one approach to face detection based on color and texture information.
U.S. Pat. No. 6,826,284, which is hereby incorporated by reference in its entirety, discloses a system where source tracking information enables device control, such as camera steering, for example.
In addition, “Knowing Who To Listen To In Speech Recognition: Visually Guided Beamforming” (U. Bub, M. Hunke, and A. Waibel, Interactive System Laboratories, IEEE 1995) and “Listen: A System For Locating And Tracking Individual Speakers” (M. Collobert, R. Ferraud, G. Le Tourneur, O. Bernier, J. E. Viallet, Y. Mahieux, D. Collobert, France Telecom, IEEE Transactions (1999)), which are hereby incorporated by reference in their entireties, disclose using a mechanical device to move a camera towards a user's face for visual and audio tracking used in fixed teleconferencing conditions.
“Joint Audio-Video Object Localization and Tracking” (N. Strobel, S. Spors and R. Rabenstein, IEEE Signal Processing Magazine (2001)), discloses an object tracking methodology.
Further, U.S. Pat. No. 5,335,011 discloses using a sound localization technique which is based on the prior knowledge of the position of each user.
However, despite the above advances, there is still a strong need to provide an improved audio capture system.