Unless otherwise indicated herein, the materials described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.
Audio data may often be recorded by audio and video recording devices at events to record conversations, meetings, speakers, concerts, and other similar events. The audio data may be extracted from the audio/video recordings to localize audio sources from the recorded audio data to identity audio sources in the scene and to transcribe the audio data for future searching and indexing. Current techniques to record audio data may include placing microphones at known fixed locations in a scene and spatially resolving different sound sources using more than one microphone. Microphone arrays with two or more microphones may allow for differential sensing of sound and for listening to specific areas of a scene, if the relative position between the microphones is known precisely.
With the proliferation of handheld and mobile technology, users may frequently use handheld mobile devices, such as smart phones, to capture photographs and videos of events and scenes. Often the users may upload the captured videos to websites and social networks to share videos with other users. Such video recording archives may result in large amounts of video recordings of an event or scene from a wide variety of angles and perspectives. Each of the video recordings may capture audio data for the scene, although the video capturing devices may be handheld movable devices which may not be at fixed locations within the scene to enable audio extraction and localization employing an approach similar to a fixed microphone array approach.