Embodiments of the teachings herein relate to live monitoring of audio captured by multiple spatially distributed microphones. Such captured audio may be used for live-streaming for presentation within an augmented reality or virtual reality context, or may be stored for later rending in that regard. The audio is captured preferably by multiple close-up microphones that are close to and capture the sound sources of interest, and multiple microphone arrays that capture a fuller background integration. The close-up microphones may be tracked in order to facilitate realistic rendering of the tracked sound sources in the final mix.
Consider an example of a musical concert; there may be a close-up microphone near each member of the band who is playing a different musical instrument and/or vocalizing, and further microphone arrays dispersed about the stage and among the concert hall. With recording capability being ubiquitous in personal mobile phones, the close-up microphones may be smartphones themselves and there may be further non-array microphones among the audience that capture sound that is incorporated into the final mix. Capturing a sound environment in this manner can then be processed so as to be presented to a listener as if that listener were are any location, not limited to the specific locations of the microphones themselves; this flexibility in the audio experience presented to the user is considered a free viewpoint (FVP) system.
FIG. 1 illustrates an example of an audio environment with multiple dispersed microphones capturing sound that may serve as the audio input to a FVP system. Positions 1-10 represent close-up microphones, each generating their own audio channel. In some embodiments, at least some of these microphones may generate more than one channel. For example, a stereo microphone may be utilized. Assuming the sound environment is a musical concert positions 1-10 may be positioned near each different band member (guitarist, drummer, lead singer, backup singers, etc.). Positions OP1-OP7 designate microphone arrays and ideally are positioned at locations deemed to best capture the overall audio environment including ambiance. As one non-limiting example each of these can be implemented as a Nokia OZO camera, which has a 360° camera view and omnidirectional audio from 8 microphones (see https://ozo.nokia.com/, last visited Nov. 25, 2016). This environment yields a total of 66 audio channels: 10 from the close-up microphones and 56 OZO channels from the 7 different OZO arrays. If all these channels are processed and transmitted individually to a consumer device over an unmanaged Internet-protocol (IP) network such as a wireless local area network (WLAN), the receiving device will find it difficult (depending on the resource availability) to handle all the content, and further the quality of the WLAN channels over which this content is uplinked, and of the cellular or other WLAN channel over which it is downlinked to the end user, changes dynamically. The difficulties lie in network congestion and latency requirements of the audio being delivered. Embodiments of these teachings are directed to managing this audio content to optimize the end user experience under these conditions of high data volume and unstable radio channel quality.
The currently available solutions that are workable for a FVP audio environment are generally devoted towards dedicated professional hardware over managed audio over IP networks which transmit audio data in a lossless manner. But this is not suitable for prosumer or consumer applications not having access to expensive professional audio equipment and infrastructure. Some relevant prior art teachings can be seen at U.S. Pat. No. 8,856,049 (co-owned), U.S. Pat. Nos. 9,167,346 and 9,165,558; and at US Patent Application Publication Nos. 2016/0300577 and 2011/0002469.