Remote audio and video communication over a network is increasingly popular for many applications. Through remote audio and video access, students can attend classes from their dormitories, scientists can participate in seminars held in other countries, executives can discuss critical issues without leaving their offices, and web surfers can view interesting events through webcams. As this technology develops, part of the challenge is to provide customized audio to a plurality of users.
Many audio enhancement techniques, such as beam forming and ICA (Independent Component Analysis) based blind source separation, have been developed in the past. To use these techniques in a real environment, it is critical to know spatial parameters of users' attention. For example, if the system points a high performance beam former in an incorrect direction, the desired audio may be greatly attenuated due to the high performance of the beam former. The ICA approach has similar results. If an ICA system is not configured with information related to what a user wants to hear, the system may provide a reconstructed source signal that shields out the user's desired audio.
One common form of remote 2-way audio communication is the telephone. Telephone systems give us the opportunity to form a customized audio link with phones. To form telephone links with various collaborators, users are forced to remember large quantities of phone numbers. Although modern advanced telephones try to assist users by saving these phone numbers and corresponding collaborators' names in phone memory, going through a long list of names is still a cumbersome task. Moreover, even if a user has the number of a desired collaborator, the user does not know if the collaborator is available for a phone conversation.
Many audio pick-up systems of the prior art use far-field microphones. Far-field microphones pick up audio signals from anywhere in an environment. As audio signals come from all directions, it may pick up noise or audio signals that a user does not want to hear. Due to this property, a far-field microphone generally has worse signal-to-noise ratio than close-talking microphones. Although a far-field microphone has the drawback of a poor signal-to-noise ratio, it is still widely used for teleconference purposes because remote users may conveniently monitor the audio of an entire environment.
To overcome some of the drawbacks of far-field microphones, such as the pick-up or capture of audio signals from several sources at the same time, some researchers proposed to use the ICA approach to separate sound signals blindly for sound quality improvement. The ICA approach showed some improvement in many constraint experiments. However, this approach also raised new problems when used with far-field microphones. ICA requires more microphones than sound sources to solve the blind source separation problem. As the number of microphones increases, the computational cost becomes prohibitive for real time applications. The ICA approach also requires its user to select proper nonlinear mappings. If these nonlinear mappings cannot match input probability density functions, the result will not be reliable.
Removing independent noises acquired by different microphones is another problem for the ICA approach. As an inverse problem, if the underlying audio mixing matrix is singular, the inverse matrix for ICA will not be stable. Besides all these problems, classical ICA approach eliminates location information of sound sources. Since the location information is eliminated, it becomes difficult for some final users to select ICA results based on location information. For example, an ideal ICA machine may separate signals from ten audio sources and provide ten channels to a user. In this case, the user must check all ten channels to select the source that the user wants to hear. This is very inconvenient for real time applications.
Besides the ICA approach, some other researchers use the beam-forming technique to enhance audio in a specific direction. Compared with the ICA approach, the beam-forming approach is more reliable and depends on sound source direction information. These properties make beam-forming better suited for teleconference applications. Although the beam-forming technique can be used for pick-up of audio signals from a specific direction, it still does not overcome many drawbacks of far-field microphones. The far-field microphone array used by a beam-forming system may still capture noises along a chosen direction. The audio “beam” formed by a microphone array is normally not very narrow. An audio “beam” wider than necessary may further increase the noise level of the audio signal. Additionally, if a beam former is not directed properly, it may attenuate the signal the user wants to hear.
FIG. 1 illustrates a typical control structure 100 of an automatic beam former control system of the prior art. Here, the control unit 140 (performed by a computer or processor) acquires environmental information 110 with sensors 120, such as microphones and video cameras. The microphones used for the control may be the microphones used for beam-forming. A single sensor representation is illustrated to represent both audio and visual sensors to make the control structure clear. Based on the audio and visual sensory information, the control unit 140 may localize the region of interest, and point the beam former 130 to the interesting spot. In this system, the sensors and the controlled beam former must be aligned well to achieve quality audio output. This system also requires a control algorithm to accurately predict the region in which audience members are interested. Computer prediction of the region of interest is a considerable problem.
FIG. 2 shows the control structure 200 of a traditional human operated audio management system. Here, the human operator 230 continuously monitors environment changes via audio and video sensors 220, and adjusts the magnification of various microphones based on environment changes. Compared to state-of-the-art automatic microphone management, a human controlled audio system is often better at selecting meaningful high quality audio signals. However, human controlled audio systems require people to continuously monitor and control audio mixers and other equipment.
What is needed is a audio device management system that enhances audio acquisition quality by using human suggestions and learning audio pick-up strategies and camera management strategies from user operations and input.