The invention relates to teleconferencing. More particularly, the invention relates to a method and apparatus for selecting signals in a teleconference.
The primary goal of teleconferencing systems is to provide, at a remote teleconference site, a high fidelity representation of the persons present and of events occurring at a local teleconference site. A teleconferencing system that represents the local conferencing site with sufficient fidelity enables effective communication and collaboration among the participants despite their physical separation.
In practice, it is difficult to capture the persons and events at a local conferencing site effectively using a single video feed from a single video camera and a single audio feed from a single microphone. This is especially true in conferences with more than one local conferencing participant. While employing a single camera with a wide-angle view of a local conferencing site may successfully capture more than one participant within the camera field of view, such views create a sense of distance that is neither comfortable nor engaging for the remote participant.
Several prior art video conferencing systems, including the Viewstation MP, manufactured by Polycom, Inc. of Pleasanton, Calif., have attempted to mitigate this shortcoming with a motion control video camera. The camera automatically tracks a single video conferencing participant or pans and tilts to capture multiple participants, successively, within the field of view. While this approach does provide a closer view of individual participants, the moving view captured by a panning and tilting camera as it transitions from one participant to another is disconcerting when viewed by the remote participant.
To avoid the panning and tilting motion provided by motion control cameras, several prior art conferencing systems, including the CT-4A Automatic Mixer, manufactured by Jefferson Audio Systems of Louisville, Ky., have incorporated video feeds from multiple video cameras, and audio feeds from multiple microphones. In addition, many systems allow for the transmission of video and audio feeds from sources such as laptop computers, document cameras, and video cassette recorders.
Because a teleconferencing system must operate within the limited bandwidth connecting a local and remote location, it is in practice not possible to transmit all of the audio and video signals to the remote location. Moreover, the amount of visual and aural information the remote participant can comfortably process is itself limited. It is therefore desirable to determine, among the many video and audio feeds available at the local conferencing site, which feed or feeds to transmit to the remote location.
Several prior art approaches, including U.S. Pat. No. 6,025,870 to Hardy have suggested that the selection of the video and audio signals may be performed in a manner that simulates the shift in attention of an observer physically present at the local site. For example, the selected video signal may be obtained from a video camera offering a prominent view of the current speaker, and the selected audio signal may be obtained from a microphone offering the clearest rendering of the current dialogue. Providing video and audio signals to the remote participant in this manner provides a more natural interaction with the local teleconferencing site.
In some instances, selection of signals in this manner requires a human operator. This approach is distracting if carried out by a meeting participant, or costly, if carried out by a hired director. A few systems, however, attempt to perform the signal selection in an automated manner. T. Inoue, K. Okada, and Y. Matsushita, Learning from TV Programs: Application of TV Presentation to a Videoconferencing System and Proceedings of the ACM Symposium on User Interface Software and Technology, pp. 147-154, Pittsburgh, Pa. (Nov. 14-17, 1995) propose an automated system emulating the direction techniques used in the television industry.
U.S. Pat. No. 6,025,870 to Hardy describes a system for automatically capturing the changing focus of a video conference. The system xe2x80x9cincludes a video switch for selecting focus video information, a physical video input node coupled to provide physical video information to the video switch, a graphics processing module coupled to provide graphical video information to the video switch, and a remote source interface coupled to provide remote video information to the video switch. The videoconference system further includes an audio processing module for processing audio information. A record controller is coupled to the video switch, the graphics processing module and the audio processing module. The record controller is coupled to receive event information from the audio processing module and the graphics processing module. The record controller automatically determines a focus video source from the physical video input, the graphics processing module and the remote source interface responsive to receiving the event information. The record controller controls the video switch to couple the focus video source to a video switch output responsive to determining the focus video source.xe2x80x9d
While the systems disclosed by Inoue et al. and Hardy do provide improvement over more traditional systems, several deficiencies remain. In particular the Inoue system merely considers a relative probability of transitions from a current signal to a subsequent signal based on the classes of the current signal and available signals, where the signal classes are defined by the subject matter represented by the video signal. The system has, if any, a very limited sense of the current state and context of the video conference. The system is therefore unable to select meaningfully an appropriate signal based on the specific progression of events in a particular video conference, and instead transitions from one signal to another along standardized sequences.
The system disclosed by Hardy does incorporate an understanding of the current state of the conference, as indicated by the events received by the record controller. However, the ability of the system to respond to the changing state of the conference is limited to specific responses to specific events. Most notably, the system is unable to develop a continually refined assessment of the state and context of the conference. Instead, the system merely waits for a recognized event and then responds accordingly.
Moreover, neither system suggests that the selection of signals could be based on a history of the conference state, or a prediction of future conference states. Further, neither prior art system attempts to develop a quantitative estimate of the suitability of selection for each of the potentially selected signals. In these regards, the systems are more rule-based than model-based.
Finally, the prior art systems do not suggest a signal selection method that changes throughout the course of a conference to remain consistent with the changing dynamics of a typical business meeting.
What is needed is a system that continually monitors a teleconference to develop an understanding of the state and context of the conference. Based on this understanding, the system should consider and evaluate each candidate configuration of output signals, preferably quantitatively, and select from among the candidate output configurations a most desirable output configuration. In this manner, the system should develop a model of the conference, preferably incorporating a sense of continuity in the progression of selected output configurations. Further, the model is preferably varied throughout the course of the conference to allow for the changing dynamics of a typical business meeting.
Furthermore, the system, when operated at a local video conferencing site, should be compatible with any existing teleconferencing equipment present at the remote site.
Finally, the system should have interfaces that are simple and intuitive, allowing use by those with little or no computer literacy.
Importantly, the automated selection should be accomplished in a manner providing an accurate and engaging representation of the teleconference, thus allowing for more natural and meaningful interaction between physically separated teleconference participants.
The invention provides appropriate output signals to output devices in a teleconference setting. Input signals are obtained from input devices, and information describing the teleconference is received from several sensors. On a substantially continuous basis, using the descriptive information, a desirability is computed for each of several possible output configurations, where each output configuration specifies a routing of output signals to output devices. The most desirable output configuration is then selected, and output signals are provided to output devices as specified by the selected output configuration.
Exemplary input devices include video cameras, computers, document scanners, and microphones. Exemplary sensors include microphones, motion detectors, and security badge readers. Output signals are composed from the input signals provided by the input devices. Examples of output signal composition include a selection of an input signal or composing a split-screen view from two or more input signals. The output signals are provided to output devices such as television monitors, computer displays, video recording devices, audio recording devices, and printers.
In the preferred embodiment of the invention, the desirability of each possible output configuration is calculated based on contributions from several components. Each component is multiplied by a component weighting and then additively combined with the other components to yield the desirability. These components can include, for example, an activity component, a saturation component, and a continuity component.
The activity component is based on contributions from several activity terms. Each activity term is multiplied by an activity term weighting and then additively combined with the other activity terms to yield the activity component of the desirability. Activity terms can, for example, include an audio activity term, a motion activity term, an audio undercoverage term, and an audio overcoverage term.
The audio activity term reflects the desirability of the possible output configurations based on audio activity detected by microphones within the teleconference site.
The motion term reflects the desirability of the possible output configurations based on motion detected by motion sensors within the teleconference site.
The audio undercoverage term indicates an increasing desirability for those output configurations incorporating output signals related to audio activity and yet not incorporated within the output configuration currently provided to the output devices.
Finally, the audio overcoverage term indicates a decreasing desirability for those output configurations incorporating output signals not related to audio activity and yet incorporated within the output configuration currently provided to the output devices.
The saturation component indicates an increasing desirability for output configurations incorporating output signals not currently provided to the output devices, and a decreasing desirability for output configurations incorporating output signals currently provided to at least one of said output devices.
The continuity component is based on contributions from several continuity terms. Each continuity term is multiplied by a continuity term weighting and then additively combined with the other continuity terms to yield the continuity component of the desirability. The continuity terms can include, for example, a spatial continuity term, a context continuity term, a rapid switching continuity term, and a sustained switching continuity term.
The spatial continuity term indicates a greater desirability for output configurations similar to the output configuration currently provided to the output devices.
The context continuity term indicates a greater desirability for output configurations recently provided to the output devices.
The rapid switching continuity term indicates a greater desirability for the output configuration currently provided to the output devices, and a lesser desirability for all other output configurations, the difference in desirability attaining a maximum value when the current output configuration is initially selected and decreasing thereafter.
Finally, the sustained switching continuity term indicates a greater desirability for the output configuration currently provided to the output devices, and a lesser desirability for all other output configurations, the difference in desirability proportional to a recent history switching rate between output configurations.
The component weightings, activity term weightings, and continuity term weightings are adjustable parameters than can be altered to affect the selection of a most desirable output configuration. Values for the adjustable parameters may be provided to suit a particular conference style, and may be varied over the duration of an individual conference.
The invention thus allows a large number of input signals obtained from a wide variety of input devices to be evaluated and routed to a wide variety of output devices using a consistent and logical framework. Diverse information describing the dynamics of the conferencing environment is incorporated in an intuitive manner to provide natural and meaningful interaction between physically separated teleconference participants.