To the largest possible extent, a video conference should eliminate the impression of a physical separation between its participants. In addition to providing high image and sound quality, a useful video conference system should facilitate each conferee's participation by automatically pre-eliminating irrelevant information so that he or she is faced with a manageable flow and can comfortably focus on the discussion. In a real life situation, a manageable information flow is created by the participant alone by simply directing his or her eyes and ears at a speaking person, or by alternating between persons in a group currently involved in a discussion. Besides, the conferee will in most cases arrive at the meeting with an expectation of particular participants being more frequent or more interesting speakers than others, or will develop an appreciation for this in the course of the meeting. Unconsciously guided by the recent history of the conference, the conferee will direct attention to them more often.
Bridging the gap between the extreme simplicity in selecting the focus at a natural meeting and the restricted field of view in a video conference is arguably the most challenging difficulty facing the constructor of a video conference system. Most likely, convenience for participants and the training time needed to get started are also crucial factors for the commercial success of video conferences.
Inherently aimed at overcoming the geographical separation of its participants, a video conference system is faced with bandwidth limitations in practically all its uses. This is an additional incentive to single out interesting visual and aural information with due care, allowing this information to be transmitted at an acceptable quality level.
Several prior art video conference systems exist and are described in patent documents.
U.S. Pat. Nos. 5,638,114, 5,801,756 and 6,025,870 represent early attempts towards solving the technical problem of selecting an image in accordance with the activity of the participants in the video conference. U.S. Pat. No. 5,638,114 discloses a television conference system which monitors audio activity at different physical locations and distributes the corresponding video signals, possibly combined into one image, from active locations to all locations. Likewise, U.S. Pat. No. 5,801,756 discloses a video conference system including a central unit which mixes selected input video signals into one output video signal, which is distributed to local terminals. The selection is based on the voice activity measured in the near past at the local terminals. The video conference system can be configured to prefer one of the terminals to others—for instance, a terminal used by a lecturer should take precedence over those of the students—but apart from this does not offer any possibility of adapting the properties of the system to different conference situations. Finally in U.S. Pat. No. 6,025,870, there is described an automatic video switch for use in a video conference system. The switch selects one focus video source, on the basis of event information provided by an audio processing module and/or a graphics processing module, and transmits its image signal to other sites.
The fundamental dilemma of these three conference systems is their simple approach to selecting input video signals. It is a one-layer approach in the sense that the selection is made on the basis of voice activity either momentarily or in a time interval ranging from the present instant to a point located some non-zero distance back in time. The constructor considering to decrease the length of the interval has to weigh an attentive switching behaviour—new speakers will be let in with little delay—against an increasingly flickering image, and vice versa. The resulting compromise solution will not always be acceptable to the potential users, and is rarely one that suits all imaginable conference situations. Indeed, the described systems do not develop a long-term understanding of the conference and its participants, but automatically take predetermined actions in response to predetermined recognised events.
U.S. Pat. No. 6,812,956 proposes an approach where the task of selecting signals is formulated in the form of a standard optimisation problem. A finite number of possible output configurations (candidate solutions) are predefined, wherein each output configuration specifies a routing of output signals to output devices. The output configurations are compared and selected on a substantially continuous basis, by evaluating a desirability (target function), which maps each output configuration to a real number based on “activity”, “saturation”, “continuity”, “participant priorities” and “security levels”. The relative importance of the factors is determined by weights, which are adjustable parameters. At regular intervals, the system assesses all possible output configurations by calculating their desirabilities, concludes whether the current output configuration is the most desirable, and switches to a different configuration if this is not the case. In contrast with the one-layer switching methods mentioned above, the method disclosed in this patent takes account of two points in time: the present instant, at which momentary voice activity is measured, and a point at a predetermined distance “Δt” back in time, at which “audio undercoverage” and “audio overcoverage” are determined. These quantities, which form part of the “activity” contribution to the desirability, are expected to capture a discrepancy between the signals that were selected and the signals at which audio activity effectively took place.
The approach suggested in U.S. Pat. No. 6,812,956 has three main drawbacks. Firstly, setting the system up for the first time will include acquiring an understanding of the meaning of the weights which are included in the desirability function. For the tuning of the parameters, a few test rounds with a realistic number of participants will be needed; the test rounds cannot be too short, since the conference system apparently reviews the past continually. Secondly, there is a need for reconfiguring or at least resetting the system as soon as a participant arrives or leaves, since this will add or remove a number of possible output configurations, for which the desirability is henceforth evaluated in the update procedure. Thirdly and most seriously, since the solution is an implementation of an extremely general approach, it is not adapted to all aspects of video conferencing. Most notably, computational complexity limits its scalability. To illustrate, in a scenario where combined pictures are allowed which include up to four participants selected from a total of p participants, the number of possible output configurations is
            (                                    p                                                1                              )        +          (                                    p                                                2                              )        +          (                                    p                                                3                              )        +          (                                    p                                                4                              )        =            O      ⁡              (                  p          4                )              .  Evaluating the desirability of each configuration will impose a huge computational burden on the system for typical values of p. The intervals between reassessments of the output configuration could certainly be increased in order to reduce the impact of this problem, but doing so would inevitably increase the response time of the system.
To summarise our discussion, there is a long felt need in the field of video conferencing for a satisfactory solution to the problem of selecting input signals. The requirements on a successful video conference system which solves this problem include:                the selecting algorithm faithfully reproduces typical human behaviour and produces an accurate yet engaging representation of the physically separated participants;        the response time of the system is short, so that new speakers are quickly selected and thus shifted into focus;        the system takes into account the past behaviour of the conferees, to reflect a reasonable expectation on their future activity;        the image does not flicker or make discomforting transitions;        the system has feasible scaling characteristics with respect to the number of participants and is bandwidth economical;        the system is easy to set up and can be used by layman participants after a moderate training time;        the number of participants can be easily increased or decreased while the system is running;        the system can be adapted to the needs of the respective users; and        the dynamic properties of the selecting can be easily adapted to a multitude of conference scenarios, ranging from business meetings and court proceedings, over education and professional training, to private use and entertainment applications.        