In most high end video conferencing systems, high quality cameras with pan-, tilt-, and zoom capabilities are used to frame a view of the meeting room and the participants in the conference. The cameras typically have a wide field-of-view (FOV), and high mechanical zooming capability. This allows for both good overview of a meeting room, and the possibility of capturing close-up images of participants. The video stream from the camera is compressed and sent to one or more receiving sites in the video conference. All sites in the conference receive live video and audio from the other sites in the conference, thus enabling real time communication with both visual and acoustic information.
Video conferences vary a great deal when it comes to purpose, the number of participants, layout of conference rooms, etc. Each meeting configuration typically requires an individual adjustment of the camera in order to present an optimal view. Adjustments to the camera may be required both before and during the video conference. E.g. in a video conference room seating up to 16 persons, it is natural that the video camera is preset to frame all of the 16 available seat locations. However, if only 2 or 3 participants are present, the wide field of view camera setting will give the receiving end a very poor visual representation.
Adjustments to the camera are typically done via a remote control, either by manually controlling the camera pan, tilt and zoom, or by choosing between a set of pre-defined camera positions. These pre-defined positions are manually programmed. Often, before or during a video conference, the users do not want to be preoccupied with the manual control of the camera, or the less experienced user may not even be aware of the possibility (or how) to change the cameras field of view. Hence, the camera is often left sub-optimally adjusted in a video conference, resulting in a degraded video experience.
Some video conferencing systems with Camera Tracking capabilities exist. However, the purpose of these systems is to automatically focus the camera on an active speaker. These systems are typically based on speaker localization by audio signal processing with a microphone array, and/or in combination with image processing.
Some digital video cameras (for instance web-cams) use video analysis to detect, center on and follow the face of one person within a limited range of digital pan, tilt and zoom. However, these systems are only suitable for one person, require that the camera is initially correctly positioned and have a very limited digital working range. Hence, none of the conventional systems mentioned above describe a system for automated configuration of the camera in a video-conference setting.