In order to control the image obtained by a camera, manual remote control systems for controlling the pan, zoom and other functions of a camera are available. Such systems can allow video conference participants at a first location to control the image that is provided to participants at a second video conference location. Alternatively or in addition, manual systems may allow video conference participants at a second location to control functions of the camera at the first video conference location. Accordingly, such systems require input from a user in order to provide appropriate image content. Furthermore, such systems can be difficult to operate from a remote location.
In video conference systems, it is desirable to provide the image of the current speaker to other video conference locations. Where there are a number of conference participants at one location, systems have been developed that use audio information in order to determine the current speaker and to point an imaging camera at that speaker. In particular, systems that rely on multiple microphones to determine the location of the current speaker through triangulation are available.
Although systems that use audio information for controlling a camera at a video conference location can adjust the image that is provided to a remote video conference location depending on sounds at the source video conference location, the operation of such systems has not been entirely satisfactory. For example, in noisy environments, spurious noises can cause the camera control system to thrash. In addition, such systems require the use of multiple microphones in order to enable triangulation to determine the source of the sounds. Also, where there are multiple speakers, such systems are typically unable to choose the most important speaker from between the multiple signals. In order to address this so called “cocktail party problem,” techniques for electronically processing signals to separate desired signals from noise sources have been developed, but have had limited success. Audio-based systems are also unable to determine the location of a speaker using sign language. Furthermore, such systems are unable to determine if something other than a speaker, such as a white board or exhibit, should be imaged.
In connection with obtaining video imagery in surveillance-type applications, it is desirable to obtain imagery from areas within a scene where significant events are occurring. However, such areas generally cannot be determined in advance. As a result, adequate coverage of each area within a scene under surveillance may require a large number of cameras. However, because of the expense and complication of deploying a large number of cameras, such an approach is often impractical. As a result, cameras having a wide field of view are often used, which often results in low resolution, poor quality images of areas within the field of view that are determined to be of interest. Accordingly, it would be desirable to provide a system for controlling the areas of a scene imaged by a surveillance camera.