Video targeting and tracking systems that respond to user commands to change, or acquire, a target is a rapidly-growing field. The speed of computers and, consequently, the speed of image-processing and speech processing are such that very convenient mechanisms for aiming and re-aiming cameras can be provided. In video-conferencing systems, for example, a user can point to an object of interest to position a zoom-capable camera on a PT base. Such automated systems are more intuitive and easier to control than conventional systems that require more explicit commands such as voice-command (“command-control,” basically a speech-based symbol processor where each verbal command corresponds to an instruction, for example “PAN—LEFT,” “UP,” “DOWN’” etc.), joystick control, and continuous target tracking. Continuous tracking systems typically track moving objects using a camera with an image detector to capture images of an object. These captured images are then processed to find and track the object. When a camera is used to capture the images, and the object being tracked moves far away from the center of the camera's field of view, the camera's aim is adjusted to continue the tracking process.
One system that employs such “smart” technology to allow the control of a camera is described in a U.S. patent application Ser. No. 08/996,677, filed Dec. 23, 1997, entitled System and Method for Permitting Three-Dimensional Navigation Through a Virtual Reality Environment Using Camera-Based Gesture Inputs, the entirety of which is incorporated herein by reference. This patent application discusses art in which a camera distinguishes the profiles of human subjects from the background using image-processing techniques. The image-processing techniques use metrics and other image-processing techniques relating to the target to distinguish the subject from the background. The subjects can then be followed by a pan/tilt/zoom (PTZ) camera. Such a system can repeatedly position, zoom, and focus on a target so that the target remains relatively centered on the screen.
Another technique, such as described in U.S. Pat. No. 5,187,574, is referred to as virtual or electronic zoom. Video information from one or more fixed cameras is processed electronically such that the target of interest remains visible in a desired configuration in the output video signal despite the fact that the object may not be centered in the field of view of any particular camera. Through extraction and interpolation operations, the tracking process can be accomplished through fixed cameras, which are generally less expensive than PTZ cameras.
Although capable of tracking a target, these systems lack the ability or the flexibility to locate the target to be acquired and tracked. These systems must either rely on an operator to first select the object, or the object must exhibit a property that the system is preconfigured to detect.
Another improvement is described in detail in the article “‘Finger-Pointer’: Pointing interface by Image Processing” by Masaaki Fukumoto, Yasuhito Suenga and Kenji Mase. In this article, the authors describe a system that directs a camera to focus on a target by having an operator located within the field of view of the system point to that target. The system scans and processes the image of the operator's finger, which directs a camera to be aimed in that general direction. The article also describes a system using a combination of pointing gestures and voice commands. Through the use of simple voice or gesture commands, the operator can direct the camera to perform simple functions, such as zoom in or out, or clear the screen.
One obvious problem associated with this system results from the misdirection of the camera to an object or target that the operator did not intend to target. Sources of this problem include operator error (i.e., the operator did not precisely point to the desired direction), system error (i.e., the system did not correctly interpret the operator's gesture), and inherent ambiguity (i.e., the information in the gesture is insufficient to define a target coordinate unambiguously). For example, the likelihood that the camera will focus on the wrong target will increase if multiple objects are found along the trajectory of the pointed direction or if there are multiple objects within close proximity to the targeted object. Manually redirecting the camera can be time-consuming, nullify any benefit from having such an automated system. Further, operating an advanced video system, whether by physically re-aiming the camera or through voice commands, is a wasteful distraction.