The increasing availability of inexpensive video cameras and high-quality projection displays is providing opportunities for developing novel interfaces that use computer vision. These interfaces enable interactive applications that impose little constraint on a user and the environment. For example, the user can interact with objects in a scene without the need for a physical coupling between the user, the objects, and a computer system, as in more conventional mouse, or touch-based computer interfaces.
However, computer vision systems, with rare exceptions, are difficult to implement for applications where the visual appearance of objects and the scene change rapidly due to lighting fluctuations. Under dynamic lighting, traditional segmentation techniques generally fail.
The difficulty of implementation increases for interactive applications that use front-projected or rear-projected displays because the projector will illuminate foreground objects as well as the background. This makes color tracking and other appearance-based methods difficult, if not impossible to use.
By utilizing calibrated stereo cameras, it is possible to take advantage of 3-dimensional geometric constraints in the background to segment the scene using stereo analysis. Indeed, if the geometry of the background is known, then it becomes possible to determine a depth at every pixel in pairs of images, and compare these depths to the depths in images of a scene with static geometry, i.e., a scene without moving foreground objects. However, this process involves computing a dense depth map for each pair of images acquired by the stereo camera. This is computationally time consuming, and therefore unsuitable for applications that demand real-time performance.
Many prior art computer vision systems used for object recognition and motion analysis begin with some form of segmentation, see for example Friedman et al. “Image segmentation in video sequences: A probabilistic approach,” Thirteenth Conference on Uncertainty in Artificial Intelligence, 1997, Stauffer et al. “Adaptive background mixture models for real-time tracking,” Proc. of CVPR-99, pages 246-252, 1999, and Wren et al. “Pfinder: Real-time tracking of the human body,” IEEE Trans. on Pattern Analysis and Machine Intelligence, 19(7):780-785, 1997.
Typically, a real, tangible, physical background surface is measured over an extended period of time, and a 3D model is constructed using statistical properties of the measurements. The model is then used to determine which pixels in an input image are not part of the background, and therefore must be foreground pixels. Obviously, the background in the scene must remain relatively static for the segmentation to work, or at most, vary slowly with respect to geometry, reflectance, and illumination. For many practical applications that require natural interactions and natural user environments, these constraints are too restrictive.
Reliable segmentation for outdoor environments with a static geometry can be performed by using an explicit illumination model, see Oliver et al. “A Bayesian computer vision system for modeling human interactions,” Proceedings of ICVS99, 1999. There, the model is an eigenspace of images that describes a range of appearances in the scene under a variety of illumination conditions. Any different and unknown illumination dramatically degrades performance of the system, should it work at all. None of the above techniques accommodate rapidly changing lighting conditions, such as one would get when illuminating background and foreground objects with a dynamic, high-contrast projection display device.
Another class of prior art techniques take advantage of the geometry in the scene. For example, Gaspar et al., in “Ground plane obstacle detection with a stereo vision system,” International workshop on Intelligent Robotic Systems, 1994, describe constraints of a ground plane in order to detect obstacles in the path of a mobile robot.
Other methods employ special purpose multi-baseline stereo hardware to compute dense depth maps in real-time, see Okutomi et al. “A multiple-baseline stereo,” IEEE Trans. on Pattern Analysis and Machine Intelligence, 15(4):353-363, 1993. Provided with background disparity values, their method performs real-time depth segmentation, or “z-keying,” provided that the background does not vary, see Kanade “A stereo machine for video-rate dense depth mapping and its new applications,” In Proc. of Image Understanding Workshop, pages 805-811, 1995. However, the burden of computing dense, robust, real-time stereo maps is great.
Ivanov et al., in “Fast lighting independent background subtraction,” International Journal of Computer Vision, 37(2):199-207, 2000, describe a segmentation method that first illuminates a physical background surface using a laser pointer. The location of the laser spot in stereo images is used to construct a sparse disparity map of the geometrically static, physical background surface. They use Delaunay triangulation to estimate neighborhood relationships anywhere in the 3D mesh. The disparity map is used to segment a foreground object from the background in real-time. As an advantage, a dense depth map is never explicitly computed. Instead, the pre-computed disparity map is used to rectify input images prior to direct image subtraction.
As a disadvantage, their method requires a time consuming measurement step with the laser pointer while stereo images are collected. This requires specialized equipment, and is error prone. Because the disparity map is modeled in the form of flat triangles, the method requires a high degree of human intervention when the surface is highly curved or otherwise irregular. In this case a sparse set of calibration points is insufficient because interpolation is ineffective in many areas.
In addition, their system requires a background surface that reflects laser light. This means that their method cannot be used to define virtual surfaces. Hereinafter, the term virtual surface means a surface that is geometrically defined in the real world and that is either tangible, i.e., a surface of a physical object, or some imaginary plane in space, not necessarily tangible, or only partially tangible.
This means their method cannot work for detecting objects in thin air, for example, a person entering through the virtual plane of an open doorway, or a ball falling through the virtual plane defined by a hoop. Nor, can their system deal with objects appearing from behind the background surface.
Moreover, their laser scanning is only practical for indoor scenes, and quite unsuitable for large scale outdoor scenes where it is desired to define depth planes geometrically, that in fact do not exist as tangible objects. Therefore, there still is a need for a robust depth segmentation technique that can operate in real-time on tangible and virtual surfaces in the physical world, at arbitrary scales.