Various techniques can be used to obtain an image of a scene. The image may be intensity information in one or more spectral bands, range information, or a combination of the two. From such an image, it is useful to compute the full 3D model of the scene. One need for this is in robotic applications where the full 3D scene model is required for path planning, grasping, and other manipulation. In such applications, it is also useful to know which parts of the scene correspond to separate objects that can be moved independently of other objects. Other applications have similar requirements for obtaining a full 3D scene model that includes segmentation into separate parts.
Computing the full 3D scene model from an image of a scene, including segmentation into parts, is referred to here as “constructing a 3D scene model” or alternatively “parsing a scene”. There are many difficult problems in doing this. Two of these are: (1) identifying which parts of the image correspond to separate objects; and (2) identifying or maintaining the identity of objects that are partially or fully occluded.
Previously, there has been no entirely satisfactory method for reliably constructing a 3D scene model, in spite of considerable research. Several technical papers provide surveys of a vast body of prior work in the area. One is such survey is Paul J. Besl and Ramesh C. Jain, “Three-dimensional object recognition”, Computing Surveys, 17(1), pp 75-145, 1985. Another is Roland T. Chin and Charles R. Dyer, “Model-based recognition in robot vision”, ACM Computing Surveys, 18(1), pp 67-108, 1986. Another is Farshid Arman and J. K. Aggarwal, “Model-based object recognition in dense-range images—a review”, ACM Computing Surveys, 25(1), pp 5-43, 1993. Another is Richard J. Campbell and Patrick J. Flynn, “A survey of free-form object representation and recognition techniques”, Computer Vision and Image Understanding, 81(2), pp 166-210, 2001.
None of the prior work solves the problem of constructing a 3D scene model reliably, particularly when the scene is cluttered and there is significant occlusion. Hence, there is a need for a system and method able to do this.