(1) Field of Invention
The present invention relates to a visual perception system for determining position and pose of a three-dimensional object and, more particularly, to a visual perception system for determining a position and pose of a three-dimensional object through matching with flexible object templates.
(2) Description of Related Art
Robotic sensing is a branch of robotics science intended to give robots sensing capabilities in order to perform specific actions. Robotic sensing primarily gives robots the ability to see, touch, hear, and move using processes that require environmental feedback. Visual perception for allowing a robot to grasp and manipulate a desired object requires segmenting a scene, object identification, localization and tracking of action points.
Appearance-based methods segment scenes based on similar texture and/or color (see List of Cited Literature References, Literature Reference Nos. 10, 14, 22, and 35). These approaches are quick and can work with a single camera, since they do not require depth information. However, they require texture-free backgrounds. Shape-based methods are generally indifferent to visual textures (see Literature Reference Nos. 32 and 36). These systems use mesh grids that are generated from a three-dimensional (3D) data source. This generation requires considerable processing time and suffers from object-class ambiguity of neighboring points.
Additionally, appearance-based methods have been used for object identification (see Literature Reference Nos. 2, 6, 19, 21, 31, and 39). These approaches can operate with only a single camera, but can be thrown off by large changes in lighting or 3D pose. Shape-based methods have been also used for object identification (see Literature Reference Nos. 15 and 25). These approaches are indifferent to visual textures, but can be thrown off by similarly shaped, but differently appearing objects (e.g. a knife versus a screwdriver).
Further, graph matching methods can recognize object parts (see Literature Reference Nos. 9, 18, and 33). These methods scale to multi-part and articulated objects, but typically rely on appearance features only and are computationally expensive. Rigid- (and piece-wise rigid-) body transforms (see Literature Reference Nos. 8, 20, and 38), which are commonly used for well-modeled objects, provide precise pose estimates, but cannot handle deformable or previously unseen objects of a known class. Moreover, through search or learning, grasp points have been computed directly from image features (see Literature Reference No. 30). This approach can handle novel objects, but is sensitive to scene lighting and irrelevant background textures. Deformation mapping (see Literature Reference Nos. 34 and 40) can handle deformable objects, but do not scale well for large number of feature correspondences or handle articulated objects. Kemp and Edsinger find tool tips through fast moving edges when the robot moves the tool (see Literature Reference No. 16). However, this approach is restricted to finding tips of objects.
Each of the aforementioned methods exhibit limitations that make them incomplete. Thus, there is a continuing need for a robotic visual perception system that provides increased levels of autonomy, robustness to uncertainty, adaptability, and versatility.