Robotic systems, such as a robotic manipulator containing a gripping component, may be used for applications involving picking up or moving objects. For instance, a robotic device may be used to fill a container with objects, create a stack of objects, or unload objects from a truck bed. In some cases, all of the objects may be of the same type. In other cases, a container or truck may contain a mix of different types of objects, such as boxed items, cans, tires, or other stackable objects. Such robotic systems may direct a robotic manipulator to pick up objects based on predetermined knowledge of where objects are in the environment.
In some examples, a robotic system may use computer vision techniques to determine a representation of three-dimensional (3D) scene geometry. By way of example, a robotic system may triangulate information observed from a scene to determine a depth to one or more surfaces in a scene. One approach to depth sensing is the use of stereo image processing. According to this approach, two optical sensors with a known physical relationship to one another are used to capture two images of a scene. By finding mappings of corresponding pixel values within the two images and calculating how far apart these common areas reside in pixel space, a computing device can determine a depth map or image using triangulation. The depth map or depth image may contain information relating to the distances of surfaces of objects in the scene.
Another approach to depth sensing is structured-light processing. The main idea of structured-light processing is to project a known illumination pattern onto a scene, and capture an image of the scene that includes the projected pattern. For example, as shown in FIG. 1, a projector 102 may project a known texture pattern onto an object 104, and an optical sensor 106 (e.g., a camera) may capture an image 108 of the object 104. A computing device may then determine a correspondence between a region in the image and a particular part of the projected pattern. Given a position of the projector 102, a position of the optical sensor 106, and the location of the region corresponding to the particular part of the pattern within the image 108, the computing device may then use triangulation to estimate a depth to a surface of the object 104.
Typically the projector 102 and optical sensor 106 are displaced horizontally along a baseline, and the projector 102 and optical sensor 106 are calibrated. The calibration process may map a pixel in the optical sensor 106 to a one-dimensional curve of pixels in the projector 102. If the sensor image and the projector image are rectified, then this curve may take the form of a horizontal line. In this case, the search for matches to the projected texture pattern can proceed along this line, making the process more efficient.
In structured-light processing, the process of matching an image region with its corresponding part of the projected part is known as pattern decoding. During pattern decoding, a computing device searches horizontally for a region in the image that contains a unique portion of the projected pattern. In practice, the size of the horizontal region in the image that computing device searches within (i.e., the search space) may require that the pattern be unique over that distance, for every part of the image. For instance, the projected pattern may be made up of repeating portions such that the pattern is unique for a certain size patch (e.g., 19×19 pixels) over a certain horizontal matching range, such as 128 pixels. Consequently, the search space may be a region that is at least 128 pixels.