Image understanding systems employ various techniques, including those in the field of computer vision, to understand image and/or video data. These techniques tend to imitate human visual recognition mechanisms. For example, when viewing a scene, such as a cityscape, a forest or a cafeteria, humans often decompose that scene into a richly organized interaction of objects, functions, spaces and/or the like. Image understanding systems attempt to do the same, with some success to an extent. Nonetheless, parsing an image into a set of objects and interactions remains a difficult and costly undertaking for image understanding systems.
Many current approaches represent an image as a two-dimensional array of pixel labels, which fail to account for occlusion. This means portions of a scene's semantic structure are occluded from view. Occlusion renders visible content difficult to parse. For example, when projected into a two-dimensional image, background objects are often fragmented by occluding objects in the foreground. A number of image understanding techniques attempt to address the problem of understanding images with multiple overlapping portions.