The connection between three dimensional (3D) structure and image understanding is an important and long running theme in computer vision, largely motivated by models of human perception from computational neuroscience, including Marr's 2.5D sketch. Geometric cues, whether arising from direct 3D reconstruction or low-level geometric reasoning, benefit a range of computer vision problems including tracking, object detection and visual saliency prediction.
One approach of single-view scene reconstruction assigns ordinal depth to a static image segmentation based on foreground occlusions, by ‘pushing’ occluded regions to lower layers and ‘popping’ occluding regions to higher layers. However, sensitivity to initial segmentation and target path makes such an approach brittle. Another approach adopts a more robust minimum description length optimization for depth layer assignment based on the assumption that persistent edges of foreground targets coincide with occlusion boundaries.
Another approach builds a simple three-layer model that segments scenes into static background, moving targets and static foreground occlusions. Evidence for occlusions follows from two assumptions: occlusions result in persistent foreground edges perpendicular to the direction of motion, and occlusion edges never appear inside foreground regions. The simple three-layer model approach is restricted to relatively simple scenes (e.g., scenes in which a person walks behind but never in front of an occluding object). Another approach additionally assumes foreground regions change rapidly in an area of an image during occlusion (e.g., when a person disappears behind an occluding object). The simple three-layer model approach and the approach which assumes the rapid change of foreground regions do not require static occlusion boundaries to coincide with intensity edges, and have the advantage of avoiding over-segmentation. However, neither the simple three-layer model approach nor the rapid change approach detects horizontal occlusion boundaries parallel to target motion.
Moving targets have been previously exploited to segment floor regions. One approach analyses moving people to recover camera calibration, floor segmentation and ground plane parameters in a static scene. The floor segmentation approach models floor appearance based on seed pixels underneath detected target footprints. The floor segment is constructed as a connected region of pixels within a threshold colour distance from the seeds. Another approach for detecting floor regions proposes a similar model to detect floor regions by iteratively growing a floor region around seed pixels. Both approaches work well in simple scenes with homogeneous floor appearance and sufficiently distributed footprints. However, neither approach considers the impact of partially occluded targets with no visible footprint, or the impact of occlusions that divide the floor into disconnected segments.
Another approach for detecting floor regions moves beyond simple foreground shape analysis by detecting human actions to probe “sittable” and “walkable” surfaces of a scene, which are subsequently used to infer single-view 3D structure. Such an approach relies heavily on single-view human pose estimation, which remains an open and challenging problem.