The connection between three dimensional (3D) scene structure and image understanding is an important and long running theme in computer vision, largely motivated by models of human perception from computational neuroscience, including Marr's 2.5D sketch. Geometric cues, whether arising from direct 3D reconstruction or low-level geometric reasoning, benefit a range of computer vision problems including tracking, object detection and visual saliency prediction.
Conventional methods of recovering scene structure require multiple images with a change in imaging conditions (e.g., viewpoint, focus or lighting). However, applications of practical interest, such as surveillance and monitoring, are dominated by static cameras in uncontrolled environments. If multiple views are being used by such applications, the multiple views are often wide-baseline or non-overlapping. For such multiple views, the conventional multi-view methods of recovering scene structure cannot be applied.
In some conventional methods, scene geometry is recovered using less robust single-view geometric cues. Since single-view geometric understanding is inherently under-constrained, one conventional method is to incorporate constraints on admissible scene geometry. For example, one method of recovering scene structure interprets 2D line drawings based on a set of known 3D polyhedral shapes. Another method of recovering scene structure models a scene as a horizontal ground plane and a set of vertical planes. Yet another method of recovering scene structure, commonly known as the Manhattan world model, interprets indoor and urban scenes as a set of orthogonal planes. The planes correspond to, for example, the floor, ceiling and walls of a rectilinear room.
In methods of recovering scene structure that constrain scene geometry as described above, a geometric cue is surface orientation at each image location. In one such method, the surface orientation is a semantic label corresponding to floor, ceiling, left wall, rear wall or right wall. In another method of recovering scene structure, the surface orientation is an integer label corresponding to one of the orthogonal plane orientations in the Manhattan world model. A surface orientation map that assigns a surface orientation label to each pixel location in an image may be determined. The surface orientation map indicates the orientation of the scene surface projected to a location in the image.
One method of determining surface orientation for an image determines support regions for different surface orientations based on detected line segments. Three vanishing points corresponding to a Manhattan world model are detected in an input image. Then for each detected line segment, a support region for a particular vanishing point is determined based on the triangle formed by the end-points of the line segment and one of the vanishing points. Finally, image locations where exactly two support regions for different vanishing points overlap are labelled according to the corresponding surface orientation. Such a method of determining surface orientation is sensitive to missing or noisy line detection, and does not assign a label to all image locations. Furthermore, this method does not provide an indication of the confidence of an image label, which would be useful for later algorithms that utilise the determined surface orientation.
Another method of determining surface orientation for an image is to learn the appearance of different semantic surfaces, such as the floor, ceiling and walls, based on multiple features determined on a superpixel segmentation of the image. In one method of determining surface orientation, a boosted decision tree classifier learns a mapping from colour, texture, location, shape and line segment layout features to a semantic label. In another related method of determining surface orientation, the boosted decision tree classifier additionally learns from indoor geometric context recovered by finding vanishing points and fitting a box model to the floor, walls and ceiling of a room based on detected line segments. Such machine learning-based methods rely on training a classifier from a large number of training samples. However, the learned classifier may overfit the training data and not generalize well to other scenes. Further, collecting and annotating a large training set can require significant effort.
In another method of determining surface orientation, surface orientation determination is treated as an optimal superpixel label assignment problem, which is solved using a Markov Random Field (MRF) formulation. The MRF potentials are formulated based on the colour difference between neighbouring superpixels, the straightness of shared boundaries between neighbouring superpixels, whether superpixels occur inside detected rectilinear structures or span horizon lines defined by detected vanishing points, and whether the boundaries of superpixels align with vanishing points. Both the MRF based method and the decision tree-based method described above are based on a superpixel segmentation of an image. Different superpixel segmentation methods, and different parameter settings within the same superpixel segmentation method, are known to produce significantly different superpixel segmentations. Thus, the above described surface orientation determination methods are sensitive to the particular selection of superpixel segmentation method and parameters.