Simplifying assumptions about the structure of surroundings facilitates reasoning about complex environments. On a wide range of scales, from the layout of a city to structures such as buildings, furniture, and many other objects, man-made structures lend themselves to a description in terms of parallel and orthogonal planes. This intuition is formalized as the Manhattan World (MW) assumption, which posits that most man-made structures may be approximated by planar surfaces that are parallel to one of the three principal planes of a common orthogonal coordinate system.
The MW assumption has been used to estimate orientations within man-made environments for the visually impaired and for robots. In the application of Simultaneous Localization and Mapping (SLAM), the MW assumption has been used to impose constraints on the inferred map. At a coarse level, the Manhattan World assumption holds for city layouts, most buildings, hallways, offices, and other man-made environments. However, the strict Manhattan World assumption cannot represent many real-world scenes, such as a rotated desk, a half-opened door, and complex city layouts (in contrast to planned cities, such as Manhattan).
A popular alternative to the MW model describes man-made structures by individual planes with no constraints on their relative normal directions. Such plane-based representations of 3D scenes have been used in scene segmentation, localization, optical flow, as well as other computer-vision applications. For example, the main direction of planes in a scene can be extracted using a hierarchical Expectation-Maximization (EM) approach. Using the Bayesian Information Criterion (BIC), the number of main directions can be inferred. However, plane-based approaches do not exploit the orthogonal relationships between planes that are common in man-made structures. In such cases, independent location and orientation estimates of planes will be less robust, especially for planes that have few measurements or are subject to increased noise.
Another alternative to the MW model is the Atlanta World (AW). The AW model assumes that the world is composed of multiple Manhattan Worlds sharing the same z-axis. This facilitates inference from RGB images, as only a single angle per Manhattan World needs to be estimated, as opposed to a full 3D rotation. However, common indoor scenes break the AW assumption, because these scenes typically contain orientations that do not share the same z-axis.
Accordingly, there is a need for an improved method and system for describing scenes that can accommodate multiple orientations, yet does not suffer from the limitations described for the above approaches.