Online street view applications are often combined with online mapping applications. For example, a user selects a specific geographical point of interest from a map or a satellite image, and then switches to a panoramic street view. The street view enables the user to navigate to the point of interest as it would be seen, for example, by a driver of a vehicle. Street view can also be used for in many other computer vision, augmented reality, smart car and safety applications.
A typical street scene includes components, such as roads in a ground plane, objects such as pedestrians, bicyclists, other vehicles, buildings and sky. Understanding and labeling images of such a scene require understanding the components and their relative spatial locations. Most conventional methods solve this as two problems: three-dimensional (3D) scene reconstruction and component segmentation.
Recently, these two problems have been merged and solved as a single optimization problem, although several challenges still exist. Prior art segmentation focuses on classifying the pixels into different classes. Such approaches generally take a long time and may not respect the layered constraint that is preserved for street scenes.
Street scene labeling is related to semantic segmentation and scene understanding. Early prior art methods were usually based on hand-designed features. Recently, it has been shown that using deep neural networks for feature learning leads to better performance. Solving both semantic segmentation and depth estimation from stereo camera can be performed using a unified energy minimization framework.
One popular model for road scene is a “stixel world” that simplifies the world using a ground plane and a set of vertical sticks on the ground representing obstacles. Stixels are compact and efficient for representing a part of an upright object, standing on the ground, by its 3D foot print, height, width and distance to the camera (depth). The stixel representation can be characterized as the computation of two curves, possibly very non-smooth, where the first curve runs on the ground plane enclosing the free space that can be immediately reached without collision, and the second curve encodes the vertical object's boundary of the free space. To determine the stixel world, either a depth map from semi-global stereo matching procedure (SGM), or stereo matching cost can be used.
A stixmantics scene representation is more flexible compared to stixels. Instead of having only one stixel for every column, stixmantics allow multiple segments along every column in the image and also combine nearby segments to form superpixel-style entities with better geometric meaning.