Scene labeling constitutes an important step in understanding a scene generated electronically. In practice, electronic devices and systems assign a corresponding category label to each pixel in image or each point in 3D point cloud, then jointly finalize the scene segmentation, and object recognition. Scene labeling has diverse applications, such as use by autonomous robots, augmented reality, etc. For both indoor scenes and outdoor scenarios, scene labeling becomes challenging due to light conditions, occlusion, and complicated textures. The emergence of various depth sensors, such as those found in RGB-D cameras, makes it convenient to acquire color and depth information simultaneously, which helps improve the accuracy and robustness of scene labeling. However, mobile devices (e.g. Google Tango) provide depth data of low quality due to the limitations of power-consumption and computability compared with specialized RGB-D cameras like Kinect.
Techniques now exist for segmenting or labeling the objects in a cluttered scene using an RGB-D data representation. Such present day techniques make use of category classifiers learned from a training dataset with ground-truth labels. However, present day labeling techniques suffer from the problem of how to represent the features from the sensor data. Traditionally, handcrafted features rely on individual color and depth data, making it difficult to extend such techniques to different modalities and to exploit cross-modality information. Thus, the performance of such scene labeling relies on the selection and combination of handcrafted features. In addition, some techniques rely on unsupervised feature learning methods to learn sophisticated feature representations in order to improve algorithm performance and fuse the color and depth data.
The publication, K. Lai, et al., Unsupervised Feature Learning for 3D Scene Labeling, ICRA'14, proposes a hierarchical sparse coding method to learn features from a 3D point cloud. The training of classifiers described in this paper relies on a synthetic dataset of virtual scenes separately generated using CAD models. The training of classifiers occurs using RGB-D images and the classifiers are combined with classifiers trained from 3D data. Wang et al., in their paper, Multi-Model Unsupervised Feature Learning for RGB-D Scene Labeling, ECCV'14 propose learning features from color and depth information in a joint manner via an unsupervised learning framework. These two papers regard color and depth information as a direct concatenation during classifying or feature learning.
More recently, feature learning methods, like convolutional neural networks have found applicability in RGB-D feature learning. Interactive labeling also constitutes an alternative method to overcome low-quality depth data and insufficient benchmark datasets. The paper SemanticPaint: Interactive 3D Labeling and Learning at your Fingertips, published by J. Valentin et al. at ToG'15 describes a state-of-the-art labeling technique wherein users simultaneously scan the environment and interactively segment scenes by reaching out and touching desired objects or surfaces. However, this technique has proven hard to implement in cluttered and large scenes with the objects out of the user's reach and forbidden to touch. O. Milisk et al., in their paper, The Semantic Paintbrush: Interactive 3D Mapping and Recognition in Large Outdoor Spaces, CHI'15, proposes using a laser pointer to draw onto a 3D world to semantically segment objects. From the examples interactively labeled, Milisk et al. automatically segments the captured 3D models.
While the above-described segmentation techniques work well with high resolution RGB-D cameras, such techniques do not work well with mobile devices such as Google Tango, which capture images with low quality 3D data.
Thus, a need exits for a scene labeling technique that overcomes the aforementioned disadvantages of the prior art and especially the ability to utilize low quality 3D data from mobile devices.