Existing three-dimensional (3D) object retrieval approaches may be categorized into (i) those operating directly on the 3D content and (ii) those which extract “2.5D” or 2D contents (stereo-pairs, multiple views of images, artificially rendered 3D objects, silhouettes, etc.).
Focusing now on “2D-to-3D” retrieval frameworks that are based on 2D image as input for performing the retrieval, several shape-based approaches, including boundary analyses, have been adapted for 3D object retrieval from 2D image(s).
For instance, T. Napoleon, “From 2D Silhouettes to 3D Object Retrieval: Contributions and Benchmarking”, In. EURASIP Journal on Image and Video Processing, 2010, conducted 3D object search with multiple silhouette images. The query includes not only 2D silhouettes, but also hand-drawn sketches. Notably, this document introduced the idea of including silhouette/contour alignment using dynamic programming in a coarse-to-fine way for search efficiency. However, an important drawback of this method is that performance is sensitive to the quality of the contour resulting from automatic detouring, which remains a great challenge.
As another example of 3D retrieval from 2D images, Aono et al., “3D Shape Retrieval from a 2D Image as Query”, In Signal & Information Processing Association Annual Summit and Conference (APSIPA ASC), 2012, uses a composite feature vector as a combination of Zernike moments and the HOG features for 3D object retrieval from a single 2D image. HOG features are computed from shaded depth-buffer images while Zernike moments from silhouette images. These features could not be sufficient to distinguish between similar objects with the same overall shape. Also, they often fail with partially occluded objects.
Other approaches to “2D-to-3D” matching utilize the 3D models for efficient object detection and/or fine pose estimation. For this, they rely on a collection of 3D exemplar models, which they render from a large number of viewpoints. The rendered images are then used for learning part templates to localize an object in a given image and estimate its fine-pose. The main drawback of such approaches is that they require heavy annotations and calculations. Thus are not scalable. For learning meaningful model, they need to associate each available CAD model to a set of images which contain the same CAD model and which are annotated with the object pose.
Querying a database of 3D objects with a 2D image has been also used for automatic 3D reconstruction of objects depicted in Web images, in Q. Huang et al., “Single-View Reconstruction via Joint Analysis of Image and Shape Collections”, in CVPR 2015. The approach reconstructs objects from single views. The key idea is to jointly analyze a collection of images of different objects along with a smaller collection of existing 3D models. Dense pixel-level correspondences are established between natural images and rendered images. These correspondences are used to jointly segment the images and the 3D models. The computed segmentations and correspondences are then used to construct new models. However, such a method is sensitive to the quality of the segmentation and thus could fail with images with complex backgrounds and partially occluded objects.
Hence, these methods suffer several drawbacks. First they can require constraints on the image provided as input; for instance segmentation of the image, automatic detouring of the image. In addition, they do not always allow retrieving objects that are partially occluded. Furthermore, the scalability of these methods can be limited as they rely on learning machines that quickly reach their learning capabilities limits. Moreover, the discriminative power of the signatures used for retrieving the objects does not always allow relevant 3D objects; for instance these methods are not able to determine by themselves what makes the difference between two objects in a 3D model.
Within this context, there is still a need for an improved method for recognizing a three-dimensional modeled object from a two-dimensional image.