Photographs and other 2D images depict 3D objects. Computers estimate the 3D geometry of the 3D objects depicted in such 2D images to enable numerous image editing techniques. For example, existing 3D geometry detection techniques are used to analyze a 2D image and create estimates of 3D geometry information for each of the pixels of the image. Specifically, these existing techniques are used to determine estimates of depth (e.g., distance from a camera) of the pixels in the images. For example, in a 2D image that depicts a room with furniture, the pixels of the back wall of the room are given larger depth estimates than the pixels of a sofa in the middle of the room that is estimated to be is closer to the camera. When an object, such as a bicycle, is added to the 2D image at a particular depth and in a location that is between the back wall and the sofa so that the sofa should block a portion of the bike, the pixel depth information is used to determine that the pixels of the sofa will be displayed rather than that portion of the bike, but that the pixels of the other portion of bike will replace the corresponding pixels of the back wall in the 2D image, since the back wall is deeper than the bike.
Similarly, existing techniques are used to determine estimates of normals (e.g., an orientation or direction associated with a pixel) of the pixels of 2D images. For example, existing techniques represent small areas of a 2D image by a geometry of connected polygons associated with the pixels, and the normals of the pixels identify a direction that is orthogonal to the surface of the associated polygon. When an object is inserted into the 2D image, the orientation of the inserted object can be based on the normal of a pixel at the location where the object is inserted. For example, if a new object, such as a bottle, is inserted in the center of the top surface of a table in the 2D image, the orientation of the new object can be based on the normal of the pixel at that location so that the object will appear to rest flat on the table surface.
Generally, existing techniques estimate depths and normals of the pixels of a 2D image to facilitate various image editing functions. The estimates enable image editors to more realistically insert people, cars, bicycles, advertisements, and other objects into the 2D image consistently with the depths and shapes of existing objects in the image. The estimates of depth and normals also enable adding shadows, view point changes, relighting, and other 3D realistic edits to the 2D image. Techniques to estimate geometries in images can use multiple images of a scene such as those taken by special-purpose 3D cameras, but it is often desirable to estimate the 3D geometry when only a single image of the scene is available.
One technique for estimating 3D geometry in a single 2D image is to train deep convolutional neural networks (CNNs) using a significantly large amount of training data. The technique uses depth training data to learn a first neural network and uses normal training data to learn a second neural network. The technique then uses the first neural network on an image to predict a depth map (i.e., a depth value at each pixel in the image) for the image and uses the second neural network to predict a normal map (i.e., a normal value at each pixel of the image) for the image. The technique learns the neural networks by minimizing overall differences between predicted depth/normal results and known ground truth depth/normal maps. At each pixel, the technique checks the difference between the predicted value and the ground truth and attempts to minimize the sum of all of those differences.
While the predictions of these existing CNN techniques to estimate depths and normals from a single image are good overall, the results often include many drawbacks, such as irregularities with respect to estimating the depths and normals in planar regions, among others.