Depth estimation in images is typically utilized to estimate a distance between objects in an image scene and a camera used to capture the images. This is conventionally performed using stereoscopic images or dedicated depth sensors (e.g., time-of-flight or structured-light cameras) to identify objects, support gestures, and so on. Accordingly, this reliance on dedicated hardware such as stereoscopic cameras or dedicated depth sensors limits availability of these conventional techniques.
Semantic labeling in images is utilized to assign labels to pixels in an image, such as to describe objects represented at least in part by the pixel, such as sky, ground, a building, and so on. This may be utilized to support a variety of functionality, such as object removal and replacement in an image, masking, segmentation techniques, and so on. Conventional approaches used to perform semantic labeling, however, are typically solved separately or sequentially from depth estimation using different and unrelated techniques, lack accuracy, and may result in propagation of errors formed at early stages in the performance of the techniques to later stages.