Saliency detection has been studied under three different scenarios. Early works attempt to predict human eye-fixation over an image, while later works increasingly focus on salient foreground segmentation, i.e., predicting a dense, pixel-level binary map to differentiate the salient objects from the background. However, salient foreground segmentation does not provide a way to separate different overlapping salient objects. Also, salient foreground segmentation requires pixel-level annotations that are expensive to acquire for large datasets. Different from salient foreground segmentation, salient object detection aims to locate and draw bounding boxes around salient objects. Salient object detection uses bounding box annotations, which significantly reduces the effort for human labeling, and can easily separate overlapping salient objects.
With the re-emergence of the convolutional neural network (CNN), the computer vision community has witnessed numerous breakthroughs, including salient object detection, thanks to the extraordinary discriminative and representative ability of CNNs. Prior to CNNs, heuristics detected a single salient object in an image. Also, a fixed-sized list of bounding boxes which might contain salient objects was ranked without determining the exact detections. However, such methods do not solve the existence problem, i.e., determining whether any salient objects exist in an image at all, and simply rely on external binary classifiers to address this problem. Recently, saliency detection based on deep networks has achieved state-of-the-art performance. For example, one network designed for generic object detection generates hundreds of candidate bounding boxes that are further ranked to output a compact set of salient objects. A probabilistic approach filters and re-ranks candidate boxes as a substitution for non-maxima suppression (NMS). To improve accuracy, the network is applied recursively to image sub-parts, adding additional overhead.
To accurately localize salient objects, these approaches require a large number of class-agnostic proposals covering the whole image. If only tens of boxes are used, precision and recall of these methods significantly drop. Generic object proposals have a very low success rate of locating an object, i.e., only a few of all the proposals tightly enclose the ground-truth objects, while most are redundant. Despite the application of additional refinement steps, many false positives remain. Such additional steps also make this framework infeasible for real-time applications.