When a person enters a room that the person has never seen before, the person's visual system immediately begins to parse the scene. The eyes move (saccade) to regions of the room that contain objects of interest. As the objects of interest are found, the brain immediately begins classifying them. If the person sees something new and unrecognized, the person might ask a friend what the item is called. While this task is trivial for most humans to accomplish, it has proven to be a very challenging problem for computers. Because human performance far exceeds that of the best machine vision systems to date; building an artificial system inspired by the principles underlying human vision is an attractive option. However, existing bio-inspired systems have not been robustly tested on real-world image datasets and are not suited for real-time applications. Additionally, existing attention and segmentation systems are slow, inefficient, perform inaccurate or arbitrary segmentation of a scene, and lack automation so that they require a user's constant supervision. The majority of vision-based methods have limited application to natural scenes and cannot accommodate for subtle changes in a scene, such as the natural lighting. Such accommodations would require extensive timing of the method to each particular scene and would require constant user maintenance to properly segment an object. Additionally, current systems implement object segmentation mechanisms that only work at a single scale, do not have the ability to break apart an image into smaller objects based on feature density, and do not have the ability to join smaller regions of similar features into larger objects with defined boundaries.
Recently, a number of researchers have shown interest in systems that compute the saliency of a scene. Two particular theories have emerged from this interest: (1) attention that focuses on features at specific loci in the scene; and (2) attention that focuses on the saliency of specific objects or colored “blobs” in a scene. These methods are called feature-based and object-based attention, respectively. Researchers have found evidence that supports each of these, and both methods have exhibited utility for a variety of applications.
Feature-based attention works at the pixel level and bases computation of attention on the saliency of a specific location within the scene. The attention work of Itti and Koch employs such an approach, which computes attention by constructing a saliency map from a set of biologically-inspired features extracted from the image (see literature reference no. 1, below in the Detailed Description). The algorithm breaks apart the image into a set of Gaussian pyramids corresponding to color, intensity, and orientation at a series of scales, which are combined across scales and merged into the saliency map. The system attends to the point that corresponds to the maximum value in the saliency map, applies inhibition of return to that location, and shifts to the next most salient point. This process continues until the program attends to a maximum number of locations, or the user terminates the program.
A significant problem with Itti's method is its inefficiency, in spite of the inhibition-feedback mechanism where the algorithm routinely returns to previously attended locations if a better point is not found. This often results in a few locations receiving almost all of the attention, while other, less conspicuous regions receive none. This is an undesired effect for the purposes for which the present invention is intended. To obtain a complete analysis of a scene, the algorithm should attend to every location. The Itti and Koch algorithm lacks the type of queuing system required to efficiently attend to every location in the image. Some groups have also found that this algorithm does not exhibit invariance to the orientation of a scene, for example, when the algorithm analyzes the same image twice; one being slightly rotated relative to the other, it returns dramatically different salient regions for each (see literature reference no. 2). This is an undesired effect, as invariance to the image orientation is critical. Itti's attention system was outlined in detail in U.S. Pat. No. 6,670,963.
A number of modifications have been made to the original algorithm that allow for top-down biasing of the attention mechanism to pay more attention to regions that exhibit certain desired features (see literature reference nos. 3 and 4). These methods apply a series of weights to the feature maps that directly affect the construction of the saliency map and tilt the attention system in favor of particular features. One method computes the feature weights through a training regimen, in which the system observes a series of target objects and determines which features provide a common thread between them, expressed as the feature relevance. Because this top-down biasing system must be applied through the aforementioned Itti and Koch algorithm, the overall method shares the same limitations previously discussed.
An extension of Itti's attention mechanism was developed by Walther et al., which segments the region around the most salient point, returning a region of interest instead of a simple point in the image (see literature reference no. 5). To do this, a saliency map is computed for an image and the most salient point in the image is located. Next, the original feature maps are examined and the feature that contributed the most to the saliency at the most salient point is determined. The algorithm processes the strongest feature map around the most salient point, employing a flooding and thresholding algorithm to determine the extent of the strongest feature. The program computes the boundary of the resulting map and returns it as the region of interest. As with the original attention algorithm, this method continues to shift its attention around the image until a predetermined number of regions have been visited, or the user stops the algorithm.
This method, however, exhibits a number of critical weaknesses. In particular, the segmentation employed, while feature-based, appears arbitrary when actually applied and does not correspond to actual objects. In cases where there is clutter or no distinct feature boundaries, this method often fails to clearly segment regions of interest and the algorithm often ignores parts of objects or cleaves them in half. Also, this method demonstrates the same inefficiency that the Itti algorithm displays pertaining to overlapping salient regions and continuously revisiting previously attended spaces in the image. This work is described in United States Patent Publication Number 2005/0047647.
In contrast to feature-based attention mechanisms, object-based attention mechanisms focus on tangible objects within a scene, compute attention based on the saliency of the objects composing the scene, and determine which object “stands out” from the others. Sun and Fisher proposed an object-based attention mechanism, in which they compute attention for a set of objects within a manually segmented image at multiple scales (see literature reference no. 6). For example, at a coarse scale, a canoe on water can be segmented out; at a finer scale, the canoe and each passenger on the canoe can be segmented. The algorithm computes the saliency of each object based on features extracted from the scene and shifts attention between scales and between objects. The algorithm is limited, however, in that it requires that the scene be manually segmented before the attention can be computed. The inability to automatically segment the scene dramatically reduces the utility of that method. Not only is manual segmentation of a scene in this manner a tedious process, but it effectively eliminates any chance that the algorithm could be used to analyze foreign or novel imagery, such as that used for surveillance, since it requires that one have prior access to the images.
Orabona et al. developed an object-based approach that does not require manual segmentation (see literature reference no. 7). This method converts the input image into a log-polar representation and computes three feature maps, corresponding to receptive fields with red-center and green-surround, green-center and red-surround, and blue-center with yellow-surround. In addition to using the feature maps to compute the saliency of an object within the image, the algorithm processes the feature maps to find the feature contours and then combines these contour maps into a master feature map.
A watershed flooding algorithm is applied to the master feature map, which segments the image into “proto-objects,” corresponding to regions with uniform features. The system computes the saliency of a given proto-object as the magnitude of a vector whose elements correspond to the difference between the average feature densities of the proto-object and a rectangular region surrounding it, for a given feature. The system shifts attention between regions in the order of saliency, and applies inhibition to previously attended regions. Also proposed by Orabona et al. was a simple top-down mechanism that involves computing the saliency as the weighted sum of the top-down and bottom-up elements.
However, for practical applications, the implementation of the watershed algorithm over-segments the scene into regions that encode for small fragments of objects. This problem becomes overwhelmingly clear in situations of natural lighting, where objects of a single color exhibit hue gradients across the object. This method also uses a small feature set and does not treat the intensity of the image as a separate feature which limits the ability of the algorithm to find complex feature contours. The system of Orabona et al. can be extended to more complicated scenes with natural lighting but would require extensive tuning of the watershed threshold to elicit even a marginal improvement. Also, the system only works at a single scale. Finally, the use of the log-polar transform of the scene as input creates an undesired blind spot for object boundaries near the center of the image. The log-polar transform spreads out objects near the middle of the scene, blurring the object boundaries. This obscures the feature contrast and makes it difficult for the edge-detection algorithm to find a feature contour. Since the watershed algorithm can only process closed feature contours, these objects are ignored, which opposes the preferred behavior of attending to all objects.
Thus, a continuing need exists for a visual attention and object segmentation system that can compute attention for a natural scene and attend to regions in a scene based on saliency rankings, extract the boundary of an attended proto-object based on feature contours to segment the attended object, and, if biased, can boost the attention paid to specific features in a scene, such as those of a desired target object in static and video imagery. Moreover, the system should perform at multiple scales of object extraction and possess a high degree of automation, requiring only minor parameter adjustment by the user to consider an entire class of scenes and account for subtle feature changes that can be caused by natural lighting in a scene when performing object segmentation.