(1) Field of Invention
The present invention relates to a system for object detection from dynamic visual imagery and, more particularly, to a system for object detection from dynamic visual imagery that utilizes motion processing and biologically-inspired segmentation based on feature contours.
(2) Description of Related Art
Many techniques exist for finding objects of interest in imagery and segmenting such objects. For instance, several researchers have shown interest in systems that compute the saliency of a scene. Two particular theories have emerged from this interest: attention that focuses on features at specific loci in the scene, and attention that focuses on the saliency of specific objects or colored “blobs” in a scene. These methods are called feature-based and object-based attention, respectively. Researchers have found evidence that supports each of these, and both methods have exhibited utility for a variety of applications.
Feature-based attention works at the pixel level and computes attention based on the saliency of a given location within the scene at a specific location. The attention work of Itti and Koch (see List of Cited Literature References, Literature Reference No. 4) employs such an approach, which computes attention by constructing a saliency map from a set of biologically inspired features extracted from the image, and has been modified to incorporate top-down biasing of the attention (see Literature Reference Nos. 6 and 7). The algorithm breaks apart the image into a set of Gaussian pyramids corresponding to color, intensity, and orientation at a series of scales, which are combined across scales and merged into the saliency map. The system attends to the point that corresponds to the maximum value in the saliency map, applies inhibition of return to that location, and shifts to the next most salient point. This process continues until the program attends to a maximum number of locations, or the user terminates the program.
An extension of the attention mechanisms described above was developed and described in Literature Reference No. 13, which segments the region around the most salient point, returning a region of interest instead of a simple point in the image. The most significant problem with the method described in Literature Reference Nos. 4, 6, and 7 is its inefficiency; in spite of the inhibition of return mechanism, the algorithm routinely returns to previously attended locations if a better point is not found. The algorithm lacks the type of queuing system required to efficiently attend to every location in the image. Some groups (see Literature Reference No. 1) have also found that this algorithm does not exhibit invariance to the orientation of a scene: when the algorithm analyzes the same image twice, one being slightly rotated relative to the other, it returns dramatically different salient regions for each. The algorithm described in Literature Reference No. 13 exhibits its own limitations. In particular, the segmentation employed, while feature-based, appears arbitrary when actually applied and does not correspond to actual objects or blobs. These attention systems above were also outlined in detail in U.S. Pat. No. 6,670,963, entitled, “Visual Attention Model” and U.S. Patent Publication No. 2005/0047647, entitled, “System and Method for Attentional Selection”, which are hereby incorporated by reference as though fully set forth herein.
The work of Itti et al. and Walther (see Literature Reference Nos. 4-7 and 13) is used to process motion cues using the same mechanism. The methods use a frame differencing approach where the intensity channels are compared at directional offsets in the up, down, left, right, and in place (i.e., no offset, called “flicker”) directions. This allows the system to detect motion in the frame. However, this motion is not related to any of their color channels. Furthermore, this method does not create a background model, and the system cannot “learn” what motion in the scene is normal and what motion is anomalous. As a result, the algorithms described in Itti et al. and Walther (see Literature Reference Nos. 4-7 and 13) are very noisy and return many false alarms from nonthreatening moving objects (e.g., trees).
Object-based approaches to attention focus on tangible objects within the scene, and compute attention based on the saliency of the objects composing the scene, determining which “stands out” from the others. The authors of Literature Reference No. 12 proposed an object-based attention mechanism, in which they compute attention for a set of objects within a manually segmented image at multiple scales. For example, at a coarse scale, they segment out a canoe on the water; at a finer scale, they segment the actual canoe, as well as each of the passengers on the canoe. The algorithm computes the saliency of each object based on features extracted from the scene and shifts attention between scales and between objects, but handicaps itself with its requirement that the scene be manually segmented before the attention can be computed.
The authors of Literature Reference No. 10 developed an object-based approach that does not require manual segmentation. Their method computes three feature maps from the image and uses them to compute the segmentation using a watershed flooding process, which segments the image into “proto-objects”, corresponding to regions with uniform features. The system computes the saliency of each proto-object from the feature maps as well. The system shifts attention between regions in the order of saliency, and applies inhibition to previously attended regions. This group also proposed a simple top-down mechanism for biasing attention. However, for practical applications, their implementation of the watershed process over segments the scene into regions that encode for small fragments of objects, and not complete blobs or proto-objects. Its small feature set and lack of intensity information limits the ability of the process to find complex feature contours. Finally, the use of the log-polar transform of the scene as input creates an undesired blind spot for object boundaries near the center of the image. The log-polar transform spreads out objects near the middle of the scene, blurring the object boundaries. This obscures the feature contrast and makes it difficult for the edge-detection process to find a feature contour. Since the watershed process can only process closed feature contours, these objects are ignored.
The work described in Literature Reference No. 19 employs motion processing using a codebook in order to extract anomalous motion patterns from a series of frames in video. The object-based attention work described in Literature Reference No. 20, which improved upon the work of Literature Reference No. 10, extracts proto-objects from static imagery. However, it consists of a simple saliency algorithm and does not work on video imagery. Further, the work described in Literature Reference No. 19 is capable of processing motion information from dynamic imagery. However, the blobs it returns sometimes leave out pieces of objects or are incomplete.
Each of the prior methods discussed above exhibit limitations that make them incomplete. This is due to the problems that they exhibit in their limited application to natural scenes and in their theoretical structure. The prior work in Literature Reference Nos. 19 and 20 is also improved upon by the innovations disclosed herein. In particular, while the work of Literature Reference No. 20 was a major improvement, it is not well-suited to dynamic imagery that requires high frame rate; the processing involved in computing the saliency for a sequence of frames in video imagery is prohibitively expensive (i.e., incapable of running at 20-30 frames per second (fps), which is required of real-time video analysis systems), since processing the watershed regions is a computationally expensive procedure. This algorithm also employed pure saliency, which does not learn patterns from the scene as time goes by.
Thus, a continuing need exists for a system for object detection and segmentation in dynamic imagery that is robust, efficient, and produces accurate segmentation of a scene.