(1) Field of Invention
The present invention relates to a system for object detection from dynamic visual imagery and, more particularly, to a system for object detection from dynamic visual imagery that fuses surprise-based object detection with motion processing based object detection.
(2) Description of Related Art
Several systems exist that detect anomalies in a scene, such as for surveillance applications. Such systems typically employ an attention algorithm (i.e., saliency) as a first pass that analyzes the entire scene and extracts regions of interest, which are typically advanced to some other system for analysis and/or identification. The attention algorithm is also often used as a seed to a “surprise” algorithm that finds salient regions of interest that change and, therefore, could be a potential anomaly in the field-of-view (FOV). The attention and surprise algorithms must be very accurate, as any event that is not detected by these algorithms cannot be analyzed by subsequent algorithms in the system. However, since attention/surprise must be processed on every pixel in the imagery from a wide FOV, the attention/surprise algorithm requires a large amount of computational resources and often requires hardware to run the algorithms. Depending on the hardware used to run the algorithm, the requirement may either exceed the available resources of the hardware or require larger and more complex hardware (e.g., multiple field-programmable gate arrays (FPGAs)), which results in bulky and high-power systems.
The pure surprise algorithms (both feature and object-based) are incomplete because they yield poor results when applied to video imagery of a natural scene. Artifacts from ambient lighting and weather often produce dynamic features that can throw off a saliency algorithm and cause it to think that “everything is salient”. Mathematically, it may be the case that everything in the scene is salient, but when a system is tasked with a specific purpose, such as surveillance, one is only interested in legitimate short-term anomalies that are likely to be targets. Therefore, simple saliency systems cannot provide the service that the current invention does.
An alternative approach to “pure surprise” is to use a full surprise algorithm. Full surprise algorithms employ a great deal of additional processing on the features in each frame of the video and create statistical models that describe the scene. If anything unexpected happens, the surprise algorithm is able to return the location of the happening. The closest known prior art to this invention is the surprise algorithm of Itti and Baldi (see the List of Cited Literature References, Literature Reference No. 2). The work of Itti and Baldi employs a Bayesian framework and features that contribute to the saliency map to construct a prior distribution for the features in the scene. The current saliency map is used as the seed for a “posterior” distribution. This algorithm uses the Kullback-Leibler divergence (or KL distance) between the prior and posterior as the measure of surprise. KL distance is a non-symmetric measure of the difference between two probability distributions. Because it takes the entire history of the scene into account, it exhibits a much lower false alarm rate than that of a system that exclusively uses saliency.
However, as one might expect from the description of the algorithm, the Itti and Baldi surprise algorithm (see Literature Reference No. 2) is very complicated and computationally expensive. It was designed to run on very high-end computer hardware, and even then cannot currently run at real-time on high-resolution video imagery. The computer hardware it runs on is very bulky and power-consuming, which prevents its use on a mobile platform. Furthermore, the complexity of the algorithm largely prevents it from being ported to low-power hardware, which is essential for deployment on a mobile platform.
Furthermore, several additional researchers have shown interest in systems that compute the saliency of a scene. Two particular theories have emerged from this interest: attention that focuses on features at specific loci in the scene, and attention that focuses on the saliency of specific objects or colored “blobs” in a scene. These methods are called feature-based and object-based attention, respectively.
Feature-based algorithms work at the pixel level and compute attention based on the saliency of a given location within the scene at a specific location. The attention work of Itti and Koch (see Literature Reference No. 3) employs such an approach, which computes attention by constructing a saliency map from a set of biologically inspired features extracted from the image. The algorithm breaks apart the image into a set of Gaussian pyramids corresponding to color, intensity, and orientation at a series of scales, which are combined across scales and merged into the saliency map. The system attends to the point that corresponds to the maximum value in the saliency map, applies inhibition of return to that location, and shifts to the next most salient point. This process continues until the program attends to a maximum number of locations, or the user terminates the program. The most significant problem with the method described in Literature Reference No. 3 is its inefficiency; in spite of the inhibition of return mechanism, the algorithm routinely returns to previously attended locations if a better point is not found. The algorithm lacks the type of queuing system required to efficiently attend to every location in the image. Some groups have also found that this algorithm does not exhibit invariance to the orientation of a scene: when the algorithm analyzes the same image twice, one being slightly rotated relative to the other, it returns dramatically different salient regions for each.
The work described in Literature Reference No. 12 employs motion processing using a codebook in order to extract anomalous motion patterns from a series of frames in video. The work described in Literature Reference No. 12 is capable of processing motion information from dynamic imagery. However, the blobs it returns sometimes leave out pieces of objects or are incomplete.
Each of the prior methods discussed above exhibit limitations that make them incomplete. This is due to the problems that they exhibit in their limited application to natural scenes and in their theoretical structure. Thus, a continuing need exists for an improved system for finding surprised-based objects of interest in dynamic scenes.