The “Rapid Serial Visual Presentation” (RSVP) protocol has recently been discovered as a useful tool for high-throughput filtering of images into simple “target” and “nontarget” categories. See S. Thorpe, D. Fize, and C. Marlot, in Speed of Processing in the Human Visual System. Nature, vol. 381, pp. 520-522 (1996). The RSVP protocol involves displaying small images (e.g., at 256-by-256 resolution), called “chips,” to a human subject at a very high frame rate (e.g., 10 Hertz) and measuring the electrical activity of the subject's brain using electroencephalograph (EEG) technology.
When a target image is shown to the subject, even at these high speeds, the brain perceives the target chip as different from the others and registers a “surprise,” which translates into a specific brainwave, dubbed as the “P300,” which occurs at a specific, fixed time delay from the presentation of the image. A P300 is far more reliable than voluntary subject responses, such as a button press, which have varied delays. The chips that are perceived as nontargets are perceived as “boring” and do not elicit a P300. Therefore, the presence of a P300 signal is a valuable discriminator between what the subject considers a “surprising” versus “boring” chip.
The concept of “targets” vs. nontargets can be extended to “Items of Interest” (IOI) vs. non-interesting items, as described in U.S. patent application Ser. No. 12/316,779, filed on Dec. 16, 2008, entitled, “Cognitive-neural method for image analysis,” which is incorporated by reference as though fully set forth herein. These items of interests are generally objects/groups of objects/spatial patterns in images and video and are of interest to the user (observer). Such items of interest are also usually application-specific. For example, an image analyst looking for a helipad in wide-area satellite imagery will consider the helipad to be the “target” or “item of interest.” Likewise, a different image analyst looking for a convoy of moving vehicles in wide-area satellite imagery will consider such a spatio-temporal pattern to be the IOI for that application.
The P300 occurs prior to the activation of higher-level processes in the brain that identify and classify the target, but is not a “subliminal” process; the subject generally realizes that a target was viewed, but does so much slower than the brain produces a P300. The RSVP method captures the inherent efficiency of lower-level responses in the subject's brain.
Research has shown that even at these speeds, the human brain performs admirably well at differentiating between “target” and “nontarget” images, and is far more efficient than if the subject had manually inspected and sorted the chips. See Thorpe (1996); and Gerson, A. D., Parra, L. C., and Sajda, P., in Cortically Coupled Computer Vision for Rapid Image Search. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 14(2): 174-179 (2006). RSVP has been used in a variety of applications, particularly those in which the subject is instructed to find targets in a sparse environment; for example, the subject might be instructed to look for buildings in satellite imagery over a desert.
As one might expect, measuring an involuntary response from a human brain poses difficulties that must be addressed. In particular, the RSVP paradigm does not allow the subject to classify chips specifically into “target” and “nontarget” bins. Rather, the chips are actually sorted into “surprising” and “boring” bins based on the presence of the P300 signal. While the typical experimental procedure for an RSVP session involves randomizing the chips, if the sequence contains a series of high contrast chips or chips whose features are very different, an experiment can invoke a false P300 signal from nontarget images based on “jarring” the visual field between dissimilar images. For example, a ground-based image might contain sky, background, foreground, and middle ground, each of which exhibit dramatically different colors, lighting, scales, and textures. A chip sequence consisting of a number of foreground images followed by a single sky image could easily produce a P300 based on the surprise of rapidly shifting from one set of image features to another in succession. This false signal masks the surprise produced by actual targets and increases the rate of false alarm.
In practice, using RSVP to analyze ground-based images presents a number of hazards that can cause the subject to exhibit a P300 neural signal without viewing a target. As noted above, the P300 signal occurs as the result of “surprise”, which can be the result of seeing a target in an image, but also can occur from the rapid exposure to images that have a high contrast to one another, such as an image of the dark ground followed by an image of the bright sky. A way to reduce such “jarring” could be by placing similar images next to one another.
Current methods exist to sequence images according to their similarity. These algorithms create generally smooth sequences that contain a handful of bad transitions that can derail an RSVP experiment, which requires precision in the image ordering. For example, the problem of computing a sequence of images whose distances from one another are minimized is an analog to the “travelling salesman” problem. See wikipedia.org/Travelling_salesman_problem. The travelling salesman problem is computationally intractable and cannot be solved absolutely without testing every possible image sequence, which is a complex and time-consuming process.
Another solution to the “jarring” problem is in the field of content-based image retrieval (CBIR). See Smeulders, A., Worring, M., Santini, S., Gupta, A., and Jain, R., Content-Based Image Retrieval at the End of the Early Years. IEEE Transactions on PAMI. 22(12): 1349-1380 (2000). CBIR permits image searching based on features automatically extracted from the images themselves. This field has been motivated by the need to efficiently manage large image databases and run image retrievals without exhaustive searches of the image archive each time. The system compares the features of the selected image with the characteristics of the other images in the set and returns the most similar images. Typically, this is done by computing, for each image, a vector containing the values of a number of attributes and computing the distance between image feature vectors. Many different features and combinations have been used in CBIR systems. Color retrieval yields the best results, in that the computer results of color similarity are similar to those derived by a human visual system. See Rogowitz, B. E., Frese, T., Smith, J., Bouman, C. A., and Kalin, E., Perceptual Image Similarity Experiments. Proceedings of SPIE, 3299: 576-590 (1998). Other features include texture, shape, bio-inspired features, et cetera. The best image matches are typically returned and displayed to the user in descending order of this computed distance.
While CBIR could be naively applied to image ordering using RSVP, this would pose a number of difficulties. For a block of images to be ordered for RSVP, one could determine the feature set of each and load them into the CBIR database. Starting from an arbitrary image, one could find the closest match, then the closest match to that image (the match), and so on, until all images have been queued. This procedure is equivalent to using the “nearest neighbor” heuristic for solving the travelling salesman problem. However, this algorithm does not guarantee the optimal result, and can actually provide the least optimal result depending on the dataset and the first image selected. See Gutin, G., Yeo, A., and Zverovich, A., Traveling Salesman Should Not be Greedy: Domination Analysis of Greedy-Type Heuristics for the TSP. Discrete Applied Mathematics. 117: 81-86 (2002).
The prior art for user relevance feedback (i.e., supervised learning) in CBIR systems primarily focuses on whether the images returned by the algorithm are similar to a seed image. See Morrison, D., Marchand-Maillet, S., and Bruno, E., Semantic Clustering of Images Using Patterns of Relevance Feedback. in Proceedings of the 6th International Workshop on Content-based Multimedia Indexing (CBMI 2008), London, UK (2008). This involves running the computer algorithm to find a candidate match for an image, and then allowing the user to answer as affirmative or negative regarding the similarity of the image. This deviates sharply from the present invention because it does not address the issue of image sequencing or determining the relative similarity of images that may, in fact, be very similar to one another. The CBIR prior art has no notion of ordering of the images as in the present invention.
Each of the methods of the prior art as discussed above exhibit limitations that make them incomplete. For example, the prior art does not directly address the problem of ordering images specifically for the RSVP paradigm and, as such, produce results that are unacceptable for the application.
Further, simple metrics for determining image distance fail to sequence the images properly (according to human perception) based solely on distance. While an image distance metric can objectively order images according to some mathematical formula, the application to RSVP for an EEG study requires that the images be presented in a perceptibly smooth manner. Often, the optimal sequence from an objectively determined distance metric will still contain image sequences that exhibit a jarring effect, again, providing an unacceptable result.
Thus, a continuing need exists for an image ordering system that employs subjective feedback from a human viewer for rapid serial visual presentation to detect items of interest in images and video.