(1) Field of Invention
The present invention relates to a system for multiple-object recognition in visual images and, more particularly, to a system for multiple-object recognition using keypoints generated from local feature algorithms in visual images.
(2) Description of Related Art
Accurate and robust recognition of objects in a cluttered natural scene remains one of the most difficult problems faced by the computer vision field. The primary issue is that the same object appears differently to a viewer depending on the viewing angle (azimuth and elevation), the distance of the viewer (which affects the perceived size of the object, that is its scale), and whether it is partially occluded by other objects (and the degree of this occlusion). Human perception solves these problems with a minimum of effort. Based on a limited number of training views, a human being can learn enough about an object to accurately recognize it in each of these scenarios. The computerized recognition problem also shares the problem of maintaining a reasonably sized database that acts as memory of trained objects. One must represent the training object in a minimalist way to provide adequate speed, but must also capture enough information to retain recognition accuracy.
A number of researchers have attempted the problem of recognition of multiple objects in a scene with varying degrees of success. The most robust algorithms tend to rely on local feature extraction, which employ a set of keypoints located at stable regions in the image to identify objects. Each keypoint is assigned the label of its nearest match from a training database, and the combination of these matches is used to identify the object. This method provides a great deal more robustness than alternative algorithms such as template matching as described by Hajdu and Pitas (see Literature Reference No. 8), which simply looks for exact copies of objects from the training set and cannot account for planar rotations and scaling of training objects. Because they use local features to perform object recognition, these algorithms are collectively known as “local feature algorithms” (LFAs). Non-limiting examples of LFAs include Speeded Up Robust Features (SURF) as described by Bay et at (see Literature Reference No. 9), SIFT as described by Lowe (see Literature Reference No. 1), jets-(Gabor) based Features as described by Jimenez (see Literature Reference No. 1), and the “semilocal” features of Carniero and Jepson (see Literature Reference No. 10), as well as a number of color variants of these algorithms as described by Abdel-Hakim and Farag (see Literature Reference No. 12). However, although LFAs can extract keypoints, how to process these keypoints remains an open problem.
Despite their similar results, the problem of recognizing and segmenting multiple objects in a scene is a far more complex problem than the simple recognition of a single object, when the computer knows “a priori” that there is only one object in the frame. In this instance, one could simply give each keypoint a vote and identify the object according to the most votes. This method provides very accurate identification of objects. For a scene known to contain multiple objects, the recognition algorithm must localize clusters of keypoints belonging to a single object, compensate for outlier noise caused by misclassification of keypoints, and process keypoint clusters to find the optimal object boundary and pose based on the training data.
Multiple-object recognition and segmentation is the ability to correctly identify objects in a scene and identify multiple occurrences of the same object in the scene. While algorithms exist that can perform multiple-object segmentation, they cannot accommodate scenes in which the same object appears twice in different poses. These algorithms typically miss the second occurrence of the same object or recognize the same object multiple times. The closest prior art of using local feature algorithms (LFAs) to perform recognition of multiple objects in a scene was described by Lowe (see Literature Reference Nos. 1 and 2), who employed the scale-invariant feature transform (SIFT) algorithm to extract keypoints from a natural scene and perform boundary segmentation and object recognition based on a population of inlier keypoints provided by the Generalized Hough transform and least-squares fitting. The work by Lowe provided an affine transform matrix that could be applied to the outline of a training image to provide a rough estimate of the object boundary. However, this work did not employ multiple views for each training object. Rather, keypoints were extracted for a single view for each of three training objects, and the test objects were not manipulated dramatically from their training pose. Lowe did, however, show that his algorithm is robust against partial occlusion of an object by another. He also showed an extremely limited application for multiple instances of the same object; however, this was carried out by simple clustering and would not extend to images where the identical objects are close to one another, or with objects that are likely to create many misclassified keypoints. In these cases, simple clustering algorithms would fail to separate the keypoint populations provided by separate identical objects.
Work by Zickler and Veloso (see Literature Reference No. 3) successfully finds multiple instances of the same object using clustering algorithms on sets of keypoints extracted using Principal Components Analysis (PCA) based representation for local features (PCA-SIFT). Their work demonstrates a great deal of success at labeling the centers of multiple objects in a scene. However, this algorithm does not extract any information about the pose of an object and cannot provide an object boundary. Additionally, this clustering algorithm exhibits a significant false positive rate at recognition rates greater than ninety percent. Similarly, the work of Murphy-Chutorian et al. (see Literature Reference No. 4) also successfully labels the centroids of multiple objects in a scene using a biologically-inspired recognition algorithm that shows a much lower false positive rate than Zickler and Veloso on a much larger training database. However, the biologically-inspired recognition algorithm also does not provide an object boundary or pose information and simply places a label at the center of the object.
Additionally, prior art exists that uses kernel density estimation to perform object recognition. For example, work by Moss and Hancock (see Literature Reference No. 5) and Chen and Meer (see Literature Reference No. 6) describes the use of kernel density estimation for computer vision, but not specifically for object recognition and does not use feature keypoints. The work by Moss and Hancock performs object alignment, while the work by Chen and Meer is used to recover structures from heavily corrupted data.
Each of the prior methods discussed above exhibit limitations that make them incomplete. This is due to various problems with their theoretical structure that limit their effectiveness in practice, especially with regard to recognition of multiple instances of the same object in a scene. For example, the method proposed by Lowe (see Literature Reference No. 2) to recognize and segment multiple objects assumes that a single instance of each object occurs in each scene, and does not contain a mechanism to remove keypoints belonging to a segmented object (e.g., a scene with two bottles; prior art will only recognize one of them). The multiple-object recognition schemes of Murphy-Chutorian et al. (see Literature Reference No. 4) and Zickler and Veloso (see Literature Reference No. 3) can recognize multiple instances of the same object, but only place an object label at the centroid of the keypoints; they do not perform segmentation or extract pose information from the object, which limits its practical application in industry. In general, each of the prior art considers a specific facet of the multiple object recognition problem, but none of them solve the essential problem of multiple-object recognition and segmentation.
The problem could potentially be remedied by clustering algorithms; however, the most prominent algorithms such as k-means clustering require the number of clusters required to be known beforehand. Since one does not know the specific number of each object in the scene prior to identification, this method will not work. More robust methods of clustering that do not require a predetermined number of clusters exist, such as the X-means algorithm described by Pelleg and Moore (see Literature Reference No. 7), but they work slower than needed for efficient object recognition.
Thus, a continuing need exists for a system that correctly identifies objects in a visual scene, provides boundary and pose information for each object, and can identify multiple occurrences of the same object in the scene in an efficient manner.