From the photography aficionado type digital cameras to the high-end computer vision systems, digital imaging is a fast growing technology that is becoming an integral part of everyday life. In its most basic definition, a digital image is a computer readable representation of an image of a subject or object taken by a digital imaging device, e.g. a camera, video camera, or the like. A computer readable representation, or digital image, typically includes a number of pixels arranged in an image file or document according to one of many available graphic formats. For example, some graphic file formats include, without limitation, bitmap, Graphics Interchange Format (GIF), Joint Photographic Experts Group (JPEG) format, and the like. An object is anything that can be imaged, e.g., photographed, video taped, or the like. In general, an object may be an inanimate physical object or part thereof, a person or a part thereof, a scenic view, an animal, or the like. An image of an object typically comprises viewing conditions that, to some extent, make the image unique. In imaging, viewing conditions typically refer to the relative orientation between the camera and the object (e.g., the pose), and the external illumination under which the images are acquired.
Given a collection of images of 3-dimensional objects, where the observer's viewpoint has varied between images, one may wish to cluster the images, e.g., group them according to the identity of the objects. This problem requires understanding how the images of an object vary under different viewing conditions, so that the goal of a clustering method or algorithm is to detect some consistent patterns among the images. One conventional computer vision approach to solving this problem utilizes some kind of image feature extraction, e.g., texture, shape, filter bank outputs, etc. A description of this is in B. L. Saux and N. Boujemaa, “Unsupervised robust clustering for image database categorization,” International Conference on Pattern Recognition, Volume 1 (2002); and in H. Frigui et al., “Unsupervised clustering and feature discrimination with application to image database categorization,” Joint 9th IFSA World Congress and 20th NAFIPS Conference (2001), which are incorporated herein in their entirety. The underlying assumption of this approach is that some global or local image properties of a 3D object exist over a wide range of viewing conditions. A shortcoming of this approach is that it is usually difficult to extract these features reliably and consistently.
Appearance-based approaches utilize a different strategy for tackling the clustering problem, whereby image feature extraction no longer plays a significant role. Instead, the concept of geometric relations among images in the image space is central, and the primary analytical paradigm is the appearance manifold. Descriptions of these concepts can be found in R. Basri et al., “Clustering appearances of 3D objects,” Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. (1998), in A. W. Fitzgibbon and A. Zisserman, “On affine invariant clustering and automatic cast listing in movies,” in A. Heyden, G. Sparr, M. Nielsen, P. Johansen, eds., Proceedings of the Seventh European Conference on Computer Vision. LNCS 2353, Springer-Verlag (2002), and in H. Murase and S. K. Nayar, “Visual learning and recognition of 3-D objects from appearance,” International Journal of Computer Vision, Volume 14. (1995), which are incorporated by reference herein it their entirety. While existing appearance-based methods represent an improvement over feature extraction, they nonetheless suffer from lack of reliability and from inconsistency.
The bases of the shortcomings of conventional appearance-based approaches may be understood by reference to FIG. 1. FIG. 1(a) shows a hypothetical/ideal projection of the images of three objects onto the image space, as discussed in Murase and Nayar, cited above, wherein the three axes represent the three most significant eigenvectors of the eigenspace. While limitation to three eigenvectors/axes provides an intuitively tangible example, the analysis may be mathematically extended to a larger number of eigenvectors/dimensions. Each ellipse 110 corresponds to one object. The sample points 120 of each ellipse correspond to images of the respective object taken over pose, e.g., over rotation about one axis, with constant illumination. The essence of the clustering problem is to associate images with the correct ellipse, e.g., to correctly identify a particular observed image and its projection point through association with the appropriate ellipse 110.
In general, the clustering problem is complicated by the inherently sampled nature of data 120 and by the presence of error in the image capture process. Such error may arise, for example, from inaccuracy or noise in the image capture apparatus. Also, the sample points may lie in close proximity to one another. Accordingly, the clustering method must be sufficiently robust to afford reliable results with limited and inaccurate data. For example, a datum 130 that is actually a member of ellipse 110a might be erroneously associated with ellipse 110b, since it might be closer to another observed datum on ellipse 110b versus one on ellipse 110a. Consequently, the clustering algorithm would need to take advantage of knowledge of the manifold structure by constructing virtual points 140 from which to provide meaningful distance metrics.
Yet another basis for inaccurate clustering is depicted in FIG. 1b, wherein hypothetical ellipses 150a and 150c have been tilted with respect to their counterparts in FIG. 1a. Specifically, in region 160, the ellipses are sufficiently close to one another as to cause instability in conventional clustering methods. For example, as shown conceptually in region 170, several trajectories may be imagined based upon the sample points shown.
In conventional systems, e.g., Saux and Boujemaa, appearance-based clustering methods often utilize projections, e.g., geometric transformations from the image/manifold space to lower-order spaces, wherein the final clustering operations are performed. With regard to unstable regions such as region 160, it becomes difficult to synthesize the correct trajectories in the projection space as well. This is illustrated in FIG. 2, which shows intermediate clustering results based on three projection methods. The projection spaces shown were derived from sample images of three model cars taken over rotation with constant illumination.
FIG. 2(a) shows the results of a principal components analysis algorithm as discussed in Murase and Nayar, which was cited above. As can be appreciated, there is such high dispersion of the data that no trajectories can be discerned, that is, there is no basis for clustering decisions. This is a consequence of the fact that the local linear estimates utilized for transformation become unstable for complex regions of the manifold structure. FIG. 2(b) shows the results of an isomap algorithm. An example of this can be found in Joshua B. Tenenbaum, Vin de Silva and John C. Langford, “A Global Geometric Framework for Nonlinear Dimensionality Reduction,” Science, Vol. 290, No. 5500 (2000), which is incorporated herein in its entirety. While it is clear that a collection of trajectories exists, there is little separation of the individual trajectories. Thus, again there is little basis for effective clustering. FIG. 2(c) shows the results of local linear embedding. An example of this can be found in Sam T. Roweis and Lawrence K. Saul, Nonlinear Dimensionality Reduction by Locally Linear Embedding, Science, Vol. 290, No. 5500 (2000), which is incorporated herein in its entirety. As shown by the superimposed boundaries 210, there is a basis for making clustering decisions. However, the boundaries overlap, indicating potential for instability or incorrect clustering.
Based upon the foregoing, there is a need for an improved system and method for clustering images that will yield reliable results for a wide variety of objects in the presence of noise and inaccuracy.