The present invention relates to the field of computer-based image analysis, and more particularly to systems of classifying and searching image data dependent upon the content of the image data.
Text retrieval based on keywords has been the main stream in the field of information retrieval. A number of existing visual retrieval systems extract and annotate the visual content of data objects manually, often with some assistance by means of user interfaces. An example of such a system is given by Rowe, L. A., Borexzky, J. S., and Eads, C. A., xe2x80x9cIndices for User Access to Large Video Databases,xe2x80x9d Storage and Retrieval for Image and Video Databases II, Proc. SPIE 2185, 1994, pp. 150-161. Once keywords are associated with the visual content, text retrieval techniques can be used easily. Although text descriptions may reflect the (largely conceptual) semantics of multimedia data, they may also result in a combinatorial explosion of keywords in the attempted annotation due to the ambiguous and variational nature of multimedia data. Also, there is a limit to how much semantic information textual attributes can provide, as described by Bolle, R. M., Yeo, B. L., and Yeung, M. M., xe2x80x9cVideo Query: Research Directionsxe2x80x9d, IBM Journal of Research and Development, 42(2), March 1998, pp. 233-252.
On the other hand, visual content-based retrieval systems have mainly focused on using primitive features such as colour, texture, shape, etc., for describing and comparing visual contents. Examples of such systems include: Bach, J. R., et al., xe2x80x9cVirage Image Search Engine: An Open Framework for Image Management,xe2x80x9d Storage and Retrieval for Image and Video Databases IV, Proc. SPIE 2670, 1996, pp. 76-87; Niblack, W., et al., xe2x80x9cThe QBIC Project: Querying Images By Content Using Colour, Textures and Shapes,xe2x80x9d Storage and Retrieval for Image and Video Databases, Proc. SPIE 1908, 1993. pp. 13-25; and Pentland, A., Picard, R. W., and Sclaroff, S., xe2x80x9cPhotobook: Content-Based Manipulation of Image Databases,xe2x80x9d Intl. J of Computer Vision, 18(3), 1996, pp. 233-254. When these feature-based techniques are applied to individual objects, an object is often the focus of retrieval. Not much consideration has been given to the interrelationship among the objects.
Region-based query methods rely on colour and texture segmentation to transform raw pixel data into a small set of localised, coherent regions in colour and texture space (xe2x80x9cblobworldxe2x80x9d) and perform similarity-based retrieval using these regions. Such a system is described by Carson, C. et al., xe2x80x9cRegion-based image querying,xe2x80x9d Proc. IEEE Workshop on Content-Based Analysis of Images and Video Libraries, 1997, pp. 42-49. However, the regions are derived from each individual image without reference to any consistent attributes across images in a domain. Moreover, the segmentation of regions is in general not robust and may result in perceptually incoherent regions without meaningful semantics.
The VisualSEEK system has been described by Smith, J. R. and Chang, S.-F., xe2x80x9cVisualSEEk: A Fully Automated, Content-Based Image Query System,xe2x80x9d Proc. ACM Multimedia 96, Boston, Mass., Nov. 20, 1996. This system and its descendants consider the spatial relationship among regions and combine it with primitive features of the regions for image retrieval. The matching algorithm merges lists of image candidates, resulting from region-based matching between a query and database images, with respect to some threshold and tends to be rather complex and ad hoc in realisation. The segmentation of regions is based on colour only, and no object or type information is extracted from the segmented regions.
In a different approach that advocates the use of global configuration, Ratan, A. L., and Grimson, W. E. L., xe2x80x9cTraining Templates for Scene Classification Using a Few Examples,xe2x80x9d Proc. IEEE Workshop on Content-Based Analysis of Images and Video Libraries, 1997, pp. 90-97, describe a method for extracting relational templates that capture the colour, luminance and spatial properties of classes of natural scene images from a small set of examples. The templates are then used for scene classification.
Although the method automates previous effort that handcrafted the templates, such as Lipson, P., Grimson, E., and Sinha, P., xe2x80x9cConfiguration Based Scene Classification and Image Indexing,xe2x80x9d Proc. of CVPR""97, 1997, pp. 1007-1013, scene representation and similarity matching are computed through the relationships between adjacent small regular partitions, which are rather complex for comprehension.
In general, for text documents, the segmentation and extraction of keywords are relatively straight forward, since keywords are symbolic in nature. For visual data, which are perceptual and pattern-based, no equivalent visual keywords have been proposed.
U.S. Pat. No. 4,839,853 issued to Deerwester, et al. on Jun. 13, 1989 describes a methodology that exploits higher-order semantic structure implicit in the association of terms with text documents known as Latent Semantic Analysis (LSA). Using singular value decomposition (SVD) with truncation, LSA captures underlying structure in the association of terms and documents, while attempting to remove the noise or variability in word usage that plagues word-based retrieval methods. The derived coded description achieves a reduction in dimensionality while preserving structural similarity in term-document association for similarity matching. However, no similar method has been proposed for visual domains since there is no equivalent notion of visual keywords.
U.S. Pat. No. 4,944,023 issued to Imao, et al. on Jul. 24, 1990 describes a method for describing an image by recursively and equally dividing the image into 2n regions until each region includes two or less kinds of regions, and each of the 2n regions further into 2n sub-regions of the same kind. Thus, an image is represented as a binary tree of local homogeneous regions. Though it cuts an image into types of regions and sub-regions, it does not further extract regularities of these regions across images nor uses the tree representation for comparing image similarities.

U.S. Pat. No. 5,710,877 issued to Marimount, et al. on Jan. 20, 1998 describes an image structure map (ISM) to represent the geometry, topology and signal properties of regions in an image and to allow spatial indexing of the image and those regions. Another objective of ISM is to support interactive examination and manipulation of an image by a user. No attempt is given to define object or type information across collection of images or to use ISM as a means for image retrieval. U.S. Pat. No. 5,751,852 issued to Marimount, et al. on May 12, 1998 elaborates on the ISM.
U.S. Pat. No. 5,751,286 issued to Barber, et al. on May 12, 1998 describes an image query system and method underlying the QBIC system. Queries are constructed in an image query construction area using visual characteristics such as colours, textures, shapes, and sizes. Retrieval of images is performed based on the values of representations of the visual characteristics and the locations of the representations of the query in the image query construction area. Aggregate measures of low-level features are often used for similarity matching. No object or type information is compiled from the images in the database for comparing similarity between two image contents.
U.S. Pat. No. 5,781,899 issued to Hirata on Jul. 14, 1998 describes a method and system to produce image index for image storage and management system. Images similarity matching is based on zones, which consist of at least one pixel in the original images, divided from the images using hue, brightness, and grid information. The size of the zone-divided image is adjusted according to the similarity for integrating the zones against some threshold to determine the total number of zones for use as an index. The zones are specific to an image. No prototypical zones are compiled in advance for describing image contents.
U.S. Pat. No. 5,802,361 issued to Wang, et al. on Sep. 1, 1998 describes a method and system for querying images and videos based on a variety of attributes of images such as motion, colour, textures, segments etc and also complex Boolean expression based on these attributes. Its unique user interface allows a user to graphically construct a query with icons representing different kinds of image attributes by selecting the icons into a user playground area. Besides low-level features, it also supports matching of arbitrarily shaped regions representing meaningful objects such as human faces, which are templates predetermined by the system or supplied by the user.
However, there is no systematic and automatic process to extract object or type information over collection of images and to compute its spatial distribution in a visual content.
U.S. Pat. No. 5,819,288 issued to De Bonet on Oct. 6, 1998 discloses a method for generating a semantically based, linguistically searchable, numeric descriptor of a predefined group of input images. A signature is computed for each image in a set using multi-level iterative convolution filtering with predefined Gaussian kernels for each predefined visual characteristic. Each element of the signature is taken as the averaged pixel value produced by the last convolution filter for the corresponding visual characteristic. An average and a variance are derived for each element of the signature across all the signatures of the predefined group of images. A database manager associates predefined linguistic terms with these simple statistical measures. Images are retrieved, in response to a linguistic term chosen by a user, by comparing the signatures of the images with the associated statistical measure of an image set. Though statistical information is extracted over a set of images, only global means and variances after iterative convolution are retained for signature similarity matching. Spatial distribution information is not utilised.
Thus, a need clearly exists for an improved system of classifying and searching image data dependent upon the content of the image data.
In accordance with a first aspect of the invention, there is disclosed a method of indexing and retrieving a visual document using visual keywords. The method includes the steps of: providing a plurality of visual keywords derived using a learning technique from a plurality of visual tokens extracted across a predetermined number of visual documents; comparing a plurality of visual tokens of another visual document with the visual keywords, a comparison result being represented by a three-dimensional map of detected locations of the visual keywords; and determining a spatial distribution of visual keywords dependent upon the comparison result to provide a visual-content signature for the visual document.
Preferably, the method includes the step of comparing two visual documents by matching the visual-content signatures of the visual documents. More preferably, the comparing step involves use of one or more predetermined similarity measures.
Preferably, the visual-content signature is a spatial aggregation map.
The method can also include the step of coding the visual-content signature to reduce the dimensionality of the visual-content signature to provide a coded description of the visual document. It may further include the step of comparing two visual documents by matching the coded descriptions.
Preferably, the method further includes the step of extracting a plurality of visual tokens from the content of one or more visual documents, each visual token being represented by one or more predefined visual characteristics. The extracting step can be dependent upon predefined receptive fields and displacements. The predefined receptive fields may have a predetermined size. The displacements may have predetermined sizes. Alternatively, two or more of the predefined receptive fields may have different sizes. Displacements may have different sizes too.
The method may include the step of generating the visual keywords using the learning technique from the visual tokens extracted across the predetermined number of the visual documents. Preferably, the learning technique is a supervised and/or unsupervised learning technique. Also, the method may include the step of representing the outcome of the visual-token comparing step by a vector of real values, each corresponding to a visual keyword, to provide the three-dimensional map of the detected locations of the visual keywords. The three-dimensional map is a type evaluation map.
Preferably, the extracting step utilises a plurality of predefined receptive fields and displacements in horizontal and vertical directions to tessellate the two-dimensional visual content of the visual documents. The extracting step may also include the step of transforming a visual token, being the part of the visual content covered by a receptive field, into a vector of real values representing one or more visual characteristics of the visual token.
Optionally, the generating step utilises a plurality of view-based recognisers and a supervised learning technique to train the view-based recognisers from positive and negative visual tokens of a view of an object, the trained view-based recognisers being the visual keywords. Alternatively, the generating step may utilise a plurality of cluster-based parameters and an unsupervised learning technique to modify the cluster-based parameters using a collection of the visual tokens, the trained cluster-based parameters being the visual keywords.
The comparing step may further include the step of estimating a confidence factor or set membership of the visual token to each of the visual keywords. It may also include the step of iteratively computing the confidence factor or set membership for each of the visual tokens in a given visual content, resulting in the three-dimensional map of detected locations of the visual keywords. The determining step may include the step of summarising the confidence factors or set memberships of the three-dimensional map of detected locations to provide a three-dimensional map regarding occurrences of the visual keywords.
More preferably, the coding step utilises a singular-value decomposition transform to represent a matrix X formed by concatenating linearised visual content signatures of visual documents as column vectors such that X=USVT of rank r and obtains an approximate representation Xk=UkSkVkT of rank k, where kxe2x89xa6r. It may further include the step of transforming a visual content signature of a visual document D to generate a coded description Dxe2x80x2 as Dxe2x80x2=DTUkSkxe2x88x921.
Optionally, a plurality of similarity matching functions are used, each for comparing the similarity between the visual-content signatures of two visual documents. The method may also include the step of determining which similarity measure, among those produced by the plurality of similarity matching functions on pairs of the visual content signatures of the two visual documents, to use as a final similarity measure between the two visual documents.
In accordance with a second aspect of the invention, there is disclosed an apparatus for indexing and retrieving a visual document using visual keywords. The apparatus including: a device for providing a plurality of visual keywords derived using a learning technique from a plurality of visual tokens extracted across a predetermined number of visual documents; a device for comparing a plurality of visual tokens of another visual document with the visual keywords, a comparison result being represented by a three-dimensional map of detected locations of the visual keywords; and a device for determining a spatial distribution of visual keywords dependent upon the comparison result to provide a visual-content signature for the visual document.
In accordance with a third aspect of the invention, there is disclosed a computer program product having a computer readable medium having a computer program recorded thereon for indexing and retrieving a visual document using visual keywords. The computer program product includes: a module for providing a plurality of visual keywords derived using a learning technique from a plurality of visual tokens extracted across a predetermined number of visual documents; a module for comparing a plurality of visual tokens of another visual document with the visual keywords, a comparison result being represented by a three-dimensional map of detected locations of the visual keywords; and a module for determining a spatial distribution of visual keywords dependent upon the comparison result to provide a visual-content signature for the visual document.