1. Field of the Invention
The invention relates to apparatus and an accompanying method for an iterative convolution filter for determining an image signature and which is particularly useful in a system for automatically classifying individual images, on a numerical basis, in, e.g., an image database, and, through a query-by-example paradigm, retrieving a desired image(s) therefrom.
2. Description of the Prior Art
Textual representations constitute a relatively small portion of all the currently available information in the world. Nevertheless, dating back many centuries, textual representations, whether handwritten or machine printed, have been and are still used to abstract and convey information from one person to another through a printed (including numeric) alphabet, with language and numbers representing semantic concepts. However, a language by its very nature of representing abstract concepts through an organized contextual hierarchy built on a finite vocabulary, be it words or pictographs (e.g., words, having given meanings--though contextually variable, are grouped by an individual into sentences, sentences into paragraphs, and so forth), often results in communication that is indefinite and ambiguous. An image, unburdened by the limitations inherent in any language, is a far more efficient vehicle to convey information among individuals than text is now and is ever likely to be. Accordingly, humans constantly use their sense of vision in perceiving information and communicating it amongst themselves. As such, the world is awash in and dominated by visual information of one sort or another, whether it is, e.g., a single image such as on a photographic print, printed on a piece of paper or stored in an electronic memory, or a sequence of images such as in a motion picture or in a video signal. Images are so prevalent, as a means of communication, that humans create a huge and escalating number of new images every single day. Recent advances in computing and telecommunications are only increasing the use and dominance of visual information in modern day life.
Many different ways exists to classify and search textual information, particularly with a computer. Any textual item, by its very nature, residing in that database is written in a language, such as English, that, for the most part, has a well-defined and bounded vocabulary. Such a vocabulary readily lends itself to searching for any word (or other similar granular linguistic construct) to locate a stored textual entry of interest. While a textual database itself can be inordinately large, i.e. contain an extraordinarily massive amount of different textual entries, various algorithms exist, which by exploiting a well-defined vocabulary and its usage inherent in a language, such as English, permits a computer to efficiently index and retrieve any of the items stored in the database. In that regard, certain of these algorithms index an item by examining it for the presence of any so-called "keywords". Once any such word is found, a pointer to a stored record of that textual item is added to an appropriate classification (list) defined by (and data structure associated with) that keyword. Each such classification, generally consists of a list of pointers, with each pointer defining a location in a massive textual database at which the corresponding textual record for that item (or a portion thereof, such as an bibliographic abstract) is stored. All the keyword lists collectively define a keyword database. Keyword based retrieval systems generally operate by querying the keyword database with a user supplied keyword(s) to retrieve pointers to all the corresponding records that contain the keyword(s) and then present these pointers (in the form of a numbered list of records) to the user for subsequent selection thereamong. Once the user has selected which records (s)he wants, (s)he can then instruct the system to retrieve, display and/or print the complete item stored within each of the selected records.
Unlike text, an image, from a semantic perspective, is not defined by a linguistic or mathematical vocabulary. In that regard, any such vocabulary is often inadequate to fully describe all but very simple images. As such, human beings, whose communication is biased heavily towards using linguistic based verbal and printed expressions, are ill-equipped to fully describe anything more complex than a very simple image.
Given this linguistic deficiency, then, not surprisingly, computerized information search systems, have yet to, and probably will not for some time, be developed that can semantically categorize an image. Hence, users of existing computerized search systems had little choice but to find a desired image indirectly--by performing a keyword search to locate, e.g., an article that contained not only a textual description of a desired image, such as in a figure caption, but also hopefully (though without any guarantees) the desired image itself. However, this approach often failed and hence frustrated the user because it still relied on a linguistic description--one that was inadequate to fully describe just the particular image which the user wanted and, not surprisingly, often returned an article(s) having a wrong image(s) from that desired.
In view of an increasing predominance of imagery in present day (and, certainly, expected future) communications, a general problem, thusfar unmet in the art, has existed, for some time--though recently becoming more acute, as to just how images, by themselves and devoid of accompanying descriptive text, can be efficiently and accurately manipulated, i.e. how such large numbers of images can first be indexed into an image database and a desired image(s) then accurately retrieved therefrom.
Given the absence of a suitable vocabulary to describe image semantics, conventional image classification and retrieval schemes simply relied on a human being to subjectively assess semantic content of images, on an image-by-image basis, for both indexing and retrieval.
In particular, a conventional image repository, commonly referred to as a "photo shop", usually employs an individual to view, become familiar with and appropriately index each and every image in an image collection. Subsequently, a customer, seeking a certain image, will verbally describe various salient aspects of that image in an effort to convey a "gut feeling" of the image to that employee. As an example, suppose a customer desires an image of John Kennedy evoking an appearance of "determination". The employee, using his (her) knowledge of the image collection, will then retrieve an image from the collection thought to be most similar to that which the customer has described. For example, the employee may first examine a group of images may have been classified as depicting Presidents in office. The employee will present the retrieved image, perhaps an image of a concerned Kennedy sitting at desk in the Oval Office during a meeting involving the Cuban Missile Crisis, to the customer and request whether that image is the one desired or not, and, if not, why, in terms of differences between the desired and retrieved images, the retrieved image is not desired. In this situation, the customer may respond by stating that he (she) wants an image of Kennedy standing in front of an American flag rather than in the Oval Office. Armed with this information, the employee will then return to the collection to narrow the search and retrieve a closer image, if it is available, such as, e.g., Kennedy giving a charged public speech in West Berlin, to that desired. This manual process will iteratively repeat until either all the similar images in the collection, have been retrieved by the employee and rejected or the desired image (or one sufficiently similar) has been found. Alternatively, depending on the granularity of the index, a relatively large group of images may be found, such as all those containing past Presidents, through which a user or customer will be forced to manually examine, in seriatim, each image in the group to locate the desired image. Though human experiences are generally similar across different people, each person who views and indexes images does so based on his (her) own subjective criteria. While these criteria are generally quite broad (such as here perhaps including past Presidents as one, and American flags as another) and to a certain extent overlap among individuals (another individual might use Presidents in public), the decision as to whether an image possesses a given semantic content, hence falling within one of a number of given criteria and then should be retrieved or not based on a verbal description of what is desired, is highly subjective and, for a common image, often varies widely across different viewers. While a single human can index and effectively deal with perhaps as much as 100,000 different images or so, a photo shop often has a collection of considerably more images, such as several 100K images to several million, if not more. With such large collections, indexing is performed by several different individuals; the same occurs for retrieval. Hence, owing to the highly subjective nature of human-based indexing and retrieval, inconsistent results often occur from one individual to the next.
Moreover, no finer granularity than relatively broad criteria (e.g., images of "chairs") is generally used to classify images. Also, in large image collections, images, possessing one criteria (e.g. "chairs"), are generally not cross-indexed, to images possessing another criteria (e.g. depicting "gardens") such that images having both criteria, i.e. a sub-set (e.g. "a chair in a garden") can be readily located by itself and to the exclusion of images having just one of the criteria (e.g. just chairs or just gardens). Furthermore, images are classified with a limited and bounded set of linguistic criteria (e.g. an image of a "chair"). Unfortunately, doing so often results in a customer describing an image using terms (e.g. an image of a "light switch") that have not been used as a class descriptor. Thus, such manual image indexing and retrieval methodologies tend to be highly frustrating and inefficient to use, and quite problematic in their results.
While such manual approaches are still used with relatively small image collections, these approaches become totally useless if one desires to index massive numbers of images provided by, e.g., many currently available image sources. For example, a single video stream, such as programming carried over a broadcast channel, contains a substantial number of images, though successive images possess significant commonality. Currently, hundreds of different cable channels are available each providing a different video stream. To form a comprehensive image database, each and every different image in each video stream might need to be indexed. A similar problem would be posed by indexing images that appear in all recorded footage. Another source of a potentially infinite number of images is the world wide web wherein the number of new visual data sites continues to exponentially grow, with the images provided therefrom exponentially increasing at a significantly greater rate. In any event, a huge number of different images exist both now and increasingly so in the future which are likely to constitute an image database, with far more images than any manual methodology can handle.
Clearly, indexing all such images from all such sources, or even just a small fraction of these images, into a common image database is an enormous task which is only feasible, if at all, if it can be automated in some fashion.
In an effort to overcome the deficiencies inherent in conventional manual indexing and retrieval methodologies, the art has indeed turned to automated, i.e. computerized, techniques. However, in practice, none of these techniques has yet proven entirely satisfactory.
One such technique involves a so-called "query by image content (QBIC)" paradigm. This technique is typified by work currently undertaken in "The QBIC Project" by IBM Corporation (see the web site at http://wwwqbic.almaden.ibm. com), in the so-called "Visual Information Retrieval" technology being developed at Virage Corporation (see the web site at http://www.virage.com); and in the "Photobook" project currently underway at the Media Lab at Massachusetts Institute of Technology (see the web site at http://www-white.media.mit.edu/vismod/demos/photobook). In general, the QBIC technique relies on classifying an image according to a relatively small number of pre-defined fixed image features (also referred to as characteristics or attributes), such as distribution of color across an image, shapes in an image including their position and size, texture in an image, locations of dominant edges of image objects and regions, and so forth. For each image, a computerized system scans the image and measures each such characteristic. The premise behind using such characteristics is to mimic those visual attributes with which humans are familiar and use in recognizing an image. Once these attributes are measured for each image, a sequence of numeric values, i.e. a vector, results for that image. A user desiring to find a given image in a QBIC image database queries the database by providing an example of an image similar to that which he (she) desires and then setting a weight for each such characteristic in a fashion he (she) believes accurately reflects the presence of each attribute in the desired image as compared to that in the test image. For example, if a desired image is to have less variation in color across the image than does the test image, then the user will ostensibly choose a relatively low weight for color distribution, and so forth for other weights. The attributes in the example image are then measured. To retrieve an image, the system compares the vector for the test image, modified by the weights provided by the user, to the vector for each image in the database. A difference measure, based on a mathematical difference, between these two vectors is computed for each database image. The retrieved image with the lowest difference measure is then presented to the user. The user, upon viewing that retrieved image, can adjust the weights to refine the selection process in an attempt to retrieve another image closer to that which he (she) desires, and so forth, until presumably the closest image in the database to that desired is eventually retrieved.
While the QBIC technique represents an advance in machine based image indexing and retrieval, this technique suffers two basic infirmities.
First, the number of attributes is generally limited to between 5-10. This very small number simply fails to provide sufficient resolution to adequately describe the visual characteristics of most images. While, at first blush, it would seem trivial to extend a set of attributes to encompass additional ones, considerable difficulty exists in specifying just what each additional attribute should be. Specifically, as additional characteristics are added, they tend to become increasingly abstract and difficult for a user to comprehend and visualize.
Second, the user is burdened with selecting the proper numeric value for each weight. Not only is the user rather ill-equipped for the task of deciding just what value should be used for each weight for a visually apparent attribute, but as additional increasingly abstract image attributes are used particularly those which the user can not readily visualize and comprehend, the difficulty inherent in this task greatly compounds.
As a result of these practical limitations, the number of image attributes in a QBIC system remains small at approximately 5-10 rather broad characteristics.
Consequently, a fairly large group of images are usually produced in response to any query to a QBIC system, necessitating that the user manually review each and every resulting image. Doing so is often quite labor and time intensive and, as such, generally infeasible for practical use.
Apart from a significant effort potentially required of a user during a search, the user generally needs to expend a considerable amount of time and effort just to properly learn how to use a QBIC system, including how to correctly set the weights. Inasmuch as any user will still set the weights subjectively, then, if different users--even those who are highly trained--were to search for a common image in a common image database, the subjectivity exhibited by these users will likely yield different and often inconsistent images. These factors further reduce the attractiveness of using a QBIC system.
Another conventional technique for automating image classification and retrieval uses, e.g., so-called "eigenimages" which are mathematical techniques for clustering vectors in space. An additional technique known in the art measures a distribution of colors, in terms of a histogrammed frequency of occurrence across a query image and for an entire color gamut. This histogrammed distribution is also measured for each image in an image database, with a distance measure then used to compare the histogrammed results between each database image and the query image. The database image possessing the smallest distance measure is presented to the user as a retrieved image. Each of these two alternate techniques suffers the same infirmity inherent in a QBIC system; namely, for a user, it is both labor and time intensive. Specifically, both of these systems exhibit insufficient resolution, which, in turn, often yields a large group of retrieved images that a user must individually review in seriatim.
Thus, an acute need still remains in the art for an effective automated image classification and retrieval system. Such a system should not be either time or labor intensive for an individual to learn, much less use. In that regard, the system should classify images with sufficient resolution so as to reduce the number of retrieved images presented to a user at any one time. The system should also not require a user to specify any desired image features or set any numeric weights thereof. Additionally, the system should substantially eliminate user induced subjectivity and produce highly consistent results, in terms of the images retrieved, across different users.