There are many systems in common use today whose function is automatic object identification. Many make use of cameras or scanners to capture images of objects, and employ computers to analyze the images. Examples are bill changing machines, optical character readers, blood cell analyzers, robotic welders, electronic circuit inspectors, to name a few. Each application is highly specialized, and the detailed design and implementation of each system is finely engineered to the specific requirements of the particular application, most notably the visual characteristics of the objects to be recognized. A device that is highly accurate in recognizing a dollar bill would be worthless in recognizing a white blood cell.
The more general problem of identifying an image (or any object through the medium of an image) based solely upon the pictorial content of the image has not been satisfactorily addressed. Considering that the premier model for a generalized identification system is the one which we all carry upon our shoulders, i.e., the human brain, it is not surprising that the general system does not yet exist. Any child can identify a broad range of pictures better than can any machine, but our understanding of the processes involved are so rudimentary as to be of no help in solving the problem.
As a result, the means that have been employed amount to the shrewd applications of heuristic methods. Such methods generally are derived from the requirements of a particular problem. Current technology often uses such an approach to successfully solve specific problems, but the solution to the general image identification problem has remained remote.
The landscape of the patent literature referring to image identification is broad, but very shallow. The following is a summary of two selected patents a three commercial systems which are considered to represent the current state-of-the-art.
U.S. Pat. No. 5,893,095 to Jain et al presents a detailed framework for a pictorial content based image retrieval system and even presents this framework in representative hardware. Flowcharts are given describing the operation of the framework system. The system depends for identification upon the matching of visual features derived from the image pictorial content. Examples of these visual features are hue, saturation and intensity histograms; edge density; randomness; periodicity; algebraic moments of shapes; etc. Some of these features are computed over the entire image and some are computed over a small region of the image. Jain does not reveal the methods through which such visual features are discerned. These visual features are expressed in Jain's system as “primitives”, which appear to be constructed from the visual features at the discretion of a human operator.
A set of primitives and primitive weightings appropriate to each image is selected by the operator and stored in a database. When an unknown image is presented for identification it can either be processed autonomously to create primitives or the user can specify properties and/or areas of interest to be used for identification. A match is determined by comparing the vector of weighted primitive features obtained for the query image against the all the weighted primitive feature vectors for the images in the database.
Given the information provided by Jain, one skilled in the art could not construct a viable image identification system because the performance of the system is dependent upon the skill of the operator at selecting primitives, primitive weightings, and areas of interest. Assuming that Jain ever constructed a functioning system, it is not at all clear that the system described could perform the desired function. Jain does not provide any enlightenment concerning realizable system performance.
U.S. Pat. No. 5,852,823 to De Bonet teach an image recognition system that is essentially autonomous. Image feature information is extracted through application of particular suitable algorithms, independent of human control. The feature information thus derived is stored in a database, which can then be searched by conventional means. De Bonet's invention offers essentially autonomous operation (he suggests that textual information might be associated with collections of images grouped by subject, date, etc. to thereby subdivide the database) and the use of features derived from the whole of the image. Another point of commonality is the so-called “query by example” paradigm, wherein the information upon which a search of the image database is predicated upon information extracted exclusively from the pictorial content of the unknown image.
De Bonet takes some pains to distinguish his technology from that developed by IBM and Illustra Information Technologies, which are described later in this section. He is quite critical of those technologies, declaring that they can address only a small range of image identification and retrieval functions.
De Bonet refers to the features that he extracts from images as the image's signature. The signature for a given image is computed according to the following sequence of operation: (1) The image is split into three images corresponding to the three color bands. Each of these three images is convolved with each of 25 pre-determined and invariant kernels. (2) The 75 resulting images are each summed over the image's range of pixels, and the 75 sums become part of the image's signature. (3) Each of the 75 convolved images is again convolved with the same set of 25 kernels. Each of the resulting 1875 images is summed over its range of pixels, and the 1875 sums become part of the image's signature. (4) Each of the 1875 convolved images it convolved a third time with the same set of 25 kernels. The resulting 46,875 images are each summed over the image's range of pixels, and the 46,875 sums become part of the original image's signature.
In the simplest case, then, the 48,825 sums (46,875+1875+75) serving as the signature are stored in an image database, along with ancillary information concerning the image. It should be noted that this description was obtained from DeBonet's invention summary. Later, he uses just the 46,875 elements obtained from the third convolution. An unknown image is put through the same procedure. The signature of the unknown image is then compared to the signatures stored in the database one at a time, and the best signature matches are reported. The corresponding images are retrieved from an image library for further examination by the system user.
In a somewhat more complex scenario, it is posited that the system user has a group of images that are related in some way (all are images of oak trees; all are images of sailboats; etc.). With the signatures of each member of the group already calculated, the means and variances of each element of their signatures (all 48,825) are computed, thereby creating a composite signature representing all member images of the group, along with a parallel array of variances. When a signature in the database is compared to a given signature, the difference between each corresponding element of the signatures is inversely weighted by the variance associated with that element. The implicit assumption upon which the weighting process is based is that elements exhibiting the least variance would be the best descriptors for that group. In principle, the system would return images representative of the common theme of the group.
Additionally, such composite signatures can be stored in the image database. Then, when a signature matching a composite signature is found, the system returns a group of images which bear a relation to the image upon which the search was based.
The system is obviously very computation-intensive. De Bonet used a 200 Mz computer based upon the Intel Pro processor to generate some system perforinance data. He reports that a signature can be computed in 1.5 minutes. Using a database of 1500 signatures, image retrieval took about 20 seconds. The retrieval time should be a linear function of data base size.
In terms of commercial products, Cognex, Inc. offers an image recognition system under the trademarked name “Patmax” intended for industrial applications concerning the gauging, identification and quality assessment of manufactured components.
The system is trained on a comprehensive set of parts to be inspected, extracting key features (mostly geometrical) and storing it in a file associated with that particular part. Thereafter, the system is able to recognize that part under a variety of conditions. It is also able to identify independent of object scale and to infer part orientation.
In the early to mid 1990's, IBM (Almaden Research Center) developed a general-purpose image identification/retrieval system. Reduced to software that runs under the OS/2 operating system, it has been offered for sale as Ultimedia Manager 1.0 and 1.1, successively.
The system identifies an image principally according to four kinds of information:    1. Average color, calculated by simply adding all of the RGB color values in each pixel.    2. Color histogram, in which the color space is divided into 64 segments. A heuristic method is used to compare one histogram to another.    3. Texture, defined in terms of coarseness, contrast and direction. These features are extracted from gray-level representations of the images.    4. Shape, defined in terms of circularity, eccentricity, major axis direction, algebraic moments, etc.
In addition to the distinguishing information noted above, which can be extracted from a given image automatically, the IBM system is said to have means through which a user can supplement the information extracted automatically by manually adding information such as user-defined shapes, particular areas of interest within the image, etc.
The system does not rank the stored images in terms of the quality of match to an unknown, but rather selects 20-50 good candidates, which must then be manually examined by a human. Thus, it can barely be called an image identification system.
Illustra developed a body of technology to be used for image identification and retrieval. Informix acquired Illustra in 1996.
The technology employed is the familiar one of extracting the attributes related to color, composition, structure and texture. These attributes are translated into a standard language, which fits into a file structure. Unknown images are decomposed by the same methods into terms that can be used to search the file structure. The output is said to return possible matches, ordered from the most to the least probable. The information extracted from the unknown image can be supplemented or replaced by input data supplied by the user.
Aside from the general purpose of image identification and retrieval (by Informix's Excalibur System), this technology has been applied to the archiving and retrieval of video images (by Virage, Inc. and Techmath).
Management of information is one of the greatest problems confronting our society. As the sheer volume of generated information increases dramatically every year, effective and efficient access to stored information becomes a particular concern.
While information in its physical embodiment was once stored in file cabinets, libraries archives and the like, to be accessed through arcane means such as the Dewey Decimal System; current needs dictate that information must be stored as digital data in electronic media. Database management systems have been developed to identify and access information that can be simply and uniquely described through their alphanumeric keywords. A document entitled “New Varieties of Wheat” appearing in the Journal of Agronomy, series 10, volume 3, Jan. 4, 1999 is easy to digitize, store and retrieve. The search mechanism, given all of the identifications above, can be swift, efficient and foolproof. Similarly, cross-referencing according to field of interest, subject matter, etc. works rather well.
Currently, however, much of the information with which we are confronted is presented in pictorial form. Though we can create arbitrarily accurate representations of objects in pictorial form, such as digital images, and can readily store such images, the accessing and retrieving of this information often presents difficulties. For the sake of the present discussion, the term “digital image” is defined as a facsimile of a pictorial object wherein the geometrical and chromatic characteristics are represented in digital form.
Many such images can be stored and retrieved efficiently and accurately through associated alphanumeric keywords, i.e., meta-data. The associated information Claude Monet-Poppies-1892 might allow the unique identification and retrieval of a famous painting. Graphics used for advertising might be identified by the associated information of the date of creation, the subject matter and the creating advertisement agency. But if one considers the cases of an unattributed painting or undocumented pictorial advertising copy, i.e., no meta-data, such identifications become more problematic.
There are innumerable instances in which one has only the digital image on hand (one can always generate a digital image from a physical object if need be) and it is desired to access information in a database concerning its identification, its original nature, etc. In such cases, the seeker has no information with which to search an appropriate database, other than the information of the image itself.
Consider some examples of the cases noted above.    (1.) Let us postulate that a person had a swatch of fabric having a particular pattern of colors, shapes, textures, etc. Further, let us assume that the swatch has no identifying labels. The person wishes to identify the textile. Assuming that a catalog of all fabrics existed, the person might be able to narrow the search through observation of the type of fabric and the like, but, in general, the person would have no choice but to visually compare his sample fabric to all the other fabrics, one at a time.    (2.) It is desired to identify an unknown person in a photograph, when the person is not otherwise identified, but is thought to be pictorially represented in a database, for example, a database of all passport pictures. Except for the obvious partitions according to sex of subject, age of subject, and other meta-data sortings, there exists no effective way to identify the person in the photograph other then through direct comparison by humans with all the pictures in the database.    (3.) A person possesses a porcelain dinner plate of unknown origin, which is believed to be valuable due to the observable characteristics of the object. The person wishes to ascertain the history of and the approximate value of the plate. In this case, the pictorial database exists mostly in reference books and in the minds of experts. Assuming the first case, the person must compare the object to images stored in the appropriate books, image by image. In the second case, the person must identify an appropriate expert, present the expert with the object or pictorial representations of the object, and hope that the expert can locate the proper reference in the database or provide the required information from memory.
In all the examples presented above, the problem solution rests upon humans visually comparing objects, or images of objects, to images in a database. As current and future electronic media generate, store and transmit an ever-increasing torrent of images, for a multitude of purposes, it is certain that a great many of these images will be of sufficient importance that it will be imperative for the images themselves to serve as their own descriptors, i.e., no meta-data. The problems of manually associating keyword descriptions, i.e., meta-data to every digitally stored image to permit rapid retrieval from image databases very quickly becomes unmanageable as the number of pertinent images grows.
Assuming, then, that an image's composition itself must somehow serve as an image's description in image databases, we immediately are faced with the problem that the compositions of pictorial images are presented in a language that we neither speak nor understand. Images are composed of shapes, colors, textures, etc., rather than of words or numbers.
At a most basic level, a digitized image can be completely described in terms hue, saturation and intensity at each pixel location. There is no more information to be had from the image. Furthermore, this definition of an image is the one definition currently existing which is universal and is presented in a language which all can understand. Viewed from this perspective, it is worth investigating further.
The naive approach to identifying an unknown image by associating it with a stored image found within a given database of digitized images would be to compare a digitized facsimile of the unknown image to each image in the database on a pixel by pixel basis. When each pixel of a stored image is found to match each pixel of the unknown image, a match between that particular stored image and the unknown image can be said to have occurred. The unknown image can now be said to be known, to the extent that the ancillary information attached to the stored image can now be associated with the unknown image.
When considered superficially, the intuitive procedure given above seems to offer a universal solution to the problem of managing image databases. Practical implementation of such an approach presents a plethora of problems. The process does not provide any obvious means for subdividing the database into smaller segments, one of which can be known a priori to contain the unknown image. Thus, the computer performing the comparisons must do what a human would have to do: compare each database image to the unknown image one at a time on a pixel-by-pixel basis. Even for a high-speed computer, this is a very time consuming process.
In many cases, the database images and the unknown image are not geometrically registered to each other. That is, because of relative rotation and/or translation between the database image and the unknown image, a pixel in the first image will not correspond to a pixel in the second. If the degree of relative rotation/translation between the two images is unknown or cannot be extracted by some means, identification of an unknown image by this method becomes essentially impossible for a computer to accomplish. Because a pixel-by-pixel comparison, commonly referred to as template matching, seems to be such an intuitively obvious answer to the problem, it has been analyzed and tested extensively and has been found to be impractical for any but the simplest applications of image matching, such as coin or currency recognition.
All other image recognition schemes with which we are familiar are based upon the extraction of distinctive features from an unknown image and correlation of such features with a database of like features, with each feature set having been similarly extracted from and related to each stored image. The term pattern recognition has come to represent all such methods. Examples of such feature sets, which can be extracted and used, might be line segments, defined, perhaps, by the locations of the endpoints, by their orientation, by their curvature, etc. The reduction of images to feature sets is always an attempt to translate image composition, for which, there is no language, into a restrictive dictionary of image features.
The selection of feature sets and their application to image matching have been investigated intensely. The feature sets used have been largely based upon the intuition of the process designer. Some systems of feature matching have performed quite well in image matching problems of limited scope (such as identifying a particular manufactured part as being of a pre-defined class of similar parts; distinguishing between a military tank and a military truck, etc.). However no system has yet solved the general problem of matching an unknown image to its counterpart in an image database.