This invention relates generally to image processing and, in particular, to methods whereby still or moving images or other objects are transformed into more compact forms for comparison and other purposes.
There are many systems in common use today whose function is automatic object identification. Many make use of cameras or scanners to capture images of objects, and employ computers to analyze the images. Examples are bill changing machines, optical character readers, blood cell analyzers, robotic welders, electronic circuit inspectors, to name a few. Each application is highly specialized, and the detailed design and implementation of each system is finely engineered to the specific requirements of the particular application, most notably the visual characteristics of the objects to be recognized. A device that is highly accurate in recognizing a dollar bill would be worthless in recognizing a white blood cell.
The more general problem of identifying an image (or any object through the medium of an image) based solely upon the pictorial content of the image has not been satisfactorily addressed. Considering that the premier model for a generalized identification system is the one which we all carry upon our shoulders, i.e., the human brain, it is not surprising that the general system does not yet exist. Any child can identify a broad range of pictures better than can any machine, but our understanding of the processes involved are so rudimentary as to be of no help in solving the problem.
As a result, the means that have been employed amount to the shrewd applications of heuristic methods. Such methods generally are derived from the requirements of a particular problem. Current technology often uses such an approach to successfully solve specific problems, but the solution to the general image identification problem has remained remote.
The landscape of the patent literature referring to image identification is broad, but very shallow. The following is a summary of two selected patents an three commercial systems which are considered to represent the current state-of-the-art.
U.S. Pat. No. 5,893,095 to Jain et al presents a detailed framework for a pictorial content based image retrieval system and even presents this framework in representative hardware. Flowcharts are given describing the operation of the framework system. The system depends for identification upon the matching of visual features derived from the image pictorial content. Examples of these visual features are hue, saturation and intensity histograms; edge density; randomness; periodicity; algebraic moments of shapes; etc. Some of these features are computed over the entire image and some are computed over a small region of the image. Jain does not reveal the methods through which such visual features are discerned. These visual features are expressed in Jain""s system as xe2x80x9cprimitivesxe2x80x9d, which appear to be constructed from the visual features at the discretion of a human operator.
A set of primitives and primitive weightings appropriate to each image is selected by the operator and stored in a database. When an unknown image is presented for identification it can either be processed autonomously to create primitives or the user can specify properties and/or areas of interest to be used for identification. A match is determined by comparing the vector of weighted primitive features obtained for the query image against the all the weighted primitive feature vectors for the images in the database.
Given the information provided by Jain, one skilled in the art could not construct a viable image identification system because the performance of the system is dependent upon the skill of the operator at selecting primitives, primitive weightings, and areas of interest. Assuming that Jain ever constructed a functioning system, it is not at all clear that the system described could perform the desired function. Jain does not provide any enlightenment concerning realizable system performance.
U.S. Pat. No. 5,852,823 to De Bonet teach an image recognition system that is essentially autonomous. Image feature information is extracted through application of particular suitable algorithms, independent of human control. The feature information thus derived is stored in a database, which can then be searched by conventional means. De Bonet""s invention offers essentially autonomous operation (he suggests that textual information might be associated with collections of images grouped by subject, date, etc. to thereby subdivide the database) and the use of features derived from the whole of the image. Another point of commonality is the so-called xe2x80x9cquery by examplexe2x80x9d paradigm, wherein the information upon which a search of the image database is predicated upon information extracted exclusively from the pictorial content of the unknown image.
De Bonet takes some pains to distinguish his technology from that developed by IBM and Illustra Information Technologies, which are described later in this section. He is quite critical of those technologies, declaring that they can address only a small range of image identification and retrieval functions.
De Bonet refers to the features that he extracts from images as the image""s signature. The signature for a given image is computed according to the following sequence of operation: (1) The image is split into three images corresponding to the three color bands. Each of these three images is convolved with each of 25 pre-determined and invariant kernels. (2) The 75 resulting images are each summed over the image""s range of pixels, and the 75 sums become part of the image""s signature. (3) Each of the 75 convolved images is again convolved with the same set of 25 kernels. Each of the resulting 1875 images is summed over its range of pixels, and the 1875 sums become part of the image""s signature. (4) Each of the 1875 convolved images it convolved a third time with the same set of 25 kernels. The resulting 46,875 images are each summed over the image""s range of pixels, and the 46,875 sums become part of the original image""s signature.
In the simplest case, then, the 48,825 sums (46,875+1875+75) serving as the signature are stored in an image database, along with ancillary information concerning the image. It should be noted that this description was obtained from DeBonet""s invention summary. Later, he uses just the 46,875 elements obtained from the third convolution. An unknown image is put through the same procedure. The signature of the unknown image is then compared to the signatures stored in the database one at a time, and the best signature matches are reported. The corresponding images are retrieved from an image library for further examination by the system user.
In a somewhat more complex scenario, it is posited that the system user has a group of images that are related in some way (all are images of oak trees; all are images of sailboats; etc.). With the signatures of each member of the group already calculated, the means and variances of each element of their signatures (all 48,825) are computed, thereby creating a composite signature representing all member images of the group, along with a parallel array of variances. When a signature in the database is compared to a given signature, the difference between each corresponding element of the signatures is inversely weighted by the variance associated with that element. The implicit assumption upon which the weighting process is based is that elements exhibiting the least variance would be the best descriptors for that group. In principle, the system would return images representative of the common theme of the group.
Additionally, such composite signatures can be stored in the image database. Then, when a signature matching a composite signature is found, the system returns a group of images which bear a relation to the image upon which the search was based.
The system is obviously very computation-intensive. De Bonet used a 200 Mz computer based upon the Intel Pro processor to generate some system performance data. He reports that a signature can be computed in 1.5 minutes. Using a database of 1500 signatures, image retrieval took about 20 seconds. The retrieval time should be a linear function of data base size.
In terms of commercial products, Cognex, Inc. offers an image recognition system under the trademarked name xe2x80x9cPatmaxxe2x80x9d intended for industrial applications concerning the gauging, identification and quality assessment of manufactured components.
The system is trained on a comprehensive set of parts to be inspected, extracting key features (mostly geometrical) and storing it in a file associated with that particular part. Thereafter, the system is able to recognize that part under a variety of conditions. It is also able to identify independent of object scale and to infer part orientation.
In the early to mid 1990""s, IBM (Almaden Research Center) developed a general-purpose image identification/retrieval system. Reduced to software that runs under the OS/2 operating system, it has been offered for sale as Ultimedia Manager 1.0 and 1.1, successively.
The system identifies an image principally according to four kinds of information:
1. Average color, calculated by simply adding all of the RGB color values in each pixel.
2. Color histogram, in which the color space is divided into 64 segments. A heuristic method is used to compare one histogram to another.
3. Texture, defined in terms of coarseness, contrast and direction. These features are extracted from gray-level representations of the images.
b 4. Shape, defined in terms of circularity, eccentricity, major axis direction, algebraic moments, etc.
In addition to the distinguishing information noted above, which can be extracted from a given image automatically, the IBM system is said to have means through which a user can supplement the information extracted automatically by manually adding information such as user-defined shapes, particular areas of interest within the image, etc.
The system does not rank the stored images in terms of the quality of match to an unknown, but rather selects 20-50 good candidates, which must then be manually examined by a human. Thus, it can barely be called an image identification system.
Illustra developed a body of technology to be used for image identification and retrieval. Informix acquired Illustra in 1996.
The technology employed is the familiar one of extracting the attributes related to color, composition, structure and texture. These attributes are translated into a standard language, which fits into a file structure. Unknown images are decomposed by the same methods into terms that can be used to search the file structure. The output is said to return possible matches, ordered from the most to the least probable. The information extracted from the unknown image can be supplemented or replaced by input data supplied by the user.
Aside from the general purpose of image identification and retrieval (by Informix""s Excalibur System), this technology has been applied to the archiving and retrieval of video images (by Virage, Inc. and Techmath).
Management of information is one of the greatest problems confronting our society. As the sheer volume of generated information increases dramatically every year, effective and efficient access to stored information becomes a particular concern.
While information in its physical embodiment was once stored in file cabinets, libraries archives and the like, to be accessed through arcane means such as the Dewey Decimal System, current needs dictate that information must be stored as digital data in electronic media. Database management systems have been developed to identify and access information that can be simply and uniquely described through their alphanumeric keywords. A document entitled xe2x80x9cNew Varieties of Wheatxe2x80x9d appearing in the Journal of Agronomy, series 10, volume 3, Jan. 4, 1999 is easy to digitize, store and retrieve. The search mechanism, given all of the identifications above, can be swift, efficient and foolproof. Similarly, cross-referencing according to field of interest, subject matter, etc. works rather well.
Currently, however, much of the information with which we are confronted is presented in pictorial form. Though we can create arbitrarily accurate representations of objects in pictorial form, such as digital images, and can readily store such images, the accessing and retrieving of this information often presents difficulties. For the sake of the present discussion, the term xe2x80x9cdigital imagexe2x80x9d is defined as a facsimile of a pictorial object wherein the geometrical and chromatic characteristics are represented in digital form.
Many such images can be stored and retrieved efficiently and accurately through associated alphanumeric keywords, i.e., meta-data. The associated information Claude Monet-Poppies-1892 might allow the unique identification and retrieval of a famous painting. Graphics used for advertising might be identified by the associated information of the date of creation, the subject matter and the creating advertisement agency. But if one considers the cases of an unattributed painting or undocumented pictorial advertising copy, i.e., no meta-data, such identifications become more problematic.
There are innumerable instances in which one has only the digital image on hand (one can always generate a digital image from a physical object if need be) and it is desired to access information in a database concerning its identification, its original nature, etc. In such cases, the seeker has no information with which to search an appropriate database, other than the information of the image itself.
Consider some examples of the cases noted above.
(1.) Let us postulate that a person had a swatch of fabric having a particular pattern of colors, shapes, textures, etc. Further, let us assume that the swatch has no identifying labels. The person wishes to identify the textile. Assuming that a catalog of all fabrics existed, the person might be able to narrow the search through observation of the type of fabric and the like, but, in general, the person would have no choice but to visually compare his sample fabric to all the other fabrics, one at a time.
(2.) It is desired to identify an unknown person in a photograph, when the person is not otherwise identified, but is thought to be pictorially represented in a database, for example, a database of all passport pictures. Except for the obvious partitions according to sex of subject, age of subject, and other meta-data sortings, there exists no effective way to identify the person in the photograph other then through direct comparison by humans with all the pictures in the database.
(3.) A person possesses a porcelain dinner plate of unknown origin, which is believed to be valuable due to the observable characteristics of the object. The person wishes to ascertain the history of and the approximate value of the plate. In this case, the pictorial database exists mostly in reference books and in the minds of experts. Assuming the first case, the person must compare the object to images stored in the appropriate books, image by image. In the second case, the person must identify an appropriate expert, present the expert with the object or pictorial representations of the object, and hope that the expert can locate the proper reference in the database or provide the required information from memory.
In all the examples presented above, the problem solution rests upon humans visually comparing objects, or images of objects, to images in a database. As current and future electronic media generate, store and transmit an ever-increasing torrent of images, for a multitude of purposes, it is certain that a great many of these images will be of sufficient importance that it will be imperative for the images themselves to serve as their own descriptors, i.e., no meta-data. The problems of manually associating keyword descriptions, i.e., meta-data to every digitally stored image to permit rapid retrieval from image databases very quickly becomes unmanageable as the number of pertinent images grows.
Assuming, then, that an image""s composition itself must somehow serve as an image""s description in image databases, we immediately are faced with the problem that the compositions of pictorial images are presented in a language that we neither speak nor understand. Images are composed of shapes, colors, textures, etc., rather than of words or numbers.
At a most basic level, a digitized image can be completely described in terms hue, saturation and intensity at each pixel location. There is no more information to be had from the image. Furthermore, this definition of an image is the one definition currently existing which is universal and is presented in a language which all can understand. Viewed from this perspective, it is worth investigating further.
The naive approach to identifying an unknown image by associating it with a stored image found within a given database of digitized images would be to compare a digitized facsimile of the unknown image to each image in the database on a pixel by pixel basis. When each pixel of a stored image is found to match each pixel of the unknown image, a match between that particular stored image and the unknown image can be said to have occurred. The unknown image can now be said to be known, to the extent that the ancillary information attached to the stored image can now be associated with the unknown image.
When considered superficially, the intuitive procedure given above seems to offer a universal solution to the problem of managing image databases. Practical implementation of such an approach presents a plethora of problems. The process does not provide any obvious means for subdividing the database into smaller segments, one of which can be known a priori to contain the unknown image. Thus, the computer performing the comparisons must do what a human would have to do: compare each database image to the unknown image one at a time on a pixel-by-pixel basis. Even for a high-speed computer, this is a very time consuming process.
In many cases, the database images and the unknown image are not geometrically registered to each other. That is, because of relative rotation and/or translation between the database image and the unknown image, a pixel in the first image will not correspond to a pixel in the second. If the degree of relative rotation/translation between the two images is unknown or cannot be extracted by some means, identification of an unknown image by this method becomes essentially impossible for a computer to accomplish. Because a pixel-by-pixel comparison, commonly referred to as template matching, seems to be such an intuitively obvious answer to the problem, it has been analyzed and tested extensively and has been found to be impractical for any but the simplest applications of image matching, such as coin or currency recognition.
All other image recognition schemes with which we are familiar are based upon the extraction of distinctive features from an unknown image and correlation of such features with a database of like features, with each feature set having been similarly extracted from and related to each stored image. The term pattern recognition has come to represent all such methods. Examples of such feature sets, which can be extracted and used, might be line segments, defined, perhaps, by the locations of the endpoints, by their orientation, by their curvature, etc. The reduction of images to feature sets is always an attempt to translate image composition, for which, there is no language, into a restrictive dictionary of image features.
The selection of feature sets and their application to image matching have been investigated intensely. The feature sets used have been largely based upon the intuition of the process designer. Some systems of feature matching have performed quite well in image matching problems of limited scope (such as identifying a particular manufactured part as being of a pre-defined class of similar parts; distinguishing between a military tank and a military truck, etc.). However no system has yet solved the general problem of matching an unknown image to its counterpart in an image database.
The methods of this invention present an effective means for addressing the general problem of image recognition described above. It does not depend upon feature extraction and is not related to any other image-matching system. The method derives from the study of certain stochastic processes, commonly referred to as chaos theory, in particular, the study of strange attractors. In this method, an auxiliary construct, a chaotic system, is associated with an image. The auxiliary construct is a dynamic system whose behavior is described by a system of linear differential equations whose coefficients are dynamically derived from the values of the pixels in the digital image. As the dynamic system is successively iterated, it is observed that the system converges towards an attractor state, that is, random behavior becomes predictable and the system reaches an equilibrium configuration. The equilibrium configuration uniquely represents the digital image upon which it has been constructed.
The form of the auxiliary construct that has been commonly used during the development of this invention is a rectangular, orthogonal grid, though the invention does not depend upon any particular grid form. It is assumed hereafter that a rectangular auxiliary grid is used, and it will hereafter be referred to as the warp grid. The warp grid is assigned a particular mesh scale and location relative to the original image. The locations of all grid intersections are noted and stored.
A series of transformations is then imposed upon the warp grid. Each transformation is governed by a given set of transformation rules which use the current state of the warp grid and the information contained in the invariant underlying original image. The grid intersections will generally translate about the warp grid space as the result of each transformation. However, the identity of each intersection is maintained. At each iteration of the warp grid, the image is sampled at the warp grid points. The number of warp grid points is many orders of magnitude smaller than the number of pixels in the digital image, and the number of iterations is on the order of a hundred. The total number of computational steps is well within the capabilities of ordinary personal computers to implement very rapidly. After a given number of transformations have been performed upon the warp grid, the final position of each of the grid intersections is noted. For each grid point, a vector is formed between its original position and its final position. The set of all such vectors, corresponding to all of the original grid points, constitutes a unique representation of the underlying original image, called a Visual Key.
This resultant set of vectors represents a coherent language through which we can compare and identify distinct images. In the preferred embodiment, the problem of matching an unknown image to an image in a database, we could use the following procedure. First we would apply a given warp grid iterative process to each original image. From each such procedure we would obtain a vector set associated with that image, and the vector set would be stored in a database. An unknown image that had a correspondent in the database could be processed in the same way and identified through matching the resultant vector set to one of the vector sets contained in the database. Of course, auxiliary information commonly used for database searching, such as keywords, could also be used in conjunction with the present invention to augment the search process.
The size of the vector set is small compared to the information contained in the image. The vector set is typically on the order of a few kilobytes. Thus, even if the database were to be searched exhaustively to find a match to an unknown image""s vector set, the search process will be fairly rapid even for database containing a significant number of vector sets. Of greater importance is the fact that the database used for identification of unknown images need not contain the images themselves, but only the vector sets and enough information to link each vector set to an actual image. The images themselves could be stored elsewhere, perhaps on a large, remote, centrally located storage medium. Thus, a personal computer system, which could not store a million images, could store the corresponding million information sets (vector sets plus identification information), each of a few kilobytes in size. As has been mentioned, the personal computer would be more than adequate to apply the image transformation operations to an unknown image in a timely manner. The personal computer could compute the vector set for the unknown image and then could access the remote storage medium to retrieve the desired image identification information.
In practice, however, the matching of vector components can be too slow to allow a very large database of many millions of images to be searched in a timely manner. As noted in the following, there may not be a perfect match between a vector set derived from an unknown image and a vector set stored in the database. A unique search method dealing with this uncertainty, which is also very fast and efficient, will be described herein.
The unknown image and the corresponding database image will generally have been made either with two different imaging devices, by the same imaging device at different times, or under different conditions with different settings. In all cases, any imaging device is subject to uncertainties caused by internal system noise. As a result, the unknown image and the corresponding image in the database will generally differ. Because the images differ, the vector sets associated with each will generally differ slightly. Thus, as noted above, a given vector set derived from the unknown image may not have an exact correspondent in the database of vector sets. A different aspect of the invention addresses this problem and simultaneously increases the efficiency of the search process.
The search process employed by this invention for finding a corresponding image in a database is called squorging, a newly coined term derived from the root words sequential and originating. The method sequentially examines candidate database images for their closeness of match in a sequential order determined by their a priori match probability. Thus, the most likely match candidate is examined first, the next most likely second, and so forth. The process terminates when a match of sufficient closeness is found, or a match of sufficient closeness has not been found in the maximum allowable number of search iterations.
The squorging method depends upon an index being prefixed to each image vector set in the database. A pre-selected group of j warp grid points is used to construct the index. Each x and y component of the pre-selected group of warp grid vectors is quantized into two intervals, represented by the digits 0 and 1. In effect, each vector set has been recast as a set of 2*j lock tumblers, with each tumbler having 2 positions. Associated with each vector set in the database, then, is a set of 2*j tumblers, each of which is set to one of 2 values. The particular value of each tumbler is determined by which interval the vector component magnitude is quantized into.
At this point in the process, every entry in the database is associated with a set of 2*j tumblers, with each tumbler position determined by the underlying vector set components. These tumbler sets are referred to as index keys. Note that there is not necessarily a one-to-one relationship between vector sets and index keys in the database. A single index key can be related to several vector sets.
Returning to the unknown image, selected elements of its vector set are similarly recast into an index key. However, in the case of the unknown, statistics which are known a priori are used to calculate the most probable index key associated with the unknown image, the next most probable, and so on. The index keys are calculated on demand in order of decreasing probability of the unknown index key being the correct one.
These index keys are checked sequentially against the index keys in the database until one is calculated having an exact correspondent in the database of index keys. Note that not all of the index keys in the list necessarily have exact matches in the database of index keys. If the first index key on the list matches an index key in the database, all vector sets associated with that index key are examined to determine the closest match to the vector set associated with the unknown image. Then the corresponding database image is said to most probably be the unknown image. Likewise, the second, third, etc. most probable matches can be identified.
If a match is not found within the scope of the first index key, the first index key calculated is discarded, and the next most probable index key is calculated. The squorging operation determines which tumblers in the index key to change to yield the next most probable index key. The process is repeated until a satisfactory match between the Visual Key Vector associated with the unknown image and a Visual Key Vector in the database is found.
The squorging method does not perform very well when the individual picture objects are individual frames of a movie or video stream. The high degree of frame-to-frame correlation necessary to convey the illusion of subject motion means that individual warp grid vectors are likely to be significantly correlated. This results in an undesirably sparse distribution of index keys with some of the index keys being duplicated very many times. Therefore, in order to extend the present invention to the recognition of streams, additional algorithms referred to as xe2x80x9cHolotropic Stream Recognitionxe2x80x9d are presented.
Holotropic Stream Recognition (HSR) employs the warp grid algorithm on each frame of the picture object stream, but rather than analyzing the warp grid vectors themselves to generate index keys, HSR analyzes the statistics of the spatial distribution of warp grid points in order to generate index keys. Furthermore, rather than employing fixed threshold levels to define individual tumbler probabilities, the HSR methodology constructs a dynamic decision tree whose threshold levels are individually adjusted each time an individual tumbler probability is generated. Finally, the method of squorging itself is replaced by a statistical inference methodology, which is effective precisely because the individual frames of a picture object stream are highly correlated.
It is the intention of this invention to permit an appropriately equipped and programmed computer to perform Picture identifications similar to those that would be performed by a trained human identifier, only with a substantially greater memory for different Pictures and significantly faster and more reliable performance.