A wide range of computing tasks today require the comparison of some sort of "search criteria" to each Record in a database. Where the search criteria is certain to be found exactly in one or more of the Records, or not at all, excellent techniques are well known to organize and index the database, to provide rapid access to the Records of interest. This situation is called the problem of exact matching. Where the search criteria is unlikely to be found exactly in one or more of the Records, particularly where some kind of similarity or relevance metric is computed between the search criteria and each Record, then the techniques that work for the former, exact-match situation do not usually apply. This latter situation is called the problem of inexact matching.
One solution to the problem of inexact matching is to explicitly compare the search criteria with each Record in the database, applying whatever similarity or relevance metric may be appropriate to the situation. This solution, however, is slow, particularly when the number of Records in the database is large.
Better solutions are needed for the problem of inexact matching when the number of Records in the database is large, and some better methods are known. Two prominent better methods are the method of superimposed coding and the method of inverted list tables. Both of these methods depend for their operation on "Features" of both the Records and the search criteria, information elements that can be used to accumulate evidence about the relevance of any Record to the search criteria.
For each possible Record Feature, the method of superimposed coding provides a bitmap for each Record in the database, where a "0" bit guarantees that the corresponding Record does not contain the corresponding Feature, while a "1" bit indicates that the corresponding Record may contain the corresponding Feature.
Similarly, for each possible Feature, the method of inverted list tables provides a Record number or other identifier for each Record in the database that may contain the corresponding Feature, and no other Record numbers.
It is easy to see that these two methods, superimposed coding and inverted list tables, both provide essentially the same information in different forms. Given the information in the form of either method, it is straightforward to construct the form of the other method.
To use either of these methods to assist the problem of inexact matching, one starts with a feature table, then builds a coding table in the selected form. This coding table is simply a collection of the information available under the form about each Record and for each Feature, reflecting what is available in the feature table. A feature table is typically organized with rows of the table representing Features, and the information within each row representing Record status with respect to that row's Feature.
The feature table can then be used to eliminate quickly those Records that do not contain all of the Features of interest, as follows.
In the case of the method of superimposed coding, each row corresponds to a Feature and lists a "0" for Records that do not contain that Feature, and a "1" otherwise. The search criteria is examined for any of these Features that it may contain. The Features it does contain are then used to select row bitmaps. These row bitmaps are then logically combined using the bitwise logical AND function, with the result that the bitmap produced will contain a "0" in every bit position, corresponding to every Record, that does not contain exactly all of the required Features. The Records corresponding to "1" values in the result may contain all of the required Features. Each Record corresponding to a "1" value is then exhaustively compared to the search criteria by whatever means is in use, with the logical certainty that none of the Records corresponding to "0" values in the result bitmap are worthy of further consideration.
Similarly, the method of inverted list tables can be employed for the same purpose. In the case of the method of inverted list tables, each row corresponds to a Feature and lists the Record numbers of the Records of the database that may contain the Feature. As before, the search criteria is examined for any of these Features that it may contain. The Features it does contain are then used to select row lists. These row lists are then logically combined using the bitwise logical AND function. To combine these selected row lists, one must find those Record numbers that appear in each and every selected row list. The resulting list shows only those Record numbers that may contain all of the Features in the search criteria. Each such Record is then exhaustively compared to the search criteria by whatever means is in use, with the logical certainty that no other Records are worthy of further consideration.
In many cases, the number of "1" values remaining in the result bitmap after application of the method of superimposed coding, or the number of Records in the result list after application of the method of inverted list tables, is substantially smaller than the total number of Records. Thus, even though every one of these candidate result Records must be evaluated further against the search criteria, the task is much smaller than it would have been if every Record in the database had to be evaluated. This ability to eliminate quickly all but a few candidate Records of the database, which alone then undergo costly close inspection, can be a great benefit with respect to the overall speed of the search process.
Both the method of superimposed coding and the method of inverted list tables can encounter situations in which the other method is superior in terms of speed of operation. In the method of superimposed coding, when two row bitmaps are logically combined, all bits of both bitmaps must be considered, and the number of computational operations that must be performed is proportional to the total number of Records. In the method of inverted list tables, when two row lists are logically combined, only those Records in each of the two row lists must be considered, and the number of computational operations that must be performed is proportional to the product of the numbers of Records in each of the lists. In the "sparse" case, where the two row lists are very short and the corresponding two row bitmaps each contain very few "1" bits, then the number of computational operations implied by the method of inverted list tables can be substantially less than the number of operations implied by the method of superimposed coding. Conversely, in the "dense" case, where the two row lists are very long and the corresponding two row bitmaps each contain very many "1" bits, then the number of computational operations implied by the method of inverted list tables can be substantially more than the number of operations implied by the method of superimposed coding.
Thus both the method of superimposed coding and the method of inverted list tables have substantial weaknesses with respect to the other method under certain circumstances.