1. Technical Field
This invention relates to computerized data processing, and more particularly to structures and methods for storing and searching data using encoded signatures representing that data.
2. Description of the Prior Art
Methods for coding and organizing data to allow for faster searching are important to information systems. Signature coding is one such method. To understand the problems solved by this invention, we begin by explaining the signature generation or encoding process. We will use the term "record" to indicate a generic data object such as a database record or a text fragment within a document.
The actual encoding process consists of computing a short signature S1 containing only 1's and 0's for each record. Various known "hashing" techniques may be used for generating these signatures, and will not be discussed in detail. The resulting signature for each record is usually much smaller than the original record. The signature and identifier of the record (called a TID) are stored on "pages" for later retrieval. A page is a fixed sized unit of storage which can contain key and signature data and may be in memory or on disk.
To locate a record or text fragment containing one or more values, a signature is computed from the search terms by using the same encoding process. This "query" signature is then compared against the stored signatures. When the stored signature contains a 1 bit in each position in which there is a 1 bit in the query signature, the record associated with the signature is identified as potentially satisfying the query. The TID stored with the signature is then used to retrieve the record. The data fields in the record (or words in the text fragment) are precisely matched against the search values using a conventional string compare algorithm to determine if a match has occurred. Records which satisfy the precise match conditions are then returned to the user.
To accommodate large numbers of records, "parent" signatures are computed for "groups" of records. Higher level (e.g., grandparent) signatures are organized similarly for groups of lower-level signatures. These signatures can then be organized into a hierarchical (multi-level) file structure. One well-known method for computing a new parent signature is to superimpose or "bit-OR" a group of individual signatures. A query signature is then compared to this parent signature first before it is compared to individual signatures. If a 1 bit occurs in any position of the query signature without a corresponding 1 in the patent signature, the entire group of lower-level (child) signatures and their associated records need not be accessed for further examination. This process allows a parent signature to filter out a large number of non-matching signatures and records.
Unfortunately, when this technique is used, both saturation and combinatorial errors occur. As more signatures are superimposed into the parent signature, more bits are set to 1. At some point, saturation occurs and the parent signature contains all 1's. The parent signature then becomes useless, because it will match any query signature and never be rejected. Since several methods are known to control this saturation problem, it will not be discussed in detail.
The second problem is that since the bits of a signature represent fields of the original records, the parent signature represents not only all existing individual records, but also nonexistent "virtual" records which appear to contain data formed by combining values from among the records in the group represented by the parent. These virtual records do not exist in the data, but are falsely indicated as existing by the parent signature. For example, assume records contain simple last name, title field pairs (Chang, Engineer), (Schek, Scientist), (Yost, Manager), and (Lohman, Scientist). Signatures for these might be (00111010), (01110100), (10110000), and (01010100). A parent signature (11111110) formed by bit-ORing these four signatures would correctly indicate the presence of the above records, but would also incorrectly indicate the presence of non-existent virtual records (Chang, Scientist), (Schek, Manager), (Yost, Engineer), etc.
The saturation and combinatorial error effects caused by using the superimposed method of grouping signatures results in records being unnecessarily accessed. Unnecessary accesses of records are also called "data false drops". When a parent signature causes a set of child signatures to be accessed unnecessarily, this is called a "signature false drop".
Parent signatures indicate a superset of records over which an exact test must be performed. Ideally, the size of this set should match the size of the correct answer set (i.e., no false drops.) Due to imperfections in hashing, and because of various saturation and combinatorial effects, this is not the case. Thus, the number of data and signature false drops is a crucial indicator of the effectiveness of any such coding scheme in eliminating non-matching records from further consideration. Several different multi-level signature organizations have already been investigated by Roberts (1979), Pfaltz (1980), Deppisch (1986), Sacks-Davis (1987), and others in attempts to solve these problems.
Pfaltz documents a multi-level signature organization using a sparse signature encoding scheme. Signatures with a low ratio of 1's to 0's are bit-OR'ed to form group signatures. While this helps the saturation problem, a combinatorial error remains. Queries composed of combinations of record values from the same group result in unnecessary accesses of record signatures. In addition to this combinatorial error, the sparse encoding scheme by Pfaltz results in an inefficient use of the signature space.
Roberts first proposed and implemented a signature storage method which minimized the combinatorial error effect by using a bit-sliced architecture. In this approach, signatures logically form rows in a matrix, and are physically stored by bit columns. When a query is processed, positions in the query signature where 1's occur indicate which columns in the matrix should be accessed and examined. The major disadvantage of this method is the high cost of updates and deletions. Since the storage for each bit column is determined by the total number of rows, the storage and update requirements for each column can be tremendous.
Sacks-Davis have devised a multi-level block approach improving on the bit-sliced architecture first proposed by Roberts. In this approach, bit-sliced parent "block" signatures are used to reduce saturation. However, the combinatorial error problem is not solved. Furthermore, in environments where updates are frequent update costs of this approach are on the order of several dozen to over a hundred page accesses per signature insert, which is unacceptably high.
Deppisch has developed a multi-level method wherein leaf signatures are clustered by similarity of bit patterns. Signatures exhibit slightly less sensitivity to the combinatorial error effect due to the use of signicantly larger data and query signatures. This method has two distinct disadvantages. First, more storage space is required for the larger signatures. Second, significantly more computation is required for the clustering algorithm.