One of the important operations in information processing is retrieving items with approximately matching attributes. In essence, this problem can be treated as follows. Consider a set of information items that are characterized by binary vectors with bit-position values featuring the presence or absence of corresponding attributes. The attributes may include, for example, the presence or absence of respective search terms, letters of an alphabet constituting a word or phrase, or any other set of qualities exhibited by an item. Then, given a binary vector specified in a similar way by attributes of a requested item, it is necessary to select from the considered set near-matching binary vectors, i.e., those vectors which are close to the given vector in terms of the Hamming metric. Such a near-match procedure is encountered in different information processing constructions.
In computer systems, a primitive searching operation is retrieving words by exact matching. More advanced types of information retrieval are based on some sophisticated combinations involving this primitive operation. The near-match procedure is related to the problem of a neighborhood search in a multidimensional space. In general, there is no efficient solution to this problem. Normally, it is based on a brute-force approach which implies sequential comparison of all the elements of the system with a given information item or taking each individual element in a neighborhood of a given information item for direct access to the system. The former strategy would require O(N) operations where N is the number of elements in the system. The latter strategy would require a combinatorial number of accesses depending on the specified size of the neighborhood of a given binary vector.
More recently, a near-match procedure has been described by Berkovich, S., El-Qawasheh, E., “Reversing the Error-Correction Scheme for a Fault-Tolerant Indexing,” The Computer Journal, vol. 43, no. 1, pp. 54–64 (2000), incorporated herein in its entirety. In this work, for the realization of the near-match procedure, the authors suggest employing a hash coding method. In this case, the comparison of information items is organized using indices that are obtained by a hash transformation function. With a hash method, the retrieval procedure takes on average O(1) operations with a small overhead to resolve collisions.
Searching with not exactly matching information items, can be done using a hash transformation that tolerates some distortions. For approximate matching of words in information retrieval systems such a fault-tolerant hashing can be implemented as a two step process. At the first step, an entering character string is transformed to a reduced form to provide the same code for anticipated distortions of given words. At the second step, a hash transformation is applied to this reduced form of the word. Resolution of collisions of hash coding utilizes an approximate comparison of the original character strings. The most popular realization of this approach is based on the so-called Soundex concept, which is designed to provide the same reduced form for phonetically similar names (see, e.g., Wiederhold, G, Database Design, McGraw-Hill Inc., N.Y. (1983). Coding names in Soundex is performed by retaining the first letter of the name while dropping all the vowels from the remaining letters. After that, the remaining letters are replaced by numbers according to a set of rules.
The authors suggest a systematic approach to the expression of closeness relying upon representation of object attributes by binary vectors of a fixed length. In this case, a collection of objects close to a given object can be defined as a set of binary vectors located within a certain Hamming distance from a binary vector of a given object. A relationship between a hashing technique and the Hamming distance concept has been considered in the paper Balakirsky, V. “Hashing of databases based on indirect observations of Hamming distances”, IEEE Trans. Inform. Theory, vol. 42, pp. 664–671 (1996). The representation of binary attribute vectors involving a regular partitioning of a multidimensional binary cube suggests employing the mathematical technique of error-correction codes. This idea reverses the conventional scheme of application of error-correction codes as illustrated in FIG. 1.
This method of indexing information objects with a certain tolerance in terms of Hamming metric utilizes hash transformation based on a decoding procedure of error-correction codes. In conventional usage of error-correction codes, a dispatched dataword is supplemented with extra bits so that a longer sequence of bits called codeword can be used at the receiver site for recovering a distortion in the original data. Error-correction codes are described by three parameters: (n,k,d), where n is the length of the codeword, k is the length of the dataword, and d is the minimum Hamming distance between the codewords.
To provide fault-tolerant indexing, the authors suggest reversing the conventional application of error-correction coding (FIG. 1). Considering the decoding procedure as a primary operation, a neighborhood of codewords may be mapped into a smaller collection of datawords. Then, these datawords can constitute hash indices to provide search for binary vectors that are close in Hamming's metric.
For purposes of illustration, the possibility of using Hamming code (7,4,3) is shown in Table 1. This code consists of 16 datawords of 4-bit length; each dataword corresponding to a decoding sphere of radius 1 in a 7-dimensional binary cube. To deliver a 4-bit message, the sender expands the 4-bit message by adding 3 bits to it, and transmits the 7 bits instead of 4. For example, a message such as 1001 will be converted into a 7-bit codeword 1001011, which is the center of a decoding (i.e., codeword) sphere. The receiver can reconstruct the transmitted 7-bit codeword, as long as it remains in the codeword sphere having no more than a one-bit distortion.
TABLE 1Illustration of perfect mapping for a Hash transformation withHamming code (7,4,3)HashindicesCodeword spheres
In the suggested fault-tolerant indexing technique, this scheme is reversed. That is, given a 7-bit key, a hash index of this key is determined by the decoding procedure, so it represents a dataword. For instance, the above-considered 7-bit key 1001011 will get a hash index equal to 1001. Then, consider a one-bit distortion of this 7-bit key, for example, 1001111. The decoding procedure applied to this distortion will yield the same dataword: 1001. Thus, with this hashing function, any 7-bit key at a Hamming distance 1 from the given key 1001011 can be retrieved through the same hash index 1001. However, this simple retrieval tolerating a one-bit mismatch occurs only if a given key represents the center of a codeword sphere. Otherwise, keys at Hamming's distance 1 may belong to an adjacent sphere. In this case, the retrieval procedure has to be expanded to probe more hash indices.
That is, of the 128 keys, sixteen correspond to the 7-bit codewords for respective 4-bit datawords, while the remaining 112 keys are the one-bit distortions of these codewords. Thus, while a one-bit mismatch of the 7-bit codeword will be “corrected” to yield the original dataword, one-bit mismatches of the remaining 112 keys (which already represent a one-bit distortion) may either (i) one out of seven times negate the original mismatch or (ii) six out of seven times result in a key that, when “corrected” yields an incorrect dataword.
Generally, a tolerance to one-bit mismatch can be implemented via brute-force by probing each hash index corresponding to all one-bit modifications of a given word. This would involve inspecting 8 hash buckets: one determined by the hash value of this key and 7 buckets determined by the hash values of all of its one-bit modifications. The suggested scheme employing error-correction codes for a fault-tolerant retrieval gives an advantage of reducing the excessiveness of this brute-force procedure.
For clarification, consider how the fault-tolerant retrieval can be organized for a 7-bit key: 1001100. This retrieval would require probing the hash values for all of the following key modifications of this key:    1001100+0000000=1001100→1011    1001100+0000001=1001101→0001    1001100+0000010=1001110→1000    1001100+0000100=1001000→1101    1001100+0001000=1000100→1000    1001100+0010000=1011100→1011    1001100+0100000=1101100→1101    1001100+1000000=0001100→0001
Here and below, in manipulations with binary code vectors the sign “+” means the mod 2 addition. In Table 1 the modified key values are underlined in the highlighted codeword spheres. We can see that there are just 4 hash values for all possible one-bit distortions of the original vector. In other words, with this hash transformation, the number of probes in a searching procedure reduces from 8 to 4.
The Hamming code (7,4,3) is perfect in the sense that the codewords represent all possible 7-bit combinations: 27=128. Thus, a decoding procedure of such a code can be directly used for the described fault-tolerant hash transformation. The suggested approach can also be realized using non-perfect codes, but would require a certain adaptation to the decoding procedure and will result in less regular structural developments. Unfortunately, perfect codes are rare. Thus, Hamming codes, a family of (2m−1, 2m−m−1, 3) codes for any positive integer m≧3, form a class of single-error-correcting perfect codes. “Besides the Hamming codes the only other nontrivial binary perfect code is the (23,12) Golay code” (see, e.g., Lin, S. and Costello, D. (1983) Error Control Coding: Fundamentals and Applications. Hall Inc., Englewood Cliffs, NJ.)
The practicality of the suggested technique of fault-tolerant indexing is determined by the amount of redundancy employed by the hash transformation. Apparently, this redundancy would be decreased if the space of binary vectors of the attributes were partitioned with bigger codeword spheres. In this respect, the Golay code capable of correcting three errors by virtue of the minimum Hamming distance between the codewords of 7 has a definite advantage over single-error-correcting Hamming codes where this minimum distance is only 3. It turns out that the fault-tolerant indexing utilizing the Golay code (23, 12, 7) can be rather effective for near-match searches in real information systems of a reasonable size. A description of the Golay code (23, 12, 7) and implementation thereof can be found in U.S. Pat. No. 4,414,667 of Bennett entitled “Forward Error Correcting Apparatus” issued Nov. 18, 1983 and U.S. Pat. No. 5,968,199 of Khayrallah et al. entitled “High Performance Error Control Decoder” issued Oct. 19, 1999, both of which are incorporated herein by reference in their entirety.
The Golay code transforms 12-bit messages using 11 correction bits. Thus, 212=4096 binary vectors are transformed into 23-bit codewords. Each codeword is associated with a decoding sphere containing all vectors that are at Hamming distance ≦3 from the codeword. The number of the 23-bit vectors in these spheres, T, is
                                                        T              =                                                (                                                                                    23                                                                                                            0                                                                              )                                +                                  (                                                                                    23                                                                                                            1                                                                              )                                +                                  (                                                                                    23                                                                                                            2                                                                              )                                +                                  (                                                                                    23                                                                                                            3                                                                              )                                                                                                        =                                                (                                      1                    +                    23                    +                    253                                    )                                +                1771                                                                                        =                                                277                  +                  1771                                =                                  2048                  =                                      2                    11                                                                                                          (        1        )            
The Golay code (23,12,7) is perfect in the sense that the codeword spheres cover all possible 23-bit combinations 223=212·211.
The Golay code can be applied for fault-tolerant hashing because of its basic properties and related procedures. That is, the Golay code (23,12,7) can be treated as a cyclic code, an important subclass of linear codes, whose implementation is simplified by using algebraic manipulations with generating polynomials rather than computations with parity check matrices. Thus, the Golay code (23,12,7) can be generated by either of the following polynomials (see, e.g., Lin, S. and Costello, D. (1983) Error Control Coding: Fundamentals and Applications. Hall Inc., Englewood Cliffs, NJ):G1(x)=1+x2+x4+x5+x6+x10+x11  (2)G2(x)=1+x+x5+x6+x7+x9+x11  (3)
The 23-bit vectors of the Golay code having the coordinates labeled from 0 through 22 are represented by congruence classes of polynomials modulo (x23−1) with corresponding coefficients in a binary field GF(2).
The encoding and decoding procedures have been realized following the algorithmic constructions given in Vasconcellos, P., Vojcic, B., and Pickholtz, R. (1994) Hard Decision Decoding of the (23,12,7) Golay Code, Tech. Rep., George Washington University, USA which are described below for the sake of convenience. The author's application of the Golay codes, being oriented towards information retrieval, requires manipulations with binary vectors in a way as if they were regular hash indices. This has been implemented with the C language featuring binary vectors as “unsigned long integers” of 32 bits. Thus, the generator polynomials are respectively: 3189 and 2787. The operations of addition, multiplication, and division of the polynomials have been expressed by bitwise operations of “exclusive OR” and shifting.
The data message word is a 12-bit binary vector, that is, an integer in the range from 0 to 4095. Adding 11 parity-check bits results in a 23-bit codeword of binary-valued components (a0, a1, a2, . . . , a21, a22). Let the data message be represented by the polynomial I(x):I(x)=a11x11+a12x12+ . . . +a22x22  (4)
and the parity check by the polynomial P(x):P(x)=a0+a1x+a2x2+ . . . +a10x10  (5)
Then a codeword C(x) is represented by:C(x)=I(x)+P(x)  (6)
A codeword C(x) must be a multiple of the generator polynomial:C(x)=Q(x)G(x)  (7)
where G(x) is one of the polynomials G1(x) or G2(x) and Q(x) is a polynomial of degree 11. Taking into account equation (6) and dividing I(x) by G(x), we get:I(x)=Q(x)G(x)+P(x)  (8)
Thus, the encoding procedure consists of the following three steps:
1—Multiply the message polynomial by x11 to obtain I(x)
2—Divide I(x) by G(x) to obtain the remainder P(x)
3—Form the codeword by combining I(x)+P(x)
A decoding procedure is converting a 23-bit codeword into the 12-bit original message. Let M(x) be the received vector. Suppose that the original codeword was corrupted by adding an error pattern E(x) containing 3 or less “1”:M(x)=C(x)+E(x)  (9)
Further, dividing by G(x) we get:M(x)=A(x)G(x)+S(x)  (10)
where S(x), a polynomial of degree 10 or less, is the syndrome of the received vector M(x). The syndrome S(x)=0 if and only if M(x) is a codeword.
Since C(x) is a multiple of the polynomial G(x), we haveE(x)=[A(x)+Q(x)]G(x)+S(x)  (11)
There is a one-to-one correspondence between 211 23-bit patterns with 3 or fewer errors and 211 distinct syndromes. So, a table of 211=2048 possible error patterns associated with syndromes can be constructed. This can be done by dividing all possible error vectors by the generator polynomial. The decoding procedure goes as follows:
1—Calculate the syndrome of the received vector
2—Enter the table to obtain the error pattern
3—Add the error pattern to the received vector
So, a table of 2″=2048 possible error patterns associated with syndromes can be constructed.
Using the Golay code for fault-tolerant indexing is based on an observation represented by the following lemma. On one hand, this lemma can be derived as a corollary from the association of the Golay code with combinatorial designs (see, e.g., Pless, V. (1998) Introduction to the Theory of Error-Correction Codes, John Wiley & Sons, Inc., New York.). On the other hand, the statement of this lemma can be easily tested by an exhaustive computer search. However, a direct proof of this lemma provides an instructive geometrical insight into the details of the suggested technique.
Given a 23 dimensional binary cube partitioned into 212 equal spheres, a probing sphere with a Hamming radius of 1 placed in this cube either lies completely within one of the decoding spheres, or spreads uniformly over six decoding spheres. Thus, considering a decoding sphere with a center at Po, according to equation (1), this sphere will contain 2048 points which are divided into four categories depending on their distance from Po. Considering two cases determined by the distance, D(Po,To), between Po and the center of a probing sphere To:
Case #1:0≦D(Po,To)≦2. In this case, the probing sphere with a radius of 1 fits completely into a decoding sphere with a radius of 3.
Case #2: D(Po,To)=3. To begin with, let us introduce a set U of 23-bit vectors of no more than unit weight which constitute the probing sphere of radius 1: U={u0, u1, u2, . . . u23}, where u0=(000 . . . 00), u1=(000 . . . 01), u2=(000 . . . 10), u22=(010 . . . 00), u23=(100 . . . 00)
Then, To, the center of a probing sphere, can be represented with three different fixed unit vectors, say ui, uj, and uk:To=Po+ui+uj+uk  (12)
The points, Ts, which constitutes this probing sphere can be represented by means of all one bit modifications of the vector (12):Ts=Po+ui+uj+uk+us  (13)
Where s=1, 2, . . . , 23. First consider the situation where s=i, s=j, and s=k. In this situation, us cancels with one of the fixed vectors and the Ts stays within the decoding sphere Po. Thus, four vectors of U with s=i, s=j, and s=k together with s=0 fall in the decoding sphere with the center at P0.
Now, consider a modification of To by another vector us where s is different from i, j, k, or 0. Apparently, this vector, say uw, falls into an adjacent decoding sphere that is at a distance of 3 from its center, P1. So, we havePo+ui+uj+uk+uw=P1+ux+uy+uz  (14)
Vectors ux, uy, and uz are all different and none of them can be equal ui, uj, uk or uw. Otherwise, the distance between Po and P1 would be less than 7, in contradiction to the property of the Golay code partitioning. The equality (14) can be rewritten in three other ways by swapping uw with each of the ux, uy, and uz. This means that each of the four points of the probing sphere with the center in To: Tw, Tx, Ty, and Tz, goes to the adjacent decoding sphere with the center in P1. So, of 24 points constituting the probing sphere we have located a quadruple of points in the original partitioning sphere and another quadruple of points in an adjacent decoding sphere. By continuing the described procedure and selecting another us vector with a value of s different from any of the considered indexes, we can find another quadruple of points falling in a different adjacent sphere. This procedure can be repeated until all of the points of U are exhausted. As a result, we will establish that the 24 points of U are combined in different quadruples and fall in different adjacent partitioning spheres. Therefore, in the case # 2 when D(P0, T0)=3, we always get 6 different decoding spheres
The searching logic of fault-tolerant indexing can be organized with the Golay code partitioning in the same way as has been illustrated with Hamming code. The case # 1 corresponds to creation of one central hash index; this would occur in 277/2048=13.5% of all situations. The case # 2 corresponds to creation of 6 hash indices—one central and 5 peripheral; this would occur in 1771/2048=86.5% of all situations.
Applying the presented analysis for the Golay code to partitioning of multi-dimensional binary cubes with perfect Hamming codes (2m−1, 2m−m−1, 3) one can obtain analogous results. Namely, a probing sphere with a Hamming radius of 1 placed in this cube either lies completely within one of the decoding spheres, or spreads uniformly over 2m−1 decoding spheres. So, for m=3, i.e., for the Hamming code (7,4,3), we get four partitioning spheres as shown in Table 1. Further, the case when the probing sphere falls in only one decoding sphere occurs very infrequently, in 1 of 2m situations. Thus, employing Hamming codes (2m−1, 2m−m−1, 3) for the suggested technique of fault-tolerant indexing will require essentially 2m−1 hash indices. This implies involving a substantially higher redundancy in comparison to the Golay code. Therefore, the perfect partitioning offered by the Golay code appears most advantageous for implementation of the suggested technique of fault-tolerant indexing.
The elementary computer operation available for data retrieval is a strict comparison of one word with another. In searching for exact matching a given key is compared with certain keys stored in the system. The realization of approximate matching relies upon expansion of comparison operations. Those involve a matching neighborhood of a given key for access and replication in the storage system. In other words, realization of approximate matching requires redundant efforts in terms of time and/or space: on one hand, it is possible to test matching neighborhoods by increasing the number of accesses, on the other hand, this can be done by replicating the information inside the memory. The retrieval capabilities depend only on the sizes of the matching neighborhoods, irrespective of whether they are introduced on the side of “time” or “space”.
The Golay code offers different possibilities for the realization of the suggested fault-tolerant indexing technique. Variations in the sizes of the matching neighborhoods can be intermixed with a combined usage of both kinds of the Golay codes corresponding to different generating polynomials. The retrieval capabilities of these variations are presented in FIGS. 2A–2F. The recall for a given key—the percentage of accessible keys with respect to the total number of keys at a certain Hamming's distance—does not depend on how matching operations are organized. The data can be stored with no replication and accessed several times or replicated in the storage and accessed once.
Notations for the considered formation variations are specified by the sizes of matching neighborhoods. The case 1-1 shown in FIG. 2A means matching two neighborhoods of size 1, i.e., just a direct matching of 12-bit hashes. Of course, this matching gives a 100% recall when the 23-bit vectors are equal (Hamming distance equals 0), but occasionally a small portion of matches occurs for keys with higher Hamming distances. The case 1-6 shown in FIG. 2B corresponds to matching neighborhoods of size 1 vs. 6. In this situation, for the Hamming distance 1 we get a 100% recall. The case 6-6 shown in FIG. 2C presents matching neighborhoods of size 6 vs. 6. The remaining three cases of FIGS. 2D–2F show corresponding matchings for the two kinds of the Golay code partitionings. So, for example, the case 2-12 means that two neighborhoods of size 1 are matched vs. two neighborhoods of size 6.
The sizes of the matching neighborhoods used in the notations of FIGS. 2A–2F give an approximate estimate for the required redundancies. The actual redundancy is less because a matching neighborhood of size 6 represented by a sphere of a radius 1 has about 0.135 probability to be reduced to a size 1 as it falls completely in a decoding sphere. The calculated values of actual redundancies are given in Table 2.
TABLE 2Time-Space RedundanciesSearching SchemeRedundancy1-11-11-6  1-5.326-65.32-5.322-21.97-1.97 2-12 1.97-10.3312-1210.33-10.33
The choice of an appropriate searching scheme depends upon a compromise between desired retrieval characteristics and implementation overhead (FIGS. 2A–2F). Schemes 1-1 and 2-2 do not offer a tangible retrieval enhancement beyond exact matching. The scheme 1-6 and 2-12 give a 100% assurance of retrieving keys only at Hamming distance one. However, searching within neighborhood of radius 1 may not be sufficient for some applications. The case 12-12 provides full retrieval capabilities for keys at Hamming distance 2 and more than 80% chances of retrieval for keys at Hamming distance 3 and 4. However, this imposes a high redundancy both in time and space. The remaining case, 6-6, offers almost the same retrieval performance but with a substantially less redundancy. The retrieval performance of this scheme is rather significant, it guarantees a 100% recall when the distance between two keys is less than or equal two. It turns out that for two keys at Hamming's distance 2 having 6 hashes each there are always two values in common. Thus, the case 6-6 is considered an exemplary searching scheme with a reasonable overhead. Computational arrangements with this scheme are shown in the diagram of FIG. 3, while an implementation of this scheme is presented in FIG. 4 described in further detail below.
There might be further elaboration of the organization of fault-tolerant indexing. In particular, the matching neighborhood can be extended to the radius 2. For a straightforward brute-force approach in the case of 23-bit vectors this would imply a 277 redundancy factor. Using the Golay code, this factor will be reduced to 22. With such high replication in memory it becomes possible to search keys at Hamming's distance 2 with only one access to memory. This can be instrumental in some time-constrained applications when saving memory is of less importance for example, for signal processing with vector-quantization. Prior work has focused on the basic 6-6 scheme.
The Golay code is used as a hash index of a list of buckets, each bucket storing a corresponding binary attribute vector that hashes into that index. As described, 86.5% of the vectors will have six Golay hash indices, each of 12 bits, while 13.5% will have a single 12-bit hash index. Thus, each vector will be found in either six buckets corresponding to its six Golay hash indices, or in one bucket at its single Golay hash index. FIG. 3 can be used to illustrate this technique.
As a basic scheme for the suggested technique of fault-tolerant indexing has considered the 6-6 scheme as having an acceptable trade-off between the space-time redundancy and retrieval characteristics. Searching with this scheme yields the whole neighborhood of a given binary vector at Hamming's distance 2 and a certain portion of the neighborhoods at greater distances (see FIG. 2C). The speed of searching operations for this basic scheme may be enhanced with some adjustments. Thus, for example, the number of hash indices, either on the part of storing or accessing can be reduced from 6 to 5. This still guarantees the retrieval of binary vectors at Hamming distance 2 but sacrifices the recall of binary vectors at higher Hamming distances
Utilization the main 6-6 variant of this searching scheme begins with filling the hash table. The hash table presents an array of 212=4096 pointers to the buckets with the 23-bit keys. Both kinds of the binary vectors, the 12-bit indices to the array and 23-bit keys, may be represented as unsigned integers by 32-bit words. For example, a 23-bit vector: (0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,1,0,1) with “1” in positions: 5, 3, 2, and 0 is represented as an unsigned long integer: 45=25+23+22+20. For this 23-bit key we get 6 unsigned long integers as 12-bit hash indices: 0, 47, 61, 493, 1581, and 2093. These 6 numbers are used to access the 4096 array pointers the 6 buckets in each of which the considered key 45 is to be inserted.
In general, searching with this hash table requires traversing all 6 buckets associated with a given key. The performance of searching is determined primarily by the average size of the buckets. For ordinary hashing, a set of 4096 random uniformly distributed keys would be accommodated in this table in buckets of a small average size. For the suggested fault-tolerant indexing inserting 4096 keys in the 4096 position hash table would result in about 6 times redundancy. Ideally, we would get 4096 buckets of size of 6, or more accurately of the size 5.32. However, the average bucket size exceeds this redundancy coefficient because of shrinking of scattering intervals depending on particular patterns of incoming binary vectors.
While this technique limits the searching required so that close matches will have hashes overlapping with that of the target, it still requires the search of several lists, particularly in those 86.5% of cases where six indices are produced. These lists may become very long as the number of vectors to be stored in or referenced by the hash table approach and exceed the size of the 12-bit hash index available.
The close matching technique described provides a valuable tool for retrieving items with approximately matching attributes. However, an efficient implementation of the technique is limited to relatively small sets of items. Also, the prior implementation of this technique relies on an exhaustive method of producing distortions of a codeword to identify nearby codewords in adjacent data spheres, as shown in FIG. 4.