1. Technical Field
The present disclosure relates to similarity testing by syndromes and decoding.
2. Description of the Related Art
Oftentimes, a memory receives a request to store a message identical to another message that is already stored by the memory. Storing multiple copies of the same message is wasteful of the memory space in that less space is available for storing differing messages. One method for eliminating the storage of multiple copies of the same message involves identifying the identical messages and substituting a pointer to a stored copy of the message for a message that is received with a request for its storage. A cyclic redundancy code (CRC) check may be applied to each of the messages stored in memory and the message received for storage to determine whether the received message is identical to a stored message.
For example, a hash function converts each stored sector within the memory to parity bits using a systematic encoding for a high-rate cyclic redundancy code. A new arriving sector is considered potentially identical to a previously-stored sector if the hash value (i.e., value of the parity bits) of the arriving sector is equal to that of a previously-stored sector.
Although the application of the CRC check to each of the stored messages (e.g., sectors) and an arriving message is useful for identifying and reducing the storage of identical messages, such is not suitable for similarity testing. More specifically, the CRC check described above does not identify messages that are similar but not identical.
To overcome this deficiency, similarity testing may be achieved by locality sensitive hashing (LSH), defined in P. Indyk and R. Motwani, “Approximate nearest neighbors: Towards removing the curse of dimensionality,” STOC 1998: 604-613. The simplest example of LSH is randomly choosing a single bit. Such similarity testing assures that for two length-n messages of Hamming distance ≦d, the probability of missing the similarity is ≦d/n, while for two messages of distance ≧(1+e)d, the probability of falsely declaring similarity is ≦1−(1+e)d/n. See, e.g., Prop. 4 in the above paper of Indyk and Motwani, The performance of single bit sampling can be improved as follows. First, the probability of false positive may be decreased by repeating the bit sampling process  times, and declaring similarity when there is a bit-by-bit agreement between the two sequences of  hash bits from two messages. Alternatively, the false negative probability may be decreased by declaring similarity if the two -bit hashes agree in at least a single coordinate. Combining these two methods, one can find a quite good tradeoff between false positive, false negative, and hash size (number of hash bits).
However, with LSH, the probability of a false detection of similarity may be too high, unless the number of hash bits is very large. Also, hardware limitations might dictate supporting similarity within some small fixed Hamming distance d and a very low false-detection probability is desirable from a Hamming distance of d+1.