This invention relates to data compression of mostly similar strings.
To store sets of mostly similar strings supporting fast retrieval of randomly selected members, one method is to represent each string with its difference from a set-fixed reference string (which is usually also a member of the set). This representation will result in compression when the string is sufficiently similar to the reference. Thus, choosing a good reference string is central to the quality of compression when such a storage method is used. Given a set of N strings, selecting the best reference string among them requires an order of N2 compression tests. Such a selection technique, especially in large sets of strings, is lengthy.
For any compression method, there are a few parameters that may be defined:
CompLength(Sc, Sr)xe2x80x94if Sr is the reference string and Sc a string to be compressed, then CompLength(Sc, Sr) is the length of the compressed representation of Sc with respect to Sr.
TotalLength(Sr)xe2x80x94is the total length of the compressed representation of all the strings in the set, when they are compressed using Sr as the reference string.
The object of the invention is to easily find such a string, Sr, so TotalLength (Sr) is minimal.
This object is realized in accordance with a broad aspect of the invention by a computer implemented method for selecting a string for serving as a reference string for a comparison scheme for compressing a set of strings, the method comprising the following steps:
(a) calculating preliminary compression results for every string relative to an initial reference string, and
(b) using the preliminary compression results to find a better reference string without additional compression tests.
According to one embodiment of the invention, there is provided a computer implemented method for selecting a string for serving as a reference string for a comparison scheme for compressing a set of strings, the method comprising the following steps:
(a) compressing the set of strings against a selected initial reference string so as to produce a set of compressed strings,
(b) determining a histogram of the costs of all strings in the set of compressed strings showing for each different length of string in the set of compressed strings a frequency of occurrence in the set, and an identity of at least one string whose compression length equals said different length, and
(c) using said histogram to determine a better reference string.
The invention is based on the heuristic assumption that:
if CompLength (S1, Sr)xe2x88x92CompLength (S2, Sr)=xcex4
Then CompLength (S1, S2)≈xcex4
In other words, a subset of strings that are different from a reference string by similar degrees will probably be compressed at a lower cost if one of them is chosen as the reference string instead. It is therefore possible to predict the results of compression with one reference string based on the actual result of compressing it with another.
The invention uses the above heuristic assumption in order to predict a good reference string or strings. There are several ways of utilizing this idea, but all are based on the same principle of calculating preliminary compression results for every string relative to an arbitrary chosen string, and then using these results to find, with very small computational cost and without additional compression tests, a better reference string.