1. Field of the Invention
The present invention relates to an apparatus, system and method for data compression and, in particular, an apparatus, system and method for data compression which use irredundant patterns.
2. Description of the Related Art
Data compression methods are partitioned traditionally into lossy and lossless. Typically, lossy compression is applied to images and more in general, to signals susceptible to some degeneracy without lethal consequence. On the other hand, lossless compression is used in situations where fidelity is of the essence, which applies to high quality documents and perhaps most notably to text files.
Lossy methods rest mostly on transform techniques whereby, for instance, cuts are applied in the frequency, rather than in the time domain of a signal. By contrast, lossless textual substitution methods are applied to the input in native form, and exploit its redundancy in terms of more or less repetitive segments or patterns.
When textual substitution is applied to digital documents such as fax, image or audio signal data, one could afford some loss of information in exchange for savings in time or space. In fact, even natural language can easily sustain some degrees of indeterminacy where it is left for the reader to fill in the gaps.
For example, FIG. 10 illustrates two versions of the opening passage from the Book 1 of the Calgary Corpus. These versions are equally understandable by an average reader and yet when applied to the entire book the first variant requires 163,837 less bytes than the second one, out of 764,772.
In practice, the development of optimal lossless textual substitution methods is made hard by the circumstance that the majority of the schemes are NP-hard. Obviously, this situation cannot improve with lossy ones. As an approximation, heuristic off-line methods of textual substitution can be based on greedy iterative selection.
For example, at each iteration, a substring w of the text x is identified such that encoding all instances of w in x yields the highest possible contraction of x. This process is repeated on the contracted textstring, until substrings capable of producing contractions can no longer be found. This may be regarded as inferring a “straight line” grammar by repeatedly finding the production or rule that, upon replacing each occurrence of the “definition” by the corresponding “nonterminal”, maximizes the reduction in size of the current textstring representation.
Recent implementations of such greedy off-line strategies compare favorably with other current methods, particularly as applied to ensembles of otherwise hardly compressible inputs such as biosequences. They also appear to be the most promising in terms of the achievable approximation to optimum descriptor sizes.
Off-line methods can be particularly advantageous in applications such as mass production of CD-ROMs, backup archiving, and any other scenario where extra time or parallel implementation may warrant the additional effort imposed by the encoding.
The idea of trading some amount of errors in reconstruction in exchange for increased compression is ingrained in Rate Distortion Theory, and has been recently revived in a number of papers, mostly dealing with the design and analysis of lossy extensions of Lempel-Ziv on-line schemata.