In digital systems, documents are often compressed to save storage costs or to reduce transmission time through a transmission channel. Lossless compression can be applied to these documents that can achieve very good compression on regions of document that are computer-rendered, such as characters and graphics. Dictionary-based lossless compression is often used in these cases as these methods adapt well to a variety of input raster data types. Implementation of dictionary compression methods require searching and maintaining a sliding window type of history buffer of previous input data to find the best string match to current input data. For raster data, better matches are often found at scan line intervals in the history buffer. This requires implementations of dictionary-based lossless compression systems to have a large history buffer that has to contain several scan lines. In both software and hardware, implementations increasing the size of this buffer are more expensive in terms of implementation costs or reduced performance. In particular for hardware implementations, this memory is often a specialized memory such as a content addressable memory (CAM) which requires more circuits to implement vs. standard memory that is not content addressable.
Dictionary-based compression methods use the principle of replacing substrings in a data stream with a codeword that identifies that substring in a dictionary. This dictionary can be static if knowledge of the input stream and statistics are known or can be adaptive. Adaptive dictionary schemes are better at handling data streams where the statistics are not known or vary.
Many adaptive dictionary coders are based on two related techniques developed by Ziv and Lempel. The two methods are often referred to as LZ77 (or LZ1) and LZ78 (or LZ2). Both methods use a simple approach to achieve adaptive compression. A substring of text is replaced with a pointer to a location where the string has occurred previously. Thus the dictionary is all or a portion of the input stream that has been processed previously. Using previous strings from the input stream often makes a good choice for the dictionary, as substrings that have occurred will likely reoccur. The other advantage to this scheme is that the dictionary is transmitted essentially at no cost, because the decoder can generate the dictionary from the previously coded input stream. The many variations of LZ coding differ primarily in how pointers are represented and what pointers are allowed to refer to.
LZ1 is a relatively easy to implement version of a dictionary coder. The dictionary in this case is a sliding window containing the previous data from the input stream. The encoder searches this window for the longest match to the current substring in the input stream. Searching can be accelerated by indexing prior substrings with a tree, hash table, or binary search tree. Decoding for LZ1 is very fast: each code word is an array lookup and a length to copy to the output (uncoded) data stream.
In contrast to LZ1, where pointers can refer to any substring in the window of prior data, the LZ2 method places restrictions on which substrings can be referenced. However, LZ2 does not have a window to limit how far back substrings can be referenced. This avoids the inefficiency of having more than one coded representation for the same string that can occur frequently in LZ1.
LZ2 builds the dictionary by matching the current substring from the input stream to a dictionary that is stored. This stored dictionary is adaptively generated based on the contents of the input stream. As each input substring is searched in the dictionary, the longest match will be located, starting at the current symbol in the input stream. So if character xe2x80x9caxe2x80x9d were the first part of a substring, then only substrings that started with xe2x80x9caxe2x80x9d would be searched. Generally this leads to a good match of input substring to substrings in the dictionary. However, if a substring xe2x80x9cbacdefxe2x80x9d were in the dictionary, then xe2x80x9cacdefxe2x80x9d from the input stream would not match this entry since the substring in the dictionary starts with xe2x80x9cbxe2x80x9d. This is different from LZ1, which is allowed to generate a best match anywhere in the window and could generate a pointer to xe2x80x9cacdefxe2x80x9d.
The references described herein and above are incorporated by reference for their teachings.
In accordance with the invention, there is provided a method and apparatus for compressing and decompressing electronic documents, with improved compression and reduced requirements on the size of the history buffer.
In accordance with one aspect of the invention, there is provided a method pre-ordering raster data into a vector of pixels. The specific ordering of this vector of pixels takes advantage of correlation of pixel data at scan line intervals. The pre-ordering of raster data by this specific method results in improved compression and reduces size requirements of the history buffer that has to be searched.
In accordance with another aspect of the invention, there is provided a method of compressing raster image comprising: receiving scan ordered raster data; ordering the raster data into a vector of pixels that are taken from several scan lines; and compressing the resulting vector of pixels by a dictionary compression method.
A method for digital image compression of a raster image is disclosed which pre-orders the raster data from a scan line ordering to a vector ordering. This vector ordering takes advantage of the two dimensional correlation that occurs in most raster documents. Pre-ordering the raster data in this method improves compression and reduces the requirements on the size of the history buffer. The additional step of pre-ordering the raster data is a simpler operation of copying data and does not require specialized memory such as CAM. Software implementations are improved by not having to search and pattern match on a larger history buffer.