This invention relates to the field of data compression. In particular, it relates to systems and methods of organizing multiple files to be compressed into one archive file, and systems and methods of compressing the multiple files as one unified file.
A number of compression methods have been utilized to reduce the size of an input data file by encoding symbols or strings of symbols as numeric codes or combinations of numeric codes and symbols. Such encoding reduces repetitions of symbols in the input data file. Major categories of compression methods include dictionary methods and statistical methods. Dictionary methods encode strings of symbols as tokens using a dictionary. A token indicates the location of the string in the dictionary. Dictionary methods include LZ77, LZ78, LZSS, and LZW. Statistical methods use codes of variable lengths to represent symbols or groups of symbols, with the shorter codes representing symbols or groups of symbols that appear or are likely to appear more often. Statistical methods include the Huffman method and the dynamic Huffman method.
The LZ77 method reads part of the input data into a window. The window is divided into a search buffer (i.e., a dictionary) on the left and a look-ahead buffer on the right. The search buffer is the current dictionary that includes symbols that have been read and encoded. The look-ahead buffer includes data yet to be encoded. The encoder scans the search buffer from right to left looking for a match to the longest stream of symbols in the look-ahead buffer. When the longest match is found, the matched stream of symbols plus the next symbol in the look-ahead buffer are encoded as a token containing three parts, the position (i.e., the distance from the end of the search buffer), the length of the longest match, and the next symbol. For a more detailed description of the LZ77 method please refer to pages 154-157 of Data Compression the Complete Reference by David Salomon, Second Edition, 2000. The LZ78 method is a variation of LZ77. For a more detailed description of the LZ78 method please refer to pages 164-168 of Data Compression the Complete Reference by David Salomon, Second Edition, 2000.
The LZSS method is a variation of LZ77. Unlike the LZ77 tokens, LZSS tokens use two fields instead of three. An LZSS token includes a position and a length. If no match is found, the uncompressed code of the next symbol is produced as output, with a flag bit to indicate it is uncompressed. The LZSS method also holds the look-ahead buffer in a circular queue and the search buffer in a binary search tree. For a more detailed description of the LZSS method please refer to pages 158-161 of Data Compression the Complete Reference by David Salomon, Second Edition, 2000.
The LZW method is a variation of LZ78. For a detailed description of the LZW method please refer to U.S. Pat. No. 4,558,302 issued on Dec. 10, 1985.
The Huffman method, also called the static Huffman method, builds a list of all of the symbols in descending order according to their probabilities of appearance. The method builds a tree from the bottom up with each leaf representing a symbol. The two symbols with the smallest probabilities of appearance are added to the top of the tree and deleted from the list. An auxiliary symbol is created to represent these two symbols. This process is repeated until the list is reduced to one auxiliary symbol. The method then assigns codes of variable length to the symbols on the tree. One variation of the Huffman method is the dynamic, or adaptive, Huffman method. This method assumes that the probabilities of appearance are not known prior to reading the input data. The compression process starts with an empty tree and modifies the tree as symbols are being read and compressed. The decompression process works in synchronization with the compression process. For a more detailed description of the Huffman method and the dynamic Huffman method please refer to pages 62-82 of Data Compression the Complete Reference by David Salomon, Second Edition, 2000.
Context based methods are another type of compression methods. Context based methods use one or more preceding symbols to predict (i.e., to assign the probability of) the appearance of the next symbol. For example, in English, the letter xe2x80x9cqxe2x80x9d is almost always followed by the letter xe2x80x9cuxe2x80x9d. When letter xe2x80x9cqxe2x80x9d appears, a context based method would assign a high probability of appearance to the letter xe2x80x9cuxe2x80x9d. One example of a context based method is the Markov model which may be classified by the number of proceeding symbols it uses to predict the next symbol. An Order-N Markov model uses the N preceding symbols to predict the next symbol. For a more detailed description of the Markov model, please refer to page 126 and pages 726-735 of Data Compression the Complete Reference by David Salomon, Second Edition 2000.
Compression methods may be used by programs to compress multiple files to archives. For example, the ARC program compresses multiple files and combines them into one file called an archive. PKZIP is a variation of ARC. ZIP is an open standard created by the maker of PKZIP for compressing files to archives. For details of ARC and PKZIP please refer to pages 206-211 of Data Compression the Complete Reference by David Salomon, Second Edition, 2000. Other compression/archiving programs include ARJ by Robert K. Jung, LHArc by Haruyasu Yoshizaki, and LHZ by Haruhiko Okumura and Haruyasu Yoshizaki.
The systems and methods relate to the compression of multiple files into a single file called an archive. The systems and methods examine the multiple files to determine their data characteristics. The systems and methods then arrange the order of the multiple files according to their data characteristics to increase the potential of data redundancy among neighboring files. The increased potential of redundancy provides potential improvement in compression ratio and compression speed. The ordered multiple files are then combined as one unified file and compressed.
One embodiment uses a dictionary method to compress the unified file. In addition, a large dictionary is used in one embodiment to take advantage of potential between-file redundancies. In another embodiment, the redundancy characteristics of the multiple files are examined to dynamically determine the dictionary size. After the dictionary compression method produces an intermediary output data file, the intermediary output data may be separated into multiple sections, such that for each section a compression method that is potentially suitable for the data characteristics of that section is applied. The compressed result of each section are then combined to produce the final output.
One embodiment of the present invention is a method of compressing a plurality of files. The method comprises examining said plurality of files to determine data characteristics that correspond to said plurality of files and determining ranking orders for said plurality of files according to said data characteristics. In addition, the method comprises combining said plurality of files into a unified file at least according to said ranking orders and compressing said unified file.
An additional embodiment of the present invention is a system for compressing a plurality of files. The system comprises an examination module configured to examine said plurality of files to determine data characteristics that correspond to said plurality of files and an ordering module configured to determine ranking orders for said plurality of files. In addition, the system comprises a combining module configured to combine said plurality of files as a unified file at least according to the ranking orders of said plurality of files and a compressing module configured to compress said unified file using a first compression method.
A further embodiment of the present invention is a system for compressing a plurality of files. The system comprises means for examining said plurality of files to determine data characteristics that correspond to said plurality of files and means for determining ranking orders for said plurality of files. In addition, the system comprises means for combining said plurality of files as a unified file at least according to the ranking orders of said plurality of files and means for compressing said unified file using a first compression method.
For purposes of summarizing the invention, certain aspects, advantages, and novel features of the invention are described herein. It is to be understood that not necessarily all such advantages may be achieved in accordance with any particular embodiment of the invention. Those skilled in the art will recognize that the invention may be embodied or carried out in a manner that achieves one advantage or a group of advantages as taught herein without necessarily achieving other advantages as may be taught or suggested herein.