The present invention relates generally to methods and systems for data coding, and specifically to methods of data compression.
Compression algorithms based on textual substitution are widely used in storage and transmission of data files. The classic method in this class is the Lempel-Ziv algorithm, described by Ziv and Lempel in xe2x80x9cA Universal Algorithm for Sequential Data Compression,xe2x80x9d published in IEEE Transactions on Information Theory 23(3), pages 337-343 (1977), which is incorporated herein by reference. These algorithms are based on finding textual resemblance between different segments in a file. The later occurrence of a given segment is then replaced by a pointer to the earlier occurrence.
Textual substitution algorithms are universally applicable, in the sense that they require no a priori knowledge of the contents of the file. In the asymptotic limit, such algorithms are capable of attaining the best compression possible for a given file. By their nature, however, algorithms known in the art start to gain compression only after an initial portion of the file (sometimes of substantial length) has been processed, and the degree of compression that is achieved grows slowly.
Wyner and Ziv studied the effectiveness of compression using a fixed, predetermined reference string, external to the file that is to be compressed, in a later article, entitled xe2x80x9cFixed Data Base Version of the Lempel-Ziv Data Compression Algorithm,xe2x80x9d published in IEEE Transactions on Information Theory 37(3), pages 878-880 (1991), which is incorporated herein by reference. They showed that asymptotic results similar to those of the classic algorithm can be achieved, provided that the reference string is produced by the same source that generates the file to be compressed. The authors make no suggestion, however, as to how the reference string should be chosen.
In preferred embodiments of the present invention, a target file is compressed by matching segments in the file to corresponding segments in a reference file or set of files. The present invention exploits the fact that typically, many computers have the same set of common files already resident on their disks, such as help files, program files and other resources supplied by a given vendor or other source. New files supplied from the same source or based on the same set of resources frequently reuse substantial, identical segments of the earlier files. By using these pre-existing, common resources as reference files, it becomes possible to match very long substrings from the target file to appropriate segments in the reference files and thus to achieve compression superior to methods known in the art.
In some preferred embodiments of the present invention, a server compresses a target file to be conveyed to a client based on a common set of reference files shared by the server and the client. The server is typically aware in advance of the reference files held by the client, which typically include the client""s operating system files (even if different from the server""s operating system) and other software platform components. Alternatively or additionally, the server may derive this information from a preliminary communication with the client. The server codes the target file as a list of specifiers, or pointers, to segments in the client""s reference files that match successive substrings in the target file. Each specifier preferably includes a reference file identifier and an offset and length of the segment in the reference file. Substrings in the target file that do not have a match of sufficient length in any of the reference files are preferably added to the pointer list as is, most preferably with a flag to indicate that they are uncoded.
Preferably, the server compresses the coded list by encoding the specifiers in the list. The list of specifiers typically draws on a small subset of the total corpus of reference files. Therefore, the reference file identifiers can be encoded efficiently based on the frequency of their occurrence, using Huffman coding, for example. The server preferably adds a header to the resultant compressed file, identifying the reference files that it has used and their respective codes.
In some preferred embodiments of the present invention, the server maintains multiple sets of reference files, typically corresponding to different client platforms, in order to generate different compressed versions of the target file depending on the client to which the compressed file is to be sent. Alternatively or additionally, the computer prepares and caches the different versions in advance. It will be understood that while preferred embodiments are described herein with reference to a client/server architecture, peer computers may also compress, store and/or transmit to one another efficiently-compressed files based on the principles of the present invention.
In preferred embodiments of the present invention, in order to decompress the file, the client reads the header and opens the appropriate reference files on its own disk. It then processes the list of pointers, retrieving the successive segments from the reference files that are indicated by the specifiers. These segments are concatenated in order to reconstruct the target file, along with any uncoded substrings as appropriate.
Although preferred embodiments are described herein with reference to compression and transmission of files, it will be understood that the principles of the present invention are equally applicable to compression of other bodies of data that can be characterized as strings of characters or other symbols. Therefore, in the context of the present patent application and in the claims, the terms xe2x80x9cfile,xe2x80x9d xe2x80x9ctarget filexe2x80x9d and xe2x80x9creference filesxe2x80x9d should be taken to refer generally to substantially any suitable bodies of computer-readable data to which the principles of the present invention may be applied. Similarly, the terms xe2x80x9csubstringsxe2x80x9d and xe2x80x9csegmentsxe2x80x9d are used herein for convenience and clarity to denote strings of symbols within the target file and reference files, respectively, and should be understood to refer to strings of substantially any length and suitable type within the bodies of data in question.
There is therefore provided, in accordance with a preferred embodiment of the present invention, a method for compressing a target string of symbols, including:
identifying a set of reference strings stored by a computer;
matching a plurality of successive substrings in the target string to respective segments found in one or more of the reference strings;
assigning respective segment specifiers to the substrings that identify the respective segments to which they are matched; and
outputting an ordered list of the specifiers.
Preferably, identifying the set of reference strings includes identifying a set of files used by the computer. Preferably, the files are associated with an operating platform of the computer. Alternatively or additionally, the files include at least one of a program file and a help file.
Preferably, matching the substrings includes, for each of the substrings, finding the respective one of the segments in the reference strings so as to maximize a length of the substring. Further preferably, matching the substrings includes encoding the reference strings in a tree structure having nodes connected by edges corresponding to the symbols in the segments, and finding the respective one of the segments includes traversing in succession the edges of the tree that correspond to the symbols in the substring. Most preferably, traversing the edges of the tree includes traversing the tree up to one of the nodes reached by a last one of the edges in the succession, and assigning the respective segment specifiers includes returning a node specifier associated with the one of the nodes.
Additionally or alternatively, assigning the respective segment specifiers includes assigning the specifiers only to the substrings that are matched by segments of a length no less than a predefined minimum, and including adding to the ordered list of the specifiers the substrings that are matched only by segments that are less than the predefined minimum in length.
Preferably, assigning the respective segment specifiers includes specifying respective identifiers of the reference strings in which the segments occur, wherein assigning the respective segment specifiers includes specifying respective offsets and lengths of the segments within the reference strings in which they occur. Additionally or alternatively, outputting the ordered list includes compressing the list. Preferably, outputting the ordered list includes compressing the identifiers of the reference strings, wherein compressing the identifiers includes coding the identifiers responsive to respective frequencies of occurrence of the identifiers in the ordered list.
Preferably, outputting the ordered list of the codes includes transmitting an output file containing the list to the computer over a communication link.
In a preferred embodiment, identifying the set of reference strings includes identifying first and second sets of reference strings, and matching the plurality of substrings and assigning the segment specifiers includes associating first and second pluralities of the substrings with respective matching segments in the first and second sets of reference strings, respectively, and assigning first and second sets of the specifiers accordingly, and outputting the ordered list includes outputting first and second lists of the specifiers in the first and second sets, respectively, corresponding to the first and second sets of the reference strings.
Preferably, the first and second sets of the reference strings are respectively stored by first and second computers, and outputting the first and second lists includes sending the first list to the first computer, and the second list to the second computer.
There is also provided, in accordance with a preferred embodiment of the present invention, a method for data communications, including:
identifying at a sending computer at least one reference file that is stored by a receiving computer;
matching one or more substrings in a target file to respective segments of the at least one reference file;
compressing the target file by replacing the one or more substrings with segment specifiers that identify the respective segments; and
transmitting the compressed file from the sending computer to the receiving computer, whereby the receiving computer decompresses the file using the at least one reference file.
In a preferred embodiment, the at least one reference file is associated with a program run by the receiving computer but not by the sending computer.
In another preferred embodiment, identifying the at least one reference file includes receiving the identification of the at least one reference file from the receiving computer via a communication link, and transmitting the compressed file includes choosing the compressed file to transmit responsive to the received identification.
Preferably, identifying the at least one reference file includes recalling a copy of the at least one reference file stored by the sending computer, and matching the one or more substrings includes finding segments that match the substrings at the sending computer using the recalled copy of the at least one reference file.
In a preferred embodiment, identifying the at least one reference file includes identifying first and second reference files stored respectively by first and second receiving computers, and matching and replacing the one or more substrings includes associating first and second pluralities of the substrings with respective matching segments in the first and second reference files, respectively, and replacing the substrings with first and second sets of the specifiers accordingly to generate first and second compressed files, and transmitting the compressed file includes transmitting the first compressed file to the first receiving computer, and the second compressed file to the second receiving computer. Preferably, compressing the target file includes storing the first and second compressed files at the sending computer, and transmitting the first and second compressed files includes recalling the stored, compressed files for transmission thereof.
In another preferred embodiment, the target file includes a first target file, and compressing the target file includes compressing the first target file and inserting an identification of the reference file in a header of the compressed file, and the method includes compressing a second target file by replacing substrings of the second target file with internal pointers to earlier occurrences of the substrings in the second target file, wherein the header of the compressed file includes a field used to indicate that the second target file was compressed using the internal pointers.
There is additionally provided, in accordance with a preferred embodiment of the present invention, a method for decompressing a compressed file, which contains an ordered list of codes specifying respective segments in one or more reference strings stored by a computer, the segments matching respective substrings in a target file, the method including:
reading the codes from the list;
retrieving the segments specified by the codes from the one or more reference strings stored by the computer; and
concatenating the retrieved segments to reconstruct the target file.
Preferably, the one or more reference strings include files used by the computer. Further preferably, reading the codes includes decoding compressed identifiers of the one or more reference strings. Most preferably, retrieving the segments includes reading, from one of the reference strings specified by one of the codes, a sequence of symbols of a length and at an offset within the string specified by the one of the codes. Alternatively or additionally, the compressed file further contains an uncoded substring of characters from the target file, and wherein concatenating the retrieved segments includes concatenating the uncoded substring with the retrieved segments.
There is further provided, in accordance with a preferred embodiment of the present invention, apparatus for compressing a target string of symbols, including a compression processor, adapted to receive an identification of a set of reference strings stored by a computer, to match a plurality of successive substrings in the target string to respective segments found in one or more of the reference strings and to assign respective segment specifiers to the substrings that identify the respective segments to which they are matched, and to output an ordered list of the specifiers.
There is moreover provided, in accordance with a preferred embodiment of the present invention, a server for data communications, including a compressed file processor, which is adapted to receive an identification of at least one reference file that is stored by a receiving computer and to transmit to the receiving computer a compressed file generated by matching one or more substrings in a target file to respective segments of the at least one reference file and by replacing the one or more substrings with segment specifiers that identify the respective segments, whereby the receiving computer decompresses the file using the at least one reference file.
Preferably, the server includes a storage device, which is adapted to store a copy of the at least one reference file, wherein the processor is adapted to recall the at least one reference file from the storage device and to generate the compressed file using the recalled file.
There is furthermore provided, in accordance with a preferred embodiment of the present invention, apparatus for decompressing a compressed file, which contains an ordered list of codes specifying respective segments in one or more reference strings, the segments matching respective substrings in a target file, the apparatus including:
a storage device, adapted to store the one or more reference strings; and
a decompression processor, adapted to read the codes from the list, and coupled to retrieve the segments specified by the codes from the one or more reference strings stored by the storage device and to concatenate the retrieved segments so as to reconstruct the target file.
There is also provided, in accordance with a preferred embodiment of the present invention, a computer software product for compressing a target string of symbols, the product including a computer-readable medium in which program instructions are stored, which instructions, when read by a computer, cause the computer to receive an identification of a set of reference strings stored by the computer, to match a plurality of successive substrings in the target string to respective segments found in one or more of the reference strings, to assign respective segment specifiers to the substrings that identify the respective segments to which they are matched, and to output an ordered list of the specifiers.
There is additionally provided, in accordance with a preferred embodiment of the present invention, a computer software product for data communications, the product including a computer-readable medium in which program instructions are stored, which instructions, when read by a sending computer, cause the sending computer to receive an identification of at least one reference file that is stored by a receiving computer, to match one or more substrings in a target file to respective segments of the at least one reference file, to compress the target file by replacing the one or more substrings with segment specifiers that identify the respective segments, and to transmit the compressed file to the receiving computer, whereby the receiving computer decompresses the file using the at least one reference file.
There is further provided, in accordance with a preferred embodiment of the present invention, a computer software product for decompressing a compressed file, which contains an ordered list of codes specifying respective segments in one or more reference strings stored by a computer, the segments matching respective substrings in a target file, the product including a computer-readable medium in which program instructions are stored, which instructions, when read by the computer, cause the computer to read the codes from the list, to retrieve the segments specified by the codes from the one or more reference strings stored by the computer, and to concatenate the retrieved segments to reconstruct the target file.
The present invention will be more fully understood from the following detailed description of the preferred embodiments thereof, taken together with the drawings in which: