1. Field of the Invention
The invention relates generally to computer-automated string matching. The invention relates more specifically to string matching applied in the context of a data compression method.
2a. Notice Regarding Copyright Claim to Disclosed Computer Program Listing
This application includes a listing of a computer program.
The assignee of the present application claims certain copyrights in said computer program listing. The assignee has no objection, however, to the reproduction by others of this listing if such reproduction is for the sole purpose of studying it to understand the invention. The assignee reserves all other copyrights in the program listing including the right to reproduce the computer program in machine-executable form.
2b. Cross Reference to Related Applications
The following U.S. patent application(s) is/are assigned to the assignee of the present application, is/are related to the present application and its/their disclosures is/are incorporated herein by reference:
(A) Ser. No. 07/759,226 filed Sep. 13, 1991, by Lloyd L. Chambers IV and entitled FAST DATA COMPRESSOR WITH DIRECT LOOKUP TABLE INDEXING INTO HISTORY BUFFER, now U.S. Pat. No. 5,155,484; and PA1 (B) Ser. No. 07/839,958 filed Feb. 21, 1992, by Lloyd L. Chambers IV and entitled METHOD AND APPARATUS FOR LOCATING LONGEST PRIOR TARGET STRING MATCHING CURRENT STRING IN BUFFER, now U.S. Pat. No. 5,246,779.
3. Description of the Related Art
It is sometimes advantageous to scan through a "history" buffer of a digital computer to find an "old data string" within the buffer which matches the starting and subsequent portions of a current data string to the longest extent possible.
The above-cited U.S. patent application Ser. No. 07/839,958, METHOD AND APPARATUS FOR LOCATING LONGEST PRIOR TARGET STRING MATCHING CURRENT STRING IN BUFFER, now U.S. Pat. No. 5,246,779 is an example of one useful application of such string matching. A data compression system tries to replace each successive "current" data string (a set of adjacent bytes or bits) within an input buffer with a "compression vector" of shorter bit length.
The compression vector points back into a history buffer portion of an input buffer, to a matching old string (hereafter also termed "MOS"). The MOS is the same as the current string for a given number of bits or bytes. The compression vector indicates the length of match between the current string and the matching old string (MOS). During decompression, the history buffer is reconstructed from the compressed file, in a sequential, front-to-end boot-strap fashion, with each encountered "compression vector" being replaced by the indicated length of a prior string in the partially-reconstructed history buffer, the prior string being one that is pointed to by the vector.
Compression efficiency depends on the length of match between each current string and a prior string and on the length of the vector that replaces the current string. Ideally, the compression vector should be as small as possible and the match length should be as large as possible.
Theoretically speaking, a wide variety of algorithms could be employed to realize such an ideal condition. However, in practice, attention has to be paid to physical considerations such as limiting compression time, limiting decompression time, and avoiding excessive waste of system memory space during compression and/or decompression.
The system disclosed in the above-cited U.S. patent application Ser. No. 07/839,958, METHOD AND APPARATUS FOR LOCATING LONGEST PRIOR TARGET STRING MATCHING CURRENT STRING IN BUFFER, now U.S. Pat. No. 5,246,779, searches a history buffer from a front end to back end, looking first for all possible matches to a current string and then for the longest match.
In so doing, the system first generates an array of sorted "index-pair" lists to help it find matching strings more quickly. Each index-pair list is associated with one of 2.sup.16 possible two-byte combinations. There are thus, as many as 2.sup.16 such lists created within the computer's memory. The first two bytes of the current string are combined to produce a reference "index-pair". Each matching two-byte combination within the history buffer is considered the start of an old string which matches the current string for at least two bytes. Examples of "index-pairs" include ASCII combinations such as "AB", "AN", "BU", etc.
Each sorted index-pair list includes one or more pointers that point to the locations of one or more matching index-pairs in the history buffer. Sorting of lists and entries within the array is done first according to index-pair value and next, within each list, according to the position of the index-pair within the history buffer, with the position of the matching index-pair furthest away from a "current" string position appearing first in its corresponding list. The system then uses the sorted index-pairs array as a fast path for locating every old string which starts with the same index-pair as the first two bytes of the current string (the string which is to be potentially replaced with a compression vector).
When the longest old string is found that matches a corresponding length of the current string, a compression vector is generated. The compression vector includes an n-bit-long offset-value field which indicates the difference between the start position of the current string and the start position of the matching old string (MOS). The compression vector further includes an m-bit-long length-value field which indicates the number of successively matching bytes in the old string.
Other fields may also be included within the compression vector, but of importance here, it is to be noted that the length of the compression vector is at minimum, n plus m bits (the length of the offset-value field plus the length of the length-value field). Compression efficiency suffers when the sum n+m remains relatively large over a large number of compression vectors.