1. Field of the Invention
The present invention relates generally to dictionary-based compression methods and, more particularly, to the compression of a file string by matching the string against a dictionary, and parsing according to detected matches, and outputting a new data representing, in a reversible manner, the original string as a succession of matches with the dictionary.
2. Background Description
Compression is the coding of data to minimize the number of bits needed to represent an underlying "raw" data file or information. The raw data file can, for example, be the text of a book, with each alphanumeric character of the text being one byte of digital data. Without compression, all subsequent processing, storage and transmission of the file requires eight bits per text character, and becomes quite large when, for example, the file consists of entire text of a book. This uncompressed representation is inefficient because text files almost always contain multiple occurrences of a character string within words, and/or multiple occurrences of the same word and phrases. For example, the words "the," "there," "their" and "therefore" all contain the three character string of "t,h,e." Assuming eight bits per character, 24 bits are required for each occurrence of this three bit sequence. If that particular three bit sequence, however, were translated into a new word having less than 24 bits, the number of bits to represent the underlying text would be fewer.
An uncompressed raw data representation of course does not exploit this redundancy and instead represents, and hence stores and transfers, each repeated occurrence with all of its constituent characters.
"Dictionary-based encoding" is a compression method which partially removes this redundancy, and thus obtains compression, by replacing repeated strings of consecutive characters, or repeated words or phrases, with indices into a dictionary. Dictionary encoding can use either a static or dynamic dictionary. A static dictionary is preset and does not change according to the input data or text. A dynamic dictionary is updated depending on the input data.
Referring first to a static dictionary, the means by which compression is achieved is readily described. A set of words or phrases W(i), i=1,N each having a length of L(i) number of characters is stored in a dictionary. Input data is compared, on a byte-by-byte or word-by-word basis, against each entry of the dictionary. The term "word-by-word" is conceptual and does not mean one word per machine cycle. When a match is found at, for example, the Bth location, the raw data string is replaced by a pointer to that Bth location. If no match is found, the raw data is output. The output datum points have a flag indicating whether the datum is a pointer or is raw data. Compression is achieved because each time a raw data string is replaced by a pointer there are fewer bits in the pointer than were in the represented the raw data.
Static dictionary methods, however, must pre-store a set of words or character strings which, on average, have a high repetition rate within general text. The dictionary cannot be optimized, however, based on repetition of characters or words within a particular text. Further, if a dictionary is made so large as to encompass all likely string sequences, the pointer to the string location, assuming enough bits to address all locations in the dictionary, will become so large that it may defeat the compression objective.
Accordingly, "dynamic" or "adaptive" dictionary encoding methods, which establish and/or update the dictionary according to the input text are preferable for many applications.
There is a plurality of known dynamic dictionary encoding methods. One is the Ziv-Lempel LZ1977 method, which is well known in the art and is described, for example, in "A Universal Algorithm for Sequential Data Compression," IEEE Trans. Information Theory, IT-24(3), pages 337-343, May 1977. Another well-known method is the Ziv-Lempel 1978 method, described, for example, by Ziv et al. "Compression of Individual Sequences Via Variable Rate Coding," IEEE Trans. Information Theory, IT-24(5), pages 530-536, September 1977.
The LZ1977 method operates as follows: A sliding window, also termed the "history buffer," stores the most recent N characters of the input. The sliding window is then used as the dictionary. Each successive input character is then compared to each of the N previous characters within the sliding window. The address of each match within the sliding window is stored. A string is detected when an input byte matches a cell in the window and the succeeding input byte matches the succeeding cell in the widow. The LZ1977 method then, upon detection of a string of two or more characters, waits for final termination of the string. Final termination occurs when the string does not increase by one byte on the next cycle. When this is detected, the LZ1977 method outputs a pointer to the last character of the string within the N-sample sliding window. If two consecutive characters of the input do not match consecutive characters in the window, the first character is sent as raw data.
As stated above, the LZ1977 method, like other dictionary methods, achieves compression because, on average, each time a raw data string is replaced by a pointer to a location in the sliding window, it results in fewer bits than would have been required to represent the raw data. The amount of compression is therefore determined, in part, by the number of bits in the pointer which, in turn, is Log2(length of the buffer). On the other hand, a longer buffer increases the probability of and average length of a string match. The length of the buffer is therefore an optimization problem discussed extensively in the treatises
To attain maximum compression, the LZ 1977 method uses what is termed "greedy parsing." Greedy parsing locates the longest from among all previous strings corresponding to an input sequence, and then transmits the pointer to that longest string. The general operation of the LZ1977 greedy parsing can be described by example. Assume that the words "ship" and "shipping" are both stored as a character sequence within the sliding window of N previous characters. Next, assume that the word "shipping" appears again in the input sequence. When the second character, "h," of "shipping" is received, the LZ1977 method will identify two stored strings "sh" as matching--the first corresponding to "ship" and the second to "shipping." This continues though the fourth received input bit, since the sequence "ship" is stored at two locations in the window. When the fifth character is received, however, the string corresponding to "ship" terminates. The second string has not yet terminated, since the fifth input character "p" is also the fifth in the string "shipping." The existence of the longer matching string prevents a pointer corresponding to "ship" from being output. This is the basis for the term "greedy parsing." Continuing with the example, if the next character in the input string does not match the next character of the stored after the string "shipping", the match is terminated and a pointer is transmitted. The pointer indicates both where in the sliding window the string for "shipping" terminates, i.e., the address of the last "g" character, and the length of that string.
The LZ1977 method may be performed in software, but is generally implemented by special purpose input/output (I/O) hardware because of the large number of operations-per-character required for the string matching and pointer generation.
A typical special purpose LZ1977 hardware compressor is shown, in block diagram form, in prior art FIG. 1. The FIG. 1 compressor comprises a history buffer, or array, 10, a priority select logic ("PS logic") unit 12 and a resolver encoder, also termed a prior encoder, unit 14. The array unit 10 contains N one-byte storage/one byte input content addressable memory (one byte CAM) cells, each holding one byte of data. A typical one byte CAM cell (not shown) comprises eight one-bit storage register, a write select line, a data input line, a match line MATCHn,t!, match enabling logic, and a match output line MLn,t!. The input to the array 10 is the raw data DI(t), where t is the time or sample index. The array 10 operates as the N-sample sliding window of the LZ1977 method, and performs N-1 comparisons for each input byte DIt! for identifying, as explained below, the end points of matching strings.
Referring to prior art FIG. 2, array 10 operates as follows, using an array size of N=8 as an example:
Beginning at t=0, input data DI0 ! is written into the first of the array's eight storage locations (not shown), referenced as S0!. . . S7!. At t=1, the second input data DI(1) is written into S(1),(not shown). Also at t=1 the input data D(1) is compared against the data stored in S(0) and the CAM cell match line, MATCH 0,1!, is enabled and output as ML0,1!. A value of MLn,t! means that DIt! matched the content of Sn! at time t. There is a "0" output, at t=1, of the remaining CAM match lines MATCHn,1! for n=1 to N-1. A disabling of MLn,t!, i.e., not enabling MATCHn,t!, is shown as an "X." There are two conditions where MLn,t! is not enabled. The first condition is where no previous sample of the input data DIt! is at the Sn! location. Accordingly, the first column of the MLn,t! table of FIG. 2 has an "X" is shown in all locations since, at t=0, there are no previous samples of DIt! in the array 10. The second condition is where the comparison is against the same Sn! location as DIt! is being written into, which is why MLn,t! is an X for all n=t(mod N). The reason for this disabling is that where n=t or n=tMod(N)the content of Sn! is DIt! and it would be meaningless to the input date with itself.
Referring again to the overall system of FIG. 1, the N parallel MLn,t! lines of the array 10 are input to the PS logic block 12. The operation of the PS Logic block 12 is described by the following set of coupled recursive equations: ##EQU1## where EQU ORMt!=M0,t! OR M1,t! . . . MN-1,t! (2)
and EQU Mn,t!=MLn,t! AND PSn-1,t! (3).
Substituting variables, equation (3) can be rewritten as: ##EQU2##
The variables on the right side of equations (1)-(3) are known and are used to solve for the left side. Based on the above Equations (1)-(3), the PS logic block 12 sets MLn,t-1!=1 at the beginning of a string match. As can be seen from Equations (1)-(3), a string match occurs at the first sequence of two characters in the input sequence DIt! that match a sequence of two characters stored in the array 10, setting Mn,t! to "1". Mn,t!="1" does not mean that the corresponding string will not increase by one byte on the next clock cycle. Final termination of a string is indicated by the ORMt! signal according to Equation (2).
The generation of ORMt! is shown by the following example: Assume an eight-sample array 10, having S0! through S7!. Assume that at time t=0 the characters "t", "o" and "p" are stored in locations S1!, S2!, and S3!, respectively, and the sequence "t", "o", and "m" is stored in locations S4!, S5! and S6!. Next, assume that an input sequence "t", "o", "p", and "0" is input as D0! D1!, D2!, and D3!.
The input "t" at t=0 is compared to the contents of each Sn! address except S0!, which is the cell the input byte is written to at t=0. In this example, there is a match with the "t" in S1! and S4! and, therefore, ML1,0!="1" and ML4,0!="1". The value of M1,0! and M4,0! will not be "1", however, because, looking to Equation (3) a match of only one character without a previous carry bit will not establish a string. The carry bits, however, PS1,0! and PS4,0! will be set to "1", as shown in Equation (1).
At the next clock, t=1, there is a match between the input "o" and the "o" contents of S2! and S5!, and thus ML2,1! and ML5,1! are equal to "1". The "o" character is written into S1!. Based on the equations for Mn,t! shown above, together with the previously calculated carry bit PS1,0! and PS4,0!, M2,2! and M5,2! are set equal to "1". M2,1! and M5,1! having a value of 1 results in ORM1! having a value of "1", since ORMt! is the logical OR of Mn,t! for all n. Also at t=1, PS2,1! and PS5,1! is set to "1", based on Equation (1) above, and DI1! is stored in location S1!.
On the next clock, t=2, there is a match between the input "p" and the content of S3!, and ML3,2! is thus equal to "1". However, there is no match between input "p" and the "m" content of S6! Based on the equations for Mn,t!, therefore, M3,2! will equal to "1" but M6,2! will be equal to "0". The input's match against the "tom" string at S4! through S6! has terminated at S5!. M3,2! having a value of "1" results in ORM2! having a value of "1", since ORMt! is the logical OR of Mn,t! for all n, even though M6,2! is equal to "0". Therefore, ORMt! indicates that a string, i.e., the "top" string at S1! through S3! is continuing.
On the next clock, t=3, however, the input byte of "0" does not match the "t" character in S4!. Therefore, ML3,3! will be equal to "0" and, hence, M3,3! is equal to "0". Based on the Equation (2) above, ORM3! will have a value of "0". The longest sequence in the array 10 corresponding to the example input sequence "top0", which is the "top" sequence stored at t=0 at S1! through S3!, is thus indicated as terminated by ORMt!. The output sequence of ORMt! for the above example is "0110", with the string termination where ORMt! changes from a "1" to a "0".
When ORMt! changes from "1" to "0", the PS Logic unit 12 finds the lowest n for which Mn,t! is equal to 1. This corresponds to the location within the array 10 for the longest string match and, if more than one are of identical length, the lowest n. Accordingly, PS Logic unit 12 identifies the longest matching string within the array buffer 10.
FIG. 3 illustrates the operation of PS Logic block 12 in accordance with equations (1) through (4), as implemented on a prior art one-byte per location and one-byte input content addressable memory CAM processor of FIG. 2.
The example MLn,t! array depicted in FIG. 3 indexes storage locations as rows, n, and indexes time increments as columns, t. The FIG. 3 matrix values are based on a random input sequence. As described above, cycles in which ORM(t)is="1" before changing to "0" are cycles in which one or more strings terminate. Looking to FIG. 4, the address of the terminating string within the array 10 is the lowest n (highest toward the top) of the M n,t! matrix in the column t corresponding at the point t where ORM(t) is "1" before changing to "0". The string length is one plus the number of cycles of t for which ORM(t)="1" prior to changing to "0". Referring to FIG. 3, a string length of 4 is shown as terminating at t=5, at address n=3. A string of length 2 terminates at t=9, at address n=2. If two or more strings start and stop at the same time, i.e., an input string and two or more strings stored in the array 10 are identical, the Priority Select block 12 will pick the lowest n.
As stated above, the prior art LZ1977 methods and hardware process the input stream at one byte per cycle. Therefore, looking to Equation (3) it can be seen that the Mn,t! output represents a match which can happen on only one path from the previous Mn-1,t-1!. More specifically the only path is where PSn-1,t-1! is equal to "1" and, at the next clock, t, the match MLn,t! is equal to "1". This path, or trajectory, is shown as Mn,t! in FIG. 4. Accordingly, the hardware for one-byte-per-cycle string matching must only account for that one trajectory.
The one-byte-per-cycle methods and hardware of the related art, although pipelined and paralleled where possible for the maximum operations-per-cycle, is, by definition, inherently limited in its overall throughput to one character, or byte, per machine cycle. Accordingly, as the present inventor has recognized, a significant throughput advantage can be gained by performing the string match at two or more bytes per cycle. There are however, several reasons why the related art cannot perform string match processing at more than one byte per cycle. One, which is discussed in greater detail in the Description of the Preferred Embodiments below, is that optimal parsing for LZ1977 and related algorithms requires that the input string be parsed to a resolution of one byte. In other words, even if the parsing is performed in two bytes-per-cycle, the hardware must discriminate between a termination at the second, or odd byte, and a termination at the first, or even byte, of the input byte pair. The present art, in addition to lacking means for inputting, matching, and storing the input string at two-bytes-per-cycle, lacks any means for discriminating between an even and odd byte termination.
Another fundamental shortcoming of the related art single-byte-per-cycle methods is that detecting a string termination based, for example, on two consecutive bytes of the input data being matched, in one machine cycle, against the contents of the history buffer, requires tracking of three or more trajectories from the results of the previous input byte-pair being matched against the history buffer contents.
A co-pending application, Ser. No. 228,321, filed Apr. 15, 1994, entitled "A Method and Means For Character String Pattern Matching For Compression and the Like Using Minimal Cycles Per Character", which is referenced herein as "Application '321" identifies what it terms a multi-byte-per-cycle dictionary compression. However, the method and hardware described in Application '321 will not produce a correct or usable compression output for all input sequences. This can be seen by referring to Related Art FIG. 5, which is a redrawing of the match detect logic associated with one of the 8 separate one-byte storage locations contained in the composite system of Application '321 FIGS. 4B.1, 4C.1 and 4C.2. The only output of the FIG. 4 match detect logic is PSn,t!, which is identical to PSn,t! for the prior art single-byte-per-cycle method of FIG. 1.
Referring back to Equations (3) and (4) describing Mn,t! for the prior art single-byte-per-cycle method, it is clear that the termination identifier Mn,t! is a function of the match line for n, MLn,t!, ANDed with the carry information PSn-1,t!. The reason is that for Mn,t!=1, when parsing one one byte per cycle, requires concurrence of only two conditions. One is a match at n at time t. The other is a match at n-1 at the previous time t-1. However, as can be seen from the Description of the Preferred Embodiments below, if the input is processed two-bytes-per-cycle there are three conditions that can produce a match signal Mn,t! at location n.
Further, when processing at one-byte-per-cycle a final termination, i.e., where a string terminating at location n at time t does not terminate at location n+1at time t+1, is at that byte location n. Referring to related art FIG. 2, this is the n value of Mn,t! occurring where ORt! transitions from "1" to "o". However, as stated above, a string from a two-byte-per-cycle input can terminate at either the even or the odd byte. The PSn,t! information of Application '321 is insufficient to detect all of the three state conditions which, for a two-byte-per-cycle input, produce a string termination, Mn,t!, at location n, and is insufficient to determine whether a string terminated at the even or odd byte of the input. Accordingly, Application '321 does not disclose a complete or realizable two-byte per cycle compression method.
For the foregoing reasons, a need exists for a two-byte-per-cycle method and means for implementing LZ1977 and other dictionary-based string matching processes.